Qwen3.5-27B-Text2SQL

Fine-tuned Qwen/Qwen3.5-27B for Text-to-SQL generation. Given a database schema and a natural language question, the model outputs a clean SQL query.

Key Results

Metric	Base Model	This Model	Improvement
Gretel Execution Accuracy (3,492 samples)	26.5%	66.7%	+40.2%
Gretel Valid SQL	37.8%	89.4%	+51.6%
Spider Execution Accuracy (1,032, gold std)	46.6%	55.1%	+8.5%
Spider Exact Match	0.0%	22.2%	+22.2%
Spider Keyword Score	45.5%	85.4%	+39.9%

Regression (standard benchmarks via lm-eval-harness)

Benchmark	Base	This Model	Delta
MMLU (humanities)	81.9%	83.5%	+1.6% (no regression)
MMLU (STEM)	87.2%	86.7%	-0.5% (no regression)
MMLU (social sciences)	92.0%	92.0%	0% (no regression)
MMLU (other)	87.5%	87.5%	0% (no regression)
GSM8K (math, strict)	60.4%	35.4%	-25.0% (regression)
HellaSwag (common sense)	79.6%	84.1%	+4.5% (improved)
ARC-Challenge (reasoning)	69.3%	71.3%	+2.0% (improved)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mahernaija/Qwen3.5-27B-Text2SQL",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "mahernaija/Qwen3.5-27B-Text2SQL",
    trust_remote_code=True,
)

schema = "CREATE TABLE employees (id INTEGER, name TEXT, department TEXT, salary REAL);"
question = "Find all employees in Engineering with salary above 90000."

prompt = f"<|im_start|>system\nYou are a SQL expert. Given a database schema and a natural language question, write the correct SQL query.<|im_end|>\n<|im_start|>user\nSchema: {schema}\nQuestion: {question}<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
# SELECT * FROM employees WHERE department = 'Engineering' AND salary > 90000

With vLLM

vllm serve mahernaija/Qwen3.5-27B-Text2SQL --tensor-parallel-size 2

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

response = client.chat.completions.create(
    model="mahernaija/Qwen3.5-27B-Text2SQL",
    messages=[
        {"role": "system", "content": "You are a SQL expert. Given a database schema and a natural language question, write the correct SQL query."},
        {"role": "user", "content": "Schema: CREATE TABLE products (id INT, name TEXT, price REAL);\nQuestion: What are the 3 most expensive products?"},
    ],
    max_tokens=256,
    temperature=0,
)
print(response.choices[0].message.content)
# SELECT name FROM products ORDER BY price DESC LIMIT 3

Training Details

Parameter	Value
Base model	Qwen/Qwen3.5-27B (26.9B params)
Method	Full fine-tuning with FSDP
Hardware	16× NVIDIA H200 (2 nodes, Nebius AI Cloud)
Training time	3 hours 6 minutes
Dataset	sql-create-context (78K) + Gretel synthetic (100K) = 178K samples
Epochs	1
Batch size	2 per GPU × 8 grad accum × 16 GPUs = 256 global
Learning rate	2e-5 (cosine decay, 5% warmup)
Sequence length	512 (SQL samples P99=373 tokens)
Precision	BF16
GPU utilization	98-100%
Final train loss	0.144
Final eval loss	0.176

Data Preprocessing

Schemas stripped of INSERT data (model learns SQL from schema structure, not memorized answers)
Manual chat format (bypasses Qwen3.5 <think> tag injection)
Label masking: loss only on SQL output, prompt tokens masked with -100
Deduplication + contamination check between train and eval splits

Evaluation

Gretel Execution Accuracy (gold standard — runs SQL in SQLite, compares results):

Complexity	Base	This Model
Basic SQL	23.8%	71.4%
Aggregation	18.2%	54.5%
Single JOIN	25.0%	75.0%
Window functions	0.0%	33.3%

Spider Benchmark (1,034 dev questions, public standard):

Exact match: 22.2%
Keyword score: 85.4%

Known Limitations

Catastrophic forgetting: Full fine-tuning on SQL-only data caused regression in general capabilities. The model tries to answer non-SQL questions with SQL (56% SQL contamination on general prompts). For production use with mixed tasks, consider:

LoRA fine-tuning instead of full FT
Mixed training data (SQL + general chat)
Using this model only for SQL-specific pipelines

Regression Test

Category	Base	This Model	SQL Contamination
General Knowledge	84%	44%	2/5
Math	100%	40%	3/5
Code	48%	47%	5/5
Language	90%	78%	4/5
Common Sense	82%	93%	0/5

Architecture

Architecture: Qwen3_5ForConditionalGeneration (VLM with text + vision)
Text backbone: 64 layers, hidden_size=5120, 24 attention heads, 4 KV heads
Attention: Hybrid GDN (linear_attention + full_attention)
Context: 262K tokens
Vision: Built-in 27-layer ViT (weights from base model, not finetuned)

Files

File	Description
`model-00001-of-00003.safetensors`	Text backbone weights (shard 1/3)
`model-00002-of-00003.safetensors`	Text backbone weights (shard 2/3)
`model-00003-of-00003.safetensors`	Text backbone weights (shard 3/3)
`model-visual.safetensors`	Vision encoder weights (from base model)
`config.json`	Full VLM config (required by vLLM)
`tokenizer.json`	Tokenizer

Citation

@misc{naija2026qwen35text2sql,
  title={Qwen3.5-27B-Text2SQL: Fine-tuned Qwen 3.5 for Text-to-SQL},
  author={Maher Naija},
  year={2026},
  url={https://huggingface.co/mahernaija/Qwen3.5-27B-Text2SQL}
}

License

Apache 2.0 (same as base model Qwen/Qwen3.5-27B)

Acknowledgments

Base model: Qwen/Qwen3.5-27B by Alibaba
Training data: sql-create-context + Gretel synthetic_text_to_sql

Downloads last month: 406

Model tree for mahernaija/Qwen3.5-27B-Text2SQL

Base model

Qwen/Qwen3.5-27B

Finetuned

(168)

this model

mahernaija
/

Qwen3.5-27B-Text2SQL