Qwen3.5-27B-Text2SQL
Fine-tuned Qwen/Qwen3.5-27B for Text-to-SQL generation. Given a database schema and a natural language question, the model outputs a clean SQL query.
Key Results
| Metric | Base Model | This Model | Improvement |
|---|---|---|---|
| Gretel Execution Accuracy (3,492 samples) | 26.5% | 66.7% | +40.2% |
| Gretel Valid SQL | 37.8% | 89.4% | +51.6% |
| Spider Execution Accuracy (1,032, gold std) | 46.6% | 55.1% | +8.5% |
| Spider Exact Match | 0.0% | 22.2% | +22.2% |
| Spider Keyword Score | 45.5% | 85.4% | +39.9% |
Regression (standard benchmarks via lm-eval-harness)
| Benchmark | Base | This Model | Delta |
|---|---|---|---|
| MMLU (humanities) | 81.9% | 83.5% | +1.6% (no regression) |
| MMLU (STEM) | 87.2% | 86.7% | -0.5% (no regression) |
| MMLU (social sciences) | 92.0% | 92.0% | 0% (no regression) |
| MMLU (other) | 87.5% | 87.5% | 0% (no regression) |
| GSM8K (math, strict) | 60.4% | 35.4% | -25.0% (regression) |
| HellaSwag (common sense) | 79.6% | 84.1% | +4.5% (improved) |
| ARC-Challenge (reasoning) | 69.3% | 71.3% | +2.0% (improved) |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"mahernaija/Qwen3.5-27B-Text2SQL",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"mahernaija/Qwen3.5-27B-Text2SQL",
trust_remote_code=True,
)
schema = "CREATE TABLE employees (id INTEGER, name TEXT, department TEXT, salary REAL);"
question = "Find all employees in Engineering with salary above 90000."
prompt = f"<|im_start|>system\nYou are a SQL expert. Given a database schema and a natural language question, write the correct SQL query.<|im_end|>\n<|im_start|>user\nSchema: {schema}\nQuestion: {question}<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
# SELECT * FROM employees WHERE department = 'Engineering' AND salary > 90000
With vLLM
vllm serve mahernaija/Qwen3.5-27B-Text2SQL --tensor-parallel-size 2
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
response = client.chat.completions.create(
model="mahernaija/Qwen3.5-27B-Text2SQL",
messages=[
{"role": "system", "content": "You are a SQL expert. Given a database schema and a natural language question, write the correct SQL query."},
{"role": "user", "content": "Schema: CREATE TABLE products (id INT, name TEXT, price REAL);\nQuestion: What are the 3 most expensive products?"},
],
max_tokens=256,
temperature=0,
)
print(response.choices[0].message.content)
# SELECT name FROM products ORDER BY price DESC LIMIT 3
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3.5-27B (26.9B params) |
| Method | Full fine-tuning with FSDP |
| Hardware | 16× NVIDIA H200 (2 nodes, Nebius AI Cloud) |
| Training time | 3 hours 6 minutes |
| Dataset | sql-create-context (78K) + Gretel synthetic (100K) = 178K samples |
| Epochs | 1 |
| Batch size | 2 per GPU × 8 grad accum × 16 GPUs = 256 global |
| Learning rate | 2e-5 (cosine decay, 5% warmup) |
| Sequence length | 512 (SQL samples P99=373 tokens) |
| Precision | BF16 |
| GPU utilization | 98-100% |
| Final train loss | 0.144 |
| Final eval loss | 0.176 |
Data Preprocessing
- Schemas stripped of INSERT data (model learns SQL from schema structure, not memorized answers)
- Manual chat format (bypasses Qwen3.5
<think>tag injection) - Label masking: loss only on SQL output, prompt tokens masked with -100
- Deduplication + contamination check between train and eval splits
Evaluation
Gretel Execution Accuracy (gold standard — runs SQL in SQLite, compares results):
| Complexity | Base | This Model |
|---|---|---|
| Basic SQL | 23.8% | 71.4% |
| Aggregation | 18.2% | 54.5% |
| Single JOIN | 25.0% | 75.0% |
| Window functions | 0.0% | 33.3% |
Spider Benchmark (1,034 dev questions, public standard):
- Exact match: 22.2%
- Keyword score: 85.4%
Known Limitations
Catastrophic forgetting: Full fine-tuning on SQL-only data caused regression in general capabilities. The model tries to answer non-SQL questions with SQL (56% SQL contamination on general prompts). For production use with mixed tasks, consider:
- LoRA fine-tuning instead of full FT
- Mixed training data (SQL + general chat)
- Using this model only for SQL-specific pipelines
Regression Test
| Category | Base | This Model | SQL Contamination |
|---|---|---|---|
| General Knowledge | 84% | 44% | 2/5 |
| Math | 100% | 40% | 3/5 |
| Code | 48% | 47% | 5/5 |
| Language | 90% | 78% | 4/5 |
| Common Sense | 82% | 93% | 0/5 |
Architecture
- Architecture: Qwen3_5ForConditionalGeneration (VLM with text + vision)
- Text backbone: 64 layers, hidden_size=5120, 24 attention heads, 4 KV heads
- Attention: Hybrid GDN (linear_attention + full_attention)
- Context: 262K tokens
- Vision: Built-in 27-layer ViT (weights from base model, not finetuned)
Files
| File | Description |
|---|---|
model-00001-of-00003.safetensors |
Text backbone weights (shard 1/3) |
model-00002-of-00003.safetensors |
Text backbone weights (shard 2/3) |
model-00003-of-00003.safetensors |
Text backbone weights (shard 3/3) |
model-visual.safetensors |
Vision encoder weights (from base model) |
config.json |
Full VLM config (required by vLLM) |
tokenizer.json |
Tokenizer |
Citation
@misc{naija2026qwen35text2sql,
title={Qwen3.5-27B-Text2SQL: Fine-tuned Qwen 3.5 for Text-to-SQL},
author={Maher Naija},
year={2026},
url={https://huggingface.co/mahernaija/Qwen3.5-27B-Text2SQL}
}
License
Apache 2.0 (same as base model Qwen/Qwen3.5-27B)
Acknowledgments
- Base model: Qwen/Qwen3.5-27B by Alibaba
- Training data: sql-create-context + Gretel synthetic_text_to_sql
- Downloads last month
- 406
Model tree for mahernaija/Qwen3.5-27B-Text2SQL
Base model
Qwen/Qwen3.5-27B