Qwen3-4B-Thinking-2507-Heretic-GGUF

Llamacpp imatrix Quantizations of Qwen3-4B-Thinking-2507-Heretic by becnic (from original Qwen3-4B-Thinking-2507)

Using llama.cpp release b7120 for quantization.

Original model: https://huggingface.co/becnic/Qwen3-4B-Thinking-2507-Heretic

Run them directly with llama.cpp, or any other llama.cpp based project

Download a file (not the whole branch) from below:

Filename	Quant type	File Size	Split	Description
Qwen3-4B-Thinking-2507-Heretic-f16.gguf	f16	8.05GB	false	Full precision, highest possible quality
Qwen3-4B-Thinking-2507-Heretic-Q8_0.gguf	Q8_0	4.28GB	false	Extremely high quality
Qwen3-4B-Thinking-2507-Heretic-Q6_K.gguf	Q6_K	3.31GB	false	Near-lossless high quality
Qwen3-4B-Thinking-2507-Heretic-Q5_K_S.gguf	Q5_K_S	2.82GB	false	Premium high quality
Qwen3-4B-Thinking-2507-Heretic-Q5_K_M.gguf	Q5_K_M	2.89GB	false	Very high quality
Qwen3-4B-Thinking-2507-Heretic-Q5_0.gguf	Q5_0	2.82GB	false	High quality
Qwen3-4B-Thinking-2507-Heretic-Q4_K_S.gguf	Q4_K_S	2.38GB	false	Strong mid-high quality
Qwen3-4B-Thinking-2507-Heretic-Q4_K_M.gguf	Q4_K_M	2.50GB	false	Balanced mid-high quality
Qwen3-4B-Thinking-2507-Heretic-Q4_0.gguf	Q4_0	2.37GB	false	Good balance of size and quality
Qwen3-4B-Thinking-2507-Heretic-Q3_K_S.gguf	Q3_K_S	1.89GB	false	Higher tier Q3
Qwen3-4B-Thinking-2507-Heretic-Q3_K_M.gguf	Q3_K_M	2.08GB	false	Mid-range
Qwen3-4B-Thinking-2507-Heretic-Q2_K.gguf	Q2_K	1.67GB	false	Smallest size, lowest quality

Downloading using huggingface-cli

Click to view download instructions

First, make sure you have hugginface-cli installed:

pip install -U "huggingface_hub[cli]"

Then, you can target the specific file you want:

huggingface-cli download ZuzeTt/Qwen3-4B-Thinking-2507-Heretic-GGUF --include "Qwen3-4B-Thinking-2507-Q8_0.gguf" --local-dir ./

If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:

huggingface-cli download ZuzeTt/Qwen3-4B-Thinking-2507-Heretic-GGUF --include "Qwen3-4B-Thinking-2507-Q8_0.gguf/*" --local-dir ./

Abliteration parameters

Parameter	Value
direction_index	19.42
attn.o_proj.max_weight	1.23
attn.o_proj.max_weight_position	22.34
attn.o_proj.min_weight	0.69
attn.o_proj.min_weight_distance	10.42
mlp.down_proj.max_weight	1.12
mlp.down_proj.max_weight_position	29.64
mlp.down_proj.min_weight	1.08
mlp.down_proj.min_weight_distance	20.24

Performance

Metric	This model	Original model (Qwen/Qwen3-4B-Thinking-2507)
KL divergence	0.06	0 (by definition)
Refusals	6/100	96/100

Model Overview

Qwen3-4B-Thinking-2507 has the following features:

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Number of Parameters: 4.0B
Number of Paramaters (Non-Embedding): 3.6B
Number of Layers: 36
Number of Attention Heads (GQA): 32 for Q and 8 for KV
Context Length: 262,144 natively.

NOTE: This model supports only thinking mode. Meanwhile, specifying enable_thinking=True is no longer required.

Additionally, to enforce model thinking, the default chat template automatically includes <think>. Therefore, it is normal for the model's output to contain only </think> without an explicit opening <think> tag.

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Downloads last month: 2,176

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for ZuzeTt/Qwen3-4B-Thinking-2507-Heretic-Imatrix-GGUF

Base model

Qwen/Qwen3-4B-Thinking-2507

Finetuned

becnic/Qwen3-4B-Thinking-2507-Heretic

Quantized

(3)

this model