Qwen3-4B-Thinking-2507-Heretic-GGUF

Llamacpp imatrix Quantizations of Qwen3-4B-Thinking-2507-Heretic by becnic (from original Qwen3-4B-Thinking-2507)

Using llama.cpp release b7120 for quantization.

Original model: https://huggingface.co/becnic/Qwen3-4B-Thinking-2507-Heretic

Run them in LM Studio

Run them directly with llama.cpp, or any other llama.cpp based project

Download a file (not the whole branch) from below:

Filename Quant type File Size Split Description
Qwen3-4B-Thinking-2507-Heretic-f16.gguf f16 8.05GB false Full precision, highest possible quality
Qwen3-4B-Thinking-2507-Heretic-Q8_0.gguf Q8_0 4.28GB false Extremely high quality
Qwen3-4B-Thinking-2507-Heretic-Q6_K.gguf Q6_K 3.31GB false Near-lossless high quality
Qwen3-4B-Thinking-2507-Heretic-Q5_K_S.gguf Q5_K_S 2.82GB false Premium high quality
Qwen3-4B-Thinking-2507-Heretic-Q5_K_M.gguf Q5_K_M 2.89GB false Very high quality
Qwen3-4B-Thinking-2507-Heretic-Q5_0.gguf Q5_0 2.82GB false High quality
Qwen3-4B-Thinking-2507-Heretic-Q4_K_S.gguf Q4_K_S 2.38GB false Strong mid-high quality
Qwen3-4B-Thinking-2507-Heretic-Q4_K_M.gguf Q4_K_M 2.50GB false Balanced mid-high quality
Qwen3-4B-Thinking-2507-Heretic-Q4_0.gguf Q4_0 2.37GB false Good balance of size and quality
Qwen3-4B-Thinking-2507-Heretic-Q3_K_S.gguf Q3_K_S 1.89GB false Higher tier Q3
Qwen3-4B-Thinking-2507-Heretic-Q3_K_M.gguf Q3_K_M 2.08GB false Mid-range
Qwen3-4B-Thinking-2507-Heretic-Q2_K.gguf Q2_K 1.67GB false Smallest size, lowest quality

Downloading using huggingface-cli

Click to view download instructions

First, make sure you have hugginface-cli installed:

pip install -U "huggingface_hub[cli]"

Then, you can target the specific file you want:

huggingface-cli download ZuzeTt/Qwen3-4B-Thinking-2507-Heretic-GGUF --include "Qwen3-4B-Thinking-2507-Q8_0.gguf" --local-dir ./

If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:

huggingface-cli download ZuzeTt/Qwen3-4B-Thinking-2507-Heretic-GGUF --include "Qwen3-4B-Thinking-2507-Q8_0.gguf/*" --local-dir ./

Abliteration parameters

Parameter Value
direction_index 19.42
attn.o_proj.max_weight 1.23
attn.o_proj.max_weight_position 22.34
attn.o_proj.min_weight 0.69
attn.o_proj.min_weight_distance 10.42
mlp.down_proj.max_weight 1.12
mlp.down_proj.max_weight_position 29.64
mlp.down_proj.min_weight 1.08
mlp.down_proj.min_weight_distance 20.24

Performance

Metric This model Original model (Qwen/Qwen3-4B-Thinking-2507)
KL divergence 0.06 0 (by definition)
Refusals 6/100 96/100

Model Overview

Qwen3-4B-Thinking-2507 has the following features:

  • Type: Causal Language Models
  • Training Stage: Pretraining & Post-training
  • Number of Parameters: 4.0B
  • Number of Paramaters (Non-Embedding): 3.6B
  • Number of Layers: 36
  • Number of Attention Heads (GQA): 32 for Q and 8 for KV
  • Context Length: 262,144 natively.

NOTE: This model supports only thinking mode. Meanwhile, specifying enable_thinking=True is no longer required.

Additionally, to enforce model thinking, the default chat template automatically includes <think>. Therefore, it is normal for the model's output to contain only </think> without an explicit opening <think> tag.

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Downloads last month
2,176
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ZuzeTt/Qwen3-4B-Thinking-2507-Heretic-Imatrix-GGUF

Quantized
(3)
this model