TN5000 Thyroid Ultrasound Classifier

Fine-tuned SwinV2-Base for Benign vs Malignant Thyroid Nodule Classification

📋 Table of Contents

Overview
Model Architecture
Dataset
Training Methodology
Results
External Validation
How to Use
Limitations & Disclaimers
Citation

Overview

This model classifies thyroid ultrasound images as benign or malignant, designed to assist in the early detection of thyroid cancer. It was fine-tuned from Microsoft's SwinV2-Base vision transformer on the TN5000 ROI dataset from Kaggle.

Key Design Decisions:

Optimized for sensitivity (87.5%) to minimize missed malignancies — critical in cancer screening
AUC-ROC of 0.94 indicates excellent discriminative ability
Focal loss with class weighting handles the benign/malignant class imbalance
Early stopping prevents overfitting on the small medical dataset

Model Architecture

Property	Value
Base Model	microsoft/swinv2-base-patch4-window8-256
Architecture	Swin Transformer V2
Parameters	86.9M
Input Size	256 × 256
Patch Size	4 × 4
Window Size	8 × 8
Number of Classes	2 (benign, malignant)
License	Apache 2.0

Why SwinV2? Swin Transformers use hierarchical feature maps and shifted window attention, making them particularly effective for medical imaging where local texture patterns (echogenicity, microcalcifications, irregular margins) are diagnostically important. SwinV2 improves training stability with a cosine attention mechanism and larger model capacity.

Dataset

Primary Training Dataset: TN5000 ROI

Source: Kaggle - ROI Dataset TN5000
Type: Thyroid ultrasound Region-of-Interest (ROI) patches
Total Images: 4,250

Split	Images	Benign	Malignant
Train (80%)	2,800	~1,600	~1,200
Validation (20%)	700	~400	~300
Test (held-out)	750	~400	~350

Class Distribution: The dataset is moderately imbalanced with more benign cases. We used balanced class weights (benign: 1.75, malignant: 0.70) and focal loss (γ=2.0) to prioritize malignant case detection.

External Validation Dataset

Source: Johnyquest7/thyroid-cancer-classification-ultrasound-dataset
Images: 3,115 total (train + test splits)
Purpose: Independent validation on unseen data from a different source

Training Methodology

Data Preprocessing

Transform	Training	Validation/Test
Resize	RandomResizedCrop(256)	Resize(256) + CenterCrop(256)
Horizontal Flip	50% probability	No
Rotation	±10°	No
Color Jitter	brightness=0.2, contrast=0.2	No
Normalization	ImageNet mean/std	ImageNet mean/std

Training Configuration

learning_rate: 2e-5
batch_size: 16 (per device)
gradient_accumulation_steps: 2
effective_batch_size: 32
epochs: 30 (early stopping patience: 5)
warmup_ratio: 0.1
optimizer: AdamW (β1=0.9, β2=0.999)
scheduler: Linear with warmup
mixed_precision: bf16
seed: 42

Loss Function: Focal Loss

Standard cross-entropy treats all misclassifications equally. In thyroid screening, missing a malignant case (false negative) is far more costly than a false alarm. We used focal loss:

FL(pt) = −(1 − pt)^γ · log(pt)

With γ=2.0, the model focuses learning on hard-to-classify malignant cases. Class weights further upweight the minority malignant class.

Model Selection Criterion

The best model was selected by validation AUC-ROC (not accuracy), ensuring optimal discrimination between benign and malignant cases across all thresholds.

Results

Validation Set (700 images)

Metric	Value
Accuracy	87.9%
F1-Score	91.3%
Sensitivity (Recall)	88.8%
Specificity	85.5%
PPV (Precision)	93.9%
NPV	75.3%
AUC-ROC	0.940

Confusion Matrix:

              Predicted
           Benign  Malignant
Actual Benign    171       29
      Malignant   56      444

Test Set (750 images — held out)

Metric	Value
Accuracy	87.2%
F1-Score	90.8%
Sensitivity (Recall)	87.5%
Specificity	86.5%
PPV (Precision)	94.4%
NPV	72.6%
AUC-ROC	0.937

Confusion Matrix:

              Predicted
           Benign  Malignant
Actual Benign    180       28
      Malignant   68      474

Training Curves

The model converged around epoch 18-22 with validation AUC-ROC peaking at 0.940. Early stopping triggered at epoch 27, loading the best checkpoint.

Epoch	Train Loss	Val AUC-ROC	Val Sensitivity	Val Specificity
1	0.356	0.713	0.714	0.590
5	0.229	0.912	0.940	0.715
10	0.187	0.922	0.858	0.835
15	0.148	0.934	0.928	0.805
18	0.125	0.939	0.846	0.885
22	0.143	0.940	0.888	0.855

External Validation

To assess generalization, we tested the model on an independent dataset without any fine-tuning:

Metric	Value
Accuracy	66.8%
F1-Score	44.7%
Sensitivity	34.5%
Specificity	87.4%
PPV	63.5%
NPV	67.7%
AUC-ROC	0.707

Confusion Matrix (External):

              Predicted
           Benign  Malignant
Actual Benign   1665      240
      Malignant  793      417

Analysis: The external validation shows a significant performance drop (AUC 0.94 → 0.71), which is expected due to:

Domain shift: Different ultrasound machines, protocols, and image preprocessing
Different ROI extraction: The external dataset may use different cropping strategies
Population differences: Different patient demographics and disease prevalence

This highlights the importance of domain adaptation or fine-tuning on local data before clinical deployment.

How to Use

Quick Inference with Pipeline

from transformers import pipeline
from PIL import Image

# Load model
classifier = pipeline("image-classification", model="Johnyquest7/TN5000_model")

# Predict
image = Image.open("thyroid_ultrasound.png").convert("RGB")
results = classifier(image)

# Results format:
# [{'label': 'malignant', 'score': 0.944}, {'label': 'benign', 'score': 0.056}]

Manual Inference

import torch
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification

# Load model and processor
model = AutoModelForImageClassification.from_pretrained("Johnyquest7/TN5000_model")
processor = AutoImageProcessor.from_pretrained("Johnyquest7/TN5000_model")

# Preprocess
image = Image.open("thyroid_ultrasound.png").convert("RGB")
inputs = processor(image, return_tensors="pt")

# Predict
with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)[0]

# Get probabilities
malignant_prob = probs[1].item()
benign_prob = probs[0].item()

print(f"Malignant: {malignant_prob:.1%}")
print(f"Benign: {benign_prob:.1%}")

Gradio Demo

Try the live demo: 🩺 Thyroid Nodule Classifier Demo

Limitations & Disclaimers

⚠️ CRITICAL: This model is for research and educational purposes only.

Not FDA-approved for clinical use
External validation showed performance degradation (AUC 0.71 vs 0.94) — domain shift is a real concern
Trained on ROI patches, not full ultrasound images — the model expects pre-cropped nodule regions
Class imbalance in training data may bias predictions
No multi-institutional validation — performance may vary across hospitals and equipment
Always consult a radiologist or endocrinologist for diagnosis

Intended Use Cases:

Research on AI-assisted thyroid screening
Educational tool for medical students
Prototype for integration into PACS systems (with proper validation)

Not Intended For:

Direct patient diagnosis
Replacing human radiologists
Screening without supervision

Citation

If you use this model in your research, please cite:

@misc{tn5000_model,
  title={TN5000 Thyroid Ultrasound Classifier},
  author={Johnyquest7},
  year={2026},
  howpublished={\url{https://huggingface.co/Johnyquest7/TN5000_model}},
  note={Fine-tuned SwinV2-Base for benign vs malignant thyroid nodule classification}
}

Base Model:

@article{liu2022swinv2,
  title={Swin Transformer V2: Scaling Up Capacity and Resolution},
  author={Liu, Ze and Hu, Han and Lin, Yutong and Yao, Zhuliang and Xie, Zhenda and Wei, Yixuan and Ning, Jia and Cao, Yue and Zhang, Zheng and Dong, Li and Wei, Furu and Guo, Baining},
  journal={International Conference on Computer Vision (ICCV)},
  year={2021}
}

Dataset:

TN5000 ROI Dataset: Kaggle

Acknowledgments

Model trained using Hugging Face Transformers and Datasets libraries
Compute provided by Hugging Face GPU credits
Base model: Microsoft SwinV2-Base

Generated by ML Intern — an agent for machine learning research and development on the Hugging Face Hub.

Downloads last month: -

Safetensors

Model size

86.9M params

Tensor type

F32

Model tree for Johnyquest7/TN5000_model

Base model

microsoft/swinv2-base-patch4-window8-256

Finetuned

(17)

this model

Evaluation results

Accuracy on TN5000 ROI Dataset
self-reported

0.872
F1 on TN5000 ROI Dataset
self-reported

0.908
AUC-ROC on TN5000 ROI Dataset
self-reported

0.937
Sensitivity on TN5000 ROI Dataset
self-reported

0.875
Specificity on TN5000 ROI Dataset
self-reported

0.865