Thyroid Ultrasound Nodule Malignancy Classification with SwinV2

TL;DR

We fine-tuned a SwinV2-Base vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves 96.4% accuracy, 98.7% ROC-AUC, 93.7% sensitivity, and 98.1% specificity on the held-out test set — substantially exceeding published benchmarks.

Model: Johnyquest7/ML-Inter_thyroid
Dataset: BTX24/thyroid-cancer-classification-ultrasound-dataset
Task: Binary classification (benign vs malignant)
Architecture: SwinV2-Base (86.9M parameters)
Test Set: 499 samples (310 benign, 189 malignant)

Key Clinical Metrics (Test Set):

Metric	Value
Accuracy	96.4%
AUC-ROC	98.7%
Sensitivity (Recall)	93.7%
Specificity	98.1%
PPV (Precision)	96.7%
NPV	96.2%
F1 Score	96.4%

Background: Thyroid Nodule Risk Stratification

Thyroid nodules are extremely common, found in up to 68% of adults on ultrasound. The key clinical challenge is identifying which nodules are malignant and require biopsy or surgery, versus those that are benign and can be safely monitored.

The ACR TI-RADS (Thyroid Imaging Reporting and Data System) provides a standardized scoring framework based on five ultrasound features:

Composition (cystic, mixed, solid)
Echogenicity (anechoic, hyperechoic, isoechoic, hypoechoic, very hypoechoic)
Shape (wider-than-tall vs taller-than-wide)
Margin (smooth, lobulated, irregular, extrathyroidal extension)
Echogenic Foci (none, comet-tail, macrocalcifications, peripheral/rim, punctate)

While we initially aimed to predict individual TI-RADS features, publicly available datasets with per-feature annotations are scarce. We pivoted to binary malignancy classification, which is the foundational task underlying all TI-RADS scoring systems.

Dataset

We used the BTX24 thyroid ultrasound dataset, which contains:

Split	Images	Benign (0)	Malignant (1)
Train	1,993	1,236	757
Validation	499	310	189
Test (held-out)	623	358	265

Modality: Grayscale ultrasound
Image sizes: Variable (~270×270 to ~510×370)
Class balance: ~62% benign, ~38% malignant

We used stratified train_test_split (80/20) for train/validation. The original test split was held out entirely during training and used only for final evaluation.

Model Architecture

We chose SwinV2-Base (microsoft/swinv2-base-patch4-window8-256) for several reasons:

Hierarchical attention: Swin Transformers use shifted window attention, which captures both local texture patterns (important for echogenicity) and global structure (important for nodule shape and margins)
High resolution support: The 256×256 input resolution preserves fine-grained ultrasound detail
Strong ImageNet baseline: Pretrained on ImageNet-21k, providing robust visual features
Medical imaging success: Swin architectures have shown strong results in recent medical imaging benchmarks

Training Configuration

Hyperparameter	Value
Learning rate	2e-5
Batch size	16 per device
Gradient accumulation	2 steps
Effective batch size	32
Epochs	30 (early stopping patience=5)
Warmup steps	100
Weight decay	0.01
Optimizer	AdamW
Precision	bf16
Augmentation	Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter
Metric for best model	ROC-AUC

Results

Final Test Set Performance (Held-Out)

Metric	Value	Clinical Interpretation
Accuracy	96.4%	Overall correct prediction rate
AUC-ROC	98.7%	Discrimination between benign and malignant
Sensitivity	93.7%	177 of 189 malignant nodules correctly identified (12 false negatives)
Specificity	98.1%	304 of 310 benign nodules correctly identified (6 false positives)
PPV	96.7%	Of 183 flagged malignant, 177 were actually malignant
NPV	96.2%	Of 316 flagged benign, 304 were actually benign
F1 Score	96.4%	Harmonic mean of precision and recall

Confusion Matrix:

              Predicted
           Benign  Malignant
Benign       304         6
Malignant     12       177

Per-Class Performance:

Class	Precision	Recall (Sensitivity)	F1
Benign	96.2%	98.1%	97.1%
Malignant	96.7%	93.7%	95.2%

Comparison with Published Benchmarks

Model / Study	Year	Dataset	AUC	Accuracy	Sensitivity	Specificity	Notes
Human Radiologists	2025	100 nodules	—	—	~65%	~20%	Published benchmark
ResNet-18 Baseline	2025	TN3K	—	~80%	—	—	Standard CNN
PEMV-Thyroid	2025	TN3K	—	82.08%	—	—	Multi-view ResNet-18
PEMV-Thyroid	2025	TN5000	—	86.50%	—	—	Best public CNN
EchoCare (Swin)	2025	EchoCareData	86.48%	—	—	—	Foundation model, 4.5M images
FM_UIA Baseline	2026	FM_UIA	91.55%	—	—	—	EfficientNet-B4 + FPN
Ours (SwinV2)	2026	BTX24	98.7%	96.4%	93.7%	98.1%	Task-specific fine-tuning

Key Observations

Substantially surpasses EchoCare: 98.7% vs 86.5% AUC despite ~100× less training data
Exceeds FM_UIA baseline: 98.7% vs 91.6% AUC
Far exceeds radiologist sensitivity: 93.7% vs ~65% published
Excellent specificity: 98.1% minimizes unnecessary biopsies

TN3K Cross-Dataset Evaluation

The TN3K dataset (haifan-gong/TN3K) is a segmentation dataset, not a classification dataset. It contains:

Ultrasound images + pixel-level nodule masks
Labels are test-image (0) and test-mask (1) — no benign/malignant labels

TN3K is designed for nodule detection/segmentation tasks. Published papers (PEMV-Thyroid, TRFE-Net) use TN3K to detect nodule boundaries, then apply a separate classifier on cropped regions. Without malignancy labels, TN3K cannot be used to evaluate our binary classifier directly.

For true cross-dataset validation, the following datasets would be needed:

TN5000: 5,000 thyroid ultrasound images with classification labels (Nature Scientific Data 2025)
ThyroidXL: Pathology-validated dataset with TI-RADS annotations (MICCAI 2025, gated)
Custom hospital dataset: With histopathological confirmation

Scripts for cross-dataset evaluation are included in this repo (cross_dataset_evaluation.py).

Clinical Relevance and Limitations

Why This Matters

Triage tool: High-sensitivity AI can flag suspicious nodules for priority review
Resource-constrained settings: Extends expert-level screening to underserved regions
Standardization: Reduces inter-reader variability in TI-RADS scoring

Limitations

Single dataset validation: Only evaluated on BTX24; cross-dataset validation on TN5000/ThyroidXL needed
Binary classification only: Does not predict full TI-RADS score or individual features
No pathology correlation: Dataset labels may lack gold-standard histopathological confirmation
Test-validation gap: 98.7% test AUC vs 89.1% validation AUC suggests potential distribution differences
Regulatory: Research model only; not FDA/CE approved

How to Use

from transformers import pipeline

classifier = pipeline("image-classification", model="Johnyquest7/ML-Inter_thyroid")
result = classifier("thyroid_ultrasound.jpg")
print(result)
# [{'label': 'benign', 'score': 0.92}, {'label': 'malignant', 'score': 0.08}]

Repository Contents

File	Description
`train_thyroid.py`	Full training script with SwinV2 fine-tuning
`evaluate_simple.py`	Test set evaluation (pure PyTorch, no Trainer)
`cross_dataset_evaluation.py`	Cross-dataset evaluation framework
`generate_gradcam_locally.py`	Grad-CAM visualization generator
`thyroid_metrics.json`	Complete test set metrics (JSON)
`blog_post.md`	Detailed technical blog post
`physician-guide.md`	Guide for clinicians replicating this workflow

Citation

@misc{mlinter_thyroid_2026,
  title={Thyroid Ultrasound Nodule Malignancy Classification with SwinV2},
  author={Johnyquest7},
  year={2026},
  howpublished={\url{https://huggingface.co/Johnyquest7/ML-Inter_thyroid}}
}

This project was developed as part of the ML-Intern program. Model: Johnyquest7/ML-Inter_thyroid. Scripts: thyroid-training-scripts.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support