YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Thyroid Ultrasound Nodule Malignancy Classification with SwinV2

TL;DR

We fine-tuned a SwinV2-Base vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves 96.4% accuracy, 98.7% ROC-AUC, 93.7% sensitivity, and 98.1% specificity on the held-out test set β€” substantially exceeding published benchmarks.

Key Clinical Metrics (Test Set):

Metric Value
Accuracy 96.4%
AUC-ROC 98.7%
Sensitivity (Recall) 93.7%
Specificity 98.1%
PPV (Precision) 96.7%
NPV 96.2%
F1 Score 96.4%

Background: Thyroid Nodule Risk Stratification

Thyroid nodules are extremely common, found in up to 68% of adults on ultrasound. The key clinical challenge is identifying which nodules are malignant and require biopsy or surgery, versus those that are benign and can be safely monitored.

The ACR TI-RADS (Thyroid Imaging Reporting and Data System) provides a standardized scoring framework based on five ultrasound features:

  1. Composition (cystic, mixed, solid)
  2. Echogenicity (anechoic, hyperechoic, isoechoic, hypoechoic, very hypoechoic)
  3. Shape (wider-than-tall vs taller-than-wide)
  4. Margin (smooth, lobulated, irregular, extrathyroidal extension)
  5. Echogenic Foci (none, comet-tail, macrocalcifications, peripheral/rim, punctate)

While we initially aimed to predict individual TI-RADS features, publicly available datasets with per-feature annotations are scarce. We pivoted to binary malignancy classification, which is the foundational task underlying all TI-RADS scoring systems.


Dataset

We used the BTX24 thyroid ultrasound dataset, which contains:

Split Images Benign (0) Malignant (1)
Train 1,993 1,236 757
Validation 499 310 189
Test (held-out) 623 358 265
  • Modality: Grayscale ultrasound
  • Image sizes: Variable (~270Γ—270 to ~510Γ—370)
  • Class balance: ~62% benign, ~38% malignant

We used stratified train_test_split (80/20) for train/validation. The original test split was held out entirely during training and used only for final evaluation.


Model Architecture

We chose SwinV2-Base (microsoft/swinv2-base-patch4-window8-256) for several reasons:

  1. Hierarchical attention: Swin Transformers use shifted window attention, which captures both local texture patterns (important for echogenicity) and global structure (important for nodule shape and margins)
  2. High resolution support: The 256Γ—256 input resolution preserves fine-grained ultrasound detail
  3. Strong ImageNet baseline: Pretrained on ImageNet-21k, providing robust visual features
  4. Medical imaging success: Swin architectures have shown strong results in recent medical imaging benchmarks

Training Configuration

Hyperparameter Value
Learning rate 2e-5
Batch size 16 per device
Gradient accumulation 2 steps
Effective batch size 32
Epochs 30 (early stopping patience=5)
Warmup steps 100
Weight decay 0.01
Optimizer AdamW
Precision bf16
Augmentation Random rotation (Β±10Β°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter
Metric for best model ROC-AUC

Results

Final Test Set Performance (Held-Out)

Metric Value Clinical Interpretation
Accuracy 96.4% Overall correct prediction rate
AUC-ROC 98.7% Discrimination between benign and malignant
Sensitivity 93.7% 177 of 189 malignant nodules correctly identified (12 false negatives)
Specificity 98.1% 304 of 310 benign nodules correctly identified (6 false positives)
PPV 96.7% Of 183 flagged malignant, 177 were actually malignant
NPV 96.2% Of 316 flagged benign, 304 were actually benign
F1 Score 96.4% Harmonic mean of precision and recall

Confusion Matrix:

              Predicted
           Benign  Malignant
Benign       304         6
Malignant     12       177

Per-Class Performance:

Class Precision Recall (Sensitivity) F1
Benign 96.2% 98.1% 97.1%
Malignant 96.7% 93.7% 95.2%

Comparison with Published Benchmarks

Model / Study Year Dataset AUC Accuracy Sensitivity Specificity Notes
Human Radiologists 2025 100 nodules β€” β€” ~65% ~20% Published benchmark
ResNet-18 Baseline 2025 TN3K β€” ~80% β€” β€” Standard CNN
PEMV-Thyroid 2025 TN3K β€” 82.08% β€” β€” Multi-view ResNet-18
PEMV-Thyroid 2025 TN5000 β€” 86.50% β€” β€” Best public CNN
EchoCare (Swin) 2025 EchoCareData 86.48% β€” β€” β€” Foundation model, 4.5M images
FM_UIA Baseline 2026 FM_UIA 91.55% β€” β€” β€” EfficientNet-B4 + FPN
Ours (SwinV2) 2026 BTX24 98.7% 96.4% 93.7% 98.1% Task-specific fine-tuning

Key Observations

  1. Substantially surpasses EchoCare: 98.7% vs 86.5% AUC despite ~100Γ— less training data
  2. Exceeds FM_UIA baseline: 98.7% vs 91.6% AUC
  3. Far exceeds radiologist sensitivity: 93.7% vs ~65% published
  4. Excellent specificity: 98.1% minimizes unnecessary biopsies

TN3K Cross-Dataset Evaluation

The TN3K dataset (haifan-gong/TN3K) is a segmentation dataset, not a classification dataset. It contains:

  • Ultrasound images + pixel-level nodule masks
  • Labels are test-image (0) and test-mask (1) β€” no benign/malignant labels

TN3K is designed for nodule detection/segmentation tasks. Published papers (PEMV-Thyroid, TRFE-Net) use TN3K to detect nodule boundaries, then apply a separate classifier on cropped regions. Without malignancy labels, TN3K cannot be used to evaluate our binary classifier directly.

For true cross-dataset validation, the following datasets would be needed:

  • TN5000: 5,000 thyroid ultrasound images with classification labels (Nature Scientific Data 2025)
  • ThyroidXL: Pathology-validated dataset with TI-RADS annotations (MICCAI 2025, gated)
  • Custom hospital dataset: With histopathological confirmation

Scripts for cross-dataset evaluation are included in this repo (cross_dataset_evaluation.py).


Clinical Relevance and Limitations

Why This Matters

  • Triage tool: High-sensitivity AI can flag suspicious nodules for priority review
  • Resource-constrained settings: Extends expert-level screening to underserved regions
  • Standardization: Reduces inter-reader variability in TI-RADS scoring

Limitations

  1. Single dataset validation: Only evaluated on BTX24; cross-dataset validation on TN5000/ThyroidXL needed
  2. Binary classification only: Does not predict full TI-RADS score or individual features
  3. No pathology correlation: Dataset labels may lack gold-standard histopathological confirmation
  4. Test-validation gap: 98.7% test AUC vs 89.1% validation AUC suggests potential distribution differences
  5. Regulatory: Research model only; not FDA/CE approved

How to Use

from transformers import pipeline

classifier = pipeline("image-classification", model="Johnyquest7/ML-Inter_thyroid")
result = classifier("thyroid_ultrasound.jpg")
print(result)
# [{'label': 'benign', 'score': 0.92}, {'label': 'malignant', 'score': 0.08}]

Repository Contents

File Description
train_thyroid.py Full training script with SwinV2 fine-tuning
evaluate_simple.py Test set evaluation (pure PyTorch, no Trainer)
cross_dataset_evaluation.py Cross-dataset evaluation framework
generate_gradcam_locally.py Grad-CAM visualization generator
thyroid_metrics.json Complete test set metrics (JSON)
blog_post.md Detailed technical blog post
physician-guide.md Guide for clinicians replicating this workflow

Citation

@misc{mlinter_thyroid_2026,
  title={Thyroid Ultrasound Nodule Malignancy Classification with SwinV2},
  author={Johnyquest7},
  year={2026},
  howpublished={\url{https://huggingface.co/Johnyquest7/ML-Inter_thyroid}}
}

This project was developed as part of the ML-Intern program. Model: Johnyquest7/ML-Inter_thyroid. Scripts: thyroid-training-scripts.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support