TN5000 Thyroid Ultrasound Classifier

Fine-tuned SwinV2-Base for Benign vs Malignant Thyroid Nodule Classification

Model Demo Dataset


๐Ÿ“‹ Table of Contents

  1. Overview
  2. Model Architecture
  3. Dataset
  4. Training Methodology
  5. Results
  6. External Validation
  7. How to Use
  8. Limitations & Disclaimers
  9. Citation

Overview

This model classifies thyroid ultrasound images as benign or malignant, designed to assist in the early detection of thyroid cancer. It was fine-tuned from Microsoft's SwinV2-Base vision transformer on the TN5000 ROI dataset from Kaggle.

Key Design Decisions:

  • Optimized for sensitivity (87.5%) to minimize missed malignancies โ€” critical in cancer screening
  • AUC-ROC of 0.94 indicates excellent discriminative ability
  • Focal loss with class weighting handles the benign/malignant class imbalance
  • Early stopping prevents overfitting on the small medical dataset

Model Architecture

Property Value
Base Model microsoft/swinv2-base-patch4-window8-256
Architecture Swin Transformer V2
Parameters 86.9M
Input Size 256 ร— 256
Patch Size 4 ร— 4
Window Size 8 ร— 8
Number of Classes 2 (benign, malignant)
License Apache 2.0

Why SwinV2? Swin Transformers use hierarchical feature maps and shifted window attention, making them particularly effective for medical imaging where local texture patterns (echogenicity, microcalcifications, irregular margins) are diagnostically important. SwinV2 improves training stability with a cosine attention mechanism and larger model capacity.


Dataset

Primary Training Dataset: TN5000 ROI

Split Images Benign Malignant
Train (80%) 2,800 ~1,600 ~1,200
Validation (20%) 700 ~400 ~300
Test (held-out) 750 ~400 ~350

Class Distribution: The dataset is moderately imbalanced with more benign cases. We used balanced class weights (benign: 1.75, malignant: 0.70) and focal loss (ฮณ=2.0) to prioritize malignant case detection.

External Validation Dataset


Training Methodology

Data Preprocessing

Transform Training Validation/Test
Resize RandomResizedCrop(256) Resize(256) + CenterCrop(256)
Horizontal Flip 50% probability No
Rotation ยฑ10ยฐ No
Color Jitter brightness=0.2, contrast=0.2 No
Normalization ImageNet mean/std ImageNet mean/std

Training Configuration

learning_rate: 2e-5
batch_size: 16 (per device)
gradient_accumulation_steps: 2
effective_batch_size: 32
epochs: 30 (early stopping patience: 5)
warmup_ratio: 0.1
optimizer: AdamW (ฮฒ1=0.9, ฮฒ2=0.999)
scheduler: Linear with warmup
mixed_precision: bf16
seed: 42

Loss Function: Focal Loss

Standard cross-entropy treats all misclassifications equally. In thyroid screening, missing a malignant case (false negative) is far more costly than a false alarm. We used focal loss:

FL(pt) = โˆ’(1 โˆ’ pt)^ฮณ ยท log(pt)

With ฮณ=2.0, the model focuses learning on hard-to-classify malignant cases. Class weights further upweight the minority malignant class.

Model Selection Criterion

The best model was selected by validation AUC-ROC (not accuracy), ensuring optimal discrimination between benign and malignant cases across all thresholds.


Results

Validation Set (700 images)

Metric Value
Accuracy 87.9%
F1-Score 91.3%
Sensitivity (Recall) 88.8%
Specificity 85.5%
PPV (Precision) 93.9%
NPV 75.3%
AUC-ROC 0.940

Confusion Matrix:

              Predicted
           Benign  Malignant
Actual Benign    171       29
      Malignant   56      444

Test Set (750 images โ€” held out)

Metric Value
Accuracy 87.2%
F1-Score 90.8%
Sensitivity (Recall) 87.5%
Specificity 86.5%
PPV (Precision) 94.4%
NPV 72.6%
AUC-ROC 0.937

Confusion Matrix:

              Predicted
           Benign  Malignant
Actual Benign    180       28
      Malignant   68      474

Training Curves

The model converged around epoch 18-22 with validation AUC-ROC peaking at 0.940. Early stopping triggered at epoch 27, loading the best checkpoint.

Epoch Train Loss Val AUC-ROC Val Sensitivity Val Specificity
1 0.356 0.713 0.714 0.590
5 0.229 0.912 0.940 0.715
10 0.187 0.922 0.858 0.835
15 0.148 0.934 0.928 0.805
18 0.125 0.939 0.846 0.885
22 0.143 0.940 0.888 0.855

External Validation

To assess generalization, we tested the model on an independent dataset without any fine-tuning:

Metric Value
Accuracy 66.8%
F1-Score 44.7%
Sensitivity 34.5%
Specificity 87.4%
PPV 63.5%
NPV 67.7%
AUC-ROC 0.707

Confusion Matrix (External):

              Predicted
           Benign  Malignant
Actual Benign   1665      240
      Malignant  793      417

Analysis: The external validation shows a significant performance drop (AUC 0.94 โ†’ 0.71), which is expected due to:

  1. Domain shift: Different ultrasound machines, protocols, and image preprocessing
  2. Different ROI extraction: The external dataset may use different cropping strategies
  3. Population differences: Different patient demographics and disease prevalence

This highlights the importance of domain adaptation or fine-tuning on local data before clinical deployment.


How to Use

Quick Inference with Pipeline

from transformers import pipeline
from PIL import Image

# Load model
classifier = pipeline("image-classification", model="Johnyquest7/TN5000_model")

# Predict
image = Image.open("thyroid_ultrasound.png").convert("RGB")
results = classifier(image)

# Results format:
# [{'label': 'malignant', 'score': 0.944}, {'label': 'benign', 'score': 0.056}]

Manual Inference

import torch
from PIL import Image
from transformers import AutoImageProcessor, AutoModelForImageClassification

# Load model and processor
model = AutoModelForImageClassification.from_pretrained("Johnyquest7/TN5000_model")
processor = AutoImageProcessor.from_pretrained("Johnyquest7/TN5000_model")

# Preprocess
image = Image.open("thyroid_ultrasound.png").convert("RGB")
inputs = processor(image, return_tensors="pt")

# Predict
with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)[0]

# Get probabilities
malignant_prob = probs[1].item()
benign_prob = probs[0].item()

print(f"Malignant: {malignant_prob:.1%}")
print(f"Benign: {benign_prob:.1%}")

Gradio Demo

Try the live demo: ๐Ÿฉบ Thyroid Nodule Classifier Demo


Limitations & Disclaimers

โš ๏ธ CRITICAL: This model is for research and educational purposes only.

  1. Not FDA-approved for clinical use
  2. External validation showed performance degradation (AUC 0.71 vs 0.94) โ€” domain shift is a real concern
  3. Trained on ROI patches, not full ultrasound images โ€” the model expects pre-cropped nodule regions
  4. Class imbalance in training data may bias predictions
  5. No multi-institutional validation โ€” performance may vary across hospitals and equipment
  6. Always consult a radiologist or endocrinologist for diagnosis

Intended Use Cases:

  • Research on AI-assisted thyroid screening
  • Educational tool for medical students
  • Prototype for integration into PACS systems (with proper validation)

Not Intended For:

  • Direct patient diagnosis
  • Replacing human radiologists
  • Screening without supervision

Citation

If you use this model in your research, please cite:

@misc{tn5000_model,
  title={TN5000 Thyroid Ultrasound Classifier},
  author={Johnyquest7},
  year={2026},
  howpublished={\url{https://huggingface.co/Johnyquest7/TN5000_model}},
  note={Fine-tuned SwinV2-Base for benign vs malignant thyroid nodule classification}
}

Base Model:

@article{liu2022swinv2,
  title={Swin Transformer V2: Scaling Up Capacity and Resolution},
  author={Liu, Ze and Hu, Han and Lin, Yutong and Yao, Zhuliang and Xie, Zhenda and Wei, Yixuan and Ning, Jia and Cao, Yue and Zhang, Zheng and Dong, Li and Wei, Furu and Guo, Baining},
  journal={International Conference on Computer Vision (ICCV)},
  year={2021}
}

Dataset:


Acknowledgments

  • Model trained using Hugging Face Transformers and Datasets libraries
  • Compute provided by Hugging Face GPU credits
  • Base model: Microsoft SwinV2-Base

Generated by ML Intern โ€” an agent for machine learning research and development on the Hugging Face Hub.

Downloads last month
-
Safetensors
Model size
86.9M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Johnyquest7/TN5000_model

Finetuned
(17)
this model

Evaluation results