BIG-TB: A Benchmark Dataset for Genomic Resistance Prediction and Interpretability

BIG-TB is a large-scale, multimodal benchmark dataset designed to advance machine-learning-based antibiotic resistance prediction and biological interpretability in Mycobacterium tuberculosis.

The dataset integrates genomic, protein, and structural features across 11 WHO-priority drugs and enables rigorous evaluation of both predictive accuracy and causal variant recovery.


🔬 Key Contributions

  • Curated 17K isolates with unified resistance labels and high-confidence mutation annotations from WHO 2023 and 2021 catalogues.
  • Developed reproducible pipelines for sequence, structure, and evolutionary data integration (DNA + protein + 3D features).
  • Introduced interpretability benchmarks via Precision@k and SHAP-based variant recovery.
  • Released accompanying code and documentation to promote transparency and reproducibility.

🧠 Methods

Models include:

  • CNNs and Transformers for sequence-level modeling
  • ESM-based protein embeddings and PCA-compressed representations
  • Linear fused-ridge baselines for interpretable structure-aware learning

🏆 Impact

Presented as a spotlight talk at the Machine Learning for Computational Biology (MLCB) Workshop 2025, co-located with NeurIPS.
BIG-TB serves as a foundation for evaluating biological faithfulness in ML systems and fostering explainable AI for global health.