Research
I am broadly interested in interpretable machine learning for biological discovery, with a focus on genomic resistance prediction in Mycobacterium tuberculosis. My work bridges protein sequence modeling, evolutionary augmentation, and causal variant discovery, aiming to make machine learning models biologically faithful and practically useful for antimicrobial resistance surveillance.
BIG-TB Benchmark
The BIG-TB Benchmark is a large-scale, multimodal dataset of over 17,000 M. tuberculosis isolates spanning 11 WHO-priority antibiotics.
It standardizes resistance prediction as a unified ML task and enables fair comparison across DNA-, protein-, and structure-based models.
- Designed a cross-validated benchmark for resistance prediction and interpretability
- Integrated genomic, proteomic, and evolutionary features across drug pathways
- Evaluated CNN, Transformer, and foundation model embeddings (ESM2, DNABERT)
📄 Manuscript in preparation: “BIG-TB: A Benchmark Dataset for Genomic Resistance Prediction and Interpretability in M. tuberculosis”
Resistance Forecast Project
The Resistance Forecast project combines structural bioinformatics, machine learning, and evolutionary constraints to identify causal variants driving resistance.
We integrate:
- ΔΔG stability changes from Rosetta and FoldX
- 3D-structure-aware proximity metrics and fused-ridge feature models
- SHAP-based variant attribution for explainability
Goal: Bridge mechanistic biology with interpretable ML, yielding causal insights into mutation effects.
Evolutionary Augmentation
I develop phylogenetic data augmentation techniques that extend supervised ML datasets using multi-species homologous sequences (via UniProt and InterPro).
This method increases diversity and robustness in low-data antibiotic resistance settings.
- Treats homologs as weakly-labeled susceptible sequences
- Uses evolutionary embeddings (ESM2, InterProScan) for fine-tuning
- Improves generalization and variant recovery on rare-mutation datasets
Related work: “Beyond Sequence-Only Models: Leveraging Structural Constraints for Antibiotic Resistance Prediction” (ICLR MLGenX 2025 Workshop)
Other Interests
I’m also exploring:
- Multi-modal integration of protein and genomic embeddings
- Transfer learning for cross-species resistance prediction
- Benchmark design and interpretability evaluation pipelines
For full paper details, see my Publications page.