Research

I am broadly interested in interpretable machine learning for biological discovery, with a focus on genomic resistance prediction in Mycobacterium tuberculosis. My work bridges protein sequence modeling, evolutionary augmentation, and causal variant discovery, aiming to make machine learning models biologically faithful and practically useful for antimicrobial resistance surveillance.


BIG-TB Benchmark

The BIG-TB Benchmark is a large-scale, multimodal dataset of over 17,000 M. tuberculosis isolates spanning 11 WHO-priority antibiotics.
It standardizes resistance prediction as a unified ML task and enables fair comparison across DNA-, protein-, and structure-based models.

📄 Manuscript in preparation: “BIG-TB: A Benchmark Dataset for Genomic Resistance Prediction and Interpretability in M. tuberculosis”


Resistance Forecast Project

The Resistance Forecast project combines structural bioinformatics, machine learning, and evolutionary constraints to identify causal variants driving resistance.

We integrate:

Goal: Bridge mechanistic biology with interpretable ML, yielding causal insights into mutation effects.


Evolutionary Augmentation

I develop phylogenetic data augmentation techniques that extend supervised ML datasets using multi-species homologous sequences (via UniProt and InterPro).
This method increases diversity and robustness in low-data antibiotic resistance settings.

Related work: “Beyond Sequence-Only Models: Leveraging Structural Constraints for Antibiotic Resistance Prediction” (ICLR MLGenX 2025 Workshop)


Other Interests

I’m also exploring:

For full paper details, see my Publications page.