Our Model: VirSentAI

The Viral Sentry AI model uses machine learning to predict zoonotic potential based on viral sequences and metadata. It is designed to assist researchers and public health officials in identifying emerging threats. The VirSentAI model (virsentai-v2-hyena-dna-16k) was obtained by fine-tuning the pretrained HyenaDNA model, sourced from LongSafari/hyenadna-medium-160k-seqlen-hf. This model is capable of handling DNA sequences up to 160,000 bases long, enabling robust analysis of viral genomes.

Training & Hardware

Training Dataset31,728 viruses
Training Time150 hours
HardwareNVIDIA 24G GPU

Model Performance

Validation Accuracy0.8724
Training Accuracy0.8789
Validation AUROC0.9496
Training AUROC0.9512

Dataset

Databases: NCBI, VirusHostDB, Viprbrc

Data points: 31,728 complete genomes of a balanced dataset of viruses with human and non-human hosts (maximum length of 160,000 bases).

Model Training Details

Precision: 16-bit float precision for efficient computation.

Epochs: Trained for 15 epochs.

Batch Size: A batch size of 2 was used.

Gradient Accumulation: 8 steps. This is a key technique for training large models on limited hardware, as it accumulates gradients over several steps before performing an optimizer update.

Optimizer: AdamW, an adaptive optimizer well-suited for transformer models.

Pre-trained HyenaDNA Model Advantages

HyenaDNA’s main advantage is its ability to efficiently model extremely long sequences with single-nucleotide precision and global context, thanks to its innovative Hyena operator. This makes it uniquely suited for genomics and any application where understanding long-range dependencies in sequence data is critical. The model topology offers several advantages over traditional Transformer models:

  • Massive Context Length: Supports sequences up to 1 million tokens, enabling the modeling of long-range dependencies crucial for genomics.
  • Subquadratic Scaling: Uses a Hyena operator, an efficient alternative to attention, allowing faster processing of long sequences with lower memory requirements.
  • Single Nucleotide Resolution: Pretrained at single nucleotide resolution, allowing for fine-grained analysis and precise downstream predictions.
  • Global Receptive Field: The implicit long convolution allows every position in the sequence to influence every other position at every layer.
  • State-of-the-Art Performance: Set new state-of-the-art results on 23 downstream genomic tasks.

DNA Models: VirSentAI vs. Evo

Feature VirSentAI Model Evo Model
Model Size120 M parameters (fewer than 30 MB)7B parameters (Evo-1), 40B (Evo-2)
Model TokenizationSingle-nucleotide, byte-level resolutionSingle-nucleotide, byte-level resolution
Training TimeLess than one week with a 24 GB GPU using 4 sequences (16 GB for fewer sequences)Approximately 4 weeks on large datasets
Inference ResourcesMaximum 13.5 GB VRAM for 4 sequences; CPU and RAM are easily usable.High-memory GPUs needed for million-base context
Max Sequence Length160,000 base pairsUp to 1 million base pairs (Evo-2)
Architecture AspectHyenaDNAStripedHyena
Core ArchitectureStack of Hyena operators replacing attention with implicit convolutions and gatingHybrid of rotary grouped attention and gated convolutions arranged in Hyena blocks
Efficiency160x faster training at sequence length 1M compared to TransformersFaster training and inference at long context (>3x speed at 131k tokens compared to Transformers)
ScalabilitySubquadratic operations enabling much longer context windows (up to 1M tokens)Near-linear compute/memory scaling with sequence length
Application FocusGenomic foundation model specialized for DNA sequencesGeneral sequence modeling including biological sequences
Attention MechanismNo traditional attention; replaced by efficient Hyena operatorPartial rotary grouped attention combined with Hyena layers
Autoregressive GenerationNot specified, focused on long context understandingSupports efficient generation with caching

Limitations of VirSentAI

  • Limited training time: the AUROC plot shows that the model can still be improved.
  • Limited sequence length: only the first 160,000 DNA bases of any viral sequence are used for prediction.
  • Limited database scans: for the moment, VirSentAI is scanning only the NCBI database.

References