The Viral Sentry AI model uses machine learning to predict zoonotic potential based on viral sequences and metadata. It is designed to assist researchers and public health officials in identifying emerging threats. The VirSentAI model (virsentai-v2-hyena-dna-16k) was obtained by fine-tuning the pretrained HyenaDNA model, sourced from LongSafari/hyenadna-medium-160k-seqlen-hf. This model is capable of handling DNA sequences up to 160,000 bases long, enabling robust analysis of viral genomes.
Databases: NCBI, VirusHostDB, Viprbrc
Data points: 31,728 complete genomes of a balanced dataset of viruses with human and non-human hosts (maximum length of 160,000 bases).
Precision: 16-bit float precision for efficient computation.
Epochs: Trained for 15 epochs.
Batch Size: A batch size of 2 was used.
Gradient Accumulation: 8 steps. This is a key technique for training large models on limited hardware, as it accumulates gradients over several steps before performing an optimizer update.
Optimizer: AdamW, an adaptive optimizer well-suited for transformer models.
HyenaDNA’s main advantage is its ability to efficiently model extremely long sequences with single-nucleotide precision and global context, thanks to its innovative Hyena operator. This makes it uniquely suited for genomics and any application where understanding long-range dependencies in sequence data is critical. The model topology offers several advantages over traditional Transformer models:
| Feature | VirSentAI Model | Evo Model |
|---|---|---|
| Model Size | 120 M parameters (fewer than 30 MB) | 7B parameters (Evo-1), 40B (Evo-2) |
| Model Tokenization | Single-nucleotide, byte-level resolution | Single-nucleotide, byte-level resolution |
| Training Time | Less than one week with a 24 GB GPU using 4 sequences (16 GB for fewer sequences) | Approximately 4 weeks on large datasets |
| Inference Resources | Maximum 13.5 GB VRAM for 4 sequences; CPU and RAM are easily usable. | High-memory GPUs needed for million-base context |
| Max Sequence Length | 160,000 base pairs | Up to 1 million base pairs (Evo-2) |
| Architecture Aspect | HyenaDNA | StripedHyena |
| Core Architecture | Stack of Hyena operators replacing attention with implicit convolutions and gating | Hybrid of rotary grouped attention and gated convolutions arranged in Hyena blocks |
| Efficiency | 160x faster training at sequence length 1M compared to Transformers | Faster training and inference at long context (>3x speed at 131k tokens compared to Transformers) |
| Scalability | Subquadratic operations enabling much longer context windows (up to 1M tokens) | Near-linear compute/memory scaling with sequence length |
| Application Focus | Genomic foundation model specialized for DNA sequences | General sequence modeling including biological sequences |
| Attention Mechanism | No traditional attention; replaced by efficient Hyena operator | Partial rotary grouped attention combined with Hyena layers |
| Autoregressive Generation | Not specified, focused on long context understanding | Supports efficient generation with caching |