Caching Fastas
Overview
The caching system for the genomic region generation prevents redundant computations and speeds up repeated or similar requests. It works by generating a unique cache key for each specific set of parameters and storing the output path in a Redis cache using dogpile.cache. This allows the pipeline to reuse previous outputs rather than recomputing them. The cache also handles locking to enable parallel execution of the genomic region generation task without accidentally downloading or computing the same files concurrently.
The genomic region generator gets executed when the genomic_region_generation_forms key in a pipeline’s formdata payload contains the appropriate configuration.
What is Cached?
Each high-level request is uncomposed into individual region-specific forms (for example, one for gene, one for exon, etc.).
For each form:
- A unique cache key is generated from selected form fields
- If our cache contains the key and the associated path points to an existing directory, the region generation step is skipped and the results are reused.
-
Otherwise, generation runs, its output files (such as
.fna) are stored in a directory like below and the output path is added to the cache:cache/generated/cached_genomic_<hash>/
Two-Level Cache
The caching system employs a two-level cache strategy to optimize both performance and resource usage, with explicit workflows and directory layouts for each level:
-
Level 1: Region FASTA Cache
Level 1 is the cache for generated FASTA (
.fna) files for each specific genomic region, such as a gene, exon, or other user-requested interval. Each Level 1 cache entry is keyed by a unique hash derived from a subset of the form fields (see below), which encode the selected data source, taxon/species, annotation release, and the precise set of genomic regions. Since these FASTA files are generated for a specific region selection, they are typically small and tied directly to the user’s query. The cache key is deterministic and ensures that only exact matches for the same region and parameters are reused.Example directory layout:
cache/generated/ ├── cached_genomic_<hash>/ | └── annotation/ │ ├── gene1.fna │ └── gene2.fna -
Level 2: Raw Asset Cache from NCBI/Ensembl
Level 2 is the heavy, persistent cache for raw data assets downloaded from NCBI or Ensembl. These assets are typically large—ranging from hundreds of megabytes to several gigabytes—and comprise the official compressed files (
.fna.gz,.gtf.gz) as distributed by the source. The cache stores both the original compressed files and their uncompressed forms. This avoids repeated uncompression overhead for downstream processing.All Level 2 downloads are verified for integrity: NCBI and Ensembl files are checked against the
md5checksums.txtandCHECKSUMSprovided in the FTP directory respectively. Only files passing verification are used or cached.The cache directory is partitioned by source but otherwise flat. Example layouts:
-
NCBI Cache Example:
cache/ncbi/ ├── annotation.gtf.gz ├── annotation.gtf ├── genome.fna.gz └── genome.fna -
Ensembl Cache Example:
cache/ensembl/ ├── annotation.gtf.gz ├── annotation.gtf ├── genome.fna.gz └── genome.fna
This structure ensures that both the original and ready-to-use files are always available for fast access and reuse.
-
Workflow Summary:
- Level 1 cache lookup: Return the region-specific
.fnafiles for the current query if present, otherwise proceed to Level 2. - Level 2 cache lookup: Check whether required assets are present in Level 2 cache. If present, skip the following step.
- Level 2 resource download: Download the required assets from FTP and cache both compressed and uncompressed copies for future use.
- Level 2 processing: Verify checksums (using NCBI
md5checksums.txtor EnsemblCHECKSUMSfiles) and return file paths. - Level 1 region generation: Build a temporary
customYAML config pointing to the uncompressed.gtfand.fnafiles in the Level 2 cache, and run the genomic region generator using these assets to generate the Level 1 cache outputs. - Reuse until expiry: Once Level 2 assets are cached and verified, they are reused for all future queries requiring the same species/taxon and annotation release, avoiding repeated downloads and uncompression.
Cache Key Construction
The cache key is a SHA256 hash generated from a subset of the form data:
{
"source": "NCBI",
"source_params": {
"taxon": "9606",
"species": "Homo_sapiens",
"annotation_release": "110"
},
"genomic_regions": {
"gene": "true",
"exon": "false"
}
}
Only these fields are considered. Any change to them will produce a new cache key and therefore a new cached directory.
Caching Workflow
- A user submits a request to the API endpoint.
- The server generates per‑region forms where only one genomic region is set to
true. - For each form:
- A cache key is computed
- The server checks for cached output under
/cache/generated/cached_genomic_<key>/annotation/*.fna
- If a cached directory exists:
- The contents are reused
- If no cache is found:
- A YAML configuration is created
- The genomic region generator runs
- Output is stored in the designated cache directory
- The temporary YAML file is deleted after use
Cache Directory Layout
cache/generated/
├── cached_genomic_<hash>/
│ ├── annotation/
│ │ ├── gene1.fna
│ │ └── gene2.fna
│ └── config_genomic_<hash>.yaml (deleted after execution)
Cache Cleanup
To prevent excessive disk usage, cached directories are periodically purged.
Cleanup Logic
A scheduled job executes a cleanup script that removes cache directories under the following conditions:
- The directory has not been accessed within the last 30 days
- The directory name is not on the exclusion list
Exclusion Example
EXCLUDE_DIRS = [
"cached_genomic_special_human",
"cached_genomic_mouse_reference"
]
These directories are preserved regardless of access time.
Cron Job Configuration
To automate cleanup, a cron job is scheduled as follows:
0 3 1 * * /path/to/venv/bin/python /path/to/cleanup_cache_dirs.py >> /var/log/cache_cleanup.log 2>&1
This runs the cleanup script at 03:00 on the first day of every month.