Caching Fastas
Overview
The caching system for the genomic region extraction pipeline prevents redundant computations and speeds up repeated or similar requests. It works by generating a unique cache key for each specific set of parameters and storing the resulting files under a deterministic directory name. This allows the pipeline to reuse previous outputs rather than recomputing them.
What is Cached?
Each high-level request (e.g., to /api/genomic/cascaded/ncbi or /ensembl) is decomposed into individual region-specific forms (for example, one for gene, one for exon, etc.).
For each form:
- A unique cache key is generated from selected form fields
-
Output files (such as
.fna) are stored in a directory like:/cache/cached_genomic_<hash>/ - If the directory already exists and contains valid output, the pipeline step is skipped and the results are reused
Two-Level Cache
The caching system employs a two-level cache strategy to optimize both performance and resource usage, with explicit workflows and directory layouts for each level:
-
Level 1: Region FASTA Cache
Level 1 is the cache for generated FASTA (
.fna) files for each specific genomic region, such as a gene, exon, or other user-requested interval. Each Level 1 cache entry is keyed by a unique hash derived from a subset of the form fields (see below), which encode the selected data source, taxon/species, annotation release, and the precise set of genomic regions. A Level 1 cache hit occurs if the directory/cache/cached_genomic_<hash>/annotation/contains all the expected.fnafiles for the requested regions, based on the exact combination of parameters. A miss occurs if any required file is missing or the directory does not exist for the cache key. Since these FASTA files are generated for a specific region selection, they are typically small and tied directly to the user’s query. The cache key is deterministic and ensures that only exact matches for the same region and parameters are reused.Example directory layout:
/cache/cached_genomic_<hash>/ ├── annotation/ │ ├── gene1.fna │ └── gene2.fna -
Level 2: Raw Asset Cache from NCBI/Ensembl
Level 2 is the heavy, persistent cache for raw data assets downloaded from NCBI or Ensembl. These assets are typically large—ranging from hundreds of megabytes to several gigabytes—and comprise the official compressed files (
.fna.gz,.gtf.gz) as distributed by the source. The cache stores both the original compressed files in araw/subdirectory and their decompressed forms in adecompressed/subdirectory. This avoids repeated decompression (gunzip) overhead for downstream processing.All Level 2 downloads are verified for integrity: NCBI files are checked against the
md5checksums.txtprovided in the FTP directory, while Ensembl files are checked against their.md5sidecar files. Only files passing verification are used or cached.Both NCBI and Ensembl directory structures are deterministic, mirroring the official FTP layouts for ease of traceability and reproducibility. Example layouts:
-
NCBI Cache Example:
/cache/ncbi_raw/ └── <taxon_id>/ ├── raw/ │ ├── annotation.gtf.gz │ └── genome.fna.gz └── decompressed/ ├── annotation.gtf └── genome.fna -
Ensembl Cache Example:
/cache/ensembl_raw/ └── <species_name>/ ├── raw/ │ ├── annotation.gtf.gz │ └── genome.fna.gz └── decompressed/ ├── annotation.gtf └── genome.fna
This structure ensures that both the original and ready-to-use files are always available for fast access and reuse.
-
Workflow Summary:
- Level 1 cache miss: If the region-specific
.fnafiles for the current query are not found (cache miss), proceed to Level 2. - Level 2 lookup: Check if the required raw and decompressed files are present in the Level 2 cache (
raw/anddecompressed/for the appropriate NCBI or Ensembl subdirectory). - If Level 2 present: Build a temporary
customYAML config pointing to the decompressed.gtfand.fnafiles in the Level 2 cache, and run the region extraction pipeline using these assets to generate the Level 1 cache outputs. - If Level 2 absent: Download the required assets from FTP, verify checksums (using NCBI
md5checksums.txtor Ensembl.md5files), decompress into thedecompressed/folder, and cache both compressed and decompressed copies for future use. Then, proceed as in step 3. - Reuse forever: Once Level 2 assets are cached and verified, they are reused for all future queries requiring the same species/taxon and annotation release, avoiding repeated downloads and decompression.
Explicit Directory Layouts:
- NCBI:
/cache/ncbi_raw/ └── <taxon_id>/ ├── raw/ │ ├── annotation.gtf.gz │ └── genome.fna.gz └── decompressed/ ├── annotation.gtf └── genome.fna - Ensembl:
/cache/ensembl_raw/ └── <species_name>/ ├── raw/ │ ├── annotation.gtf.gz │ └── genome.fna.gz └── decompressed/ ├── annotation.gtf └── genome.fna
Cache Key Construction
The cache key is a SHA256 hash generated from a subset of the form data:
{
"source": "NCBI",
"source_params": {
"taxon": "9606",
"species": "Homo_sapiens",
"annotation_release": "110"
},
"genomic_regions": {
"gene": "true",
"exon": "false"
}
}
Only these fields are considered. Any change to them will produce a new cache key and therefore a new cached directory.
Caching Workflow
- A user submits a request to the API endpoint.
- The server generates per‑region forms where only one genomic region is set to
true. - For each form:
- A cache key is computed
- The server checks for cached output under
/cache/cached_genomic_<key>/annotation/*.fna
- If a cached directory exists:
- The contents are reused
- The directory’s access time is updated
- If no cache is found:
- A YAML configuration is created
- The genomic region generator runs
- Output is stored in the designated cache directory
- The temporary YAML file is deleted after use
Cache Directory Layout
cache/
├── cached_genomic_<hash>/
│ ├── annotation/
│ │ ├── gene1.fna
│ │ └── gene2.fna
│ └── config_genomic_<hash>.yaml (deleted after execution)
Cache Cleanup
To prevent excessive disk usage, cached directories are periodically purged.
Cleanup Logic
A scheduled job executes a cleanup script that removes cache directories under the following conditions:
- The directory has not been accessed within the last 30 days
- The directory name is not on the exclusion list
Exclusion Example
EXCLUDE_DIRS = [
"cached_genomic_special_human",
"cached_genomic_mouse_reference"
]
These directories are preserved regardless of access time.
Cron Job Configuration
To automate cleanup, a cron job is scheduled as follows:
0 3 1 * * /path/to/venv/bin/python /path/to/cleanup_cache_dirs.py >> /var/log/cache_cleanup.log 2>&1
This runs the cleanup script at 03:00 on the first day of every month.