DNA File Formats for DNA Data Storage and Bioinformatics
Introduction.
DNA file formats are specialized file types used to store, encode, and process DNA sequences. These formats are crucial for DNA data storage, genome analysis, and bioinformatics applications.
1 Common DNA File Formats
1. FASTA (.fasta, .fa)
Description: Stores
nucleotide (DNA/RNA) or protein sequences in a simple text format.
Structure:
- First line: Header (starts with > followed by sequence description).
- Following lines: DNA sequence (A, T, C, G) or protein sequence.
Example:
>Human_Gene1
ATGCGTACGTAGCTAGCTAGCTAGCTAGC
Uses:
Genome sequencing
Storing and sharing DNA sequences
Bioinformatics tools (BLAST, ClustalW)
2. FASTQ (.fastq, .fq)
Description: Stores
raw sequencing data, including quality scores.
Structure:
- Line 1: Identifier (@ followed by sequence ID).
- Line 2: DNA sequence.
- Line 3: + separator (optional identifier).
- Line 4: Quality scores (ASCII-encoded Phred scores).
Example:
@SEQ_ID
GATTTGGGGTTTCCCAGTCACGAC
+ !''*((((***+))%%%++)(%%%%).1
Uses:
Next-Generation Sequencing (NGS) data storage
Read quality analysis
3. GenBank (.gb, .gbk)
Description: Stores
DNA sequences along with annotations, including gene names, features,
and references.
Structure:
- LOCUS: Sequence name, length, type
- DEFINITION: Brief description
- FEATURES: Gene annotations
- ORIGIN: DNA sequence
Example (simplified):
LOCUS SCU49845 5028 bp DNA
DEFINITION Yeast mitochondrion gene.
FEATURES Location/Qualifiers
gene 1..5028
/gene="COX1"
ORIGIN
ATGCGTACGTAGCTAGCTAGCTAGC
Uses:
Storing annotated genetic data
Genome databases (NCBI, EMBL, DDBJ)
4. GFF/GTF (.gff, .gtf)
Description: Gene
annotation formats used for mapping genes to sequences.
Structure:
- Columns: Chromosome, source, feature type, start, end, strand, etc.
Example (GFF3 format):
chr1 Ensembl gene 1000 5000 . + . ID=Gene1;Name=COX1
Uses:
Gene annotations in genomic research
5. SAM/BAM (.sam, .bam)
Description: Stores
DNA sequence alignments to a reference genome.
SAM = Text-based, BAM = Binary format
(compressed).
Uses:
DNA sequence alignment from high-throughput sequencing
Storing large genomic datasets efficiently
6. VCF (.vcf)
Description: Stores
genetic variations (SNPs, mutations) in a genome.
Uses:
Storing human genetic variation data
Used in population genetics studies
7. DNA Data Storage-Specific Formats
DNA Fountain –
Advanced encoding technique for digital DNA storage.
Twist Bioscience Format – Custom format for synthetic DNA
storage.
Each DNA file format serves a unique purpose, from storing raw sequencing data (FASTQ) to annotated genetic databases (GenBank, GFF) and genomic variations (VCF).