File formats – Bioinformatics for NGS

These are three reads produced by an Illumina sequencer, and they are in FASTQ format. What we describe here is the naming scheme of the widespread Illumina sequencers.

@M02007:58:000000000-AW0NA:1:1101:11070:1384 1:N:0:CAGAGGCA+CTAAGCCT
CCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGTCCGCACTCCTTTTGCACCCCTTCCCCGTGTTTGAAGC
+
6BCCCFEE9)88B@@FE@FCFGD7@CCFE,6,C@,CC,,<,8+++;6;,,6,;,,CB+:,:6,9+8,6,:,,,,,
@M02007:58:000000000-AW0NA:1:1101:19460:1444 1:N:0:AAGAGGCA+CTACGCCT
CCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGCCCCACCCCTGTTCCAGCCTTCCCGCGTGTTTGTTCC
+
@CCCCGGG>)=@FFGG<FFGGGG7,C,EE9C9FE,C,,,;,8,+,86:,,6,<,,;C,9,:,,++8+6,6,,,,,
@M02007:58:000000000-AW0NA:1:1101:19666:1451 1:N:0:AAGAGGCA+CTAAGCCT
CCCTACGGGAGGCAGCAGTAGGGAATCTTCCACAATGGGCGAAACCCTGTTCGAGCACCTCCGCTTCCGTGTAGC
+
<BCCCFEG>)=@CFFC<EFFFGGFFEEFCFEDFE,C@,,;+++++;6;,,6,6,6B+,,,:6+,,8,,,::,,,,

First, remember that the name is the part after the “@” and before any space (it’s highlighted in bold in the example above). It must be unique within a single file.

As described in the Illumina website the read name is composed by these parts (separated by columns):

Instrument (i.e. M02007)
Run number (i.e. 58)
Flowcell ID (i.e. 000000000-AW0NA)
These three codes are constant in a single FASTQ file, produced by a single flowcell)
Lane
Tile
X coordinate
Y coordinate

As you can note it’s followed by a “comment” specifying the Index used, for example.

@SRR5232030.1 1 length=101 NATCAATAGTATTCGTACCAATAGAACGAATATCCGCCAGCACCATTTGTTTGGCGGCGTCGCCCACCACGACAATGGAAACCACCGACGCAATACCGATT + #>BBABFFFFFFGGGGGGGGGGHHHHHGGGGGHHHGGGGGGHHHHHHHHHGHHGHGGGGGGGGGGGGGHHGGGGGHHHHHGHHHGGGGGGGGFGHHHGGGG

Category: File formats

Illumina reads naming scheme

FastQ file: the common output from NGS sequencers