Illumina reads naming scheme

These are three reads produced by an Illumina sequencer, and they are in FASTQ format. What we describe here is the naming scheme of the widespread Illumina sequencers.

@M02007:58:000000000-AW0NA:1:1101:11070:1384 1:N:0:CAGAGGCA+CTAAGCCT
CCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGTCCGCACTCCTTTTGCACCCCTTCCCCGTGTTTGAAGC
+
6BCCCFEE9)88B@@FE@FCFGD7@CCFE,6,C@,CC,,<,8+++;6;,,6,;,,CB+:,:6,9+8,6,:,,,,,
@M02007:58:000000000-AW0NA:1:1101:19460:1444 1:N:0:AAGAGGCA+CTACGCCT
CCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGCCCCACCCCTGTTCCAGCCTTCCCGCGTGTTTGTTCC
+
@CCCCGGG>)=@FFGG<FFGGGG7,C,EE9C9FE,C,,,;,8,+,86:,,6,<,,;C,9,:,,++8+6,6,,,,,
@M02007:58:000000000-AW0NA:1:1101:19666:1451 1:N:0:AAGAGGCA+CTAAGCCT
CCCTACGGGAGGCAGCAGTAGGGAATCTTCCACAATGGGCGAAACCCTGTTCGAGCACCTCCGCTTCCGTGTAGC
+
<BCCCFEG>)=@CFFC<EFFFGGFFEEFCFEDFE,C@,,;+++++;6;,,6,6,6B+,,,:6+,,8,,,::,,,,

First, remember that the name is the part after the “@” and before any space (it’s highlighted in bold in the example above). It must be unique within a single file.

As described in the Illumina website the read name is composed by these parts (separated by columns):

  1. Instrument (i.e. M02007)
  2. Run number (i.e. 58)
  3. Flowcell ID (i.e. 000000000-AW0NA)
    These three codes are constant in a single FASTQ file, produced by a single flowcell)
  4. Lane
  5. Tile
  6. X coordinate
  7. Y coordinate

As you can note it’s followed by a “comment” specifying the Index used, for example.

FastQ file: the common output from NGS sequencers

Most NGS sequencers will save their output as text files in FASTQ format. In the modern incarnation of this format each sequence is written using 4 lines:

  1. The first will contain the sequence name, followed by the “@” symbol
  2. The DNA sequence itself
  3. A spacing line,  a “+”, optionally followed by the sequence name (repeated)
  4. The quality line

An example of a single sequencing read written in FASTQ format is:

@SRR5232030.1 1 length=101
NATCAATAGTATTCGTACCAATAGAACGAATATCCGCCAGCACCATTTGTTTGGCGGCGTCGCCCACCACGACAATGGAAACCACCGACGCAATACCGATT
+
#>BBABFFFFFFGGGGGGGGGGHHHHHGGGGGHHHGGGGGGHHHHHHHHHGHHGHGGGGGGGGGGGGGHHGGGGGHHHHHGHHHGGGGGGGGFGHHHGGGG

 The quality is encoded to have a single character representing the Phred score of a base. This means that the quality of the tenth base is encoded in the tenth character of the quality line.