Coverage tracks from BAM files

Sequence coverage (or sequencing depth) refers to the number of reads that include a specific nucleotide of a reference genome. In the screenshot below a small region of the Human genome is shown in a genome browser, where the alignment of a re-sequencing experiment was loaded. Each gray arrow in the main area of the window represents a single read, while the highlighted area is a coverage plot, summarizing for each nucleotide how many times was sequenced in this experiment.

IGV can display a detailed coverage track from a BAM file,
but will stop rendering if the region is >50Kbp.

An extended version of this article was published on Medium.

Visualizing coverage from a BAM file

As shown in the introduction, IGV is a simple and effective tool to inspect an alignment file. You need the reference genome used to perform the alignment (load it via Genome → Load from file…) and the alignment file in BAM format (sorted and indexed).

The major limitation is that you can view up to 50 kbp with IGV, if you zoom out you’ll stop seeing alignments. This makes IGV perfect for detailed inspection of small regions (you’ll see the alignment read by read!) but less useful for broader inspection.

Typical case: which regions have a very low (or no) coverage? Which ones have a very high coverage?

Extracting coverage information with samtools

There is a samtools subprogram, called depth, that calculates the sequence coverage at each position³:

samtools depth -a FILE.bam > FILE.txt

The output is a tabular three columns table: chromosome name, position, coverage. It is worth noting that samtools checks the CIGAR field and if your sample has a deletion will report zero coverage on the deletion. Most of the times this is not wanted (it is required for the variant calling process, but that’s another story).

Bedtools genome coverage

Bedtools has an even better option to calculate coverage, as it produces a standard BED file.

bedtools genomecov -bga FILE.bam > FILE.coverage.bed

The output is a standard BED file: a tabular file with these columns: chromosome, start, end, coverage. The difference with the previous output is that adjacent bases with the same coverage will be collapsed in the same interval.

Covtobed: from conda with love

We recently published covtobed, a simple tool to extract coverage information from sorted BAM files.

To install the tool you can use Miniconda:

conda install -y -c bioconda covtobed

And then, the usage to extract regions covered bedtween 0X (included) and 1X (excluded) is as simple as:

covtobed -m 0 -M 1 input.bam > not_covered.bed

Bash script: safer with “set -euxo pipefail”

An excellent blog post from Van Eyckt explains how starting a Bash script with:

#!/bin/bash
set -euxo pipefail

can help a lot in troubleshooting, making the script easier to write. With this single line the script will exit when a single command fails (-e), even when part of a pipe fails (-o).
The -u switch will prevent the use of unbound variables, while -x will print each command before execution, saving some echoes but messing the screen. Most of the times, I avoid the -x.

About the “percentage of homology”

Two structures or two sequence can either be homologous or not. Homologous means deriving from the same structure from a common ancestor. Bat wings and human arms, for example, are superficially quite different, but they are homologous structures. This fact become less surprising when inspecting the bones anatomy.

So… it makes no sense to talk about the percentage of homology, while it is correct to measure the percentage of identity of two sequences (that can suggest their homology).

I think this distinction is quite important, yet even very popular papers can help carrying this error:

From Taxonomic Note: A Place for DNA-DNA Reassociation and 16S rRNA Sequence Analysis in the Present Species Definition in Bacteriology

Illumina reads naming scheme

These are three reads produced by an Illumina sequencer, and they are in FASTQ format. What we describe here is the naming scheme of the widespread Illumina sequencers.

@M02007:58:000000000-AW0NA:1:1101:11070:1384 1:N:0:CAGAGGCA+CTAAGCCT
CCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGTCCGCACTCCTTTTGCACCCCTTCCCCGTGTTTGAAGC
+
6BCCCFEE9)88B@@FE@FCFGD7@CCFE,6,C@,CC,,<,8+++;6;,,6,;,,CB+:,:6,9+8,6,:,,,,,
@M02007:58:000000000-AW0NA:1:1101:19460:1444 1:N:0:AAGAGGCA+CTACGCCT
CCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGCCCCACCCCTGTTCCAGCCTTCCCGCGTGTTTGTTCC
+
@CCCCGGG>)=@FFGG<FFGGGG7,C,EE9C9FE,C,,,;,8,+,86:,,6,<,,;C,9,:,,++8+6,6,,,,,
@M02007:58:000000000-AW0NA:1:1101:19666:1451 1:N:0:AAGAGGCA+CTAAGCCT
CCCTACGGGAGGCAGCAGTAGGGAATCTTCCACAATGGGCGAAACCCTGTTCGAGCACCTCCGCTTCCGTGTAGC
+
<BCCCFEG>)=@CFFC<EFFFGGFFEEFCFEDFE,C@,,;+++++;6;,,6,6,6B+,,,:6+,,8,,,::,,,,

First, remember that the name is the part after the “@” and before any space (it’s highlighted in bold in the example above). It must be unique within a single file.

As described in the Illumina website the read name is composed by these parts (separated by columns):

  1. Instrument (i.e. M02007)
  2. Run number (i.e. 58)
  3. Flowcell ID (i.e. 000000000-AW0NA)
    These three codes are constant in a single FASTQ file, produced by a single flowcell)
  4. Lane
  5. Tile
  6. X coordinate
  7. Y coordinate

As you can note it’s followed by a “comment” specifying the Index used, for example.

FastQ file: the common output from NGS sequencers

Most NGS sequencers will save their output as text files in FASTQ format. In the modern incarnation of this format each sequence is written using 4 lines:

  1. The first will contain the sequence name, followed by the “@” symbol
  2. The DNA sequence itself
  3. A spacing line,  a “+”, optionally followed by the sequence name (repeated)
  4. The quality line

An example of a single sequencing read written in FASTQ format is:

@SRR5232030.1 1 length=101
NATCAATAGTATTCGTACCAATAGAACGAATATCCGCCAGCACCATTTGTTTGGCGGCGTCGCCCACCACGACAATGGAAACCACCGACGCAATACCGATT
+
#>BBABFFFFFFGGGGGGGGGGHHHHHGGGGGHHHGGGGGGHHHHHHHHHGHHGHGGGGGGGGGGGGGHHGGGGGHHHHHGHHHGGGGGGGGFGHHHGGGG

 The quality is encoded to have a single character representing the Phred score of a base. This means that the quality of the tenth base is encoded in the tenth character of the quality line.