Command-line manipulation of sequence files

Adrian Baez-Ortega
22 September 2018

Here I present some command-line approaches, hoarded over the years, for manipulating and converting between different popular sequence data files formats, namely BAM/SAM, FASTQ and FASTA. While they are by no means the only methods available for these tasks, they are perhaps the most simple.

1. Converting BAM to SAM

BAM is just a binary compressed version of the SAM format, and the easier way to convert between them is via samtools view command from the SAMtools software suite. The -h option allows including the BAM file header in the output SAM, which is often necessary.

samtools view -h INPUT.bam > OUTPUT.sam

2. Converting SAM to BAM

Here, the -bS options indicate that the input is SAM and the output is BAM.

samtools view -bS INPUT.sam > OUTPUT.bam

3. Converting SAM/BAM to FASTQ

Converting from SAM to FASTQ (both of which are text-based formats) may be as simple as selecting specific columns of the SAM file and arranging them into lines of the FASTQ file. The initial grep command discards the SAM file header, which cannot go into the FASTQ file.

grep -v ^@ INPUT.sam | awk '{print "@"$1"\n"$10"\n+\n"$11}' > OUTPUT.fastq

Obviously, the SAM file itself can be produced from a BAM file using samtools view as shown above. There is, however, a problem with this grep|awk approach: each record in a SAM/BAM file represent a read alignment to a reference sequence, and it is possible for a given sequence read to have more than one alignment. In contrast, each record in a FASTQ should represent a unique read, which implies that we should only consider one alignment (the primary alignment) for each read in the SAM/BAM file.

The easiest way (to my knowledge) of generating a FASTQ file from a BAM file, such that reads with multiple alignments are considered only once, is using the bamtofastq command from the biobambam2 software suite, which automatically ignores secondary alignments.

By default, bamtofastq generates two FASTQ files, containing the forward and reverse reads in each read pair, respectively. The output file names are indicated via the F and F2 arguments. Orphan forward and reverse reads lacking a paired mate are also output to two files, indicated by O and O2. For simplicity, the same file name can be used for F and O, and for F2 and O2 (meaning that we don’t care whether the reads have a mate, but only whether they are forward or reverse).

bamtofastq filename=INPUT.bam F=OUTPUT_1.fastq O=OUTPUT_1.fastq \
F2=OUTPUT_2.fastq O2=OUTPUT_2.fastq

Additionally, the inputformat argument allows using SAM input files, while gz=1 can be used to generate compressed FASTQ files (fq.gz extension).

bamtofastq inputformat=sam gz=1 filename=INPUT.sam F=OUTPUT_1.fq.gz O=OUTPUT_1.fq.gz \
F2=OUTPUT_2.fq.gz O2=OUTPUT_2.fq.gz

Finally, if we simply want all the reads to be in a single FASTQ file, we can just omit all the output name arguments and redirect the output to a file.

bamtofastq filename=INPUT.bam > OUTPUT.fastq

For more info, run bamtofastq -h.

4. Converting FASTQ to FASTA

This involves simply selecting the first two lines of every four-line record in the FASTQ file, and replacing the @ at the beginning of the read identifier with >. Note that, for this to work, each sequence must be on a single line (without line breaks), that is, every record in the FASTQ must be composed of exactly four lines.

awk '{ if ('NR%4==1' || 'NR%4==2'){ gsub("@",">",$1); print }}' INPUT.fastq > OUTPUT.fasta

5. Sorting FASTQ files by read identifier

To sort a FASTQ file with four-line records (i.e. with each sequence on a single line) by read ID, we can transform it into a four-column table, sort by the value of the first column, and convert back to FASTQ format, all within a single line. The -S argument in the sort command allows us to define the maximum amount of memory to use (4G means a maximum of 4 GB).

cat INPUT.fastq | paste - - - - | sort -k1,1 -S 4G | tr '\t' '\n' > OUTPUT.fastq

6. Merging FASTQ files with identical read names

When we have a pair of FASTQ files containing, respectively, the forward and reverse reads in each read pair, sometimes we find that the two reads in a same pair have identical identifiers, such that if we merge both FASTQ files into one we end up with duplicated read IDs. This can be solved by adding a suffix (in this case, _1 or _2) to the IDs in each file while merging them (by concatenation).

awk '{ if ('NR%4==1'){ $1=$1"_1" } print }' INPUT_1.fastq > OUTPUT.fastq
awk '{ if ('NR%4==1'){ $1=$1"_2" } print }' INPUT_2.fastq >> OUTPUT.fastq

7. Merging BAM files with different read groups

The samtools merge command from the SAMtools suite allows us to easily merge two or more BAM files as follows.

samtools merge OUTPUT.bam INPUT_1.bam INPUT_2.bam [INPUT_3.bam ...]

However, this command preserves only the header of the first input file (INPUT_1.bam) and discards the headers of all subsequent files. This means that, if the reads in the different BAM files belong to different read groups, we will lose their read group information, as this is contained in the @RG records within the BAM header. Therefore, the @RG records of the subsequent input files first need to be appended to a copy of the first file’s header, before providing samtools merge with the resulting modified header (via the -h argument). After merging, the output BAM should be indexed with samtools index.

samtools view -H INPUT_1.bam > header.txt
samtools view -H INPUT_2.bam | grep "@RG" >> header.txt
## Repeat the line above for INPUT_3.bam, etc.
samtools merge -h header.txt OUTPUT.bam INPUT_1.bam INPUT_2.bam [INPUT_3.bam ...]
samtools index OUTPUT.bam