Parsing and analyzing BAM files
- Binary Alignment/Map (BAM) file (.bam) is a compressed binary format of Sequence Alignment/Map (SAM) file (.sam), which is used for storing the sequence alignment information. BAM file is compressed by the BGZF library and it takes less disk space as compared to text-based SAM file. For example, the 6 GB SAM file can be stored as ~800 MB BAM file.
- The majority of downstream bioinformatics analyses, including sequence assembly, read quantification, alignment viewer (IGV), and so on, require BAM files.
- SAMtools, a software package, allows parsing and analyzing the SAM/BAM files. Here, I am using SAMtools v1.10.
Interconversion of SAM and BAM files
Convert SAM to BAM,
samtools view -bS PC14_L001_R1.sam > PC14_L001_R1.bam
Convert BAM to SAM,
samtools view -h PC14_L001_R1.bam > out.sam
Convert BAM to CRAM,
CRAM is a highly compressed alternative to the BAM file. CRAM file uses indexed reference sequences for the compression. It is fully compatible with BAM. For example, the 800 MB BAM file can be stored as a ~400 MB CRAM file.
samtools view -C -T potato_dm_v404_all_pm_un.fasta PC14_L001_R1_pos_sorted.bam > PC14_L001_R1_pos_sorted.cram
# convert CRAM to BAM (slow as compared to BAM to CRAM)
samtools view -b -T potato_dm_v404_all_pm_un.fasta PC14_L001_R1_pos_sorted.cram > out.bam
Sorting and indexing BAM files
BAM files can be unsorted. Sorted (position-based) BAM files are essential for several Bioinformatics applications such as for StringTie transcript assembly, viewing the alignment using IGV, etc.
# sort BAM file by chromosomal position
# @: number of threads
samtools sort -@ 16 -o PC14_L001_R1_pos_sorted.bam PC14_L001_R1.bam
# sort BAM file by read names (QNAME field)
samtools sort -@ 16 -n -o PC14_L001_R1_read_sorted.bam PC14_L001_R1.bam
# index sorted BAM file (generate .bai file)
# indexing allows fast random access of records of sorted BAM files
samtools index -@ 16 PC14_L001_R1_pos_sorted.bam
You can also use the Picard SortSam command to sort the BAM file by chromosomal position and read name. Read more
here
Split BAM file by chromosome
Split the BAM file based on chromosome name
samtools view -b PC14_L001_R1.bam chr00 > chr00.bam
Merge BAM files
Merge multiple BAM files into a single BAM file,
samtools merge merge_out.bam chr00.bam chr01.bam
# add RG tag to each alignment record for source file name (e.g., RG:Z:chr00)
samtools merge -r merge_out.bam chr00.bam chr01.bam
# merge large number of BAM files
samtools merge merge_out.bam *.bam
Calculate sequence coverage or depth
Get sequence coverage or depth from genome mapped data
# get read depth for each position on chromosome
# use -a parameter to get read depth for all positions
samtools depth PC14_L001_R1.bam > read_depth.txt
# get overall read depth
awk '{sum+=$3;} END {print sum/NR;}' read_depth.txt
14.6749
# $3 means read depth at each position of chromosome (third column from read_depth.txt)
# NR means total rows in read_depth.txt [total mapped chromosome size (with -a option, you will get whole chromosome
# size)]
Extract records from specific regions
Specific region alignment records can be extracted from the BAM file by providing chromosome name and region coordinates
# extract alignment records from chr01 between specific regions
samtools view PC14_L001_R1.bam chr01:1322100-1332100
BAM to FASTA and FASTQ
Convert BAM file into FASTA and FASTQ file format,
# get all mapped reads
# 4 is SAM flag for reads unmapped (0x4)
samtools fasta -F 4 PC14_L001_R1.bam > PC14_L001_R1.fasta
# get unmapped reads
samtools fasta -f 4 PC14_L001_R1.bam > PC14_L001_R1.fasta
# similarly replace fasta with fastq for FASTQ format file
References
- Samtools
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The sequence alignment/map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9.
If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com
This work is licensed under a Creative Commons Attribution 4.0 International License