Alignment (Solving the Jigsaw Puzzle)¶
Bowtie2 alignment BAM-files single-end paired-end samtools genome-index MAPQ quality-control multimapping ChIP-seq flagstat MultiQC filtering
Basic Concept (The "Puzzle")¶
Imagine your Reference Genome is the Picture on the Puzzle Box (a complete image of the DNA). Your Reads (FASTAS) are the millions of tiny Puzzle Pieces scattered on the floor.
Alignment is simply picking up every piece and finding exactly where it fits on the picture.
- The Input: Millions of jumbled reads.
- The Tool: Bowtie2 (The Puzzle Solver).
- The Output: A BAM File. This is the digital record of where every piece belongs.
Execution (Solving It)¶
Step 1: The Index (Building the Map)¶
Before Bowtie2 can work, it needs to process the reference genome into a format it can search quickly. This is called an Index.
# Example syntax: bowtie2-build [genome.fa] [prefix_name]
bowtie2-build genome_index/ce.fa genome_index/ce_index # ce.fa = C. elegans reference genome
You only do this once!
You can download the C. elegans reference genome from Ensembl using the following link: C. elegans genome. Place it in the genome_index directory, decompress it, and rename the file to ce.fa
After Indexing (Bowtie2)¶
chipseq_tutorial/
├── fastq_raw/
│ └── ...
├── fastq_cleaned/ ← Fastp cleaned reads
│ └── ...
├── fastp_reports/
│ └── ...
├── genome_index/ ← Bowtie2 index files
│ ├── ce_index.1.bt2
│ ├── ce_index.2.bt2
│ ├── ce_index.3.bt2
│ ├── ce_index.4.bt2
│ ├── ce_index.rev.1.bt2
│ └── ce_index.rev.2.bt2
└── sample_id.txt
Step 2: Single Sample Alignment¶
Run Bowtie2 alignment for a single-end sample
Input files needed:
- Reference genome index:
genome_index/ce_index(created in previous step) - Cleaned FASTQ file:
fastq_cleaned/H3K27me3_IP_rep1.clean.fastq.gz
mkdir -p bowalign
bowtie2 -x genome_index/ce_index \
-U fastq_cleaned/H3K27me3_IP_rep1.clean.fastq.gz \
-p 6 --no-unal \
2> bowalign/H3K27me3_IP_rep1.log | \
samtools sort -@ 6 -o bowalign/H3K27me3_IP_rep1.sorted.bam
samtools index bowalign/H3K27me3_IP_rep1.sorted.bam
What this does:
- bowtie2 aligns reads to the reference genome using the
-Uflag for single-end data - -p 6 uses 6 CPU threads for faster processing
- --no-unal suppresses unaligned reads from output (reads that don't match the genome are excluded, saving disk space and processing time downstream)
- 2> saves alignment statistics to a log file
- samtools sort sorts alignments by genomic position (required for downstream analysis)
- samtools index creates an index file (.bai) for fast random access
Paired-End Data
If you have paired-end data, use -1 and -2 flags instead of -U:
bowtie2 -x genome_index/ce_index \
-1 fastq_cleaned/Sample_R1.clean.fastq.gz \
-2 fastq_cleaned/Sample_R2.clean.fastq.gz \
-p 6 --no-unal \
2> bowalign/Sample.log | samtools sort -@ 6 -o bowalign/Sample.sorted.bam
samtools index bowalign/Sample.sorted.bam
Final Touch: We always "index" the BAM file. Think of this as creating a Table of Contents so the computer can jump to any chromosome instantly.
samtools index bowalign/H3K27me3_IP_rep1.sorted.bam
Output Structure¶
After running this step, your directory should look like:
bowalign/
├── H3K27me3_IP_rep1.log
├── H3K27me3_IP_rep1.sorted.bam
└── H3K27me3_IP_rep1.sorted.bam.bai
Once this single run completes successfully, you can confidently automate for all samples.
Step 3: Automation Loop¶
After creating sample_id.txt (see Section 2.4), here is the script to run this for all your samples:
#!/bin/bash
set -euo pipefail
mkdir -p bowalign bowalign_log
while read -r sample; do
echo "Aligning $sample"
bowtie2 -x genome_index/ce_index \
-U "fastq_cleaned/${sample}.clean.fastq.gz" \
-p 6 --no-unal \
2> "bowalign_log/${sample}.bowtie2.log" \
| samtools sort -@ 6 -o "bowalign/${sample}.sorted.bam"
samtools index "bowalign/${sample}.sorted.bam"
done < sample_id.txt
Paired-End While Loop
If your samples have _R1 and_R2 files:
while read -r sample; do
echo "Aligning $sample"
bowtie2 -x genome_index/ce_index \
-1 "fastq_cleaned/${sample}_R1.clean.fastq.gz" \
-2 "fastq_cleaned/${sample}_R2.clean.fastq.gz" \
-p 6 --no-unal \
2> "bowalign_log/${sample}.log" | \
samtools sort -@ 6 -o "bowalign/${sample}.sorted.bam"
samtools index "bowalign/${sample}.sorted.bam"
done < sample_id.txt
Optimization¶
Optimization: Threads vs. Jobs¶
You have a limited number of CPU cores (computers brains). You can use them in two ways:
- Multi-Threading (
-p 6): One sample uses 6 cores. It finishes very fast, but you only do one sample at a time.- Best for: Large genomes, low memory.
- Parallel Jobs: You run 3 samples at once, and each sample uses 2 cores.
- Best for: Many small samples (RNA-seq, small genomes).
Rule of Thumb: bowtie2 stops getting faster after about 8 threads. Don't give it 50 threads; it's a waste!
Benchmarking Bowtie2 Threading - Jeff Kaufman (2023)
Guidance with using multiple threads with samtools - GitHub
Example with GNU Parallel:
#!/bin/bash
set -euo pipefail
mkdir -p bowalign bowalign_log
parallel -j 2 '
bowtie2 \
-x genome_index/ce_index \
-U fastq_cleaned/{}.clean.fastq.gz \
-p 4 \
--no-unal \
2> bowalign_log/{}.log \
| samtools sort \
-@ 2 \
-m 1G \
-o bowalign/{}.sorted.bam
samtools index bowalign/{}.sorted.bam
' :::: sample_id.txt
Effective CPU usage (implied by this setup):
- 2 parallel samples (
-j 2) - Per sample: bowtie2 (4 threads,
-p 4) + samtools sort (2 threads,-@ 2) - Total ≈ 12 threads → safe on a 16-core machine
chipseq_tutorial/
├── fastq_raw/
├── fastq_cleaned/
├── fastp_reports/
├── genome_index/
├── bowalign/ ← Aligned BAM files
│ ├── H3K27me3_IP_rep1.sorted.bam
│ ├── H3K27me3_IP_rep1.sorted.bam.bai
│ └── ...
├── bowalign_log/ ← Alignment statistics
│ ├── H3K27me3_IP_rep1.log
│ └── ...
└── sample_id.txt
Understanding BAM File Structure¶
Before we check alignment quality, let's look at what's inside a BAM file. BAM files store alignment information in a structured format with 11 mandatory columns:
# View the first few alignments in human-readable format
samtools view bowalign/H3K27me3_IP_rep1.sorted.bam | head -3
Example output (single-end data):
SRR7297996.3814722 0 I 15200 42 76M * 0 0 ACGTACGT... IIIIIIII...
SRR7297996.7661710 0 II 98432 30 76M * 0 0 TGCATGCA... IIIHHHHH...
SRR7297996.11560734 16 III 45123 0 76M * 0 0 GATTACA... IIIHHHGG...
Column Breakdown (First 5 of 11 columns):
| Column | Name | Example | Description |
|---|---|---|---|
| 1 | QNAME | SRR7297996.3814722 | Read/Query name |
| 2 | FLAG | 0, 16 | Bitwise FLAG (flagstat reads this!) |
| 3 | RNAME | I, II, III | Reference sequence name (chromosome) |
| 4 | POS | 15200 | Leftmost mapping position |
| 5 | MAPQ | 42, 30, 0 | Mapping Quality score (0-255) |
What Each QC Tool Evaluates:
samtools flagstat reads Column 2 (FLAG):
- Bit 4: Read unmapped?
- Bit 256: Secondary alignment?
- Bit 1024: PCR/optical duplicate?
- Bit 2048: Supplementary alignment?
- And more... (see FLAG section below)
Decode FLAGS
Use Explain SAM Flags to understand bitwise FLAG values.
Multimapping & The "Lost GPS"¶
Sometimes a read is repetitive (e.g., "ATATATAT"). It fits in 50 different places on the genome.
The aligner doesn't know which spot is correct, so it gives it a Low MAPQ Score (Mapping Quality).
MAPQ filtering (-q 30) reads Column 5 (MAPQ):
- MAPQ = 0: "I have NO clue where this goes. It fits in many places." (Multimapped)
- MAPQ > 30: "I am highly confident this read belongs EXACTLY here." (Unique)
Quality Check: samtools flagstat & samtools stats¶
Did the alignment work? Let's check the score using two tools:
- flagstat: Quick alignment summary (mapped reads, pairs, etc.)
- stats: Comprehensive metrics including MAPQ distribution (for MultiQC visualization)
# Create QC output directory for alignment metrics
mkdir -p bowalign_qc
# Quick summary with flagstat
samtools flagstat bowalign/H3K27me3_IP_rep1.sorted.bam > bowalign_qc/H3K27me3_IP_rep1.flagstat.txt
# Comprehensive stats (includes MAPQ distribution for MultiQC)
samtools stats bowalign/H3K27me3_IP_rep1.sorted.bam > bowalign_qc/H3K27me3_IP_rep1.stats.txt
Batch processing for all samples:
set -euo pipefail
THREADS=8
mkdir -p bowalign_qc
while read -r sample; do
echo "Running QC for: $sample"
samtools flagstat -@ "$THREADS" "bowalign/${sample}.sorted.bam" > "bowalign_qc/${sample}.flagstat.txt"
samtools stats -@ "$THREADS" "bowalign/${sample}.sorted.bam" > "bowalign_qc/${sample}.stats.txt"
done < sample_id.txt
echo "Generating MultiQC report..."
multiqc bowalign_qc/ -o bowalign_qc/ -n alignment_qc_report
echo "MultiQC report saved: bowalign_qc/alignment_qc_report.html"
This creates an interactive HTML report (alignment_qc_report.html) with visualizations of alignment metrics across all samples, including MAPQ distribution plots.
What to look for:
- Mapping Rate: Ideally >80-90%. If it's <50%, your DNA might be contaminated (e.g., bacteria in a sample).
- Goal: High mapping percentage indicates good quality alignment.
- Warning: If mapping is low (<50%), you may have the wrong organism or bad sequencing.
The Sensitivity-Specificity Trade-off:
Multi-mapping reads critically influence the balance between sensitivity and specificity in ChIP-seq peak detection. Excluding multi-mappers, standard practice in most peak callers, improves specificity by preventing artificial signal inflation in repetitive regions but reduces sensitivity for genuine binding events within those regions ([Nakato et al., 2016](https://doi.org/10.1093/bib/bbw023
)). This trade-off is acceptable for transcription factors and chromatin features predominantly in unique genomic loci. However, repetitive and transposable elements (REs/TEs) constitute significant genome portions with important regulatory roles, and discarding multi-mapped reads substantially underrepresents regulatory events in these regions ([Morrissey et al., 2024](https://doi.org/10.1101/gr.278638.123
)).
Quick Manual MAPQ Check¶
For a quick manual inspection of MAPQ distribution (complementary to the MultiQC visualization):
samtools view bowalign/H3K27me3_IP_rep1.sorted.bam | awk '{print $5}' | sort -n | uniq -c > bowalign_qc/H3K27me3_IP_rep1.mapq.txt
Understanding the Manual Output:
- If you see lots of 0s: Your file includes "Lost GPS" reads (Multimappers).
- If your scores start at 30+: Your file has Already Filtered the bad reads.
Output format: count MAPQ_score
(chip) rajaishaqnabikhan@Mac % cat bowalign_qc/H3K27me3_IP_rep1.mapq.txt
133889 0 # MAPQ = 0 → completely ambiguous alignments; no positional confidence
1933336 1
53822 2
15117 3
3816 4
1830 5
470602 6
55346 7
14767 8
25440 11
50763 12
7006 14
12688 15
5168 16
32747 17
22396 18
23353 21
19845 22
21062 23
32201 24
19869 25
20007 26
16640 27
246444 30
151768 31
119155 32 # MAPQ = 32 → high-confidence, near-unique alignments (passes MAPQ ≥ 30 filter)
6925 33
133902 34
129735 35
106899 36
126589 37
130911 38
150140 39
100539 40
18872247 42 # MAPQ = 42 → very high-confidence, uniquely mapped reads; dominant unique class
Verdict: This BAM file contains multimapped reads.
Filtering Multi-mapping Reads¶
For standard ChIP-seq analysis (TFs, histone marks in unique regions), filter BAM files to retain only high-quality uniquely mapped reads:
# Create directory for filtered BAM files
mkdir -p bowalign_filtered
# Filter out multi-mappers (keep only MAPQ ≥ 30)
samtools view -b -q 30 bowalign/H3K27me3_IP_rep1.sorted.bam > bowalign_filtered/H3K27me3_IP_rep1.filtered.bam
# Index the filtered BAM
samtools index bowalign_filtered/H3K27me3_IP_rep1.filtered.bam
Batch processing for all samples:
#!/bin/bash
set -euo pipefail
THREADS=8
mkdir -p bowalign_filtered
while read -r sample; do
echo "Filtering multi-mappers for: $sample"
samtools view -@ "$THREADS" -b -q 30 bowalign/${sample}.sorted.bam > bowalign_filtered/${sample}.filtered.bam
samtools index -@ "$THREADS" bowalign_filtered/${sample}.filtered.bam
done < sample_id.txt
Note
The -q 30 flag filters out reads with MAPQ < 30 (removes multi-mappers and low-confidence alignments). Most downstream tools (MACS2, deepTools) can also filter by MAPQ, so this step is optional but recommended for cleaner data.
Directory Structure After Alignment QC¶
chipseq_tutorial/
├── fastq_raw/
├── fastq_cleaned/
├── genome_index/
├── bowalign/ ← Aligned BAM files
│ ├── H3K27me3_IP_rep1.sorted.bam
│ ├── H3K27me3_IP_rep1.sorted.bam.bai
│ └── ...
├── bowalign_log/ ← Alignment statistics
│ ├── H3K27me3_IP_rep1.log
│ └── ...
├── bowalign_qc/ ← Quality control metrics
│ ├── H3K27me3_IP_rep1.flagstat.txt
│ ├── H3K27me3_IP_rep1.stats.txt
│ ├── H3K27me3_IP_rep1.mapq.txt
│ ├── alignment_qc_report.html
│ ├── alignment_qc_report_data/
│ └── ...
├── bowalign_filtered/ ← Filtered BAM files (MAPQ ≥ 30)
│ ├── H3K27me3_IP_rep1.filtered.bam
│ ├── H3K27me3_IP_rep1.filtered.bam.bai
│ └── ...
└── sample_id.txt
Summary¶
- Analogy: Alignment is placing puzzle pieces onto the reference picture.
- Action: Use
bowtie2to align andsamtools sortto organize. - Result: A Sorted BAM file (the solved puzzle), ready for peak calling.
Up Next
Before peak calling, we need to remove PCR duplicates and assess library quality.