Experiment Design & BAM Quality Control¶
ENCODE BAM-files samtools flagstat MAPQ quality-control alignment-QC deepTools single-end
1. Basic Concept (The Experiment & The File)¶
1.1 The Experiment Story¶
This tutorial uses real data from BLaER1 cells (human immune cells).
- Treatment: Cells were treated for 18 hours with Estradiol, Interleukin-3, and CSF1 to activate specific genes.
- The Targets:
- CEBPA: A Transcription Factor (The "Driver" that turns on genes).
- H3K27me3: A Histone Mark for Closed/Repressed DNA (The "Stop Sign").
- H3K9ac: A Histone Mark for Open/Active DNA (The "Go Sign").
- Input: Random background DNA (The "Noise" control).
1.2 The "Zip File" Analogy (BAM vs SAM)¶
- SAM File (Sequence Alignment Map): This is a huge, readable text file. It's like a 1000-page printed manuscript.
- BAM File (Binary Alignment Map): This is the Compressed Zip File version. It contains the exact same info but is smaller and faster for the computer to read.
- Rule: We always work with BAM files to save space and time.
Data Availability:
2. Data used in the tutorial¶
2.1 Sample Table¶
Here are the files we are analyzing. In real life, you should make a table like this to track your work.
| Biosample Accession | ChIP Type | Target | Custom BAM Filename |
|---|---|---|---|
| ENCFF327JFG | TF ChIP-seq | CEBPA | ceb_ENCFF327JFG.bam |
| ENCFF744SVA | TF ChIP-seq | CEBPA | ceb_ENCFF744SVA.bam |
| ENCFF164ALR | Histone ChIP-seq | H3K27me3 | H3K27me3_ENCFF164ALR.bam |
| ENCFF532DQH | Histone ChIP-seq | H3K27me3 | H3K27me3_ENCFF532DQH.bam |
| ENCFF193NPE | Histone ChIP-seq | H3K9ac | H3K9ac_ENCFF193NPE.bam |
| ENCFF534IPX | Histone ChIP-seq | H3K9ac | H3K9ac_ENCFF534IPX.bam |
| ENCFF110SOB | Control ChIP-seq | Input | Input_ENCFF110SOB.bam |
| ENCFF919XCV | Control ChIP-seq | Input | Input_ENCFF919XCV.bam |
2.2 Simplest and easiet way to donwload all the bam files in the working folder¶
#!/bin/bash
set -euo pipefail
cat <<EOF | xargs -n 2 -P 4 wget -c -O
ceb_ENCFF327JFG.bam https://www.encodeproject.org/files/ENCFF327JFG/@@download/ENCFF327JFG.bam
ceb_ENCFF744SVA.bam https://www.encodeproject.org/files/ENCFF744SVA/@@download/ENCFF744SVA.bam
H3K27me3_ENCFF164ALR.bam https://www.encodeproject.org/files/ENCFF164ALR/@@download/ENCFF164ALR.bam
H3K27me3_ENCFF532DQH.bam https://www.encodeproject.org/files/ENCFF532DQH/@@download/ENCFF532DQH.bam
H3K9ac_ENCFF193NPE.bam https://www.encodeproject.org/files/ENCFF193NPE/@@download/ENCFF193NPE.bam
H3K9ac_ENCFF534IPX.bam https://www.encodeproject.org/files/ENCFF534IPX/@@download/ENCFF534IPX.bam
Input_ENCFF110SOB.bam https://www.encodeproject.org/files/ENCFF110SOB/@@download/ENCFF110SOB.bam
Input_ENCFF919XCV.bam https://www.encodeproject.org/files/ENCFF919XCV/@@download/ENCFF919XCV.bam
EOF
3. Basic Quality Checks¶
Before processing, we verify the BAM files are healthy. For detailed explanations of these metrics, see Section 05: Alignment QC.
3.1. Create a smaller test file (Optional) Working with full genomes takes time. For testing, we can extract just chromosome 11 and 12:
samtools view -b -h ceb_ENCFF327JFG.bam chr11 chr12 | samtools sort -o ceb_ENCFF327JFG.chr11_12.bam
3.2. Get Alignment Stats
For detailed explanation of flagstat output, see Understanding BAM File Structure.
# Quick summary
samtools flagstat ceb_ENCFF327JFG.bam > ceb_ENCFF327JFG.flagstat.txt
cat ceb_ENCFF327JFG.flagstat.txt
Output:
2565563 + 0 in total (QC-passed reads + QC-failed reads)
2565563 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
2565563 + 0 mapped (100.00% : N/A)
2565563 + 0 primary mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Verdict: 100% mapping rate - High-quality alignment, single-end data, no duplicates
3.3. Check MAPQ Distribution
For detailed explanation of MAPQ scores and multimapping, see Multimapping & The "Lost GPS".
samtools view ceb_ENCFF327JFG.bam | awk '{print $5}' | sort -n | uniq -c
Output:
123964 30 ← Lowest score is 30 (Good confidence)
1928 31
20741 32
1477 33
4040 34
34434 35
14294 36
53329 37
30665 38
88528 39
77123 40
2960636 42 ← Highest score is 42 (Perfect confidence)
Verdict: MAPQ ≥ 30 - File contains only uniquely mapped, high-quality reads
Directory Structure After BAM QC¶
chipseq_tutorial/
├── encode_bam/ ← ENCODE BAM files
│ ├── ceb_ENCFF327JFG.bam
│ ├── H3K9ac_ENCFF193NPE.bam
│ └── ... (8 BAM files total)
├── encode_bam_qc/ ← QC metrics for ENCODE BAMs
│ ├── ceb_ENCFF327JFG.flagstat.txt
│ ├── ceb_ENCFF327JFG.mapq.txt
│ ├── H3K9ac_ENCFF193NPE.flagstat.txt
│ ├── H3K9ac_ENCFF193NPE.mapq.txt
│ └── ... (16 QC files total)
└── sample_id.txt
Summary¶
- Context: We are analyzing active/repressed marks in BLaER1 cells.
- Files: BAMs are compressed alignment maps.
- QC: We use
samtools flagstatand check MAPQ scores to ensure we aren't analyzing "lost" multimapping reads.
!!! NOTE Up Next: We'll perform cross-correlation analysis to estimate fragment length and assess signal quality.