Welcome to the Practical ChIP-seq Tutorial¶
ChIP-seq chromatin-immunoprecipitation epigenetics genome-wide-binding transcription-factors histone-modifications NGS introduction
1. What is ChIP-seq?¶
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) answers one key question: Where do proteins interact with DNA in our genome?
Think of your genome as a massive library with 3 billion books. Certain proteins act as "bookmarks" that control which genes are active. ChIP-seq enables us to identify all these bookmarks simultaneously across the entire genome.
By mapping these binding locations, we learn how genes are turned on and off, which is critical for understanding both normal biology and diseases like cancer.
contact
ishaaq.raja@gmail.com
2. A Brief History & Why ChIP-seq Matters¶
Before ChIP-seq, researchers had limited options for studying protein-DNA interactions. ChIP-PCR could only examine a handful of pre-selected regions, like searching for a word in a book by checking only 10 pages. ChIP-chip improved on this by using microarrays, but it remained constrained to predefined genomic regions and offered limited resolution (Park, 2009).
The arrival of next-generation sequencing in the mid-2000s changed everything. ChIP-seq enabled genome-wide, high-resolution mapping for the first time, allowing scientists to see the complete picture of protein-DNA interactions across the entire genome.
This breakthrough enabled landmark discoveries. The ENCODE Project (2012) used ChIP-seq extensively to demonstrate that approximately 80% of the human genome has biochemical function—fundamentally overturning the long-held "junk DNA" myth. That same year, researchers used ChIP-seq to reveal how our 24-hour body clock is encoded in chromatin, explaining at the molecular level why circadian disruption increases disease risk.
These discoveries demonstrate ChIP-seq's direct impact on personalized medicine, cancer research, and our understanding of gene regulation.
3. How ChIP-seq Works (The Experiment)¶
A ChIP-seq experiment captures where proteins bind to DNA through five connected steps. First, formaldehyde cross-links proteins to DNA, freezing them in place like taking a snapshot. Next, the DNA is sheared into small fragments—imagine cutting a long rope into shorter segments that are easier to handle.
The key step is immunoprecipitation: antibodies that recognize your protein of interest act like magnets, pulling out only the DNA fragments attached to that specific protein. After this enrichment, reverse cross-linking releases the DNA from the proteins, leaving you with purified DNA fragments that were bound by your target. Finally, these fragments are sequenced, generating millions of short reads that reveal the genomic locations where your protein was bound (Furey, 2012).
4. Computational Analysis Pipeline¶
The millions of reads from Section 3 arrive as data files. Here's how we process them:
FASTQ → BAM → Peaks + BigWig
After sequencing, you receive data in FASTQ format—a text file containing millions of short DNA sequences (= reads), each with a quality score showing how confident we are in each letter (= base call). At this stage, we don't know where in the genome these sequences came from. That's what the next step figures out.
Alignment (= mapping) uses software like Bowtie2 to match each read to its location on a reference genome. The output is a SAM file (Sequence Alignment/Map), which records where each read landed, how well it matched, and other details.
SAM files are plain text and take up a lot of space. So we compress them into BAM format (Binary Alignment/Map) same information, but smaller and faster to work with. In most pipelines, SAM files are never saved; the aligner writes directly to BAM.
Next comes peak calling. Tools like MACS3 scan the BAM file and find regions where reads pile up more than expected. These "peaks" are likely protein binding sites. The output includes peak coordinates and statistical parameters (= p-values, q-values, fold enrichment), saved as BED files and bedGraph files.
BED files are simple lists of genomic locations (chromosome, start position, end position). They're used for many downstream tasks like finding DNA motifs or linking peaks to nearby genes (= annotation).
bedGraph files are similar to BED files but include a fourth column: a numerical value (like signal intensity or coverage) for each region. This format is human-readable text, useful for inspection, but results in large file sizes for whole-genome data.
BigWig files contain the same signal information as bedGraph, but differ in two key ways:
- Binary format: Data is stored in compressed binary rather than plain text, reducing file size significantly
- Indexed structure: An internal index allows software to retrieve data from any genomic region without reading the entire file
In practice, this means: when you open a bedGraph in a genome browser, the software must scan from the beginning of the file to find your region of interest. With BigWig, the software uses the index to jump directly to the relevant data block. For a 3 billion base pair human genome, this difference makes BigWig the standard format for visualization.
5. Who Is This Tutorial For?¶
This tutorial is designed for:
- Biology students new to bioinformatics
- Researchers who want hands-on ChIP-seq analysis skills
- Anyone who prefers learning by doing, not just reading
No prior coding experience is required. We explain every step.
6. Why This Tutorial?¶
We built this course to solve common frustrations of bioinformatics learning. Here's what makes it different:
The "Tiered" Learning Method¶
We believe you shouldn't just run code, you should understand it. Every chapter is broken into three levels:
| Level | Focus | What You'll Get |
|---|---|---|
| Level 1: Basic Concept | The "Why" | Simple explanations with real-world analogies |
| Level 2: Execution | The "How" | Exact code to run, line-by-line |
| Level 3: Interpretation | The "So What?" | How to read output and spot good vs. bad results |
By the end, you'll have the skills and the code to analyze your own ChIP-seq data.
7. Tutorial Structure¶
The tutorial consists of 16 comprehensive chapters organized by workflow stage:
Setup & Data Acquisition¶
01. Environment Setup¶
02. Bash Automation Fundamentals¶
- Introduction: Why Learn Bash for Bioinformatics?
- The Foundation: Setting Up Safe Scripts
- Part 1: Understanding Sample Lists
- Part 2: Creating Your Sample List
- Part 3: Using Your Sample List in Automation
03. GEO/FASTQ Download¶
- Level 1: Basic Concept
- Level 2: Fetching the data
- Connecting GEO to SRA
- Technical Replicates (Multi-lane)
04. FASTQ Concepts & QC¶
- Basic Concept (The Anatomy of a Read)
- Level 2: Execution (The Car Wash)
- Level 3: Advanced Analysis (The Math)
Alignment & Initial QC¶
05. Alignment with Bowtie2¶
06. Duplicate Removal & QC¶
- Basic Concept: The "Photocopier" Analogy
- Understanding the Details
- Marking & Removing Duplicates: Why Picard is Unique
- Samtools (Simple Alternative)
07. Library Complexity Assessment¶
- Level 1: Basic Concept (The Photographer)
- Level 2: Execution (The Calculator)
- Pipeline Summary: Pre-processing Workflow Complete
- Transition to ENCODE BAM Files
08. BAM Quality Metrics¶
09. Strand Cross-Correlation¶
10. BAM Summary & Fingerprint Plots¶
- Basic Concept (The Health Check)
- Running the QC of bam files before Peak Calling
- Level 3: Reading the Charts
Peak Calling & Reproducibility¶
11. MACS3 Peak Calling¶
12. FRiP Quality Metrics¶
13. IDR & Consensus Peaks¶
- Reproducibility Analysis: IDR (Irreproducible Discovery Rate)
- Running IDR on CEBPA Replicates
- Motif Analysis: Finding DNA Binding Sequences
- Motif Discovery with HOMER
Visualization & Annotation¶
14. BigWig Generation¶
- Basic Concept (The Traffic Map)
- Requirements (Effective Genome Size)
- Execution (The Converter)
- Fine Tuning (Under the Hood)
15. Visualization with deepTools¶
- Basic Concept (Camera Modes)
- The Blueprint & The Photo - Basic requirement
- Reading the Pictures
- Average Signal Analysis
- Normalization to Input Controls
- CEBPA Peak-Focused Analysis
16. Peak Annotation with ChIPseeker¶
8. Dataset Used in This Tutorial¶
We use two datasets to teach different parts of the pipeline:
Part 1: Preprocessing (FASTQ → BAM)¶
Source: GSE115704 Histone modifications in C. elegans sperm, oocytes, and early embryos.
Why this dataset? It's publicly available and demonstrates the practical steps of downloading, organizing, and aligning raw data.
Part 2: Downstream Analysis (BAM → Peaks → Visualization)¶
Source: ENCODE BLaER1 data Human cell line with ChIP-seq for CEBPA, H3K27me3, and H3K9ac.
Why this dataset? Pre-aligned, high-quality data that lets us focus on peak calling, normalization, and comparative analysis.
Let's Get Started¶
Up Next
Before diving into analysis, we'll set up your computational environment with the bioinformatics tools you'll need throughout this tutorial.