Welcome to the Practical ChIP-seq Tutorial¶

ChIP-seq chromatin-immunoprecipitation epigenetics genome-wide-binding transcription-factors histone-modifications NGS introduction

1. What is ChIP-seq?¶

ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) answers one key question: Where do proteins interact with DNA in our genome?

Think of your genome as a massive library with 3 billion books. Certain proteins act as "bookmarks" that control which genes are active. ChIP-seq enables us to identify all these bookmarks simultaneously across the entire genome.

By mapping these binding locations, we learn how genes are turned on and off, which is critical for understanding both normal biology and diseases like cancer.

contact

ishaaq.raja@gmail.com

2. A Brief History & Why ChIP-seq Matters¶

Before ChIP-seq, researchers had limited options for studying protein-DNA interactions. ChIP-PCR could only examine a handful of pre-selected regions, like searching for a word in a book by checking only 10 pages. ChIP-chip improved on this by using microarrays, but it remained constrained to predefined genomic regions and offered limited resolution (Park, 2009).

The arrival of next-generation sequencing in the mid-2000s changed everything. ChIP-seq enabled genome-wide, high-resolution mapping for the first time, allowing scientists to see the complete picture of protein-DNA interactions across the entire genome.

This breakthrough enabled landmark discoveries. The ENCODE Project (2012) used ChIP-seq extensively to demonstrate that approximately 80% of the human genome has biochemical function—fundamentally overturning the long-held "junk DNA" myth. That same year, researchers used ChIP-seq to reveal how our 24-hour body clock is encoded in chromatin, explaining at the molecular level why circadian disruption increases disease risk.

These discoveries demonstrate ChIP-seq's direct impact on personalized medicine, cancer research, and our understanding of gene regulation.

3. How ChIP-seq Works (The Experiment)¶

A ChIP-seq experiment captures where proteins bind to DNA through five connected steps. First, formaldehyde cross-links proteins to DNA, freezing them in place like taking a snapshot. Next, the DNA is sheared into small fragments—imagine cutting a long rope into shorter segments that are easier to handle.

The key step is immunoprecipitation: antibodies that recognize your protein of interest act like magnets, pulling out only the DNA fragments attached to that specific protein. After this enrichment, reverse cross-linking releases the DNA from the proteins, leaving you with purified DNA fragments that were bound by your target. Finally, these fragments are sequenced, generating millions of short reads that reveal the genomic locations where your protein was bound (Furey, 2012).

4. Computational Analysis Pipeline¶

The millions of reads from Section 3 arrive as data files. Here's how we process them:

FASTQ → BAM → Peaks + BigWig

After sequencing, you receive data in FASTQ format—a text file containing millions of short DNA sequences (= reads), each with a quality score showing how confident we are in each letter (= base call). At this stage, we don't know where in the genome these sequences came from. That's what the next step figures out.

Alignment (= mapping) uses software like Bowtie2 to match each read to its location on a reference genome. The output is a SAM file (Sequence Alignment/Map), which records where each read landed, how well it matched, and other details.

SAM files are plain text and take up a lot of space. So we compress them into BAM format (Binary Alignment/Map) same information, but smaller and faster to work with. In most pipelines, SAM files are never saved; the aligner writes directly to BAM.

Next comes peak calling. Tools like MACS3 scan the BAM file and find regions where reads pile up more than expected. These "peaks" are likely protein binding sites. The output includes peak coordinates and statistical parameters (= p-values, q-values, fold enrichment), saved as BED files and bedGraph files.

BED files are simple lists of genomic locations (chromosome, start position, end position). They're used for many downstream tasks like finding DNA motifs or linking peaks to nearby genes (= annotation).

bedGraph files are similar to BED files but include a fourth column: a numerical value (like signal intensity or coverage) for each region. This format is human-readable text, useful for inspection, but results in large file sizes for whole-genome data.

BigWig files contain the same signal information as bedGraph, but differ in two key ways:

Binary format: Data is stored in compressed binary rather than plain text, reducing file size significantly
Indexed structure: An internal index allows software to retrieve data from any genomic region without reading the entire file

In practice, this means: when you open a bedGraph in a genome browser, the software must scan from the beginning of the file to find your region of interest. With BigWig, the software uses the index to jump directly to the relevant data block. For a 3 billion base pair human genome, this difference makes BigWig the standard format for visualization.

5. Who Is This Tutorial For?¶

This tutorial is designed for:

Biology students new to bioinformatics
Researchers who want hands-on ChIP-seq analysis skills
Anyone who prefers learning by doing, not just reading

No prior coding experience is required. We explain every step.

6. Why This Tutorial?¶

We built this course to solve common frustrations of bioinformatics learning. Here's what makes it different:

The "Tiered" Learning Method¶

We believe you shouldn't just run code, you should understand it. Every chapter is broken into three levels:

Level	Focus	What You'll Get
Level 1: Basic Concept	The "Why"	Simple explanations with real-world analogies
Level 2: Execution	The "How"	Exact code to run, line-by-line
Level 3: Interpretation	The "So What?"	How to read output and spot good vs. bad results

By the end, you'll have the skills and the code to analyze your own ChIP-seq data.

7. Tutorial Structure¶

The tutorial consists of 16 comprehensive chapters organized by workflow stage:

Setup & Data Acquisition¶

01. Environment Setup ¶

02. Bash Automation Fundamentals ¶

03. GEO/FASTQ Download ¶

04. FASTQ Concepts & QC ¶

Alignment & Initial QC¶

05. Alignment with Bowtie2 ¶

06. Duplicate Removal & QC ¶

07. Library Complexity Assessment ¶

08. BAM Quality Metrics ¶

09. Strand Cross-Correlation ¶

10. BAM Summary & Fingerprint Plots ¶

Peak Calling & Reproducibility¶

11. MACS3 Peak Calling ¶

12. FRiP Quality Metrics ¶

13. IDR & Consensus Peaks ¶

Visualization & Annotation¶

14. BigWig Generation ¶

15. Visualization with deepTools ¶

16. Peak Annotation with ChIPseeker ¶

8. Dataset Used in This Tutorial¶

We use two datasets to teach different parts of the pipeline:

Part 1: Preprocessing (FASTQ → BAM)¶

Source: GSE115704 Histone modifications in C. elegans sperm, oocytes, and early embryos.

Why this dataset? It's publicly available and demonstrates the practical steps of downloading, organizing, and aligning raw data.

Part 2: Downstream Analysis (BAM → Peaks → Visualization)¶

Source: ENCODE BLaER1 data Human cell line with ChIP-seq for CEBPA, H3K27me3, and H3K9ac.

Why this dataset? Pre-aligned, high-quality data that lets us focus on peak calling, normalization, and comparative analysis.

Let's Get Started¶

Up Next

Before diving into analysis, we'll set up your computational environment with the bioinformatics tools you'll need throughout this tutorial.