Understanding and Cleaning Your Data (FASTQ & fastp)¶
FASTQ-format fastp quality-control adapter-trimming Phred-scores read-filtering QC-reports quality-assessment
1: Basic Concept (The Anatomy of a Read)¶
A FASTQ file is just a text file full of DNA sequences. But unlike a simple list of letters, every single read carries extra baggage (its quality score).
Think of every read like a Luggage Tag with 4 lines of information:
- Line 1 (The Header): Starts with
@. This is the ID Card. It tells you the machine name, flowcell lane, and coordinates. - Line 2 (The Sequence): The DNA letters (
ACTG...). This is the Content inside the bag. - Line 3 (The Spacer): Starts with
+. Just a divider. - Line 4 (The Quality): A string of weird characters (
F:F#,,...). This is the Trust Score. Each character represents the probability that the corresponding base in Line 2 is wrong.
Example Read:
@SN227:495:CA0TUACXX:1:1106:1159:2114 1:N:0:ATCACG <-- ID: Read #1
GTAAAAAGATTACATATATATTTAAAGTACACTGTAATTCTTANCA <-- DNA: "N" means the machine failed to call that base
+ <-- Spacer
FDFFFHHHHHJIJGIJIJJHJJJJJJIIJIJGIHFIIJJIIIIJG <-- Quality: Each character encodes a Phred quality score (Illumina 1.8+, ASCII offset 33).
Level 2: Execution (The Car Wash)¶
Before we start analyzing, we need to clean our data.
- The Problem: Sequencers sometimes make mistakes, especially at the ends of reads. They also leave "adapters" (artificial tags) attached to the DNA.
- The Solution: We use a tool called fastp. It acts like an automatic car wash: dirty reads go in, clean reads come out.
2.1 Basic Cleaning (Single-End)¶
# -i: Input (dirty)
# -o: Output (clean)
fastp -i fastq_raw/H3K27me3_IP_rep1.fastq.gz -o fastq_cleaned/H3K27me3_IP_rep1.clean.fastq.gz
Single-End vs Paired-End
This dataset uses single-end sequencing. If you have paired-end data, you would use:
fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz
What does fastp do automatically?
- Quality Filtering: Drops reads if too many bases have low scores (default: Phred -q < 15).
- Length Filtering: Drops reads that become too short after trimming (default: -l < 15bp).
- Adapter Removal: Finds and cuts off adapter sequences automatically.
Level 3: Advanced Analysis (The Math)¶
3.1 Quick Stats with AWK¶
Sometimes you don't want to run a full report; you just want to know "How many reads do I have?"
You can use awk (a math tool for text) to count directly from the compressed file.
Count Total Reads:
# A FASTQ record is 4 lines. We count total lines and divide by 4.
gzcat fastq_raw/H3K27me3_IP_rep1.fastq.gz | wc -l | awk '{print $1/4 " reads"}'
3.2 Batch Processing¶
If you have 50 files, you can use a script to run fastp on all of them in parallel.
The fastp developers provide a handy script called parallel.py:
mkdir -p fastq_cleaned fastp_reports
# Process 3 files at a time (-f 3), using 2 threads per file (-t 2)
python parallel.py -i fastq_raw -o fastq_cleaned -r fastp_reports -a '-f 3 -t 2'
Parameter explanation:
-f 3: Sets the batch size—the script processes 3 files at a time-t 2: Sets the number of threads per file—each file is processed with 2 threads
This automatically finds pairs and generates HTML reports for every sample.
Important
parallel.py avoids the need to explicitly loop over sample_id.txt in a Bash script.
chipseq_tutorial/
├── fastq_raw/ # Original FASTQ files
│ ├── H3K27me3_IP_rep1.fastq.gz
│ ├── H3K27me3_IP_rep2.fastq.gz
│ └── Input_rep1.fastq.gz
├── fastq_cleaned/ # Cleaned FASTQ files
│ ├── H3K27me3_IP_rep1.clean.fastq.gz
│ ├── H3K27me3_IP_rep2.clean.fastq.gz
│ └── Input_rep1.clean.fastq.gz
├── fastp_reports/ # HTML QC reports
│ ├── H3K27me3_IP_rep1.html
│ ├── H3K27me3_IP_rep2.html
│ └── Input_rep1.html
└── sample_id.txt
Summary¶
- Understand: FASTQ files have 4 lines per read; line 4 is the quality score.
- Action: Always run
fastpto trim adapters, low-quality bases, and too-short reads. - Check: Use
wc -lorawkfor instant feedback on your data size.
Up Next
With clean reads in hand, we're ready to align them to a reference genome using Bowtie2.