GATK base calling score recalibration — first task on Duke Campus

In our center, the major focus of our research has been to find out the variant in human geome. With the NGS technology, people would often to ask whether the “base calling” scores obtained from Illumina sequencer (GAII, HiSeq, etc) have been re-calibrated. In our existing pipeline, unfortunately, we did NOT incorporate the percedure implemented in GATK to re-calibrate the base calling scores. It would be nice to have it, but the question is do we really need this re-calibration at all? Well, to assement this and further make any judgment call, we would like to first ask these three questions:

1. What does it do in GATK?
2. How feasible does it to incorporate GATK in the pipeline
3. What is the gain?

Fact about GATK:

Here is the:
GATK
link
Getting the GATK paper published on Genome Reseearch in 2010:

It incorporates common functional programming paradigm called “map and reduce”

Five major component in GATK package:

1. Initial alignment
2. MSA realignment
3. Q-score recalibration
4. Single-sample genotyping
5. SNA filtering

For the Q-score recalibration
It was reported that “The quality scores reported by the Solexa, SOLiD, and
454 base callers are inaccurate”, in the Q-score re-calibration:
– We examine the aligned reads and use the reference mismatch rate at non‐dbSNP sites to recalibrate the reported quality scores
– We can also account for covariates of base errors, such as local sequence context and machine cycle, to identify subsets of higher‐quality bases

Gain from GATK process includes:
A. A Q-Q plot (base calling scroes?)
B. Difference of number of “dinucleotide” (covariates)
C. Increased % of Q25 and Q30 bases
D. Increased % in SNP calls
E. Ratio between Ti/Tv ~ 2.1 for genomes, and ~2.8 for exoms

To compare to samtools variance calling

Start practising with GATK
1. A good tutorial on GATK wiki

2. Another good forum site hosted/answered by Erin

3. Nowadays, we may not need “rod” file, the vcf file will work just fine. The VCF files can be downloaded from dbSNP

With the VCF file, we need to change the parameter setting to -B:name,type file
For example: -B:dbsnp,VCF dbsnp132.vcf

4. A sample code to do the recalibration will look something like:

java ‐Xmx4g ‐jar GenomeAnalysisTK.jar
‐R Homo_sapiens_assembly18.fasta
‐D dbsnp_129_hg18.rod
‐I original.bam
‐T CountCovariates
‐cov ReadGroupCovariate
‐cov QualityScoreCovariate
‐cov DinucCovariate
‐cov CycleCovariate
‐recalFile table.recal_data.csv

Or,

java ‐Xmx4g ‐jar GenomeAnalysisTK.jar
‐R Homo_sapiens_assembly18.fasta
‐I original.bam
‐T TableRecalibra;on
‐recalFile table.recal_data.csv
‐outputBam recal.bam
Inially, we used “var calling”, which generates pile up format; but newer version, people would use “Variant calling” — mpileup, which generates the vcf format.

According to Jessica Miya, GATK will fail on the ad hoc coverted VCF format (from pileup), I will keep an eye on it.

Here is the ftp site for downloading stuffs from Broad Institute.

About jianyingli18

I started to use the WordPress since 2009...
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a comment