Abstract 33P
Background
In NGS data analysis, when nucleotide polymorphisms (SNPs) exist within genomic region of low mappability, misalignments can raise alone with additional mismatches that may be identified as somatic variants. Due to the nature of SNPs between individual, no modern variant caller can algorithmically distinguish artifacts of such origin. A pre-indexed knowledgebase may help distinguishing such artifacts from real somatic variants.
Methods
The goal is to construct a knowledgebase of genomic regions that characterize as low mappable while highly polymorphic. We generated a synthetic data by dividing human reference genome into reads with length of 300bp and step of 75bp. BLAST was then used to search for region of similarity between FASTA and reference genome and preliminary inclusion criterion of similarity region was set. To validate artifacts of hypothesized origin, we generated another FASTA file by inserting SNPs of NA18595 from 1KGP into original FASTA file. The FASTA was aligned to reference genome so that the origin of mismatches can be explored.
Results
As expected, no mismatches were detected when synthetic data is free of SNPs. After germline variants were inserted, a total amount of 91 mismatches were identified at exome scale. All artifacts raised from reads that harbored SNPs and were misaligned to genomic regions of similar sequence context. Out of 91 artifacts, 36% occurred at very SNPs loci and 58% of them occurred at loci adjacent to SNPs. 94% artifacts were covered by our artifact knowledgebase. In addition, 59% of potential artifacts in our knowledgebase were reported in COSMIC. Although this only provided a rough estimation since we only selected artifact sites adjacent to polymorphic site of high population frequency (>5%), this high percentage implies the existence of artifacts in public cancer mutation knowledgebase.
Conclusions
Our analysis indicates that the difference between reference and individual can lead to misalignment especially when such genomic polymorphism occurs within low mappable regions. These misalignments may introduce false somatic variants. By constructing a BLAST-guided knowledgebase, we were able to faithfully detect artifact of such origin and achieve higher specificity of somatic variant detection.
Clinical trial identification
Editorial acknowledgement
Legal entity responsible for the study
Genetron Health (Beijing) Co. Ltd., 102206, Beijing, China.
Funding
Genetron Health (Beijing) Co. Ltd., 102206, Beijing, China.
Disclosure
All authors have declared no conflicts of interest.