Using this approach we identified 288,957 SNPs that have both a high probability according to PolyBayes and are located in good sequence Inhibitors,Modulators,Libraries neighborhoods. Using this conservative set of SNPs, we obtained a density of 2. 4 SNPs per 100 bp for T. cruzi coding regions. The great majority of the observed SNPs were bi allelic, however there were 2,990 tri allelic SNPs and 10 tetra allelic SNPs. These are very inter esting SNPs that can be exploited in the design of strain typing assays. One such assay, based on one tetra allelic Inhibitors,Modulators,Libraries and a number of tri allelic SNPs has just been developed using this information. All this information is available in the Additional file 1, Table S1 and has also been integrated in a new release of the TcSNP database.
Experimental validation of candidate Batimastat SNPs To validate the strategy used in silico, and to assess the quality of the SNPs and the probability of them being true SNPs we performed a small scale re sequencing study on 47 loci. This set contained 1136 predicted SNPs with probabilities ranging from 0 to 1, obtained from genes with different numbers of predicted polymor phisms, low, medium and high. PCR amplification of selected fragments from these loci was followed by direct sequen cing of the amplified products and identification Inhibitors,Modulators,Libraries of SNPs from the raw chromatogram sequence data, including heterozygous peaks. This re sequencing experiment allowed us to validate 96% of the predicted SNPs that had PolyBayes probabilities 0. 7, whereas the success rate for SNPs with proba bilities between 0 0. 4 fell to 12. 5%.
The results of this small scale study suggest that overall Inhibitors,Modulators,Libraries the scoring strategy used to rank the SNPs worked well. We also identified 43 new heterozygous SNPs within the CL Brener strain and 1,261 new SNPs from other T. cruzi strains. The majority of these new CL Brener SNPs escaped the initial in silico prediction because of artifacts in the assembly of the T. cruzi genome, which resulted, for example, in a missing allele for an hypo thetical protein with high similarity to the yeast ERG10 gene. In the T. cruzi genome database there is only one allele reported for this gene. As a consequence, the few poly morphisms identified by our computational strategy were derived from the comparison of this allele against a short CL Brener EST sequence. However upon PCR amplification from CL Brener DNA, we were able to uncover additional heterozygous polymorphisms.
Both pieces of evidence suggest that there is a second allele that was probably merged or missed during genome assem bly. Apart from this case, this small scale re sequencing Table 3 Transitions and transversions in T. cruzi experiment confirmed the majority of the SNPs identified in silico, which is in agreement with the expected sequence coverage quality of genomic and transcriptomic data used. A complete table listing all loci analyzed, and their SNPs is available in Additional file 2, Table S2.