Nuzhat A Akram*
Department of Genetics, University of Karachi, Karachi, Pakistan
Received date: October 15, 2012; Accepted date: October 27, 2012; Published date: October 29, 2012
Citation: Akram NA, Farooqi SR (2012) DNASF: A Statistical Package to Analyze the Distribution and Polymorphism of CODIS STR Loci in a Heterogeneous Population. J Forensic Res 3:170. doi:10.4172/2157-7145.1000170
Copyright: © 2012 Akram NA, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Forensic Research
Short Tandem Repeat (STR) markers are moderately repetitious DNA segments serving efficiently as a core sequence for the human identification. Their use as identification markers involves many technical and statistical issues. DNASF (DNA Statistics for Forensics) is a package of statistical programs designed to analyze the STR distribution in a heterogeneous population. It includes software DNA Forensics GenePro and DNA Forensics and a Microsoft excel workbook DNA AF. They can compute a number of parameters used to estimate the forensic utility of STR loci, including genetic diversity, unbiased heterozygosity, Shannon information index, polymorphism information content, and probability of exclusion and power of discrimination. In these programs each individual/ subpopulation is defined on the basis of two variables namely paternal ethnicity and mother tongue. The options for the two variables consist mainly of Indian subcontinent ethnicities and native languages but it does not undermine the software utility for researchers working on other populations. The input data are CODIS STR genotype and allele frequency data for DNA Forensics GenePro and DNA Forensics respectively. DNA AF can calculate allele frequency and other descriptive statistics from genotype data. Each component of DNASF is user friendly and provided with a set of instructions. For validation studies genotype data of five Pakistani subpopulations and allele frequency data of fifteen world populations were used. Validation studies of DNASF made it a reliable and effective tool for forensic investigations.
STR polymorphism; Statistical package; Software; Forensic parameters; Validation studies
Currently STR loci are the most informative genetic markers for identity testing . High degree of STR polymorphism showed the great promise for the DNA typing in forensic applications [2-6]. American FBI has designated thirteen STR loci as a core set to be used in determining one’s individuality [7-9]. However, their use as human identification markers in a population is subjected to various issues. One of them is the level of polymorphism which determines their efficiency for human identification purposes and another is the presence of substructure in a population i.e. presence of genetically differentiated subpopulations within a population . Subpopulation is a generic term indicating a cluster within a heterogeneous population . Profile frequencies calculated from population averages might be seriously misleading for particular subpopulations [12-16]. Extensive studies from a wide variety of databases show that there are indeed substantial frequency differences among the major racial and linguistic groups. And within these groups there is often a statistically significant departure from random proportions. National Research Council (NRC) 1992, 1996 have suggested the use of random samples from “relatively genetically homogeneous” population [14,17]. Construction of subpopulation databases can play a crucial role in establishing the confidence in the result of DNA typing.
Advances in the current technology for DNA typing has made the construction of such databases at finer levels of population stratification trouble-free and straight forward. However, software tools are needed for various purposes like storing, tracking, comparing and analyzing such databases [1,18]. A wide variety of software has been written to facilitate the task of managing, error checking, and analyzing genotype and phenotype data for genetic studies . DNASF is a statistical package that offers STR data entry and cumbersome analyses for estimation of forensic parameters in a user friendly manner. It comprises two softwares and a Microsoft Excel workbook. Each component is provided with a user guide/manual. All the components along with their user guide/ manual are available from the corresponding author.
In April 2011, the FBI laboratory proposed an expanded set of core STR loci for the United States in order to reduce the likelihood of adventitious matches [20-22]. However, the current battery of STR loci is still validated for the analysis of single source DNA profile cases. More autosomal STR loci are needed for kinship and DNA mixture analyses [22,23]. Therefore only thirteen STRs published in FBI CODIS program 1997 (http://www.cstl.nist.gov/strbase/fbicore.htm) are included in the program by default along with the list of their alleles (http://www.cstl.nist.gov/strbase/).
DNASF consists of two software programs; DNA Forensics GenePro and DNA Forensics. It also includes a Microsoft excel workbook DNA AF.
DNA Forensics GenePro: The software needs genotype data for estimation of forensic parameters (Figure 1). Each subpopulation is defined on the basis of paternal ethnicity and mother tongue. Individuals’ genotype data is entered into his or her subpopulation.
DNA Forensics: The software needs allele frequency data as input file for the estimation of the forensic parameters for a population or subpopulation which are defined on the basis of paternal ethnicity and mother tongue (Figure 2).
DNA AF: This is an excel workbook which can calculate various statistics from genotype data (Figure 3). These include allele frequency, its variance and 95% confidence interval, heterozygotes, homozygotes, number of chromosomes genotyped and sample size of the population. It can also calculate the various forensic parameters.
1.Five Pakistani subpopulations namely Balochi, Muhajir, Pathan, Punjabi and Sindhi were genotyped for three CODIS STR loci CSF1PO, TPOX and TH01. Genotype data was analyzed through a statistical program Powerstat (http://www.promega.com/geneticidtools/ powerstats/) and DNA Forensics GenePro for the estimation of forensic parameters. The parameters include heterozygosity, polymorphism information content, probability of exclusion, match probability and power of discrimination. Regression analysis was performed on three of the parameters namely match probability, power of discrimination and polymorphism information content to estimate the accuracy of computation (Figures 4-6).
2.Allele frequencies for each of the three STR loci across the five Pakistani subpopulations were calculated using excel workbook DNA AF. Allele frequencies were used as input data for DNA Forensics. Forensic parameters were calculated and the regression analyses were performed between them and the parameters calculated by Powerstat to estimate the accuracy of the software (Figures 7-9).
3.Forensic parameters Heterozygosity and Power of discrimination were calculated for the loci CSF1PO, TPOX and TH01 by DNA Forensics using allele frequency data reported in the literature (Table 1) (http:// dnaa.bravehost.com/index.html). The calculated parameters were then compared with those reported in the literature. Regression analysis was performed to estimate the accuracy of computation (Figures 10-15).
|Ref. No.a||Heterozygosity||Power of Discrimination(PD)||Heterozygosity||Power of Discrimination(PD)||Heterozygosity||Power of Discrimination(PD)|
|7||Andhra Pradesh Dravidian 1||4||0.735||0.88||0.716||0.865||0.773||0.908|
|11||South African Whites||12||0.728||0.873||0.645||0.822||0.75||0.899|
|12||South African Blacks||12||0.781||0.916||0.788||0.921||0.718||0.867|
aIt refers to the number of references provided at the website http://dnaa.bravehost.com/index.html
Table 1: Fifteen world populations’ heterozygosities and power of discrimination for the loci CSF1PO, TPOX and TH01 calculated by DNA Forensics.
Calculations for forensic parameters
1. Unbiased Heterozygosity (H)
Unbiased Heterozygosity is calculated as 2n (1- Σ pi2) / (2n-1) where n is the number of chromosomes examined and pi2 is the frequency of heterozygotes . If the number of individuals sampled for the STR locus is 30 then H will be calculated as
2. Probability of Identity (PI) and Power of Discrimination (PD)
PI is derived by the formula Σ (xi)2 + Σ (xij )2 where xi stands for the frequency of homozygotes and is equal to pi2 .While xij stands for the frequency of heterozygotes and is equal to 2 pi qj, where pi and qj stands for the frequencies of i-th and j-th alleles of a locus. PD is defined as ,
1- Σ (xi)2 + Σ (xij )2 or 1- PI (2)
3. Polymorphism Information Content (PIC)
PIC is defined as
1- Σn i=1 pi2 - 2 Σn-1 i=1 Σn j=1 pi2 pj2 (3)
where pi and pj stands for the frequencies of i-th and j-th alleles of a locus .
1. Regression analysis between POWERSTAT and DNA Forensics GenePro shows a coefficient of determination of 1 which confirms the accuracy of the software in calculating the parameter (Figures 4-6).
2. Regression analysis between POWERSTAT and DNA Forensics for Pakistani subpopulations shows a regression coefficient of determination approaching 1 (0.88 to 0.99) which confirms the accuracy of the software in calculating the parameter (Figures 7-9).
3. Regression analysis between DNA Forensics and those reported in the literature shows a coefficient of determination from 0.80 to 0.999 for heterozygosity which confirms the accuracy of the software in calculating the parameter (Figures 10-12). The coefficient of determination decreases for power of discrimination from 0.14 (CSF1PO) to 0.54 (TPOX) (Figures 13-15). However, the p values are significant (<<0.05) only for coefficients of determination of TPOX and TH01.
Molecular information from highly variable DNA markers is being widely used to identify individuals or evidence for forensic purposes [4,26,27]. Various software programs have been introduced to make the use of DNA for identification purposes a fast and trouble-free process. DNASF comprises relatively simple programs; nonetheless, their utility for forensic community cannot be underestimated as they provide the basic calculations considered essential for forensic investigations.
Moreover, the programs encourage the researcher to categorize the population under study into smaller groups that can be differentiated on the basis of their paternal ethnicity and/or mother tongue. This is in accordance with recommendations of National Research Council 1996 to construct DNA databases of genetically homogeneous populations . There may be important differences in allele frequencies in different subpopulations or ethnic groups and these differences may influence the calculations. For example match probability may be higher when a person is compared to his or her own ethnic group than when she or he is compared to the whole population. When population substructure is ignored, the match probability is simply the relative frequency of the defendant’s profile in the suspected population of the culprit . Essentially, this treats each human population as large and randomly mating, ignoring possible subpopulations. People in these subpopulations could tend to mate within their subpopulation which would lead to different allelic frequencies than those estimated from the overall population . To estimate these possible differences, it is necessary to make databases of each subpopulation or ethnic group within a larger population. Another measure to minimize the effect of background relatedness among the subpopulations on forensic calculations is the use of inbreeding coefficient (θ) [29,30]. In 1994, Balding and Nichols proposed a method for calculating match probabilities, which makes use of this inbreeding coefficient .
Another feature of the DNASF is that it provides a number of options for the user. For example options are given for paternal ethnicity, mother tongue, CODIS STR loci and their alleles. Data entry by the user is kept minimum thus reducing the chances of error. Although the options for paternal ethnicity and mother tongue mainly consist of ethnicities and native languages of Indian subcontinent populations but it does not undermine the software utility for researcher working on other populations. They can use the software by using the options of ‘unknown’ or ‘other’ for paternal ethnicity or mother tongue. Moreover, validation studies of the statistical package make it a reliable and efficient tool for researchers working on CODIS STR loci.