Centre for Human Genetics, Edith Cowan University, 100 Joondalup Drive, Perth WA 6027, Australia
Received date: 25 November 2005; Accepted date: 13 December 2005; Available online 30 December 2005; Published date: 30 December 2005
© Copyright: Alan H Bittles
Visit for more related articles at Journal of Molecular and Genetic Medicine
Population stratification and its influence on genetic association studies is a controversial topic. Although it has been suggested that stratification is unlikely to bias the results of association studies conducted in developed countries, convincing contrary empirical evidence has been published. However, it is in populations where historical ethnic, religious and language barriers exist that community subdivisions will predictably exert greatest genetic effect, and influence the organization of association studies. In many of the populations of the Indian sub-continent, these basic population divisions are compounded by a strict tradition of intracommunity marriage and by marriage between close biological relatives. Data on the very significant levels of genetic diversity that characterize the populations of India and Pakistan, with some 50,000-60,000 caste and non-caste communities in India, and average first cousin marriage rates of 40%-50% in Pakistan, are presented and discussed. Under these circumstances, failure to explicitly control for caste/biraderi membership and the presence of consanguinity could seriously jeopardize, and may totally invalidate, the results of association/case control studies and clinical trials
Stratification, endogamy, consanguinity, association studies, India, Pakistan
The reliability and reproducibility of results obtained in genetic association studies have increasingly been questioned, especially when applied to complex diseases and in dealing with rare alleles with small effect sizes. In many cases poor study design has been identified as the main cause of misleading results (Cardon and Bell, 2001), with a lack of appropriate power calculations and insufficiently large datasets creating particular problems (Dahlman et al, 2002). Factors such as the effect size of susceptibility loci, the frequency of disease alleles, the frequency of marker alleles correlated with disease alleles, and the extent of linkage disequilibrium at or close to the region of the genome under investigation also have been cited as potential problem areas (Zondervan and Cardon, 2004), and the possible role(s) of epistasis remains to be investigated in appropriate detail (Carlborg and Haley, 2004).
Since the evolutionary history of haplotypes and patterns of linkage disequilibrium often vary widely between different ethnic populations, population stratification by ethnicity can cause problems in terms of the power of association studies (Cardon and Bell, 2001). Except in ethnically diverse samples, it has been suggested that population stratification may not be a major cause of spurious allelic associations (Cardon and Palmer, 2003). However, the influence of stratification on genetic association studies has been demonstrated even in well-designed protocols, with greatest effect in recently admixed populations and for diseases with variant prevalence rates in the ancestral source populations (Freedman et al, 2004).
Specific attention also has been drawn to the phenomenon of cryptic relatedness, i.e., kinship among cases or controls that is unknown to or unrecognized by the investigator. Although not a problem in appropriately designed studies conducted on outbred populations, cryptic relatedness can result in false positives where there has been a sampling bias towards the recruitment of relatives and/or in founder populations that have undergone rapid, recent growth (Voight and Pritchard, 2005). Its undetected presence may therefore be a major source of error in association studies when these conditions apply.
Estimates of the levels of genetic variation within human populations vary from lower bound values of 83%-88% (Excoffier and Hamilton, 2003) to 93%-95% of total genetic variance (Rosenberg et al, 2002). The full implications of these findings have largely been obscured by debate as to whether or not ethnicity should be considered in biomedical studies and clinical practice (Cooper et al, 2003; Gonsález Burchard et al, 2003), and because of this controversy the potential impact of intra-population allelic differences often seems to be overlooked or ignored in health-based studies. The aim of this mini-review is to highlight underlying causes of genomic variation in the populations of India and Pakistan that can be ascribed to their very diverse social, demographic and genetic substructures, and to consider the effects of this variation on the results of genetic association/case-control studies and clinical trials conducted in the sub-continent.
Potential problems with respect to stratification in association studies especially apply to populations in the Indian sub-continent because of the multiple ethnic, language, religious and socio-demographic boundaries that historically have restricted inter-community gene flow. In terms of language differentials it is estimated that there are just four major language families in India, i.e., Austric, Dravidian, Indo-European and Sino-Tibetan (Gadgil et al, 1998), but 15 major languages are spoken which can be further subdivided into 4,647 ‘mother tongues’ used by specific communities (Bhasin et al, 1992; Gadgil et al, 1998).
More importantly from a genetic perspective, the stratification effects of language divisions are exacerbated by caste differentials which apply to the 82% of Indians who are Hindu. The caste system is believed to have been introduced to India by Indo-European invaders some 3,500 bp. Since that time caste has traditionally defined the type of work undertaken by individuals and hence has governed their position and status within society (Thapar, 1966; Basham, 1967; Gadgil and Guha, 1995). As caste membership is hereditary, and to the present day virtually all Hindu marriages are contracted within caste boundaries, caste acts as a potent and long-established mechanism of genetic subdivision within Hindu society. At the same time, since the patterns of employment and living conditions of most individuals within Indian society continue to be largely dependent on their caste status, caste also can be regarded as an ‘environmental’ variable, which has important implications for the design of complex trait association studies.
Current estimates suggest that there are approximately 3,000 castes, together with more than 1,600 ‘scheduled tribes and castes’ whose members effectively live outside the caste system (Bhasin et al, 1992). In the early days of the caste system entire non-Indian communities could be accorded caste status within Hindu society (Thapar, 1966; Basham, 1967), and during the 18th and 19th centuries the splitting of castes into endogamous sub-castes was quite commonly observed (Census of Mysore, 1871; Basham, 1967). Further, some tribal communities are reported to have attained lower caste status on adopting settled agriculture practices (Bhattacharya et al, 1999). During the last century caste boundaries became more inflexible, although with continuing opportunities for hypergamy, i.e., the change of caste by a woman of lower caste on marriage to a husband of higher caste. This may explain reports of limited female but not male gene flow between closely ranked castes (Bamshad et al, 1998; Bhattacharya et al, 1999). However, male gene flow from caste groups to tribal communities has been demonstrated (Ramana et al, 2001; Cordaux et al, 2004).
Some 130 million of the non-Hindu population of India are Muslim, and there also are major Christian, Sikh, Buddhist and Jain communities which individually number in the millions. Although caste restrictions generally do not apply to these non-Hindu communities, they often exhibit stringent religious and social stratification. As a result of these various population subdivisions it has been estimated that there are 50,000-60,000 separate endogamous communities in India as a whole (Gadgil et al, 1998), ranging in total numbers from fewer than a hundred individuals to hundreds of thousands.
Given the potential relationship between founder effect and cryptic relatedness (Voight and Pritchard, 2005), it is important to note the very rapid growth of the Indian population during the course of the 20th century, from an estimated 238 million in 1901 (Dyson, 2001) to the current population of 1,104 million (PRB, 2005). In addition, in past generations the total number of communities across the Indian sub-continent may have been closer to 75,000 (Cavalli-Sforza et al, 1994), many of which would have been numerically very small. A combination of restricted initial effective population sizes, rapid population growth and strict marital endogamy would be expected to significantly increase the opportunities for founder effect and random drift at community level, with population bottlenecks also potentially affecting specific communities, e.g., following major epidemics. Because of these past events, high levels of genomic homozygosity can be observed even in communities which proscribe intra-familial marriage, as evidenced by studies on U.K. migrants of north Indian origin (Overall and Nicholls, 2001).
Preferential consanguineous marriage is, however, a potent additional factor that would be expected to facilitate genetic stratification in many Indian communities and thus influence the results of association studies. The National Family and Health Survey conducted in India during 1992-1993 indicated that 11.9% of marriages were consanguineous, equivalent to a mean coefficient of inbreeding, α = 0.0075 (IIPS, 1995). But this composite figure masks the very marked ethnic, regional and religious differences in the prevalence of consanguineous marriages across the country. In the majority Hindu population, consanguineous unions are proscribed for North Indian communities (Kapadia, 1958; Bittles et al, 1991), but uncle-niece (F = 0.125), with progeny homozygous at 12.5% of loci, and first cousin marriages (F = 0.0625), with progeny homozygous at 6.25% of loci, are especially popular in the three southern states of Andhra Pradesh, Karnataka and Tamil Nadu, which have a combined population approaching 200 million. In these three states, an estimated 29.7%-38.2% of marriages are consanguineous, with α = 0.0180-0.0266 (IIPS, 1995; Bittles, 1998, 2002), and 7.5%, 10.6% and 21.0% consanguinity additionally was reported in neighbouring Kerala, Goa and Maharashtra (IIPS, 1995). It should be stressed that these estimates apply to a single generation only. Where there has been a longstanding tradition of consanguineous marriage across generations, the resultant level of cumulative homozygosity would predictably be very much higher (Bittles et al, 1991; Bittles, 2001).
Among non-Hindu minorities, first cousin unions nationally account for 20.8% of Muslim marriages (Bittles and Hussain, 2000), and uncle-niece and first cousin marriages are favoured by many south Indian Christian groups, e.g., with 18.6% consanguineous marriage (α = 0.0173) among Christians in Karnataka (Bittles et al, 1991). The specific patterns of consanguineous marriage contracted vary by religion, with uncle-niece marriages proscribed for Muslims but double first cousin unions (also F = 0.125) permitted within Islam. Similarly, almost all Hindu and Christian first cousin marriages are cross-cousin, most commonly between a man and his mother's brother's daughter, whereas in Muslim communities all four types of first cousin marriage are permissible. While this difference would have no major effect at autosomal loci, it could significantly influence the expression of genes located on the X-chromosome (Bittles, 2001).
A comparable picture of community endogamy exists in Pakistan but with different emphases on the underlying causes of population stratification. Some 98% of the current population of 162 million are Muslim, however major language and ethnic differences exist with some 18 ethnic groups and more than 60 languages identified (Qamar et al, 2002; Hussain, 2005). Rather than the straightforward Sunni/Shia divide generally perceived outside Pakistan, major subdivisions exist within the different branches of Islam. The Sunni majority is divided into four major endogamous religious groups, the Hanafi, Shafei, Maliki and Hanbali, based on different schools of Islamic jurisprudence, and the Shia minority is similarly subdivided into the Ishnahary, Ismailis, and Dawoodi Bohras (Hussain, 2005). Sufism is quite widespread, and there are also other minor, non-Muslim, doctrinal subgroups, such as the Ahmadis and Qadianis.
Superimposed on these basic population divisions, marriage in Pakistan is usually contracted within traditional social and occupational groups, variously termed biraderis, quoms or zats (Shami et al, 1994; Hussain, 2005), which to an extent parallel the Hindu caste system. Besides inter-ethnic genetic differences (Mohyuddin et al, 2001; Qamar et al, 2002), although marriages between spouses from different biraderis can occur, genomic studies have shown major differences between the members of co-resident biraderis that appear indicative of very limited levels of past biraderi intermarriage (Wang et al, 2000).
Most importantly within the Pakistan context, major national and regional studies have determined that 46.2%-61.2% of marriages in the current generation are consanguineous (α = 0.0242-0.0332), with first cousin unions accounting for 40%-50% of all marriages (NIPS, 1992; Bittles et al, 1993; Yaqoob et al, 1993; Hussain and Bittles, 1998). Varying levels of consanguineous marriage have, however, been reported in studies conducted in different communities, ranging from 31.1% (α= 0.0163) in urban Swat (Wahab and Ahmad, 1996) to 77.1% (α= 0.0414) among Army recruits (Hashmi, 1997), and there also is significant variation in the overall levels of consanguinity observed in different biraderis (Shami et al, 1994). In general, consanguineous marriages have been found to be much less common in the small Christian and Hindu communities (Hussain and Bittles, 1998).
Equivalently high levels of consanguineous marriage have been reported in the U.K. Pakistani community that mainly originates from the Mirpur district of Kashmir (Darr and Modell, 1988; Bundey et al., 1990; Corry, 2002). The degree to which consanguinity adversely affects the health of this community continues to arouse controversy in the medical literature, with divergent opinions often expressed (Bundey et al. 1992; Ahmad, 1994; Modell and Darr, 2002; Devereux et al., 2004), and recently the subject also has attracted wider public attention (Dyer, 2005). Unfortunately, discussion of the role of population sub-division is almost always missing from these disputations, despite the fact that within Pakistani society biraderi membership can circumscribe, and thus effectively define, the spectrum of mutations within the different constituent gene pools.
Association/case-control studies implicitly assume the absence of significant confounding variables. With such strong evidence of population stratification, and random and preferential inbreeding in both India and Pakistan, it is very surprising that the potential influence of either caste/biraderi membership or consanguinity has been rarely mentioned or apparently considered in studies conducted in either country. This is illustrated in Table 1, using information compiled from a search of PubMed that employed association study, case-control study and clinical trial as keywords, and matching of these terms with caste or biraderi and consanguinity. According to the data accessed, the term caste was cited in a maximum of 6.6% Pakistani and 4.2% of Indian studies, with consanguinity listed in a maximum of 4.6% and 0.2% of studies in Pakistan and India, respectively.
|Country||Keywords||No of papers||Percentage|
|association study caste||39||4.2|
|association study consanguinity||10||1|
|case-control study caste||37||1.9|
|case-control study consanguinity||4||0.2|
|clinical trial caste||8||0.5|
|clinical trial consanguinity||1||<0.1|
|association study caste||10||6.6|
|association study biraderi||0||0|
|association study consanguinity||7||4.6|
|case-control study caste||7||1.9|
|case-control study biraderi||0||0|
|case-control study consanguinity||15||4.0|
|clinical trial caste||1||0.6|
|clinical trial biraderi||0||0|
|clinical trial consanguinity||0||0|
*Data collated from PubMed on 8 November 2005
Table 1: *Data collated from PubMed on 8 November 2005 Frequency of citation of caste, biraderi and consanguinity in association studies, case-control studies and clinical trials in Indian and Pakistan*
In the absence of explicit authors' statements to the contrary, it is unlikely that caste/biraderi or consanguinity were included as explanatory variables other than in the small minority of the studies identified in Table 1. It also seems highly improbable that the apparent lack of control for these core variables could have been based on access to pre-existing information demonstrating non-significant effects of caste/biraderi membership and consanguinity on allele profiles and frequencies at the loci investigated.
It has been proposed that genetic studies which recruit participants from a small number of geographically proximate sites may be partially protected from the effects of population stratification, and thus benefit from a level of de facto control over environmental variation between cases and controls (Foster and Sharp, 2004). However, the scale and multiple levels of internal subdivision that characterize both the Indian and Pakistani populations suggest that this tactic alone would be insufficient to nullify the effects of population subdivisions or potential ‘environmental’ variables in genetic association studies conducted in either country.
From a population perspective the South Asian region is dominated by the populations of India and Pakistan which comprise 20% of the global total, and with large Indian and Pakistani migrant communities now permanently resident abroad. An additional 144 million people live in Bangladesh and 20 million in Sri Lanka (PRB, 2005). As in the rest of the sub-continent, major ethnic and religious boundaries exist in both of these countries and consanguineous marriage is a widely permissible option in most of their constituent communities.
Under these circumstances, urgent attention needs to be paid to the possible effects of population stratification in genetic association studies, especially since India has a rapidly expanding and increasingly international pharmaceutical industry, with growing numbers of clinical trials conducted and reported. An initiative of this nature will first require a thorough understanding and appreciation of the highly complex demographic and social structures of the numerous constituent sub-populations of the sub-continent and their population genetics, a caveat that equally applies to the overseas migrant communities. Unless and until it can be shown that caste/biraderi differentials do not significantly influence patterns of genomic variation in local and regional populations, e.g., by using unlinked genetic markers (Pritchard and Rosenberg, 1999), association studies that effectively ignore caste differentials in their recruitment schedules must be open to serious criticism. Conversely, the results of association studies conducted on members of a single caste may have little relevance to other caste and non-caste communities in the wider population.
A similar situation exists with respect to consanguinity, especially in Pakistan where the majority of marriages are preferential intra-familial unions. Consanguinity is thus an important factor in the expression of genes at the family level and, as previously noted, within specific communities it can serve to reinforce the actions of other important population genetic influences, such as founder effect and drift. Failure to control for these genetic variables could result in cryptic relatedness and thus invalidate many of the data that are derived from association studies, with no guarantee either that the results reported are biologically or clinically meaningful or are capable of replication.
These reservations apply equally to the populations of many other populous regions, including West and Central Asia, the Middle East, and North and sub-Saharan Africa, which also retain strong traditions of clan and tribal endogamy and consanguineous marriage (Bittles, 2005; www.consang.net). If pharmacogenetic profiling is to fulfil the promise of highly specific, customized treatment protocols, ongoing advances are clearly needed in our knowledge of the patterns and structure of genetic variation across the genome, assisted by collaborative initiatives such as the HapMap project (McVean et al, 2005). At the same time, there seems to be a major gap in comprehending the importance of the basic parameters that govern the transmission of genes within and between human populations, which makes complementary detailed empirical studies into this topic an equally important and urgent issue.
The author declared no competing interests.