Spatial Disease Cluster Detection: An Application to Childhood Asthma in Manitoba, Canada

Cluster detection is an important part of spatial epidemiology because it may help suggest potential factors associated with disease and thus, guide further investigation of the nature of diseases. Many different methods have been proposed to test for disease clusters. The most popular methods for detecting spatial focused clusters are circular spatial scan statistic (CSS), flexible spatial scan statistic (FSS) and Bayesian disease mapping (BYM). The only latter approach is based on rigorous modeling approach. However, the Bayesian inference may depend on the choice of priors. We propose a frequentist approach, which yields to maximum likelihood estimation, to identify potential focused clusters. The proposed approach is based on the recent introduction of the method of data cloning. We can also provide the prediction (and prediction interval) for relative risk values. The advantages of data cloning approach are that the answers are independent of the choice of priors and non-estimable parameters are flagged automatically. We illustrate the proposed approach, and compare with aforementioned approaches, by analyzing a dataset of childhood asthma visits to hospital in the province of Manitoba, Canada, during 2000-2010. Our results showed that the potential clusters are mainly located in the north-central part of the province. J o ur na l o f B iometrics & Bistatis t i c s ISSN: 2155-6180 Journal of Biometrics & Biostatistics Citation: Torabi M (2012) Spatial Disease Cluster Detection: An Application to Childhood Asthma in Manitoba, Canada. J Biomet Biostat S7:010. doi:10.4172/2155-6180.S7-010


Introduction
Asthma is a severe disease that inflames and narrows the airways, causing difficulty in breathing. The primary cause of asthma is known to be sensitization to allergic and non-allergic triggers. Allergic triggers can be mould, animal dander, pollen, cockroach, and dust mites, and non-allergic triggers can be weather, humidity, rain/precipitation, high surface pressure, low solar irradiance, winds, air pollution, respiratory viral infections, chemicals, and certain drugs. The major risk factors for developing asthma are known to be a family history of asthma and/or allergy (eczema, allergic rhinitis); exposure, in infancy, to high levels of antigen such as house dust mites; and exposure to tobacco smoke or chemical irritants in the workplace triggers.
According to the World Health Organization, asthma is now a serious public health problem with over 300 million sufferers worldwide [1]. Over the past two decades, asthma has reached epidemic proportions in large areas of North America. Asthma rates have been increasing remarkably particularly in children where the disease occurs in up to 12% of all children in North America, and about twice as frequently in children living in poorer conditions, such as inner cities [2]. Asthma is a disease affecting approximately 8% of the Canadian population [3]. According to Statistics Canada, 10% of the Canadian children population have been diagnosed as having asthma (2008)(2009)) and it is the major cause of hospitalization of children in Canada [4]. Asthma is responsible for increasing numbers and proportions of emergency room visits and hospitalizations, with some increase in deaths as well [2]. With such an impact, it is important to identify trends in asthma incidence that may suggest further epidemiological studies to identify risk factors and identify any changes in important factors. Trends may occur over region and the focus of our paper is to examine geographical variation in the number of asthma visits to hospital during 2000 to 2010 in the province of Manitoba, Canada.
A limited region within the study regions with a high ratio of disease cases is defined as a spatial cluster [5]. The identification of a cluster of disease can help to find potential factors associated with disease and lead to improved understanding of etiology. Moreover, identification of clusters may lead to more detailed investigations to find out the association between exposures and disease interventions [6].
Statistical cluster detection methods are generally classified into two main categories, focused and general (also called as non-focused). Methods for focused cluster detection are designed to identify regions with excess number of cases in the vicinity of potential causes (e.g., toxic waste site) [7,8]. On the other hand, methods for general clusters are designed to identify regions with excess number of cases. Typically, these models adopt extra-Poisson variability in different ways [9,10,11]. These methods are reviewed and compared in [12].
Methods for focused cluster detection include, but are not limited to, circular spatial scan statistic (CSS) [13], flexible spatial scan statistic (FSS) [14], and Bayesian disease mapping (BYM) [9]. The methods for general cluster detection include the Besag and Newell (BN) [15,16] test and the maximizing excess events test (MEET) [17]. The aim of focused tests is to test the null hypothesis of no local spatial cluster, while, the general tests are used to detect the potential clusters in the study region. In other words, for the focused tests (CSS, FSS, and BYM), the goal is to find a cluster for a specific region of interest, and consequently the test statistics are designed to capture the potential cluster. For the general tests (BN and MEET), the goal is to find any significant cluster in the study region without specifying any region of interest.
In this paper, we mainly focus on the focused cluster detection approaches. With advances in computational power, the Bayesian i L i m be likelihood under the null and alternative hypothesis, where the null hypothesis is no cluster in region i and the alternative hypothesis is a cluster in region i based on its j-th nearest neighbours. Then the likelihood ratio statistic is given by where C i and E i denote the observed and expected number of cases in a circle, respectively, and ( The circles with the highest likelihood ratio values are identified as potential clusters. We can implement this method using SaTScan [21] or FleXScan [22] software. In general, the K is chosen to include at most 50% of population at risk. We used K = 15, the FleXScan default, and since our example uses aggregate data, the region centroid had to be included in the radius of the circle for the region to be part of the circle.

Flexible spatial scan statistic (FSS)
This method is similar to the method of CSS; however, the detected cluster is allowed to be flexible in shape while at the same time the cluster is confined to a relatively small neighbourhood of each region. The flexible scan statistic imposes an irregularly shaped window S on each region by connecting its adjacent regions. For each region i, the set of irregularly shaped windows with length j, the j connected regions including i, can move from 1 to the pre-specified maximum J. The connected regions are restricted to the subsets of the set of regions i and (J-1)-th nearest neighbours to the region i, where J is a pre-specified maximum length of cluster. The set of all windows to be scanned by the flexible spatial scan statistic is then Note that the circular spatial scan statistic considers J circles for a given region i; however, the flexible spatial scan statistic considers J circles in addition to the all sets of connected regions whose centroids are located within the J-th largest concentric circle. As a consequence, the size of S 2 is much larger than S 1 which is at most mJ. Under the Poisson assumption, the test statistic for the flexible spatial scan statistic based on the likelihood ratio test is obtained by (1), where the circle defined in (1) now refers to the S 2 rather than S 1 . We implement this method with the FleXScan software, using the default setting J=15. Similar to the circular spatial scan statistic, the circles with the highest likelihood ratio values are identified as potential clusters.

Bayesian disease mapping (BYM)
A Bayesian approach using Markov chain Monte Carlo (MCMC) can also be used for cluster detection [9,10,23,24]. This approach was first used by Besag et al. (BYM) [9] and the model consists of two parts. In the first part, the cases are assumed to follow a Poisson distribution with an area specific parameter : where C i and E i are the observed and expected number of cases in region i respectively. The second part of the model is obtained by approach especially the non-informative Bayesian approach has become quite popular as a modeling approach to identify the potential clusters in a research study. However, the inference may depend on the choice of priors.
Recently, Lele et al. [18] introduced an alternative approach, called data cloning (DC), to compute the maximum likelihood estimates (MLE) and their standard errors for general hierarchical models. Lele et al. [19] also described an approach to compute prediction and prediction intervals for the random effects. The DC approach, thus, is well suited to address the issues in spatial focused cluster detection using the frequentist paradigm. The other advantages of DC method are that the answers are invariant to the choice of priors and nonestimable parameters are flagged automatically.
In this paper, we propose a frequentist approach via data cloning for identifying the potential focused clusters. In particular, we evaluate the performance of the proposed approach, and compare with other focused cluster detection approaches such as CSS, FSS and BYM, by applying to a real dataset of childhood asthma visits to hospital in the province of Manitoba, Canada, during 2000-2010.

Study subjects
The study was based on a yearly dataset of asthma visits to hospital by children (age ≤ 20) in the Canadian province of Manitoba during the 2000-2010 fiscal years (see http://atlas.nrcan.gc.ca/site/english/maps/ reference/national/can_political_e/map.pdf for a map of Canada). The population of Manitoba was stable during the study period from 1.15 million in 2000 to 1.20 million in 2010, with an average population of children of around 336,000. The province consisted of eleven Regional Health Authorities that were responsible for the delivery of health care services. These eleven regions were further sub-divided into 56 Regional Health Authorities Districts (RHAD) and these RHAD are the geographic unit used in our model and all data were linked to these geographic boundaries. For simplicity, we call these regions 1,2,...,56. In addition, a population-based centroid was provided for each RHAD and these centroids were not necessarily geographic centres. The data was aggregated over the study period 2000-2010.
The number of asthma visits totaled 14,691 over the study period with mean and median number of yearly cases per region of 26 and 17 (range 3 to 422), respectively. The region children population sizes varied from 319 to 173,400, with mean and median numbers of 5,998 and 2,432, respectively. The largest population was in region 56, while region 42 had the least population.
The key data requirements for the focused methods are the number of cases and the number of expected cases or the population size for each region. When the expected number of disease cases varies by important strata, such as year and gender, adjustments can be made. The expected number of disease cases is then adjusted by year (1-10) and gender (male, female). We first briefly review the spatial focused clusters such as CSS, FSS, and BYM, and then explain the proposed approach of data cloning.

Circular spatial scan statistic (CSS)
The spatial scan statistic has been used in a wide range of applications within the field of epidemiology [20]. The circular spatial scan statistic imposes a circular window S on each region, and for any of those regions, the radius of the circle varies from zero to a prespecified maximum distance d or a pre-specified maximum number of where θ i is the relative risk (RR) in region i, µ is the overall mean ratio over region and η i represents spatially correlated random effects. We use conditionally autoregressive (CAR) model to capture the spatial random effects η i . A variety of CAR models may also be used by taking a collection of mutually compatible conditional distributions and ∂ i refers a set of neighbours for the i-th region [9]. We consider the following general model for the spatial effects identity matrix of dimension m (see [25] for details of this proper CAR model). The parameters can be then estimated within the Bayesian framework (MCMC) using vague priors for the parameters. This produces the posterior distributions for the parameters in the model.
A cluster is defined as a region where the estimated relative risk is significantly larger than 2 (in terms of their credibility sets) [26]. To implement this method, we used WinBUGS software [25] to compute the relative risk values.

Frequentist approach using data cloning for disease mapping (DC)
The DC method uses the Bayesian computational approach for frequentist purposes. In DC, the observations y is the likelihood for L copies of the original data. Lele et al. [18,19] showed that, for L large enough, converges to a multivariate Normal distribution with mean equal to the MLE of the model parameters and variance-covariance matrix equal to 1/ L times the inverse of the Fisher information matrix for the MLE. This factor of 1/ L adjusts for the fact that the cloned dataset has L times more information than the original dataset. Hence, this distribution is nearly degenerated at the MLE α for large L. Moreover, the sample mean vector of the generated random numbers from (3) provides the MLE of the model parameters, and L times their sample variance-covariance matrix is an estimate of the asymptotic variance-covariance matrix for the MLE α. Lele et al. [19] also provided various checks to determine the adequate number of clones L.
Prediction of relative risk: Prediction of relative risk (random effects), particularly from the frequentist viewpoint, is usually problematic. A naive approach, when α is estimated using the data, is to use (RR| , ) π α y where 1 RR= (RR ,..., RR ) . ′ m However, this approach does not take into account the variability introduced by the model parameters estimate. An approach that has been suggested in the literature (e.g., Hamilton [27]) to take into account the variation of the estimator is to use the density: φ ξ σ denotes Normal density with mean ξ and variance σ 2 , which are equal to the MLE and the inverse of the Fisher information matrix here. In this paper, we obtain prediction of the RR using the density in equation (4) along with MCMC sampling. Similar to the Bayesian approach, a cluster is defined as a region where the estimated relative risk is significantly larger than 2 (in terms of their prediction intervals). We used the package of dclone [28] in software R [29] to compute the relative risk values.
Note that these focused methods have different assumptions. While the CSS and FSS methods are distribution free, the number of cases in BYM and DC methods is assumed to follow a Poisson distribution. We also need to specify the number of regions to be included in the cluster for the CSS and FSS methods while it is not required for the BYM and DC methods.
In Figure 1, the areas that are statistically significant (potential clusters) are shown for each method separately. The summary of the results is presented in Table 1. The order of significant regions of different methods is also reported in Table 1. More precisely, the regions are ordered based on which one is more significant to be as a cluster. For instance, 1 in the DC method means that the region 37 is most likely to constitute a significant cluster, while 6 means that the region 26 is least likely to be a significant cluster. Hence, it is easy to see which region has more contribution to constitute a cluster.
It seems that the methods CSS and FSS identified somehow similar regions as potential clusters with 13 regions for the FSS method and 14 regions for the CSS method. In particular, the CSS method detected the regions {10,14, 20, 21, 26,31,33,34,35,36,37,38, 40, 41} as potential clusters while the FSS method identified the regions {25, 26, 29,32,33,34,35,36,37,38,39, 40, 41} as potential clusters. The main reason for different results between CSS and FSS is due to noncircular shape of some regions in the province of Manitoba, where the FSS method had the ability to identify those non-circular shaped regions as potential clusters compared to the CSS method.
The DC method detected the regions {26, 28,34,36,37, 41} as potential clusters. The same regions were also identified as potential clusters for the BYM method but with different order of significance (e.g., regions 28 and 36). However, the BYM approach may depend on the choice of priors and we may get different results with using different priors; noting that we used gamma distribution for the inverse of variance component with shape and scale parameter 0.001 and Normal distribution with mean 0 and variance 10 6 for the fixed effect. It is worthwhile to mention that regions identified as potential clusters by the methods DC and BYM were also detected by methods CSS and FSS except for the region 28.

Discussion
The most popular approaches for detecting spatial focused clusters are distribution free methods such as CSS and FSS. The Bayesian method (BYM) which is based on a Poisson model is also popular as a  method for identifying spatial focused clusters. However, the Bayesian inference may depend on the choice of priors.
Using DC, we have proposed a frequentist approach which identifies potential clusters with high ratio of disease. The advantages of DC approach are that the answers are independent of the choice of priors and non-estimable parameters are also flagged automatically. We applied the proposed approach to a real dataset of childhood asthma visits to hospital in the province of Manitoba, Canada. We also compared the proposed approach with other methods such as CSS, FSS, and BYM. Two methods CSS and FSS detected some different regions as potential clusters due to non-circular shape of some regions in the province of Manitoba. Two methods BYM and DC identified lower number of regions combined as a potential cluster compared to CSS and FSS methods. Although, the results of DC and BYM were similar for detecting potential clusters in our analysis, however, one may get different results for BYM, unlike DC, with using different priors.
In the BYM and DC approaches, we conservatively defined a region as a cluster if the credibility set of the estimated relative risk was larger than two. One may define different decision rule where the estimated relative risk would be larger or smaller than two [30].
We adjusted our expected number of asthma cases by two important factors gender and year. The proposed method can be also easily extended to include some covariates directly, which may be required for some applications.
In general, the potential clusters are located in the north-central part of the province. These findings may represent real increases or may be indicative of different distributions of important covariates, such as demographic characteristics of the population of the north-central region, that are unmeasured and unadjusted for in our modeling. Further investigation is needed to explore these findings.