Validation of an Alzheimer’s Disease Assessment Battery in Asian Participants With Mild to Moderate Alzheimer’s Disease

Background: There is a lack of validated tools for assessing Alzheimer’s disease (AD) across Asia. This study evaluates the psychometric properties of the Alzheimer’s Disease Assessment Scale-Cognitive Subscale (ADASCog), Disability Assessment for Dementia (DAD), and Neuropsychological Test Battery (NTB) in Asian participants. Methods: Participants with mild to moderate AD (n=251) and healthy controls (n=51) from Mainland China, Taiwan, Singapore, Hong Kong, and South Korea completed selected instruments at several time points. Results: Test-retest reliability was better than 0.70 for all tests. AD participants performed significantly poorly than controls on every score. Within the AD group, greater disease severity corresponded to significantly poorer performance. The disease in the AD group worsened over time and there was a trend for worse performance in AD compared to healthy controls over time. Conclusions: The ADAS-Cog, DAD, and NTB are reliable, valid, and responsive measures in this population and could be used for clinical trials across Asian countries/regions.


Introduction
Alzheimer's disease (AD) is a neurodegenerative disorder and the major cause of dementia in the elderly. AD-related medical complications are among the most common causes of death in the elderly population [1]. Approved treatments have been developed in clinical trials conducted largely in North America. According to a report from the Institute of Medicine [2], such studies were an insufficient guide to practice as they had too few patients from some countries or from different ethnic groups. As AD has become a global concern, including patients from Asia in clinical trials and translational research is important given that China and other Asian countries have the highest number of people with dementia [3][4][5]. Yet a lack of standardized assessment tools has hindered clinical trials in this region.
Cognitive and functional instruments, such as the Alzheimer's Disease Assessment Scale-Cognitive Subscale (ADAS-Cog) [6], Disability Assessment for Dementia (DAD) [7], and the Neuropsychological Test Battery (NTB), including elements of the Wechsler Memory Scale [8,9], measured the severity of AD-related symptoms and are considered important for exploring and providing evidence of treatment efficacy in research trials. However, the comparability of the psychometric properties of these instruments in Asian populations across regions has not been adequately assessed. For example, although ADAS-Cog was validated in Chinese, the sample size was small (n=39) and longitudinal data were not available [10]. Within the Chinese language, there could be dramatic differences in expressions and interpretations depending on the region.
This study aims to evaluate the psychometric properties of the ADAS-Cog, DAD, and NTB in Asian participants with mild to moderate AD, including floor and ceiling effects, test-retest reliability, intra-and inter-rater reliability, construct validity in terms of convergent and divergent validity and discriminant validity, and the sensitivity to change during the longitudinal course of this study (approximately 78 weeks or 1.5 years).

Instruments/Translation
In addition to ADAS-Cog, DAD and NTB, Neuropsychiatric Inventory (NPI) [11], Clinical Dementia Rating-Sum of Boxes (CDR-SB) [12], Mini-Mental State Examination (MMSE) [13], and Dependence scale (DS) [14] were selected as references for validation. All instruments went through a vigorous and standardized translation process that involved forward translation, backward translation, incountry clinician review, and debriefing by native language speaking subjects, such as normal subjects and/or Alzheimer's disease caregivers. This process was to ensure that the translated versions were not only conceptually equivalent to the original instrument but also culturally relevant and understandable to the target population in the target country. Efforts were made to ensure cultural adaptations, if necessary, were consistent across all translations. For each instrument, there were 7 linguistically validated translations to evaluate in the study, including Simplified Chinese (for mainland China), Traditional Chinese (for Taiwan, Hong Kong, and Singapore), English (for Hong Kong and Singapore), and Korean (for Korea).

Subjects
This study utilized a multicenter, longitudinal, observational design in participants with mild to moderate AD and normal cognition controls from Mainland China, Taiwan, Singapore, Hong Kong, and South Korea. After informed consent was obtained, eligible individuals entered a screening period of up to 31 days and, if eligible, then entered into the study and evaluated over the next 78 weeks.
Eligibility criteria for all participants were 1) ages 50-85 years, 2) Rosen Modified Hachinski Ischemic (RMHI) score ≤ 4; and 3) fluency in local primary language and have at least an elementary education or equivalent. Inclusion criteria for the AD group were: 1) diagnosis of probable AD according to the National Institute of Neurological and Communicative Disorders and Stroke-Alzheimer's Disease and Related Disorders Association (NINCDS-ADRDA) criteria; 2) MMSE score of 13 to 26, inclusive; 3) CDR global score ≥0.5; and 4) Screening visit brain magnetic resonance imaging (MRI) scan consistent with the diagnosis of AD. Inclusion criteria for the healthy controls were: 1) No significant memory complaints from normal population aged 50 to 85 years; 2) MMSE score of 21 to 30, inclusive; 3) CDR global score equal to 0, with a Memory Box score equal to 0; 4) Cognitively normal, based on absence of significant impairment in cognitive functions or activities of daily living; and 5) Normal brain MRI scan findings.

Instrument scoring
The ADAS-Cog, DAD, NTB, CDR-SB, MMSE, DS, and NPI were administered at screening, baseline, week-13, -26, -52, and -78. The NTB included the following subtests: Wechsler Memory Scale Visual-Paired Associates immediate and delayed scores [15], Rey Auditory Verbal Learning Test (RAVLT; immediate and delayed) [16], Wechsler Memory Scale -Digit Span forward and backward [15], Controlled Word Association Test (COWAT) [17], and Category Fluency Test (CFT) [18]. All scores were computed according to standard scoring instructions. Z-scores were calculated for each of the nine NTB components using the baseline mean and SD for all healthy controls with baseline scores. An 'executive function' z-score was obtained by averaging the z-scores from NTB components measuring executive function (CFT, COWAT, WMS-R-Digit Span). Signs were reversed, as needed, prior to summing such that higher NTB z-scores indicate better cognitive functioning. The remaining six components, which measure memory, were averaged to obtain a 'memory' z-score (WMS-R-Visual-Paired Associates, WMS-R-Verbal -Paired Associates, RAVLT, all with immediate and delayed components).

Laboratory apolipoprotein E (ApoE) genotyping
ApoE genotypes were determined by Quest Diagnostics using QIAGEN PyroMark TM ApoE Test Kit.

Analysis
Test-retest reliability of all tests was evaluated by calculating the intraclass correlation coefficient (ICC) using data from the screening and baseline assessments (25 to 31 days apart). ICCs were also calculated to evaluate the inter-and intra-rater reliability using videotaped assessments. Two AD subjects from each site were videotaped for ADAS-Cog, DAD and NTB administration at baseline visit. For intra-rater reliability, the video-recording of the baseline scale administrations was reviewed by the same raters and scored again within 7-21 days of the live assessment. For inter-rater reliability, a rater different from the rater who performed the initial assessment viewed the video-recordings and scored the assessments within 7-21 days of the live assessment.
Spearman correlation coefficients were calculated among all scores to assess convergent and divergent validity. To assess discriminant validity, we compared mean scores between AD and control groups using analysis of covariance (ANCOVA) controlling for age and education. ANCOVA was also used to compare AD participants with mild disease (MMSE 20-26) versus moderate or severe disease (MMSE<20), to compare AD participants across regions, and ApoE4 carriers (versus non-carriers).
Change from baseline was calculated for all scores. The responsiveness index (i.e., effect size), defined as the mean change in the AD groups divided by the standard deviation of the change scores in the healthy control group, was calculated to evaluate the magnitude of change overtime. We also compared mean change scores between AD and control groups with adjustment for baseline scores using ANCOVA. Longitudinal data were analyzed using mixed effects linear models for repeated measures. The mixed effects models included study group, visit and group*visit as fixed effects, controlling for other baseline covariates (age, gender, region and education).

Participants
Screening phase included 333 potential participants; yet 31 (29 AD, 2 healthy controls) did not complete the screening process, resulting in 251 AD and 51 healthy controls. Sites from Chinese mainland (9 sites, 115 AD participants, 18 controls) represented nearly half of the sample, followed by those from Korea (6,66,12), Hong Kong (3,33,11), Taiwan (3,25,6), and Singapore (3,12,4). Of these participants, 208 AD and 49 healthy controls completed the entire study. Across visits, compliance with test completion ranged from 94-100%. The mean age was 70.5 (8.62 Table 1 summarizes other demographic and clinical characteristics by study group. The ADAS-Cog exhibited no floor or ceiling effects on either group. In 14/17 NTB subcomponent tests, at least some AD participants scored the minimum possible, although the extent varied greatly across tests (1%-70%). Only in 4 NTB tests did some AD participants score the maximum possible (2%-18%). On the other hand, healthy controls rarely scored the minimum while in 8 of 17 NTB tests, some control participants achieved the maximum possible (8%-80%). Notably 9% of AD participants achieved the best possible DAD score at baseline, compared to 96% of controls.

Reliability
Test-retest reliabilities were supported by acceptable ICC between screening and baseline, with ICC>0.7 for 17 of 19 measures. Two measures with ICC<0.7 were: Wechsler Memory Scale (WMS) Visual-Paired Associates Immediate (ICC=0.5) and Delayed tests (ICC=0.49). Inter-and intra-rater reliability was assessed on data from 45 videos of participants. ICCs estimates were ≥ 0.91 for all except WMS Visual-Paired Associates Immediate tests where ICC=0.85 and 0.86 for within and between raters, respectively.

Convergent and divergent validity
Among non-NTB tests, 14 of 21 comparisons had Spearman's rho 0.30 or greater. NPI-caregiver distress was poorly correlated with all scales (rho: 0.14 to 0.23) except NPI (rho=0.84). When comparing to NTB, ADAS-Cog, MMSE, and CDR-SB were significantly (p<0.001) correlated with Executive Function, Memory Function and Total NTB scores (rho: 0.33 to 0.71); DAD and Dependence Scale were significantly (p<0.001) correlated with Executive Function and Memory Function (rho: 0.36 to 0.36); DAD and Dependence Scale were significantly (p<0.01) correlated with Total NTB scores with rho=0.19 and 0.18 for DAD and Dependence Scale, respectively.

Discriminant validity
Comparisons of demographic and clinical characteristics across groups demonstrated statistically differences in age and education. Therefore, age and education were accounted for in the following series of ANCOVA analyses. As shown in Table 2, AD participants performed poorer (p<0.001) than healthy controls on all comparisons with effect sizes ranging from 0.75 (NPI) to 3.27 (total NTB score). As shown in Table 3, participants with moderate or severe AD showed significantly (p<0.001) poorer performance on nearly every assessment with effect sizes ranging from 0.9 (Dependence Scale) to 3.48 (MMSE) than participants with mild AD. NPI total and caregiver distress scores did not significantly differ across AD severity levels (   Table 3: Analysis of covariance (ANCOVA) between MMSE-derived impairment groups at screening and baseline, controlling for age and education In terms of regions, differences in assessment scores among AD patients were found for MMSE (p<0.001), ADAS-Cog (p<0.001), RAVLT Immediate (p=0.010), Digit Span (p<0.001) and Executive Function (p<0.001). It should be noted that sample sizes of each region were relatively small except mainland China (n=110, 32, 66, 11, and 25 for China, Hong Kong, Korea, Taiwan and Singapore, respectively), and this result should be interpreted with caution. The following sensitivity to change analyses were conducted adjusted for region.

Sensitivity to change
Estimated means from the longitudinal mixed effects models are plotted in Figures 1a-1j. The primary term of interest from these models is the interaction between week and group, which evaluates whether the trend over time differs between groups. A significant interaction between group and time was observed for all measures except the NPI Caregiver scores where p=0.066. The trend over time differed between Mainland China and Korea, the only countries with sufficient sample size for subgroup analyses, only in Memory Function (p=0.040) with a greater degree of change observed for China.
As shown on Table 4, results of the multiple-regression-based change scores analysis, which incorporates age, gender, education, region, and baseline score, indicated a substantial portion of AD participants worsened to a statistically significant degree, relative to the healthy controls (all p<0.001).
In a simplified analysis that did not adjust for demographic differences, the mean change from baseline to Week 78, adjusted for baseline score, was evaluated within each group against the null hypothesis that the change is equal to zero (Table 5). Patients with AD significantly (p<0.001) worsened on all scores while control group significantly (p<0.05) worsened on ADAS-Cog, DS, Memory Function, Executive Function and overall NTB scores. Responsiveness index ranged from 0.14 (Memory Function) to 22.31 (DAD). Significant change score differences between AD and control groups were found on all but two (ADAS-Cog and MMSE) scores as also shown on Table 5.

Discussions
This is the first large scale study which included multiple Asian countries/regions for psychometric properties of major AD instruments. Using data from both patients with AD and healthy controls, we verified psychometric properties of commonly used assessment tools, including acceptable test-retest reliability, inter and intra-rater reliability, validity, and responsiveness over a period of 78 weeks. Within the AD group, test-retest reliability was better than 0.70 for all tests. DAD, ADAS-Cog, DS, CDR-SB and MMSE scores correlated well with NTB component, summary and total scores, achieving significance in nearly every comparison. After adjustment for age and education differences, AD participants performed more poorly than controls on every assessment at all visits with large effect sizes. Effect sizes for NPI total and caregiver distress scores were the lowest among the assessments. Within the AD group, greater disease severity corresponded to significantly poorer performance on nearly every assessment. Only NPI total and caregiver distress scores did not significantly differ across AD participants with low versus high disease severity [19,20].
Some differences emerged between the performance of the instruments in this Asia-only cohort versus previous global studies. The mean change in the ADAS-Cog (3.9) was somewhat less in the AD group compared to past studies (usual range 5-8). For example, recent studies of solanezumab and semagacestat, had ADAS-Cog declines of 4.5 points in Expedition 1 and 6.6 points in Expedition 2 (solanezumab) [21] and 7.8 points in the semagecestat studies [22]. Similarly, changes in DAD, NTB Total, and MMSE were smaller than observed in these prior studies. Lot of factors could contribute to this change, including trial patients overall better support and adherence to treatments. A smaller change score-if reproducible-would affect power calculations and sample sizes required to show a drug-placebo difference in Asian populations.
ApoE4 was detected in significantly more AD participants than healthy controls with similar rates across regions. This finding confirms the documented literature in which ApoE4 was considered a risk factor for developing AD [23]. The rate of ApoE4 was less than in many previously reported AD studies [24], yet similar to reports that Asian patients with AD had a lower prevalence of ApoE4 compared to US [25] and northern Europeans [26]. Regardless of its significant role in predicting AD, ApoE4 had little association with psychometric assessments, except the NTB executive function score. Our finding matched the literature [27], in which inclusion and exclusion of ApoE4 did not influence the predictive accuracy of AD progression (81% versus 80% for inclusion and exclusion, respectively).
The AD group demonstrated substantial worsening of most scores with large effect sizes represented by responsiveness index. There was large variability in the responsiveness index for the psychometric tests, with effect sizes ranging in magnitude from 0.14 for the Memory Function (or 1.3 for non-NTBADAS-Cog) to 22.3 for the DAD. It should be noted that the variability in the healthy control group change scores, which constitutes the denominator of the responsiveness index, was small for most tests (range from 0.02 for DS and CDRS-SB to 0.35 for ADAS-Cog) and this may be a part of what drives the larger responsiveness index values. Compared to healthy controls and adjusted for demographics and baseline score, the trend over time was significantly worse for the AD group for all measures except the NPI Caregiver scores which was expected. Yet significant NPI caregiver score differences between AD and healthy control group were found at all time-points.
Prior to this study, these instruments have been the subject of minimal psychometric research in Asia, although they have been shown to be valid, reliable, and responsive to change in the nations and regions in which they were developed. Specifically, prior studies have shown that the ADAS-Cog is sensitive to age-related decline in patients with mild to moderate AD [28]. The DAD has been used as an endpoint to assess functional outcomes of AD patients after treatment [7]. The individual measures of the NTB have been shown to be reliable, valid for use in AD, and sensitive to cognitive decline [29]. The NTB also evaluates delayed recall and executive function, cognitive domains that are not adequately assessed with the ADAS-Cog [30]. The NPI [11,31] CDR [32], and MMSE [13,33] were all developed specifically to assess patients with dementia or other cognitive impairment. The results of this study suggest that, in Asia, these instruments are also reliable, valid in differentiating cognitively impaired from cognitively healthy subjects, and sensitive in documenting longitudinal change in an AD patient group.
According to the Global Burden of Disease estimates for the 2003 World Health Report [34], dementia contributed 11.2% of years lived with disability in people aged 60 years and older. Using a Delphi consensus approach conducted by Alzheimer's Disease International, Ferri and colleagues 5 reported that although the expert consensus was for a higher prevalence of dementia in developed region, it is China and its developing western-Pacific neighbors that have the highest number of people with dementia (6 million), followed by western Europe with 4.9 million, and North America with 3.4 million. They also predicted that by 2040 China and its western-Pacific neighbors will have three times more people living with dementia than Western Europe. Zhang and colleagues reported the prevalence for persons 65 years or older was 4.8% for Alzheimer's Disease and 6.8% for dementia in China, after post-hoc correction for negative screening errors. Chan and colleagues [35] updated that the number of dementia patients were 9·19 million (5·92-12·48) in 2010 and the number of people with Alzheimer's disease was about 5·69 million (3·85-7·53) at the same period [36]. Catindig and colleagues [37] claimed that the dementia subtype pattern appeared to have changed over time with AD becoming more prevalent in East Asia countries since 1990. All highlighted the importance of including Asian countries in global clinical trials. The impact of dementia in Asia on health, society and the economy requires more attention. More studies using standardized cross-culturally sensitive cognitive instruments and ascertainment of functional and social declines are needed to better understand the burden and cause of early dementia [37]. The results of this study fill this important gap for Asia. Not only did it provide the validated instruments for future AD research in this region, but also provided invaluable information on how to conduct AD clinical trials in this region/countries. Some limitations are noted. The sample was limited to five Asian countries and therefore results cannot be generalized to other Asian countries. Additionally, although the overall sample size in this study was sufficient for psychometric validation analyses, they were not the same across all regions and only China and Korea had sample sizes greater than 50. These unequal and small sample sizes within regions limited the potential of examining the impact of cultural differences across all regions using advanced item response theory model [38,39]. Yet to our knowledge, this is one of the first validation studies using sufficient numbers of patients across Asian regions to cover more diverse Asian population so that results could be reasonably generalizable. Future studies that recruit more diverse samples across more regions can enhance the generalizability of the results.
In conclusion, the psychometric properties of the ADAS-Cog, DAD, and NTB were verified using data from patients with mild to moderate AD recruited from Asia. These instruments can be used for future clinical trials in the participating countries/region. Additionally, significant amount of information was obtained, including the rates of ApoE4 status in Asian AD patients which warrants further investigations. The trial was complicated and challenging, but further demonstrated the potentials and capacities of the participating sites cross countries/regions in collaboratively conducting global AD trials.