Beyond IC50s: Towards Robust Statistical Methods for in vitro Association Studies

Cell line cytotoxicity assays have become increasingly popular approaches for genetic and genomic studies of differential cytotoxic response. There are an increasing number of success stories, but relatively little evaluation of the statistical approaches used in such studies. In the vast majority of these studies, concentration response is summarized using curve-fitting approaches, and then summary measure(s) are used as the phenotype in subsequent genetic association studies. The curve is usually summarized by a single parameter such as the curve’s inflection point (e.g. the EC/IC50). Such modeling makes major assumptions and has statistical limitations that should be considered. In the current review, we discuss the limitations of the EC/IC50 as a phenotype in association studies, and highlight some potential limitations with a simulation experiment. Finally, we discuss some alternative analysis approaches that have been shown to be more robust.

limitations of such in vitro assays has previously been reviewed [1,2]. An increasing number of success stories for such experiments are emerging [3][4][5][6][7][8][9], but the statistical methodologies applied in such experiments have not been examined in detail.
Cell-based studies allow for the examination of drug response at greater resolutions by measuring cellular response across a spectrum of concentrations rather than limited set of concentrations afforded by traditional studies. These types of dose-response or concentration-response studies measure some indication of cellular health or response such as total ATP, cell viability/morphology, or transcript expression levels as a function of increasing drug concentration. These data points are then fit to a statistical model, usually some form of a 4-parameter logistic curve (sometimes referred to as the hill equation), to produce a dose-response curve. Figure 1 shows an example of anannotated concentrationresponse curve. The curve is usually summarized by a single parameter such as the curve's inflection point (e.g. the EC/IC 50 ) [10] or the slope of the curve (called the hill-slope) [11]. Perhaps the most widely used summary in pharmacogenomics cell line experiments is the IC 50 , which represents the concentration where the response achieves 50% of maximal activity [3][4][5][6][7][8][9]. This notion of IC 50 can be generalized further such that IC X is the concentration at which the response is X% between minimal and maximal activity. IC 50 s (and their IC X cousins) have been widely used in areas such as toxicology, pharmacology/ pharmacogenomics, and industrial drug development [10]. Its popularity derives from the fact that it is a concise and interpretable summary of a drug's activity, which conveys an indication of the drug's potency. In association studies, this value is treated as a quantitative trait and standard QTL methods are then applied to link genotype to this derived phenotype [12].
However, traditional analytic and statistical methods are often illequipped to analyze this type of data and subsequent inference based on the IC 50 poses many challenges. First, the appropriateness of the hillslope model from which the IC 50 is derived is often unchecked, which may have large implications on the resulting conclusions [12]. This model is based on ligand-macromolecule binding dynamics [13], which may be an appropriate model in some instances, but inappropriate in others. Assuming this model is a correct description of the underlying biology, accurate calculation of the IC 50 may still proves problematic. Estimating the IC 50 using this model is highly sensitive to observing the full dose-response curve in the tested concentration range. If either the minimum or maximum asymptote of this curve is not observed it can have a very large impacton the estimated IC 50 which will have a correspondingly large impact on the biological conclusions. Due to the non-linearity of this model even the "well-behaved" responses may result in unstable IC 50 estimates. Two analysts may reach different IC 50 values because they used different software packages or because they usedthe same software with different configuration settings. Such differences may cause the software to fail to produce a solution at all or may produce very different IC 50 estimates, with no clear procedure for determining the correct value [14].
For example, assume a study measures total ATP across 8 concentrations using five technical replicates. Some summary statistic of these replicates, such as the mean or median ATP level at each concentration may then be fit to the hill-slope model to obtain an IC 50 value. The amount of uncertainty introduced from sampling the response at each concentration can have a considerable impact on the estimated IC 50 . To highlight this issue and give some sense of how it can affect the IC 50 , we performed a small simulation experiment. For a concentration set of {0.0001,0.001,0.01,0.1,1,10,50,100} μM, we simulated 5 technical replicates for 2 levels of noise. Each technical replicate was generated from the hill-slope model plus a small amount of random noise with a true IC 50 value of 50 μM. The noise was a random value of +/-X% of the true response value, where the values for X we tested were 1% and 5%. This is a 'heteroskedastic' noise model and is consistent with our experience the amount of variation in a response is proportional the size of the response itself. We then took mean of the 5 technical replicates and fit a curve using the nls() function in the R statistical language [15] to this mean response and recorded the estimated IC 50 . Note that we supplied the algorithm with true parameter values as starting values so as to minimize the amount of IC 50 variation coming from the fitting process. We repeated this process 10,000 times for both levels of noise. Figure 2 shows histograms for the 1% and 5% noise level models. Even in the presence of a small amount of noise (1%), the IC 50 estimates span a range of 40-70 μM and inspection of the estimated confidence interval for one such response yields a similar estimated range of variability. The situation for higher noise model is much worse with the IC 50 estimate ranging from 40-212 μM, which was outside of the tested concentration range. This amount of uncertainty results in an IC 50 measure that is not very useful in practice because it is statistically indistinguishable from a potentially wide range of other IC 50 values. In the context of association studies, this could be of great harm. Imagine that there are two populations where one population has an estimated IC 50 that is 2-3x that of the reference population, indicating that this population may be highly tolerant to the drug under study. It would be of great interest to locate any genetic loci that may be involved in this process. However, as the Figure 2 implies, these two populations may in fact have the same tolerance for the drug, but the noise introduced through sampling and estimating the IC 50 has obfuscated our ability to see this, resulting in wasted time and effort looking for the underlying causes of a phantom difference.
This leads to yet another issue with IC 50 based inference, namely that once all of the proper variation is accounted for, IC 50 s may show little meaningful variation in the statistical sense. This may result in two compounds, which by other measures would be considered to have different activity, to fail to be declared distinct, because their IC 50 s are not statistically separable. Statistical models are built to explain variation, but in the absence of meaningful variation, they will be unable to detect any genetic signal that may be present. This will be of increasing importance if pharmacogenomics, and genomics more generally, is to unravel complex traits that do not have large, single gene effects.
A somewhat larger point worthy of consideration is just how relevant an IC 50 , even one estimated with highprecision, is to the underlying scientific question. Statistical methods are only valid to the extent to which they mapback to the research question being asked. Even in the absence of all the issues discussed so far with IC 50 based inference, it may be that a "true" statistically significant difference is not very meaningful from a biological perspective. Why might we assume a priori that this parameter from this model is the best representation of a compound's activity? In this sense, it is not clear that IC 50 s are always a relevant measure or summary of a compound's activity, if potency is not a meaningful proxy for the latent biological difference. If the IC 50 is a poor proxy, then methods that take the full dose-response into consideration should be considered.
With both the promise of in vitro studies and the analytic challenges they present in mind, we hope to draw attention to some of the issues that must be addressed to maximize the utility of these types of assays. There are alternatives to the IC 50 based significance testing approach that have been and continue to be developed. The area under the curve (AUC) statistic computes the area between the dose-response curve and the x-axis and is a global measure of compound's activity [12]. This type of summary is potentially more robust than an interpolated parameter such as an IC 50 . Determination of statistical difference between two compound's AUC relies on a permutation testing based procedure and may be very computationally expensive for large datasets. However, since permutation testing can be readily parallelized, the availability of computing clusters can reduce the time needed for this type of analysis. Multivariate ANOVA Genome-Wide Association Software (MAGWAS) [16] was shown to be a very attractive approach with many desirable properties including high statistical power and computational efficiency. However, MAGWAS is sensitive to changes that occur only at one concentration, which may not be desirable in some instances. Both AUC and MAGWAS incorporate the full dose response curve into the association tested.
Each of these concerns is only exaggerated by the increasingly high throughput nature of these experiments. As robotics has enabled rapid, high-throughput phenotyping for such experiments, investigators are now able to readily assay dozens or even hundreds of chemicals across hundreds of cell lines for dose response [17]. This makes it less likely that all assumptions are met or checked across such large numbers of results. This magnifies the importance of considered statistical approaches that minimize the impact of violations from these assumptions.
It is our hope that this discussion can help further the continued consideration on best practices for in vitro association studies. While we acknowledge that that IC 50 s can be of great utility when used properly and in the correct context, we hope to raise awareness of potential issues with these approaches and highlight alternatives that could further our understanding of gene based drug response.  Distribution of the IC 50 under two levels of noise. Note the wide range of estimated IC 50 s, especially under the slightly higher 5% noise model with many estimates being 2-3x larger the true value of 50 μM. Note also that the second histogram is no longer symmetric, implying that these IC 50 s are not normally distributed.