Received Date: June 01, 2015; Accepted Date: July 13, 2015; Published Date: July 20, 2015
Citation: Nordhausen K, Oja H, Pärssinen O (2015) Mixed-Effects Regression Splines to Model Myopia Data. J Biom Biostat 6: 239. doi:10.4172/2155-6180.1000239
Copyright: © 2015 Nordhausen K, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Biometrics & Biostatistics
Myopia is a disorder of ocular refraction with varying rates of progression. Although the disorder has a dynamic nature, prospective longitudinal studies with long term follow-ups have been remarkably few. In this paper, we show how mixed-effects regression splines with different choices of basis functions can be used to model myopia progression data in a flexible way. We show how the estimated model may be used to find prediction curves with corresponding confidence and tolerance intervals for a new myopic subject. We discuss alternative choices of the basis functions such as the truncated polynomial spline functions (2 types) and B-spline functions. Principal component functions may be used for an analysis of the variation of the curves in the population. The theory is collected together and presented in a coherent way as well as illustrated with a careful analysis of myopia progression data from a Finnish myopia study.
Basis function; B-spline; Linear mixed model; Prediction curve; Principal curve; Progression; Truncated polynomial spline
Myopia, also known as nearsightedness, is defined as that state of ocular refraction in which parallel rays of light entering the eye at rest are brought to focus in front of the retina. In this situation, distant objects cannot be perceived distinctly. Myopia can occur at any age but in most cases it appears at school age and usually progresses several years, in few cases progression continues throughout life. In practice myopia is treated either by prescribing corrective lenses, such as glasses or contact lenses, or by performing refractive surgery. For people with myopia and for those who treat myopia and prescribe glasses, reliable predictions of myopia progression as early as possible would be extremely valuable.
The prevalence of myopia has markedly increased during recent decades in many countries. This increase is hard to explain only by hereditary factors. Epidemiological studies have confirmed that a longer education and higher occupational status, for example, are connected with an increased prevalence of myopia . Many studies have shown that the younger the age of onset the faster is myopia progression . Based on a small sample, Thorn et al.  estimated that myopia beginning in childhood “stabilizes” in 80% of cases by about 19-years of age. There has been an increased interest in long term follow-up studies to model the progression of myopia. The estimated models with possible covariates then often provide individual prediction curves as well. One may also wish to test, for example, whether the progression of myopia levels off at a certain age.
The dynamic progression of myopia has not been widely studied in the epidemiological or biostatistical literature. The studies mainly consider the individual progression curves from retrospective material or by means of progression in different materials . Möttönen at al.  suggested to model myopia using a polynomial (quadratic or cubic) random coefficient model. The model is not satisfactory, however, as even cubic myopia progression curves are not flexible enough to explain strong changes between 10 and 20 years. In this paper we will show that mixed-effects regression analysis using quadratic or cubic splines is a flexible tool to model the progression of myopia that easily provides individual prediction curves. Furthermore, we show that different important questions concerning the progression of myopia may be answered simply by changing the spline basis functions. For a general theory for mixed-effects regression splines, see e.g. [6,7]. One of the aims of this paper is to collect and present the theory in an easy and coherent way for the users. Mixed-effects regression spline models have recently been applied to CD4 counts  and to the growth of cattle . However, these models have never before been applied to myopia progression data.
The paper is structured as follows. In Section 2 we discuss the use of the random effect regression model in the case that subject i, i=1, ..., n, has pi measurements at time points ti1 , ..., tipi . Note that the number of measurements as well as the time points vary individually as is always the case for myopia follow-up data. Each subject is then supposed to have his/her own myopia progression curve as a function of time. This curve is observed only through the measurements at the pi time points (with measurement errors coming from N (0, τ2)). The individual curves are linear combinations of k basis curves, and the k coefficients are assumed to come from a k-variate normal distribution with an unknown mean vector depending on parameter θ and unknown covariance matrix Σ. Parameters θ and Σ (providing the mean curve and variation of the curves) are the population quantities we are interested in. We discuss the estimation of parameters θ, Σ, and τ2 and show how the estimated parameters may be used to find prediction curves with corresponding confidence and tolerance intervals. In Section 3 we discuss alternative choices of the basis functions; a special interest is in the set of principal component functions that can be used for a careful analysis of the variation of the curves in the population. In Section 4 we discuss the use of spline functions, truncated polynomial spline functions (2 types) and B-spline functions, as basis functions. In Section 5 the theory is then illustrated with a real data set from a Finnish longitudinal myopia study. The paper ends with some final comments in Section 6.
Random coefficient regression model using k basis functions
Let be the selected k basis functions for myopia regression curve. We write
for the corresponding vector valued function. We assume that individual i has pi observations
at time points ti1 , ..., tipi, i=1, ..., n. We then write
for the pi × k time design matrix for individual i, i=1, ..., n. The r-variate covariate vector xi=(xi1, ..., xir )′ is used to explain the myopia progression curve of the ith individual, i=1, ..., n. Finally, write for the total number of measurements.
We then make the following model assumption.
(I) The myopia progression curve for individual i is
Θ is a k × r-variate matrix of regression coefficients, and ξi∼Nk (0, Σ), i=1, ..., n.
(II) For individual i, the observed refraction values are
(III) The random variables and are all mutually independent.
The parameter Θ stands for the connection between xi and the myopia progression curve for the ith individual, Σ shows the variation of the curves (in the population of the progression curves), and τ2 tells about the random variation of the measurements around the individuals curves. Note that, if we collect the assumptions together, the model can be also written as a mixed model
where ⊗ denotes the Kronecker product. Then is the matrix of fixed effects and Fi is the matrix of random effects for individual i, i=1,..., n. If we further write and then
yi ∼ NPi (Xiθ, Vi) and, if Σ and τ were known, then the maximum likelihood (ML) estimate of θ, namely
has a multivariate normal distribution
The maximum likelihood (ML) estimates of the parameters θ, Σ, and τ2 can be found by minimizing the -2 times the log-likelihood function
Prediction curves based on the model
Throughout this section we assume that we have a model with estimated parameters, and their estimated variances and covariances. Consider first the prediction curve for a new individual from the same population with a covariate vector x, throughout this section we are interested in the curves on some time interval [t0, t1] only. The mean progression curve of the new individual is then
Consider next the estimate of the mean progression curve with its pointwise 100(1-α)% confidence interval as well as its 100(1-α)% confidence band. One also often wishes to estimate the 100(1-α)% point wise tolerance interval and the 100(1-α)% tolerance band for the progression curve. These are found as follows. Notice that the confidence intervals and bands in 2 and 3 are based on the joint limiting normal distribution of the estimates and therefore only approximate.
The mean progression curve is estimated by
An approximate 100(1-α)% pointwise confidence interval for the mean progression curve at time t is given by
1. Notice that pointwise confidence intervals do not give a confidence band for the (whole) mean progressive curve over all possible values of t. The confidence band is considered next.
An approaximate 100(1-α)% confidence band for the mean progression curve is given by
where c1−α is given by
where . Note that c1−α depends on limits t0 and t1 but can be easily found by simulation. Note also that .
An estimate for 100(1-α)% pointwise tolerance interval for the individual value of the refraction curve at time t, that is, (in the subpopulation of individuals having the same covariate vector value x) is given by
Finally an estimate for 100(1-α)% tolerance band for (in the subpopulation of individuals having the same covariate vector value x) is given by
where c1−α is now determined by
where . Now
Assume next that the parameters in the model are known, and we predict the mean progression curve of individual i with covariate vector xi, that is, the curve
The prediction is based on observation vector
Then we have the following estimates and predictions.
1. The predicted progression curve for is where
2. The pointwise 100 (1-α)% tolerance interval for is
3. The 100(1-α)% tolerance band for is given by
where c1−α is determined by
where . Now
Alternative choices of the basis curves
Let us assume that the model fitting is performed using k basis functions f1, ..., fk but, for the interpretation purposes, we wish to present the results using another set of basis functions g1,… , gk. Assume also that these two sets of basis functions span the same set of progression functions, that is,
Then, if we write
and there exists a full-rank k × k matrix A such that f (t)=Ag(t), for all t.
and the model is transformed
The parameters are then transformed in the following way
Recall that, in the ML estimation, the parameter estimates are transformed in the same way.
Principal component analysis for the variation of the curves
One popular set of basis functions are principal component functions which often can be interpreted and are also ordered according to the amount of variation they explain.
Consider the random functions
where, as before, is a vector valued function and . Then we define the principal component functions and corresponding scores z1 , ..., zk as follows.
1. Write Σf for the positive definite k × k matrix with the elements
2. Find the eigenvector eigenvalue decomposition
where Λ is a diagonal matrix containing the eigenvalues of in decreasing order and U is an orthogonal matrix having has ith column the eigenvector corresponding to the ith eigenvalue in Λ.
The magnitude of an eigenvalue related to a principal component function reflects its relevance and the associated score of it how strong expressed the feature represented by this function is, similar as in traditional PCA. For further ideas about the decomposition of the random effects covariance matrix, see for example .
Mixed-effects regression splines
Truncated polynomial spline functions
We first consider the splines based on truncated power functions. For the definition, let be the m+2 time points (knots) and assume that all the measurements are on the interval [t0,tm+1]. Write
The truncated linear spline function is based on m+2 basis functions
f0,1(t) ≡ 1, f1,1(t)=t, f2,1(t)=(t − t1)+ ,... , fm+1,1(t)=(t − tm)+
The truncated quadratic spline functions has a basis of m+3 functions
Finally the splines based on truncated polynomials of degree p are given by
and k=1, ..., m.
The function space spanned by these functions
is the set of piecewise p-polynomials on the interval [t0, tm+1] with continuous p-1 derivatives at the knot points. Note that the same set of functions is obtained if one uses the basis functions
and , k=1, ..., m.
then , and the sum of the first p+1 terms in and give the corresponding p-polynomials on the first and last intervals, correspondingly. Then, for example, the null hypothesis
for the set of functions says that the mean curve is constant after the last knot point tk, and this hypothesis can be tested using . More generally, any linear hypothesis
for a chosen r × p matrix A having rank(A)=p and for a chosen r-vector b can be tested using that is, under the null hypothesis, approximate r-variate normal with zero mean vector and covariance matrix
Under the null hypothesis, the limiting distribution of the squared form test statistic
is then a chi squared distribution with r degrees of freedom.
as follows. First, basis functions of degree p=0 are given by piecewise constant
k=0, ±1, ±2, ... For p=1, 2, ..., then
k=0, ±1, ±2, ... The choice of the additional knot points outside the interval [t0, tm+1] has naturally an effect on some of the functions Bk,p(t) on the interval [t0 , tm+1] but the function space
spanned by does not depend on the choice of the outside knots. Note also that, at interval [tk,tk+1], is a linear combination of p +1 functions (polynomials) only. It is then remarkable that
Remark 1: Note that, naturally, all three parametric function sets , , and provide the same fit (see for example Section 3.7.1 of ). Estimation of parameters of has best numerical properties (see for example Chapter 5 of ) but the interpretation of the regression parameters in , and is often easier. The first p+1 parameters in , for example, give the p-polynomial on the last time interval [tm, tm+1 ]. The null hypothesis H0: β1=...=βp=0 then says that the mean value is constant (does not depend on time) on the last interval.
The data set
We illustrate the theory with a data set from a Finnish longitudinal myopia study. 240 children in central Finland were recruited for this study in 1983-1984. For details about the design and inclusion criteria, see . In this analysis, the measurements are refraction values on the right eyes of 118 girls and 118 boys; 4 children were excluded for various reasons. In the beginning of the study the ages of the children were between 8.8 and 12.8 years old (mean age 10.9 years). In the first three years the children were measured yearly (they actually were also randomized into three different treatment arms which is ignored here; there were no statistical differences between the arms . As a part of the follow-up, the subjects were supposed to see the same ophthalmologist for measurements 15 and 25 years after the start of the study. Available measurements based on glass prescriptions and files from different ophthalmologists and opticians were used to obtain additional observations between these time points. During the followup, 2 subjects died and 18 subjects dropped out because of refractive surgery. The measurements were not equally spaced and often sparse with the mean number of measurements per subject 8.1 (range 2-15). The total number of recordings is 1908 with the oldest age 39.0 years. Table 1 lists the numbers of measurements for different time intervals.
|Age Interval (Years)|
|[8.8, 10)||[10, 15)||[15, 20)||[20, 25)||[25, 30)||[30, 35)||[35, 40)|
Table 1: Number of measurements for different age intervals.
In the original study, measurements of several covariates were collected but, for our demonstration purpose, we consider only sex. The observed individual curves are then shown separately for boys and girls in Figure 1. These are the raw data for our modeling, and the aim is to build prediction curves (with confidence and tolerance bands) for a new subject based on these data. A minor question was to consider the age when the progression of myopia levels off. The statistical tools described in the previous sections are used for the analysis. The analysis was done using R 2.15.0 , especially the packages splines and lme4 .
Estimates of the parameters for B-spline basis functions
B-spline quadratic basis functions were used for the estimation with the knots at the ages 12, 16, 22. These choices were based on the consultations with specialists. If no prior knowledge were available, the number and locations of the knots could be determined using crossvalidation, model se- lection criteria (AIC or BIC), likelihood ratio tests or by using a roughness penalization approach, etc (see for example ). In our model the number of basis functions is then k=6. With one dichotomous explaining variable (sex), we have 12 parameters for θ (mean curves) and 21 parameters for Σ, and the residual variance τ2.
Using the R-function lmer , we obtain the fixed effects estimates of θ that are presented in Table 2. The covariance matrix of the random effects, the estimate of Σ, is
Finally, the error variance estimate is
Table 2: Estimates of fixed effects and their standard errors.
The estimated mean curves for boys and girls based on are shown in Figure 2. The girls seem to have a faster development of myopia but the differences between boys and girls seem to vanish with the time. After the last knot at age 22, the boys’ mean curve seems still to be almost linearly descending while the girls’ curve seems to level off. From a medical point of view, it is interesting to predict the progression of myopia just using the measurement at the first visit. As an example, we provide in Figure 3 the progression chart for the first visit at the age of 9 years.
As explained before, the prediction curves may be based on several observations as well. To illustrate this, we randomly selected one boy and one girl with at least eight measurements. For both subjects, we found the prediction curves with pointwise 95% tolerance intervals and 95% tolerance bands based on the first three, first five, and first seven observations. As explained in Section 2.2, tolerance bands may be here based on (“conservative”) or on the constants (“simulated”) that can be estimated through simulations. Our simulated values are based upon 2000 replications. The predictions, tolerance intervals, and tolerance bands are given in Figure 4.
For the girl, the prediction curves based on three or seven observations seem to work well. If five points were used, the (strange) later upward trend at later age is missed. The results for the boy look similar. The conservative tolerance band that is easy to compute is too conservative to be helpful here.
Alternative sets of basis functions
B-spline basis functions are mathematically and numerically attractive but the interpretation of the parameters in Table 2 is difficult. Consider first the presentation of the model using the principal component functions as presented in Section 3.2. First, the matrix Σf can be computed using numerical integration, for example. The six eigenvalues of then are 2.8463, 0.1375, 0.0477, 0.0315, 0.0120, 0.0017, and the cumulative proportions of the variation explained by the eigenvectors are 0.9251, 0.9698, 0.9853, 0.9955, 0.9994, 1.0000, respectively. The B-spline basis functions and the principal component functions are shown in Figure 5 (two first columns). It is remarkable that the first three PCA basis functions explain together over 98% of the variation. The first principal function is simply for the general progression level (with a typical progression). The second component seems to be an indicator for strong early progression as a subject with a large score will exhibit much faster progression than one with a small score. The third function roughly describes the contrast between the first and third interval reflected by the two opposing peaks of the curve.
Although the principal component functions may be used to find the main type of variations between the individual curves, they are not practical if one wishes to have parameters to tell what is happening in the beginning or in the end of the follow-up period. These studies can be performed with the truncated power splines. One may be interested, for example, whether the progression of myopia levels off after the last knot value. Figure 2 shows that on the average the stabilization does happen neither for boys nor for girls. Separate analyses were made for boys and girls, however, to confirm that. For both cases, we fitted the model using B-splines and then changed to the truncated power spline functions shown in Figure 5 (the rightmost column) as explained in Section 3.1. (The matrix A may be simply found by calculating the values of all 12 functions at six suitable time points).
To demonstrate the conversion Tables 3 and 4 gives the estimates for the fixed effects using B-spline functions and truncated power spline functions for boys and girls, respectively. The null hypothesis that the myopia progression levels off at 22, then says that the second and third coefficients for truncated power spline functions are zero. Both p-values are less than 0.0001 and the null hypotheses should therefore be rejected, see Section 4.1.
Table 3: Estimates of fixed effects and their standard errors for different splines for the boys only data.
Table 4: Estimates of fixed effects and their standard errors for different splines for the girls only data.
In this paper we showed that the random regression analysis using polynomial spline functions is a flexible way to model myopia progression. The estimated model then also provides easy prediction charts for myopia progression depending on previous measurements. In the estimation of the parameters in the model, the B-spline functions are preferable but the results can be easily transformed to any set of basis functions spanning the same function space. The model also easily allows the use of covariates.
In our example, the theory was illustrated by testing the hypothesis that on the average the myopia progression levels off at the age of 22. The hypothesis was rejected but we believe that this may have happened partly due to the bias coming from the data collection procedure. The numbers of measurements pi and data points are informative in the sense that fast myopia progression is naturally connected to a high number of measurements. The subjects with early stabilized level have sparse late measurements, and the estimates of the parameters describing the progression in the end of the follow-up are then mainly based on the measurements from subjects with late progression. This causes bias in the estimation of the model. This is similar to the analysis of missing data problems (with informative missingness) or to the analysis of clustered data with informative cluster size . To correct the bias is an interesting topic for a future work. Furthermore, robust estimation procedures should be developed here as the data contain clear outliers, that is, the individual curves with atypical or even impossible behaviour. The outliers naturally have a strong effect on the fit of the data.
We thank the referee for careful reading of the paper and helpful comments. The work of Klaus Nordhausen and Hannu Oja was supported by the Academy of Finland (grants 131929, 218327 and 268703). The work of Olavi Pärssinen was supported by grants from the Finnish Eye Foundation and The Society of Finnish Ophthalmologists.