## Linear Regression as a Starting Point

We will start very simple by fitting a linear regression model with interactions. Due to collinearity we use Larvae stage instead of Series in this first model. We also include the three-way interaction between the drugs. The model has the form:

Length = a + Pj x Amino _ FZj + P2 x Propofbli + P3 x Ethanolj

+ P4 x Amino _ FZi x Pr opofoli + P5 x Amino _ FZi x Ethanolt + P6 x Propofoli x Ethanolt + P7 x Amino _ FZi x Propofolt x Ethanolt + P8 x LarvalStagei + et (8.7)

All the terms are fitted as nominal variables. Note that the notation for the first part implies that main terms, two-way interactions and the three-way interaction terms between the three drugs are included. For the residuals, we assume ei ~ N(0, o2). This is just a straightforward linear regression, and it can also be called analysis of variance. Its main underlying assumptions are homogeneity of variance, independence, no residual patterns, and normality (although this is the least important assumption). To verify these assumptions, we need to fit the model, obtain the residuals, and inspect these for violation of the assumptions.

To verify homogeneity, you can plot the residuals versus fitted values, or the residuals versus each explanatory variable that was used in the model (and also those that were not used in the model); the spread (variation) of residuals should be the same along the gradients in these graphs. If not, violation occurs and action should be taken! Figure 8.5a shows a graph of the residuals versus fitted values. The fact that there is a gap along the horizontal axis is not important; it is the difference in spread we are after. The residuals from observations with Larval stage 3 have more variation (this can be seen by using different colours depending on larval stage in Fig. 8.5a). Fig. 8.5 (a) Residuals versus fitted values. Violation of homogeneity is detected because the spread is increasing with increased fitted values. (b) Residuals versus series. Note that there is violation of independence, as can be clearly seen from the pattern in the residuals

The question is then why we have this. The first thing we did was plotting the residuals versus each drug (these are simple boxplots, and are not shown here), and see whether any of these drugs cause higher or lower variation in the data. However, no indication of heterogeneity could be seen in residuals versus Ethanol, residuals versus Propofol or residuals versus Amino_FZ. A graph of the residuals versus Larval stage shows a clear difference in spread.

As to dependence, we can distinguish two types of dependence. The first type of dependence is due to a missing covariate or modelling a covariate as linear, whereas in reality, it has a non-linear effect. You can easily detect this by plotting the residuals of the linear regression model versus each continuous explanatory variable. You should not see any patterns in these graphs. If you do see a pattern, then the model has to be extended. This type of violation of independence is due to a bad model, and can easily be solved by improving the model by adding more covariates, interactions or by using smoothing models. A more problematic type of dependence is if you repeatedly sample the same physical unit (e.g. the same patient, or animal). To include this type of dependence structure in a regression model or GAM, a residual correlation structure can be implemented. Alternatively, a random intercept can be used, which introduces the so-called symmetric compound correlation between observations from the same patient or animal. Here, we have multiple observations from the same batch. We did fit the model in Eq. 8.6, which uses a random intercept for batch, but the likelihood ratio test showed that it was not better than the linear regression model (details how to compare models with, and without random effects can be found in Pinheiro and Bates 2000; West et al. 2006; Zuur et al. 2009).

Figure 8.5b shows that we can indeed see a clear pattern. Notice that even though Series was not used in the model, we should still produce this graph as it helps to explain where this violation takes place. This dependence structure is likely to be due to a bad model; the effect of time is modelled via the categorical variable Larval stage. We then conclude that this model is not good due to heterogeneity and lack of independence. Our first attempt to try to improve the model is by adopting a non-linear model that uses Series instead of Larval stage. This implies either the use of polynomial models or GAM. We will go for the second option as it more easily allows one to model non-linear patterns.

0 0