## Fine Tuning the Model Using Data Exploration Results

So, what does the data exploration tell us? There are no outliers, there is serious heterogeneity, and time and larval stage are collinear. Heterogeneity can be dealt with using a transformation of length. However, there are multiple arguments against using a transformation. First of all, it changes the relationship between the variables (Quinn and Keough 2002). Secondly, we end up with estimated values on the logarithmic or square root scale (or whichever transformation is used), and these are more difficult to interpret. Back transforming the results is not always easy neither! Another convincing argument against transformation was emphasised by Keele (2008); a transformation affects the entire Y-X relationship and as it can be seen in Fig. 8.3 the length - series relationship is partly linear and also partly non-linear. We therefore try to avoid applying a transformation.

Before discussing the collinearity issue, we shortly talk about the models that we will use in this chapter. They can broadly be written as

The fixed part models the response variable Length in terms of the explanatory variables Amino_FZ, Propofol, Ethanol, Series and Larval stage. The random part is unexplained information, also called noise. If we apply a linear regression model, then (8.2) looks as follows:

+ b3 x Ethanol + b3 x Seriesi + b4 x Larvae. + e (8.3)

The fixed bit contains the terms from a to b5 x Larvae,, and this part can (and will) be made more complicated by adding interactions between the three treatment variables, Series or larval stage. The random bit is the e, from which we assume it is independently, normally distributed with mean 0 and variance a2. It is common to write this as e ~ N(0, ct2) (8.4)

Due to the high collinearity between Series and larval stage, either larval stage or Series should be used in the fixed part of the model, but never both at the same time! High collinearity is causing problems with the model selection process and it also gives higher p-values (Montgomery and Peck 1992). The other problem we noticed from Fig. 8.3 is that the length - Series relationship is non-linear. This means that if we decide to use Series in the model (instead of Larval stage), modelling it as b4 x Series. is not a good option, and alternatives are (i) to use a quadratic function of Series and omit larval stage, (ii) drop series and use Larval stage as a categorical variable, or (iii) drop larval stage and use a smoothing function of Series. The third option gives a generalised additive model (GAM) of the form

Length = a + x Amino_FZ. + b2 x Propofol. + b3 x Ethanol. + F (Series) + e (8.5)

The notation f(Series.) stands for a smoothing function, and is further discussed in Section 8.6. Hence, we have lots of different options for the fixed part of the model. None of the options will give a perfect model; it just does not exist. But it is the task of the researcher to find the least bad model. There are multiple tools available for this task:

1. Choose a model that makes biological sense.

2. Choose a model that does not contain any residual information, and complies with all its underlying assumptions.

3. Use an information criterion like the Akaike Information Criterion (AIC, Akaike

1973) that measures goodness of fit and model complexity.

Obviously, any model needs to make biological sense, hence the first point. Now, for the models that do make biological sense, you also need to ensure that these comply at least with the most important assumptions. These are homogeneity and independence. If these assumptions are violated, then you cannot trust the statistical inference from the models. With this we mean that our biological conclusions are based on a model that gives us estimated parameters. These parameters are based on one sample, yet, we pretend that they hold for the larger population from which the samples were taken. As to the third point, you may have a series of models that make biological sense, and for which the important assumptions are valid. How do you choose among these models? An option is to put a number on each model, and choose the one with the best number. The golden question is then how to put a number on a model. The AIC can be used for this; it is a function that measures the goodness of fit (based on either the residual sum of squares or maximum likelihood criterion) and the number of parameters. There are different definitions of the AIC (e.g. a small sample adjustment) and alternative selection criteria exist, e.g. the Bayesian Information Criterion (BIC), which has a higher penalty for model complexity (Schwarz 1978). The use of the AIC is not without criticism; see for example Burnham and Anderson (2002).

If you think that for the random part in the model in Eq. 8.2, there are fewer options, then unfortunately you are wrong. We can clearly see heterogeneity in Fig. 8.2. In fact, the spread in length values differs per larval stage, which means that the assumption in Eq. 8.4 is incorrect: the residuals are normally distributed with mean 0 and variance a2. Incorrect can either refer to the Normal distribution or to the variance a2. Due to the design of linear regression in Eq. 8.3, the mean of the residuals is always 0. It is the variance a2 we are after; why would each residual have the same variance? It is mathematically easy to adopt something of the form e ~ n(0,ct/)

where j = 1,...,4. Such a model has four variances, one for each larval stage. Although this may look complicated, the underlying mathematics is reasonable simple, and modern statistical software allows one to extend linear regression and additive models with multiple variances (Pinheiro and Bates 2000).

So, the heterogeneity is one aspect of the random part in Eq. 8.2. Two other aspects are random effects (leading to linear mixed effects models) and violation of independence (which are actually related). Linear mixed effects models can be used if we have multiple observations from the same patient, beach, bird nest, school, or indeed, the same batch with meat. The simplest application on our data is of the form

Lengthy = a + x Amino_FZs + b2 x Propofolis + b3 x Ethanols

We swapped from a Length. to Lengthis, where in the first notation i was the observation number (or row number in the spreadsheet), whereas in the second notation Lengthis is observation number s in batch i, where i is from 1 to 128. The random intercept a. is assumed to be normally distributed with mean 0 and variance a \ The consequence of using the random intercept is that observations from the same batch are allowed to be correlated. We can even allow for more complicated correlation structures between sequential observations, if we repeatedly sample the same physical unit (Zuur et al 2009).

It is possible to test whether adding any of the random structures (heterogeneity, random effects, and correlation) improves the model.

## Post a comment