Data Exploration

The first step of any analysis is to explore the data. Zuur et al. (2007) divided the data exploration in three main steps using mainly graphical tools, namely:

a. Outliers in the response variable. We use Cleveland dotplots (Cleveland 1993) and boxplots. Cleveland dotplots are simple scatterplot of the observed values versus their index number (Zuur et al 2009b). An interesting publication dedicated to only the Cleveland dotplot can be found in Jacoby (2006). Outliers are points that have rather different values compared to the rest. Note that we do not advocate removing outliers; we just need to know whether such observations are present. If they are influential in the analysis, we can always remove them at that stage, if desired.

b. Spread of the data. If the spread is the same per strata (or along a gradient like time), we can safely assume homogeneity of variance. If the spread in the data is different, we talk about heterogeneity. Conditional boxplots and conditional Cleveland dot-plots allow us to see whether we have similar or different variation per stratum in each explanatory variable. Linear regression assumes homogeneity of variance.

c. Collinearity is defined as explanatory variables being highly correlated with each other. Biologically, this means that we use explanatory variables that represent the same biological signal. Using collinear explanatory variables in a linear regression model (or any of its extensions) causes trouble with the model selection procedures and it also gives larger p-values (Montgomery and Peck 1992; Draper and Smith 1998; Zuur et al. 2007). Tools to detect collinearity are Pearson correlation coefficients between two continuous explanatory variables, scatterplots, or variance inflation factors (Montgomery and Peck 1992) if a large number of explanatory variables are used.

A boxplot and Cleveland dotplot of length are given in Fig. 8.1. The boxplot let you believe that there are outliers, but the dotplot shows that this is a whole group of observations that just have smaller length values. Hence, there are no observations with extreme large or small values. The advantage of a Cleveland dotplot is that it gives a more detailed overview of the spread of the data, compared to a boxplot.

5 10 15 20

Length

Fig. 8.1 (a) Boxplot of length. (b) Cleveland dotplot of length. The x-axis shows the value of an observation and the y-axis the order of the data as imported from the spreadsheet (which is in this case ordered by time). Note that there are no observations with extreme small or large values

Fig. 8.1 (a) Boxplot of length. (b) Cleveland dotplot of length. The x-axis shows the value of an observation and the y-axis the order of the data as imported from the spreadsheet (which is in this case ordered by time). Note that there are no observations with extreme small or large values

Larvae stage Length

Fig. 8.2 (a): Boxpot of length conditional on larvae stage. (b) Cleveland dotplot conditional on larvae stage. The x-axis shows the value of an observation and the y-axis the order of the data grouped larvae stage

You should make the same graphs for all continuous explanatory variables in your data set. We also made boxplots and Cleveland dotplots for Series, but it did not show any problems.

Figure 8.2a shows a boxplot of length conditional on larval stage, and Fig. 8.2b shows the Cleveland dotplot of the length data again, but this time we grouped the observations by larval stage. Both graphs show the same problem, namely there is a difference in spread per larvae group, indicating potential heterogeneity problems (Larvae stage 3 shows more spread than the others). In Fig. 8.3, we again show a boxplot of length, but this time we used a boxplot for each value of series. We can see a difference in spread, and also a non-linear pattern.

The last point we discuss is collinearity. Due to the experimental design, there are no collinearity issues with the drug variables Amino_FZ, Propofol, and Ethanol. However, Fig. 8.4 shows that there is serious collinerarity between Series and larval stage; this is obvious, as larval stage increases with time.

Series

Fig. 8.3 Boxplot of length conditional on the time variable series. Note that there is a non-linear pattern over time

Fig. 8.3 Boxplot of length conditional on the time variable series. Note that there is a non-linear pattern over time

Larvae

Larvae

Fig. 8.4 Boxplot of Series conditional on larval stage. Note that the L3 and L3P stages correspond to higher values for time

0 0

Post a comment