Mar 06

Organism of the week #20 – Don’t point that thing at me

It’s amazing how informative an anus can be.

Take this sea urchin. The orange pucker in the middle of the spines is its “around-the-bum”, although zoologists would insist on writing that in Greek as “periproct“. The bright orange ring-piece is characteristic of this species, and marks it out as Diadema setosum, rather than any of the less rectally blessed species of Diadema.

Diadema setosum [CC-BY-SA-3.0 Steve Cook]

Diadema setosum

The butt-hole of an urchin is actually the second it will own, because urchins go through a metamorphosis that shames even that of a butterfly. The larva of an urchin looks not even a little bit like the adult…

Pluteus larva (Public domain: out-of-copyright edition of the Encyclopædia Britannica)

Pluteus larva of a sea urchin. The adult will develop as a ball inside the larva’s body. The spikes on the larva are nothing to do with the spikes on the adult

…and the adult urchin develops like a well-organised tumour within the body of the larva. For this reason, the adult’s anus is an entirely different hole from the larva’s anus.

The development of the larva’s original butt-hole during development from a fertilised egg turns out to be quite revealing. Surprisingly, it marks out sea urchins and their relatives – like sea cucumbers and starfish – as much closer relatives of yours and other backboned animals, than they are of insects or worms or jellyfish, or indeed, or pretty much any other animal.

As a fertilised human or sea urchin egg divides, it forms a hollow ball of cells, somewhat like a football. Then, some of the cells on the surface fold in on themselves, forming a shape rather like what you get if you punch your fist into a half-deflated football. The dent drills its way through, and eventually opens out through the other side of the ball. What you end up with is a double-walled tube, with a hole at either end.

Gastrulation [CC-BY-SA-3.0 Steve Cook]

In humans and all other animals with backbones, and in the larva of sea urchins and starfish and sea cucumbers, the first hole – the one formed by the dent – becomes the anus; and the second hole – where the dent punches through to the other side – becomes the mouth.

In most other animals, the first hole becomes the mouth, and the second the anus (pedant alert: I’m glossing over some details here).

Humans and sea urchins develop arse-first. Or mouth-second, as zoologists would prudishly have it, preferably euphemised further by writing it in Greek. Humans and fish, and sea urchins and starfish are all “deuterostomes”.

The development of the chocolate starfish of a starfish and of the asshole of an ass hint at a deep evolutionary connection between two very different groups of animals. Enlightenment can be found in the most unexpected places.

Mar 03

Nonlinear regression

Nonlinear regression is used to see whether one continuous variable is correlated with another continuous variable, but in a nonlinear way, i.e. when a set of x vs. y data you plan to collect do not form a straight line, but do fall on a curve that can be modelled in some sensible way by a known equation, e.g.

v = \frac{ v_{max} \cdot [S] }{ K_M + [S] }

Some important general considerations for fitting models of this sort include:

  • The model must make physical sense. R (and Excel) can happily stick polynomial curves (e.g. a cubic like y=ax3+bx2+cx+d) through a data set, but fitting random equations through data is a pointless exercise, as the values of a, b, c and d are meaningless and do not relate to some useful quantity that characterises the behaviour of the data. If you want to fit a curve to a data set, it has to be a curve (and therefore an equation) you’ve chosen because it estimates something meaningful (e.g. the Michaelis constant, KM).
  • There must be enough data points. In general, you cannot fit a useful model of n parameters to a data set smaller than n+1 in size. In linear regression, you cannot fit a slope and an intercept (2 parameters) to just one datum, as there are an infinite number of lines that pass through a single point and no way to choose between them. You shouldn’t fit a 2 parameter model to 2 data points as this doesn’t buy you anything: your model is at least as complex as a simple list of the two data values. The Michaelis-Menten model has two parameters, so you need at least three concentrations of S, and preferably twice this. As ever, collect the data needed for the analysis you plan to do; don’t just launch into collecting the data and then wonder how you will analyse it, because often the answer will be “with great difficulty” or “not at all”.
  • The data set should aim to cover the interesting span of the response, even if you don’t really know what that span is. A linear series of concentrations of S is likely to miss the interesting bit of an enzyme kinetic curve (around KM) unless you have done some preliminary experiments. Those preliminary experiments will probably need to use a logarithmic series of concentrations, as this is much more likely to span the interesting bit. This is particularly important in dose/response experiments: use a concentrations series like 2, 4, 8, 16…mM, or 10, 100, 1000, 10000…µM, rather than 2, 4, 6, 8, 10…mM or 10, 20, 30, 40…µM; but bear in mind the saturated solubility (and your own safety!) when choosing whether to use a base-2, a base-10, or base-whatever series.

The data in enzyme_kinetics.csv gives the velocity, v, of the enzyme acid phosphatase (µmol min−1) at different concentrations of a substrate called nitrophenolphosphate, [S] (mM). The data can be modelled using the Michaelis-Menten equation given at the top of this post, and nonlinear regression can be used to estimate KM and vmax without having to resort to the Lineweaver-Burk linearisation.

In R, nonlinear regression is implemented by the function nls(). It requires three parameters. These are:

  • The equation you’re trying to fit
  • The data-frame to which it’s trying to fit the model
  • A vector of starting estimates for the parameters it’s trying to estimate

Fitting a linear model (like linear regression or ANOVA) is an analytical method. It will always yield a globally optimal solution, i.e. a ‘perfect’ line of best fit, because under the hood, all that linear regression is doing is finding the minimum on a curve of residuals vs. slope, which is a matter of elementary calculus. However, fitting a nonlinear model is a numerical method. Under the hood, R uses an iterative algorithm rather than a simple equation, and as a result, it is not guaranteed to find the optimal curve of best fit. It may instead get “jammed” on a local optimum. The better the starting estimates you can give to nls(), the less likely it is to get jammed, or – indeed – to charge off to infinity and not fit anything at all.

For the equation at the start of this post, the starting estimates are easy to estimate from the plot:

enzyme.kinetics<-read.csv( "H:/R/enzyme_kinetics.csv" )
plot(
    v ~ S,
    data = enzyme.kinetics,
    xlab = "[S] / mM",
    ylab = expression(v/"µmol " * min^-1),
    main = "Acid phosphatase saturation kinetics"
)

Acid phosphatase saturation kinetics scatterplot [CC-BY-SA-3.0 Steve Cook]

The horizontal asymptote is about v=9 or so, and therefore KM (value of [NPP] giving vmax/2) is about 2.

The syntax for nls() on this data set is:

enzyme.model<-nls(
    v ~ vmax * S /( KM + S ),
    data  = enzyme.kinetics,
    start = c( vmax=9, KM=2 )
)

The parameters in the equation you are fitting using the usual ~ tilde syntax can be called whatever you like as long as they meet the usual variable naming conventions for R. The model can be summarised in the usual way:

summary( enzyme.model )
Formula: v ~ vmax * S/(KM + S)
Parameters:
     Estimate Std. Error t value Pr(>|t|)    
vmax 11.85339    0.05618  211.00 7.65e-13 ***
KM    3.34476    0.03860   86.66 1.59e-10 ***
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.02281 on 6 degrees of freedom
Number of iterations to convergence: 4 
Achieved convergence tolerance: 6.044e-07

The curve-fitting has worked (it has converged after 4 iterations, rather than going into an infinite loop), the estimates of KM and vmax are not far off what we made them by eye, and both are significantly different from zero.

If instead you get something like this:

Error in nls( y ~ equation.blah.blah.blah, ): singular gradient

this means the curve-fitting algorithm has choked and charged off to infinity. You may be able to rescue it by trying nls(…) again, but with different starting estimates. However, there are some cases where the equation simply will not fit (e.g. if the data are nothing like the model you’re trying to fit!) and some pathological cases where the algorithm can’t settle on an estimate. Larger data sets are less likely to have these problems, but sometimes there’s not much you can do aside from trying to fit a simpler equation, or estimating some rough-and-ready parameters by eye.

If you do get a model, you can use it to predict the y values for the x values you have, and use that to add a curve of best fit to your data:

# Create 100 evenly spaced S values in a one-column
data frame headed 'S'
predicted.values <- data.frame( 
    S = seq( 
       from       = min( enzyme.kinetics$S ),
       to         = max( enzyme.kinetics$S ),
       length.out = 100
    )
)

# Use the fitted model to predict the corresponding v values, and assign them
# to a second column labelled 'v' in the data frame predicted.values
$v <- predict( enzyme.model, newdata=predicted.values )

# Add these 100 x,y data points to the graph, joined up by 99 line-segments
# This will look like a curve at the resolution of the graph
lines( v ~ S, data=predicted.values )

Acid phosphatase nonlinear regression model [CC-BY-SA-3.0 Steve Cook]

If you need to extract a value from the model, you need coef():

vmax.estimate <- coef( enzyme.model )[1]
KM.estimate   <- coef( enzyme.model )[2]

This returns the first and second coefficients out of the model, in the order listed in the summary().

Exercises

Try fitting non-linear regressions to the data sets below

  1. Estimate the doubling time from the bacterial data set in bacterial_growth.csv. N is the optical density of bacterial cells at time t; N0 is the OD at time 0 (you’ll need to estimate N0 too: it’s ‘obviously’ 0.032, but this might have a larger experimental error than the rest of the data set). t is the time (min); and td is the doubling time (min).
N = N_0 e^{ \frac{ \ln{2} }{ t_d } \cdot t }
  1. The response of a bacterium’s growth rate constant μ to an increasing concentration of a toxin, [X], is often well-modelled by a sigmoidal equation of the form below, with a response that depends on the log of the toxin concentration. The file dose_response.csv contains data on the response of the growth constant μ (“mu” in the file, hr‑1) in Enterobacter sp. to the concentration of ammonium ions, [NH4+] (“C” in the file, µM). The graph below shows this data plotted with a nonlinear regression curve. Reproduce it, including the superscripts, Greek letters, italics, etc., in the axis titles; and the smooth curve predicted from the fitted model.
\mu = \frac{ \mu_{max} }{ 1 + e^{ - \frac{ (\ln{[X]} - \ln{IC_{50}} ) }{ s } } }
  • μ is the exponential phase growth constant (hr−1) at any given concentration of X.
  • μmax is the maximum value that μ takes, i.e. the value of μ when the concentration of toxin, [X], is zero. The starting estimate should be is whatever the curve seems to be flattening to on the left.
  • IC50 is the 50% inhibitory concentration, i.e. the concentration of X needed to reduce μ from μmax to ½μmax. The starting estimate of ln IC50 should be the natural log of the value of [X] that gives you half the maximum growth rate.
  • s is the shape parameter: if s is small the curve drops off sharply like a cliff-edge; if s is large, the curve slopes more gently. If s is negative, the curve is Z-shaped; if s is positive, the curve is S-shaped. It starting estimate should be (ln IC25 − ln IC75) /2. If you wish to prove this to yourself, pretend that e=3 rather than 2.718… and work out what [X] has to be to give you this value.
  • Note the x-variable in the equation and on the plot is the natural logarithm of [X], so you’ll need to use log(C) in your calls to plot() and nls().

Dose-response nonlinear regression model [CC-BY-SA-3.0 Steve Cook]

Answers

  1. The doubling time td is around 25 min.
bacterial.growth<-read.csv( "H:/R/bacterial_growth.csv" )
plot(
    N    ~ t,
    data = bacterial.growth,
    xlab = "N",
    ylab = "t / min",
    main = "Bacteria grow exponentially"
)

head( bacterial.growth )
   t    N
1  0 0.032
2 10 0.046
…

We estimate the parameters from the plot: N0 is obviously 0.032 from the data above; and the plot (below) shows it takes about 20 min for N to increase from 0.1 to 0.2, so td is about 20 min.

bacterial.model<-nls(
    N     ~ N0 * exp( (log(2) / td ) * t),
    data  = bacterial.growth,
    start = c( N0 = 0.032, td = 20 )
)
summary( bacterial.model )
Formula: N ~ N0 * exp((log(2) / td) * t)
Parameters:
    Estimate Std. Error t value Pr(>|t|)    
N0 3.541e-02  6.898e-04   51.34 2.30e-11 ***
td 2.541e+01  2.312e-01  109.91 5.25e-14 ***
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
Residual standard error: 0.002658 on 8 degrees of freedom
Number of iterations to convergence: 5 
Achieved convergence tolerance: 1.052e-06

This bit adds the curve to the plot below.

predicted.values <- data.frame( 
    t = seq( 
       from       = min( bacterial.growth$t ),
       to         = max( bacterial.growth$t ),
       length.out = 100
    )
)
predicted.values$N <- predict( bacterial.model, newdata = predicted.values )
lines( N ~ t, data = predicted.values )

Bacterial growth nonlinear model [CC-BY-SA-3.0 Steve Cook]

  1. Dose/response model and graph
dose.response<-read.csv( "H:/R/dose_response.csv" )
plot(
    mu   ~ log(C),
    data = dose.response,
    xlab = expression("ln" * "[" * NH[4]^"+" * "]"
/ µM),
    ylab = expression(mu/hr^-1),
    main = expression("Sigmoidal dose/response to ammonium ions in " *
italic("Enterobacter"))
)

The plot indicates that the starting parameters should be

  • μmax ≈ 0.7
  • ln IC50 ≈ ln(3), because IC50 is the value of [X] corresponding to μmax/2 ≈ 0.35, and this is about 3. We’ll call ln(IC50) logofIC50 in the formula to nls() below so it’s clear this is a parameter we’re estimating, not a function we’re calling.
  • s ≈ − 0.6, because s ≈ (ln IC25−ln IC75) /2 = (ln(2)− ln(7)) /2
dose.model <- nls(
    mu    ~ mumax / ( 1 + exp( -( log(C) - logofIC50 ) / s ) ),
    data  = dose.response,
    start = c( mumax=0.7, logofIC50=log(3), s=-0.6 )
)
summary( dose.model )
Formula: mu ~ mumax/(1 + exp(-(log(C) - logofIC50)/s))
Parameters:
          Estimate Std. Error t value Pr(>|t|)    
mumax     0.695369   0.006123  113.58 1.00e-09 ***
logofIC50 1.088926   0.023982   45.41 9.78e-08 ***
s        -0.509478   0.020415  -24.96 1.93e-06 ***
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
Residual standard error: 0.00949 on 5 degrees of freedom
Number of iterations to convergence: 5 
Achieved convergence tolerance: 6.394e-06
Achieved convergence tolerance: 5.337e-06

This lists the estimates of μmax, ln(IC50) and s, plus the standard error of these estimates and their p values. Note that they’re all pretty close to what we guessed in the first place, so we can be fairly sure they’re good estimates. To get the actual value of IC50 from the model:

exp( coef(dose.model)[2] )

This gives 2.97 µM, which corresponds well with our initial guess.

To add the curve, we do the same as before:

predicted.values <- data.frame( 
    C = seq( 
       from       = min( dose.response$C ),
       to         = max( dose.response$C ),
       length.out = 100
    )
)
predicted.values$mu <- predict( dose.model, newdata = predicted.values )
lines( mu ~ log(C), data=predicted.values )

Feb 24

Analysis of variance: ANOVA (2 way)

The technique for a one-way ANOVA can be extended to situations where there is more than one factor, or – indeed – where there are several factors with several levels each, which may have synergistic or antagonistic effects on each other.

In the models we have seen so far (linear regression, one-way ANOVA) all we have really done is tested the difference between a null model (“y is a constant”, “y=“) and a single alternative model (“y varies by group” or “y=a+bx“) using an F test. However, in two-way ANOVA there are several possible models, and we will probably need to proceed through some model simplification from a complex model to a minimally adequate model.

The file wheat_yield.csv contains data on the yield (tn ha−1) of wheat from a large number of replicate plots that were either unfertilised, given nitrate alone, phosphate alone, or both forms of fertiliser. This requires a two-factor two-level ANOVA.

wheat.yield<-read.csv( "H:/R/wheat_yield.csv" )
interaction.plot(
    response     = wheat.yield$yield,
    x.factor     = wheat.yield$N,
    trace.factor = wheat.yield$P
)

As we’re plotting two factors, a box-and-whisker plot would make no sense, so instead we plot an interaction plot. It doesn’t particularly matter here whether we use N(itrate) as the x.factor (i.e. the thing we plot on the x-axis) and P(hosphate) as the trace.factor (i.e. the thing we plot two different trace lines for):

Wheat yield interaction plot [CC-BY-SA-3.0 Steve Cook]

You’ll note that the addition of nitrate seems to increase yield: both traces slope upwards from the N(o) to Y(es) level on the x-axis, which represents the nitrate factor. From the lower trace, it appears addition of just nitrate increases yield by about 1 tn ha−1.

You’ll also note that the addition of phosphate seems to increase yield: the Y(es) trace for phosphate is higher than the N(o) trace for phosphate. From comparing the upper and lower traces at the left (no nitrate), it appears that addition of just phosphate increases yield by about 2 tn ha−1.

Finally, you may notice there is a positive (synergistic) interaction. The traces are not parallel, and the top-right ‘point’ (Y(es) to both nitrate and phosphate) is higher than you would expect from additivity: the top-right is maybe 4, rather than 1+2=3 tn ha−1 higher than the completely unfertilised point.

We suspect there is an interaction, this interaction is biologically plausible, and we have 30 samples in each of the four treatments. We fit a two-factor (two-way) ANOVA maximal model, to see whether this interaction is significant.

First, we fit the model using N*P to fit the ‘product’ of the nitrate and phosphate factors, i.e.

wheat.model<-aov(yield ~ N*P, data=wheat.yield )
anova( wheat.model )
Analysis of Variance Table
Response: yield       
           Df  Sum Sq Mean Sq  F value   Pr(>F)    
N           1  83.645  83.645  43.3631   9.122e-10 ***
P           1 256.136 256.136 132.7859   < 2.2e-16 ***
N:P         1  11.143  11.143   5.7767   0.01759 *  
Residuals 136 262.336   1.929                       
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

N shows the effect of nitrate, P the effect of phosphate, and N:P is the interaction term. As you can see, this interaction term appears to be significant: the nitrate+phosphate combination seems to give a higher yield than you would expect from just adding up the individual effects of nitrate and phosphate alone. To test this explicitly, we can fit a model that lacks the interaction term, using N+P to fit the ‘sum’ of the factors without the N:P term:

wheat.model.no.interaction<-aov(yield ~ N+P, data=wheat.yield )
anova( wheat.model )
Analysis of Variance Table
Response: yield         
          Df  Sum Sq Mean Sq F value    Pr(>F)   
N          1  83.645  83.645  41.902    1.583e-09 ***
P          1 256.136 256.136 128.312    < 2.2e-16 ***
Residuals 137 273.479   1.996                      
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can then use the anova() function with two arguments to compare these two models with an F test:

anova( wheat.model, wheat.model.no.interaction )
Analysis of Variance Table
Model 1: yield ~ N * P
Model 2: yield ~ N + P
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1    136 262.34        
2    137 273.48 -1   -11.143 5.7767 0.01759 *
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An alternative way of doing the same thing is to update the maximal model by deletion. First you fit the maximal model, as before:

wheat.model<-aov(yield ~ N*P, data=wheat.yield )
anova( wheat.model )

You note the least significant term (largest p value) is the N:P term, and you selectively remove that from the model to produce a revised model with update():

wheat.model.no.interaction<-update( wheat.model, ~.-N:P)
anova( wheat.model, wheat.model.no.interaction )

The ~.N:P is shorthand for “fit (~) the current wheat.model (.) minus (-) the interaction term (N:P)”. This technique is particularly convenient when you are iteratively simplifying a model with a larger number of factors and a large number of interactions.

Whichever method we use, we note that the deletion of the interaction term significantly reduces the explanatory power of the model, and therefore that our minimally adequate model is the one including the interaction term.

If you have larger numbers of factors (say a, b and c) each with a large number of levels (say 4, 2, and 5), it is possible to fit a maximal model (y~a*b*c), and to simplify down from that. However, fitting a maximal model in this case would involve estimating 40 separate parameters, one for each combination of the 4 levels of a, with the 2 levels of b, and the 5 levels of c. It is unlikely that your data set is large enough to make a model of 40 parameters a useful simplification compared to the raw data set itself. It would even saturate it if you have just one datum for each possible {a,b,c} combination. Remember that one important point of a model is to provide a simplification of a large data set. If you want to detect an interaction between factors a and b reliably, you need enough data to do so. In an experimental situation, this might impact on whether you actually want to make c a variable at all, or rather to control it instead, if this is possible.

Exercises

Analyse the following data set with ANOVA

  1. The file glucose_conc.csv contains fasting blood serum glucose levels (mg dL−1) for humans homozygous for either a mutant allele or the wild-type allele at two loci (Rtfm and Stfw) thought to be involved in blood glucose control. Do the data support a correlation between possession of the mutant alleles and elevated glucose levels?

Answers

  1. Glucose concentrations
glucose.conc<-read.csv( "H:/R/glucose_conc.csv" )
interaction.plot(
    response     = glucose.conc$conc,
    x.factor     = glucose.conc$Stfw,
    trace.factor = glucose.conc$Rtfm
)

Glucose interaction plot [CC-BY-SA-3.0 Steve Cook]

The lines are parallel, so there seems little evidence of interaction between the loci. The difference between the mutant and wildtypes for the Rtfm locus doesn’t look large, and may not be significant. However, the Stfw wildtypes seem to have better control of blood glucose. A two-way ANOVA can investigate this:

glucose.model<-aov( conc ~ Stfw*Rtfm, data = glucose.conc )
anova( glucose.model )
Analysis of Variance Table
Response: conc
           Df Sum Sq Mean Sq  F value  Pr(>F)   
Stfw        1 133233  133233 311.4965 < 2e-16 ***
Rtfm        1   1464    1464   3.4238 0.06526 .  
Stfw:Rtfm   1     17      17   0.0407 0.84020    
Residuals 296 126605     428                     
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we suspected, the Stfw:Rtfm interaction is not significant. We will remove it by deletion:

glucose.model.no.interaction<-update( glucose.model, ~.-Stfw:Rtfm )
anova( glucose.model, glucose.model.no.interaction )
Analysis of Variance Table

Model 1: conc ~ Stfw * Rtfm
Model 2: conc ~ Stfw + Rtfm
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    296 126605                           
2    297 126622 -1   -17.421 0.0407 0.8402

The p value is much larger than 0.05, so the model including the interaction term is not significantly better than the one excluding it.

The reduced model is now:

anova(glucose.model.no.interaction )
Analysis of Variance Table
Response: conc
           Df Sum Sq Mean Sq  F value  Pr(>F)   
Stfw        1 133233  133233 312.5059 < 2e-16 ***
Rtfm        1   1464    1464   3.4349 0.06482 .  
Residuals 297 126622     426                     
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When we remove the interaction term, its variance is redistributed to the remaining factors, which will change their values compared to the maximal model we fitted at the start. It appears that the wildtype and mutants of Rtfm do not significantly differ in their glucose levels, so we will now remove that term too:

glucose.model.stfw.only<-update( glucose.model.no.interaction, ~.-Rtfm )
anova( glucose.model.no.interaction, glucose.model.stfw.only )
Analysis of Variance Table
Model 1: conc ~ Stfw + Rtfm
Model 2: conc ~ Stfw
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1    297 126622                              
2    298 128087 -1   -1464.4 3.4349 0.06482 .
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, we detect no significant different between the models, so we accept the simpler one (Occam’s razor).

anova( glucose.model.stfw.only )
Analysis of Variance Table

Response: conc
           Df Sum Sq Mean Sq F value    Pr(>F)    
Stfw        1 133233  133233  309.97 < 2.2e-16 ***
Residuals 298 128087     430                      
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Stfw locus does have a very significant effect on blood glucose levels. We could try deletion testing Stfw too, but here it in not really necessary as the ANOVA table above is comparing the Stfw only model to the null model in any case. To determine what the effect of Stfw is, we can use a Tukey test:

TukeyHSD( glucose.model.stfw.only )
  Tukey multiple comparisons of means
    95% family-wise confidence level
Fit: aov(formula = conc ~ Stfw, data = glucose.conc)
$Stfw
                     diff     lwr       upr p adj
wildtype-mutant -42.14783 -46.859 -37.43666     0

… or, as this model has in fact simplified down to the two-levels of one factor, this is equivalent to just doing a t test at this point:

t.test( conc ~ Stfw, data = glucose.conc)
        Welch Two Sample t-test
data:  conc by Stfw
t = 17.6061, df = 289.288, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 37.43609 46.85958
sample estimates:
  mean in group mutant mean in group wildtype 
             123.87824               81.73041

Given this analysis, the best way to represent our data is probably a simple boxplot ignoring Rtfm, with an explanatory legend:

boxplot(
    conc ~ Stfw,
    data = glucose.conc,
    xlab = expression( italic(Sftw) ),
    ylab = expression( "[Glucose] / mg "*dL^-1 ),
    main = "Mutants at the Stfw locus are less able\nto control their blood glucose levels"
)

Glucose boxplot [CC-BY-SA-3.0 Steve Cook]

Homozygous mutants at the Stfw locus were found to be less well able to control their blood glucose levels (F=310, p≪0.001). Homozygous mutation of the Rtfm locus was found to have no significant effect on blood glucose levels (F on deletion from ANOVA model = 3.4, p=0.06). Homozygous mutation at the Stfw locus was associated with a blood glucose level 40 mg dL−1 (95% CI: 37.4…46.8) higher than the wildtype (t=17.6, p≪0.001).

Next up… Nonlinear regression.

Feb 24

Analysis of variance: ANOVA (1 way)

Analysis of variance is the technique to use when you might otherwise be considering a large number of pairwise F and t tests, i.e. where you want to know whether a factor with more than 2 levels is a useful predictor of a dependent variable.

For example, cuckoo_eggs.csv contains data on the length of cuckoo eggs laid into different host species, the (meadow) pipit, the (reed) warbler, and the wren.

A box-and-whisker plot is a useful way to view data of this sort:

cuckoo.eggs<-read.csv( "H:/R/cuckoo_eggs.csv" )
boxplot( 
    egg.length ~ species,
    data = cuckoo.eggs,
    xlab = "Host species",
    ylab = "Egg length / mm"
)

Cuckoo eggs boxplot [CC-BY-SA-3.0 Steve Cook]

You might be tempted to try t testing each pairwise comparison (pipit vs. wren, warbler vs. pipit, and warbler vs. wren), but a one-factor analysis of variance (ANOVA) is what you actually want here. ANOVA works by fitting individual means to the three levels (warbler, pipit, wren) of the factor (host species) and seeing whether this results in a significantly smaller residual variance than fitting a simple overall mean to the entire data set.

Conceptually, this is very similar to what we did with linear regression: ANOVA compares the residuals on the model represented by the “y is a constant” graph below:

Cuckoo eggs null model [CC-BY-SA-3.0 Steve Cook]

…with a model where three individual means have been fitted, the “y varies by group” model:

Cuckoo eggs ANOVA model [CC-BY-SA-3.0 Steve Cook]

It’s not immediately obvious that fitting three separate means has bought us much: the model is more complicated, but the length of the red lines doesn’t seem to have changed hugely. However, R can tell us precisely whether or not this is the case. The syntax for categorical model fitting is aov(), for analysis of variance:

aov( egg.length ~ species, data=cuckoo.eggs )
Call:
   aov(formula = egg.length ~ species, data = cuckoo.eggs)
Terms:
                 species Residuals
Sum of Squares  35.57612  55.85047
Deg. of Freedom        2        57
Residual standard error: 0.989865
Estimated effects may be unbalanced

As with linear regression, you may well wish to save the model for later use. It is traditional to display the results of an ANOVA in tabular format, which can be produced using anova()

cuckoo.model<-aov( egg.length ~ species, data=cuckoo.eggs )
anova( cuckoo.model )
Analysis of Variance Table
Response: egg.length
          Df Sum Sq Mean Sq F value    Pr(>F)    
species    2 35.576 17.7881  18.154 7.938e-07 ***
Residuals 57 55.850  0.9798                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The “y is a constant” model has only one variance associated with it: the total sum of squares of the deviances from the overall mean (SST, or SSY, whichever you prefer) divided by the degrees of freedom in the data set (n−1).

The “y varies by group” model decomposes this overall variance into two components.

  • The first component is associated with categorising the data into k groups (k=3 here). It is the sum of squares of the deviances of each datum from its group-specific mean (SSA) divided by the number of degrees of freedom those k means have, (k−1). This represents the between group variation in the data.
  • The second component is the residual error remaining, which is the sum of squares of the remaining deviances (SSE) divided by the degrees of freedom left after estimating those k individual means, (nk). This represents the within group variation in the data.

In the ANOVA table, the Sum Sq is the sum of the squares of the deviances of the data points from a mean. The Df is the degrees of freedom, and the Mean Sq is the sum of squares divided by the degrees of freedom, which is the corresponding variance.

The F value is the mean-square for species divided by the mean-square of the Residuals. The p value indicates that categorising the data into three groups does make a significant difference to explaining the variance in the data, i.e. estimating three separate mean for each host species, rather than one grand mean, does make a significant difference to how well we can explain the data. The length of the eggs the cuckoo lays does vary by species.

Compare this with linear regression, where you’re trying to find out whether y=a+bx is a better model of the data than y=. This is very similar to ANOVA, where we are trying to find out whether “y varies by group, i.e. the levels of a factor” is a better model than “y is a constant, i.e. the overall mean”.

You might well now ask “but which means are different?” This can be investigated using TukeyHSD() (honest significant differences) which performs a (better!) version of the pairwise t test you were probably considering at the top of this post.

TukeyHSD( cuckoo.model )
  Tukey multiple comparisons of means
    95% family-wise confidence level
Fit: aov(formula = egg.length ~ species, data = cuckoo.eggs)
$species
                 diff        lwr        upr    p adj
Warbler-Pipit -1.5395 -2.2927638 -0.7862362 0.000023
Wren-Pipit    -1.7135 -2.4667638 -0.9602362 0.000003
Wren-Warbler  -0.1740 -0.9272638  0.5792638 0.843888

This confirms that – as the box-and-whisker plots suggested – the eggs laid in pipit and warbler nests are not significantly different in size, but those laid in wren nests are significantly smaller than those in the warbler or pipit nests.

ANOVA has the same sorts of assumption as the F and t tests and as linear regression: normality of residuals, homoscedacity, representativeness of the sample, no error in the treatment variable, and independence of data points. You should therefore use the same checks after model fitting as you used for linear regression:

plot( residuals( cuckoo.model ) ~ fitted( cuckoo.model ) )

 

Cuckoo eggs residuals [CC-BY-SA-3.0 Steve Cook]

We do not expect a starry sky in the residual plot, as the fitted data are in three discrete levels. However, if the residuals are homoscedastic, we expect them to be of similar spread in all three treatments, i.e. the residuals shouldn’t be more scattered around the zero line in the pipits than in the wrens, for example. This plot seems consistent with that.

qqnorm( residuals( cuckoo.model ) )

Cuckoo eggs normal QQ plot [CC-BY-SA-3.0 Steve Cook]On the normal Q-Q plot, we do expect a straight line, which – again – we appear to have (although it’s a bit jagged, and a tiny bit S-shaped). We can accept that the residuals are more-or-less normal, and therefore that the analysis of variance was valid.

Exercises

Analyse the following data set with ANOVA

  1. The file venus_flytrap.csv contains data on the wet biomass (g) of Venus flytraps prevented from catching flies (control), allowed to catch flies, or prevented from catching flies but instead given fertiliser applied to the soil. Does the feeding treatment significantly affect the biomass? If so, which of the three means differ significantly, and in what directions? Have a look at the residuals in the same way as you did for the regression. How do they look?

Answers

  1. Venus flytrap biomass data
venus.flytrap<-read.csv("H:/R/venus_flytrap.csv")
plot(
    biomass ~ feed,
    data = venus.flytrap,
    xlab = "Feeding treatment",
    ylab = "Wet biomass / g",
    main = expression("Venus flytrap feeding regime affects wet biomass")
)

Venus flytrap boxplot [CC-BY-SA-3.0 Steve Cook]

flytrap.model<-aov( biomass ~ feed, data = venus.flytrap )
anova( flytrap.model )
Analysis of Variance Table
Response: biomass
         Df  Sum Sq Mean Sq F value    Pr(>F)   
feed      2  5.8597 2.92983  22.371 1.449e-08 ***
Residuals 87 11.3939 0.13096                      
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD( flytrap.model )
  Tukey multiple comparisons of means
    95% family-wise confidence level
Fit: aov(formula = biomass ~ feed, data = venus.flytrap)
$feed
                         diff         lwr        upr     p adj
fertiliser-control -0.3086667 -0.53147173 -0.0858616 0.0039294
fly-control         0.3163333  0.09352827  0.5391384 0.0030387
fly-fertiliser      0.6250000  0.40219493  0.8478051 0.0000000

The feeding treatment makes a significant different to the wet biomass (F=22.3, p=1.45 × 10−8) and a Tukey HSD test shows that all three means are different, with the fly treatment having a beneficial effect (on average, the fly-treated plants are 0.32 g heavier than the control plants), but the fertiliser has an actively negative effect on growth, with these plants on average being 0.31 g lighter than the control plants.

To plot the residuals, we use the same code as for the linear regression:

plot( residuals(flytrap.model) ~ fitted(flytrap.model) )

Venus flytrap residuals [CC-BY-SA-3.0 Steve Cook]

There is perhaps a bit more scatter in the residuals of the control (the ones in the middle), but nothing much to worry about.

qqnorm( residuals( flytrap.model ) )

Venus flytrap normal QQ plot [CC-BY-SA-3.0 Steve Cook]

On the normal Q-Q plot, we do expect a straight line, which – again – we appear to have. We can accept that the residuals are essentially homoscedastic and normal, and therefore that the analysis of variance was valid.

Next up… Two-way ANOVA.

Feb 17

Comparison of expected and observed count data: the χ² test

A χ2 test is used to measure the discrepancy between the observed and expected values of count data.

  • The dependent data must – by definition – be count data.
  • If there are independent variables, they must be categorical.

The test statistic derived from the two data sets is called χ2, and it is defined as the square of the discrepancy between the observed and expected value of a count variable divided by the expected value.

\chi^2 = \sum{ \frac{ (O - E)^2 }{ E } }

The reference distribution for the χ2 test is Pearson’s χ2. This reference distribution has a single parameter: the number of degrees of freedom remaining in the data set.

A χ2 test compares the χ2 statistic from your empirical data with the Pearson’s χ2 value you’d expect under the null hypothesis given the degrees of freedom in the data set. The p value of the test is the probability of obtaining a test χ2 statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis (“there is no discrepancy between the observed and expected values”) is true. i.e. The p value is the probability of observing your data (or something more extreme), if the data do not truly differ from your expectation.

The comparison is only valid if the data are:

  • Representative of the larger population, i.e.the counts are sampled in an unbiased way.
  • Of sufficiently large sample size. In general, observed counts (and expected counts) less than 5 may make the test unreliable, and cause you to accept the null hypothesis when it is false (i.e. ‘false negative’). R will automatically apply Yates’ correction to values less than 5, but will warn you if it thinks you’re sailing too close to the wind.
  • Independent.

Do not use a χ2 test unless these assumptions are met. The Fisher’s exact test fisher.test() may be more suitable if the data set is small.

In R, a χ2-test is performed using chisq.test(). This acts on a contingency table, so the first thing you need to do is construct one from your raw data. The file tit_distribution.csv contains counts of the total number of birds (the great tit, Parus major, and the blue tit, Cyanistes caeruleus) at different layers of a canopy over a period of one day.

tit.distribution<-read.csv( "H:/R/tit_distribution.csv" )
print( tit.distribution )

This will spit out all 706 observations: remember that the raw data you import into R should have a row for each ‘individual’, here each individual is a “This bird in that layer” observation. You can see just the start of the data using head():

head( tit.distribution )
     Bird  Layer
1 Bluetit Ground
2 Bluetit Ground
3 Bluetit Ground
4 Bluetit Ground
5 Bluetit Ground
6 Bluetit Ground

and look at a summary of the data frame object with str():

str( tit.distribution )
'data.frame':  
706 obs. of  2 variables:
 $ Bird :
Factor w/ 2 levels "Bluetit","Greattit": 1 1 1 1 1 1 1 1 1
1 ...
 $ Layer: Factor
w/ 3 levels "Ground","Shrub",..: 1 1 1 1 1 1 1 1 1 1 ...

To create a contingency table, use table():

tit.table<-table( tit.distribution$Bird, tit.distribution$Layer )
tit.table
           Ground Shrub Tree
  Bluetit      52   72  178
  Greattit     93  247   64

If you already had a table of the count data, and didn’t fancy making the raw data CSV file from it, just to have to turn it back into a contingency table anyway, you could construct the table manually using matrix():

tit.table<-matrix( c( 52, 72, 178, 93, 247, 64 ), nrow=2, byrow=TRUE )
# nrows means cut the vector into two rows
# byrow=TRUE means fill the data in horizontally (row-wise)
# rather than vertically (column-wise)
tit.table
     [,1] [,2] [,3]
[1,]   52   72  178
[2,]   93  247   64

The matrix can be prettified with labels (if you wish) using dimnames(), which expects a list() of two vectors, the first of which are the row names, the second of which are the column names:

dimnames( tit.table )<-list( c("Bluetit","Greattit" ), c("Ground","Shrub","Tree" ) )
tit.table
         Ground Shrub Tree
Bluetit      52    72  178
Greattit     93   247   64

To see whether the observed values (above) differ from the expected values, you need to know what those expected values are. For a simple homogeneity χ2-test, the expected values are simply calculated from the corresponding column (C), row (R) and grand (N) totals:

E = \frac{R \times C}{ N }
Ground Shrub Tree Row totals
Blue tit 52 72 178 302
E 302×145/706 = 62.0 302×319/706 = 136.5 302×242/706 = 103.5  
χ2 (52−62)2/62 = 1.6 (72−136.5)2/136.5 = 30.5 (178−103.5)2/103.5 = 53.6  
Great tit 93 247 64 404
E 404×145/706 = 83.0 404×319/706 = 182.5 404×242/706 = 138.5
χ2 (93−83)2/83 = 1.2 (247−182.5)2/182.5 = 22.7 (64−138.5)2/138.5 = 40.1
Column totals 145 319 242 706

The individual χ2 values show the discrepancies for each of the six individual cells of the table. Their sum is the overall χ2 for the data, which is 149.7. R does all this leg-work for you, with the same result:

chisq.test( tit.table )
       Pearson's Chi-squared test
data: tit.table 
X-squared = 149.6866, df = 2, p-value < 2.2e-16

The individual tits’ distributions are significantly different from homogeneous, i.e. there are a lot more blue tits in the trees and great tits in the shrub layer than you would expect just from the overall distribution of birds.

Sometimes, the expected values are known, or can be calculated from a model. For example, if you have 164 observations of progeny from a dihybrid selfing genetic cross, where you expect a 9:3:3:1 ratio, you’d perform a χ2 manually like this:

A- B- A- bb aa B- aa bb
O 94 33 28 9
E 164×9/16 = 92.25 164×3/16 = 30.75 164×3/16 = 30.75 164×1/16 = 10.25
χ2 (94−92.25)2/92.25 = 0.033 (33−30.75)2/30.75 = 0.165 (28−30.75)2/30.75 = 0.246 (9−10.25)2/10.25 = 0.152

For a total χ2of 0.596. To do the equivalent in R, you should supply chisq.test() with a second, named parameter called p, which is a vector of expected probabilities:

dihybrid.table<-matrix( c( 94, 33, 28, 9 ), nrow=1, byrow=TRUE )
dimnames( dihybrid.table )<-list( c( "Counts" ), c( "A-B-","A-bb","aaB-","aabb" ) )
dihybrid.table
       A-B- A-bb aaB- aabb
Counts   94   33   28    9
null.probs<-c( 9/16, 3/16, 3/16, 1/16 )
chisq.test( dihybrid.table, p=null.probs )
    Chi-squared test for given probabilities
data: dihybrid.table 
X-squared = 0.5962, df = 3, p-value = 0.8973

The data are not significantly different from a 9:3:3:1 ratio, so the A and B loci appear to be unlinked and non-interacting, i.e. they are inherited in a Mendelian fashion.

The most natural way to plot count data is using a barplot() bar-chart:

barplot( dihybrid.table, xlab="Genotype", ylab="N", main="Dihybrid cross" )

Maize kernel histogram [CC-BY-SA-3.0 Steve Cook]

Exercises

Use the χ2 test to investigate the following data sets.

  1. Clover plants can produce cyanide in their leaves if they possess a particular gene. This is thought to deter herbivores. Clover seedlings of the CN+ (cyanide producing) and CN (cyanide free) phenotypes were planted out and the amount of rabbit nibbling to leaves was measured after 48 hr. Leaves with >10% nibbling were scored as ‘nibbled’, those with less were scored as ‘un-nibbled’. Do the data support the idea that cyanide reduces herbivore damage?
Nibbled Un-nibbled
CN+ 26 74
CN 34 93
  1. In a dihybrid selfing cross between maize plants heterozygous for the A/a (A is responsible for anthocyanin production) and Pr/pr (Pr is responsible for modification of anthocyanins from red to purple) loci, we expect an F2 ratio of 9 A− Pr−: 3 A− pr pr: 3 a a Pr− : 1 a a pr pr. The interaction between the loci results in the a a Pr−  and a a pr pr individuals being indistinguishable in the colour of their kernels. The file maize_kernels.csv contains a tally of kernel colours. Do the data support the gene-interaction model?

Answers

  1. As the data is already in a table, it is easier to construct it directly as a matrix. The data do not support the hypothesis that the two phenotype differ in their damage from rabbit nibbling
clover.table<-matrix( c( 26, 74, 34, 93 ), nrow=2, byrow=TRUE )
dimnames( clover.table )<-list( 
  c( "CN.plus",  "CN.minus" ), 
  c( "Nibbled", "Un.nibbled" )
)
clover.table
        
        Nibbled Un.nibbled
CN.plus      26         74
CN.minus     34         93
chisq.test( clover.table )
       
Pearson's Chi-squared test with Yates' continuity correction
data: clover.table 
X-squared = 0, df = 1, p-value = 1
  1. The data are in a file, so we construct a simple contingency table from that and then test against the expected frequencies of 9 purple : 3 red : 4 colourless. Make sure you get them in the right order! The data support the model, as the χ2 value has a p value greater than 0.05, i.e. we can accept that the data are consistent with a 9:3:4 ratio.
maize.kernels<-read.csv( "H:/R/maize_kernels.csv" )
head( maize.kernels )
      Kernel
1        Red
2 Colourless
3 Colourless
4 Colourless
5     Purple
6 Colourless
maize.table<-table( maize.kernels$Kernel )
maize.table
Colourless    Purple        Red 
       229       485        160
chisq.test( maize.table, p=c( 4/16, 9/16, 3/16 ) )
Chi-squared test for given probabilities
data: maize.table 
X-squared = 0.6855, df = 2, p-value = 0.7098

Next up… One-way ANOVA.

Feb 10

Correlation of data: linear regression

Linear regression is used to see whether one continuous variable is correlated with another continuous variable in a linear way, i.e. can the dependent variable y be modelled with a straight-line response to changes in the independent covariate x:

y = a + bx + \epsilon

Here b is the estimated slope of the best-fit line (a.k.a. gradient, often written m), a is its y-intercept (often written c), and ϵ is the residual error. If the x and y data are perfectly correlated, then ϵ=0 for each and every x,y pair in the in the data set; however, this is extremely unlikely to occur in real-world data.

When you fit a linear model like this to a data set, each coefficient you fit (here, the intercept and the slope) will be associated with a t value and p value, which are essentially the result of a one-sample t test comparing the fitted value to 0.

Linear regression is only valid if:

  • The x and y variables have a linear relationship. If y is a function of the square of x, or has a hyperbolic relationship with x, then naïve linear regression must not be used, as it will try to fit a straight line to what is clearly not a straight-line relationship. It is often possible to transform curved relationships to straight-line relationships using transformation (logs, reciprocals, etc.) Always plot() and eyeball your data before modelling!
  • The data sets are representative of the larger population. As usual, if you collect data that is biased, fraudulent or of very small size, then any kind of statistical analysis is likely to be broken.
  • The residuals are normally distributed and homoscedastic. When the linear model is fitted, there will be some residual ‘noise’, i.e. the ϵ error term above. These residuals must be normally distributed, and should not be a function of the value of the x variable, i.e. the variance of y data at small values of x should be the same as the variance of y data at large values of x.
  • Each pair of data is independent. Each x,y pair should be independent of every other x,y pair.
  • The x variable is measured without error. Only the y variable can have an error associated with it.

Linear regression is very commonly used in circumstances where it is not technically appropriate, e.g. time-series data (where later x,y pairs are most certainly not independent of earlier pairs), or where the x-variable does have some error associated with it (e.g. from pipetting errors), or where a transformation has been used that will make the residuals non-normal. You should at least be aware you are breaking the assumptions of the linear regression procedure if you use it for data of this sort.

The file cricket_chirps.csv contains data on the frequency of cricket chirps (Hz) at different temperatures (°C). A quick plot of the data seems to show a positive, linear relationship:

cricket.chirps<-read.csv( "H:/R/cricket_chirps.csv" )
plot(
    Frequency ~ Temperature,
    data = cricket.chirps,
    xlab = "Temperature / °C",
    ylab = "Frequency / Hz",
    main ="Crickets chirp more frequently at higher temperatures",
    pch  = 15
         # The pch option can be used to control the pointer character
)

Cricket chirps scatterplot [CC-BY-SA-3.0 Steve Cook]

To model the data, you need to use lm( y ~ x, data=data.frame ). The lm() stands for “linear model”.

lm( Frequency ~ Temperature, data=cricket.chirps )
Call:
lm(formula = Frequency ~ Temperature, data = cricket.chirps)
Coefficients:
(Intercept)       Temperature  
    -0.1140       0.1271

You’ll often want to save the model for later use, so you can assign it to a variable. summary() can then be used to see what R thinks the slope and intercept are:

chirps.model<-lm( Frequency ~ Temperature, data=cricket.chirps )
summary( chirps.model )
Call:
lm(formula = Frequency ~ Temperature, data=cricket.chirps)

Residuals:
     Min       1Q    Median       3Q       Max 
-0.39779 -0.11544  -0.00191  0.12603   0.33985 

Coefficients:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.113971   0.152264  -0.749    0.467    
Temperature  0.127059   0.005714  22.235 2.55e-12 ***
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2107 on 14 degrees of freedom
Multiple R-squared: 
0.9725,    Adjusted R-squared:  0.9705 
F-statistic: 494.4 on 1 and 14 DF,  p-value: 2.546e-12
  • The (Intercept) is a from the formula at the top of this section. It is the y-intercept of the line of best fit through the x,y data pairs. It is value of y (Frequency) when x (Temperature) is zero, i.e. how frequently the crickets chirp at the freezing point of water. The estimated value is -0.1140 Hz, which is impossible(!), but satisfyingly does not appear to be significantly different from zero (p=0.467).
  • The Temperature is b from the formula at the top of this section. It is the slope of line of best fit through the x,y data pairs. The estimated value is 0.1271 Hz °C−1, i.e. for every 10°C increase in temperature, the chirping rate increases by about 1.3 Hz.
  • The Multiple R-squared value is the square of the correlation coefficient R for Frequency on Temperature. Values of R2 close to 1 indicate y is well correlated with the x covariate with relatively little scatter; values close to 0 indicate the scatter is large and x is a poor predictor of y.

To understand the meaning of the F-statistic part of the report, it is important to understand what actually happens when you perform a linear regression. What you’re really trying to find out in linear regression is whether a straight-line with non-zero slope “y=a+bx” is a better model of the dependent variable than a straight line of slope zero, with a y-intercept equal to a constant, usually the mean of the y values: “y=“. These two possible models are shown below. The code needed to display them with the deviance of each datum from the regression line picked out as red (col="red") vertical line segments is also shown.

Linear regression model “y=a+bx” with non-zero slope…

chirps.model<-lm( Frequency ~ Temperature, data=cricket.chirps )
abline( chirps.model )
chirps.predicted<-data.frame( Temperature=cricket.chirps$Temperature )
chirps.predicted$Frequency<-predict( chirps.model, newdata=chirps.predicted )
segments(
    cricket.chirps$Temperature,   cricket.chirps$Frequency,
    chirps.predicted$Temperature, chirps.predicted$Frequency,
    col="red"
)

We use the predict() function to predict the Frequency values from the model and a data frame containing Temperature values. We put the predicted Frequency values into a new column in the data frame by assigning them using the $ dollar syntax.

To add a regression line to the current plot using the fitted model, we use abline(). Like many of the other functions we’ve seen, this can either take an explicit intercept and slope:

# abline( a=intercept, b=slope )
abline( a=-0.1140, b=0.1271 )

Or it can take a tilde ~ modelled-by formula:

abline( cricket.chirps.model )

We use the segments() function to add red line segments to represent the deviation of each datum from the regression line. segments() takes four vectors as arguments, the x and y coordinates to start each segment from (here, the measured Temperature, Frequency data points), plus the x and y coordinates to finish each line (the equivalent columns from the data frame containing the predicted data: these are the corresponding points on the regression line).

Cricket chirps linear model [CC-BY-SA-3.0 Steve Cook]

y is a constant “y= ” model with zero slope…

mean.frequency<-mean( cricket.chirps$Frequency )
abline( a=mean.frequency, b=0 )
segments(
    cricket.chirps$Temperature, cricket.chirps$Frequency,
    cricket.chirps$Temperature, rep( mean.frequency, length(cricket.chirps$Frequency) ),
    col="red"
)

The constant is the mean of the Frequency measurements. The predicted Frequency values are therefore just 16 copies of this mean. We use length() to avoid having to hard-code the ’16′.

Cricket chirps null model [CC-BY-SA-3.0 Steve Cook]

It is ‘obvious’ that the y=a+bx model is better than the y= model. The y= model estimates just one parameter from the data (the mean of y), but leaves a huge amount of residual variance unexplained. The y=a+bx model estimates one more parameter, but with an enormous decrease in the residual variance, and a correspondingly enormous increase in the model’s explanatory power.

How much better? The degree to which the y=a+bx model is better than the y= model is easily quantified using an F test, and in fact R has already done this for you in the output from summary( chirps.model ):

F-statistic: 494.4 on 1 and 14 DF,  p-value: 2.546e-12

Accounting for the covariate Temperature makes a significant difference to our ability to explain the variance in the Frequency values. The F statistic is the result of an F test comparing the residual variance in the y=a+bx model (i.e. the alternative hypothesis: “Temperature makes a difference to frequency of chirps”) with the residual variance the y= model (i.e. the null hypothesis “Temperature makes no difference to frequency of chirps”).

An F test tells you whether two variances are significantly different: these can be the variances of two different data sets, or – as here – these can be the variances of two different models. The F value is very large (494) and the difference in explanatory power of the two models is therefore significantly different: by estimating just one extra parameter, the slope, which requires us to remove just one extra degree of freedom, we can explain almost all of the variance in the data.

Once we have fitted a linear model, we should check that the fit is good and that the assumption about the normality of the residual variance in the y variable is satisfied.

plot( residuals(chirps.model) ~ fitted(chirps.model) )

Cricket chirps residuals [CC-BY-SA-3.0 Steve Cook]

This plots the fitted Frequency values (i.e. Frequency.fitted = 0.1271×Temperature-0.1140) as the x variable against the residual values (ε=Frequency-Frequency.fitted) as the y variable. If the residuals are behaving themselves (i.e. they are normal), this should look like a starry sky, with equal numbers of points above and below 0. If the residuals increase or decrease (i.e. it looks like you could stick a line or curve through them) with the fitted values, or are asymmetrically distributed, then your data break the assumptions of linear regression, and you should be careful in their interpretation.

You should also look at the normal quantile-quantile (QQ) plot of the residuals:

qqnorm( residuals( chirps.model ) )

Cricket chirps normal QQ plot [CC-BY-SA-3.0 Steve Cook]

The points on this graph should lie on a straight line. If they’re curved, again, your data break the assumptions of linear regression, and you should be careful in their interpretation. You can scan through these and other diagnostic plots using:

plot( chirps.model )

Exercises

Fit linear models to the following data sets. Criticise the modelling: are the assumptions of the linear regression met?

  1. Using the sycamore_seeds.csv file you have already made, model the data and add a suitable regression line to the graph. You saved the code for the graph, and the CSV file from before, didn’t you?
  2. The file nadh_absorbance.csv contains data on the absorbance (A) at 340 nm of solutions containing increasing micromolar concentrations (C) of NADH. What is the Beer-Lambert molar extinction coefficient (ϵ) for NADH at 340 nm? How confident are you in your estimate? Is there anything about the data set that violates the linear regression assumptions? [Note that the epsilon here is the standard symbol for molar extinction coefficient and has nothing to do with the residuals]
A=\epsilon C l
  1. There is a relationship between the size of an island (or other defined area) and the number of species it contains. This relationship is modelled by the equation below, where S is the number of species, A is the area, and C and z are data-specific constants. Using logs, convert this equation into a linear form. Use R to transform the data below, and to estimate C and z. Criticise your model, and comment on the reliability of the results.
S=C A^z
Island Area of island / km2 Number of (non-bat) mammal species
Jersey 116.3 9
Guernsey 63.5 5
Alderney 7.9 3
Sark 5.2 2
Herm 1.3 2

Answers

  1. Sycamore seeds regression
sycamore.seeds<-read.csv( "H:/R/sycamore_seeds.csv" )
plot(
    descent.speed ~ wing.length,
    data = sycamore.seeds,
    xlab = "Wing length / mm",
    ylab = expression("Descent speed " / m*s^-1),
    main = "Sycamore seeds with longer wings fall more slowly"
)
sycamore.seeds.model<-lm( descent.speed ~ wing.length, data=sycamore.seeds )
abline( sycamore.seeds.model )
summary( sycamore.seeds.model )
Call:
lm(formula = descent.speed ~ wing.length, data = sycamore.seeds)
Residuals:
      Min        1Q   Median        3Q       Max 
-0.073402 -0.034124 -0.005326  0.005395 0.105636 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.388333   0.150479  15.872 9.56e-07 ***
wing.length -0.040120   0.004607  -8.709 5.28e-05 ***
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.06416 on 7 degrees of freedom
Multiple R-squared: 
0.9155,    Adjusted
R-squared:  0.9034 
F-statistic: 75.85 on 1 and 7 DF,  p-value: 5.28e-05

Both the intercept and the slope are significantly different from zero. The slope is negative, around −0.04 m s−1 mm−1. The residuals don’t look too bad, but note that if you accept the model without thinking, you’ll predict that a wing of length −y-intercept/slopec. 60 mm (i.e. the x-intercept) would allow the seed to defy gravity forever. Beware extrapolation!

Sycamore seeds linear model [CC-BY-SA-3.0 Steve Cook]

  1. Beer-Lambert law for NADH
nadh.absorbance<-read.csv( "H:/R/nadh_absorbance.csv" )
plot(
    A340 ~ Conc.uM,
    data = nadh.absorbance,
    xlab = "[NADH] / µM",
    ylab = expression(A[340]),
    main = "Absorbance at 340 nm shows shows linear\nBeer-Lambert law for NADH"
)
nadh.absorbance.model<-lm( A340 ~ Conc.uM, data=nadh.absorbance )
abline( nadh.absorbance.model )
summary( nadh.absorbance.model )
Call:
lm(formula = A340 ~ Conc.uM, data = nadh.absorbance)
Residuals:
       Min         1Q     Median         3Q       Max 
-0.0043482 -0.0020392 -0.0004086  0.0020603 0.0057544 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.973e-03  8.969e-04   3.314  0.00203 ** 
Conc.uM     6.267e-03  3.812e-05 164.378  < 2e-16 ***
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.002783 on 38 degrees of freedom
Multiple R-squared: 0.9986,    Adjusted R-squared:  0.9986 
F-statistic: 2.702e+04 on 1 and 38 DF,  p-value: < 2.2e-16

NADH linear model [CC-BY-SA-3.0 Steve Cook]

The slope is 6.267×10−3 µM, which means ϵ is 6.267×103 M. This is very significantly different from zero; however, so too is the intercept, which – theoretically – should be zero. You’ll also note that the Q-Q plot:

qqnorm( residuals( nadh.absorbance.model ) )

NADH normal QQ plot [CC-BY-SA-3.0 Steve Cook]

Is very clearly not a straight line, indicating that the variance in the residuals is not a constant.

  1. Species-area requires a log/log transformation
\log{S} = \log{C}+z\log{A}
logS<-log(c(     9,    5,   3,   2,   2 ))
logA<-log(c( 116.3, 63.5, 7.9, 5.2, 1.3 ))
species.area<-data.frame(logS=logS,logA=logA)
plot( 
    logS ~ logA,
    data = species.area,
    xlab = expression("ln( Area of island"/ km^2 *" )"),
    ylab = "ln( Number of species )",
    main = "Species supported by islands of different areas"
)
species.area.model<-lm( logS ~ logA, data=species.area )
abline( species.area.model )
summary( species.area.model )
Call:
lm(formula = logS ~ logA, data = species.area)

Residuals:
       1        2        3        4        5 
 0.22137 -0.16716  0.00828 -0.25948  0.19699 

Coefficients:

           Estimate Std. Error t value Pr(>|t|) 
(Intercept) 0.40977    0.20443   2.004   0.139  
logA        0.32927    0.06674   4.934   0.016 *
---
Signif. codes: 
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2471 on 3 degrees of freedom
Multiple R-squared: 0.8903,    Adjusted R-squared:  0.8537 
F-statistic: 24.34 on 1 and 3 DF,  p-value: 0.01597

log C is the (Intercept), 0.4098 (so C itself is e0.4098 = 1.5), and z is the slope associated with logA, 0.329 km−2. The residual plots are a little difficult to interpret as the sample size is small; and you’ll note the large error in the estimate of log C, which is not significantly different from 0 (i.e. C may well be 1). I wouldn’t want to bet much money on the estimates of C or of z here.

Species-area log-log linear model [CC-BY-SA-3.0 Steve Cook]
Next up… The χ²-test.

Feb 03

Statistical testing

If you want a random yes/no answer to a question, like “who should kick-off this football match?” it’s very common to entrust the decision to the flip of a coin, on the assumption that the coin doesn’t care which side gets the advantage.

But what if that trust is misplaced? What if the coin gives the answer “tails – your opponent kicks-off” more often than “heads – you kick-off”, and so gives the advantage to your opponent more often than to you? And, most importantly, how could you tell if the coin is biased before you entrust your decision-making to it?

One answer is to perform a statistical test on a sample of coin-flips, and to see whether the number of heads is lower than you would expect if the coin were fair. In this post, we’ll see how to do this from scratch for coin-flipping, so that when you do more complex tests, you have a better idea of what they’re doing under the hood.

A statistical test is a way of choosing whether to reject a hypothesis. So, before you consider a statistical test, you should have in mind at least one clear hypothesis:

  • The null hypothesis, H0, which usually states something like “there is no difference between A and B”, e.g. “this coin has the same probability of coming up heads as tails”.

You might also have in mind another hypothesis…

  • The alternative hypothesis, H1, which states something at odds with the null hypothesis: “there is a difference between A and B”, e.g. “this coin is biased, and produces – on average – more heads than tails when flipped”.

However, proper handling of alternative hypotheses will have to wait for another post.

If you collected just a single piece of data – a single coin-flip – then you would have absolutely no idea of whether the coin is biased, because the chance of a single head is exactly the same as the chance of a single tail: 50%, or 0.5. We can plot these two probabilities against the number of heads to produce a probability distribution (technically, a probability mass function) like this:

Probability distribution for one flip of a fair coin [CC-BY-SA-3.0 Steve Cook]

The probability of getting zero heads from just one flip of a fair coin, is 0.5; and the probability of getting one head from one flip is also 0.5.

OK, so how about a few more flips? If you did three coin-flips and got not a single head, you might suspect something were awry, but most people wouldn’t bet money on it. Why? Well, if the coin is fair, the probability of getting a head on any given flip is 0.5, and so too is the probability of getting a tail. Furthermore, every flip of the coin is independent of the last – unless you’re a fraudulent coin-flipping genius with terrifying motor control. So, assuming that the coin is fair, and that you are fairly clumsy, the chances of zero heads from three coin-flips is 0.5 (for the first tail) times 0.5 (for the second) times 0.5 (for the third) = 0.53 = 0.125 = 12.5%, i.e. it’s something that will happen more than 10% of the time.

However, if you did thirty coin-flips and got not a single head, you’d be almost certain the coin were biased, and the maths backs this up: 0.530 is, as near as damn-it, a 1 in a billion chance. It’s not impossible that the coin is unbiased – experiments can never rule anything out completely! – but this would be a particularly unlikely fluke.

But what if you got 10 heads in 30 coin-flips? Or 3 heads in 10? Or 8 heads in 40? What number of failures to get a head in a sample of flips would be enough to convince you the coin were biased? Indeed, how many times should you flip the coin to check its fairness anyway?

Let’s take the example of two flips of an unbiased coin.  The possible results from two flips are:

  1. HH
  2. HT
  3. TH
  4. TT

The probability of zero heads is 0.5×0.5 = 0.25.

The probability of one head and one tail is 2×0.5×0.5 = 0.5, because there are two different combinations in which you can get a single head and a single tail (HT or TH).

The probability of two heads is 0.5×0.5 = 0.25.

If we plot these probabilities against the number of heads, as before, we get this distribution:

Note that the heights of the bars are in the ratio 1:2:1, showing the 1 way you can get zero heads, the 2 ways you can get a head and a tail, and the 1 way you can get two heads.

Probability distribution for two flips of a fair coin [CC-BY-SA-3.0 Steve Cook]

For three flips, there are eight possibilities:

  1. HHH
  2. HHT
  3. HTH
  4. THH
  5. HTT
  6. THT
  7. TTH
  8. TTT

The probability of zero heads is 0.5×0.5×0.5 = 0.53 = 0.125. The probability of one head and two tails is 3×0.5×0.5×0.5 = 0.375, because there are three different combinations of one head and two tails. The same is true of two heads and one tail; and the probability of three heads (zero tails) is the same as for three tails, 0.125. Here’s the graph for three flips of a fair coin; again note the ratio: this time it’s 1:3:3:1.

Probability distribution for three flips of a fair coin [CC-BY-SA-3.0 Steve Cook]

And finally, for four flips, there are 16 possible outcomes, each with a probability of 0.54=0.0625:

  1. HHHH
  2. HHHT
  3. HHTH
  4. HTHH
  5. THHH
  6. HHTT
  7. TTHH
  8. HTTH
  9. THHT
  10. THTH
  11. HTHT
  12. TTTH
  13. TTHT
  14. THTT
  15. HTTT
  16. TTTT

There is 1 way to get zero heads, 4 ways to get one head, 6 ways to get two heads, 4 ways to get three heads, and 1 way to get four heads.

Probability distribution for four flips of a fair coin [CC-BY-SA-3.0 Steve Cook]

You may recognise the ratio of the bars in these graphs from Pascal’s triangle, where each number is formed by summing together the two numbers diagonally above it:

Pascal's triangle [CC-BY-SA-3.0 Conrad Irwin]

Pascal’s triangle [CC-BY-SA-3.0 Conrad Irwin]

The probability distributions graphed above are three specific examples of the binomial distribution, which can be calculated in the general case using the following formula:

pmf(k, n, P) = \binom{n}{k} P^k (1-P)^{n-k} = \frac{n!}{k!(n-k)!} P^k (1-P)^{n-k}
  • pmf(k, n, P) is the probability of getting k heads in a particular sample of size n, where the probability of a head on any particular flip, P, is 0.5.
  • P is the long-run probability of the coin giving you a head, i.e. the ratio of the number of heads to the total number of coin-flips for some extremely large sample. For a fair coin, this will be 0.5.
  • n is the number of coin-flips you have sampled.
  • (n,k) is the binomial coefficient from Pascal’s triangle, which can be calculated from scratch using factorials (3! = 3×2×1, etc.) if you are a masochist.

For a fair coin, p=0.5, and this formula simplifies to:

pmf(k, n) = \binom{n}{k} 0.5^n

To see whether a coin is fair, we would like to know how likely it is that the number of heads we get in a sample is consistent with a coin that gives heads 50% of the time and tails 50% of the time. We are much more likely to be convinced that getting 10 heads in 30 flips (33% heads) means the coin is biased, than we would be by getting 0 heads in 1 flip, even though that’s an superficially much more damning 0% heads. So it is clear that our opinion about the coin’s fairness will depend critically on n, the number of coin-flips we collect. This is called the sample size.

In this sort of statistical testing, the first thing we do is to define a null hypothesis. In this case, our null hypothesis is that the coin is fair. Under the null hypothesis, the probability of getting a head on a particular flip is 0.5.

We then assume that the null hypothesis really is true, and we ask ourselves: what probability distribution of results would we get, assuming that the null hypothesis is true? For the coin-flip, we’d get a binomial distribution of results: the number of heads in a sample of size n would be modelled by the equation shown above.

We then collect our sample of coin-flips. Let’s say we collect 30 in this case, so n = 30. We can then generate the precise probability distribution for a fair coin, flipped 30 times. It looks like this:

Probability distribution for thirty flips of a fair coin [CC-BY-SA-3.0 Steve Cook]

Code for (nearly) this graph is:

P<-0.5
n<-30
k<-c(0:n)
pmf<-dbinom(k, size=n, prob=P)
names(pmf)<-k
barplot( pmf, space=0, xlab="Number of heads", ylab="pmf" )

The function dbinom(k, size=n, prob=P) is R’s implementation of the function pmf( k, n, P ) discussed above: it gives you the probability of k heads in a sample of size n with a coin producing heads at prob P. Here we pass it a vector of values of k between 0 and n, which returns a vector containing the whole probability distribution. The names function can be used to give names to the items in the probability distribution, which is useful for labelling the x-axis of a barchart.

Let’s say that only 10 of the 30 flips were heads. As you can see from the graph above, this is far into the left tail of the probability distribution. The vast majority of possible outcomes of 30 coin-flips contain more than 10 heads. It is therefore pretty unlikely that 10 heads in 30 flips is consistent with the null hypothesis that the coin is unbiased. We would probably be happy to reject the null hypothesis, but can we be a little more objective (or at least consistently subjective!) about the criterion for rejection?

Yes, we can. We can work out from the model exactly how likely is it that 30 coin-flips would produce 10 or fewer heads. It’s important that we include ‘or fewer’, because if we’re convinced of bias by 10 heads out of 30, we’d be even more convinced by 5 out of 30, or 0 out of 30. This can be worked out from the cumulative distribution function for the binomial distribution. This is found for 10 heads by simply summing up the the probabilities for 10, 9, 8 … 2, 1, or 0 heads; and more generally by summing up the probabilities “below” a given number of heads. Here is the cumulative distribution for 30 coin-flips:

Cumulative distribution for thirty flips of a fair coin [CC-BY-SA-3.0 Steve Cook]

0.04937 is just slightly less than 0.05, or 5%, or 1-in-20. Assuming our coin is fair, there is only a 1-in-20 chance that we would get 10 or fewer heads in a sample of 30. We call this probability the p value. On both graphs, I’ve marked the region where the p value is less than 0.05 in red. Note that the p value is not the same thing as the probability of the coin producing a head, which I’ve symbolised as P above.

In scientific writing, the results of statistical tests with p values less than 0.05 are called statistically significant, and those with p values greater than 0.05 statistically insignificant. This cut-off at 0.05 is essentially arbitrary, and sometimes 0.01 or 0.001 are used instead, but 0.05 is used fairly widely in science as the threshold for “significant” data.

The concept of p-values and the use of statistical significance as a way of analysing data were first proposed  by Ronald Fisher. Fisher’s “significance testing” approach tends to get confused (even by scientists!) with the conceptually different “hypothesis testing” approach of Jerzy Neyman and Egon Pearson, which we’ll visit in a later post. It’s important to know what a p value is to at least give you a chance of avoiding this confusion yourself.

The p value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.

It is not the probability that the null hypothesis is true, or that the alternative hypothesis is false. It is not the probability the results are a fluke. It is not even the probability that you have rejected the null hypothesis incorrectly. All the p value tells you for this experiment is how probable it is that 10 or fewer heads would be observed from 30 flips of a fair coin: the answer is “not very”.

To sum up:

  • The null hypothesis was that the coin was fair, so we can therefore assume  that the probability of any one flip coming up heads is 0.5.
  • The test statistic we have obtained from our data is simply the number of heads, which is 10. In other kinds of statistical test, the test statistic will usually be something more complex.
  • The probability of obtaining 10 heads was worked out using a reference distribution, the binomial.
  • Generating this reference distribution requires one additional parameter, and that is the sample size, n, which was 30.
  • If we follow this procedure, we get a p value less than 0.05, and therefore we can – perhaps! – say that the data we have collected are inconsistent with the null hypothesis of a fair coin.

All this rigmarole sounds like a lot of effort, and it is. This is why you’re using R. The whole of the analysis above can be done automatically for you by simply typing:

binom.test( 10, 30, p=0.5, alternative="less" )
        Exact binomial test
data:  10 and 30
number of successes = 10, number of trials = 30, p-value = 0.04937
alternative hypothesis: true probability of success is less than 0.5
95 percent confidence interval:
 0.0000000 0.4994387
sample estimates:
probability of success 
             0.3333333

This carries out a binomial test for 10 ‘successes’ (heads) in 30 ‘trials’ (coin-flips), assuming that the coin is fair. The p argument in the call to the function is the probability of success in a single trial, which we called P above, not the p value. The letter “p” tends to get rather overloaded with conflicting meanings in statistics, so be careful about what is is being used to symbolise in any given circumstance. We also specify that the we’re only interested in whether there are less (fewer!) successes than we’d expect. You’ll note the resulting p value is as we calculated above.

The test described above is a one tailed test, as we have only checked to see whether the coin is biased towards tails. In general, we’d use a two tailed test, because a coin that produces an excess of heads is just as biased as one that produces an excess of tails, and – without good reason to suspect bias in one direction or the other – we should be prepared for either form of bias. This changes the analysis a little, as we’d be looking at the tails on both sides of the probability distribution, but the principles are the same.

Other kinds of statistical test, such as F, and t, and χ² are more complex, and – unlike this exact test – they only estimate p values based on some additional assumptions, rather than calculate them exactly from first principles. One reason for this is that – for example – calculating precise binomial p values for large values of n requires creating some pretty enormous numbers (because of all those factorials: 30! = 265252859812191058636308480000000). This becomes impractically slow and/or inaccurate, making it more practical to use approximations that are less computationally intractable. However, inexact tests are still doing similar things under the hood. They compare a test statistic (e.g. t) derived from your data, to a reference distribution (e.g. Student’s t) that you’d expect to get assuming the null hypothesis is true (“the means of two samples are the same”), and accounting for any important parameters of the data (degrees of freedom). They then return a p value that tells you how probable it is that a test statistic at least as extreme as the one you actually got would have been observed, assuming the null hypothesis is true.

A very reasonable question to ask at the end of all this is: how many times should you flip the coin to see if it is biased? That very reasonable question unfortunately has a very long answer, and will require another post!

Exercises

  1. But what if you got 3 heads in 10?
  2. Or 8 heads in 40?
  3. And what if you didn’t care whether the coin was biased to heads or to tails, for 8 heads in 40?
  4. What is the largest number of heads you would consider significant at p=0.05, for a sample size of 100 flips? You’ll need the qbinom function, which will return the smallest value of k for which the cumulative distribution function is larger than your p value.
  5. Produce a barchart of the probability distribution for n=40, like the one shown in the post for n=30. For extra happiness, colour the left and right p=0.025 tails red. You’ll need the col to the barplot function, which can take a vector of colours such as "red" and "white" as an value.
  6. Produce a barchart of the cumulative distribution for n=40, like the ones shown in the post for n=30. For extra happiness, colour the left and right p=0.025 tails red. You’ll need the pbinom function, which will return the cumulative distribution function for each value of k.

Answers

  1. 3 in 10, one-tailed: not significantly less that what you’d expect, given the null.
binom.test( 3, 10, p=0.5, alternative="less" )
        Exact binomial test
data:  3 and 10
number of successes = 3, number of trials = 10, p-value = 0.1719
alternative hypothesis: true probability of success is less than 0.5
95 percent confidence interval:
 0.0000000 0.6066242
sample estimates:
probability of success 
                   0.3
  1. 8 in 40 one-tailed, significantly less than you’d expect, given the null.
binom.test( 8, 40, p=0.5, alternative="less" )
        Exact binomial test
data:  8 and 40
number of successes = 8, number of trials = 40, p-value = 9.108e-05
alternative hypothesis: true probability of success is less than 0.5
95 percent confidence interval:
 0.0000000 0.3320277
sample estimates:
probability of success 
                   0.2
  1. 8 in 40 two-tailed, [still] significantly different from what you’d expect, given the null. Note that the p value is double the p value from the previous answer: this is because the 5% is now divided half-and-half between the ‘extremely few heads’ tail of the probability distribution, and the ‘extremely many heads’ tail. This makes it harder for a two-tailed test to give you a significant p value than a one-tailed test. Consequently – as we scientists are a conservative lot – we generally use the two-tailed version of tests unless there is some extremely good reason not to.
binom.test( 8, 40, p=0.5 )
        Exact binomial test
data:  8 and 40
number of successes = 8, number of trials = 40, p-value = 0.0001822
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.09052241 0.35647802
sample estimates:
probability of success 
                   0.2
  1. The largest number of heads in a sample of 40 that would still give p=0.05 is either 14 (for a one-tailed test) or 13 (for a two-tailed test). Note that the qbinom function returns the smallest value of k such that pmf(k, n, P) p, i.e. it returns the next value of k ‘up’ from the number of heads we’d consider suspicious.
qbinom(0.05, size=40, prob=0.5)
15
qbinom(0.05/2, size=40, prob=0.5)
14
  1. Graph of 40-flip PMF. We set up a vector called colorcode containing 41 copies of the word “white”, and then modify this to “red” for the items in colorcode whose corresponding values of k have a CDF of less than 0.025. Note the use of a conditional in the subscript call to colorcode, and that because we want both tails, we need those values with CDFs less than 0.025 (or greater than 0.975).
P<-0.5
n<-40
k<-c(0:n)
p.crit<-0.05/2

pmf<-dbinom(k, size=n, prob=P)
names(pmf)<-k

# Values of k below k.crit have a cumulative probability of less than p.crit
k.crit<-qbinom(p.crit, size=n, prob=P)

colorcode<-c( rep("white", n+1) )
colorcode[ k < k.crit ]<-"red"
colorcode[ k > n-k.crit ]<-"red"

barplot( pmf, space=0, xlab="Number of heads", ylab="pmf", col=colorcode )

Probability distribution for forty flips of a fair coin [CC-BY-SA Steve Cook]

  1. Graph of 40-flip CDF. Note that the p-value for 13-or-fewer plus 27-or-more is actually a fair bit less than 0.05 (it’s about 0.038), but because the distribution is symmetrical, if we took the next two values (14 and 26) and added them to the tails too, we’d get a p value of 0.081, which is larger than 0.05. As the distribution is discrete, we can’t do anything but err on the side of caution and exclude 14 and 26 from the tails.
P<-0.5
n<-40
k<-c(0:n)
p.crit<-0.05/2

cdf<-pbinom(k, size=n, prob=P)
names(cdf)<-as.character(k)

k.crit<-qbinom(p.crit, size=n, prob=P)

colorcode<-c( rep("white", n+1) )
colorcode[ k < k.crit ]<-"red"
colorcode[ k > n-k.crit ]<-"red"

barplot( cdf, space=0, xlab="Number of heads", ylab="cdf", col=colorcode )
segments( 0, p.crit, n+1, p.crit, col="red" )
segments( 0, 1-p.crit, n+1, 1-p.crit, col="red" )

Cumulative distribution for forty flips of a fair coin [CC-BY-SA-3.0 Steve Cook]
Next up… Statistical power (when I’ve written it. For now skip to the test)

Feb 03

Comparison of means: the t test

A t test is used to compare the means of two data sets, and is calculated as follows:

t = \frac{ | \bar{x}_1 - \bar{x}_2 | }{ \sqrt{ \frac{ (n_1 - 1)s_1^2 + (n_2 - 1)s_2^2 }{ n_1 + n_2 - 2 } \cdot ( \frac{1}{n_1} + \frac{1}{n_2} ) } }

Where x̅, s and n are the mean, standard deviation and sample size of the two data sets.

A t test can also be used to compare the mean of a single data set to a reference value, xref:

t = \frac{ | \bar{x} - x_{ref} | }{ \sqrt{ \frac{s^2}{n} } }

The test statistic derived from the two data sets is called t, and it is defined as the difference between the two means x̅1 and x̅2 (or the difference between a mean x̅ and a reference value, xref) divided by the pooled standard error of the mean(s).

The standard error is a measure of the unreliability of a statistic. The standard error of a mean is:

SE_x = \sqrt{ \frac{s^2}{n} }

The unreliability of an estimate of the population mean increases with the variance of the sample you are using to estimate it (hence the variance, s2, in the numerator) but decreases with sample size (hence n in the denominator).

The reference distribution for the t test is Student’s t. Like Fisher’s F, this reference distribution has two parameters: the number of degrees of freedom remaining in the first data set, and the number of degrees of freedom remaining in the second data set. As you are comparing one mean from each data set, then the remaining degrees of freedom for each data set is n−1, and the total number of degrees of freedom is n1+n2−2.

A t test compares the t statistic from your empirical data with the Student’s t value you’d expect under the null hypothesis, given the degrees of freedom in the two data sets. The p value of the test is the probability of obtaining a test t statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis (“the means are equal”) is true. i.e. The p value is the probability of observing your data (or something more extreme), if the means really are equal.

The comparison is only valid if the sample data sets are:

  • Representative of the larger population.
  • Normally distributed. If the normality assumption is badly violated, a non-parametric alternative such as Wilcoxon rank sum (“Mann-Whitney U“) test, using wilcox.test(), might be an appropriate alternative.
  • Independent of one another. A modification of the t test can be made for certain kinds of dependent data (e.g. where the samples in each group fall into natural pairs).
  • Homoscedastic; that is they have the same variance. You can test for this assumption using the F test. If the two samples have difference variances, they are heteroscedastic. The Welch modification of Student’s t test can be used if this assumption is broken.

Do not use a Student’s t test if these assumptions are broken.

In R, a t-test is performed using t.test(). Here we generate thirty items of random normal data using rnorm() and compare two such data sets. We add 10 to all the data in dataset2 to make the means differ. The output of your code will differ as the data is random (if you’re very un/lucky, the difference may not even be significant!):

dataset1 <-rnorm(30)
dataset2 <-10+rnorm(30)
t.test( dataset1, dataset2 )
        Welch Two Sample t-test
data: dataset1 and dataset2
t = -46.0075, df = 56.604, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -10.444369  -9.572988
sample estimates:
  mean of x   mean of y 
 0.01425418   10.02293268

Here the p value is less than 0.05, so we can say the difference in means is significant.

In R, the default t.test() actually assumes the variances are unequal (heteroscedastic), as this is very common, and it automatically uses Welch’s modification for unequal variances, which is actually rather less gnarly to look at in equation form:

t = \frac{ | \bar{x}_1 - \bar{x}_2 | }{ \sqrt{ \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} } }

The degrees of freedom for Welch’s modification are no longer an integer, but instead:

 df = \frac{ ( s_1^2/n_1 + s_2^2/n_2 )^2 }{ \frac{(s_1^2 / n_1)^2}{ n_1 - 1} + \frac{ (s_2^2 / n_2)^2 }{ n_2 - 1} }

You can force R to use Student’s original t test for homoscedastic data by setting the option var.equal=TRUE.

# The code below should only be used if an F test indicates the variances of
# the two data sets are equal

t.test( dataset1, dataset2, var.equal=TRUE )

# The code below has the same effect as calling
t.test( dataset1, dateset2 )

# i.e. it uses Welch's modification to account for the difference in variances
t.test( dataset1, dataset2, var.equal=FALSE)

In general, you should use an F test to see whether the variances of your two data sets are equal: if so, use Student’s t (var.equal=TRUE); if not, use Welch’s modification of t, (var.equal=FALSE).

As with plot(), it is usually more convenient to use the ‘modelled-by’ tilde ~ operator to perform t tests within named data frames, rather than supplying the test with two vectors of data.

Here we t test the dog-whelk data in dog_whelks.csv that you already made in the exercises of an earlier post. As we have already established that the variances are not significantly different, we should use Student’s t rather than the Welch modification:

dog.whelks<-read.csv( "H:/R/dog_whelks.csv" )
t.test( Height ~ Exposure, data = dog.whelks,
var.equal=TRUE )
        Two Sample t-test
data:  Height by Exposure
t = -8.6399, df = 54, p-value = 9.274e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9.989079   -6.226306
sample estimates:
  mean in group Exposed   mean in group Sheltered       
  20.19231                28.30000

You can read this as:

Do a t test on the data in the dog.whelks data frame, using Height as the dependent variable, and grouping the data into two sets according to the Exposure factor

You should never put raw R output into written prose. If you wish to include the results of the F and t tests on the dog-whelk data above in a report, they should be written long-hand. In typical scientific literature, this might be something like…

Dog-whelk shell-height on exposed beaches was not significantly more variable than on sheltered beaches (F=1.77, p=0.13), based on a sample size of 25. However the shells on sheltered beaches were found to be significantly taller, by 10 mm (t=6.27, p ≪0.001).

…although this style of report mixes up Fisher and Neyman-Pearson approaches to hypothesis testing: by reporting a p value, you’re probably trying to persuade people of the importance of the ‘highly significant’ t test result (Fisher); however you’re also reporting your acceptance of the alternative hypothesis (Neyman-Pearson) under which the p value is irrelevant (except insofar as it is smaller than 0.05), but the power (which you’ve not stated, let alone used to determine an appropriate sample size!) is critical to the interpretation of the results. Ho hum.
Dog whelks boxplot [CC-BY-SA-3.0 Steve Cook]

You can select the heights of the sheltered-beach shells by using tapply(). This applies a function to a vector of data, grouped by another vector of the same length containing the levels of a factor.

N.B. The “t” in tapply() has nothing to do with the “t” in t- test: there are several other apply() functions with similarly cryptic names.

Remembering that the dog.whelks data frame looks like this:

   Height  Exposure
1      25
Sheltered
2      30
Sheltered
3      29
Sheltered
…
54     20   Exposed
55     24   Exposed
56     30   Exposed

We can apply the mean() function to the Height column (dog.whelks$Height) grouped by the Exposure factor (dog.whelks$Exposure) using this syntax:

# tapply( vector.to.group, factors.to.group.by, function.to.call )
means<-tapply( dog.whelks$Height, dog.whelks$Exposure, mean )
means["Sheltered"]-means["Exposed"]
Sheltered 
 8.107692

Shells from sheltered beaches are 8 mm taller. There are several simpler ways of working this out, e.g. using the linear model function lm(), which we will cover in later posts.

lm( Height ~ Exposure, data = dog.whelks )

To check the normality assumption for both F and t tests holds, you can use a qqnorm() plot and/or a hist() plot.

The normal quantile-quantile (QQ) plot plots the real data against theoretical normally-distributed data of the same mean and standard error. If the real data are normally distributed, then the QQ plot should be a straight line. Curved or S-shaped QQ plots are indicative of non-normal data. Note the use of the [ condition ] indexing brackets to extract out from the vector only those items for which the Exposure factor is “Sheltered“: the overall data set will certainly not be normal, as the t-test shows the two exposures come from two different (normal) distributions:

qqnorm( dog.whelks$Height[ dog.whelks$Exposure=="Sheltered" ] )

Dog whelks normal QQ plot [CC-BY-SA-3.0 Steve Cook]

A histogram of the data should look like a normal bell-curve, without obvious tails (skew) or peakiness/flatness (kurtosis); however, for small data sets, the histogram is unlikely to be smooth and ‘perfect’ looking, as you can see below. For this reason, the normal QQ plot is generally the preferred way of seeing whether data are normal enough to be analysed using F and t tests.

hist( dog.whelks$Height[dog.whelks$Exposure=="Sheltered"] )

Dog whelks histogram [CC-BY-SA-3.0 Steve Cook]

You might like to plot a QQ plot and a histogram of the “Exposed” portion of the data set, and perhaps look at a histogram of the entire data set, to confirm that it is indeed composed of two overlapping (approximately) normal curves.

Some things to bear in mind when writing summaries of statistics in prose:

  • Do not include raw R output.
  • Include the name of the test you performed, the test statistic and its p value.
  • R will usually give you the p value itself, or – as here – an indication that it is lower than the minimum value it can calculate. There is no need to additionally look up your t statistic (6.27) in tables to confirm that this is indeed associated with a p value below the critical value of your choice. This makes about as much sense as performing the calculation 34378 times 83623 on a calculator, ignoring the result, and then using a slide-rule to estimate the product instead.
  • Don’t simply say a difference in means investigated by a t test is “significant”. Say which level of the factor is associated with the larger or smaller mean: it’s not just that the dog-whelk heights were different, the dog-whelks on sheltered beaches were – on average – taller. This is acceptable, as a one-tailed t test (which would test for ‘taller’ or ‘shorter’) will always have a p value less than 0.05 if the corresponding two-tailed t test  has a p value less than 0.05.
  • More generally, make sure the reader knows why you’ve done the statistics: it’s too easy to forget that the whole point is to see whether some dependent variable of interest (dog-whelk shell height) can be modelled/explained using an independent variable of interest (type of beach the shell is found on).
  • Give an indication of the sample size.

Exercises

Use F and t tests to investigate the following data sets. For each data set, write a short piece of prose describing the results of the statistical analysis. Plot suitable graphs for each set to accompany the text.

  1. The data in pollen_tubes.csv show the growth rate of pollen tubes from Impatiens pollen, grown in an agar-solidified basal medium containing sucrose, vs. the same medium with borate and calcium supplementation.
  2. The data in pipette_check.csv show the mass (g) of water measured when two brands of 1 mL adjustable pipettes are set to their maximum volume and used to pipette pure water. Which brand of pipette is better?

Answers

  1. The rate of growth of germinating pollen tubes (n=15 per treatment) in supplemented medium was found to be significantly larger (t=3.73, p=0.00085). The growth rate (mean±sd / µm hr−1) was increased from 560(±224) in the basal medium to 891(±261), but no effect was observed on the variance (F=0.73, p=0.56).

Pollen tubes boxplot [CC-BY-SA-3.0 Steve Cook]

pollen.tubes<-read.csv( "H:/R/pollen_tubes.csv" )
var.test( Growth.Rate ~ Medium, data = pollen.tubes )
        F test to compare two variances
data:  Growth.Rate by Medium
F = 0.7309, num df = 14, denom df = 14, p-value = 0.5653
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.2453878 2.1770768
sample estimates: ratio of variances 
 0.7309091
t.test( Growth.Rate ~ Medium, data = pollen.tubes, var.equal=TRUE )
       Two Sample t-test data: 
Growth.Rate by Medium
t = -3.7345, df = 28, p-value = 0.0008522
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
    -514.1032 -149.8968
sample estimates:
    mean in group Basal       mean in group Supplemented  
    559.3333                  891.3333
tapply( pollen.tubes$Growth.Rate, pollen.tubes$Medium, mean )
       Basal        Supplemented 
       559.3333     891.3333
tapply( pollen.tubes$Growth.Rate, pollen.tubes$Medium, sd )
       Basal        Supplemented 
       223.7389     261.7037
# Plot the data with suitable title
plot( 
    Growth.Rate ~ Medium,
    data = pollen.tubes,
    xlab = "Growth medium",
    ylab = expression("Growth rate / µm " * hr^-1),
    main= "Pollen tubes grow more quickly in borate/calcium-supplemented medium"
)
  1. The Piprecision pipettes have a variance at least twice as high as that of the Accurette model (F=0.23, p=0.00016), indicating the Acurrette pipettes are more precise. No difference in accuracy was observed (t=0.48, p=0.063) across the two brands.

Pipette boxplot [CC-BY-SA-3.0 Steve Cook]

pipette.check<-read.csv( "H:/R/pipette_check.csv" )

var.test( Mass ~ Brand, data=pipette.check )
        F test to compare two variances
data:  Mass by Brand
F = 0.2298, num df = 29, denom df = 29, p-value = 0.0001639
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.1093666  0.4827644
sample estimates:
  ratio of variances         
  0.2297789
t.test( Mass ~ Brand, data=pipette.check )
        Welch Two Sample t-test
data:  Mass by Brand
t = -0.4835, df = 41.659,
p-value = 0.6313
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.03967369  0.02434036
sample estimates:
  mean in group Accurette  mean in group Piprecision 
                 0.999000                   1.006667
tapply( pipette.check$Mass, pipette.check$Brand, mean )
  Accurette  Piprecision 
   0.999000     1.006667
tapply( pipette.check$Mass, pipette.check$Brand, sd )
  Accurette  Piprecision 
 0.03754078  0.07831560
# Plot the data with suitable title
plot( 
    Mass ~ Brand,
    data = pipette.check,
    xlab = "Pipette brand",
    ylab = "Mass of water in 1 mL measurement / g",
    main="Both brands of pipette are accurate, but Piprecision is less precise than Accurette"
)

Next up… Linear regression.

Feb 03

Comparison of variances: the F test

An F test is used to compare the variances of two data sets:

  • As it is used to compare variances, the dependent data must – by definition – be numeric.
  • As it is used to compare two distinct sets of data, these sets represent the two levels of a factor.

The test statistic we use to compare the variances of the two data sets is called F, and it is defined very simply: F is the larger of the two variances divided by the smaller.

F = \frac{s_1}{s_2}

The reference distribution for the F test is Fisher’s F distribution. This reference distribution has two parameters: the number of degrees of freedom in the first data set, and the number of degrees of freedom in the second data set.

An F test compares the F statistic from your experimental data with the Fisher’s F value under the null hypothesis for the given degrees of freedom. The p value of the test is the probability of obtaining a test F statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis (“the variances are equal”) is true. i.e. the p value is the probability of observing your data (or something even more extreme), if the variances really are equal.

The comparison is only valid if the sample data sets are:

  • Both representative of the larger population. The larger the data set, the more likely this is to be true; however, it is also critical that the data be collected in an unbiased way, i.e. with suitable randomisation.
  • Both normally distributed. If you plot the data as box-and-whisker plots, gross deviations from normality are often obvious. You can also use normal quantile-quantile (QQ) plots to examine the data: help(qqnorm)
  • Independent of one another, i.e. there is no reason to suppose that the data in the first set are correlated with the data in the second set. If the data represent samples from the same organism at two different times, there is every reason to suppose they will be correlated. If the data represent paired samples, e.g. from one male child and from one female child, repeated across many families, there is again every reason to suppose they will be correlated.

Do not use an F test if these assumptions are broken.

In R, an F-test is performed using var.test(). Here we generate thirty items of random normal data using rnorm() and compare two such data sets. The output of your code will differ as the data is random:

dataset1<-rnorm(30)
dataset2<-rnorm(30)
var.test( dataset1, dataset2 )
        F test to compare two variances
data: dataset1 and dataset2
F = 1.7706, num df = 29, denom df = 29, p-value = 0.1298
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.8427475  3.7200422
sample estimates:
ratio of variances
1.770609

Here the p value is greater than 0.05, so we say the difference in variances is not significant.

Like plot(), it is usually more convenient to use the ‘modelled-by’ tilde ~ operator to perform F tests within named data frames, rather than supplying the test with two vectors of data. Here we use the dog_whelks.csv data again.

dog.whelks<-read.csv( "H:/R/dog_whelks.csv" )
var.test( Height ~ Exposure, data = dog.whelks )
        F test to compare two variances
data:  Height by Exposure
F = 0.9724, num df = 25, denom df = 29, p-value = 0.9504
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.4540033 2.1297389
sample estimates:
ratio of variances 
    0.9724247

You can read this as:

“Do an F test on the data in the dog.whelks data frame, using Height as the dependent variable, and grouping the data into two sets according to the Exposure factor”

Exercises

See the next post on the t test.

Jan 30

Organism of the week #19 – Bird-brained humans

Most people who want to attract birds to their gardens put out bird-seed, or – if they’re really adventurous – mealworms. However, on the East London dockside where I lived until recently, one of the residents strewed the quayside with livers in December last year. Whether this was a deliberate ploy to attract carrion crows for bird-on-variety-meat action, I don’t know, but whether or not this was the intention, it was certainly the result.

Corvus corone [CC-BY-SA-3.0 Steve Cook]

Corvus corone, the carrion crow, pecking at a liver

Crows have a reputation as being very intelligent. I wouldn’t argue against this, but it’s interesting that intelligence is something humans tend to see in any animal that behaves somewhat similarly to them. As humans are highly social apes, most of the important things in their environment are other humans; and much of their neurological effort is therefore directed at predicting (and outwitting) the behaviour of other humans.

Above all else, humans consider their mental talents the sine qua non of human-ness: any old sparrow can walk on two legs, and even a bloody rat can carry off the last-chicken-in-the-shop look, but solving crossword puzzles and finding new ways of fucking over your conspecifics – now that’s what being human really means.

So perhaps it’s not terribly surprising that the animals humans see as intelligent – dolphins, chimps, dogs, crows, ‘even’ bees – are those that live in complex social groups, where knowing how to interpret and predict the behaviour of others is an important survival skill. Their communication, prediction, and manipulation remind humans of their own dubious talents. Humans delight in the familiarity they see in the activities of other intelligent animals, whilst smugly assuming that their human world is an tapestry by which the scanty threads of bird-brains pall.

But it’s all survival, adaptation, drift. Species eke out a living in the niche in which their find themselves. The human niche is different from that of a liver-pecking crow, and the talents they require are different. But it takes a human to reinterpret ‘different’ as ‘better’. The real sine qua non of human-ness is self-aggrandisement, and compared to soaring on wings of midnight, that’s not a terribly edifying talent.

Older posts «