Continue reading »]]>
But what would happen if dead nettles were such good fakers that they became very common? The bunnies would rarely meet the real thing, and would probably never learn to avoid nettle-like plants of any sort. Even if the bunnies did occasionally meet stinging nettles, those reckless bunnies that threw caution to the wind and ate things that looked like nettles would still tend to get more to eat than more cautious bunnies. In either case, the dead nettles would get nibbled back into relative rarity. And then the more reckless bunnies would get stung more often, as they’d meet real stinging nettles more frequently, and this would – in its turn – favour bunnies that were more cautious again, leading to a resurgence of the dead nettles. And so on, and so on.
The relative rarity of dead nettles and stinging nettles wouldn’t necessarily roller-coaster up and down like this: the cycles could be quite small. However, it’s interesting that neither a field of dead nettles on their own, nor of stinging nettles on their own, is stable. A field of nothing but stinging nettles is prone to invasion by fake dead nettles; but if the number of dead nettles gets too high, the bunnies will never meet the real thing, and won’t learn to avoid nettle-like plants in the first place. There is likely to be some ratio of real to fake nettles (and of cautious to reckless rabbits) that is stable in the long term, but it won’t be 0% or 100%.
These sorts of ‘game’ between mimics – the dead nettle “fakers” – and their models – the stinging nettle pain “makers” – are very common in biology, and are an important part of the ecology of many organisms. Wherever an organism has made some sort of ‘effort’, there is likely to be a living made scrounging off them, or mimicking their appearance.
But of course, it’s always a bit more complicated in biology. Not all mimics are fakes. Some mimics benefit from looking dangerous because they really are dangerous.
The honeybee and bumblebee in the image above both have black and yellow striped bodies. Both are able to sting, and both seem to have similar colours. Is one mimicking the other, and if so, why?As I said earlier, you should be skeptical of plausible stories. Bumblebees and honeybees are quite closely related, so perhaps the black-and-yellow is just a colour-scheme they’ve inherited from their common ancestor that has nothing to do with mimicry. We need more evidence.
As it turns out, there is very good evidence that black-and-yellow is meaningful mimicry, not accidental similarity. For example, the cinnabar moth caterpillar in the third image is not closely related to the bees, so it is likely that this caterpillar’s colours have evolved independently from those of the bees. Can it sting? Not exactly, but it is poisonous, because it mostly eats ragwort, and it steals the ragwort’s poisons for its own defence. Any bird that has learnt to avoid black-and-yellow insects through unhappy run-ins with bees is likely to avoid this caterpillar too. Importantly, this works both ways: any bird that’s had a bad experience with cinnabar moth caterpillars is also likely to avoid bees (and wasps, and other similar insects).
This sort of mimicry, where makers – the animals and plants that can back up their threats – all come to have similar warning colours is called Müllerian mimicry. If you need any more convincing, it’s telling that there are also many Batesian fakers of the black-and-yellow “warning” colour-scheme too, like the harmless hoverfly shown in the fourth image.
The natural world if full of liars and cheats; except when it isn’t.
]]>Continue reading »]]>
$PLACE_OF_WORK
.
I first arrived at what would become my workplace as a badly coiffured youth in 1995 to do a biology degree. South Kensington seemed a great improvement over Croydon, where I had endured my previous 18 years: there was a refreshing absence of casual street violence, and a greatly improved proximity to the grubby delights of Soho. At that time, my Hall of Residence was directly above the first-year lecture theatre, and in the same building as the Students’ Union. Despite this tempting proximity to cheap vodka – and even cheaper dates – I somehow managed to attend almost every lecture of my first year, aside from a week (a week!) of lectures on algae, which I traded for bossing Munchkins about in the Questor’s Theatre in Ealing. I met my personal tutor at least twice, survived two under-catered field-trips to somewhere, somewhere in a field in HampshireBerkshire, and made friends whom I treasure to this day.
Second-year forced me out into less convenient accommodation: an ill-conceived double-Georgian knock-through near Brompton Cemetery with 18 bedrooms, and anything up to 2 working bathrooms on any given day. Due to sometwo else both failing their first-year exams, I found myself promoted to Homosexual in Chief of the LGBT society, for which dubious honour I now have a pot behind the Union bar.
My final-year project on copper-tolerant fungi somewhere, somewhere in that field in Berkshire led to the offer of a PhD in wood preservation, which I leapt upon, having received no careers guidance whatsoever up to that point, and having begun to fear moving back to Croydon for want of any botanical PhD opportunities in London. My undergraduateship ended with a viva voce, upon which I thought hung the fate of my entire degree; in fact, I turned out to be a control, and what I had thought would be a bowel-loosening grilling turned out to be entirely unmemorable.
Like most postgraduate research degrees, mine was a heady mix of disappointment, poverty, and the growing realisation that week-day nights-out were incompatible with competent laboratory work. My department had moved out of the timeshare flat with the Students’ Union and into a brand-new building during the summer between my BSc and PhD, but someone had been a little unrealistic about the space available in the new labs. The first and second years of my PhD were spent trying not to poison myself with arsenic trioxide amongst a labyrinth of broken vacuum impregnators, quickfit glassware, and bottles of solvent with labels written in Linear A; the third and fourth years spent trying to fit research into the gaps between the demonstrating in lab practicals I had to do in order to have enough money to eat. Somehow I captured the heart of a young aeronautical engineer, who has miraculously put up with my questionable charms ever since.
I presented my ground-breaking findings on the bacterial biotransformation of an anti-sapstain chemical to a conference in glamorous Cardiff, and left it at that. My contribution to the greater knowledge of humankind will forever be a few grey literature conference proceedings, and a large blue book buried in quicklime below the College library.
Having drifted into a PhD, I continued on my under-thought career path by applying for a three-year post-doctoral position that combined part-time research with a part-time PGCE in secondary school education. In retrospect, combining the laugh-a-minute relaxation of academic research with the delights of herding teenagers through GCSEs may not have been the best life decision I’ve made. There were amusing moments – the attempts of year 7 students to embarrass me during sex-ed lessons were doomed from the start – but mostly it was exhausting and impossible. I somehow made it through to the other side, but with no interest whatsoever in ever darkening the door of a secondary school or research lab again. Fortunately, I had kept up a bit of lab demonstrating on the side, and had even been roped into giving a few first-year lectures in the twilight of my PhD. A temporary position opened up convening a first-year biology course, giving a few lectures, and running some of the practicals I’d been demonstrating for the best part of a decade. And so began a slow accretion from ‘stop-gap teaching gimp’ to ‘senior teaching fellow’.Many of the staff who taught me as an undergrad have since retired or moved on; even the new-born building of 1998 is now old enough to legally have sex and drive a moped. Some 1700 students have learned – or at least endured – first-year molecular biology and enzymology with me, and the pile of marking in front of me (for which writing this banal drivel is the sort of displacement activity against which I’ve hypocritically warned those very students) probably contains the ten thousandth script I have scrawled with the Biros of judgement.
I probably ought to get back to it.
In confirmation of the universe’s pitiless malevolence, I now give the lectures on algae that I skived off in my first-year.
]]>Continue reading »]]>
The UK’s track record at maintaining its biodiversity has been – to put it generously – somewhat patchy. We have wiped out a goodly swathe of our large mammals: brown bears, elks, lynxes, and wolves; we drove our blue backed stag beetles to oblivion; and Davall’s Sedge has not been spotted since 1930. One species that was formerly so common in the UK that Shakespeare felt the need to warn theatre-goers about its favoured nest-building materials is the red kite:
My Trafficke is sheetes: when the Kite builds, looke to lesser Linnen.
This beautiful bird was very nearly wiped out in the UK by the early 20th century; only a handful of breeding pairs were left by 1990, in – you guessed it – Wales. Its populations in southern Europe continue to decline, and it is still considered near threatened. However, since the 1990s, the red kite has been the target of a major reintroduction program in the UK, and in a few places they are once again a common sight, soaring on thermals and seeking out rabbits, carrion, and recently washed pillowcases.
A good place to see these impressive birds is the Chilterns, a range of chalk hills just north of London. I’m not generally a charismatic-megafauna kind of biologist, but getting close enough for even this somewhat blurry action shot was thrilling:
The kites particularly like to hang around on airfields, presumably on the look-out for tasty pilots. Their blasé attitude to the planes and gliders is amusing if you’re on the ground. It is somewhat less amusing when you meet them in the air, and they remind you in no uncertain terms that their lineage has been flying since before your lineage even took to the trees, let alone came back down from them. ]]>Continue reading »]]>
Just ten stops down the District Line from $WORK lies the Royal Botanic Gardens Kew. The gardens have three enormous glasshouses, a number of smaller glasshouses, and 121 hectares of trees, beds and desperately awful architecture to explore. Unfortunately, it also has an entry fee (for non-concession adults) of £14.50, which is a little steep, and possibly one of the reasons that disappointingly few of my students seem to have visited it, despite its proximity.
My favourite indoor displays at Kew are the two rooms of carnivorous plants in the Princess of Wales Conservatory (don’t miss the newer cloud-forest full of Nepenthes), and the ever-changing contents of the Alpine and Waterlily Houses. The latter often has large clumps of sensitive plants (Mimosa pudica and relatives) to poke. I also enjoy the very Victorian approach to health-and-safety in the walkway at the top of the Palm House. Botanerd highlights.Small but perfectly formed, the Chelsea Physic Garden is one of the oldest botanic gardens in the world (Oxford, below, claims the top spot). It specialises in plants used by humans, including (when I last went) a special display of plant fibre ropes. Entry about £10.
Botanerd highlights.I don’t remember the Royal Botanic Garden Edinburgh being this sunny either time we visited, but apparently it was on at least one trip. Unlike Kew, the glasshouses are squidged together, so if the weather’s misbehaving, your fern to rain ratio will be much higher than in London. Unfortunately, like Kew, at the time of writing, some of the glasshouses are shut for renovations. Console yourself with the fact that entry to the gardens themselves is free.
Botanerd highlights.Sitting on the slopes of Montjuïc, just below the Olympic Stadium, the Jardí Botànic de Barcelona is my most recent bagging. Unlike the other gardens here, it is entirely outdoors, with no glasshouses, and therefore specialises in plants from Mediterranean scrub habitats like Chile, South West Australia and California. Entry fee to the gardens is a very reasonable €3.
Botanerd highlights.The photo below of De Hortus Botanicus Amsterdam doesn’t do it justice, but it’s well worth the €8.50 entry. The glasshouses are very well laid out, and they have a very good selection of carnivorous plants, obscure ferns (including Marattia) and cycads.
Botanerd highlights.Claiming to be the oldest botanic garden in the world (and I’ve no reason to doubt them!), the University of Oxford Botanic Garden is a snip at £4.50 entry, and has a good mixture of outdoor beds and glasshouses. The glasshouses are small, but absolutely rammed with stuff, including Pachypodium (below), assorted ferns, jade vines, a lovely Amorphophallus rivieri (well, lovely until you stick your nose over it), but – as it turns out – no Orchis fatalis.
Botanerd highlights.The Botanischer Garten und Botanisches Museum Berlin-Dahlem claims to be the second-largest in the world (after Kew), and now has a dedicated moss garden (which unfortunately post-dates my visit) as well as the usual beds and (extensive) glasshouses. Entry fee is €6.
Botanerd highlights.I didn’t quite make it into the San Francisco Botanical Garden, but perhaps one day I’ll return with more time, and having not been recently fleeced at the California Academy of Sciences ($30 entry!)
Brussels has a wholly confusing pair of botanic gardens, the National Botanic Garden of Belgium, which is just north of Brussels, and the Botanical Garden of Brussels, which sits on the real botanic garden’s old site in the middle of Brussels. I got the former mixed up with the latter, much to my disappointment. It’s perfectly pleasant, but not really a botanic garden.
Darwin’s House at Down in Kent has a small glasshouse with a good collection of carnivorous plants. Well worth a visit, and a wander down the sandwalk.
In particular, I’d love to know where I can see the following obscure corners of the vegetable empire:
Continue reading »]]>
Take this sea urchin. The orange pucker in the middle of the spines is its “around-the-bum”, although zoologists would insist on writing that in Greek as “periproct“. The bright orange ring-piece is characteristic of this species, and marks it out as Diadema setosum, rather than any of the less rectally blessed species of Diadema.
The butt-hole of an urchin is actually the second it will own, because urchins go through a metamorphosis that shames even that of a butterfly. The larva of an urchin looks not even a little bit like the adult……and the adult urchin develops like a well-organised tumour within the body of the larva. For this reason, the adult’s anus is an entirely different hole from the larva’s anus.
The development of the larva’s original butt-hole during development from a fertilised egg turns out to be quite revealing. Surprisingly, it marks out sea urchins and their relatives – like sea cucumbers and starfish – as much closer relatives of yours and other backboned animals, than they are of insects or worms or jellyfish, or indeed, or pretty much any other animal.
As a fertilised human or sea urchin egg divides, it forms a hollow ball of cells, somewhat like a football. Then, some of the cells on the surface fold in on themselves, forming a shape rather like what you get if you punch your fist into a half-deflated football. The dent drills its way through, and eventually opens out through the other side of the ball. What you end up with is a double-walled tube, with a hole at either end.
In humans and all other animals with backbones, and in the larva of sea urchins and starfish and sea cucumbers, the first hole – the one formed by the dent – becomes the anus; and the second hole – where the dent punches through to the other side – becomes the mouth.
In most other animals, the first hole becomes the mouth, and the second the anus (pedant alert: I’m glossing over some details here).
Humans and sea urchins develop arse-first. Or mouth-second, as zoologists would prudishly have it, preferably euphemised further by writing it in Greek. Humans and fish, and sea urchins and starfish are all “deuterostomes”.
The development of the chocolate starfish of a starfish and of the asshole of an ass hint at a deep evolutionary connection between two very different groups of animals. Enlightenment can be found in the most unexpected places.
]]>Continue reading »]]>
Some important general considerations for fitting models of this sort include:
The data in enzyme_kinetics.csv gives the velocity, v, of the enzyme acid phosphatase (µmol min^{−1}) at different concentrations of a substrate called nitrophenolphosphate, [S] (mM). The data can be modelled using the Michaelis-Menten equation given at the top of this post, and nonlinear regression can be used to estimate K_{M} and v_{max} without having to resort to the Lineweaver-Burk linearisation.
In R, nonlinear regression is implemented by the function nls()
. It requires three parameters. These are:
Fitting a linear model (like linear regression or ANOVA) is an analytical method. It will always yield a globally optimal solution, i.e. a ‘perfect’ line of best fit, because under the hood, all that linear regression is doing is finding the minimum on a curve of residuals vs. slope, which is a matter of elementary calculus. However, fitting a nonlinear model is a numerical method. Under the hood, R uses an iterative algorithm rather than a simple equation, and as a result, it is not guaranteed to find the optimal curve of best fit. It may instead get “jammed” on a local optimum. The better the starting estimates you can give to nls()
, the less likely it is to get jammed, or – indeed – to charge off to infinity and not fit anything at all.
For the equation at the start of this post, the starting estimates are easy to estimate from the plot:
enzyme.kinetics<-read.csv( "H:/R/enzyme_kinetics.csv" ) plot( v ~ S, data = enzyme.kinetics, xlab = "[S] / mM", ylab = expression(v/"µmol " * min^-1), main = "Acid phosphatase saturation kinetics" )
The horizontal asymptote is about v=9 or so, and therefore K_{M} (value of [NPP] giving v_{max}/2) is about 2.
The syntax for nls()
on this data set is:
enzyme.model<-nls( v ~ vmax * S /( KM + S ), data = enzyme.kinetics, start = c( vmax=9, KM=2 ) )
The parameters in the equation you are fitting using the usual ~
tilde syntax can be called whatever you like as long as they meet the usual variable naming conventions for R. The model can be summarised in the usual way:
summary( enzyme.model )
Formula: v ~ vmax * S/(KM + S) Parameters: Estimate Std. Error t value Pr(>|t|) vmax 11.85339 0.05618 211.00 7.65e-13 *** KM 3.34476 0.03860 86.66 1.59e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.02281 on 6 degrees of freedom Number of iterations to convergence: 4 Achieved convergence tolerance: 6.044e-07
The curve-fitting has worked (it has converged after 4 iterations, rather than going into an infinite loop), the estimates of K_{M} and v_{max} are not far off what we made them by eye, and both are significantly different from zero.
If instead you get something like this:
Error in nls( y ~ equation.blah.blah.blah, ): singular gradient
this means the curve-fitting algorithm has choked and charged off to infinity. You may be able to rescue it by trying nls(…)
again, but with different starting estimates. However, there are some cases where the equation simply will not fit (e.g. if the data are nothing like the model you’re trying to fit!) and some pathological cases where the algorithm can’t settle on an estimate. Larger data sets are less likely to have these problems, but sometimes there’s not much you can do aside from trying to fit a simpler equation, or estimating some rough-and-ready parameters by eye.
If you do get a model, you can use it to predict the y values for the x values you have, and use that to add a curve of best fit to your data:
# Create 100 evenly spaced S values in a one-column data frame headed 'S' predicted.values <- data.frame( S = seq( from = min( enzyme.kinetics$S ), to = max( enzyme.kinetics$S ), length.out = 100 ) ) # Use the fitted model to predict the corresponding v values, and assign them # to a second column labelled 'v' in the data frame predicted.values $v <- predict( enzyme.model, newdata=predicted.values ) # Add these 100 x,y data points to the graph, joined up by 99 line-segments # This will look like a curve at the resolution of the graph lines( v ~ S, data=predicted.values )
If you need to extract a value from the model, you need coef()
:
vmax.estimate <- coef( enzyme.model )[1] KM.estimate <- coef( enzyme.model )[2]
This returns the first and second coefficients out of the model, in the order listed in the summary()
.
Try fitting non-linear regressions to the data sets below
log(C)
in your calls to plot()
and nls()
.bacterial.growth<-read.csv( "H:/R/bacterial_growth.csv" ) plot( N ~ t, data = bacterial.growth, xlab = "N", ylab = "t / min", main = "Bacteria grow exponentially" ) head( bacterial.growth )
t N 1 0 0.032 2 10 0.046 …
We estimate the parameters from the plot: N_{0} is obviously 0.032 from the data above; and the plot (below) shows it takes about 20 min for N to increase from 0.1 to 0.2, so t_{d} is about 20 min.
bacterial.model<-nls( N ~ N0 * exp( (log(2) / td ) * t), data = bacterial.growth, start = c( N0 = 0.032, td = 20 ) ) summary( bacterial.model )
Formula: N ~ N0 * exp((log(2) / td) * t) Parameters: Estimate Std. Error t value Pr(>|t|) N0 3.541e-02 6.898e-04 51.34 2.30e-11 *** td 2.541e+01 2.312e-01 109.91 5.25e-14 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.002658 on 8 degrees of freedom Number of iterations to convergence: 5 Achieved convergence tolerance: 1.052e-06
This bit adds the curve to the plot below.
predicted.values <- data.frame( t = seq( from = min( bacterial.growth$t ), to = max( bacterial.growth$t ), length.out = 100 ) ) predicted.values$N <- predict( bacterial.model, newdata = predicted.values ) lines( N ~ t, data = predicted.values )
dose.response<-read.csv( "H:/R/dose_response.csv" ) plot( mu ~ log(C), data = dose.response, xlab = expression("ln" * "[" * NH[4]^"+" * "]" / µM), ylab = expression(mu/hr^-1), main = expression("Sigmoidal dose/response to ammonium ions in " * italic("Enterobacter")) )
The plot indicates that the starting parameters should be
logofIC50
in the formula to nls()
below so it’s clear this is a parameter we’re estimating, not a function we’re calling.dose.model <- nls( mu ~ mumax / ( 1 + exp( -( log(C) - logofIC50 ) / s ) ), data = dose.response, start = c( mumax=0.7, logofIC50=log(3), s=-0.7 ) ) summary( dose.model )
Formula: mu ~ mumax/(1 + exp(-(log(C) - logofIC50)/s)) Parameters: Estimate Std. Error t value Pr(>|t|) mumax 0.695369 0.006123 113.58 1.00e-09 *** logofIC50 1.088926 0.023982 45.41 9.78e-08 *** s -0.509478 0.020415 -24.96 1.93e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.00949 on 5 degrees of freedom Number of iterations to convergence: 5 Achieved convergence tolerance: 6.394e-06 Achieved convergence tolerance: 5.337e-06
This lists the estimates of μ_{max}, ln(IC_{50}) and s, plus the standard error of these estimates and their p values. Note that they’re all pretty close to what we guessed in the first place, so we can be fairly sure they’re good estimates. To get the actual value of IC_{50} from the model:
exp( coef(dose.model)[2] )
This gives 2.97 µM, which corresponds well with our initial guess.
To add the curve, we do the same as before:
predicted.values <- data.frame( C = seq( from = min( dose.response$C ), to = max( dose.response$C ), length.out = 100 ) ) predicted.values$mu <- predict( dose.model, newdata = predicted.values ) lines( mu ~ log(C), data=predicted.values )]]>
Continue reading »]]>
In the models we have seen so far (linear regression, one-way ANOVA) all we have really done is tested the difference between a null model (“y is a constant”, “y=y̅“) and a single alternative model (“y varies by group” or “y=a+bx“) using an F test. However, in two-way ANOVA there are several possible models, and we will probably need to proceed through some model simplification from a complex model to a minimally adequate model.
The file wheat_yield.csv contains data on the yield (tn ha^{−1}) of wheat from a large number of replicate plots that were either unfertilised, given nitrate alone, phosphate alone, or both forms of fertiliser. This requires a two-factor two-level ANOVA.
wheat.yield<-read.csv( "H:/R/wheat_yield.csv" ) interaction.plot( response = wheat.yield$yield, x.factor = wheat.yield$N, trace.factor = wheat.yield$P )
As we’re plotting two factors, a box-and-whisker plot would make no sense, so instead we plot an interaction plot. It doesn’t particularly matter here whether we use N
(itrate) as the x.factor
(i.e. the thing we plot on the x-axis) and P
(hosphate) as the trace.factor
(i.e. the thing we plot two different trace lines for):
You’ll note that the addition of nitrate seems to increase yield: both traces slope upwards from the N
(o) to Y
(es) level on the x-axis, which represents the nitrate factor. From the lower trace, it appears addition of just nitrate increases yield by about 1 tn ha^{−1}.
You’ll also note that the addition of phosphate seems to increase yield: the Y
(es) trace for phosphate is higher than the N
(o) trace for phosphate. From comparing the upper and lower traces at the left (no nitrate), it appears that addition of just phosphate increases yield by about 2 tn ha^{−1}.
Finally, you may notice there is a positive (synergistic) interaction. The traces are not parallel, and the top-right ‘point’ (Y
(es) to both nitrate and phosphate) is higher than you would expect from additivity: the top-right is maybe 4, rather than 1+2=3 tn ha^{−1} higher than the completely unfertilised point.
We suspect there is an interaction, this interaction is biologically plausible, and we have 30 samples in each of the four treatments. We fit a two-factor (two-way) ANOVA maximal model, to see whether this interaction is significant.
First, we fit the model using N*P
to fit the ‘product’ of the nitrate and phosphate factors, i.e.
wheat.model<-aov(yield ~ N*P, data=wheat.yield ) anova( wheat.model )
Analysis of Variance Table Response: yield Df Sum Sq Mean Sq F value Pr(>F) N 1 83.645 83.645 43.3631 9.122e-10 *** P 1 256.136 256.136 132.7859 < 2.2e-16 *** N:P 1 11.143 11.143 5.7767 0.01759 * Residuals 136 262.336 1.929 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
N
shows the effect of nitrate, P
the effect of phosphate, and N:P
is the interaction term. As you can see, this interaction term appears to be significant: the nitrate+phosphate combination seems to give a higher yield than you would expect from just adding up the individual effects of nitrate and phosphate alone. To test this explicitly, we can fit a model that lacks the interaction term, using N+P
to fit the ‘sum’ of the factors without the N:P
term:
wheat.model.no.interaction<-aov(yield ~ N+P, data=wheat.yield ) anova( wheat.model )
Analysis of Variance Table Response: yield Df Sum Sq Mean Sq F value Pr(>F) N 1 83.645 83.645 41.902 1.583e-09 *** P 1 256.136 256.136 128.312 < 2.2e-16 *** Residuals 137 273.479 1.996 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can then use the anova()
function with two arguments to compare these two models with an F test:
anova( wheat.model, wheat.model.no.interaction )
Analysis of Variance Table Model 1: yield ~ N * P Model 2: yield ~ N + P Res.Df RSS Df Sum of Sq F Pr(>F) 1 136 262.34 2 137 273.48 -1 -11.143 5.7767 0.01759 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An alternative way of doing the same thing is to update the maximal model by deletion. First you fit the maximal model, as before:
wheat.model<-aov(yield ~ N*P, data=wheat.yield ) anova( wheat.model )
You note the least significant term (largest p value) is the N:P
term, and you selectively remove that from the model to produce a revised model with update()
:
wheat.model.no.interaction<-update( wheat.model, ~.-N:P) anova( wheat.model, wheat.model.no.interaction )
The ~.N:P
is shorthand for “fit (~) the current wheat.model
(.) minus (-) the interaction term (N:P)”. This technique is particularly convenient when you are iteratively simplifying a model with a larger number of factors and a large number of interactions.
Whichever method we use, we note that the deletion of the interaction term significantly reduces the explanatory power of the model, and therefore that our minimally adequate model is the one including the interaction term.
If you have larger numbers of factors (say a
, b
and c
) each with a large number of levels (say 4, 2, and 5), it is possible to fit a maximal model (y~a*b*c
), and to simplify down from that. However, fitting a maximal model in this case would involve estimating 40 separate parameters, one for each combination of the 4 levels of a
, with the 2 levels of b
, and the 5 levels of c
. It is unlikely that your data set is large enough to make a model of 40 parameters a useful simplification compared to the raw data set itself. It would even saturate it if you have just one datum for each possible {a,b,c} combination. Remember that one important point of a model is to provide a simplification of a large data set. If you want to detect an interaction between factors a
and b
reliably, you need enough data to do so. In an experimental situation, this might impact on whether you actually want to make c
a variable at all, or rather to control it instead, if this is possible.
Analyse the following data set with ANOVA
glucose.conc<-read.csv( "H:/R/glucose_conc.csv" ) interaction.plot( response = glucose.conc$conc, x.factor = glucose.conc$Stfw, trace.factor = glucose.conc$Rtfm )
The lines are parallel, so there seems little evidence of interaction between the loci. The difference between the mutant and wildtypes for the Rtfm locus doesn’t look large, and may not be significant. However, the Stfw wildtypes seem to have better control of blood glucose. A two-way ANOVA can investigate this:
glucose.model<-aov( conc ~ Stfw*Rtfm, data = glucose.conc ) anova( glucose.model )
Analysis of Variance Table Response: conc Df Sum Sq Mean Sq F value Pr(>F) Stfw 1 133233 133233 311.4965 < 2e-16 *** Rtfm 1 1464 1464 3.4238 0.06526 . Stfw:Rtfm 1 17 17 0.0407 0.84020 Residuals 296 126605 428 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we suspected, the Stfw:Rtfm
interaction is not significant. We will remove it by deletion:
glucose.model.no.interaction<-update( glucose.model, ~.-Stfw:Rtfm ) anova( glucose.model, glucose.model.no.interaction )
Analysis of Variance Table Model 1: conc ~ Stfw * Rtfm Model 2: conc ~ Stfw + Rtfm Res.Df RSS Df Sum of Sq F Pr(>F) 1 296 126605 2 297 126622 -1 -17.421 0.0407 0.8402
The p value is much larger than 0.05, so the model including the interaction term is not significantly better than the one excluding it.
The reduced model is now:
anova(glucose.model.no.interaction )
Analysis of Variance Table Response: conc Df Sum Sq Mean Sq F value Pr(>F) Stfw 1 133233 133233 312.5059 < 2e-16 *** Rtfm 1 1464 1464 3.4349 0.06482 . Residuals 297 126622 426 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When we remove the interaction term, its variance is redistributed to the remaining factors, which will change their values compared to the maximal model we fitted at the start. It appears that the wildtype and mutants of Rtfm do not significantly differ in their glucose levels, so we will now remove that term too:
glucose.model.stfw.only<-update( glucose.model.no.interaction, ~.-Rtfm ) anova( glucose.model.no.interaction, glucose.model.stfw.only )
Analysis of Variance Table Model 1: conc ~ Stfw + Rtfm Model 2: conc ~ Stfw Res.Df RSS Df Sum of Sq F Pr(>F) 1 297 126622 2 298 128087 -1 -1464.4 3.4349 0.06482 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, we detect no significant different between the models, so we accept the simpler one (Occam’s razor).
anova( glucose.model.stfw.only )
Analysis of Variance Table Response: conc Df Sum Sq Mean Sq F value Pr(>F) Stfw 1 133233 133233 309.97 < 2.2e-16 *** Residuals 298 128087 430 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Stfw locus does have a very significant effect on blood glucose levels. We could try deletion testing Stfw too, but here it in not really necessary as the ANOVA table above is comparing the Stfw only model to the null model in any case. To determine what the effect of Stfw is, we can use a Tukey test:
TukeyHSD( glucose.model.stfw.only )
Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = conc ~ Stfw, data = glucose.conc) $Stfw diff lwr upr p adj wildtype-mutant -42.14783 -46.859 -37.43666 0
… or, as this model has in fact simplified down to the two-levels of one factor, this is equivalent to just doing a t test at this point:
t.test( conc ~ Stfw, data = glucose.conc)
Welch Two Sample t-test data: conc by Stfw t = 17.6061, df = 289.288, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 37.43609 46.85958 sample estimates: mean in group mutant mean in group wildtype 123.87824 81.73041
Given this analysis, the best way to represent our data is probably a simple boxplot ignoring Rtfm, with an explanatory legend:
boxplot( conc ~ Stfw, data = glucose.conc, xlab = expression( italic(Sftw) ), ylab = expression( "[Glucose] / mg "*dL^-1 ), main = "Mutants at the Stfw locus are less able\nto control their blood glucose levels" )
Homozygous mutants at the Stfw locus were found to be less well able to control their blood glucose levels (F=310, p≪0.001). Homozygous mutation of the Rtfm locus was found to have no significant effect on blood glucose levels (F on deletion from ANOVA model = 3.4, p=0.06). Homozygous mutation at the Stfw locus was associated with a blood glucose level 40 mg dL^{−1} (95% CI: 37.4…46.8) higher than the wildtype (t=17.6, p≪0.001).
Next up… Nonlinear regression.
]]>Continue reading »]]>
For example, cuckoo_eggs.csv contains data on the length of cuckoo eggs laid into different host species, the (meadow) pipit, the (reed) warbler, and the wren.
A box-and-whisker plot is a useful way to view data of this sort:
cuckoo.eggs<-read.csv( "H:/R/cuckoo_eggs.csv" ) boxplot( egg.length ~ species, data = cuckoo.eggs, xlab = "Host species", ylab = "Egg length / mm" )
You might be tempted to try t testing each pairwise comparison (pipit vs. wren, warbler vs. pipit, and warbler vs. wren), but a one-factor analysis of variance (ANOVA) is what you actually want here. ANOVA works by fitting individual means to the three levels (warbler, pipit, wren) of the factor (host species) and seeing whether this results in a significantly smaller residual variance than fitting a simple overall mean to the entire data set.
Conceptually, this is very similar to what we did with linear regression: ANOVA compares the residuals on the model represented by the “y is a constant” graph below:
…with a model where three individual means have been fitted, the “y varies by group” model:
It’s not immediately obvious that fitting three separate means has bought us much: the model is more complicated, but the length of the red lines doesn’t seem to have changed hugely. However, R can tell us precisely whether or not this is the case. The syntax for categorical model fitting is aov()
, for analysis of variance:
aov( egg.length ~ species, data=cuckoo.eggs )
Call: aov(formula = egg.length ~ species, data = cuckoo.eggs) Terms: species Residuals Sum of Squares 35.57612 55.85047 Deg. of Freedom 2 57 Residual standard error: 0.989865 Estimated effects may be unbalanced
As with linear regression, you may well wish to save the model for later use. It is traditional to display the results of an ANOVA in tabular format, which can be produced using anova()
cuckoo.model<-aov( egg.length ~ species, data=cuckoo.eggs ) anova( cuckoo.model )
Analysis of Variance Table Response: egg.length Df Sum Sq Mean Sq F value Pr(>F) species 2 35.576 17.7881 18.154 7.938e-07 *** Residuals 57 55.850 0.9798 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The “y is a constant” model has only one variance associated with it: the total sum of squares of the deviances from the overall mean (SS_{T,} or SS_{Y}, whichever you prefer) divided by the degrees of freedom in the data set (n−1).
The “y varies by group” model decomposes this overall variance into two components.
In the ANOVA table, the Sum Sq
is the sum of the squares of the deviances of the data points from a mean. The Df
is the degrees of freedom, and the Mean Sq
is the sum of squares divided by the degrees of freedom, which is the corresponding variance.
The F value
is the mean-square for species
divided by the mean-square of the Residuals
. The p value indicates that categorising the data into three groups does make a significant difference to explaining the variance in the data, i.e. estimating three separate mean for each host species, rather than one grand mean, does make a significant difference to how well we can explain the data. The length of the eggs the cuckoo lays does vary by species.
Compare this with linear regression, where you’re trying to find out whether y=a+bx is a better model of the data than y=y̅. This is very similar to ANOVA, where we are trying to find out whether “y varies by group, i.e. the levels of a factor” is a better model than “y is a constant, i.e. the overall mean”.
You might well now ask “but which means are different?” This can be investigated using TukeyHSD()
(honest significant differences) which performs a (better!) version of the pairwise t test you were probably considering at the top of this post.
TukeyHSD( cuckoo.model )
Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = egg.length ~ species, data = cuckoo.eggs) $species diff lwr upr p adj Warbler-Pipit -1.5395 -2.2927638 -0.7862362 0.000023 Wren-Pipit -1.7135 -2.4667638 -0.9602362 0.000003 Wren-Warbler -0.1740 -0.9272638 0.5792638 0.843888
This confirms that – as the box-and-whisker plots suggested – the eggs laid in pipit and warbler nests are not significantly different in size, but those laid in wren nests are significantly smaller than those in the warbler or pipit nests.
ANOVA has the same sorts of assumption as the F and t tests and as linear regression: normality of residuals, homoscedacity, representativeness of the sample, no error in the treatment variable, and independence of data points. You should therefore use the same checks after model fitting as you used for linear regression:
plot( residuals( cuckoo.model ) ~ fitted( cuckoo.model ) )
We do not expect a starry sky in the residual plot, as the fitted data are in three discrete levels. However, if the residuals are homoscedastic, we expect them to be of similar spread in all three treatments, i.e. the residuals shouldn’t be more scattered around the zero line in the pipits than in the wrens, for example. This plot seems consistent with that.
qqnorm( residuals( cuckoo.model ) )
On the normal Q-Q plot, we do expect a straight line, which – again – we appear to have (although it’s a bit jagged, and a tiny bit S-shaped). We can accept that the residuals are more-or-less normal, and therefore that the analysis of variance was valid.
Analyse the following data set with ANOVA
venus.flytrap<-read.csv("H:/R/venus_flytrap.csv") plot( biomass ~ feed, data = venus.flytrap, xlab = "Feeding treatment", ylab = "Wet biomass / g", main = expression("Venus flytrap feeding regime affects wet biomass") )
flytrap.model<-aov( biomass ~ feed, data = venus.flytrap ) anova( flytrap.model )
Analysis of Variance Table Response: biomass Df Sum Sq Mean Sq F value Pr(>F) feed 2 5.8597 2.92983 22.371 1.449e-08 *** Residuals 87 11.3939 0.13096 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD( flytrap.model )
Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = biomass ~ feed, data = venus.flytrap) $feed diff lwr upr p adj fertiliser-control -0.3086667 -0.53147173 -0.0858616 0.0039294 fly-control 0.3163333 0.09352827 0.5391384 0.0030387 fly-fertiliser 0.6250000 0.40219493 0.8478051 0.0000000
The feeding treatment makes a significant different to the wet biomass (F=22.3, p=1.45 × 10^{−8}) and a Tukey HSD test shows that all three means are different, with the fly treatment having a beneficial effect (on average, the fly-treated plants are 0.32 g heavier than the control plants), but the fertiliser has an actively negative effect on growth, with these plants on average being 0.31 g lighter than the control plants.
To plot the residuals, we use the same code as for the linear regression:
plot( residuals(flytrap.model) ~ fitted(flytrap.model) )
There is perhaps a bit more scatter in the residuals of the control (the ones in the middle), but nothing much to worry about.
qqnorm( residuals( flytrap.model ) )
On the normal Q-Q plot, we do expect a straight line, which – again – we appear to have. We can accept that the residuals are essentially homoscedastic and normal, and therefore that the analysis of variance was valid.
Next up… Two-way ANOVA.
]]>Continue reading »]]>
The test statistic derived from the two data sets is called χ^{2}, and it is defined as the square of the discrepancy between the observed and expected value of a count variable divided by the expected value.
The reference distribution for the χ^{2} test is Pearson’s χ^{2}. This reference distribution has a single parameter: the number of degrees of freedom remaining in the data set.
A χ^{2} test compares the χ^{2} statistic from your empirical data with the Pearson’s χ^{2} value you’d expect under the null hypothesis given the degrees of freedom in the data set. The p value of the test is the probability of obtaining a test χ^{2} statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis (“there is no discrepancy between the observed and expected values”) is true. i.e. The p value is the probability of observing your data (or something more extreme), if the data do not truly differ from your expectation.
The comparison is only valid if the data are:
Do not use a χ^{2} test unless these assumptions are met. The Fisher’s exact test fisher.test()
may be more suitable if the data set is small.
In R, a χ^{2}-test is performed using chisq.test()
. This acts on a contingency table, so the first thing you need to do is construct one from your raw data. The file tit_distribution.csv contains counts of the total number of birds (the great tit, Parus major, and the blue tit, Cyanistes caeruleus) at different layers of a canopy over a period of one day.
tit.distribution<-read.csv( "H:/R/tit_distribution.csv" ) print( tit.distribution )
This will spit out all 706 observations: remember that the raw data you import into R should have a row for each ‘individual’, here each individual is a “This bird in that layer” observation. You can see just the start of the data using head()
:
head( tit.distribution )
Bird Layer 1 Bluetit Ground 2 Bluetit Ground 3 Bluetit Ground 4 Bluetit Ground 5 Bluetit Ground 6 Bluetit Ground
and look at a summary of the data frame object with str()
:
str( tit.distribution )
'data.frame': 706 obs. of 2 variables: $ Bird : Factor w/ 2 levels "Bluetit","Greattit": 1 1 1 1 1 1 1 1 1 1 ... $ Layer: Factor w/ 3 levels "Ground","Shrub",..: 1 1 1 1 1 1 1 1 1 1 ...
To create a contingency table, use table()
:
tit.table<-table( tit.distribution$Bird, tit.distribution$Layer ) tit.table
Ground Shrub Tree Bluetit 52 72 178 Greattit 93 247 64
If you already had a table of the count data, and didn’t fancy making the raw data CSV file from it, just to have to turn it back into a contingency table anyway, you could construct the table manually using matrix()
:
tit.table<-matrix( c( 52, 72, 178, 93, 247, 64 ), nrow=2, byrow=TRUE ) # nrows means cut the vector into two rows # byrow=TRUE means fill the data in horizontally (row-wise) # rather than vertically (column-wise) tit.table
[,1] [,2] [,3] [1,] 52 72 178 [2,] 93 247 64
The matrix can be prettified with labels (if you wish) using dimnames()
, which expects a list()
of two vectors, the first of which are the row names, the second of which are the column names:
dimnames( tit.table )<-list( c("Bluetit","Greattit" ), c("Ground","Shrub","Tree" ) ) tit.table
Ground Shrub Tree Bluetit 52 72 178 Greattit 93 247 64
To see whether the observed values (above) differ from the expected values, you need to know what those expected values are. For a simple homogeneity χ^{2}-test, the expected values are simply calculated from the corresponding column (C), row (R) and grand (N) totals:
Ground | Shrub | Tree | Row totals | |
Blue tit | 52 | 72 | 178 | 302 |
E | 302×145/706 = 62.0 | 302×319/706 = 136.5 | 302×242/706 = 103.5 | |
χ^{2} | (52−62)^{2}/62 = 1.6 | (72−136.5)^{2}/136.5 = 30.5 | (178−103.5)^{2}/103.5 = 53.6 | |
Great tit | 93 | 247 | 64 | 404 |
E | 404×145/706 = 83.0 | 404×319/706 = 182.5 | 404×242/706 = 138.5 | |
χ^{2} | (93−83)^{2}/83 = 1.2 | (247−182.5)^{2}/182.5 = 22.7 | (64−138.5)^{2}/138.5 = 40.1 | |
Column totals | 145 | 319 | 242 | 706 |
The individual χ^{2} values show the discrepancies for each of the six individual cells of the table. Their sum is the overall χ^{2} for the data, which is 149.7. R does all this leg-work for you, with the same result:
chisq.test( tit.table )
Pearson's Chi-squared test data: tit.table X-squared = 149.6866, df = 2, p-value < 2.2e-16
The individual tits’ distributions are significantly different from homogeneous, i.e. there are a lot more blue tits in the trees and great tits in the shrub layer than you would expect just from the overall distribution of birds.
Sometimes, the expected values are known, or can be calculated from a model. For example, if you have 164 observations of progeny from a dihybrid selfing genetic cross, where you expect a 9:3:3:1 ratio, you’d perform a χ^{2} manually like this:
A- B- | A- bb | aa B- | aa bb | |
O | 94 | 33 | 28 | 9 |
E | 164×9/16 = 92.25 | 164×3/16 = 30.75 | 164×3/16 = 30.75 | 164×1/16 = 10.25 |
χ^{2} | (94−92.25)^{2}/92.25 = 0.033 | (33−30.75)^{2}/30.75 = 0.165 | (28−30.75)^{2}/30.75 = 0.246 | (9−10.25)^{2}/10.25 = 0.152 |
For a total χ^{2}of 0.596. To do the equivalent in R, you should supply chisq.test()
with a second, named parameter called p
, which is a vector of expected probabilities:
dihybrid.table<-matrix( c( 94, 33, 28, 9 ), nrow=1, byrow=TRUE ) dimnames( dihybrid.table )<-list( c( "Counts" ), c( "A-B-","A-bb","aaB-","aabb" ) ) dihybrid.table
A-B- A-bb aaB- aabb Counts 94 33 28 9
null.probs<-c( 9/16, 3/16, 3/16, 1/16 ) chisq.test( dihybrid.table, p=null.probs )
Chi-squared test for given probabilities data: dihybrid.table X-squared = 0.5962, df = 3, p-value = 0.8973
The data are not significantly different from a 9:3:3:1 ratio, so the A and B loci appear to be unlinked and non-interacting, i.e. they are inherited in a Mendelian fashion.
The most natural way to plot count data is using a barplot()
bar-chart:
barplot( dihybrid.table, xlab="Genotype", ylab="N", main="Dihybrid cross" )
Use the χ^{2} test to investigate the following data sets.
Nibbled | Un-nibbled | |
CN^{+} | 26 | 74 |
CN^{−} | 34 | 93 |
clover.table<-matrix( c( 26, 74, 34, 93 ), nrow=2, byrow=TRUE ) dimnames( clover.table )<-list( c( "CN.plus", "CN.minus" ), c( "Nibbled", "Un.nibbled" ) ) clover.table
Nibbled Un.nibbled CN.plus 26 74 CN.minus 34 93
chisq.test( clover.table )
Pearson's Chi-squared test with Yates' continuity correction data: clover.table X-squared = 0, df = 1, p-value = 1
maize.kernels<-read.csv( "H:/R/maize_kernels.csv" ) head( maize.kernels )
Kernel 1 Red 2 Colourless 3 Colourless 4 Colourless 5 Purple 6 Colourless
maize.table<-table( maize.kernels$Kernel ) maize.table
Colourless Purple Red 229 485 160
chisq.test( maize.table, p=c( 4/16, 9/16, 3/16 ) )
Chi-squared test for given probabilities data: maize.table X-squared = 0.6855, df = 2, p-value = 0.7098
Next up… One-way ANOVA.
]]>Continue reading »]]>
Here b is the estimated slope of the best-fit line (a.k.a. gradient, often written m), a is its y-intercept (often written c), and ϵ is the residual error. If the x and y data are perfectly correlated, then ϵ=0 for each and every x,y pair in the in the data set; however, this is extremely unlikely to occur in real-world data.
When you fit a linear model like this to a data set, each coefficient you fit (here, the intercept and the slope) will be associated with a t value and p value, which are essentially the result of a one-sample t test comparing the fitted value to 0.
Linear regression is only valid if:
plot()
and eyeball your data before modelling!Linear regression is very commonly used in circumstances where it is not technically appropriate, e.g. time-series data (where later x,y pairs are most certainly not independent of earlier pairs), or where the x-variable does have some error associated with it (e.g. from pipetting errors), or where a transformation has been used that will make the residuals non-normal. You should at least be aware you are breaking the assumptions of the linear regression procedure if you use it for data of this sort.
The file cricket_chirps.csv contains data on the frequency of cricket chirps (Hz) at different temperatures (°C). A quick plot of the data seems to show a positive, linear relationship:
cricket.chirps<-read.csv( "H:/R/cricket_chirps.csv" ) plot( Frequency ~ Temperature, data = cricket.chirps, xlab = "Temperature / °C", ylab = "Frequency / Hz", main ="Crickets chirp more frequently at higher temperatures", pch = 15 # The pch option can be used to control the pointer character )
To model the data, you need to use lm( y ~ x, data=data.frame )
. The lm()
stands for “linear model”.
lm( Frequency ~ Temperature, data=cricket.chirps )
Call: lm(formula = Frequency ~ Temperature, data = cricket.chirps) Coefficients: (Intercept) Temperature -0.1140 0.1271
You’ll often want to save the model for later use, so you can assign it to a variable. summary()
can then be used to see what R thinks the slope and intercept are:
chirps.model<-lm( Frequency ~ Temperature, data=cricket.chirps ) summary( chirps.model )
Call: lm(formula = Frequency ~ Temperature, data=cricket.chirps) Residuals: Min 1Q Median 3Q Max -0.39779 -0.11544 -0.00191 0.12603 0.33985 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.113971 0.152264 -0.749 0.467 Temperature 0.127059 0.005714 22.235 2.55e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.2107 on 14 degrees of freedom Multiple R-squared: 0.9725, Adjusted R-squared: 0.9705 F-statistic: 494.4 on 1 and 14 DF, p-value: 2.546e-12
(Intercept)
is a from the formula at the top of this section. It is the y-intercept of the line of best fit through the x,y data pairs. It is value of y (Frequency
) when x (Temperature
) is zero, i.e. how frequently the crickets chirp at the freezing point of water. The estimated value is -0.1140 Hz, which is impossible(!), but satisfyingly does not appear to be significantly different from zero (p=0.467).Temperature
is b from the formula at the top of this section. It is the slope of line of best fit through the x,y data pairs. The estimated value is 0.1271 Hz °C^{−1}, i.e. for every 10°C increase in temperature, the chirping rate increases by about 1.3 Hz.Multiple R-squared
value is the square of the correlation coefficient R for Frequency
on Temperature
. Values of R^{2} close to 1 indicate y is well correlated with the x covariate with relatively little scatter; values close to 0 indicate the scatter is large and x is a poor predictor of y.To understand the meaning of the F-statistic
part of the report, it is important to understand what actually happens when you perform a linear regression. What you’re really trying to find out in linear regression is whether a straight-line with non-zero slope “y=a+bx” is a better model of the dependent variable than a straight line of slope zero, with a y-intercept equal to a constant, usually the mean of the y values: “y=y̅“. These two possible models are shown below. The code needed to display them with the deviance of each datum from the regression line picked out as red (col="red"
) vertical line segments is also shown.
Linear regression model “y=a+bx” with non-zero slope…
chirps.model<-lm( Frequency ~ Temperature, data=cricket.chirps ) abline( chirps.model ) chirps.predicted<-data.frame( Temperature=cricket.chirps$Temperature ) chirps.predicted$Frequency<-predict( chirps.model, newdata=chirps.predicted ) segments( cricket.chirps$Temperature, cricket.chirps$Frequency, chirps.predicted$Temperature, chirps.predicted$Frequency, col="red" )
We use the predict()
function to predict the Frequency
values from the model and a data frame containing Temperature
values. We put the predicted Frequency
values into a new column in the data frame by assigning them using the $
dollar syntax.
To add a regression line to the current plot using the fitted model, we use abline()
. Like many of the other functions we’ve seen, this can either take an explicit intercept and slope:
# abline( a=intercept, b=slope ) abline( a=-0.1140, b=0.1271 )
Or it can take a tilde ~
modelled-by formula:
abline( cricket.chirps.model )
We use the segments()
function to add red line segments to represent the deviation of each datum from the regression line. segments()
takes four vectors as arguments, the x and y coordinates to start each segment from (here, the measured Temperature
, Frequency
data points), plus the x and y coordinates to finish each line (the equivalent columns from the data frame containing the predicted data: these are the corresponding points on the regression line).
y is a constant “y=y̅ ” model with zero slope…
mean.frequency<-mean( cricket.chirps$Frequency ) abline( a=mean.frequency, b=0 ) segments( cricket.chirps$Temperature, cricket.chirps$Frequency, cricket.chirps$Temperature, rep( mean.frequency, length(cricket.chirps$Frequency) ), col="red" )
The constant is the mean of the Frequency
measurements. The predicted Frequency
values are therefore just 16 copies of this mean. We use length()
to avoid having to hard-code the ’16’.
It is ‘obvious’ that the y=a+bx model is better than the y=y̅ model. The y=y̅ model estimates just one parameter from the data (the mean of y), but leaves a huge amount of residual variance unexplained. The y=a+bx model estimates one more parameter, but with an enormous decrease in the residual variance, and a correspondingly enormous increase in the model’s explanatory power.
How much better? The degree to which the y=a+bx model is better than the y=y̅ model is easily quantified using an F test, and in fact R has already done this for you in the output from summary( chirps.model )
:
F-statistic: 494.4 on 1 and 14 DF, p-value: 2.546e-12
Accounting for the covariate Temperature
makes a significant difference to our ability to explain the variance in the Frequency
values. The F statistic is the result of an F test comparing the residual variance in the y=a+bx model (i.e. the alternative hypothesis: “Temperature makes a difference to frequency of chirps”) with the residual variance the y=y̅ model (i.e. the null hypothesis “Temperature makes no difference to frequency of chirps”).
An F test tells you whether two variances are significantly different: these can be the variances of two different data sets, or – as here – these can be the variances of two different models. The F value is very large (494) and the difference in explanatory power of the two models is therefore significantly different: by estimating just one extra parameter, the slope, which requires us to remove just one extra degree of freedom, we can explain almost all of the variance in the data.
Once we have fitted a linear model, we should check that the fit is good and that the assumption about the normality of the residual variance in the y variable is satisfied.
plot( residuals(chirps.model) ~ fitted(chirps.model) )
This plots the fitted Frequency
values (i.e. Frequency.fitted = 0.1271×Temperature-0.1140)
as the x variable against the residual values (ε=Frequency-Frequency.fitted
) as the y variable. If the residuals are behaving themselves (i.e. they are normal), this should look like a starry sky, with equal numbers of points above and below 0. If the residuals increase or decrease (i.e. it looks like you could stick a line or curve through them) with the fitted values, or are asymmetrically distributed, then your data break the assumptions of linear regression, and you should be careful in their interpretation.
You should also look at the normal quantile-quantile (QQ) plot of the residuals:
qqnorm( residuals( chirps.model ) )
The points on this graph should lie on a straight line. If they’re curved, again, your data break the assumptions of linear regression, and you should be careful in their interpretation. You can scan through these and other diagnostic plots using:
plot( chirps.model )
Fit linear models to the following data sets. Criticise the modelling: are the assumptions of the linear regression met?
Island | Area of island / km^{2} | Number of (non-bat) mammal species |
Jersey | 116.3 | 9 |
Guernsey | 63.5 | 5 |
Alderney | 7.9 | 3 |
Sark | 5.2 | 2 |
Herm | 1.3 | 2 |
sycamore.seeds<-read.csv( "H:/R/sycamore_seeds.csv" ) plot( descent.speed ~ wing.length, data = sycamore.seeds, xlab = "Wing length / mm", ylab = expression("Descent speed " / m*s^-1), main = "Sycamore seeds with longer wings fall more slowly" ) sycamore.seeds.model<-lm( descent.speed ~ wing.length, data=sycamore.seeds ) abline( sycamore.seeds.model ) summary( sycamore.seeds.model )
Call: lm(formula = descent.speed ~ wing.length, data = sycamore.seeds) Residuals: Min 1Q Median 3Q Max -0.073402 -0.034124 -0.005326 0.005395 0.105636 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.388333 0.150479 15.872 9.56e-07 *** wing.length -0.040120 0.004607 -8.709 5.28e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.06416 on 7 degrees of freedom Multiple R-squared: 0.9155, Adjusted R-squared: 0.9034 F-statistic: 75.85 on 1 and 7 DF, p-value: 5.28e-05
Both the intercept and the slope are significantly different from zero. The slope is negative, around −0.04 m s^{−1} mm^{−1}. The residuals don’t look too bad, but note that if you accept the model without thinking, you’ll predict that a wing of length −y-intercept/slope = c. 60 mm (i.e. the x-intercept) would allow the seed to defy gravity forever. Beware extrapolation!
nadh.absorbance<-read.csv( "H:/R/nadh_absorbance.csv" ) plot( A340 ~ Conc.uM, data = nadh.absorbance, xlab = "[NADH] / µM", ylab = expression(A[340]), main = "Absorbance at 340 nm shows shows linear\nBeer-Lambert law for NADH" ) nadh.absorbance.model<-lm( A340 ~ Conc.uM, data=nadh.absorbance ) abline( nadh.absorbance.model ) summary( nadh.absorbance.model )
Call: lm(formula = A340 ~ Conc.uM, data = nadh.absorbance) Residuals: Min 1Q Median 3Q Max -0.0043482 -0.0020392 -0.0004086 0.0020603 0.0057544 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.973e-03 8.969e-04 3.314 0.00203 ** Conc.uM 6.267e-03 3.812e-05 164.378 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.002783 on 38 degrees of freedom Multiple R-squared: 0.9986, Adjusted R-squared: 0.9986 F-statistic: 2.702e+04 on 1 and 38 DF, p-value: < 2.2e-16
The slope is 6.267×10^{−3} µM, which means ϵ is 6.267×10^{3} M. This is very significantly different from zero; however, so too is the intercept, which – theoretically – should be zero. You’ll also note that the Q-Q plot:
qqnorm( residuals( nadh.absorbance.model ) )
Is very clearly not a straight line, indicating that the variance in the residuals is not a constant.
logS<-log(c( 9, 5, 3, 2, 2 )) logA<-log(c( 116.3, 63.5, 7.9, 5.2, 1.3 )) species.area<-data.frame(logS=logS,logA=logA) plot( logS ~ logA, data = species.area, xlab = expression("ln( Area of island"/ km^2 *" )"), ylab = "ln( Number of species )", main = "Species supported by islands of different areas" ) species.area.model<-lm( logS ~ logA, data=species.area ) abline( species.area.model ) summary( species.area.model )
Call: lm(formula = logS ~ logA, data = species.area) Residuals: 1 2 3 4 5 0.22137 -0.16716 0.00828 -0.25948 0.19699 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.40977 0.20443 2.004 0.139 logA 0.32927 0.06674 4.934 0.016 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.2471 on 3 degrees of freedom Multiple R-squared: 0.8903, Adjusted R-squared: 0.8537 F-statistic: 24.34 on 1 and 3 DF, p-value: 0.01597
log C is the (Intercept)
, 0.4098 (so C itself is e^{0.4098} = 1.5), and z is the slope associated with logA
, 0.329 km^{−2}. The residual plots are a little difficult to interpret as the sample size is small; and you’ll note the large error in the estimate of log C, which is not significantly different from 0 (i.e. C may well be 1). I wouldn’t want to bet much money on the estimates of C or of z here.
Next up… The χ²-test.