Statistical power

In a recent(ish) post, we saw that if a fair coin is flipped 30 times, the probability it will give us 10 or fewer heads is less than 5% (4.937% to be pointlessly precise). Fisher quantified this using the p value of a data set: the probability of obtaining data (or a test statistic based on those data) at least as extreme as that actually observed, assuming that the null hypothesis is true.

If you ask people (even people who use them on a daily basis) what a p value is, you will get a range of answers, probably including:

  • The probability of the data being due to random chance
  • The probability of rejecting the null hypothesis incorrectly
  • The probability of the data
  • The probability of the null hypothesis being true
  • The probability of the null hypothesis being false
  • The probability that the data are significant
  • The probability of the data being accurate

…none of which are correct. Some of these errors can be traced back to the historical mixing-up of Fisher’s significance testing, and a different form of hypothesis testing proposed by Jerzy Neyman and Egon Pearson. Both of these involve calculating p values, and both frequently use the arbitrary cut-off of 0.05, but Neyman-Pearson testing is conceptually distinct. We’ll try to untangle them here.

If we simulate a infinitely large collective of coin-flipping trials, each consisting of 30-flips of a fair coin, we will obtain a binomial distribution. Unfortunately, we don’t have infinite time, so we’ll make do with a randomly generated collection of one thousand 30-flip trials instead:

fair.heads<-rbinom( n=1000, size=30, prob=0.5 )

# breaks allows us to set the bin sizes for a histogram
    main   = "Number of heads in 30 coin-flips of a fair coin", 
    xlim   = c( 0, 30 ),
    breaks = seq( 0, 30, by = 1 ),
    xlab   = "Number of heads in 30 flips",
    ylab   = "Frequency"

Fair coin simulation [CC-BY-SA-3.0 Steve Cook]

Fair coin simulated data (1000 trials)

36 observations of 10 or fewer heads in 1000 observation is 3.6%, which is very nearly the 4.937% we expect from theory.

The function rbinom(how.many, size=n, prob=P) is R’s implementation of a random binomial data generator. The ‘r’ stands for random.rbinom(this.many, size=n, prob=P) gives you this.many random variables drawn from a binomial distribution of size n with a coin producing heads at prob P.

rbinom(1000, size=30, prob=0.5) returns one thousand random variables, each representing the number of heads got from 30 flips of a fair coin.

Other distributions have similar random generator functions available: rnorm (normal), rpois (Poisson), rt (t), rf (F), etc.

To simplify things, we can go straight for the binomial distribution to give us a ‘perfect’ set of coin-flip data:

# create a vector into which we can put the number-of-heads-in-30-flips for 1000 observations
simulated.heads <- vector( "numeric" ) 

# This uses a for loop, the syntax for which is: 
# for ( variable in list { do.something.with( variable ) }

for ( k in 0:30 ) {
    # dbinom will give the pmf for each number of heads, k, in 30 flips
    # The number of observations of k-heads-in-30-flips is 1000 times this <- round( 1000 * dbinom( k, size=30, prob=0.5 ) )
    # e.g. when k=10, append 28 repetitions of "10" to the simulated head-count data
    simulated.heads <- c( simulated.heads, rep( k, times = ) )

    main   = "Theoretical number of heads in 30 coin-flips of a fair coin", 
    xlim   = c( 0, 30 ), 
    breaks = seq( 0, 30, by = 1 ),
    xlab   = "Number of heads in 30 flips",
    ylab   = "Theoretical frequency"

Fair coin theroretical [CC-BY-SA-3.0 Steve Cook]

Fair coin theoretical data (1000 trials)

So, 10 heads in 30 flips would allow us to reject the null hypothesis that the coin is fair at a significance level of 0.05, but it tells us nothing about how biased the coin is. A coin that is only slightly biased to tails would also be unlikely to produce 10 heads in 30 flips, and a coin so biased that it always lands heads up, or always lands tails up is also inconsistent with 10 heads in 30 flips.

How can we decide between different hypotheses?

You might think: “isn’t that’s what the p value shows?”, but this is not correct. A p value tells you something about the probability of data under a null hypothesis (H0), but it does not tell you anything about the probability of data under any particular alternative hypothesis (H1). Comparing two hypotheses needs rather more preparatory work.

If we consider a coin that may be fair, or may be loaded to heads, we might have two competing hypotheses:

  • H0: the coin is fair, and the probability of heads is 1/2 (P=0.50).
  • H1: the coin is loaded, and the probability of heads is 2/3 (P=0.666…).

Why P=2/3? And why is P=2/3 the alternative rather than the null hypothesis?

I’ve chosen P=2/3 because I have decided – before I start the experiment – that I am not bothered by coins any less biased that this. Perhaps this is for reasons of cost: making coins that are better balanced is not worth the investment in new coin-stamping dies. The difference in the probability of heads under the null and alternative hypotheses is called the effect size. Here, the effect size is the difference of 0.1666… in the probability of getting a head. Other measures of effect size are appropriate to different kinds of hypotheses.

I’ve made P=2/3 the alternative hypothesis because it is more costly to me if I accept that hypothesis when it is – in fact – false. Again, perhaps this is for reasons of cost: sticking with the old coin dies is a lot cheaper than replacing them, and it’s unlikely that a coin will be used to decide anything important: just trivial nonsense like kick-offs in football that I’m not really bothered by.

If you’re already objecting to this pre-emptive bean-counting approach to scientific data collection, you have hit upon one of the things Fisher didn’t like about Neyman and Pearson’s method, so you are in good company. That said…

The Neyman-Pearson method tries to control for the long-term probability of making mistakes in deciding between hypotheses. In the example above, we are trying to decide between two hypotheses, and there are four possible outcomes:

Decision H0 is true H0 is false
Accept H0 Correctly accept H0 Type II error
Reject H0 Type I error Correctly reject H0

In two of these outcomes, we make a mistake. These are called type I and type II errors:

  • Type I error: rejecting the null hypothesis when it is – in fact – true. This is a kind of false-positive error. The coin isn’t really biased, but we’re unlucky and get data that is unusually full of heads.
  • Type II error: accepting the null hypothesis when it is – in fact – false. (If we are considering only two hypotheses, this is equivalent to rejecting the alternative hypothesis when it is – in fact – true) This is a kind of false-negative error. The coin really is biased, but we’re unlucky and get data that looks like a 50:50 ratio of heads to tails.

If we were to repeat a coin-flipping experiment a very large number of times, any decision rule that allows us to decide between H0 and H1 will give us the wrong decision from time to time. There’s nothing we can do about this, but we can arrange the experiment so that the probabilities of making either of the two mistakes are reduced to acceptable levels:

  • We call the long-term probability of making a type I error across a large number of repeats of the experiment, α. In probability notation, α = P(reject H| H0), “the probability of rejecting the null hypothesis given (|) that the null hypothesis is true”.
  • We call the long-term probability of making a type II error across a large number of repeats of the experiment, β. In probability notation, β = P(accept H| ¬H0), “the probability of accepting the null hypothesis given that the null hypothesis is not (¬) true”.

We can choose whatever values we think appropriate for α and β, but typically we set α to 0.05. This is one of the things that results in the confusion between the Neyman-Pearson and Fisher approaches.

For two coins – one fair, and one biased to give 2/3 heads – the probability distributions for 30 flips look like this:

Fair and loaded coin pmf distribution overlaps [CC-BY-SA-3.0 Steve Cook]

Fair (left) and 2/3-loaded (right) coin PMF distributions, showing overlaps

The vertical line cuts the x-axis at 19 heads. This is the critical value, i.e. the number of heads above which we would reject the null hypothesis, because the probability of getting 19 heads or more heads in 30 flips of a fair coin is less than 0.05. The blue area on the right is therefore 5% of the area beneath the fair coin’s PMF curve, and it represents the probability of a false positive: i.e. rejecting the null hypothesis when it is – in fact – true, which is α.

The value ’19’ is easily determined using qbinom, which – as we saw in the last post – gives us the number of heads corresponding to a specified p-value.

critical.value = qbinom( p=0.05, size=30, prob=0.5, lower.tail=FALSE )

For coin-flipping, the statistic of interest – the one that we can use to make a decision between the null and alternative hypotheses – is simply the number of heads. For other kinds of test, it may be something more complicated (e.g. a t value, or F ratio), but the same general principle applies: there is a critical value of the test statistic above which p is less than 0.05, and above which we can therefore reject the null hypothesis with a long-term error rate of α.

The red region on the left is the portion of the 2/3-loaded coin’s PMF that is ‘cut off’ when we decide on a long-term false positive rate of 5%. Its area represents the probability of a false negative: i.e. accepting the null hypothesis when it is – in fact – false. This is β.

You can hopefully see that β is equal by pbinom( critical.value, size=30, prob=2/3 ). Its value – in this case 0.42 – depends on:

  1. The critical value (which depends on the choice of α).
  2. The number of coin-flips (here 30), i.e. the sample size.
  3. The bias of the coin under the alternative hypothesis (2/3), which is related to the effect size, i.e. the smallest difference in ‘fair’ and ‘loaded’ behaviour that we are interested in (2/3 – 1/2 = 0.1666…)

This is where the Neyman-Pearson approach diverges from Fisher’s. Fisher’s approach makes no mention of an alternative hypothesis, so it makes no attempt to quantify the false-negative rate. Neyman-Pearson sets up two competing hypotheses, and forces you to explicitly consider – and preferably control – β.

If the sample size and effect size are held constant, then decreasing α increases β, i.e. there is a trade-off between the false-positive rate and the false-negative rate: imagine sliding the vertical line left or right, and consider what would happen to the red and blue areas. For the decision between ‘fair’ vs. ‘2/3-loaded’, on a sample size of 30, beta is 0.42: we are eight times as likely to make a type II error as a type I error. It is very likely that we will wrongly say a coin is fair when in fact it is biased, and it is much more likely that we will make that mistake than we will call a fair coin biased when it isn’t. A mere 30 coin-flips is therefore woefully underpowered to distinguish between P=1/2 and P=2/3.

The power of a test, π, is:

π = 1 – β = P( reject H0 | H1 )

and is somewhat more convenient than β to remember:

  • α = P(reject H| H0)
  • π = P(reject H| H1)

What is a good power to aim for? Typically, we want an experiment to have a β of 0.20 (i.e. a power of 0.8). You might ask “why the asymmetry?” between choosing α=0.05 and β=0.20. There’s no particularly good general reason to want β to be larger than α. You might justify this on the basis that type II errors are less costly so you don’t mind making them more often than type I errors (do you really want to replace all those coin dies?) but the real reason is usually more pragmatic: detecting even small effect sizes requires quite large sample sizes:

Coin distribution overlaps for increasing sample size [CC-BY-SA-3.0 Steve Cook]

Coin distribution overlaps for increasing sample size

[To make R plot graphs on a grid like this, you want: par( mfrow=c( 2, 2 ) ) which sets up a 2-by-2 grid: each call to plot then adds a new graph until it fills up]

In the graphs, you can see that as we increase the sample size from 15 through to 120, the distributions get ‘tighter’, the overlap gets smaller, the blue area (β) gets smaller (and so the power of the test gets higher). If we plot the value of β against the sample size, we get this:

Coin power vessus sample size [CC-BY-SA-3.0 Steve Cook]

We need 60-odd flips to detect a 2/3-biased coin with a β of 0.20, but we need nearly 100 flips for a β of 0.05. The weird scatteriness is because the binomial distribution is discrete

The sample size needed to detect a 2/3-loaded coin is c. 60 flips for a β of 0.20 (a power of 0.80), and c. 100 flips for a β of 0.05 (a power of 0.95). If we have already decided on the value of α and the effect size of interest, then fixing the sample size is the only control we have over the type II error rate.

It is fairly rare for people to do power calculations of this sort before they start collecting data: more often than not, they simply collect a sample as large as they can afford. This means that – whilst type I errors are controlled at 0.05 – type II errors are often ignored, assumed (wrongly!) to be low, or – at best – calculated retrospectively. Retrospective calculations can give unpleasant surprises: as we saw above, 30 flips will result in quite a lot of quite badly loaded coins going undetected.

Ideally, you should calculate the sample size needed for a particular power before the experiment: that way, you will collect precisely the amount of data needed to give you the long-term error rates you think are acceptable (0.05 for α and 0.20 for β, or whatever). This brings us on to the last major confusion between Fisher and Neyman-Pearson testing:

  • p is the probability of a test statistic being at least as extreme as the test statistic we actually got, assuming that the null hypothesis is true.
  • α is the long-term probability of mistakenly rejecting the null hypothesis when it is actually true, if we were to repeat the experiment many times.

In Neyman-Pearson testing, we calculate a test statistic from our data (in this case just the number of heads), and compare that to the critical value of that test statistic. If the calculated value is larger than the critical value, we reject the null hypothesis. However, we can make exactly the same decision by calculating a p value from our data, and if it is less than α, rejecting the null hypothesis. Both of these approaches result in the same decision. The former used to be much more common because it was easy to obtain printed tables of critical values of test statistics like t and F for α=0.05. More recently, since computing power has made it far easier to calculate the p value directly, the latter approach has become more common.

Remember that in Fisher’s approach, a p value of 0.01 is ‘more significant’ than a p value of 0.05, and Fisher suggested (wrongly, IMHO) that this should give you more confidence in your rejection of the null. In Neyman-Pearson testing, if a p value is calculated at all, it is only to compare it to α. If you have done the power calculation, and collected the correct sample size, working out the p value is simply a means to an end: either it is larger than α (…accept null) or it is smaller than α (…reject null), but its value has no further importance. In Neyman-Pearson hypothesis testing, there is no such thing as “highly significant, p=0.01″. Unfortunately, a lot of people want to have their statistical cake and eat it: they don’t do a power calculation and collect a sample size that is expedient rather than justified (screw you Neyman-Pearson!), but they then want to interpret a very low p value (yay Fisher!) as being evidence for a particular alternative hypothesis (love you really, Neyman-Pearson!) This isn’t really very coherent, and many statisticians now prefer to side-step the whole issue and go for confidence intervals or Bayesian approaches instead, which I leave to others to explain.

This brings us to the question – finally! – of how many coin-flips we should collect to distinguish a loaded coin from a fair one. There are several answers to this, depending on our constraints, but two are particularly important:

  1. We have limited money and time, so we can only do 30 flips. What is the minimum effect size we can reliably detect?
  2. We have unlimited money and time (ha ha ha!), so we can only do as many flips as is necessary, but we’re only interested in coins that are biased to 2/3 heads or more. What sample size do we need to reliably detect a particular effect size of interest?

So, firstly, for the number of flips we can be bothered to make, what is the minimum effect size we can reliably (α=0.05, β=0.20) detect? For some common tests, R has built-in power calculations, but in general, the simplest way to work this out is by simulation. This requires a little code. In this case, it’s only actually about 12 lines, but I’ve commented it heavily as it’s a little tricky:

# Set up empty vectors to contain numeric data. sample.size will be filled in 
# with the minimum sample size needed to distinguish a fair coin from a
# loaded coin of various degrees of bias. effect.size will be filled in
# with the effect size for each loaded coin

sample.size <- vector('numeric')
effect.size <- vector('numeric')

# We'll try out numbers of flips between 0 and 200

size  <- c(0:200)

# Loop though some values of 'loaded' from 19/32 to 1. We don't go any
# lower than this because even 18/32 requires more than 200 flips!

for ( loaded in seq( from=19/32, to=32/32, by=1/32 ) ) {
    # since 'size' is a vector, qbinom will return a vector of 'crit' values
    # for all trial sizes between 0 and 200 coin-flips; and pbinom will then
    # return the beta values for each ('crit', 'size') pair.

    crit <- qbinom( p=0.05, size=size, prob=0.5, lower.tail=FALSE )
    beta <- pbinom( crit,   size=size, prob=loaded )

    # 'which' returns all the indices into the beta vector for which
    # the beta value is <= to 0.20. Since these were put in in 'size' 
    # order (i.e. beta item number 5 is the beta value for a trial of 5 flips)
    # using 'min' to return the smallest of these indices 
    # immediately gives us the smallest sample size required to
    # reliably detect a bias in the coin of 'loaded'

    smallest.needed <- min( which( beta <= 0.20 ) )
    # Append the value we've just worked out to the sample.size vector
    sample.size     <- c( sample.size, smallest.needed )
    # Effect size is the degree of bias minus the P for a fair coin (0.5)
    effect.size     <- c( effect.size, loaded - 0.5 )

# Which is the first index into sample.size for which the value is 30 or less?

index.for.thirty <- min( which( sample.size <= 30 ) )

# Since effect.size and sample.size were made in parallel, using this index
# into effect.size will give the minimum effect size that could be detected
# on 30 coin flips

effect.size[ index.for.thirty ]

The answer is that 30 flips is enough to reliably detect a bias only if the effect size is at least 0.25, i.e. to distinguish between a fair coin and a coin that is biased to give 3/4 heads. Does this surprise you? You need quite a lot of flips to distinguish even a large bias.

[NB – we made rather coarse steps of 1/32 in the loop that generated this data. If you do smaller steps of 1/64, or 1/128, you’ll find the real value is nearer 0.23]

A graph of effect size against sample size for (α=0.05, β=0.20) is useful in answering this question and the next:

Coin flipping effect size vs sample size [CC-BY-SA-3.0 Steve Cook]

Effect size and sample size are inversely related: a small sample size can only detect a large bias; a large sample size is needed to detect a small bias

Secondly, what is the minimum sample size we need to reliably (α=0.05, β=0.20) detect an effect size of interest to us?

The same code as above generates the necessary data, but now we want:

index.for.twothirds <- min( which( effect.size >= 2/3 - 1/2 ) )
sample.size[ index.for.twothirds ]

At least 45 flips are needed to distinguish a fair coin from a coin biased to give 2/3 heads. [Again, bear in mind the coarseness of the loop: the real value is closer to 59].

All this may sound like an awful lot of bother merely to be told that your sample size is woefully inadequate, or that you are going to miss lots of biased coins. Fortunately, R makes it easy to obtain this unwelcome information, at least for common tests like t and ANOVA, and easy-ish for some other tests, through the pwr package, which you can install from CRAN with:

install.packages( "pwr" )

…load into a script with:


…and find information about usage with:



  1. What sample size is needed to distinguish a fair coin from a slightly heads-biased coin (P=0.55) with α=0.05, and a power of 0.80? What is the critical number of heads that would allow you to reject the null?


  1. A power of 0.80 means β=0.20. The same code we used for the effect vs. sample size calculation can be modified to give us the data we want. The paste() function allows us to put the value of variables into strings for use in print() statements or as axis titles, etc. Note that number of heads required to convince us of P=0.55 rather than P=0.50 is actually only 330/621 = 0.53, because the sample size is so large.
size <- c(0:1000)
crit <- qbinom( p=0.05, size=size, prob=0.50, lower.tail=FALSE )
beta <- pbinom( crit,   size=size, prob=0.55 )
n    <- min( which( beta <= 0.20 ) )

print( paste( "Critical value is", crit[n], "heads" ) )
print( paste( "Sample size required is", n ) )
"Critical value is 330 heads"
"Sample size required is 621"

Next up… The F-test.

Organism of the week #26 – Oxymoron

Plants can have some very odd names. Bears are not renowned for their trousers, and this spiky sod is the last thing anyone would want to make a pair of trousers from, but “bear’s breeches” it is. Even its Latin name is odd: acanthus means spiny, and mollis means smooth; a literal oxymoron.

Acanthus mollis [CC-BY-SA-3.0 Steve Cook]

Acanthus mollis

It might not look very familiar, but it may be the most quietly famous plant in the world, having been immortalised in stone through much of the world for over 2000 years.

As any classicist – and many a bored sixth-former stuffing their CV with General Studies – can tell you, there are three ways to cap off an architectural column in the Greek style: plainly, fussily, or gaudily (the Romans later added boringly and ludicrously).

The gaudy version is decorated with the leaves of Acanthus: you can see them here at the top of the columns outside the Royal Institution in London:

Royal Institution (T.H. Shepherd) [Public domain]

Royal Institution (T.H. Shepherd)

Why Acanthus was chosen rather than any other local Mediterranean plant is as much a mystery as the plant’s strange common name. There is a story, quoted by Vitruvius, that Acanthus was found growing through a votive basket left on the grave of a young girl, and this inspired the sculptor and architect Callimachus to invent a new kind of column. This sounds about as likely to me as those ridiculous backronym etymologies of swear words (“Fornicating Under Consent of the King” – yeah, right), but whatever the reason, I think it’s rather nice that Callimachus (or whoever) elevated a obscure, prickly thing like Acanthus to such heights, rather than going for an obvious, safe choice like grapes or olives. But I guess I would, wouldn’t I?

Organism of the week #25 – Bull headed

This is another of the things we have found down a microscope in one of our undergraduate practicals, but for once it’s not a ciliate.

Bucephalus minimus [CC-BY-SA-3.0 Steve Cook]

Bucephalus minimus

This is the larva of a parasitic fluke called Bucephalus, which is the Greek for ‘bull headed’. It’s appropriate for this fluke not because it looks like Alexander the Great’s horse, but because its two “tails” (furcae) look like horns when they are fully extended.

Haeckel Bucephalus [Public Domain]

Bucephalus as interpreted by Ernst Haeckel

Like most of its relatives, Bucephalus has a ludicrously complicated life-cycle, making life miserable for no less than three separate hosts: a mollusc (typically a clam), a small foraging fish (like a smelt or goby), and a larger predatory fish (such as a sea-bass):

Bucephalus life-cycle [CC-BY-SA-3.0 Steve Cook, Didier Descouends, Citron, Roberto Pillon]

Bucephalus life-cycle [CC-BY-SA-3.0 Steve Cook, Didier Descouends, Citron, Roberto Pillon]

Two of these hosts – the clams and sea-bass – are economically important sea foods. Like many flukes, the larval stages in the first intermediate host – the clam – chemically castrate the host so that it diverts more resources to the survival and propagation of the parasitic fluke than it does to its own. In the sea bass – the definitive host, where the fluke reproduces sexually – the adults are found in the fish’s gut, and heavy infections cause weight loss in both wild and farmed fish.

Although Bucephalus cannot infect humans, we are the definitive host for several other flukes, most importantly the liver fluke, the Chinese liver fluke, the intestinal fluke, and the blood fluke that causes schistosomiasis (bilharzia). These have similarly complicated life cycles involving molluscs and assorted other intermediate hosts, and between them they infect many tens of millions of people worldwide.

Drug treatment options for fluke diseases tend to be limited and have unpleasant side effects, and schistosomiasis in particular is on the WHO list of neglected tropical diseases, for the damning ratio of infections in the developing world to investment made by the developed world in its treatment and prevention. It really is about time this changed.

A graph to show

I’ve never been sure where “a graph to show…” comes from. As far as I can tell, A-level specifications don’t use or specify this wording, and you wouldn’t typically see it in a figure legend in a scientific paper. But if you ask first-year students to put a title on a graph without any further guidance, almost every one of them will default to this mindless boilerplate.

I hate “a graph to show how y varies with x” with a passion bordering on pathology.

“A graph to show” tells me nothing whatsoever that a properly drawn graph doesn’t already show. It is as superfluous as gold paint on a lily or jackboots on Theresa May:

  • If it’s an x,y scatter-plot, then the y axis will have a clear label, stating what y is in terms that the expected audience will understand, including details of the units in which has been measured (if any).
  • If it’s an x,y scatter-plot, then the x axis will also have a clear label with the equivalent details.
  • If it’s an x,y scatter-plot, then you presumably wouldn’t have plotted such a thing if you didn’t want to show me how y varies with x.
  • And finally, if it’s an x,y scatter-plot, then the very last thing you should waste your breath or word-count telling me is the fact that it’s a “graph”. I know it’s a graph, because – you know – it’s got f**ing axes and f***ing data points and all the other sexy trappings of graphdom.

Titling a graph with “a graph to show how y varies with x” is a waste of time. But it’s worse that that. By training students to write mindless titles, you divert their attention from actually writing a title (and/or legend) that are useful to the reader, and to the writer.

A useful title should tell the reader enough about how (and why) the data has been collected for the graph to stand alone.

Some real(ly annoying) examples:

A graph to show how the absorbance varies with wavelength

The absorbance of what chemical? Wavelength of light, or of some other wave? How does it vary? Why should I care? Does a graph that shows absorbance (or emission) against wavelength of light have a specific name?

A graph to show how the rate of the enzyme varies with pH

Which enzyme? What substrate? How does it vary? Is there – perhaps – an optimum? What pH value gives this optimum? Is it (in)consistent with the typical pH values in which this enzyme is found?

A graph to show how the pH varies with the amount of base added to the acid

Which acid? Which base? What concentrations? Does it buffer? How many times and at what pKa values? Is there a specific name for this experimental procedure?

A graph to show how the number of lichens in a wood varies with size

Number of lichens or number of species of lichen? Which wood? Latitude and longitude? Presumably the size of the wood, not the lichen (do you mean ‘area’?) Is there a well-known mathematical relationship between these two variables? What parameters have you estimated from it?

Writing better graph titles means really thinking about what your data show and how they were collected. Putting in the effort to give your graphs meaningful titles will result in better discussion of those results. And if you don’t…

A graph to show [CC-BY-SA-3.0 Steve Cook]

A graph to show how your score will vary with the number of times you say “a graph to show how y varies with x”

Organism of the week #24 – Danse Macabre

For three centuries, the Black Death was routinely epidemic in London. The first outbreak – in 1348 – probably killed half the population of England; the last outbreak – from 1665 to 1666 – probably killed a quarter of the population of London.

In 1665, Isleworth was a small village several hours’ walk (or row) from London proper, but the Great Plague found its way there anyway. As in many places, so many died that digging individual graves became impractical, so instead, the bodies were interred in communal plague pits.

Taxus baccata at All Saints' Isleworth plague pit [CC-BY-SA-3.0 Steve Cook]

All Saints’ Isleworth plague pit memorial

Isleworth is one of the few places in present-day London where there is evidence above of the burials below. A cairn of stones and a yew tree sit atop the pit and mark the resting place of the 149 people who died there.

All Saint's Isleworth plague pit plaque [CC-BY-SA-3.0 Steve Cook]


Yew trees (Taxus baccata) have long had an association with churchyards. The optimistic may consider this appropriate because yews are an evergreen reminder of the life eternal; the cynical may have other analogies to draw.

Taxus baccata trunk at All Saints' Isleworth plague pit [CC-BY-SA-3.0 Steve Cook]

Taxus baccata yew trunk at All Saints’ Isleworth

Since the 1980s, there has been some debate about the cause of the Black Death. The majority of evidence pointed to the bacterium Yersinia pestisa parasite of rats and other rodents that gets from place to place in the guts of fleas, and causes swelling of lymph nodes, fever, coughing, bleeding under the skin, gangrene and – all too frequently – death.

Flea from Hooke's Micrographia [Public Domain: Steve Cook]

Flea from Robert Hooke’s Micrographia. I still can’t believe the lovely people at the Royal Institution let me touch it

In particular, an outbreak of Y. pestis plague that started in China in 1885, and continued worldwide until 1959, had similar symptoms to those reported by mediaeval scholars for the Black Death. However, some remained unconvinced, and pointed the finger instead at haemorrhagic fevers, anthrax, or other agents.

Yersinia_pestis [Public Domain, credit: NIH]

Yersinia pestis bacteria in the gut of a flea [Public Domain, credit: NIH]

Recent evidence from DNA sequencing of samples taken from plague pits and other burials appear to back up Y. pestis as being the culprit for both the mediaeval Black Death and the even earlier Justinian Plague, which devastated the Byzantine Empire in 541-542.

Plague is currently easily treated with antibiotics like streptomycin if caught early enough, but it’s never really gone away: rodents still carry plague in the US, India, China, Brazil, and southern Africa, and all of these countries have reported infections in the last 40 years.

It’s strange to think that one of the greatest killers in all of human history, exists not just in GCSE history books, but also out there in the real world.

Waiting patiently in the shadows.

Organism of the week #23 – Rattled

My annual summer ritual to stave off death for one more year involves running round Kensington Gardens and Hyde Park, which are situated conveniently close to $WORK.

I lumbered merry as a shroud.

That aches and sweats o’er trails and heights,

When all at once I saw a crowd,

A host, of golden parasites:

Rhinanthus minor (field) [CC-BY-SA-3.0 Steve Cook]

Yellow rattle growing in the north of Kensington Gardens (Rhinanthus minor)

Yellow rattle is a member of the broomrape family, which are almost all parasites of other plants. Broomrape itself is completely parasitic, and obtains all the sugars and other nutrients it needs to grow from other plants.

Orobanche minor [CC-BY-SA-3.0 Rosser1954]

Broomrape (Orobanche minor) is a plant that completely lacks chlorphyll so is a ghostly white colour [CC-BY-SA-3.0 Rosser1954]

This is why it has no need for the green pigment chlorophyll, which most plants use to capture sunlight. Yellow rattle, as you can see from the image below, has normal-looking green leaves, so it must be making at least some of its own food.

Rhinanthus minor [CC-BY-SA-3.0 Steve Cook]

Yellow rattle flowers and leaves

So how do we know it’s a parasite? If you very carefully dig up yellow rattle roots, you’ll find they grow tightly around the roots of grasses and other nearby plants, and make physical connections with them. The rattle uses these connections to tap into the roots of other plants and to steal their water and nutrients. It’d be nice to show you this, but whenever I’m near these plants I am a sweaty, exhausted mess and it’s all I can do to take a photo, let alone go digging. Also, as it’s a Royal Park, digging up rattle is probably treason or some similar nonsense.

You’ll have to make do with circumstantial evidence instead. The big bald patch in the image below is where the rattle is growing. The grass is about half the height of the grass surrounding the infested patch, and much sparser.

Rhinanthus minor bald patch [CC-BY-SA-3.0 Steve Cook]

Bald patch in a sward of grass caused by yellow rattle infestation

[If you’re a good ecologist, you might suggest that is is merely evidence for competition rather than parasitism, and you’d be right.]

At the end of flowering, the plant dies, leaving just a spike of seed capsules:

Rhinanthus minor (field) capsules [CC-BY-SA-3.0 Steve Cook]

Yellow rattle capsules (bottom of image) mixed in with unhappy grass

It’s the capsules that give the rattle its name: they make a satisfying noise when gently jiggled.

Although the rattle does damage the grass, it has a positive effect overall on the biodiversity of the field by keeping the grasses in check. As yellow rattle allows other species to grow that would normally get shaded out, it makes for a useful addition to those wildflower meadows that are nearly as beloved of the chattering classes as is the middle-aged PE of running round parks.

Apologies to William Wordsworth, and poetry more generally.

Phage vs. host

For a recent schools’ outreach day, I put together a card-game based around the arms-races that develop between bacterial hosts and their viruses (bacteriophages). It’s mostly just a bit of fun, but if anyone finds it useful or can suggest improvements (or just make them! I release this under a CC-BY-SA-4.0 license) I’d be happy to hear them.

Phage vs. host [CC-BY-SA-4.0 Steve Cook]


We had the annual “looking at muck down a microscope” practical last week. As usual, the best thing we saw was a ciliate in some pond water, in this case a little trumpet animalcule:

Stentor sp. [CC-BY-SA-3.0 Steve Cook]

Stentor sp. The green bits are either symbiotic algae, or dinner, possibly a bit of both.

Previous winners: Vorticella and Lacrymaria. The Ciliata really are the phylum that keeps on giving.

A queen’s Christmas message

Well, at least 2:8 is plausible.

2:1 And it came to pass 10 years after the death of Herod the Great, that there went out a decree from Caesar Augustus that all the world – except those irrelevant bits that the Romans hadn’t conquered – should be taxed.

2:2 (And this taxing was first made when Quirinius was governor of Syria during what – by a large stretch of the imagination – may have been his second term, his first (entirely undocumented) term having been in 4 BCE, during which an (entirely undocumented) census almost certainly didn’t take place either.)

2:3 And all went to be taxed, every one into his own ancestors’ city, in direct contravention of previous Roman policy and of common sense.

2:4 And Joseph also went up from Galilee, out of the city of Nazareth – perhaps using his wooden time-machine to travel back through the centuries required for the town to come into existence – into Judaea, unto the city of David, which is called Bethlehem.

2:5 To be taxed with Mary his espoused wife, being great with her child, whom she claimed – completely plausibly – to have fallen into her womb from heaven, rather than to have been formed in the usual grisly fashion.

2:6 And so it was, that, while they were there, the days were accomplished that she should be delivered of her son, in accordance with a variety of mistranslated prophesies that meant something quite different.

2:7 And she brought forth her firstborn son, and wrapped him in swaddling clothes, and laid him in a manger; because there was no room for them in the inn, this being full of the whole population of Judea, who – like Joseph – had unrealistic ideas about Fisher’s relatedness coefficient and the importance of Y chromosomes.

2:8 And there were in the same country shepherds abiding in the field, keeping watch over their flock by night.

2:9 And, lo, the angel of the Lord came upon them. The glory and bowel-loosening terror of the Lord shone round about them: and they were sore afraid, particularly of the cherubim.

2:10 And the angel said unto them, Fear not: for, behold, I bring you good tidings of great joy, which shall be to all people. For some value of ‘people’. And of ‘all’.

2:11 For unto you is born this day in the city of David a man who is God, and also the son of God, and also the son of a girl from a town that doesn’t exist. But definitely not the son of Joseph. Despite the effort we’ve gone to in establishing his back-story.

2:12 And this shall be a sign unto you; Ye shall find the logical abomination wrapped in swaddling clothes, lying in a manger, possibly attended by a number of Persian priests or kings or wise-men, whom the author of this document will casually forget to mention.

2:13 And suddenly there was with the angel a multitude of the four-faced, six-winged heavenly host praising God, and saying,

2:14 Glory to God in the highest, and on earth peace and good will toward men, except Monophysites, Monothelites, Arians, Nestorians, Manichaeans, Marcionites, Ebionites, Sadducees, Pharisees, Docetists, Cathars, and especially not towards those bloody atheists.

2:15 And it came to pass, as the angels were gone away from them into heaven, the shepherds said one to another, Let us now go even unto Bethlehem, and see this thing which is come to pass, which the Lord hath made known unto us.

2:16 And they came with haste, and found Mary, and Joseph, and the babe lying in a manger.

2:17 And when they had seen it, they made known abroad the saying which was told them concerning this child, surely much to Joseph’s delight.

2:18 And all they that heard it wondered at those things which were told them by the shepherds. You would wonder about it, wouldn’t you? Wouldn’t you?

2:19 But Mary kept all these things, and pondered them in her heart, for she knew that her remaining verses were numbered.

2:20 And the shepherds returned, glorifying and praising God for all the things that they had heard and seen, as it was told unto them.

2:21 And (as will probably be prudishly edited out when you hear this read in the dim and distant future), when eight days were accomplished for the cutting off of part of the child’s penis, his name was called Joshua, which was so named of the angel before he was conceived in the womb.

A very Merry Joshuamas to you all.

Organism of the week #22 – Faking it and making it

Nettles have a rather unhappy reputation as bringer of painful welts, and – at this time of year – dribbling noses too. The welts are probably caused by histamine, and the pain by oxalic and tartaric acids, which the nettle injects into your skin through the tiny brittle hairs that cover its stems and leaves. If you’re stung badly, the pain can last for several hours.

Urtica dioica [CC-BY-SA-3.0 Steve Cook]

Stinging nettles (Urtica dioica) in Russia Dock Woodlands, Rotherhithe

If you are stung by nettles at some point, you’ll probably avoid trampling barefoot on them in future. If getting trampled by humans is a big ecological problem for nettles, then a less stingy nettle stands a poor chance of growing up to make baby nettles. Stingless nettles will therefore go extinct, and their stingier competitors will inherit the earth. Praise be to Darwin.

Urtica dioica trichomes [CC-BY-SA-3.0 Frank Vincentz]

Stinging nettle stinging hairs [CC-BY-SA-3.0 Frank Vincentz]

Unfortunately for you (and fortunately for those who make a living from it), it’s almost always true that “I think you’ll find it’s a bit more complicated than that” in biology. If humans can be persuaded to avoid meddling with nettles through a single painful experience, there is a lot of opportunity for cheats to exploit your fear. In the case of nettles, one well-known group of cheats are the so-called dead nettles:

Lamium galeobdolon [CC-BY-SA-3.0 Steve Cook]

Yellow archangel, a common dead nettle (Lamium galeobdolon)

The leaves of dead nettles look remarkably like those of stinging nettles, but the dead nettles neither sting, nor are they even close relatives of stinging nettles: stinging nettles are related to hops and cannabis; dead nettles to mint and sage. The flowers give the game away at this time of year, but in spring, the two plants are really very similar. You need to get quite close to spot the missing stings on the dead nettles, and if you’ve had a bad experience with the real thing in the past, getting quite close is probably something you – or a fluffy wuffy bunny, or whatever – would think twice about.

Urtica dioica and Lamium album (spot the difference) [CC-BY-SA-3.0 Steve Cook]

Stinging nettle (Urtica dioica) and white dead-nettle (Lamium album): spot the difference. The dead nettle has conspicuous white flowers; the stinging nettle’s flowers are greenish-brown tassels

Good biologists should always be skeptical of plausible stories, so I should add that I’ve not actually been able to track down any experimental studies seeing whether bunnies who have learnt to avoid stinging nettles also avoid dead nettles, let alone any that show dead nettles are more successful at making seeds when real nettles are in the same area. Assuming this actually is the case, dead nettles would be “Batesian mimics” of stinging nettles, or – if you’d rather – fakers. They don’t have to waste energy making histamine and oxalic acid and hypodermic needles; they merely have to look somewhat similar to stinging nettles to receive all the benefits of having bunnies avoid them, with fewer of the costs.

But what would happen if dead nettles were such good fakers that they became very common? The bunnies would rarely meet the real thing, and would probably never learn to avoid nettle-like plants of any sort. Even if the bunnies did occasionally meet stinging nettles, those reckless bunnies that threw caution to the wind and ate things that looked like nettles would still tend to get more to eat than more cautious bunnies. In either case, the dead nettles would get nibbled back into relative rarity. And then the more reckless bunnies would get stung more often, as they’d meet real stinging nettles more frequently, and this would – in its turn – favour bunnies that were more cautious again, leading to a resurgence of the dead nettles. And so on, and so on.

The relative rarity of dead nettles and stinging nettles wouldn’t necessarily roller-coaster up and down like this: the cycles could be quite small. However, it’s interesting that neither a field of dead nettles on their own, nor of stinging nettles on their own, is stable. A field of nothing but stinging nettles is prone to invasion by fake dead nettles; but if the number of dead nettles gets too high, the bunnies will never meet the real thing, and won’t learn to avoid nettle-like plants in the first place. There is likely to be some ratio of real to fake nettles (and of cautious to reckless rabbits) that is stable in the long term, but it won’t be 0% or 100%.

These sorts of ‘game’ between mimics – the dead nettle “fakers” – and their models – the stinging nettle pain “makers” – are very common in biology, and are an important part of the ecology of many organisms. Wherever an organism has made some sort of ‘effort’, there is likely to be a living made scrounging off them, or mimicking their appearance.

But of course, it’s always a bit more complicated in biology. Not all mimics are fakes. Some mimics benefit from looking dangerous because they really are dangerous.

Mimicry [CC-BY-SA-3.0 Steve Cook]

Honeybee (Apis mellifera), bumblebee (Bombus terrestris), cinnabar moth caterpillar (Tyria jacobaeae), hoverfly (Eupeodes luniger)

The honeybee and bumblebee in the image above both have black and yellow striped bodies. Both are able to sting, and both seem to have similar colours. Is one mimicking the other, and if so, why?

As I said earlier, you should be skeptical of plausible stories. Bumblebees and honeybees are quite closely related, so perhaps the black-and-yellow is just a colour-scheme they’ve inherited from their common ancestor that has nothing to do with mimicry. We need more evidence.

As it turns out, there is very good evidence that black-and-yellow is meaningful mimicry, not accidental similarity. For example, the cinnabar moth caterpillar in the third image is not closely related to the bees, so it is likely that this caterpillar’s colours have evolved independently from those of the bees. Can it sting? Not exactly, but it is poisonous, because it mostly eats ragwort, and it steals the ragwort’s poisons for its own defence. Any bird that has learnt to avoid black-and-yellow insects through unhappy run-ins with bees is likely to avoid this caterpillar too. Importantly, this works both ways: any bird that’s had a bad experience with cinnabar moth caterpillars is also likely to avoid bees (and wasps, and other similar insects).

This sort of mimicry, where makers – the animals and plants that can back up their threats – all come to have similar warning colours is called Müllerian mimicry. If you need any more convincing, it’s telling that there are also many Batesian fakers of the black-and-yellow “warning” colour-scheme too, like the harmless hoverfly shown in the fourth image.

The natural world if full of liars and cheats; except when it isn’t.

Load more