Recursion [CC-BY-SA-3.0 Steve Cook, Andreas Thomson, Frank Vinzent]

Recursion [CC-BY-SA-3.0 Steve Cook, Andreas Thomson, Frank Vinzent]


Bark [CC-BY-SA-3.0 Steve Cook]

Top left to bottom right: Betula utilis var. jacquemontii, Pinus sylvestris, Prunus serrula, Pinus pinaster, Quercus suber, Araucaria araucana, Metasequoia glyptostroboides, Pinus bungeana, Betula ermanii × pubescens, Pinus nigra ssp. laricio, Betula albosinensis cv. Red Panda, Platanus × acerifolia.

Wisley in Winter

It’s not really a botanic garden, but the Royal Horticultural Society’s gardens at Wisley is near enough as makes no difference. We visited in what should have been the dead of winter, but which in reality was this weird sprautumn mash-up that is now December in the UK. The heather garden was particularly pretty, despite the wind:

RHS Wisley heather garden [CC-BY-SA-3.0 Steve Cook]

RHS Wisley heather garden

At £12, the entry price is a bit steep: even Kew does reduced rates when most of the plants are underground, sleeping off the summer’s excesses.

Botanerd highlights included the biggest pine-cone I have yet added to my collection (yes, it is a collection, not a sickness)…

Pinus × holfordiana cone [CC-BY-SA-3.0 Steve Cook]

Holford pine (Pinus × holfordiana) cone – nothing satisfies like a 12 inch Pinus

…and the – apparently legal – purchase of class A drugs from the attached garden centre:

Lophophora williamsii [CC-BY-SA-3.0 Steve Cook]

Peyote cactus (Lophophora williamsii)

From what I gather, owning a peyote cactus is legal in the UK unless you ‘prepare it for consumption’, which magically transforms it into a 7 year jail term. This presumably includes accidentally letting it die and dry out into a button whilst on holiday. I may be having a teesy bit of buyers’ regret.


Most years the pond-water microscopy practical throws up an exciting ciliate (or two, or three), but this year, the only ones we saw were duplicates, or too bloody fast to photograph. Ho hum.

So this year you’ll have to make do with an imposter. It’s still got cilia, but it is not a ciliate, and although it’s no bigger than a single-celled organism, it is in fact a full-blown multicellular animal:

Euchlanis rotifer [CC-BY-SA-3.0]

Euchlanis rotifer

This is a rotifer (‘wheel animal’), which are famous for two things: being ridiculously small, and – in one infamous subgroup – being ridiculously averse to having sex. They’re also really cute when they turn on their feeding wheels. Happy New Year.

Bagging botanical Brussels

The last time we went to Brussels, I got terribly excited that the hotel we were staying in was right next door to the Botanical Garden of Brussels. Unfortunately – as we discovered in short order – at some point in the 1930s the plants had mostly been shipped off elsewhere, leaving the garden not very botanical, and Dr Cook not very impressed.

That elsewhere was the Plantentuin Meise / Jardin Botanique Meise, which is to the north of Brussels, conveniently situated – for the purposes of this second attempt to visit them – between Brussels airport and Antwerp.

Brussels botanic gardens [CC-BY-SA-3.0 Steve Cook]

Botanic Gardens, Meise

We were visiting in autumn, and they were in the middle of renovating part of their main glasshouse – the Plant Palace – but there was still plenty to see, including a good arboretum, an orchid exhibition, and some lovely Japanese beautyberries:

Callicarpa japonica [CC-BY-SA-3.0 Steve Cook]

Japanese beautyberry (Callicarpa japonica)

The entry fee (€ 7) is very reasonable, and the restaurant helpfully ensures all sandwiches come with a colonic depth-charge of salad; just the ticket to cure the faecal impaction resulting from the ‘food’ served to us on the flight out with <insert awful British airline here>. When we were there, there was also an art installation by Roos Van de Velde in praise of the best plant in the world – the maidenhair tree Ginkgo biloba  but you’ve only got until mid-January to catch it:

Gingko biloba Meise [CC-BY-SA-3.0 Steve Cook]

Roos Van de Velde’s Gingko biloba installation

Botanerd highlights:

  1. Whereas Kew’s evolution house (currently being renovated as part of the Temperate House restoration) always felt like a holding pen for a few cycads and ferns that didn’t physically fit into the Palm House, the Meise version was much more complete, with a good selection of horsetails, ferns, and high-quality fakes of fossil plants like Cooksonia, Lepidodendron, etc. That said, I would very much like to see these ‘evolution house’ efforts re-branded as ‘phylodiversity houses’ or similar, as the usual Scala Naturae bullshit of using modern mosses, lycopods, ferns, and conifers as stand-ins for prehistoric forms gets right up my arse.
  2. The pinetum is small, but has a good selection of species, including quite a few weird cultivars of Japanese cedar (Cryptomeria japonica) that I’ve not seen before.

Cryptomeria japonica cv. Cristata [CC-BY-SA-3.0 Steve Cook]

Cryptomeria japonica cv. Cristata

Previous baggings…

Organism of the week #29 – Galling

Today is the first day of the new (academic) year at $WORK, but – aside from a couple of intro lectures – this is the calm before the real storm, which arrives in the form of a deluge of biological chemistry in November. If you’re lucky, you’ll get a few mugshots of some weird ciliate or other around then, otherwise, expect tumbleweed until the new (calendar) year…

Insects and flowering plants have been co-evolving for 200 million years or so, so it’s unsurprising that many of their interactions have become very intimate. The relationships between gall-forming insects and their host plants are particularly interesting. Gall-forming insects lay their eggs on plants, but when the larvae hatch, instead of munching through the plant in the usual fashion, the larvae make the plant build bizarre little houses for them. These are the galls. Galls provide protection from weather, some protection from things that might like to eat the gall-former itself, and food for the developing larva.

In the UK, there are many gall-forming insects that target oak trees. These are mostly very small wasps. One of the most obvious is the knopper gall, which are the sticky crater-shaped things you often see attached to acorns around August.

Rather like the fluke we met a few weeks ago, knopper gall wasps have a complicated life-cycle involving several different hosts, in this case two different species of oak: the English (Quercus robur) and the Turkey (Quercus cerris). Female wasps emerge in spring from the galls that fell from English oaks the previous autumn. These females then lay asexual eggs on the catkins of the Turkey oak, which hatch out, and form rather inconspicuous galls. The males and females that hatch from these galls in April then mate, and the females lay their eggs on the developing acorns of English oaks, where they develop into the knopper galls in late summer.

Andricus quercuscalicis on Quercus robur [CC-BY-SA-3.0 Steve Cook]

Knopper galls caused by the larvae of the wasp Andricus quercuscalicis, on oak (Quercus robur)

Like so many pests and diseases – the knopper gall is actually a fairly recent introduction: the first galls in the UK were recorded only in the 1950s.

Smaller but creepier, the silk-button spangle gall is a trypophobes’ delight. Like the knopper gall, they’re caused by small wasps, but are much less obvious until you flip a leaf over:

Neuroterus numismalis and Neuroterus quercusbaccarum on Quercus robur [CC-BY-SA-3.0 Steve Cook]

Mostly silk-button spangle galls caused by the larvae of the wasp Neuroterus numismalis, also on oak

The less creepy common spangle gall isn’t as likely to make you feel like you have maggots crawling under your skin, but like the silk-button spangles, that’s exactly what the plant has got.

Neuroterus quercusbaccarum on Quercus robur [CC-BY-SA-3.0 Steve Cook]

Common spangle-galls (Neuroterus quercusbaccarum) on – you’ll never guess what. There are a few older reddish spangles in the silk-button image above too.

The most famous and most useful gall found on oak is the oak apple. A mixture of mashed up oak apple galls and iron salts was – for a millennium and a half – what you dunked a quill pen into, if you were lucky enough to know how to write.

Oak apple [CC-BY-SA-2.0 Bob Embleton]

Oak apple gall: these are caused by the larvae several different species of by cynipid wasps [CC-BY-SA-2.0 Bob Embleton]

Like the knopper and spangle galls, oak apples are caused by the larvae of a tiny wasp, but it’s not just wasps that can cause galls, or indeed just insects. Roundworms (nematodes) can also form galls, usually on the roots of plants; as can some fungiMites, which are more closely related to related to spiders than to insects, are responsible for many galls of sycamores and other maples:

Vasates quadripedes on Acer saccharinum [CC-BY-SA-3.0 Steve Cook]

Bladder galls caused by the mite Vasates quadripedes on – gasp, not oak! – sugar maple (Acer saccharinum)

Aceria macrorhynchus on Acer pseudoplatanus [CC-BY-SA-2.0 Lairich Rig]

Mite galls caused by Aceria macrorhynchus on sycamore (Acer pseudoplatanus) [CC-BY-SA-2.0 Lairich Rig]

Why would a plant build a house for a parasitic insect or mite? The long answer is very long, but the short answer is that the parasite secretes plant hormones called cytokinins to manipulate the plant into forming the gall. In plants, cytokinins cause cell division and growth. They also act as “feed me” markers: sugars and other nutrients are preferentially directed towards tissues that have cytokinin in them. Cytokinins also cause plant cells to make red pigments. By secreting cytokinins (and other plant hormones) gall-forming insects, mites, fungi and nematodes manipulate the plant into building small – often red! – houses for them, and to supply those houses with sugar.

Zeatin [Public domain]

cytokinin hormone: the structure is a modified form of the DNA base adenine

The galls formed by insects and mites are amazing structures, but at a molecular level, the most extraordinary galls of all are formed not by insects or mites, but by bacteria:

Crown gall [CC-BY-SA-3.0 Bhai]

Crown gall on Kalanchoe, caused by the bacterium Rhizobium radiobacter (Agrobacterium tumefaciens) [CC-BY-SA-3.0 Bhai]

Crown galls may not look so impressive, but the way in which their inhabitants cause the gall to be formed most certainly is. The galls are formed by the bacterium Rhizobium radiobacter[For any biologists doing a double-take at this point: until recently Rhizobium radiobacter was known as Agrobacterium tumefaciens , but it’s been reclassified. This is Brontosaurus all over again…]

Rhizobium radiobacter is the world’s smallest genetic engineer. Rather than simply secreting plant hormones to make the plant direct food at it, like the gall-forming insects do, this bacterium actually injects a piece of DNA into the plant’s cells to bend them to its will. The DNA encodes enzymes for plant hormone synthesis, which causes the plant cells to grow into a cancerous mass: the crown gall.

But even that’s not enough for this bacterium. The nutrients directed into the gall by the plant are not quite to the bacterium’s taste, so the DNA it injects also encodes genes that turn plant cells into factories for bacteria-friendly food. This food even leaks out of the plant’s roots, feeding the bacterium’s sisters in the soil.

It took until 1982 for humans to develop the technology to genetically manipulate a crop plant to make better food. A mindless single-celled organism has been pulling off the same trick for millions of years. There’s nothing new under the sun.

Organism of the week #28 – Fractal art

Flowers are essentially tarts. Prostitutes for the bees. (Uncle Monty, Withnail and I)

Our tiny garden has only passing acquaintance with sunshine, so about the only plants that really thrive in its dingy clutches are shade-loving ferns. This Japanese painted fern is my current favourite: who needs flowers anyway, when leaves look like this?

Athyrium niponicum cv. Pictum [CC-BY-SA-3.0 Steve Cook]

Japanese painted fern (Athyrium niponicum cv. Pictum)

The colour is spectacular, but I also love the shape of the leaves. I have a rabbit’s-foot fern on my windowsill at $WORK, which branches not once, not twice, not thrice, but fwice, with each leaflet at each level looking very much like a miniaturised version of the whole leaf:

Humata tyermannii [CC-BY-SA-3.0 Steve Cook]

Humata tyermannii

Self-similar shapes like this are called fractals. This fern’s leaf is not a true mathematical fractal, because its self-similarity goes only four levels deep, rather than infinitely so. Also, from a developmental point of view, the leaves of ferns are produced by interactions between signalling molecules that have very little in common with the maths of fractals. However, it is remarkable how easy it is to produce a fractal that looks very much like a fern – especially when someone else has done all the hard work of discovering one for you. This one is called the Barnsley fern fractal, after Michael Barnsley:

Barnsley fern [Public domain]

Barnsley fern

This rather cute simulator allows you to play about with the Barnsley fern without going to the hassle of coding up any R yourself, but for those who enjoy such things, here’s my version of the code written in R, based on the coefficients from its Wikipedia article. An entirely imaginary prize is available for anyone who can produce a good simulation of either of the two ferns above.

fm <- vector( "list", 4 )
fc <- vector( "list", 4 )
fp <- vector( "list", 4 )

# Stem
fm[[1]] <- matrix( c( 0.00, 0.00, 0.00, 0.16 ), nrow=2, byrow=TRUE )
fc[[1]] <- matrix( c( 0.00, 0.00 ), nrow=2 )
fp[[1]] <- 0.01

# Leaflets
fm[[2]] <- matrix( c( 0.85, 0.04, -0.04, 0.85 ), nrow=2, byrow=TRUE )
fc[[2]] <- matrix( c( 0.00, 1.60 ), nrow=2 )
fp[[2]] <- 0.85

# Left largest leaflet
fm[[3]] <- matrix( c( 0.20, -0.26, 0.23, 0.22 ), nrow=2, byrow=TRUE )
fc[[3]] <- matrix( c( 0.00, 1.60 ), nrow=2 )
fp[[3]] <- 0.07

# Right largest leaflet
fm[[4]] <- matrix( c( -0.15, 0.28, 0.26, 0.24 ), nrow=2, byrow=TRUE )
fc[[4]] <- matrix( c( 0.00, 0.44 ), nrow=2 )
fp[[4]] <- 0.07

n<-250000 # you might want to make this smaller if it takes forever to render
x <- numeric( n )
y <- numeric( n )
x[1] <- 0
y[1] <- 0

for( i in 1:(n-1) ) {
  choice  <- sample( 1:4, prob=fp, size=1 ) <- fm[[ choice ]] %*% c( x[i], y[i] ) + fc[[ choice ]]
  x[i+1]  <-[1]
  y[i+1]  <-[2]

plot( x, y, cex=0.05, col="dark green" )

Living dangerously

Towards the end of the last millennium, I spent a lot of time befriending arsenic. The last two years of my PhD involved measuring how much iodine had fallen off a chemical used as a wood preservative day-in, day-out, and arsenic trioxide was the most exciting component of an otherwise excruciatingly dull test for iodide ions.

My little bottles of arsenic warned me to beware of pirates…


…for there were enough pirates in each 10 gram bottle to kill me ten times over. I apparently managed to handle the stuff safely, as – at least so far – I’m not dead even one time over.

Arsenic became so popular as a detergent for getting stubborn relatives out of the family that by the late nineteenth century people started referring to it as “inheritance powder”.

A (not very) cheap and (exceptionally) nasty way to work out how poisonous something like arsenic is, is to take ten cages containing ten rats each, and give each cageful of rats an increasingly large dose of the poison in question. The individual rats will differ in how easy they are to kill with the poison, but the rats in the cages that get very small doses will probably all survive, and the rats in the cages that get very large doses will probably all die. Somewhere in the middle you’ll probably find a cage containing 5 corpses and 5 survivors. The dose of poison that kills half the rats in a cage is called the “lethal dose, 50%” or LD50. The LD50 for arsenic trioxide is about 10 milligrams for rats.

Humans are a lot heavier than rats: I’m 60-odd kilograms on a good day, a rat only half a kilo or so. Bigger animals are can survive higher doses of poisons  because the poison ends up diluted into their larger bodies. So  LD50 values are usually scaled for the size of their victim. If 10 mg kills a 0.5 kg rat, the LD50 is 20 mg kg−1. By maths and educated guesswork about the relative susceptibilities of rats and humans to arsenic (apparently the definitive experiment is unethical), the LD50 for a human (this human, specifically) is just over 1 gram.

Arsenic is pretty poisonous, but some other poisons beat it hands-down. Botulinum toxin, the active ingredient in Botox injections, has an LD50 of 1 ng kg−1 which is about 10 million times more lethal. But, as even 16th century proto-doctors noted, it’s the dose that makes the poison. Everything is poisonous in sufficient quantity, even things as seemingly innocuous as water.

Even the most child-proofed kitchen could kill you three times over. You just need to put in a little effort:

Water LD50 [CC-BY-SA-3.0 Steve Cook]

Lethal dose of water: 6 L

The LD50 of water is about 100 mL kg−1, at least in rats. For a runt like me, that means around 6 L of water drunk over a short period of time would be lethal, mostly by diluting the salts in my blood and tissue fluid to the point where water would flood into my cells, making them swell and pop.

The opposite effect could be achieved by eating too much common table salt, which would suck water out of my cells, causing them to shrivel. With an LD50 is about 3 g kg−1, pretty much every kitchen in the land has a pot of salt in it containing at least one lethal dose:

Salt LD50 [CC-BY-SA-3.0 Steve Cook]

Lethal dose of salt: 200 g

It’s pretty difficult to eat even a teaspoon of neat salt without puking, so I shouldn’t worry about anyone trying to get their inheritance early by force-feeding you table salt rather than good old arsenic, but it might just be possible to get someone to eat the necessary dose of sugar. With an LD50 of 30 g kg−1, that would mean eating 2 kg of the stuff, but it’s certainly a sweeter way to go than many.

Sucrose LD50 [CC-BY-SA-3.0 Steve Cook]

Lethal dose of sugar: 2 kg

Organism of the week #27 – Alien haemorrhoids

A quickie this week: magnolia fruit look like something from another planet.

Magnolia ×soulangeana fruit [CC-BY-SA-3.0 Steve Cook]

Magnolia ×soulangeana fruit

Statistical power

In a recent(ish) post, we saw that if a fair coin is flipped 30 times, the probability it will give us 10 or fewer heads is less than 5% (4.937% to be pointlessly precise). Fisher quantified this using the p value of a data set: the probability of obtaining data (or a test statistic based on those data) at least as extreme as that actually observed, assuming that the null hypothesis is true.

If you ask people (even people who use them on a daily basis) what a p value is, you will get a range of answers, probably including:

  • The probability of the data being due to random chance
  • The probability of rejecting the null hypothesis incorrectly
  • The probability of the data
  • The probability of the null hypothesis being true
  • The probability of the null hypothesis being false
  • The probability that the data are significant
  • The probability of the data being accurate

…none of which are correct. Some of these errors can be traced back to the historical mixing-up of Fisher’s significance testing, and a different form of hypothesis testing proposed by Jerzy Neyman and Egon Pearson. Both of these involve calculating p values, and both frequently use the arbitrary cut-off of 0.05, but Neyman-Pearson testing is conceptually distinct. We’ll try to untangle them here.

If we simulate a infinitely large collective of coin-flipping trials, each consisting of 30-flips of a fair coin, we will obtain a binomial distribution. Unfortunately, we don’t have infinite time, so we’ll make do with a randomly generated collection of one thousand 30-flip trials instead:

fair.heads<-rbinom( n=1000, size=30, prob=0.5 )

# breaks allows us to set the bin sizes for a histogram
    main   = "Number of heads in 30 coin-flips of a fair coin", 
    xlim   = c( 0, 30 ),
    breaks = seq( 0, 30, by = 1 ),
    xlab   = "Number of heads in 30 flips",
    ylab   = "Frequency"

Fair coin simulation [CC-BY-SA-3.0 Steve Cook]

Fair coin simulated data (1000 trials)

36 observations of 10 or fewer heads in 1000 observation is 3.6%, which is very nearly the 4.937% we expect from theory.

The function rbinom(how.many, size=n, prob=P) is R’s implementation of a random binomial data generator. The ‘r’ stands for random.rbinom(this.many, size=n, prob=P) gives you this.many random variables drawn from a binomial distribution of size n with a coin producing heads at prob P.

rbinom(1000, size=30, prob=0.5) returns one thousand random variables, each representing the number of heads got from 30 flips of a fair coin.

Other distributions have similar random generator functions available: rnorm (normal), rpois (Poisson), rt (t), rf (F), etc.

To simplify things, we can go straight for the binomial distribution to give us a ‘perfect’ set of coin-flip data:

# create a vector into which we can put the number-of-heads-in-30-flips for 1000 observations
simulated.heads <- vector( "numeric" ) 

# This uses a for loop, the syntax for which is: 
# for ( variable in list { do.something.with( variable ) }

for ( k in 0:30 ) {
    # dbinom will give the pmf for each number of heads, k, in 30 flips
    # The number of observations of k-heads-in-30-flips is 1000 times this <- round( 1000 * dbinom( k, size=30, prob=0.5 ) )
    # e.g. when k=10, append 28 repetitions of "10" to the simulated head-count data
    simulated.heads <- c( simulated.heads, rep( k, times = ) )

    main   = "Theoretical number of heads in 30 coin-flips of a fair coin", 
    xlim   = c( 0, 30 ), 
    breaks = seq( 0, 30, by = 1 ),
    xlab   = "Number of heads in 30 flips",
    ylab   = "Theoretical frequency"

Fair coin theroretical [CC-BY-SA-3.0 Steve Cook]

Fair coin theoretical data (1000 trials)

So, 10 heads in 30 flips would allow us to reject the null hypothesis that the coin is fair at a significance level of 0.05, but it tells us nothing about how biased the coin is. A coin that is only slightly biased to tails would also be unlikely to produce 10 heads in 30 flips, and a coin so biased that it always lands heads up, or always lands tails up is also inconsistent with 10 heads in 30 flips.

How can we decide between different hypotheses?

You might think: “isn’t that’s what the p value shows?”, but this is not correct. A p value tells you something about the probability of data under a null hypothesis (H0), but it does not tell you anything about the probability of data under any particular alternative hypothesis (H1). Comparing two hypotheses needs rather more preparatory work.

If we consider a coin that may be fair, or may be loaded to heads, we might have two competing hypotheses:

  • H0: the coin is fair, and the probability of heads is 1/2 (P=0.50).
  • H1: the coin is loaded, and the probability of heads is 2/3 (P=0.666…).

Why P=2/3? And why is P=2/3 the alternative rather than the null hypothesis?

I’ve chosen P=2/3 because I have decided – before I start the experiment – that I am not bothered by coins any less biased that this. Perhaps this is for reasons of cost: making coins that are better balanced is not worth the investment in new coin-stamping dies. The difference in the probability of heads under the null and alternative hypotheses is called the effect size. Here, the effect size is the difference of 0.1666… in the probability of getting a head. Other measures of effect size are appropriate to different kinds of hypotheses.

I’ve made P=2/3 the alternative hypothesis because it is more costly to me if I accept that hypothesis when it is – in fact – false. Again, perhaps this is for reasons of cost: sticking with the old coin dies is a lot cheaper than replacing them, and it’s unlikely that a coin will be used to decide anything important: just trivial nonsense like kick-offs in football that I’m not really bothered by.

If you’re already objecting to this pre-emptive bean-counting approach to scientific data collection, you have hit upon one of the things Fisher didn’t like about Neyman and Pearson’s method, so you are in good company. That said…

The Neyman-Pearson method tries to control for the long-term probability of making mistakes in deciding between hypotheses. In the example above, we are trying to decide between two hypotheses, and there are four possible outcomes:

Decision H0 is true H0 is false
Accept H0 Correctly accept H0 Type II error
Reject H0 Type I error Correctly reject H0

In two of these outcomes, we make a mistake. These are called type I and type II errors:

  • Type I error: rejecting the null hypothesis when it is – in fact – true. This is a kind of false-positive error. The coin isn’t really biased, but we’re unlucky and get data that is unusually full of heads.
  • Type II error: accepting the null hypothesis when it is – in fact – false. (If we are considering only two hypotheses, this is equivalent to rejecting the alternative hypothesis when it is – in fact – true) This is a kind of false-negative error. The coin really is biased, but we’re unlucky and get data that looks like a 50:50 ratio of heads to tails.

If we were to repeat a coin-flipping experiment a very large number of times, any decision rule that allows us to decide between H0 and H1 will give us the wrong decision from time to time. There’s nothing we can do about this, but we can arrange the experiment so that the probabilities of making either of the two mistakes are reduced to acceptable levels:

  • We call the long-term probability of making a type I error across a large number of repeats of the experiment, α. In probability notation, α = P(reject H| H0), “the probability of rejecting the null hypothesis given (|) that the null hypothesis is true”.
  • We call the long-term probability of making a type II error across a large number of repeats of the experiment, β. In probability notation, β = P(accept H| ¬H0), “the probability of accepting the null hypothesis given that the null hypothesis is not (¬) true”.

We can choose whatever values we think appropriate for α and β, but typically we set α to 0.05. This is one of the things that results in the confusion between the Neyman-Pearson and Fisher approaches.

For two coins – one fair, and one biased to give 2/3 heads – the probability distributions for 30 flips look like this:

Fair and loaded coin pmf distribution overlaps [CC-BY-SA-3.0 Steve Cook]

Fair (left) and 2/3-loaded (right) coin PMF distributions, showing overlaps

The vertical line cuts the x-axis at 19 heads. This is the critical value, i.e. the number of heads above which we would reject the null hypothesis, because the probability of getting 19 heads or more heads in 30 flips of a fair coin is less than 0.05. The blue area on the right is therefore 5% of the area beneath the fair coin’s PMF curve, and it represents the probability of a false positive: i.e. rejecting the null hypothesis when it is – in fact – true, which is α.

The value ’19’ is easily determined using qbinom, which – as we saw in the last post – gives us the number of heads corresponding to a specified p-value.

critical.value = qbinom( p=0.05, size=30, prob=0.5, lower.tail=FALSE )

For coin-flipping, the statistic of interest – the one that we can use to make a decision between the null and alternative hypotheses – is simply the number of heads. For other kinds of test, it may be something more complicated (e.g. a t value, or F ratio), but the same general principle applies: there is a critical value of the test statistic above which p is less than 0.05, and above which we can therefore reject the null hypothesis with a long-term error rate of α.

The red region on the left is the portion of the 2/3-loaded coin’s PMF that is ‘cut off’ when we decide on a long-term false positive rate of 5%. Its area represents the probability of a false negative: i.e. accepting the null hypothesis when it is – in fact – false. This is β.

You can hopefully see that β is equal by pbinom( critical.value, size=30, prob=2/3 ). Its value – in this case 0.42 – depends on:

  1. The critical value (which depends on the choice of α).
  2. The number of coin-flips (here 30), i.e. the sample size.
  3. The bias of the coin under the alternative hypothesis (2/3), which is related to the effect size, i.e. the smallest difference in ‘fair’ and ‘loaded’ behaviour that we are interested in (2/3 – 1/2 = 0.1666…)

This is where the Neyman-Pearson approach diverges from Fisher’s. Fisher’s approach makes no mention of an alternative hypothesis, so it makes no attempt to quantify the false-negative rate. Neyman-Pearson sets up two competing hypotheses, and forces you to explicitly consider – and preferably control – β.

If the sample size and effect size are held constant, then decreasing α increases β, i.e. there is a trade-off between the false-positive rate and the false-negative rate: imagine sliding the vertical line left or right, and consider what would happen to the red and blue areas. For the decision between ‘fair’ vs. ‘2/3-loaded’, on a sample size of 30, beta is 0.42: we are eight times as likely to make a type II error as a type I error. It is very likely that we will wrongly say a coin is fair when in fact it is biased, and it is much more likely that we will make that mistake than we will call a fair coin biased when it isn’t. A mere 30 coin-flips is therefore woefully underpowered to distinguish between P=1/2 and P=2/3.

The power of a test, π, is:

π = 1 – β = P( reject H0 | H1 )

and is somewhat more convenient than β to remember:

  • α = P(reject H| H0)
  • π = P(reject H| H1)

What is a good power to aim for? Typically, we want an experiment to have a β of 0.20 (i.e. a power of 0.8). You might ask “why the asymmetry?” between choosing α=0.05 and β=0.20. There’s no particularly good general reason to want β to be larger than α. You might justify this on the basis that type II errors are less costly so you don’t mind making them more often than type I errors (do you really want to replace all those coin dies?) but the real reason is usually more pragmatic: detecting even small effect sizes requires quite large sample sizes:

Coin distribution overlaps for increasing sample size [CC-BY-SA-3.0 Steve Cook]

Coin distribution overlaps for increasing sample size

[To make R plot graphs on a grid like this, you want: par( mfrow=c( 2, 2 ) ) which sets up a 2-by-2 grid: each call to plot then adds a new graph until it fills up]

In the graphs, you can see that as we increase the sample size from 15 through to 120, the distributions get ‘tighter’, the overlap gets smaller, the blue area (β) gets smaller (and so the power of the test gets higher). If we plot the value of β against the sample size, we get this:

Coin power vessus sample size [CC-BY-SA-3.0 Steve Cook]

We need 60-odd flips to detect a 2/3-biased coin with a β of 0.20, but we need nearly 100 flips for a β of 0.05. The weird scatteriness is because the binomial distribution is discrete

The sample size needed to detect a 2/3-loaded coin is c. 60 flips for a β of 0.20 (a power of 0.80), and c. 100 flips for a β of 0.05 (a power of 0.95). If we have already decided on the value of α and the effect size of interest, then fixing the sample size is the only control we have over the type II error rate.

It is fairly rare for people to do power calculations of this sort before they start collecting data: more often than not, they simply collect a sample as large as they can afford. This means that – whilst type I errors are controlled at 0.05 – type II errors are often ignored, assumed (wrongly!) to be low, or – at best – calculated retrospectively. Retrospective calculations can give unpleasant surprises: as we saw above, 30 flips will result in quite a lot of quite badly loaded coins going undetected.

Ideally, you should calculate the sample size needed for a particular power before the experiment: that way, you will collect precisely the amount of data needed to give you the long-term error rates you think are acceptable (0.05 for α and 0.20 for β, or whatever). This brings us on to the last major confusion between Fisher and Neyman-Pearson testing:

  • p is the probability of a test statistic being at least as extreme as the test statistic we actually got, assuming that the null hypothesis is true.
  • α is the long-term probability of mistakenly rejecting the null hypothesis when it is actually true, if we were to repeat the experiment many times.

In Neyman-Pearson testing, we calculate a test statistic from our data (in this case just the number of heads), and compare that to the critical value of that test statistic. If the calculated value is larger than the critical value, we reject the null hypothesis. However, we can make exactly the same decision by calculating a p value from our data, and if it is less than α, rejecting the null hypothesis. Both of these approaches result in the same decision. The former used to be much more common because it was easy to obtain printed tables of critical values of test statistics like t and F for α=0.05. More recently, since computing power has made it far easier to calculate the p value directly, the latter approach has become more common.

Remember that in Fisher’s approach, a p value of 0.01 is ‘more significant’ than a p value of 0.05, and Fisher suggested (wrongly, IMHO) that this should give you more confidence in your rejection of the null. In Neyman-Pearson testing, if a p value is calculated at all, it is only to compare it to α. If you have done the power calculation, and collected the correct sample size, working out the p value is simply a means to an end: either it is larger than α (…accept null) or it is smaller than α (…reject null), but its value has no further importance. In Neyman-Pearson hypothesis testing, there is no such thing as “highly significant, p=0.01″. Unfortunately, a lot of people want to have their statistical cake and eat it: they don’t do a power calculation and collect a sample size that is expedient rather than justified (screw you Neyman-Pearson!), but they then want to interpret a very low p value (yay Fisher!) as being evidence for a particular alternative hypothesis (love you really, Neyman-Pearson!) This isn’t really very coherent, and many statisticians now prefer to side-step the whole issue and go for confidence intervals or Bayesian approaches instead, which I leave to others to explain.

This brings us to the question – finally! – of how many coin-flips we should collect to distinguish a loaded coin from a fair one. There are several answers to this, depending on our constraints, but two are particularly important:

  1. We have limited money and time, so we can only do 30 flips. What is the minimum effect size we can reliably detect?
  2. We have unlimited money and time (ha ha ha!), so we can only do as many flips as is necessary, but we’re only interested in coins that are biased to 2/3 heads or more. What sample size do we need to reliably detect a particular effect size of interest?

So, firstly, for the number of flips we can be bothered to make, what is the minimum effect size we can reliably (α=0.05, β=0.20) detect? For some common tests, R has built-in power calculations, but in general, the simplest way to work this out is by simulation. This requires a little code. In this case, it’s only actually about 12 lines, but I’ve commented it heavily as it’s a little tricky:

# Set up empty vectors to contain numeric data. sample.size will be filled in 
# with the minimum sample size needed to distinguish a fair coin from a
# loaded coin of various degrees of bias. effect.size will be filled in
# with the effect size for each loaded coin

sample.size <- vector('numeric')
effect.size <- vector('numeric')

# We'll try out numbers of flips between 0 and 200

size  <- c(0:200)

# Loop though some values of 'loaded' from 19/32 to 1. We don't go any
# lower than this because even 18/32 requires more than 200 flips!

for ( loaded in seq( from=19/32, to=32/32, by=1/32 ) ) {
    # since 'size' is a vector, qbinom will return a vector of 'crit' values
    # for all trial sizes between 0 and 200 coin-flips; and pbinom will then
    # return the beta values for each ('crit', 'size') pair.

    crit <- qbinom( p=0.05, size=size, prob=0.5, lower.tail=FALSE )
    beta <- pbinom( crit,   size=size, prob=loaded )

    # 'which' returns all the indices into the beta vector for which
    # the beta value is <= to 0.20. Since these were put in in 'size' 
    # order (i.e. beta item number 5 is the beta value for a trial of 5 flips)
    # using 'min' to return the smallest of these indices 
    # immediately gives us the smallest sample size required to
    # reliably detect a bias in the coin of 'loaded'

    smallest.needed <- min( which( beta <= 0.20 ) )
    # Append the value we've just worked out to the sample.size vector
    sample.size     <- c( sample.size, smallest.needed )
    # Effect size is the degree of bias minus the P for a fair coin (0.5)
    effect.size     <- c( effect.size, loaded - 0.5 )

# Which is the first index into sample.size for which the value is 30 or less?

index.for.thirty <- min( which( sample.size <= 30 ) )

# Since effect.size and sample.size were made in parallel, using this index
# into effect.size will give the minimum effect size that could be detected
# on 30 coin flips

effect.size[ index.for.thirty ]

The answer is that 30 flips is enough to reliably detect a bias only if the effect size is at least 0.25, i.e. to distinguish between a fair coin and a coin that is biased to give 3/4 heads. Does this surprise you? You need quite a lot of flips to distinguish even a large bias.

[NB – we made rather coarse steps of 1/32 in the loop that generated this data. If you do smaller steps of 1/64, or 1/128, you’ll find the real value is nearer 0.23]

A graph of effect size against sample size for (α=0.05, β=0.20) is useful in answering this question and the next:

Coin flipping effect size vs sample size [CC-BY-SA-3.0 Steve Cook]

Effect size and sample size are inversely related: a small sample size can only detect a large bias; a large sample size is needed to detect a small bias

Secondly, what is the minimum sample size we need to reliably (α=0.05, β=0.20) detect an effect size of interest to us?

The same code as above generates the necessary data, but now we want:

index.for.twothirds <- min( which( effect.size >= 2/3 - 1/2 ) )
sample.size[ index.for.twothirds ]

At least 45 flips are needed to distinguish a fair coin from a coin biased to give 2/3 heads. [Again, bear in mind the coarseness of the loop: the real value is closer to 59].

All this may sound like an awful lot of bother merely to be told that your sample size is woefully inadequate, or that you are going to miss lots of biased coins. Fortunately, R makes it easy to obtain this unwelcome information, at least for common tests like t and ANOVA, and easy-ish for some other tests, through the pwr package, which you can install from CRAN with:

install.packages( "pwr" )

…load into a script with:


…and find information about usage with:



  1. What sample size is needed to distinguish a fair coin from a slightly heads-biased coin (P=0.55) with α=0.05, and a power of 0.80? What is the critical number of heads that would allow you to reject the null?


  1. A power of 0.80 means β=0.20. The same code we used for the effect vs. sample size calculation can be modified to give us the data we want. The paste() function allows us to put the value of variables into strings for use in print() statements or as axis titles, etc. Note that number of heads required to convince us of P=0.55 rather than P=0.50 is actually only 330/621 = 0.53, because the sample size is so large.
size <- c(0:1000)
crit <- qbinom( p=0.05, size=size, prob=0.50, lower.tail=FALSE )
beta <- pbinom( crit,   size=size, prob=0.55 )
n    <- min( which( beta <= 0.20 ) )

print( paste( "Critical value is", crit[n], "heads" ) )
print( paste( "Sample size required is", n ) )
"Critical value is 330 heads"
"Sample size required is 621"

Next up… The F-test.

Load more