Jan 06

## Wisley in Winter

It’s not *really* a botanic garden, but the Royal Horticultural Society’s gardens at Wisley is near enough as makes no difference. We visited in what should have been the dead of winter, but which in reality was this weird sprautumn mash-up that is now December in the UK. The heather garden was particularly pretty, despite the wind:

Botanerd highlights included the biggest pine-cone I have yet added to my collection (yes, it *is* a collection, *not* a sickness)…

*owning*a peyote cactus is legal in the UK unless you ‘prepare it for consumption’, which magically transforms it into a 7 year jail term. This presumably includes accidentally letting it die and dry out into a button whilst on holiday. I may be having a teesy bit of buyers’ regret.

Jan 06

## Imposter

Most years the pond-water microscopy practical throws up an exciting ciliate (or two, or three), but this year, the only ones we saw were duplicates, or too bloody fast to photograph. Ho hum.

So this year you’ll have to make do with an imposter. It’s still got cilia, but it is not a ciliate, and although it’s no bigger than a single-celled organism, it is in fact a full-blown multicellular animal:

This is a rotifer (‘wheel animal’), which are famous for two things: being ridiculously small, and – in one infamous subgroup – being ridiculously averse to having sex. They’re also really cute when they turn on their feeding wheels. Happy New Year.Dec 22

## Bagging botanical Brussels

The last time we went to Brussels, I got terribly excited that the hotel we were staying in was right next door to the Botanical Garden of Brussels. Unfortunately – as we discovered in short order – at some point in the 1930s the plants had mostly been shipped off elsewhere, leaving the garden not very botanical, and Dr Cook not very impressed.

That elsewhere was the Plantentuin Meise / Jardin Botanique Meise, which is to the north of Brussels, conveniently situated – for the purposes of this second attempt to visit them – between Brussels airport and Antwerp.

We were visiting in autumn, and they were in the middle of renovating part of their main glasshouse – the Plant Palace – but there was still plenty to see, including a good arboretum, an orchid exhibition, and some lovely Japanese beautyberries: The entry fee (€ 7) is very reasonable, and the restaurant helpfully ensures all sandwiches come with a colonic depth-charge of salad; just the ticket to cure the faecal impaction resulting from the ‘food’ served to us on the flight out with <insert awful British airline here>. When we were there, there was also an art installation by Roos Van de Velde in praise of the best plant in the world – the maidenhair tree*Ginkgo biloba –*but you’ve only got until mid-January to catch it: Botanerd highlights:

- Whereas Kew’s evolution house (currently being renovated as part of the Temperate House restoration) always felt like a holding pen for a few cycads and ferns that didn’t physically fit into the Palm House, the Meise version was much more complete, with a good selection of horsetails, ferns, and high-quality fakes of fossil plants like
*Cooksonia*,*Lepidodendron*,*etc*. That said, I would very much like to see these ‘evolution house’ efforts re-branded as ‘phylodiversity houses’ or similar, as the usual*Scala Naturae*bullshit of using modern mosses, lycopods, ferns, and conifers as stand-ins for prehistoric forms gets right up my arse. - The pinetum is small, but has a good selection of species, including quite a few weird cultivars of Japanese cedar (
*Cryptomeria japonica*) that I’ve not seen before.

Oct 06

## Organism of the week #29 – Galling

Today is the first day of the new (academic) year at `$WORK`

, but – aside from a couple of intro lectures – this is the calm before the real storm, which arrives in the form of a deluge of biological chemistry in November. If you’re lucky, you’ll get a few mugshots of some weird ciliate or other around then, otherwise, expect tumbleweed until the new (calendar) year…

Insects and flowering plants have been co-evolving for 200 million years or so, so it’s unsurprising that many of their interactions have become very intimate. The relationships between gall-forming insects and their host plants are particularly interesting. Gall-forming insects lay their eggs on plants, but when the larvae hatch, instead of munching through the plant in the usual fashion, the larvae make the plant build bizarre little houses for them. These are the galls. Galls provide protection from weather, some protection from things that might like to eat the gall-former itself, and food for the developing larva.

In the UK, there are many gall-forming insects that target oak trees. These are mostly very small wasps. One of the most obvious is the knopper gall, which are the sticky crater-shaped things you often see attached to acorns around August.

Rather like the fluke we met a few weeks ago, knopper gall wasps have a complicated life-cycle involving several different hosts, in this case two different species of oak: the English (*Quercus robur*) and the Turkey (*Quercus cerris*). Female wasps emerge in spring from the galls that fell from English oaks the previous autumn. These females then lay asexual eggs on the catkins of the Turkey oak, which hatch out, and form rather inconspicuous galls. The males and females that hatch from these galls in April then mate, and the females lay their eggs on the developing acorns of English oaks, where they develop into the knopper galls in late summer.

Smaller but creepier, the silk-button spangle gall is a trypophobes’ delight. Like the knopper gall, they’re caused by small wasps, but are much less obvious until you flip a leaf over:

The less creepy common spangle gall isn’t as likely to make you feel like you have maggots crawling under your skin, but like the silk-button spangles, that’s exactly what the plant*has*got. The most famous and most useful gall found on oak is the oak apple. A mixture of mashed up oak apple galls and iron salts was – for a millennium and a half – what you dunked a quill pen into, if you were lucky enough to know how to write. Like the knopper and spangle galls, oak apples are caused by the larvae of a tiny wasp, but it’s not just wasps that can cause galls, or indeed just insects. Roundworms (nematodes) can also form galls, usually on the roots of plants; as can some fungi. Mites, which are more closely related to related to spiders than to insects, are responsible for many galls of sycamores and other maples: Why would a plant build a house for a parasitic insect or mite? The long answer is very long, but the short answer is that the parasite secretes plant hormones called cytokinins to manipulate the plant into forming the gall. In plants, cytokinins cause cell division and growth. They also act as “feed me” markers: sugars and other nutrients are preferentially directed towards tissues that have cytokinin in them. Cytokinins also cause plant cells to make red pigments. By secreting cytokinins (and other plant hormones) gall-forming insects, mites, fungi and nematodes manipulate the plant into building small – often red! – houses for them, and to supply those houses with sugar. The galls formed by insects and mites are amazing structures, but at a molecular level, the most extraordinary galls of all are formed not by insects or mites, but by bacteria: Crown galls may not look so impressive, but the way in which their inhabitants cause the gall to be formed most certainly is. The galls are formed by the bacterium

*Rhizobium radiobacter.*[For any biologists doing a double-take at this point: until recently

*Rhizobium radiobacter*was known as

*Agrobacterium tumefaciens*, but it’s been reclassified. This is

*Brontosaurus*all over again…]

*Rhizobium radiobacter* is the world’s smallest genetic engineer. Rather than simply secreting plant hormones to make the plant direct food at it, like the gall-forming insects do, this bacterium actually injects a piece of DNA into the plant’s cells to bend them to its will. The DNA encodes enzymes for plant hormone synthesis, which causes the plant cells to grow into a cancerous mass: the crown gall.

But even that’s not enough for *this* bacterium. The nutrients directed into the gall by the plant are not quite to the bacterium’s taste, so the DNA it injects also encodes genes that turn plant cells into factories for bacteria-friendly food. This food even leaks out of the plant’s roots, feeding the bacterium’s sisters in the soil.

It took until 1982 for humans to develop the technology to genetically manipulate a crop plant to make better food. A mindless single-celled organism has been pulling off the same trick for millions of years. There’s nothing new under the sun.

Sep 17

## Organism of the week #28 – Fractal art

Flowers are essentially tarts. Prostitutes for the bees. (Uncle Monty,

Withnail and I)

Our tiny garden has only passing acquaintance with sunshine, so about the only plants that really thrive in its dingy clutches are shade-loving ferns. This Japanese painted fern is my current favourite: who needs flowers anyway, when leaves look like this?

The colour is spectacular, but I also love the shape of the leaves. I have a rabbit’s-foot fern on my windowsill at`$WORK`

, which branches not once, not twice, not thrice, but fwice, with each leaflet at each level looking very much like a miniaturised version of the whole leaf:
Self-similar shapes like this are called fractals. This fern’s leaf is not a *true*mathematical fractal, because its self-similarity goes only four levels deep, rather than infinitely so. Also, from a developmental point of view, the leaves of ferns are produced by interactions between signalling molecules that have very little in common with the maths of fractals. However, it is remarkable how easy it is to produce a fractal that looks very much like a fern – especially when someone else has done all the hard work of discovering one for you. This one is called the Barnsley fern fractal, after Michael Barnsley: This rather cute simulator allows you to play about with the Barnsley fern without going to the hassle of coding up any R yourself, but for those who enjoy such things, here’s my version of the code written in R, based on the coefficients from its Wikipedia article. An entirely imaginary prize is available for anyone who can produce a good simulation of either of the two ferns above.

fm <- vector( "list", 4 ) fc <- vector( "list", 4 ) fp <- vector( "list", 4 ) # Stem fm[[1]] <- matrix( c( 0.00, 0.00, 0.00, 0.16 ), nrow=2, byrow=TRUE ) fc[[1]] <- matrix( c( 0.00, 0.00 ), nrow=2 ) fp[[1]] <- 0.01 # Leaflets fm[[2]] <- matrix( c( 0.85, 0.04, -0.04, 0.85 ), nrow=2, byrow=TRUE ) fc[[2]] <- matrix( c( 0.00, 1.60 ), nrow=2 ) fp[[2]] <- 0.85 # Left largest leaflet fm[[3]] <- matrix( c( 0.20, -0.26, 0.23, 0.22 ), nrow=2, byrow=TRUE ) fc[[3]] <- matrix( c( 0.00, 1.60 ), nrow=2 ) fp[[3]] <- 0.07 # Right largest leaflet fm[[4]] <- matrix( c( -0.15, 0.28, 0.26, 0.24 ), nrow=2, byrow=TRUE ) fc[[4]] <- matrix( c( 0.00, 0.44 ), nrow=2 ) fp[[4]] <- 0.07 n<-250000 # you might want to make this smaller if it takes forever to render x <- numeric( n ) y <- numeric( n ) x[1] <- 0 y[1] <- 0 for( i in 1:(n-1) ) { choice <- sample( 1:4, prob=fp, size=1 ) xy.next <- fm[[ choice ]] %*% c( x[i], y[i] ) + fc[[ choice ]] x[i+1] <- xy.next[1] y[i+1] <- xy.next[2] } plot( x, y, cex=0.05, col="dark green" )

Sep 06

## Living dangerously

Towards the end of the last millennium, I spent a lot of time befriending arsenic. The last two years of my PhD involved measuring how much iodine had fallen off a chemical used as a wood preservative day-in, day-out, and arsenic trioxide was the most exciting component of an otherwise excruciatingly dull test for iodide ions.

My little bottles of arsenic warned me to beware of pirates…

…for there were enough pirates in each 10 gram bottle to kill me ten times over. I apparently managed to handle the stuff safely, as – at least so far – I’m not dead even one time over.

Arsenic became so popular as a detergent for getting stubborn relatives out of the family that by the late nineteenth century people started referring to it as “inheritance powder”.

A (not very) cheap and (exceptionally) nasty way to work out how poisonous something like arsenic is, is to take ten cages containing ten rats each, and give each cageful of rats an increasingly large dose of the poison in question. The individual rats will differ in how easy they are to kill with the poison, but the rats in the cages that get very small doses will probably all survive, and the rats in the cages that get very large doses will probably all die. Somewhere in the middle you’ll probably find a cage containing 5 corpses and 5 survivors. The dose of poison that kills half the rats in a cage is called the “lethal dose, 50%” or *LD _{50}*. The

*LD*for arsenic trioxide is about 10 milligrams for rats.

_{50}Humans are a lot heavier than rats: I’m 60-odd kilograms on a good day, a rat only half a kilo or so. Bigger animals are can survive higher doses of poisons because the poison ends up diluted into their larger bodies. So *LD _{50} *values are usually scaled for the size of their victim. If 10 mg kills a 0.5 kg rat, the

*LD*is 20 mg kg

_{50}^{−1}. By maths and educated guesswork about the relative susceptibilities of rats and humans to arsenic (apparently the definitive experiment is unethical), the

*LD*for a human (this human, specifically) is just over 1 gram.

_{50}Arsenic is pretty poisonous, but some other poisons beat it hands-down. Botulinum toxin, the active ingredient in Botox injections, has an *LD _{50} *of 1 ng kg

^{−1}which is about 10 million times more lethal. But, as even 16th century proto-doctors noted, it’s the

*dose*that makes the poison.

*Everything*is poisonous in sufficient quantity, even things as seemingly innocuous as water.

Even the most child-proofed kitchen could kill you three times over. You just need to put in a little effort:

The*LD*of water is about 100 mL kg

_{50}^{−1}, at least in rats. For a runt like me, that means around 6 L of water drunk over a short period of time would be lethal, mostly by diluting the salts in my blood and tissue fluid to the point where water would flood into my cells, making them swell and pop.

The opposite effect could be achieved by eating too much common table salt, which would suck water out of my cells, causing them to shrivel. With an *LD _{50} *is about 3 g kg

^{−1}, pretty much every kitchen in the land has a pot of salt in it containing at least one lethal dose:

*just*be possible to get someone to eat the necessary dose of sugar. With an

*LD*of 30 g kg

_{50}^{−1}, that would mean eating 2 kg of the stuff, but it’s certainly a sweeter way to go than many.

Sep 06

## Organism of the week #27 – Alien haemorrhoids

A quickie this week: magnolia fruit look like something from another planet.

Aug 26

## Statistical power

In a recent(ish) post, we saw that if a fair coin is flipped 30 times, the probability it will give us 10 or fewer heads is less than 5% (4.937% to be pointlessly precise). Fisher quantified this using the *p* value of a data set: the probability of obtaining data (or a test statistic based on those data) at least as extreme as that actually observed, assuming that the null hypothesis is true.

If you ask people (even people who use them on a daily basis) what a *p* value is, you will get a range of answers, probably including:

- The probability of the data being due to random chance
- The probability of rejecting the null hypothesis incorrectly
- The probability of the data
- The probability of the null hypothesis being true
- The probability of the null hypothesis being false
- The probability that the data are significant
- The probability of the data being accurate

…none of which are correct. Some of these errors can be traced back to the historical mixing-up of Fisher’s significance testing, and a different form of hypothesis testing proposed by Jerzy Neyman and Egon Pearson. Both of these involve calculating* p* values, and both frequently use the arbitrary cut-off of 0.05, but Neyman-Pearson testing is conceptually distinct. We’ll try to untangle them here.

If we simulate a infinitely large **collective** of coin-flipping trials, each consisting of 30-flips of a fair coin, we will obtain a binomial distribution. Unfortunately, we don’t have infinite time, so we’ll make do with a randomly generated collection of one thousand 30-flip trials instead:

fair.heads<-rbinom( n=1000, size=30, prob=0.5 ) # breaks allows us to set the bin sizes for a histogram hist( fair.heads, main = "Number of heads in 30 coin-flips of a fair coin", xlim = c( 0, 30 ), breaks = seq( 0, 30, by = 1 ), xlab = "Number of heads in 30 flips", ylab = "Frequency" )

36 observations of 10 or fewer heads in 1000 observation is 3.6%, which is very nearly the 4.937% we expect from theory.

`rbinom` |
---|

The function `rbinom(how.many, size=n, prob=P)` is R’s implementation of a random binomial data generator. The ‘r’ stands for random.`rbinom(this.many, size=n, prob=P)` gives you `this.many` random variables drawn from a binomial distribution of `size` n with a coin producing heads at `prob` P.
Other distributions have similar random generator functions available: |

To simplify things, we can go straight for the binomial distribution to give us a ‘perfect’ set of coin-flip data:

# create a vector into which we can put the number-of-heads-in-30-flips for 1000 observations simulated.heads <- vector( "numeric" ) # This uses a for loop, the syntax for which is: # for ( variable in list { do.something.with( variable ) } for ( k in 0:30 ) { # dbinom will give the pmf for each number of heads, k, in 30 flips # The number of observations of k-heads-in-30-flips is 1000 times this number.of.heads.in.30.flips <- round( 1000 * dbinom( k, size=30, prob=0.5 ) ) # e.g. when k=10, append 28 repetitions of "10" to the simulated head-count data simulated.heads <- c( simulated.heads, rep( k, times = number.of.heads.in.30.flips ) ) } hist( simulated.heads, main = "Theoretical number of heads in 30 coin-flips of a fair coin", xlim = c( 0, 30 ), breaks = seq( 0, 30, by = 1 ), xlab = "Number of heads in 30 flips", ylab = "Theoretical frequency" )So, 10 heads in 30 flips would allow us to reject the null hypothesis that the coin is fair at a significance level of 0.05, but it tells us nothing about

*how*biased the coin is. A coin that is only

*slightly*biased to tails would also be unlikely to produce 10 heads in 30 flips, and a coin so biased that it

*always*lands heads up, or

*always*lands tails up is also inconsistent with 10 heads in 30 flips.

How can we decide between different hypotheses?

You might think: “isn’t that’s what the *p* value shows?”, but this is not correct. A *p* value tells you something about the probability of data under a *null* hypothesis (H_{0}), but it does not tell you anything about the probability of data under any particular *alternative* hypothesis (H_{1}). Comparing two hypotheses needs rather more preparatory work.

If we consider a coin that may be fair, or may be loaded to heads, we might have two competing hypotheses:

- H
_{0}: the coin is fair, and the probability of heads is 1/2 (*P*=0.50). - H
_{1}: the coin is loaded, and the probability of heads is 2/3 (*P*=0.666…).

Why *P*=2/3? And why is *P*=2/3 the alternative rather than the null hypothesis?

I’ve chosen *P*=2/3 because I have decided – **before I start the experiment** – that I am not bothered by coins any less biased that this. Perhaps this is for reasons of cost: making coins that are better balanced is not worth the investment in new coin-stamping dies. The difference in the probability of heads under the null and alternative hypotheses is called the **effect size**. Here, the effect size is the difference of 0.1666… in the probability of getting a head. Other measures of effect size are appropriate to different kinds of hypotheses.

I’ve made *P*=2/3 the alternative hypothesis because it is more costly to me if I accept that hypothesis when it is – in fact – false. Again, perhaps this is for reasons of cost: sticking with the old coin dies is a lot cheaper than replacing them, and it’s unlikely that a coin will be used to decide anything important: just trivial nonsense like kick-offs in football that I’m not really bothered by.

If you’re already objecting to this pre-emptive bean-counting approach to scientific data collection, you have hit upon one of the things Fisher didn’t like about Neyman and Pearson’s method, so you are in good company. That said…

**The Neyman-Pearson method tries to control for the long-term probability of making mistakes in deciding between hypotheses**. In the example above, we are trying to decide between two hypotheses, and there are four possible outcomes:

Reality | ||

Decision | H_{0} is true |
H_{0} is false |

Accept H_{0} |
Correctly accept H_{0} |
Type II error |

Reject H_{0} |
Type I error | Correctly reject H_{0} |

In two of these outcomes, we make a mistake. These are called type I and type II errors:

- Type I error:
**rejecting the null hypothesis when it is – in fact – true**. This is a kind of false-positive error. The coin isn’t really biased, but we’re unlucky and get data that is unusually full of heads. - Type II error:
**accepting the null hypothesis when it is – in fact – false**. (If we are considering only two hypotheses, this is equivalent to**rejecting the alternative hypothesis when it is – in fact – true**) This is a kind of false-negative error. The coin really*is*biased, but we’re unlucky and get data that looks like a 50:50 ratio of heads to tails.

If we were to repeat a coin-flipping experiment a very large number of times, any **decision rule** that allows us to decide between H_{0} and H_{1} *will* give us the wrong decision from time to time. There’s nothing we can do about this, but we can arrange the experiment so that the probabilities of making either of the two mistakes are reduced to acceptable levels:

- We call the long-term probability of making a type I error across a large number of repeats of the experiment,
*α*. In probability notation,*α*= P(reject H_{0 }| H_{0}), “the probability of rejecting the null hypothesis given (|) that the null hypothesis is true”. - We call the long-term probability of making a type II error across a large number of repeats of the experiment,
*β*. In probability notation,*β*= P(accept H_{0 }| ¬H_{0}), “the probability of accepting the null hypothesis given that the null hypothesis is not (¬) true”.

We can choose whatever values we think appropriate for *α* and *β*, but typically we set *α *to 0.05. This is one of the things that results in the confusion between the Neyman-Pearson and Fisher approaches.

For two coins – one fair, and one biased to give 2/3 heads – the probability distributions for 30 flips look like this:

The vertical line cuts the*x*-axis at 19 heads. This is the

**critical value**,

*i.e.*the number of heads above which we would reject the null hypothesis, because the probability of getting 19 heads or more heads in 30 flips of a fair coin is less than 0.05. The blue area on the right is therefore 5% of the area beneath the fair coin’s PMF curve, and it represents the probability of a false positive:

*i.e.*rejecting the null hypothesis when it is – in fact – true, which is

*α*.

The value ’19’ is easily determined using `qbinom`

, which – as we saw in the last post – gives us the number of heads corresponding to a specified *p*-value.

critical.value = qbinom( p=0.05, size=30, prob=0.5, lower.tail=FALSE )

For coin-flipping, the statistic of interest – the one that we can use to make a decision between the null and alternative hypotheses – is simply the number of heads. For other kinds of test, it may be something more complicated (*e.g.* a *t* value, or *F* ratio), but the same general principle applies: there is a critical value of the test statistic above which *p* is less than 0.05, and above which we can therefore reject the null hypothesis with a long-term error rate of *α.*

The red region on the left is the portion of the 2/3-loaded coin’s PMF that is ‘cut off’ when we decide on a long-term false positive rate of 5%. Its area represents the probability of a false negative: *i.e.* accepting the null hypothesis when it is – in fact – false. This is *β*.

You can hopefully see that* β* is equal by `pbinom( critical.value, size=30, prob=2/3 )`

. Its value – in this case 0.42 – depends on:

- The critical value (which depends on the
**choice of**).*α* - The number of coin-flips (here 30),
*i.e.*the**sample size**. - The bias of the coin under the alternative hypothesis (2/3), which is related to the
**effect size**,*i.e.*the smallest difference in ‘fair’ and ‘loaded’ behaviour that we are interested in (2/3 – 1/2 = 0.1666…)

This is where the Neyman-Pearson approach diverges from Fisher’s. **Fisher’s approach makes no mention of an alternative hypothesis, so it makes no attempt to quantify the false-negative rate**. Neyman-Pearson sets up two competing hypotheses, and forces you to explicitly consider – and preferably control – *β*.

If the sample size and effect size are held constant, then decreasing *α*** **increases

*β*,

*i.e.*there is a trade-off between the false-positive rate and the false-negative rate: imagine sliding the vertical line left or right, and consider what would happen to the red and blue areas. For the decision between ‘fair’

*vs*. ‘2/3-loaded’, on a sample size of 30, beta is 0.42: we are

*eight times*as likely to make a type II error as a type I error. It is very likely that we will wrongly say a coin is fair when in fact it is biased, and it is much more likely that we will make that mistake than we will call a fair coin biased when it isn’t. A mere 30 coin-flips is therefore woefully

**underpowered**to distinguish between

*P*=1/2 and

*P*=2/3.

The **power of a test**, *π*, is:

*π* = 1 – *β* = P( reject H_{0} | H_{1} )

and is somewhat more convenient than *β* to remember:

*α*= P(reject H_{0 }| H_{0})*π*= P(reject H_{0 }| H_{1})

What is a good power to aim for? Typically, we want an experiment to have a *β* of 0.20 (*i.e.* a power of 0.8). You might ask “why the asymmetry?” between choosing *α*=0.05 and *β*=0.20*.* There’s no particularly good general reason to want *β* to be larger than *α.* You might justify this on the basis that type II errors are less costly so you don’t mind making them more often than type I errors (do you *really* want to replace all those coin dies?) but the real reason is usually more pragmatic: detecting even small effect sizes requires quite large sample sizes:

`par( mfrow=c( 2, 2 ) )`

which sets up a 2-by-2 grid: each call to `plot`

then adds a new graph until it fills up]
In the graphs, you can see that as we increase the sample size from 15 through to 120, the distributions get ‘tighter’, the overlap gets smaller, the blue area (*β*) gets smaller (and so the power of the test gets higher). If we plot the value of *β *against the sample size, we get this:

*c.*60 flips for a

*β*of 0.20 (a power of 0.80), and

*c.*100 flips for a

*β*of 0.05 (a power of 0.95). If we have already decided on the value of

*α*and the effect size of interest, then

**fixing the sample size is the only control we have over the type II error rate**.

It is fairly rare for people to do power calculations of this sort before they start collecting data: more often than not, they simply collect a sample as large as they can afford. This means that – whilst type I errors are controlled at 0.05 – type II errors are often ignored, assumed (wrongly!) to be low, or – at best – calculated retrospectively. Retrospective calculations can give unpleasant surprises: as we saw above, 30 flips will result in quite a lot of quite badly loaded coins going undetected.

**Ideally, you should calculate the sample size needed for a particular power before the experiment**: that way, you will collect precisely the amount of data needed to give you the long-term error rates you think are acceptable (0.05 for

*α*and 0.20 for

*β*, or whatever). This brings us on to the last major confusion between Fisher and Neyman-Pearson testing:

*p*is the probability of a test statistic being at least as extreme as the test statistic we actually got, assuming that the null hypothesis is true.*α*is the long-term probability of mistakenly rejecting the null hypothesis when it is actually true, if we were to repeat the experiment many times.

In Neyman-Pearson testing, we calculate a test statistic from our data (in this case just the number of heads), and compare that to the critical value of that test statistic. If the calculated value is larger than the critical value, we reject the null hypothesis. However, we can make exactly the same decision by calculating a *p* value from our data, and if it is less than *α*, rejecting the null hypothesis. Both of these approaches result in the same decision. The former used to be much more common because it was easy to obtain printed tables of critical values of test statistics like *t* and *F* for *α*=0.05. More recently, since computing power has made it far easier to calculate the *p* value directly, the latter approach has become more common.

Remember that in Fisher’s approach, a *p* value of 0.01 is ‘more significant’ than a *p* value of 0.05, and Fisher suggested (wrongly, IMHO) that this should give you more confidence in your rejection of the null. In Neyman-Pearson testing, if a *p* value is calculated at all, it is only to compare it to *α*. If you have done the power calculation, and collected the correct sample size, working out the *p* value is simply a means to an end: either it is larger than *α* (…accept null) or it is smaller than *α* (…reject null), but its value has no further importance. **In Neyman-Pearson hypothesis testing, there is no such thing as “highly significant, p=0.01″**. Unfortunately, a lot of people want to have their statistical cake and eat it: they don’t do a power calculation and collect a sample size that is expedient rather than justified (screw you Neyman-Pearson!), but they then want to interpret a very low

*p*value (yay Fisher!) as being evidence for a particular alternative hypothesis (love you really, Neyman-Pearson!) This isn’t really very coherent, and many statisticians now prefer to side-step the whole issue and go for confidence intervals or Bayesian approaches instead, which I leave to others to explain.

This brings us to the question – finally! – of how many coin-flips we *should* collect to distinguish a loaded coin from a fair one. There are several answers to this, depending on our constraints, but two are particularly important:

- We have limited money and time, so we can only do 30 flips.
**What is the minimum effect size we can reliably detect?** - We have unlimited money and time (ha ha ha!), so we can only do as many flips as is necessary, but we’re only interested in coins that are biased to 2/3 heads or more.
**What sample size do we need to reliably detect a particular effect size of interest**?

So, firstly, for the number of flips we can be bothered to make, what is the minimum effect size we can reliably (*α*=0.05, *β*=0.20) detect? For some common tests, R has built-in power calculations, but in general, the simplest way to work this out is by simulation. This requires a little code. In this case, it’s only actually about 12 lines, but I’ve commented it heavily as it’s a little tricky:

# Set up empty vectors to contain numeric data. sample.size will be filled in # with the minimum sample size needed to distinguish a fair coin from a # loaded coin of various degrees of bias. effect.size will be filled in # with the effect size for each loaded coin sample.size <- vector('numeric') effect.size <- vector('numeric') # We'll try out numbers of flips between 0 and 200 size <- c(0:200) # Loop though some values of 'loaded' from 19/32 to 1. We don't go any # lower than this because even 18/32 requires more than 200 flips! for ( loaded in seq( from=19/32, to=32/32, by=1/32 ) ) { # since 'size' is a vector, qbinom will return a vector of 'crit' values # for all trial sizes between 0 and 200 coin-flips; and pbinom will then # return the beta values for each ('crit', 'size') pair. crit <- qbinom( p=0.05, size=size, prob=0.5, lower.tail=FALSE ) beta <- pbinom( crit, size=size, prob=loaded ) # 'which' returns all the indices into the beta vector for which # the beta value is <= to 0.20. Since these were put in in 'size' # order (i.e. beta item number 5 is the beta value for a trial of 5 flips) # using 'min' to return the smallest of these indices # immediately gives us the smallest sample size required to # reliably detect a bias in the coin of 'loaded' smallest.needed <- min( which( beta <= 0.20 ) ) # Append the value we've just worked out to the sample.size vector sample.size <- c( sample.size, smallest.needed ) # Effect size is the degree of bias minus the P for a fair coin (0.5) effect.size <- c( effect.size, loaded - 0.5 ) } # Which is the first index into sample.size for which the value is 30 or less? index.for.thirty <- min( which( sample.size <= 30 ) ) # Since effect.size and sample.size were made in parallel, using this index # into effect.size will give the minimum effect size that could be detected # on 30 coin flips effect.size[ index.for.thirty ]

The answer is that **30 flips is enough to reliably detect a bias only if the effect size is at least 0.25**, *i.e.* to distinguish between a fair coin and a coin that is biased to give 3/4 heads. Does this surprise you? You need quite a lot of flips to distinguish even a large bias.

[NB – we made rather coarse steps of 1/32 in the loop that generated this data. If you do smaller steps of 1/64, or 1/128, you’ll find the real value is nearer 0.23]

A graph of effect size against sample size for (*α*=0.05, *β*=0.20) is useful in answering this question and the next:

**what is the minimum sample size we need to reliably (**

*α*=0.05,*β*=0.20) detect an effect size of interest to us?The same code as above generates the necessary data, but now we want:

index.for.twothirds <- min( which( effect.size >= 2/3 - 1/2 ) ) sample.size[ index.for.twothirds ]

**At least 45 flips are needed to distinguish a fair coin from a coin biased to give 2/3 heads**. [Again, bear in mind the coarseness of the loop: the real value is closer to 59].

All this may sound like an awful lot of bother merely to be told that your sample size is woefully inadequate, or that you are going to miss lots of biased coins. Fortunately, R makes it easy to obtain this unwelcome information, at least for common tests like *t* and ANOVA, and easy-ish for some other tests, through the `pwr`

package, which you can install from CRAN with:

install.packages( "pwr" )

…load into a script with:

library(pwr)

…and find information about usage with:

??power

# Exercises

- What sample size is needed to distinguish a fair coin from a slightly heads-biased coin (
*P*=0.55) with*α*=0.05, and a power of 0.80? What is the critical number of heads that would allow you to reject the null?

# Answers

- A power of 0.80 means
*β*=0.20. The same code we used for the effect*vs*. sample size calculation can be modified to give us the data we want. The`paste()`

function allows us to put the value of variables into strings for use in`print()`

statements or as axis titles,*etc*. Note that number of heads required to convince us of*P*=0.55 rather than*P*=0.50 is actually only 330/621 = 0.53, because the sample size is so large.

size <- c(0:1000) crit <- qbinom( p=0.05, size=size, prob=0.50, lower.tail=FALSE ) beta <- pbinom( crit, size=size, prob=0.55 ) n <- min( which( beta <= 0.20 ) ) print( paste( "Critical value is", crit[n], "heads" ) ) print( paste( "Sample size required is", n ) )

"Critical value is 330 heads" "Sample size required is 621"

Next up… The *F*-test.