In a recent(ish) post, we saw that if a fair coin is flipped 30 times, the probability it will give us 10 or fewer heads is less than 5% (4.937% to be pointlessly precise). Fisher quantified this using the p value of a data set: the probability of obtaining data (or a test statistic based on those data) at least as extreme as that actually observed, assuming that the null hypothesis is true.
If you ask people (even people who use them on a daily basis) what a p value is, you will get a range of answers, probably including:
 The probability of the data being due to random chance
 The probability of rejecting the null hypothesis incorrectly
 The probability of the data
 The probability of the null hypothesis being true
 The probability of the null hypothesis being false
 The probability that the data are significant
 The probability of the data being accurate
…none of which are correct. Some of these errors can be traced back to the historical mixingup of Fisher’s significance testing, and a different form of hypothesis testing proposed by Jerzy Neyman and Egon Pearson. Both of these involve calculating p values, and both frequently use the arbitrary cutoff of 0.05, but NeymanPearson testing is conceptually distinct. We’ll try to untangle them here.
If we simulate a infinitely large collective of coinflipping trials, each consisting of 30flips of a fair coin, we will obtain a binomial distribution. Unfortunately, we don’t have infinite time, so we’ll make do with a randomly generated collection of one thousand 30flip trials instead:
fair.heads<rbinom( n=1000, size=30, prob=0.5 )
# breaks allows us to set the bin sizes for a histogram
hist(
fair.heads,
main = "Number of heads in 30 coinflips of a fair coin",
xlim = c( 0, 30 ),
breaks = seq( 0, 30, by = 1 ),
xlab = "Number of heads in 30 flips",
ylab = "Frequency"
)
Fair coin simulated data (1000 trials)
36 observations of 10 or fewer heads in 1000 observation is 3.6%, which is very nearly the 4.937% we expect from theory.
rbinom 
The function rbinom(how.many, size=n, prob=P) is R’s implementation of a random binomial data generator. The ‘r’ stands for random.rbinom(this.many, size=n, prob=P) gives you this.many random variables drawn from a binomial distribution of size n with a coin producing heads at prob P.
rbinom(1000, size=30, prob=0.5) returns one thousand random variables, each representing the number of heads got from 30 flips of a fair coin.
Other distributions have similar random generator functions available: rnorm (normal), rpois (Poisson), rt (t), rf (F), etc. 
To simplify things, we can go straight for the binomial distribution to give us a ‘perfect’ set of coinflip data:
# create a vector into which we can put the numberofheadsin30flips for 1000 observations
simulated.heads < vector( "numeric" )
# This uses a for loop, the syntax for which is:
# for ( variable in list { do.something.with( variable ) }
for ( k in 0:30 ) {
# dbinom will give the pmf for each number of heads, k, in 30 flips
# The number of observations of kheadsin30flips is 1000 times this
number.of.heads.in.30.flips < round( 1000 * dbinom( k, size=30, prob=0.5 ) )
# e.g. when k=10, append 28 repetitions of "10" to the simulated headcount data
simulated.heads < c( simulated.heads, rep( k, times = number.of.heads.in.30.flips ) )
}
hist(
simulated.heads,
main = "Theoretical number of heads in 30 coinflips of a fair coin",
xlim = c( 0, 30 ),
breaks = seq( 0, 30, by = 1 ),
xlab = "Number of heads in 30 flips",
ylab = "Theoretical frequency"
)
Fair coin theoretical data (1000 trials)
So, 10 heads in 30 flips would allow us to reject the null hypothesis that the coin is fair at a significance level of 0.05, but it tells us nothing about
how biased the coin is. A coin that is only
slightly biased to tails would also be unlikely to produce 10 heads in 30 flips, and a coin so biased that it
always lands heads up, or
always lands tails up is also inconsistent with 10 heads in 30 flips.
How can we decide between different hypotheses?
You might think: “isn’t that’s what the p value shows?”, but this is not correct. A p value tells you something about the probability of data under a null hypothesis (H_{0}), but it does not tell you anything about the probability of data under any particular alternative hypothesis (H_{1}). Comparing two hypotheses needs rather more preparatory work.
If we consider a coin that may be fair, or may be loaded to heads, we might have two competing hypotheses:
 H_{0}: the coin is fair, and the probability of heads is 1/2 (P=0.50).
 H_{1}: the coin is loaded, and the probability of heads is 2/3 (P=0.666…).
Why P=2/3? And why is P=2/3 the alternative rather than the null hypothesis?
I’ve chosen P=2/3 because I have decided – before I start the experiment – that I am not bothered by coins any less biased that this. Perhaps this is for reasons of cost: making coins that are better balanced is not worth the investment in new coinstamping dies. The difference in the probability of heads under the null and alternative hypotheses is called the effect size. Here, the effect size is the difference of 0.1666… in the probability of getting a head. Other measures of effect size are appropriate to different kinds of hypotheses.
I’ve made P=2/3 the alternative hypothesis because it is more costly to me if I accept that hypothesis when it is – in fact – false. Again, perhaps this is for reasons of cost: sticking with the old coin dies is a lot cheaper than replacing them, and it’s unlikely that a coin will be used to decide anything important: just trivial nonsense like kickoffs in football that I’m not really bothered by.
If you’re already objecting to this preemptive beancounting approach to scientific data collection, you have hit upon one of the things Fisher didn’t like about Neyman and Pearson’s method, so you are in good company. That said…
The NeymanPearson method tries to control for the longterm probability of making mistakes in deciding between hypotheses. In the example above, we are trying to decide between two hypotheses, and there are four possible outcomes:

Reality 
Decision 
H_{0} is true 
H_{0} is false 
Accept H_{0} 
Correctly accept H_{0} 
Type II error 
Reject H_{0} 
Type I error 
Correctly reject H_{0} 
In two of these outcomes, we make a mistake. These are called type I and type II errors:
 Type I error: rejecting the null hypothesis when it is – in fact – true. This is a kind of falsepositive error. The coin isn’t really biased, but we’re unlucky and get data that is unusually full of heads.
 Type II error: accepting the null hypothesis when it is – in fact – false. (If we are considering only two hypotheses, this is equivalent to rejecting the alternative hypothesis when it is – in fact – true) This is a kind of falsenegative error. The coin really is biased, but we’re unlucky and get data that looks like a 50:50 ratio of heads to tails.
If we were to repeat a coinflipping experiment a very large number of times, any decision rule that allows us to decide between H_{0} and H_{1} will give us the wrong decision from time to time. There’s nothing we can do about this, but we can arrange the experiment so that the probabilities of making either of the two mistakes are reduced to acceptable levels:
 We call the longterm probability of making a type I error across a large number of repeats of the experiment, α. In probability notation, α = P(reject H_{0 } H_{0}), “the probability of rejecting the null hypothesis given () that the null hypothesis is true”.
 We call the longterm probability of making a type II error across a large number of repeats of the experiment, β. In probability notation, β = P(accept H_{0 } ¬H_{0}), “the probability of accepting the null hypothesis given that the null hypothesis is not (¬) true”.
We can choose whatever values we think appropriate for α and β, but typically we set α to 0.05. This is one of the things that results in the confusion between the NeymanPearson and Fisher approaches.
For two coins – one fair, and one biased to give 2/3 heads – the probability distributions for 30 flips look like this:
Fair (left) and 2/3loaded (right) coin PMF distributions, showing overlaps
The vertical line cuts the
xaxis at 19 heads. This is the
critical value,
i.e. the number of heads above which we would reject the null hypothesis, because the probability of getting 19 heads or more heads in 30 flips of a fair coin is less than 0.05. The blue area on the right is therefore 5% of the area beneath the fair coin’s PMF curve, and it represents the probability of a false positive:
i.e. rejecting the null hypothesis when it is – in fact – true, which is
α.
The value ’19’ is easily determined using qbinom
, which – as we saw in the last post – gives us the number of heads corresponding to a specified pvalue.
critical.value = qbinom( p=0.05, size=30, prob=0.5, lower.tail=FALSE )
For coinflipping, the statistic of interest – the one that we can use to make a decision between the null and alternative hypotheses – is simply the number of heads. For other kinds of test, it may be something more complicated (e.g. a t value, or F ratio), but the same general principle applies: there is a critical value of the test statistic above which p is less than 0.05, and above which we can therefore reject the null hypothesis with a longterm error rate of α.
The red region on the left is the portion of the 2/3loaded coin’s PMF that is ‘cut off’ when we decide on a longterm false positive rate of 5%. Its area represents the probability of a false negative: i.e. accepting the null hypothesis when it is – in fact – false. This is β.
You can hopefully see that β is equal by pbinom( critical.value, size=30, prob=2/3 )
. Its value – in this case 0.42 – depends on:
 The critical value (which depends on the choice of α).
 The number of coinflips (here 30), i.e. the sample size.
 The bias of the coin under the alternative hypothesis (2/3), which is related to the effect size, i.e. the smallest difference in ‘fair’ and ‘loaded’ behaviour that we are interested in (2/3 – 1/2 = 0.1666…)
This is where the NeymanPearson approach diverges from Fisher’s. Fisher’s approach makes no mention of an alternative hypothesis, so it makes no attempt to quantify the falsenegative rate. NeymanPearson sets up two competing hypotheses, and forces you to explicitly consider – and preferably control – β.
If the sample size and effect size are held constant, then decreasing α increases β, i.e. there is a tradeoff between the falsepositive rate and the falsenegative rate: imagine sliding the vertical line left or right, and consider what would happen to the red and blue areas. For the decision between ‘fair’ vs. ‘2/3loaded’, on a sample size of 30, beta is 0.42: we are eight times as likely to make a type II error as a type I error. It is very likely that we will wrongly say a coin is fair when in fact it is biased, and it is much more likely that we will make that mistake than we will call a fair coin biased when it isn’t. A mere 30 coinflips is therefore woefully underpowered to distinguish between P=1/2 and P=2/3.
The power of a test, π, is:
π = 1 – β = P( reject H_{0}  H_{1} )
and is somewhat more convenient than β to remember:
 α = P(reject H_{0 } H_{0})
 π = P(reject H_{0 } H_{1})
What is a good power to aim for? Typically, we want an experiment to have a β of 0.20 (i.e. a power of 0.8). You might ask “why the asymmetry?” between choosing α=0.05 and β=0.20. There’s no particularly good general reason to want β to be larger than α. You might justify this on the basis that type II errors are less costly so you don’t mind making them more often than type I errors (do you really want to replace all those coin dies?) but the real reason is usually more pragmatic: detecting even small effect sizes requires quite large sample sizes:
Coin distribution overlaps for increasing sample size
[To make R plot graphs on a grid like this, you want:
par( mfrow=c( 2, 2 ) )
which sets up a 2by2 grid: each call to
plot
then adds a new graph until it fills up]
In the graphs, you can see that as we increase the sample size from 15 through to 120, the distributions get ‘tighter’, the overlap gets smaller, the blue area (β) gets smaller (and so the power of the test gets higher). If we plot the value of β against the sample size, we get this:
We need 60odd flips to detect a 2/3biased coin with a β of 0.20, but we need nearly 100 flips for a β of 0.05. The weird scatteriness is because the binomial distribution is discrete
The sample size needed to detect a 2/3loaded coin is
c. 60 flips for a
β of 0.20 (a power of 0.80), and
c. 100 flips for a
β of 0.05 (a power of 0.95). If we have already decided on the value of
α and the effect size of interest, then
fixing the sample size is the only control we have over the type II error rate.
It is fairly rare for people to do power calculations of this sort before they start collecting data: more often than not, they simply collect a sample as large as they can afford. This means that – whilst type I errors are controlled at 0.05 – type II errors are often ignored, assumed (wrongly!) to be low, or – at best – calculated retrospectively. Retrospective calculations can give unpleasant surprises: as we saw above, 30 flips will result in quite a lot of quite badly loaded coins going undetected.
Ideally, you should calculate the sample size needed for a particular power before the experiment: that way, you will collect precisely the amount of data needed to give you the longterm error rates you think are acceptable (0.05 for α and 0.20 for β, or whatever). This brings us on to the last major confusion between Fisher and NeymanPearson testing:
 p is the probability of a test statistic being at least as extreme as the test statistic we actually got, assuming that the null hypothesis is true.
 α is the longterm probability of mistakenly rejecting the null hypothesis when it is actually true, if we were to repeat the experiment many times.
In NeymanPearson testing, we calculate a test statistic from our data (in this case just the number of heads), and compare that to the critical value of that test statistic. If the calculated value is larger than the critical value, we reject the null hypothesis. However, we can make exactly the same decision by calculating a p value from our data, and if it is less than α, rejecting the null hypothesis. Both of these approaches result in the same decision. The former used to be much more common because it was easy to obtain printed tables of critical values of test statistics like t and F for α=0.05. More recently, since computing power has made it far easier to calculate the p value directly, the latter approach has become more common.
Remember that in Fisher’s approach, a p value of 0.01 is ‘more significant’ than a p value of 0.05, and Fisher suggested (wrongly, IMHO) that this should give you more confidence in your rejection of the null. In NeymanPearson testing, if a p value is calculated at all, it is only to compare it to α. If you have done the power calculation, and collected the correct sample size, working out the p value is simply a means to an end: either it is larger than α (…accept null) or it is smaller than α (…reject null), but its value has no further importance. In NeymanPearson hypothesis testing, there is no such thing as “highly significant, p=0.01″. Unfortunately, a lot of people want to have their statistical cake and eat it: they don’t do a power calculation and collect a sample size that is expedient rather than justified (screw you NeymanPearson!), but they then want to interpret a very low p value (yay Fisher!) as being evidence for a particular alternative hypothesis (love you really, NeymanPearson!) This isn’t really very coherent, and many statisticians now prefer to sidestep the whole issue and go for confidence intervals or Bayesian approaches instead, which I leave to others to explain.
This brings us to the question – finally! – of how many coinflips we should collect to distinguish a loaded coin from a fair one. There are several answers to this, depending on our constraints, but two are particularly important:
 We have limited money and time, so we can only do 30 flips. What is the minimum effect size we can reliably detect?
 We have unlimited money and time (ha ha ha!), so we can only do as many flips as is necessary, but we’re only interested in coins that are biased to 2/3 heads or more. What sample size do we need to reliably detect a particular effect size of interest?
So, firstly, for the number of flips we can be bothered to make, what is the minimum effect size we can reliably (α=0.05, β=0.20) detect? For some common tests, R has builtin power calculations, but in general, the simplest way to work this out is by simulation. This requires a little code. In this case, it’s only actually about 12 lines, but I’ve commented it heavily as it’s a little tricky:
# Set up empty vectors to contain numeric data. sample.size will be filled in
# with the minimum sample size needed to distinguish a fair coin from a
# loaded coin of various degrees of bias. effect.size will be filled in
# with the effect size for each loaded coin
sample.size < vector('numeric')
effect.size < vector('numeric')
# We'll try out numbers of flips between 0 and 200
size < c(0:200)
# Loop though some values of 'loaded' from 19/32 to 1. We don't go any
# lower than this because even 18/32 requires more than 200 flips!
for ( loaded in seq( from=19/32, to=32/32, by=1/32 ) ) {
# since 'size' is a vector, qbinom will return a vector of 'crit' values
# for all trial sizes between 0 and 200 coinflips; and pbinom will then
# return the beta values for each ('crit', 'size') pair.
crit < qbinom( p=0.05, size=size, prob=0.5, lower.tail=FALSE )
beta < pbinom( crit, size=size, prob=loaded )
# 'which' returns all the indices into the beta vector for which
# the beta value is <= to 0.20. Since these were put in in 'size'
# order (i.e. beta item number 5 is the beta value for a trial of 5 flips)
# using 'min' to return the smallest of these indices
# immediately gives us the smallest sample size required to
# reliably detect a bias in the coin of 'loaded'
smallest.needed < min( which( beta <= 0.20 ) )
# Append the value we've just worked out to the sample.size vector
sample.size < c( sample.size, smallest.needed )
# Effect size is the degree of bias minus the P for a fair coin (0.5)
effect.size < c( effect.size, loaded  0.5 )
}
# Which is the first index into sample.size for which the value is 30 or less?
index.for.thirty < min( which( sample.size <= 30 ) )
# Since effect.size and sample.size were made in parallel, using this index
# into effect.size will give the minimum effect size that could be detected
# on 30 coin flips
effect.size[ index.for.thirty ]
The answer is that 30 flips is enough to reliably detect a bias only if the effect size is at least 0.25, i.e. to distinguish between a fair coin and a coin that is biased to give 3/4 heads. Does this surprise you? You need quite a lot of flips to distinguish even a large bias.
[NB – we made rather coarse steps of 1/32 in the loop that generated this data. If you do smaller steps of 1/64, or 1/128, you’ll find the real value is nearer 0.23]
A graph of effect size against sample size for (α=0.05, β=0.20) is useful in answering this question and the next:
Effect size and sample size are inversely related: a small sample size can only detect a large bias; a large sample size is needed to detect a small bias
Secondly,
what is the minimum sample size we need to reliably (α=0.05, β=0.20) detect an effect size of interest to us?
The same code as above generates the necessary data, but now we want:
index.for.twothirds < min( which( effect.size >= 2/3  1/2 ) )
sample.size[ index.for.twothirds ]
At least 45 flips are needed to distinguish a fair coin from a coin biased to give 2/3 heads. [Again, bear in mind the coarseness of the loop: the real value is closer to 59].
All this may sound like an awful lot of bother merely to be told that your sample size is woefully inadequate, or that you are going to miss lots of biased coins. Fortunately, R makes it easy to obtain this unwelcome information, at least for common tests like t and ANOVA, and easyish for some other tests, through the pwr
package, which you can install from CRAN with:
install.packages( "pwr" )
…load into a script with:
library(pwr)
…and find information about usage with:
??power
Exercises
 What sample size is needed to distinguish a fair coin from a slightly headsbiased coin (P=0.55) with α=0.05, and a power of 0.80? What is the critical number of heads that would allow you to reject the null?
Answers
 A power of 0.80 means β=0.20. The same code we used for the effect vs. sample size calculation can be modified to give us the data we want. The
paste()
function allows us to put the value of variables into strings for use in print()
statements or as axis titles, etc. Note that number of heads required to convince us of P=0.55 rather than P=0.50 is actually only 330/621 = 0.53, because the sample size is so large.
size < c(0:1000)
crit < qbinom( p=0.05, size=size, prob=0.50, lower.tail=FALSE )
beta < pbinom( crit, size=size, prob=0.55 )
n < min( which( beta <= 0.20 ) )
print( paste( "Critical value is", crit[n], "heads" ) )
print( paste( "Sample size required is", n ) )
"Critical value is 330 heads"
"Sample size required is 621"
Next up… The Ftest.