Comparison of expected and observed count data: the χ² test

A χ2 test is used to measure the discrepancy between the observed and expected values of count data.

  • The dependent data must – by definition – be count data.
  • If there are independent variables, they must be categorical.

The test statistic derived from the two data sets is called χ2, and it is defined as the square of the discrepancy between the observed and expected value of a count variable divided by the expected value.

\chi^2 = \sum{ \frac{ (O - E)^2 }{ E } }

The reference distribution for the χ2 test is Pearson’s χ2. This reference distribution has a single parameter: the number of degrees of freedom remaining in the data set.

A χ2 test compares the χ2 statistic from your empirical data with the Pearson’s χ2 value you’d expect under the null hypothesis given the degrees of freedom in the data set. The p value of the test is the probability of obtaining a test χ2 statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis (“there is no discrepancy between the observed and expected values”) is true. i.e. The p value is the probability of observing your data (or something more extreme), if the data do not truly differ from your expectation.

The comparison is only valid if the data are:

  • Representative of the larger population, i.e.the counts are sampled in an unbiased way.
  • Of sufficiently large sample size. In general, observed counts (and expected counts) less than 5 may make the test unreliable, and cause you to accept the null hypothesis when it is false (i.e. ‘false negative’). R will automatically apply Yates’ correction to values less than 5, but will warn you if it thinks you’re sailing too close to the wind.
  • Independent.

Do not use a χ2 test unless these assumptions are met. The Fisher’s exact test fisher.test() may be more suitable if the data set is small.

In R, a χ2-test is performed using chisq.test(). This acts on a contingency table, so the first thing you need to do is construct one from your raw data. The file tit_distribution.csv contains counts of the total number of birds (the great tit, Parus major, and the blue tit, Cyanistes caeruleus) at different layers of a canopy over a period of one day.

tit.distribution<-read.csv( "H:/R/tit_distribution.csv" )
print( tit.distribution )

This will spit out all 706 observations: remember that the raw data you import into R should have a row for each ‘individual’, here each individual is a “This bird in that layer” observation. You can see just the start of the data using head():

head( tit.distribution )
     Bird  Layer
1 Bluetit Ground
2 Bluetit Ground
3 Bluetit Ground
4 Bluetit Ground
5 Bluetit Ground
6 Bluetit Ground

and look at a summary of the data frame object with str():

str( tit.distribution )
'data.frame':  
706 obs. of  2 variables:
 $ Bird :
Factor w/ 2 levels "Bluetit","Greattit": 1 1 1 1 1 1 1 1 1
1 ...
 $ Layer: Factor
w/ 3 levels "Ground","Shrub",..: 1 1 1 1 1 1 1 1 1 1 ...

To create a contingency table, use table():

tit.table<-table( tit.distribution$Bird, tit.distribution$Layer )
tit.table
           Ground Shrub Tree
  Bluetit      52   72  178
  Greattit     93  247   64

If you already had a table of the count data, and didn’t fancy making the raw data CSV file from it, just to have to turn it back into a contingency table anyway, you could construct the table manually using matrix():

tit.table<-matrix( c( 52, 72, 178, 93, 247, 64 ), nrow=2, byrow=TRUE )
# nrows means cut the vector into two rows
# byrow=TRUE means fill the data in horizontally (row-wise)
# rather than vertically (column-wise)
tit.table
     [,1] [,2] [,3]
[1,]   52   72  178
[2,]   93  247   64

The matrix can be prettified with labels (if you wish) using dimnames(), which expects a list() of two vectors, the first of which are the row names, the second of which are the column names:

dimnames( tit.table )<-list( c("Bluetit","Greattit" ), c("Ground","Shrub","Tree" ) )
tit.table
         Ground Shrub Tree
Bluetit      52    72  178
Greattit     93   247   64

To see whether the observed values (above) differ from the expected values, you need to know what those expected values are. For a simple homogeneity χ2test, the expected values are simply calculated from the corresponding column (C), row (R) and grand (N) totals:

E = \frac{R \times C}{ N }
Ground Shrub Tree Row totals
Blue tit 52 72 178 302
E 302×145/706 = 62.0 302×319/706 = 136.5 302×242/706 = 103.5  
χ2 (52−62)2/62 = 1.6 (72−136.5)2/136.5 = 30.5 (178−103.5)2/103.5 = 53.6  
Great tit 93 247 64 404
E 404×145/706 = 83.0 404×319/706 = 182.5 404×242/706 = 138.5
χ2 (93−83)2/83 = 1.2 (247−182.5)2/182.5 = 22.7 (64−138.5)2/138.5 = 40.1
Column totals 145 319 242 706

The individual χ2 values show the discrepancies for each of the six individual cells of the table. Their sum is the overall χ2 for the data, which is 149.7. R does all this leg-work for you, with the same result:

chisq.test( tit.table )
       Pearson's Chi-squared test
data: tit.table 
X-squared = 149.6866, df = 2, p-value < 2.2e-16

The individual tits’ distributions are significantly different from homogeneous, i.e. there are a lot more blue tits in the trees and great tits in the shrub layer than you would expect just from the overall distribution of birds.

Sometimes, the expected values are known, or can be calculated from a model. For example, if you have 164 observations of progeny from a dihybrid selfing genetic cross, where you expect a 9:3:3:1 ratio, you’d perform a χ2 manually like this:

A- B- A- bb aa B- aa bb
O 94 33 28 9
E 164×9/16 = 92.25 164×3/16 = 30.75 164×3/16 = 30.75 164×1/16 = 10.25
χ2 (94−92.25)2/92.25 = 0.033 (33−30.75)2/30.75 = 0.165 (28−30.75)2/30.75 = 0.246 (9−10.25)2/10.25 = 0.152

For a total χ2of 0.596. To do the equivalent in R, you should supply chisq.test() with a second, named parameter called p, which is a vector of expected probabilities:

dihybrid.table<-matrix( c( 94, 33, 28, 9 ), nrow=1, byrow=TRUE )
dimnames( dihybrid.table )<-list( c( "Counts" ), c( "A-B-","A-bb","aaB-","aabb" ) )
dihybrid.table
       A-B- A-bb aaB- aabb
Counts   94   33   28    9
null.probs<-c( 9/16, 3/16, 3/16, 1/16 )
chisq.test( dihybrid.table, p=null.probs )
    Chi-squared test for given probabilities
data: dihybrid.table 
X-squared = 0.5962, df = 3, p-value = 0.8973

The data are not significantly different from a 9:3:3:1 ratio, so the A and B loci appear to be unlinked and non-interacting, i.e. they are inherited in a Mendelian fashion.

The most natural way to plot count data is using a barplot() bar-chart:

barplot( dihybrid.table, xlab="Genotype", ylab="N", main="Dihybrid cross" )

Maize kernel histogram [CC-BY-SA-3.0 Steve Cook]

Exercises

Use the χ2 test to investigate the following data sets.

  1. Clover plants can produce cyanide in their leaves if they possess a particular gene. This is thought to deter herbivores. Clover seedlings of the CN+ (cyanide producing) and CN (cyanide free) phenotypes were planted out and the amount of rabbit nibbling to leaves was measured after 48 hr. Leaves with >10% nibbling were scored as ‘nibbled’, those with less were scored as ‘un-nibbled’. Do the data support the idea that cyanide reduces herbivore damage?
Nibbled Un-nibbled
CN+ 26 74
CN 34 93
  1. In a dihybrid selfing cross between maize plants heterozygous for the A/a (A is responsible for anthocyanin production) and Pr/pr (Pr is responsible for modification of anthocyanins from red to purple) loci, we expect an F2 ratio of 9 A− Pr−: 3 A− pr pr: 3 a a Pr− : 1 a a pr pr. The interaction between the loci results in the a a Pr−  and a a pr pr individuals being indistinguishable in the colour of their kernels. The file maize_kernels.csv contains a tally of kernel colours. Do the data support the gene-interaction model?

Answers

  1. As the data is already in a table, it is easier to construct it directly as a matrix. The data do not support the hypothesis that the two phenotype differ in their damage from rabbit nibbling
clover.table<-matrix( c( 26, 74, 34, 93 ), nrow=2, byrow=TRUE )
dimnames( clover.table )<-list( 
  c( "CN.plus",  "CN.minus" ), 
  c( "Nibbled", "Un.nibbled" )
)
clover.table
        
        Nibbled Un.nibbled
CN.plus      26         74
CN.minus     34         93
chisq.test( clover.table )
       
Pearson's Chi-squared test with Yates' continuity correction
data: clover.table 
X-squared = 0, df = 1, p-value = 1
  1. The data are in a file, so we construct a simple contingency table from that and then test against the expected frequencies of 9 purple : 3 red : 4 colourless. Make sure you get them in the right order! The data support the model, as the χ2 value has a p value greater than 0.05, i.e. we can accept that the data are consistent with a 9:3:4 ratio.
maize.kernels<-read.csv( "H:/R/maize_kernels.csv" )
head( maize.kernels )
      Kernel
1        Red
2 Colourless
3 Colourless
4 Colourless
5     Purple
6 Colourless
maize.table<-table( maize.kernels$Kernel )
maize.table
Colourless    Purple        Red 
       229       485        160
chisq.test( maize.table, p=c( 4/16, 9/16, 3/16 ) )
Chi-squared test for given probabilities
data: maize.table 
X-squared = 0.6855, df = 2, p-value = 0.7098

Next up… One-way ANOVA.

1 comment

  1. Great post, Thanks for providing us this great knowledge, Keep it up.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.