Inference using chi-squared tests

This last lecture in the course deals with some essential points about using chi-square tests in the analysis of categorical data. In all cases, it is not the classification that is tested, but the distribution of observed frequencies. To do so, it is compared with the expected frequency distribution derived from some appropriate null hypothesis. As already shown, there are two types of research questions involving frequency data. The procedures differ in some respects but share a common framework that we have already encountered in other statistical procedures: Data, Assumptions, Comparison, Inference. First I shall outline the common features, and then deal with the specific details.

You are not expected to do the actual chi-squared calculations for this section of the course. What is required is to know how to interpret the results based on a clear understanding of topics dealt with earlier:

  1. the purpose of the analysis, which determines the null hypothesis and degrees of freedom
  2. observed and expected frequencies
  3. percentages
  4. classification and counting

General

The following points always apply. Refer to the textbook for more details.

Data
A table of frequencies is obtained by sorting the data and counting sub-totals.
Assumptions
  1. Classification should be mutually exclusive and exhaustive. The observations should be independent: No person should be counted twice. This is not the same as classifying a person on more than one dimension (see contingency tables below). Lack of independence may also occur in more subtle ways. For example, the distribution of males and females in a particular situation may be determined by a quota system.
  2. A null hypothesis. This depends on the type of problem (see below).
  3. The level of significance as determined by convention, usually alpha = .05.
Compare
The observed distribution of frequencies is compared with an expected distribution based on a null hypothesis. The amount of discrepancy is indicated by:
The calculated value of chi-squared
The formula for calculating chi-squared is always the same (see below): For each ijth cell in the table, you find the difference between observed and expected frequencies. This value is then squared and divided by the expected frequency. You then sum these values for the entire table to obtain the result. chi-sq.
                formula
The critical value.
The calculated value of chi-square is compared to the critical value for a particular significance level and degrees of freedom.
Decide
The result if significant if the critical value is exceeded. Note that it is a non-directional test: Chi-squared gets bigger even if the expected frequencies are less than the observed.
Infer
A significant result indicates that the population distribution is not as expected in terms of the null hypothesis.

Goodness-of-fit

A test on the distribution of frequencies in a one-way classification.

Null Hypothesis

Specified by you on the basis of some prior information. This provides the expected frequencies. For example, in the case of handedness it was 1/10 "left" versus 9/10 "right". If no prior information is available you assume the distribution would be random (equally probable).

Degrees of freedom

Always one less than the number of categories (k): df = k - 1. For example, if you have 55 females in a sample of 100 people, then there must be 45 males because the categories are mutually exclusive and exhaustive.

Inference

If the result is significant, you conclude the observed distribution is not as expected. If it is not significant, you conclude the observed distribution is a "good fit" to the expected distribution.

Contingency tables

Also known as two-way tables, because there are two dimensions of classification. The purpose is to detect an association. That is what contingency refers to: When two sets of attributes are contingent then they are associated. Cross-classification and inspection of the frequency distribution will reveal if attributes tend to coincide. Conversely, independence implies no contingency, lack of association.

Null hypothesis

Mutual independence. If this is true, then the distribution of cell frequencies will reflect the marginal row and column totals.
Expected frequency:
For each cell in the table: Eij = (Rowi x Colj) / T where Rowi is the ith row marginal total, Colj is the jth column marginal total and T is the grand total. In the rape conviction example, if this were the case, then for any row a similar distribution of cell frequencies should be found (irrespective of attributions about the victim's blame). The marginal total was 258 guilty versus 100 not guilty verdicts, which is roughly 3:1. If the null hypothesis were true, then at each level of blame we should expect approximately a 3:1 distribution of guilty versus not guilty verdicts. The same principle applies for any column:

Table of Expected frequencies
 Blame 
VerdictHighLowTotal
Guilty130.5127.6258
Not Guilty50.649.4100
Total181177358

(note that the expected frequencies by definition give the same row and column totals as the observed frequencies).

Degrees of freedom:
df = (R - 1) x (C - 1) where R is the number of rows in the table and C is the number of columns in the table. Given the total for any row all but one of the cell frequencies are free to vary, and for any column the same applies.

You can see this in the rape conviction example where df = 1. Suppose all cell frequencies were unknown except the number of "not guilty" verdicts for "low blame" victims (i.e. 24). Given the total of 100 "not guilty" cases, there must be 76 in the "high blame" cell. Note that the total frequencies are irrelevant to the truth or falsity of the null hypothesis.

Significance

If the chi-squared test is significant then there must be some association or contingency. In the rape prosecution example the calculated value of chi-square was 35.93 and the result was significant. The conclusion is that verdict is contingent on blame. Inspection of the data table shows a distribution of approximately six guilty to one not guilty verdict when there is low blame, but less than two to one when there is high blame. Differences between observed and expected frequencies are quite large. This is most clearly seen in the table of calculated chi-squared values for each cell (remember that the calculated chi-squared value is the sum total of chi-squared for each cell):

Table of calculated chi-squared values
 Blame
VerdictHighLow
Guilty4.965.07
Not Guilty12.8013.09

The large chi-squared values in the "not guilty" row indicate that the largest discrepancies were found there. Comparison between the observed and expected frequencies for these cells reveals more "not guilty" observed than expected for high blame, and less "not guilty" than expected for low blame.

This example assumes a causal relation between attributions of blame and resulting verdict. That assumption is supported by the experimental design of the study. Clearly attribution of blame had an effect.

Another example

The distribution of male and female students in the PSY307F class has already been noted. Data were obtained from class lists for 2002, as shown in the previous lecture. Inspection of the table showed declining numbers of males as students progress from 2nd year through to Honours. This was more evident from the percentages. Percentages were calculated on the assumption that the independent variable is the level of study. To test if the distribution of males versus female students depends on the level of study the chi-square test for a contingency table will be applied. Given the null hypothesis of mutual independence expected frequencies for these data are:

Table of Expected Frequencies
 PSY206FPSY307FPSY400W
Male74.748.96.3
Female244.3160.120.7

Table of calculated chi-squared values
 PSY206FPSY307FPSY400W
Male3.123.430.85
Female0.961.050.26
The calculated total chi-squared value was 9.67, with df = 2 the critical value is 5.99 and the null hypothesis is rejected. It seem the distribution of male versus female students depends on the level of study. Examination of the table above shows large chi-squared values for males in PSY206F and PSY307F. Comparing observed and expected frequencies for these cells reveals that relative to the null hypothesis, there were "too many" males in PSY206F and "too few" in PSY307F. This could be interpreted as a "bulge" in the number of males in PSY206F.

Note that the test does not indicate anything about the decline in numbers across levels of study, nor about the disproportion of males versus females: It only indicates that the disproportion is not the same across levels of study.

The data for 2000 are given below for comparison:

Table of Observed frequencies in 2000
 PSY206FPSY307FPSY400WTotal
Male5823485
Female22511430369
Total28313734454

Table of Percentages in 2000
 PSY206FPSY307FPSY400W
Male211712
Female798388
Total100100100

The calculated value of chi-square for these data is 1.99, with two degrees of freedom. This is less than the critical value (5.99). We cannot reject the null hypothesis of independence, and conclude that the distribution of male versus female students does not depend on the level of study. Although there appears to be a pattern, we have insufficient evidence to claim that it is statistically significant.

P.S.In 2001 the same calculations were done, and the opposite pattern appeared in the observed frequencies - i.e. there appeared to be an increasing number of males. However, the chi-square test again was not significant. We do not have strong evidence that the proportion of males versus females depends on the level of study in psychology.

Sample size, number of categories, and expected frequencies

Howell cautions against using chi-squared tests when you have expected frequencies that are less than five. This is a rule-of-thumb, but quite important because it determines sample size. You can see that a contingency table can have very many cells. With small samples that inevitably leads to problems with small expected frequencies. On the other hand, using large numbers of categories increases the degrees of freedom, which is generally a good thing. But if you inspect the table of critical values for chi-square, it will be obvious that the critical value increases with degrees of freedom: It becomes more difficult to reject the hypothesis as the critical value increases, which reduces power. So there is a tradeoff to be considered when planning the research. Increasing the number of categories will yield more information, but this may reduce the power unless there are also enough heads to be counted (sample size). Fortunately, counting is not very costly. Categorical data often come cheaply, and relatively large samples can be obtained quickly.

Another consideration in this tradeoff is the fact that frequency distributions across multiple categories can be very difficult to interpret. Clarity and economy can be gained by reducing the numbers of categories. Simplicity is generally a good thing, as Popper's quote in my earlier lecture would suggest.



Copyright 2002, University of Cape Town