On the other hand, reduced degrees of freedom and the consequent loss of power pose a new threat to conclusion validity in the guise of increased risk of a Type II error. To understand why, a close look at the notion of power is required.
In the diagram the sampling distribution of the mean under the null hypothesis is shown in red, and the sampling distribution of the mean when the alternate hypothesis is true is shown in blue. For simplicity, a one-tailed test is illustrated.
Consider the situation on the left. If your sample mean (in green) falls in the rejection region, then by the rules you must reject the null hypothesis. This is correct but would in reality be a Type I error. Random sampling error has produced a result that is pure noise but looks indistinguishable from the situation on the right in which the alternate hypothesis is true and there is no error. Fortunately, we ourselves formulate the null hypothesis and also set the level of alpha according to our research aims, and therefore can manage that risk effectively:
Power refers to the odds that the outcome of your research will be in the top right-hand corner of the figure, where you correctly reject H0. Thus power counteracts the threat of a Type I error: The inevitable (albeit manageable) risk that a decision to reject the null hypothesis could be wrong is countered by the power to ensure that it will be right. Unfortunately, this power depends partly on the true state of nature and is thus not entirely within your control. That fact is indicated in the table above by the probabilities which add to 1.0 downwards but not across. In plain words this means that the odds of a Type I error (alpha) are quite independent from the odds of a Type II error (beta). So, for example, you could have a high alpha (risk of a Type I error) and at the same time have very high power: Ideal for exploratory research where you might need to discern weak signals amidst much noise. On the other hand you could have high alpha and low power: The worst scenario in which there is little control over threats to conclusion validity.
This is why the table has two independent dimensions. Whatever decision you make is not going to affect reality (except perhaps in the long term), and whatever the state of nature is will not directly affect your decision.
ß or beta). Again, this is
partly beyond our control: Our null hypothesis might really be false
but the effect we actually seek too subtle to be detected and the
null hypothesis is mistakenly retained. The result would be
written off as inconclusive. However, with sufficient power
even very subtle effects can be detected. Therefore, if you
can show that despite having to retain the null hypothesis your
research had good power to detect an effect if there was one,
then you can further argue that your result was not inconclusive.
So power is the converse of ß, in
other words 1 -
ß.
With respect to planning, the key is that given any three of these, the fourth can be calculated. Hence you can plan for a suitable level of power. As defined, power is the chance of success. You want this to be as high as possible: Howell gives details for precise calculations, but what matters here is only: How can you maximize it?
The figure below shows how the choice of alpha changes power in a one-tailed test. The red and blue curves in these diagrams are sampling distributions of the mean. They show all possible outcomes in the situation where either H0 is true, or where HA is true.
It is only meaningful to speak of power and Type II errors in
terms of the alternate hypothesis. Power is an area beneath the
blue curve. The risk ß of a Type II error (in
brown) when the null is
retained but in fact is false, is also under the blue curve.
Power is the remainder: 1 - ß.
It comprises all possible samples with a mean greater
than what is required for rejection of H0 and
therefore occupies the region to the right (in this particular
one-tailed test).
Increasing alpha (e.g. .10) increases power, but also the risk of a Type I error. You are free to choose any significance level, but convention often fixes alpha at .05 or less.
The power increases with effect size:
Effect size is formally defined in terms of population parameters: We have little control over population parameters, they are fixed by the Almighty! How can we improve power? Trochim correctly points out that effect size really is the ratio of signal to noise (where noise is sigma): Poor methods can weaken the signal and increase the noise. Reliable methods will increase the observed effect size relative to the noise, and increase power.
What remains, is the sample size: Increasing sample size directly increases power.
This leads to the pragmatic side of the matter: Research is often costly. To obtain funding you might justify the need for your expenses in terms of the research goals. But do you have a reasonable chance of achieving these goals? In statistical terms, power can answer this question. It balances the pragmatic and statistical issues in a neat equation.
Suppose you are planning some research and work out that with the proposed methodology you have a 20% chance of making a real contribution to knowledge. Fund managers would look askance. How about a similar research proposal that can claim an 80% chance? The 20% versus 80% comparison is the question of power in pragmatic terms. It is the same as saying that one research proposal claims a .2 probability of correctly rejecting a false H0, versus a claim of .8 probability to be able to do the same.
On the other hand, large samples can be costly. If it can be shown that reasonable power is available with a small sample rather than a big one then the costs and benefits can be evaluated. Such a comparison can be made, but only if it is possible to estimate power before the research is carried out.
| Exploratory | Confirmatory | |
|---|---|---|
| Much noise | Little noise | |
| Type I OK | Type II OK | |
| High alpha | Low alpha | |
| Good design | ||
| Minimize Noise | Randomize Noise | |
| Repeated measures | Randomized groups | |
| Matched groups | ||
| Reliable | ||
| Instruments | ||
| Procedures |
Refer to the comparison in Howell's Table 15.2 (3rd and 4th editions) for an example of the difference between using repeated measures (one sample tests) versus randomized groups (two-sample tests).
Howell shows how to calculate the sample size needed to get any level of power. Unfortunately, the calculation is different for each kind of statistical test, and some statistical tests require computations that are only available in specialized computer packages. For these reasons power calculations are beyond the scope of this course. You need only deal with it on a conceptual level as outlined above, especially in terms of Trochim's discussion of threats and remedies.