When we conduct statistical tests, remember that we can make 2 kinds of
correct decisions and 2 kinds of errors:
|
Actual situation |
| Experimenter's Conclusions |
|
| H0 is correct |
H0 is wrong |
|
|
|
oops!
Type I error |
Yay!
correct |
Yay!
correct |
oops!
Type II error |
|
|
As scientists, we set up our hypothesis tests to
minimize the chances of making a Type I error (concluding that there is
an effect when there really isn't). In Psychology and many other
disciplines, we typically set a Type I error rate at α = .05. This
means that when the null hypothesis is true, 5% of the time a
statistical test will lead us to believe that the null hypothesis is
unlikely to be true and so we will incorrectly reject it. Of course, as
researchers, we never know when we've made a Type I error until further
data are collected by our research team or by other researchers. What
usually happens is that we get a result that suggests that the effect
we were hoping for is present (i.e., we reject the null hypothesis and
conclude that the alternative hypothesis is supported). After
publishing our "discovery", other researchers might try to "replicate"
the effect. A "replication study" is a study that tries to make sure
that a result is real by doing new study that is the same as an earlier
study.
When efforts to replicate a finding fail repeatedly
(i.e., the null hypothesis cannot be rejected and so it is retained),
the researchers who reported the original finding usually try to
replicate it themselves (maybe the other researchers aren't doing the
study correctly). However, if new data keeps coming back from new
studies and we cannot reject the null hypothesis (except for maybe 5%
of the time), we have to conclude that the first study we did was just
a fluke and that we committed a Type I error. There is no shame in
this. It is just bad luck. We could cut the probability of getting a
Type I error by setting α to a lower level (e.g., .01) but there is an
unfortunate side-effect for setting a low α: It increases the
probability of making more Type II errors! How can a researcher catch a
break? If it isn't a Type I error, it's a Type II error! Actually,
there are a number of things that can be done to eliminate both types
of errors.
First let's review what a Type II error is:
Type II error-
the H0 is really wrong, but the experimenter didn't believe
that the evidence was strong enough to reject it. The
probability of making a Type II error is called β (beta).
Closely related to β is something called statistical power (also referred to
as simply "power"). Power is the probability that a statistical test
will correctly reject a false null hypothesis. So power = 1 - β. The
more "powerful" the test, the more readily
it will detect the effect of a variable if it is there.
Now let's consider how the two types of errors are
related. We need to consider the situation where H0 is wrong,
that is when there are two populations, the treatment population and
the null population. Consider the figure below. It represents a
situation in which there REALLY IS A DIFFERENCE between two
distributions.
- The Null distribution comes from the original
comparison group (null because if the null hypothesis were correct,
both distributions would be exactly the same).
- We define the "critical region" by a in the null distribution.
- The alternative or "treatment" distribution is the
distribution that
resulted after a particular treatment has been administered or some
variable has changed the original null distribution to make it
different in some way.
- Draw the line defined by a
up through the treatment distribution. The region that is on the same
side of the line as the critical region is Power.
Power is the probability of
obtaining sample data in the critical region when the null hypothesis
is false.
| If the sample mean falls in the critical
region then we reject the H0 |
If the sample mean falls outside of the
critical region then we retain the H0 |
 |
|
| Here we've come to the correct conclusion. |
Here we've made a Type II error |
Factors that affect power
1) The size of the difference that we're
trying to detect.
So when there are two populations, the power will be related to how big
a difference there is between the two.
 |
a big difference between the two
populations
notice that the shaded region is large
the chance to correctly reject the null
hypothesis is good
|
 |
a smaller difference between the two
populations
notice that the shaded region is smaller
the chance to correctly reject the null
hypothesis is not nearly as good
|
2) Increasing a
increases power.
3) One-tailed tests have more power
than two-tailed tests, given that you have specified the correct tail.
If you specify the wrong tail, power is usually near 0. For this
reason, most researchers always conduct 2-tailed tests. Even though
they are less powerful, they prevent the awkwardness of having to
retain the null hypothesis even when the mean difference is huge but in
the opposite direction of what was expected.
 |
One-tailed test
a = 0.05
all of the critical region (a) is on one side of the distribution
|
 |
Two-tailed test
a = 0.05
because a specific direction is not predicted, the critical region (a) is spread out equally on both sides of the
distribution
as a result the power is smaller
|
4) Increasing sample size increases
power by reducing the standard error.
 |
Small n
a = 0.05
relatively large standard error
|
 |
Larger n
a = 0.05
Smaller standard error
as a result the power is greater
|
Here is an Excel spreadsheet
that illustrates aspects of statistical power. Download it to somewhere you'll be
able to find it again and open it.
Let's walk through how to use it.
Suppose you collect data from n = 30 people and the
sample mean is 55. You think that this sample did not from a population
with μ = 50 and σ = 20. You didn't specify whether you expected your
sample to have a higher or lower mean than the population so this is a
2-tailed hypothesis. The null hypothesis is that the sample does come from that
population with μ = 50 and σ = 20. Now, suppose that, in reality and unbeknownst to you, the
sample came from a different population with μ = 55 and σ = 20. Thus,
although you are not in a position to know this for sure, the null
hypothesis is false.
First, set the
"Null Hypothesis is" option box to
"False." If you don't do this
first, you won't be able to set the mean and standard deviation of the
alternative distribution and you won't be able to calculate power.
Since power is the probability of rejecting a false null hypothesis,
power is irrelevant when the null hypothesis is true.

Next, set the
"Hypothesis Type" option box to "2-tailed (H1≠H0)" like this:

Now set the rest
of the numbers like this:
It should look like this:

The graph below should look something like this:

The blue
distribution is the distribution of sample means of the null
distribution and the pink
distribution is the distribution of sample means of the
alternative distribution.
The black vertical line is the sample mean (right now it falls on
55).
The critical regions are
shaded in blue. When the vertical line (the sample mean) falls in a
critical region, the null hypothesis is rejected.
The pink shaded regions represent statistical power. They
represent the part of the alternative distribution that is beyond the
critical regions. You can't see the very small pink portion that is in
the blue critical region on the left side because it is covered but is
also included in the calculation of statistical power.
Notice that the sample mean is 55, exactly the same as
the alternative distribution mean. You'd think that we'd reject the
null hypothesis in a situation like this, right? Wrong. Unfortunately,
the sample mean did not fall in a critical region so we must retain the
null hypothesis. Thus, we have made a Type II error.
It turns out that making a Type II error was very
likely in this situation. If you look at the "Type II Error Rate (β)"
box, you'll see that the probability of making a Type II error was
quite high (about 72%). Power was only about .28, meaning that randomly
sampling 30 people from the alternative distribution would result in a
correct decision only 28% of the time. If you were a researcher, you
would not want to put in hours of toil for only a 28% chance of
success. You'd want to improve your power substantially.
The effect of
1-tailed tests
If you had specified a 1-tailed hypothesis, your power
would have improved somewhat (if you guessed in the right direction).
Set the "Hypothesis
Type" option box to "1-tailed (H1>H0)"
Blackboard 1)
What is
your power now?
Blackboard 2)
Does
the sample mean fall in the critical region now that you have a
1-tailed hypothesis?
Set the
"Hypothesis Type" option box back to "2-tailed (H1≠H0)."
The effect of raising α
Raising α is the easiest way to raise power. It is
also the stupidest. Don't do
it when conducting real data analysis. People will laugh at you
and won't take you seriously anymore. ;)
However, just to see the effect of raising it,
temporarily raise α from .05 to .2.
Blackboard 3)
What
happened to the size of the critical regions when you raised α from .05
to .2 (Remember to set the
"Hypothesis Type" option box back to "2-tailed (H1≠H0).")?
Blackboard 4) Does the sample mean fall in the
critical region now that you have raised α from .05 to .2?
You can see that power increased from .28 to .54. On
the surface, this would seem to be a good thing. It isn't. The problem
with raising α is that it increases the frequency of Type I errors.
Type I errors are false facts and they are harder to get rid of than
Type II errors once they've been entered into the scientific literature.
Reset α to .05 again.
The effect of
increasing sample size
The most effective way to increase power that is
directly under the researcher's control is to increase the sample size.
Blackboard 5)
What
happens to the distributions in the graph when you change the sample
size from n=30 to n=100 (Remember that α must be set to .05)?
The result you noticed in question 5 should not be
surprising and you probably predicted what would happen before you did
it. The graph is the distribution of sample means. The width of this
distribution is related to the standard deviation of the distribution.
The standard deviation of the distribution of sample means is also
called the standard error. Since you may recall that the standard error
is the original population standard deviation divided by the square
root of n, you can guess what happens to the standard error when you
divide by the square root of a large number.
Try experimenting
with different values of n and see what happens.
Blackboard 6)
What is
the lowest n that causes the 2-tailed null hypothesis to be rejected?
(Hint: Pay attention to the "Decision" box in the upper right corner.
It will change to red when the null hypothesis is rejected.)
The effect of larger
mean differences
Right now, the 2 distribution means are .25 standard
deviations apart ((55 - 50) / 20 = 5 / 20 = .25). Suppose that the
alternative distribution had μ = 60 instead of μ = 55. Now the
distributions are .5 standard deviations apart ((60 - 50) / 20 = 10 /
20 = .5). If n is set back to 30, you can compare the original
power we started with (.28) to what you have now. As you can see, the
odds of correctly rejecting the null hypothesis have gone up
considerably.
Blackboard 7)
What is
the power now that μ = 60 for the alternative distribution (Remember to
set n back to 30)?
Blackboard 8) If the sample mean is also set to 60, is the null
hypothesis
rejected or retained?
The effect of larger
population standard deviations
Researchers rarely have any control over this one so
it borders on pointlessness to discuss it. However, for the sake of
completeness, we should note that a large population standard deviation
reduces power. Remember that the z-score formula has a standard
deviation in the denominator. Dividing by a large number makes the z
smaller. In order to reject the null hypothesis, z has to be big (e.g.,
far from 0).
Review
Power is the probability of correctly rejecting
the null hypothesis when the null hypothesis is false.
Power is influenced by:
1. Type of hypothesis (1-tailed --> more power)
2. α (larger α --> more power)
3. Sample size (larger n --> more power)
4. Mean differences (larger differences --> more power)
5. Population standard deviation (smaller σ --> more power)