Lab 14

Statistical Power

 

    When we conduct statistical tests, remember that we can make 2 kinds of correct decisions and 2 kinds of errors:


    Actual situation
    Experimenter's Conclusions

    H0 is correct H0 is wrong
    Reject H0
    Retain H0
    oops!
    Type I error
    Yay!
    correct
    Yay!
    correct
    oops!
    Type II error

    As scientists, we set up our hypothesis tests to minimize the chances of making a Type I error (concluding that there is an effect when there really isn't). In Psychology and many other disciplines, we typically set a Type I error rate at α = .05. This means that when the null hypothesis is true, 5% of the time a statistical test will lead us to believe that the null hypothesis is unlikely to be true and so we will incorrectly reject it. Of course, as researchers, we never know when we've made a Type I error until further data are collected by our research team or by other researchers. What usually happens is that we get a result that suggests that the effect we were hoping for is present (i.e., we reject the null hypothesis and conclude that the alternative hypothesis is supported). After publishing our "discovery", other researchers might try to "replicate" the effect. A "replication study" is a study that tries to make sure that a result is real by doing new study that is the same as an earlier study.

    When efforts to replicate a finding fail repeatedly (i.e., the null hypothesis cannot be rejected and so it is retained), the researchers who reported the original finding usually try to replicate it themselves (maybe the other researchers aren't doing the study correctly). However, if new data keeps coming back from new studies and we cannot reject the null hypothesis (except for maybe 5% of the time), we have to conclude that the first study we did was just a fluke and that we committed a Type I error. There is no shame in this. It is just bad luck. We could cut the probability of getting a Type I error by setting α to a lower level (e.g., .01) but there is an unfortunate side-effect for setting a low α: It increases the probability of making more Type II errors! How can a researcher catch a break? If it isn't a Type I error, it's a Type II error! Actually, there are a number of things that can be done to eliminate both types of errors.

    First let's review what a Type II error is:

      Type II error- the H0 is really wrong, but the experimenter didn't believe that the evidence was strong enough to reject it. The probability of making a Type II error is called β (beta).

    Closely related to β is something called statistical power (also referred to as simply "power"). Power is the probability that a statistical test will correctly reject a false null hypothesis. So power = 1 - β. The more "powerful" the test, the more readily it will detect the effect of a variable if it is there.

    Now let's consider how the two types of errors are related. We need to consider the situation where H0 is wrong, that is when there are two populations, the treatment population and the null population. Consider the figure below. It represents a situation in which there REALLY IS A DIFFERENCE between two distributions.

    • The Null distribution comes from the original comparison group (null because if the null hypothesis were correct, both distributions would be exactly the same).
      • We define the "critical region" by a in the null distribution.
    • The alternative or "treatment" distribution is the distribution that resulted after a particular treatment has been administered or some variable has changed the original null distribution to make it different in some way.
      • Draw the line defined by a up through the treatment distribution. The region that is on the same side of the line as the critical region is Power.

        Power is the probability of obtaining sample data in the critical region when the null hypothesis is false.

        If the sample mean falls in the critical region then we reject the H0 If the sample mean falls outside of the critical region then we retain the H0
        Here we've come to the correct conclusion. Here we've made a Type II error

      Factors that affect power
        1) The size of the difference that we're trying to detect.
          So when there are two populations, the power will be related to how big a difference there is between the two.
        a big difference between the two populations

        notice that the shaded region is large

        the chance to correctly reject the null hypothesis is good

        a smaller difference between the two populations

        notice that the shaded region is smaller

        the chance to correctly reject the null hypothesis is not nearly as good

        2) Increasing a increases power.

        3) One-tailed tests have more power than two-tailed tests, given that you have specified the correct tail. If you specify the wrong tail, power is usually near 0. For this reason, most researchers always conduct 2-tailed tests. Even though they are less powerful, they prevent the awkwardness of having to retain the null hypothesis even when the mean difference is huge but in the opposite direction of what was expected.

        One-tailed test
        a = 0.05

        all of the critical region (a) is on one side of the distribution

        Two-tailed test
        a = 0.05 because a specific direction is not predicted, the critical region (a) is spread out equally on both sides of the distribution

        as a result the power is smaller

        4) Increasing sample size increases power by reducing the standard error.

        Small n
        a = 0.05

        relatively large standard error

        Larger n
        a = 0.05

        Smaller standard error

        as a result the power is greater


    Here is an Excel spreadsheet that illustrates aspects of statistical power. Download it to somewhere you'll be able to find it again and open it.

    Let's walk through how to use it.

    Suppose you collect data from n = 30 people and the sample mean is 55. You think that this sample did not from a population with μ = 50 and σ = 20. You didn't specify whether you expected your sample to have a higher or lower mean than the population so this is a 2-tailed hypothesis. The null hypothesis is that the sample does come from that population with μ = 50 and σ = 20. Now, suppose that, in reality and unbeknownst to you, the sample came from a different population with μ = 55 and σ = 20. Thus, although you are not in a position to know this for sure, the null hypothesis is false.

    First, set the "Null Hypothesis is" option box to "False." If you don't do this first, you won't be able to set the mean and standard deviation of the alternative distribution and you won't be able to calculate power. Since power is the probability of rejecting a false null hypothesis, power is irrelevant when the null hypothesis is true.

    Next, set the "Hypothesis Type" option box to "2-tailed (H1≠H0)" like this:

    Now set the rest of the numbers like this:

Set the null distribution (H0) mean to 50 and the standard deviation to 20.
Set the alternative or "treatment" distribution (H1) mean to 55 and the standard deviation to 20.
Set the α level to .05.
Set the sample size (n) to 30.

    It should look like this:

    The graph below should look something like this:

    The blue distribution is the distribution of sample means of the null distribution and the pink distribution is the distribution of sample means of the alternative distribution.
    The black vertical line is the sample mean (right now it falls on 55).
    The critical regions are shaded in blue. When the vertical line (the sample mean) falls in a critical region, the null hypothesis is rejected.
    The pink shaded regions represent statistical power. They represent the part of the alternative distribution that is beyond the critical regions. You can't see the very small pink portion that is in the blue critical region on the left side because it is covered but is also included in the calculation of statistical power.

    Notice that the sample mean is 55, exactly the same as the alternative distribution mean. You'd think that we'd reject the null hypothesis in a situation like this, right? Wrong. Unfortunately, the sample mean did not fall in a critical region so we must retain the null hypothesis. Thus, we have made a Type II error.

    It turns out that making a Type II error was very likely in this situation. If you look at the "Type II Error Rate (β)" box, you'll see that the probability of making a Type II error was quite high (about 72%). Power was only about .28, meaning that randomly sampling 30 people from the alternative distribution would result in a correct decision only 28% of the time. If you were a researcher, you would not want to put in hours of toil for only a 28% chance of success. You'd want to improve your power substantially.

    The effect of 1-tailed tests

    If you had specified a 1-tailed hypothesis, your power would have improved somewhat (if you guessed in the right direction).
    Set the "Hypothesis Type" option box to "1-tailed (H1>H0)"

    Blackboard 1) What is your power now?
    Blackboard 2) Does the sample mean fall in the critical region now that you have a 1-tailed hypothesis?

    Set the "Hypothesis Type" option box back to "2-tailed (H1≠H0)."

    The effect of raising α

    Raising α is the easiest way to raise power. It is also the stupidest. Don't do it when conducting real data analysis. People will laugh at you and won't take you seriously anymore. ;)

    However, just to see the effect of raising it, temporarily raise α from .05 to .2.

    Blackboard 3) What happened to the size of the critical regions when you raised α from .05 to .2 (Remember to set the "Hypothesis Type" option box back to "2-tailed (H1≠H0).")?

    Blackboard 4) Does the sample mean fall in the critical region now that you have raised α from .05 to .2?

    You can see that power increased from .28 to .54. On the surface, this would seem to be a good thing. It isn't. The problem with raising α is that it increases the frequency of Type I errors. Type I errors are false facts and they are harder to get rid of than Type II errors once they've been entered into the scientific literature.

    Reset α to .05 again.

    The effect of increasing sample size

    The most effective way to increase power that is directly under the researcher's control is to increase the sample size.

    Blackboard 5) What happens to the distributions in the graph when you change the sample size from n=30 to n=100 (Remember that α must be set to .05)?

    The result you noticed in question 5 should not be surprising and you probably predicted what would happen before you did it. The graph is the distribution of sample means. The width of this distribution is related to the standard deviation of the distribution. The standard deviation of the distribution of sample means is also called the standard error. Since you may recall that the standard error is the original population standard deviation divided by the square root of n, you can guess what happens to the standard error when you divide by the square root of a large number.

    Try experimenting with different values of n and see what happens.

    Blackboard 6) What is the lowest n that causes the 2-tailed null hypothesis to be rejected? (Hint: Pay attention to the "Decision" box in the upper right corner. It will change to red when the null hypothesis is rejected.)

    The effect of larger mean differences

    Right now, the 2 distribution means are .25 standard deviations apart ((55 - 50) / 20 = 5 / 20 = .25). Suppose that the alternative distribution had μ = 60 instead of μ = 55. Now the distributions are .5 standard deviations apart ((60 - 50) / 20 = 10 / 20 = .5).  If n is set back to 30, you can compare the original power we started with (.28) to what you have now. As you can see, the odds of correctly rejecting the null hypothesis have gone up considerably.

    Blackboard 7) What is the power now that μ = 60 for the alternative distribution (Remember to set n back to 30)?


    Blackboard 8) If the sample mean is also set to 60, is the null hypothesis rejected or retained?

    The effect of larger population standard deviations

    Researchers rarely have any control over this one so it borders on pointlessness to discuss it. However, for the sake of completeness, we should note that a large population standard deviation reduces power. Remember that the z-score formula has a standard deviation in the denominator. Dividing by a large number makes the z smaller. In order to reject the null hypothesis, z has to be big (e.g., far from 0).

    Review
    Power is the probability of correctly rejecting the null hypothesis when the null hypothesis is false.
    Power is influenced by:
    1. Type of hypothesis (1-tailed --> more power)
    2. α (larger α --> more power)
    3. Sample size (larger n --> more power)
    4. Mean differences (larger differences --> more power)
    5. Population standard deviation (smaller σ --> more power)