Lab 12

Sampling distributions

 

When we discussed z-scores, we were using z-scores to locate a score or set of scores in the population. However, in our last lab we discussed the practical necessity of sampling. Recall, that we don't usually have access to the entire population, so instead we take a subset of the population, a sample. In this lab we will begin to learn some of the inferential statistical methods that will allow us to make claims about population parameters based on sample statistics.

    Suppose that you take 3 different random samples from the same population. They are probably going to be different from one another. See the figure below for an example of what I mean.

    The samples may have different shapes, different means, and different variability. So how do you figure out what the best estimate of the population mean is?

    There are essentially an infinite number of samples that can be taken from a population if we sample with replacement (put the ones we choose back into the population each time). But the huge set of possible samples forms a simple, orderly, and predictable pattern (a sampling distribution). Because of this, we are able to base our predictions about sample characteristics on the distribution of sample means.

    The distribution of sample means is the collection of sample means for all the possible random samples of a particular size (n) that can be obtained from a population.

In other words, what we want to do is look at all of the possible samples (of a particular size, this part is important) and make predictions based on the properties of all of them. We do this the same way that we've done it in the past: We essentially find the average of those properties.


The Distribution of Sample Means

We can create a distribution of sample means by looking at all possible samples of a certain size (n) and considering the means of each of those samples.

Let's look at a concrete example:

    Consider the following small population of scores: 2, 4, 6, 8

    Because this population is so small we actually can know the mean (and variability): m = (2+4+6+8)/4 = 5, but suppose that we didn't, and wanted to be able to make an estimate of this population from samples chosen from the population (like we do when we conduct a research study).

      step 1: pick a sample size: for this example we'll pick samples of n = 2

        - we'll talk more about sample size a little later, but typically the bigger your sample size, the more likely that your samples will be similar to one another (and to the population as a whole)

      step 2: Because we selected such a small population, we can actually consider all of the possible samples that you could get (ignoring duplications resulting from sampling with replacement), and look at their distribution.

        Okay, imagine that each score is a number on a tile. Put all four tiles in a bag. To create the first sample, pull out one tile, record the number on the tile, replace the tile in the bag, and then select the second tile (recall our sample size is 2) and record the value on that tile. The table below presents all 16 possible n = 2 samples that could result from this process along with the mean of each sample.

      ____________________________________
      scores sample mean
      sample first second ()
      1 2 2 2
      2 2 4 3
      3 2 6 4
      4 2 8 5
      5 4 2 3
      6 4 4 4
      7 4 6 5
      8 4 8 6
      9 6 2 4
      10 6 4 5
      11 6 6 6
      12 6 8 7
      13 8 2 5
      14 8 4 6
      15 8 6 7
      16 8 8 8
      Distribution of the sample means
        Now let's plot all of the different sample means (and provide a frequency distribution table). This is the distribution of sample means (where scores are means from the samples you chose).

            f
      2 1
      3 2
      4 3
      5 4
      6 3
      7 2
      8 1

      step 3: Now you're ready to answer questions like: What is the probability of getting a sample with a mean greater than 7? p( > 7) = ?

        look at our distribution of sample means, we find that 1 out of 16 have a mean greater than 7. So that's our answer: 1/16 = .0625 = 6.25%



Sampling Distribution Simulations

In the example that we walked through above we've simplified things greatly. We have a really small population, and we took a pretty small sample. Most situations, however, will be much more complex. Let's look at a more complex example that you can create.

Click the button below to open the Sampling Distribution Simulation applet.

Click the "Begin" button. It may take a while for Java to open.

The top distribution is the population distribution. We'll start with a Normally distributed population (it should already be set for that). On the left are the descriptive statistics which describe this distribution.

From this population distribution we can draw some random samples.

Click on the "Animated" button and watch the animation. This will randomly sample 5 individuals (n = 5) from the population (they'll drop down on to the graph immediately below the population graph). The applet will also compute the mean of this sample and plot the sample mean on the third plot (below the plot of the 5 individuals). Click the "Animated" button 4 more times and watch what happens to the 3rd graph (the Distribution of Means).

If you were to click the "Animated" button many more times, you would eventually see a close approximation of the Distribution of Sample Means. That would take a long, long time. To speed things up, click the "10,000" button. This is equivalent to pressing the "Animated" button 10,000 times. That is, it draws 10,000 samples from the original population, computes 10,000 means, and graphs them in the "Distribution of Means" plot. Notice what the distribution of means looks like.

Okay, let's experiment with changing the sample size. We are going to compare 2 different distributions of sample means with 2 different sample sizes.

Change the sample size in the "Distribution of Means" to N=2. On the bottom graph (the 4th one that should be empty), change the box that says "None" to "Mean" and change the sample size to N=25. Click on "Animated" and watch what happens. Now click on the "10,000" button and see what happens.

WebCT 1) Which distribution is wider?

You can play around with other sample sizes and figure out how sample size is related to the variability in the distribution of sample means.

Let's see how the shape of the original population is related to distribution of sample means and how the sample size affects that relationship.

Change the population shape option from "Normal" to "Uniform." Set the sample size to N=2 on the 3rd graph and N=5 on the 4th graph. Click the "10,000" button.

WebCT 2) How does the shape of the distribution of means (when N=2) compare to the shape of the original population?
WebCT 3) How does the shape of the distribution of means (when N=5) compare to the shape of the original population?

Change the population shape to "Skewed." Set the sample size to N=2 on the 3rd graph and N=25 on the 4th graph. Click the "10,000" button.

WebCT 4) How does the shape of the distribution of means (when N=2) compare to the shape of the original population?
WebCT 5) How does the shape of the distribution of means (when N=25) compare to the shape of the original population?

The applet allows you to use your mouse to draw a custom population of any shape you like. You can play around with it and change the sample sizes to further explore the relationship between the original population shape and the shape of the distribution of sample means.

Standard Error

The standard deviation of the distribution of sample means is called the standard error. The standard error is influenced by two factors: the variability of the population (s) and the sample size (n).

We'll consider each of these factors below:

    (A) the variability of the population - the bigger the variability of the population, the more variability you'll have in the sample means.

    large s
    big differences from the pop mean

    small s
    small differences from the pop mean

    (B) the size of the sample - the larger your sample size (n), the more accurately the sample represents the population. This is known as the Law of large numbers.

      think of it this way:
- If I randomly selected 1 score, how accurately will that score predict the population's mean?
- Suppose that I take 5 scores. Are things more accurate?
- what about 100 scores?

These two characteristics are combined in the formula for the standard error.

standard error of  = =

WebCT 6) Answer the WebCT question about standard errors.



Properties of the distribution of sample means

  • Mean:
      The average of all of the sample means will equal the mean of the population. The average of all of the sample means is called the expected value of the mean. It is "expected" because it should be a value near the population mean m.
  • Variability:
      The standard deviation of the distribution of sample means is called the standard error of the mean. 

  • Shape:
      The shape of the distribution of sample means tends to be a normal distribution. In fact, when n is large (around 30 or more), the distribution of sample means is almost perfectly normal. Does this make sense? It should. If you select a bunch of samples from the same population, the most of the means should "pile up" near the population mean m (if they don't then you must have some kind of bias in your sample)

Central limit theorem

All of these properties (shape, mean, variability) are covered in the Central Limit Theorem
    Central Limit Theorem: For any population with mean m and standard deviation s, the distribution of sample means for sample size n will approach a normal distribution with a mean of m and a standard deviation of as n approaches infinity.
      Note: for practical purposes note that if the original population is normal, distribution of sample means is going to be normal no matter what sample size is used. For most non-normal distributions, the distribution of sample means comes very close to normal when n > 30 (that is, for samples sizes larger than 30).



Using the Distribution of Sample Means to Determine Sample Likelihood

Often we are not interested in where individuals fall in population distributions but rather where a sample is in the distribution of sample means. We might have data from a sample and we might wonder if it really came from a particular population or if it is from a different population. We can calculate the probability that a sample came from a particular population using the properties of the normal curve and the z-score distribution.

Remember that using the Central Limit Theorem, we know the distribution of sample means is normal if n is greater than 30 OR the population is normal.)

    Example:
      Consider the following situation. An instructor is interested in the IQ of her students. She has 9 students in her class and thinks that they are, on the average, really smart. What is the probability that the group of students has a mean greater than or equal to 112?

      In other words, we don't want to know the probability of each individual having a score of 112 or better separately. Instead we want to know as a group, what is the probability of getting an average score of 112 or better.

        We need to start by getting the population parameters
          for the standardized IQ test: m = 100, s = 15

        Next we need to get the mean and standard deviation of the distribution of the samples (note: we'll assume a normal distribution because the original population distribution of IQ scores is normally distributed) so that we can calculate the z-score.

        m = 100 (because the population mean is 100).

         =  = 15/sq. root of 9 = 15/3 = 5

        Now we need to figure out the z-score that corresponds to this sample mean: the z-score formula looks exactly the same except for the new subscripts. The subscripts make no difference in the calculations. They just show that something different is being calculated (we're locating a sample in the distribution of sample means rather than finding a single score in a population):



          so for our example:

          P( > 112) = P(Z > (112 - 100)/ 5 ) = P(Z > 2.4) = 0.0082

          In other words, the probability that we'll get a sample of size n = 9 students with an average IQ equal to or greater than 112 is very small (0.0082). In our next labs we will extend this result to make claims concerning hypotheses about our population and our sample.

        Does this answer make sense? Let's look at the pictures of our distributions.

        Population distribution
        - at first it looks wrong

        - it seems like 112 should be less than a z = 1, because 115 is where z should equal 1

        Distribution of Sample means
        - however, we must remember that this isn't the correct distribution to be looking at, we need to look at the distribution of sample means.

        -we know that the distribution of sample means has a standard error = 5 and a mean = 100.

        - So 112 should have a z >2

    Let's look at a different kind of example.
      Example:
        How high a mean would a group of 25 have to have on IQ to be in the top 10% of the IQ distribution for groups of this size?
          First we need to get the mean and standard deviation (i.e., standard error) for the distribution of the samples
            m = 100

            = = 15/sq. root of 25 = 15/5 = 3

          Now we need to work backwards because we don't know the z-score. We can determine the z-score for the range based on the portion of the distribution we're looking for. We want the top 10% of the distribution. This corresponds to a cumulative proportion of .90 for the distribution (i.e., the top 10% is higher than 90% or .90 of the distribution). If we use the NORMINV function in Excel (select any cell and type "=NORMINV(.9,0,1)" [without any quotes]), we find that .90 corresponds to a z-score of about 1.28.

          So for our example:

            step 1: look at unit normal table for 10%

            step 2: work backwards through the z-score formula to solve for

            = z * + m = (1.28)(3) + 100 = 103.84

          so, for a group of 25 people, they'd have to have a mean of just under 104 to be in the top 10%.

          Note an easier way to solve this is to type the mean and SD of the population directly into the NORMINV function in Excel like this: =NORMINV(.9,100,3)
          A even easier way is to use the spreadsheet I made. Set the option box on the left to "Sample Mean" rather than "Single Score." Then specify the sample size. From there the spread sheet works exactly as before.



      Now try some on your own on WebCT questions 7 through 9.

      WebCT 7 through 9) Answer 3 questions about group psychotherapy.
      Answer additional review questions.