Lab 8

Correlation


So far we've spent most of our time looking at how a single variable is distributed. However, as researchers, we are usually more interested in how two (or more) different variables may be related to one another. That is, how they act together. For example, suppose that we're interested in variables related to a person's extraversion. Our first step is to identify a variable that we think might be related (e.g., number of friends, use of drugs, success in sales, and so forth), and then we examine how the distributions of each of these variables co-vary with one another. By co-vary, I mean, as the values of extraversion go up, the values of other variables go up (or down, depending on the variable in question). Variance measures how much  the values of a variable deviate from the mean. Covariance measures how much a pair of variables tend to deviate in the same direction. For example, if we expect that extraversion and cigarette smoking are positively related, we should expect to see that people who score high above the mean on extraversion should also smoke more cigarettes than average. People who score low on extraversion tests should smoke less than average.

Scatterplots

A scatterplot shows the relationship between two quantitative variables measured on the same individuals.

The values of one variable appear on one axis and the values of the other on the other axis. A point on the scatter plot represents the values of each variable for a particular individual. Note: if you have an experiment in which you've declared a response variable and an explanatory variable, always plot the response variable (Y) on the vertical axis and the explanatory variable (X) on the horizontal axis.

Consider the following example:

Data Set

 

Scatterplot

Person  X       Y

    A   1       1

    B   1       3

    C   3       2

    D   4       5

    E   6       4

    F   7       5

Y


 

 

X

Notice that each dot represents a single individual. The location of the dot is determined by the values of the two variables for that individual.

To interpret a scatterplot we should:

Form refers to how the scores cluster together.

Is the relationship linear or non-linear (we'll pretty much restrict our discussions to linear relationships)? A linear relationship is one that can be described as more or less following a straight line. A non-linear relationship is one in which there is a clear relationship but it does not follow a straight line. Examples of non-linear relationships are parabolas, ellipses, logarithms, trigonometric functions, and hyperbolas. This course will not discuss non-linear relationships other than to note that they exist.

Direction refers to the kind of relationship.


Positively associated variables are when above-average values of one variable tend to accompany above average values of the other variable (and the same for below-average scores)
Negatively associated variables are when above-average values of one variable tend to accompany below-average values of the other variable
No association when there doesn't appear to be a pattern to the scatterplot

 

Strength refers to the how tightly clustered the points are. If the relationship is linear, strength refers to how close the points are to the best-fitting line (i.e., the line that is closest to all of the points in the graph) passing through the data.

In the above example, we have a fairly linear relationship, the association is positive, and the points are fairly tightly clustered without any outliers.

Match the following graphs to the descriptions:


Blackboard 1) Which graph shows a perfect positive correlation?
Blackboard 2)
Which graph shows a perfect negative correlation?
Blackboard 3) Which graph shows a strong but less than perfect positive correlation?
Blackboard 4) Which graph shows a strong but less than perfect negative correlation?
Blackboard 5) Which graph shows the correlation that would be closest to 0?

 

Creating scatterplots in SPSS

There is a new way to make scatterplots in SPSS but I find that it is unnecessarily confusing. Fortunately, the old method can still be found under Graphs->>Legacy Dialog>>Scatter/Dot.


Select "Simple Scatter"


To assign your X and Y variables you need to select them from the variable listing and insert them into the X and Y axis.

You can also mark the individual data points by adding a categorical variable into the "set markers by" field. .

 

This will open up an output window and display your scatterplot.

For this part of the lab you will need to download the data file height.sav.

Blackboard 6) Make a scatterplot with height on the Y-axis and weight on the X-axis.  Which of the 3 graphs in Blackboard most closely matches your plot?
Blackboard 7) Make a scatterplot with height on the Y-axis and smoke15 (Average cigarettes per day smoked from age 10 to 15) on the X-axis.  Which of the 3 graphs below most closely matches your plot?
Blackboard 8) Make a scatterplot with height on the Y-axis and income on the X-axis.  Which of the 3 graphs below most closely matches your plot?

Blackboard 9) Of weight, smoking15, and income, which has the strongest positive relationship with height?
Blackboard 10) Of weight, smoking15, and income, which has the strongest negative relationship with height?


 

Computing Pearson's Correlation Coefficient (r)


Scatterplots display the form, direction, and strength of a relationship. However, it is often difficult to "see" the strength of a relationship. By adding some numerical descriptions to our scatterplots, our understanding of the relationship between the variables becomes much clearer.

Correlation is a statistical technique that measures and describes the relationship between two variables. The value of the correlation gives us a measure of strength, and the sign (positive or negative) gives us a measure of direction.

Properties of the correlation:

Computing the correlation has 3 parts

Parts 1 and 2: Variability of X and Y separately: We'll use the Sum of Squares as a measure of variability for X and for Y (that is SSX for variable X and SSY for variable Y).



SSX is the sum of the squared deviations of each X from the mean of the X's.


SSY is the sum of the squared deviations of each Y from the mean of the Y's.

 


Part 3: Covariability of X and Y: We'll call this the Sum of the Products (SP)

 


 

What this means is that for each individual (each point on the scatter plot) we figure out how much X varies and how much Y varies. Then we multiply each of these deviations together. This gives us a measure of how much X and Y are varying together (or how much they covary).

So now we have the top and bottom parts of the equation, except for one detail. The scores in the denominator are squared deviations, so we need to take the square root of these. This leaves us the following formula:


This is the formula for the Pearson Correlation Coefficient. It is symbolized with the letter r when referring to a sample statistic and the Greek letter rho (ρ) when referring to a population parameter.


Okay, let's consider the following set of data:


Our first step should be to make the scatterplot, but to save time we will skip this step.

Our second step is to compute the correlation coefficient r. We'll start by computing the SP.

Make a table that looks like the one below and complete the missing blanks (feel free to use a calculator). 


X
Y






0
1






10
3






4
1






8
2






8
3





Sums
30
10





Means
6
2






Blackboard 11) Calculate the SP.
Blackboard 12) Calculate the SSx.
Blackboard 13) Calculate the SSy.
Blackboard 14) Calculate the Pearson correlation (r).

 

Using SPSS to compute the correlation coefficient

Under the Analyze menu you will find the Correlate submenu.

From the Correlate submenu you want to select "bivariate"


In the bivariate correlation window, select the variables that you want correlated (you can have more than two at a time). For today's lab, make sure that Pearson is selected (the others are other kinds of correlations).


The output that you get is a correlation matrix. It correlates each variable against each variable (including itself). You should notice that the table has redundant information on it (e.g., you'll find an r for height correlated with weight, and  r for weight correlated with height. These two statements are identical.)

In SPSS you'll also get some additional information in the correlation matrix.

For now you can ignore the "Sig. 2-tailed" stuff. N is simply the number of cases in your data set.


So in the correlation matrix above, height and weight have an r = .794. This is a fairly strong positive correlation.


Now let's do a problem in SPSS with the same dataset as we used with the scatterplots. 

Blackboard 15)   What is the correlation of the average height of parents (avgphgt) and height?
Blackboard 16)   What is the correlation of average cigarettes per day smoked from age 10 to 15 (smoke15) and height?