So far we've spent most
of our time looking at how a single
variable is distributed. However, as researchers, we are usually more
interested in how two (or more) different variables may be related to
one
another. That is, how they act together. For example, suppose that
we're
interested in variables related to a person's extraversion. Our first
step is to
identify a variable that we think might be related (e.g., number of
friends, use of drugs, success in sales, and so forth), and then we
examine how the distributions of each of these
variables co-vary with one another. By co-vary, I mean, as the
values of extraversion go up, the values of other variables go up (or
down, depending on the variable in question). Variance measures how
much the values of a variable deviate from the mean. Covariance measures how much a pair
of variables tend to deviate in the same direction. For example, if we
expect that extraversion and cigarette smoking are positively related,
we should expect to see that people who score high above the mean on
extraversion should also smoke more cigarettes than average. People who
score low on extraversion tests should smoke less than average.
Scatterplots
|
A scatterplot
shows the relationship between two quantitative variables
measured on the same individuals. |
The values of one
variable
appear on one axis and the values of the other on the other axis. A
point on
the scatter plot represents the values of each variable for a
particular
individual. Note: if you have an experiment in which you've declared a
response
variable and an explanatory variable, always plot the response variable
(Y) on
the vertical axis and the explanatory variable (X) on the horizontal
axis.
Consider the following
example:
|
Data
Set |
|
Scatterplot |
|
Person X
Y
A 1
1
B 1
3
C 3
2
D 4
5
E 6
4
F 7
5 |
Y |
|
|
|
|
X |
Notice that each dot
represents a single individual. The location of the dot is determined
by the
values of the two variables for that individual.
To interpret a
scatterplot we
should:
Form
refers to how the scores cluster together.
Is
the relationship linear or non-linear (we'll pretty
much restrict
our discussions to linear relationships)? A linear relationship is one
that can be described as more or less following a straight line. A
non-linear relationship is one in which there is a clear relationship
but it does not follow a straight line. Examples of non-linear
relationships are parabolas, ellipses, logarithms, trigonometric
functions, and hyperbolas. This course will not discuss non-linear
relationships other than to note that they exist.
Direction
refers to the kind of relationship.
Positively associated variables are when above-average values of
one
variable tend to accompany above average values of the other variable
(and the
same for below-average scores)
Negatively associated variables are when above-average values of
one
variable tend to accompany below-average values of the other variable
No association when there doesn't appear to be a pattern to the
scatterplot
Strength
refers to the how tightly clustered the points are. If the relationship
is linear, strength refers to how close the points are to the
best-fitting line (i.e., the line that is closest to all of the points
in the graph) passing through the data.
In the above example, we
have
a fairly linear relationship, the association is positive, and the
points are
fairly tightly clustered without any outliers.
Match
the following graphs to the descriptions:
|
|
Blackboard
1) Which graph shows a perfect positive correlation?
Blackboard 2) Which
graph shows a perfect negative correlation?
Blackboard
3) Which
graph shows a strong but less than perfect positive correlation?
Blackboard
4) Which
graph shows a strong but less than perfect negative correlation?
Blackboard
5) Which
graph shows the correlation that would be closest to 0?
Creating
scatterplots in SPSS
There is a new way to make scatterplots in SPSS but I find that it is unnecessarily confusing. Fortunately, the old method can still be found under Graphs->>Legacy Dialog>>Scatter/Dot.
![]() |
Select "Simple
Scatter" |
|
|
To assign your X and Y
variables you need to select them from the variable listing and insert
them into the X and Y axis. You can also mark the individual data points by adding a categorical variable into the "set markers by" field. . |
This will open up an
output
window and display your scatterplot.
For this part of the lab
you
will need to download the data file height.sav.
Blackboard
6) Make a scatterplot with height on the Y-axis and weight on the
X-axis. Which of the 3 graphs in Blackboard most closely matches
your
plot?
Blackboard
7) Make a scatterplot with height on the Y-axis and smoke15 (Average
cigarettes per day smoked from age 10 to 15) on the X-axis. Which
of the 3 graphs below most closely matches your plot?
Blackboard
8) Make a scatterplot with height on the Y-axis and income on the
X-axis. Which of
the 3 graphs below most closely matches your plot?
Blackboard
9) Of weight, smoking15, and income, which has the strongest positive
relationship with height?
Blackboard
10) Of weight, smoking15, and income, which has the strongest negative
relationship with height?
Computing
Pearson's Correlation Coefficient (r)
Scatterplots display the
form, direction, and strength of a
relationship. However, it is often difficult to "see" the strength of
a relationship. By adding some numerical descriptions to our
scatterplots, our
understanding of the relationship between the variables becomes much
clearer.
|
Correlation is a statistical
technique that measures and describes the relationship between two
variables. The value of the correlation gives us a measure of strength,
and the sign (positive or negative) gives us a measure of direction. |
Properties of the
correlation:
Computing the correlation
has 3 parts
Parts 1 and 2:
Variability of X and Y
separately: We'll
use the Sum of Squares as a measure of variability for X and for Y
(that is SSX
for variable X and SSY for variable Y).

SSX
is the sum of the squared deviations of each X from the mean of the X's.

SSY is the sum
of the squared
deviations of each Y from the mean of the Y's.

What
this means is that for each individual
(each
point on the scatter plot) we figure out how much X varies and how much
Y
varies. Then we multiply each of these deviations together. This gives
us a
measure of how much X and Y are varying together (or how much they
covary).
So now we have the top
and
bottom parts of the equation, except for one detail. The scores in the
denominator are squared deviations, so we need to take the square root
of
these. This leaves us the following formula:

This is the formula for
the Pearson
Correlation Coefficient. It is symbolized with the letter r when
referring to a sample statistic and the Greek letter rho (ρ)
when referring to a population parameter.
Okay, let's consider the
following set of data:

Our first step
should be to make the scatterplot, but
to save time we will skip this step.
Our second step
is to compute the correlation
coefficient r. We'll start by
computing the SP.
Make
a table that looks like the one below and complete the missing blanks
(feel free to use a calculator).
| X |
Y |
![]() |
![]() |
![]() |
![]() |
![]() |
|
| 0 |
1 |
||||||
| 10 |
3 |
||||||
| 4 |
1 |
||||||
| 8 |
2 |
||||||
| 8 |
3 |
||||||
| Sums |
30 |
10 |
|||||
| Means |
6 |
2 |
Using
SPSS to compute the correlation coefficient
|
Under the Analyze menu
you will find the Correlate submenu. From the Correlate
submenu you want to select "bivariate" |
![]() |
|
In the bivariate
correlation window, select the variables that you want correlated (you
can have more than two at a time). For today's lab, make sure that
Pearson is selected (the others are other kinds of correlations). |
|
The output that you get
is a
correlation matrix. It correlates each variable against each variable
(including itself). You should notice that the table has redundant
information
on it (e.g., you'll find an r for height correlated with weight, and r for weight correlated with height. These
two statements are identical.)
|
In SPSS you'll also get
some additional information in the correlation matrix. For now you can ignore
the "Sig. 2-tailed" stuff. N is simply the number of cases in your data
set. |
|
So in the correlation
matrix
above, height and weight have an r = .794. This is a fairly strong
positive
correlation.
Now let's do a problem in SPSS with the same dataset as we used with
the scatterplots.
Blackboard 15) What is the
correlation of the average height of parents (avgphgt) and height?
Blackboard 16)
What
is the correlation of average cigarettes per day smoked from age 10 to
15 (smoke15) and height?