6.1 Correlation between Variables

In the previous section we saw how to create crosstabs tables, relating one variable with another and we computed the Chi-Square statistics to tell us if the variables are independent or not. While this type of analysis is very useful for categorical data, for numerical data the resulting tables would (usually) be too big to be useful. Therefore we need to learn different methods for dealing with numerical variables to decide whether two such variables are related.

Example: Suppose that 5 students were asked their high school GPA and their College GPA, with the answers as follows:

Student HS GPA College GPA
A 3.8 2.8
B 3.1 2.2
C 4.0 3.5
D 2.5 1.9
E 3.3 2.5

We want to know: is high school and college GPA related according to this data, and if they are related, how can I use the high school GPA to predict the college GPA?

There are two answers to give:

Casually looking at this data it seems clear that the college GPA is always worse than the high school one, and the smaller the high school GPA the smaller the college GPA. But how strong a relationship, if any, seems difficult to quantify.

We will first discuss how to compute and interpret the so-called correlation coefficient to help decide whether two numeric variables are related or not. In other words, it can answer our first question. We will answer the second question in later sections. First, let's define the correlation coefficient mathematically.

Definition of the Correlation Coefficient

If your data is given in (x,y) pairs, then compute the following quantities:

where the "sigma" symbol indicates summation and n stands for the number of data points. With these quantities computed, the correlation coefficient is defined as:

These formulas are, indeed, quiet a "hand-full" but with a little effort we can manually compute the correlation coefficient just fine.

To compute the correlation coefficient for our above GPA example we make a table containing both variables, with additional columns for their squares as well as their product as follows:

Student HS GPA
(x)
College GPA
(y)
x2 y2 x*y
A 3.8 2.8 3.82 = 14.44 2.82 = 7.84 3.8*2.8 = 10.64
B 3.1 2.2 3.12 = 9.61 2.22 = 4.84 3.1*2.2 = 6.82
C 4.0 3.5 4.02 = 16.00 3.52 = 12.25 4.0*3.5 = 14.00
D 2.5 1.9 2.52 = 6.25 1.92 = 3.61 2.5*1.9 = 4.75
E 3.3 2.5 3.32 = 10.89 2.52 = 6.25 3.3*2.5 =  8.25
Sum 16.7 12.9 57.19 34.79 44.46

The last row contains the sum of the x's, y's, x-squared, y-squared, and x*y, which are precisely the quantities that we need to compute Sxx, Syy, and Sxy. In this case we can compute these quantities as follows:

so that the correlation coefficient for this data is: 1.374 / sqrt(1.412 * 1.508) = 0.9416

Interpretation of the Correlation Coefficient

The correlation coefficient as defined above measures how strong a linear relationship exists between two numeric variables x and y. Specifically:

In the above example the correlation coefficient is very close to +1. Therefore we can conclude that there indeed is a strong positive relationship between high school GPA and college GPA in this particular example.

Using Excel to computer the Correlation Coefficient

While the table above certainly helps in computing the correlation coefficient, it is still a lot of work, especially if there are lots of (x, y) data points. Even using Excel to help compute the table seems like a lot of work. However, Excel has a convenient function to quickly compute the correlation coefficient without us having to construct a complicated table.

The Excel built-in function

=CORREL(RANGE1, RANGE2)

returns the correlation coefficient of the the cells in RANGE1 and the cells in RANGE2. All arguments should be numbers, and no cell should be empty.

Example: To use this Excel function to compute the correlation coefficient for the previous GPA example, we would enter the data and the formulas as follows:

Example:

Consider the following artificial example: some data for x and y (which have no particular meaning right now) is listed below, in a "case A", "case B", and "case C" situation.

Case A:

x = 10, y = 20
x = 20, y = 40
x = 30, y = 60
x = 40, y = 80
x = 50, y = 100

Case B:

x = 10, y = 200
x = 20, y = 160
x = 30, y = 120
x = 40, y = 80
x = 50, y = 40

Case C:

x = 10, y = 100
x = 20, y = 20
x = 30, y = 200
x = 40, y = 50
x = 50, y = 100

Just looking at this data, it seems pretty obvious that:

Indeed, using Excel to compute each correlation coefficient (we will explain the procedure below), confirms this:

Note that in "real world" data, the correlation is almost never as clear-cut as in this artificial example.

Example: In a previous section we looked at an Excel data set that shows various information about employees. Here is the spreadsheet data, but the salary is left as an actual number instead of a category (as we previously had).

Employees (numerical)

Download this file into Excel and find out whether there is a linear relationship between the salary and the years of education of an employee.

  1. Type
       =CORREL(
  2. Select the first input range (corresponding to the salary) by dragging the mouse across all cells containing numbers in the "Salary" column
  3. Type
       ,
    (a comma)
  4. Select the second input range (corresponding to the months on the job) by dragging the mouse across the "Months on the Job" column containing numbers
  5. Type
       )
    and hit RETURN

Since the correlation coefficient is 0.66 it means that there is indeed some positive relation between years of schooling and salary earnings. But since the value is not that close to +1.0, the relationship is not strong.