KITADA

Lesson #12

Correlation and the Correlation Coefficient

Motivation:

One of the patterns we look for in a scatterplot is the direction and strength of the relationship between two quantitative variables. Correlation describes the strength and direction of this relationship as long as the relationship is linear. We can assign a value that measures the strength and direction of the linear relationship between two quantitative variables: the correlation coefficient. We’ll concentrate on the correlation coefficient in this lesson.

What you need to know from this lesson: After completing this lesson, you should be able to

interpret the correlation coefficient
estimate the correlation coefficient between two quantitative variables
match a scatterplot with its correlation coefficient
use technology to obtain the correlation coefficient
explain what effects the correlation coefficient

To accomplish the above “What You Need to Know”, do the following:

1. Attend lecture and answer the questions on the following pages of this lesson.
2. Read Section 2.5 in the text (pages 106 – 112)
3. Do the Lesson 12 questions at the end of the lesson notes

The Lesson

1. What does the correlation coefficient measure?

Measures the strength of a linear relationship between datapoints.

2. What is the symbol for the sample correlation coefficient? What is the symbol for the population correlation coefficient?

sample correlation = r

population correlation = \( \rho \)

3. The formula for the sample correlation coefficient, r, is:

SEE LESSON HANDOUT

4. Properties of the correlation coefficient:

1) \( -1 \leq r \leq 1 \)
2) the sign of the correlation coefficient indicates the directions of the association
3) a correlation coefficient of zero indicates no relationship
4) a correlation coefficient of +1 or –1 indicates a perfect linear association
5) the correlation coefficient is a measure of the strength of a linear association. It should not be used when the relationship is non-linear!
6) the correlation coefficient does not make a distinction between the response variable and the explanatory variable. That is, the correlation of x with y is the same as the correlation of y with x.
7) there are no units for the correlation coefficient
8) the correlation coefficient is sensitive to outliers

5. Draw a scatter plot that represents data with the following correlation coefficients:

SEE LESSON HANDOUT

6. Match the following scatterplots with the correct correlation coefficients.

Choices for correlation coefficient: 0.834, -0.993, 0.460, 0.215, -0.275

SEE LESSON HANDOUT

7. Comment on each of the following scatterplots (linearity, outliers, etc.). Also, estimate r.

a. Is there an association between the number of wins in a season for a major league baseball team with its team batting average? The graph on the right shows the relationship between batting average and # of wins for the 30 major league baseball teams for the 2011 regular season.

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

BASEBALL<-read.csv("/Users/heatherhisako1/Desktop/Teaching/ST352_Summer16/BASEBALL.csv",
                header=TRUE)

names(BASEBALL)

## [1] "AVG"    "WINS"   "LEAGUE"

N_MLB<-BASEBALL%>%
  filter(LEAGUE=="N")

A_MLB<-BASEBALL%>%
  filter(LEAGUE=="A")

### CORRELATION FOR ENTIRE DATASET
with(BASEBALL, cor(AVG, WINS))

## [1] 0.331612

## OVERALL TREND
plot(BASEBALL$AVG, BASEBALL$WINS, type="n", xlab="Batting Average", 
     ylab="Wins")
points(N_MLB$AVG, N_MLB$WINS, col="red", pch=15)
points(A_MLB$AVG, A_MLB$WINS, pch=16)

abline(coefficients(lm(BASEBALL$WINS~BASEBALL$AVG)), 
       lwd=3, col="blue", lty=2)

plot of chunk unnamed-chunk-2

Each team belongs to one of two leagues: the National League or the American League. Different symbols are used for the two different leagues. For each league, what is your estimate of the correlation coefficient? Based on this, would it be better to analyze each league separately or altogether?

### CORRELATION AMERICAN AND NATIONAL 
N_MLB<-BASEBALL%>%
  filter(LEAGUE=="N")

with(N_MLB, cor(AVG, WINS))

## [1] 0.071264

A_MLB<-BASEBALL%>%
  filter(LEAGUE=="A")

with(A_MLB, cor(AVG, WINS))

## [1] 0.5547613

## FIT SEPERATE
plot(BASEBALL$AVG, BASEBALL$WINS, type="n", xlab="Batting Average", 
     ylab="Wins")
points(N_MLB$AVG, N_MLB$WINS, col="red", pch=15)
points(A_MLB$AVG, A_MLB$WINS, pch=16)

abline(coefficients(lm(N_MLB$WINS~N_MLB$AVG)), 
       lwd=3, col="red", lty=2)
abline(coefficients(lm(A_MLB$WINS~A_MLB$AVG)), 
       lwd=3, col="black", lty=2)

plot of chunk unnamed-chunk-3

b. Is there a relationship between the duration of the last eruption of Old Faithful and the interval between the last eruption and the next eruption?

## FIT LINEAR MODEL
my.lm.of <- with(OLDFAITHFUL, lm(interval~duration))

with(OLDFAITHFUL, 
     plot(duration, interval, main = "Old Faithful",
          ylab="interval between eruptions (minutes)", 
          xlab="duration of current eruption (minutes)")
)
mtext("Scatterplot of interval between eruptions versus
      duration of current eruption", cex = 0.8)
abline(coefficients(my.lm.of ), 
       lwd=3, col="black", lty=2)

plot of chunk unnamed-chunk-5

#To find the correlation coefficient between duration and interval
with(OLDFAITHFUL, cor(duration,interval))

## [1] 0.8772266

c. Is there a relationship between the sweetness index (higher sweetness index indicates sweeter orange juice) and the amount of water soluble pectin in orange juice?

## FIT LINEAR MODEL
my.lm.oj <- with(OJUICE, lm(sweetness.index~pectin.ppm))




#To find the correlation coefficient between duration and interval
with(OJUICE, cor(pectin.ppm,sweetness.index))

## [1] -0.4781458

removeOutliers<-OJUICE%>%
  filter(pectin.ppm<350)

#To find the correlation coefficient between duration and interval WITHOUT THE OUTLIERS
with(removeOutliers, cor(pectin.ppm,sweetness.index))

## [1] -0.3091354

with(OJUICE, 
     plot(pectin.ppm, sweetness.index, main = "Orange Juice Example",
          ylab="sweetness index", xlab="pectin in ppm")
)
mtext("Scatterplot of sweetness index versus amount of pectin", cex = 0.8)
abline(coefficients(my.lm.oj), 
       lwd=3, col="black", lty=2) 
abline(coefficients(lm(removeOutliers$sweetness.index~removeOutliers$pectin.ppm)), 
       lwd=3, col="blue", lty=2)

plot of chunk unnamed-chunk-7

d. Describe the relationship between X and Y.

my.lm.XY <- with(XYQUADRATIC, lm(y~x))

with(XYQUADRATIC, 
     plot(x, y, main = "Scatterplot of Y versus X",
          ylab="Y", xlab="X")
)
abline(coefficients(my.lm.XY), lwd=2, col="red", lty=2)

plot of chunk unnamed-chunk-9

with(XYQUADRATIC,cor(x,y))

## [1] 0