KITADA

Lesson #12

Correlation and the Correlation Coefficient

Motivation:

One of the patterns we look for in a scatterplot is the direction and strength of the relationship between two quantitative variables. Correlation describes the strength and direction of this relationship as long as the relationship is linear. We can assign a value that measures the strength and direction of the linear relationship between two quantitative variables: the correlation coefficient. We’ll concentrate on the correlation coefficient in this lesson.

What you need to know from this lesson: After completing this lesson, you should be able to

To accomplish the above “What You Need to Know”, do the following:

The Lesson

1. What does the correlation coefficient measure?

Measures the strength of a linear relationship between datapoints.

2. What is the symbol for the sample correlation coefficient? What is the symbol for the population correlation coefficient?

sample correlation = r

population correlation = \( \rho \)

3. The formula for the sample correlation coefficient, r, is:

SEE LESSON HANDOUT

4. Properties of the correlation coefficient:

5. Draw a scatter plot that represents data with the following correlation coefficients:

SEE LESSON HANDOUT

6. Match the following scatterplots with the correct correlation coefficients.

Choices for correlation coefficient: 0.834, -0.993, 0.460, 0.215, -0.275

SEE LESSON HANDOUT

7. Comment on each of the following scatterplots (linearity, outliers, etc.). Also, estimate r.

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
BASEBALL<-read.csv("/Users/heatherhisako1/Desktop/Teaching/ST352_Summer16/BASEBALL.csv",
                header=TRUE)

names(BASEBALL)
## [1] "AVG"    "WINS"   "LEAGUE"
N_MLB<-BASEBALL%>%
  filter(LEAGUE=="N")

A_MLB<-BASEBALL%>%
  filter(LEAGUE=="A")
### CORRELATION FOR ENTIRE DATASET
with(BASEBALL, cor(AVG, WINS))
## [1] 0.331612
## OVERALL TREND
plot(BASEBALL$AVG, BASEBALL$WINS, type="n", xlab="Batting Average", 
     ylab="Wins")
points(N_MLB$AVG, N_MLB$WINS, col="red", pch=15)
points(A_MLB$AVG, A_MLB$WINS, pch=16)

abline(coefficients(lm(BASEBALL$WINS~BASEBALL$AVG)), 
       lwd=3, col="blue", lty=2)

plot of chunk unnamed-chunk-2

Each team belongs to one of two leagues: the National League or the American League. Different symbols are used for the two different leagues. For each league, what is your estimate of the correlation coefficient? Based on this, would it be better to analyze each league separately or altogether?

### CORRELATION AMERICAN AND NATIONAL 
N_MLB<-BASEBALL%>%
  filter(LEAGUE=="N")

with(N_MLB, cor(AVG, WINS))
## [1] 0.071264
A_MLB<-BASEBALL%>%
  filter(LEAGUE=="A")

with(A_MLB, cor(AVG, WINS))
## [1] 0.5547613
## FIT SEPERATE
plot(BASEBALL$AVG, BASEBALL$WINS, type="n", xlab="Batting Average", 
     ylab="Wins")
points(N_MLB$AVG, N_MLB$WINS, col="red", pch=15)
points(A_MLB$AVG, A_MLB$WINS, pch=16)

abline(coefficients(lm(N_MLB$WINS~N_MLB$AVG)), 
       lwd=3, col="red", lty=2)
abline(coefficients(lm(A_MLB$WINS~A_MLB$AVG)), 
       lwd=3, col="black", lty=2)

plot of chunk unnamed-chunk-3

## FIT LINEAR MODEL
my.lm.of <- with(OLDFAITHFUL, lm(interval~duration))

with(OLDFAITHFUL, 
     plot(duration, interval, main = "Old Faithful",
          ylab="interval between eruptions (minutes)", 
          xlab="duration of current eruption (minutes)")
)
mtext("Scatterplot of interval between eruptions versus
      duration of current eruption", cex = 0.8)
abline(coefficients(my.lm.of ), 
       lwd=3, col="black", lty=2) 

plot of chunk unnamed-chunk-5

#To find the correlation coefficient between duration and interval
with(OLDFAITHFUL, cor(duration,interval))
## [1] 0.8772266
## FIT LINEAR MODEL
my.lm.oj <- with(OJUICE, lm(sweetness.index~pectin.ppm))




#To find the correlation coefficient between duration and interval
with(OJUICE, cor(pectin.ppm,sweetness.index))
## [1] -0.4781458
removeOutliers<-OJUICE%>%
  filter(pectin.ppm<350)

#To find the correlation coefficient between duration and interval WITHOUT THE OUTLIERS
with(removeOutliers, cor(pectin.ppm,sweetness.index))
## [1] -0.3091354
with(OJUICE, 
     plot(pectin.ppm, sweetness.index, main = "Orange Juice Example",
          ylab="sweetness index", xlab="pectin in ppm")
)
mtext("Scatterplot of sweetness index versus amount of pectin", cex = 0.8)
abline(coefficients(my.lm.oj), 
       lwd=3, col="black", lty=2) 
abline(coefficients(lm(removeOutliers$sweetness.index~removeOutliers$pectin.ppm)), 
       lwd=3, col="blue", lty=2)

plot of chunk unnamed-chunk-7

my.lm.XY <- with(XYQUADRATIC, lm(y~x))

with(XYQUADRATIC, 
     plot(x, y, main = "Scatterplot of Y versus X",
          ylab="Y", xlab="X")
)
abline(coefficients(my.lm.XY), lwd=2, col="red", lty=2) 

plot of chunk unnamed-chunk-9

with(XYQUADRATIC,cor(x,y))
## [1] 0