KITADA
Lesson #12
Motivation:
One of the patterns we look for in a scatterplot is the direction and strength of the relationship between two quantitative variables. Correlation describes the strength and direction of this relationship as long as the relationship is linear. We can assign a value that measures the strength and direction of the linear relationship between two quantitative variables: the correlation coefficient. We’ll concentrate on the correlation coefficient in this lesson.
What you need to know from this lesson: After completing this lesson, you should be able to
To accomplish the above “What You Need to Know”, do the following:
The Lesson
1. What does the correlation coefficient measure?
Measures the strength of a linear relationship between datapoints.
2. What is the symbol for the sample correlation coefficient? What is the symbol for the population correlation coefficient?
sample correlation = r
population correlation = \( \rho \)
3. The formula for the sample correlation coefficient, r, is:
SEE LESSON HANDOUT
4. Properties of the correlation coefficient:
1) \( -1 \leq r \leq 1 \)
2) the sign of the correlation coefficient indicates the directions of the association
3) a correlation coefficient of zero indicates no relationship
4) a correlation coefficient of +1 or –1 indicates a perfect linear association
5) the correlation coefficient is a measure of the strength of a linear association. It should not be used when the relationship is non-linear!
6) the correlation coefficient does not make a distinction between the response variable and the explanatory variable. That is, the correlation of x with y is the same as the correlation of y with x.
7) there are no units for the correlation coefficient
8) the correlation coefficient is sensitive to outliers
5. Draw a scatter plot that represents data with the following correlation coefficients:
SEE LESSON HANDOUT
6. Match the following scatterplots with the correct correlation coefficients.
Choices for correlation coefficient: 0.834, -0.993, 0.460, 0.215, -0.275
SEE LESSON HANDOUT
7. Comment on each of the following scatterplots (linearity, outliers, etc.). Also, estimate r.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
BASEBALL<-read.csv("/Users/heatherhisako1/Desktop/Teaching/ST352_Summer16/BASEBALL.csv",
header=TRUE)
names(BASEBALL)
## [1] "AVG" "WINS" "LEAGUE"
N_MLB<-BASEBALL%>%
filter(LEAGUE=="N")
A_MLB<-BASEBALL%>%
filter(LEAGUE=="A")
### CORRELATION FOR ENTIRE DATASET
with(BASEBALL, cor(AVG, WINS))
## [1] 0.331612
## OVERALL TREND
plot(BASEBALL$AVG, BASEBALL$WINS, type="n", xlab="Batting Average",
ylab="Wins")
points(N_MLB$AVG, N_MLB$WINS, col="red", pch=15)
points(A_MLB$AVG, A_MLB$WINS, pch=16)
abline(coefficients(lm(BASEBALL$WINS~BASEBALL$AVG)),
lwd=3, col="blue", lty=2)
Each team belongs to one of two leagues: the National League or the American League. Different symbols are used for the two different leagues. For each league, what is your estimate of the correlation coefficient? Based on this, would it be better to analyze each league separately or altogether?
### CORRELATION AMERICAN AND NATIONAL
N_MLB<-BASEBALL%>%
filter(LEAGUE=="N")
with(N_MLB, cor(AVG, WINS))
## [1] 0.071264
A_MLB<-BASEBALL%>%
filter(LEAGUE=="A")
with(A_MLB, cor(AVG, WINS))
## [1] 0.5547613
## FIT SEPERATE
plot(BASEBALL$AVG, BASEBALL$WINS, type="n", xlab="Batting Average",
ylab="Wins")
points(N_MLB$AVG, N_MLB$WINS, col="red", pch=15)
points(A_MLB$AVG, A_MLB$WINS, pch=16)
abline(coefficients(lm(N_MLB$WINS~N_MLB$AVG)),
lwd=3, col="red", lty=2)
abline(coefficients(lm(A_MLB$WINS~A_MLB$AVG)),
lwd=3, col="black", lty=2)
## FIT LINEAR MODEL
my.lm.of <- with(OLDFAITHFUL, lm(interval~duration))
with(OLDFAITHFUL,
plot(duration, interval, main = "Old Faithful",
ylab="interval between eruptions (minutes)",
xlab="duration of current eruption (minutes)")
)
mtext("Scatterplot of interval between eruptions versus
duration of current eruption", cex = 0.8)
abline(coefficients(my.lm.of ),
lwd=3, col="black", lty=2)
#To find the correlation coefficient between duration and interval
with(OLDFAITHFUL, cor(duration,interval))
## [1] 0.8772266
## FIT LINEAR MODEL
my.lm.oj <- with(OJUICE, lm(sweetness.index~pectin.ppm))
#To find the correlation coefficient between duration and interval
with(OJUICE, cor(pectin.ppm,sweetness.index))
## [1] -0.4781458
removeOutliers<-OJUICE%>%
filter(pectin.ppm<350)
#To find the correlation coefficient between duration and interval WITHOUT THE OUTLIERS
with(removeOutliers, cor(pectin.ppm,sweetness.index))
## [1] -0.3091354
with(OJUICE,
plot(pectin.ppm, sweetness.index, main = "Orange Juice Example",
ylab="sweetness index", xlab="pectin in ppm")
)
mtext("Scatterplot of sweetness index versus amount of pectin", cex = 0.8)
abline(coefficients(my.lm.oj),
lwd=3, col="black", lty=2)
abline(coefficients(lm(removeOutliers$sweetness.index~removeOutliers$pectin.ppm)),
lwd=3, col="blue", lty=2)
my.lm.XY <- with(XYQUADRATIC, lm(y~x))
with(XYQUADRATIC,
plot(x, y, main = "Scatterplot of Y versus X",
ylab="Y", xlab="X")
)
abline(coefficients(my.lm.XY), lwd=2, col="red", lty=2)
with(XYQUADRATIC,cor(x,y))
## [1] 0