Correlation is a measure of linear dependence between two variables. We measure the correlation by calculating the correlation coefficient. The correlation coefficient can be any number between -1 and +1.
If a correlation coefficient between two variables is +1, this means that the variables have a
strong positive linear relationship.
+ before the 1 tells us that there is a positive linear relationship between the two variables (as one variable increases, the other variable also increases).If a correlation coefficient between two variables is -1, this means that the variables have a strong negative linear relationship.
- before the 1 tells us that there is a negative linear relationship between the two variables (as one variable increases, the other variable decreases).If a correlation coefficient between two variables is 0 then there may be no linear relationship between them.
Let’s look at the data set cars.
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
From the head of cars, we can see that there are two variables in this dataset: speed and dist.
speed: speed of car in mph
dist: stopping distance of car in ft
When we are looking at the relationship between two variables, one variable is called the independent variable and the other is called the dependent variable. When we graph the relationship between two variables, the independent variable is on the x-axis while the dependent variable is on the y-axis. We can say that the dependent variable is a function of the independent variable.
What do you think the linear relationship between speed and stopping distance is? Do you think that as the speed of a car increases, the stopping distance increases? Or do you think that as the speed of a car increases, the stopping distance decreases?
We can check the linear relationship between these two variables by finding their correlation coefficient.
cor(cars$speed, cars$dist)
## [1] 0.8068949
0.8068949 is positive, so the linear relationship between speed and stopping distance is positive. In other words, as the speed of a car increases, the stopping distance also increases.
0.8068949 is also very close to +1, but how do we know if we can say that this a strong positive linear relationship?
There are ways we classify strong, moderate, and weak linear relationships.
If a correlation coefficient is between 0 and 0.5, we say that the linear relationship is weak.
If a correlation coefficient is between 0.5 and 0.8, we say that the linear relationship is moderate.
If a correlation coefficient is between 0.8 and 1, we say that the linear relationship is strong.
Since our correlation coefficient is between 0.8 and 1, we can say that a cars speed and stopping distance have a strong positive linear relationship.
Let’s graph speed vs stopping distance so we can see what a strong positive linear relationship looks like.
par(pin = c(4,2.8))
plot(x = cars$speed,
y = cars$dist,
pch = 19, # `pch` controls the symbol type of the data points
col = "blue", # `col` controls the symbol type of the data points
cex = 1.2, # `cex` controls symbol size
main = "Stopping Distance as a function of car speed",
ylab = "dist (feet)",
xlab = "speed (miles per hour)"
)
When we run the code chunk above, we can see that the points in the plot create a line that is positively correlated.
In the example of stopping distance and speed, speed is the independent variable and stopping distance is the dependent variable. We can say that stopping distance is a function of speed.
Let’s look at a dataset with more information on cars from Motor Trends magazine. The magazine is from 1974, so these cars will seem ancient!
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
You can get information on all of the columns, but we can guess that mpg is miles per gallon and wt is weight.
The 1/4 mile time (qsec) is also known as the drag racing time (time from standing stop to quarter mile mark).
library(imager)
## Warning: package 'imager' was built under R version 4.0.5
## Loading required package: magrittr
##
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
##
## add
## The following objects are masked from 'package:stats':
##
## convolve, spectrum
## The following object is masked from 'package:graphics':
##
## frame
## The following object is masked from 'package:base':
##
## save.image
drag_img <- load.image(file.path(data_dir, "maxresdefault.jpg"))
plot(drag_img,
axes=FALSE
)
When we run the code chunk below we can see what all of the columns mean.
help(mtcars)
## starting httpd help server ... done
Let’s look at the correlation coefficient of weight vs miles per gallon.
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
Based on what we learned earlier in this module, can you determine what kind of linear relationship these two variables have?
Let’s graph weight vs miles per gallon to see this relationship.
par(pin = c(4,2.8))
plot(x = mtcars$wt,
y = mtcars$mpg,
pch = 19, # `pch` controls the symbol type of the data points
col = "red", # `col` controls the symbol type of the data points
cex = 1.2, # `cex` controls symbol size
main = "Miles per gallon as a function of car weight",
ylab = "Miles per gallon (mpg)",
xlab = "Weight (1000 lbs)"
)
When we run the code chunk above, we can see that the points in the plot create a line that is negatively correlated.
Let’s also look at the correlation coefficient of weight and 1/4 mile time from mtcars.
cor(mtcars$wt, mtcars$qsec)
## [1] -0.1747159
Based on what we learned earlier in this module, can you determine what kind of linear realtionship these two variables have?
Let’s graph weight vs 1/4 mile time to see this relationship.
par(pin = c(4,2.8))
plot(x = mtcars$wt,
y = mtcars$qsec,
pch = 19, # `pch` controls the symbol type of the data points
col = "red", # `col` controls the symbol type of the data points
cex = 1.2, # `cex` controls symbol size
main = "1/4 Mile time as a function of car weight",
ylab = "1/4 Mile time",
xlab = "Weight (1000 lbs)"
)
When we run the code chunk, we can see that these variables have a weak linear relationship.
Now we want to look at the summary of the linear regression model. We use summary() to look at the summary of this.
Run the code chunk below to see the summary. Before we look at the summary of the linear regression model, let’s pass the linear regression model into the object model.
model <- lm(formula = dist ~ speed, data = cars)
summary(model)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
When running this code chunk, we get a lot of statistics.
Let’s go through everything step by step.
model$call
## lm(formula = dist ~ speed, data = cars)
model$residuals
## 1 2 3 4 5 6 7
## 3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584 -3.744993
## 8 9 10 11 12 13 14
## 4.255007 12.255007 -8.677401 2.322599 -15.609810 -9.609810 -5.609810
## 15 16 17 18 19 20 21
## -1.609810 -7.542219 0.457781 0.457781 12.457781 -11.474628 -1.474628
## 22 23 24 25 26 27 28
## 22.525372 42.525372 -21.407036 -15.407036 12.592964 -13.339445 -5.339445
## 29 30 31 32 33 34 35
## -17.271854 -9.271854 0.728146 -11.204263 2.795737 22.795737 30.795737
## 36 37 38 39 40 41 42
## -21.136672 -11.136672 10.863328 -29.069080 -13.069080 -9.069080 -5.069080
## 43 44 45 46 47 48 49
## 2.930920 -2.933898 -18.866307 -6.798715 15.201285 16.201285 43.201285
## 50
## 4.268876
The code chunk above prints all of the residuals. When we do the summary of model, we get the 5 number summary of these values.
This shows the distance from the data to the fitted line. For this, we want the Min value and the Max value to be approximately the same distance from zero. We also want the 1Q value and 3Q value to be approximately the same distance from zero. We also would like the Median to be close to zero.
The third section (under the word “Coefficients:”) we are looking at the least squares estimate for a
fitted line.
Estimate, we are given the estimate for the y-intercept (the first row) and the slope (the second row [speed]).model$coefficients
## (Intercept) speed
## -17.579095 3.932409
* In our case, we can see that our y-intercept is estimated to be -17.5791 and our slope is
estimated to be 3.9324.
* The columns `Std. Error` and `t value` are used to calculate the `p-value` (`Pr(>|t|)`).
* The `Pr(>|t|)` column gives the `p-value`. The `p-value` tests whether the estimate for the
y-intercept and the slope are equal to zero or not.
* If the estimate for the y-intercept or the slope is 0, we should use a different model for
estimating the values.
* For general analysis, we are not interested in the y-intercept `p-value`. We are only interested
in the `p-value` for the slope.
* We want the `p-value` for the slope to be less than or equal to 0.05. If the `p-value` is less
than or equal to 0.05, it is said to be **statistically significant**.
* If the `p-value` for the slope is statistically significant, the linear model is a good
predictor of the y-variable.
* In our case, our y-variable is `dist`.
* Our `p-value` for the slope is 1.49e-12 (0.00000000000149). This means that the `p-value` is
statistically significant. Because the `p-value` is statistically significant, we can say that
this linear model is a good predictor of `dist`.
The part of the output that we want to look at is the Multiple R-Squared value.
The final part we want to look at is the p-value in the F-statistic value.
If this p-value is less than 0.05, or statistically significant, we can say that the speed gives us a reliable estimate for the stopping distance.
p-value is 1.49e-12 (0.00000000000149), we in fact can say that the speed gives us a reliable estimate for the stopping distance.I know this was a lot to take in.
From this, we can say that the speed is a reliable estimate for the stopping distance using the linear model. This means that our linear regression equation is a good way to estimate stopping distance given the speed of a car.
We can use lm() to determine what the regression equation is (the equation we can use to predict the y-value when given an x-value [also called the line of best fit]).
We want to know the relationship between speed and distance. That is why we write speed ~ dist.
Our equation is gonna be in the form: dist = b + (m*speed) where b = y-intercept and m = slope.
reg <- lm(dist ~ speed, data = cars)
reg
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
When we run the code chunk above, we can see that the y-intercept is -17.579 and the slope is 3.932.
Note: We passed the linear model into an object reg so we can use this value later.
Plugging the y-intercept and slope values into our equation above, we get that:
dist = -17.579 + 3.932*(speed)
If I gave you the speed of a car, could you find the estimated stopping distance of the car?
Let’s say the speed of a car is 55 mph. Run the code chunk below to see what the stopping distance is according to the linear regression equation.
dist55 = -17.579 + 3.92*(55)
dist55
## [1] 198.021
If a car was traveling at 55mph, we can estimate that it will take it 198.021 feet to stop.
Try this yourself:
If a car is traveling at 40mph, what is the estimated stopping distance of the car?
Use the hints and your previous knowledge to try to do these practice problems.
Let’s look at the dataset Orange.
head(Orange)
## Tree age circumference
## 1 1 118 30
## 2 1 484 58
## 3 1 664 87
## 4 1 1004 115
## 5 1 1231 120
## 6 1 1372 142
We can see that there are three variables in this dataset: Tree, age and circumference.
Tree: an ordered factor indicating the three on which the measurement is made
age: the age of the tree (days since 12/31/1968)
circumference: the circumfrence of the truck at breast height (in mm)
Before we do any analysis, let’s pass the age and circumference columns into objects.
age <- Orange$age
circumference <- Orange$circumference
Use this dataset to calculate the correlation coefficient (Hint: line 54).
Can you guess what the plot will look like?
Now, create a simple scatterplot of age vs circumference (age is the y-variable and circumference is the x-variable).
#plot(x = , y = , main = "Age as a function of circumference")
Can you see what the correlation coefficient is what it is?
Now, look at the summary of the linear model (Hint: line 175-176).
What is the p-value? What does this p-value tell us? Is the linear model a good fit for the relationship between age and circumference?
Find the linear regression equation between age and circumference (we want to use the circumference of the tree to predict the age of the tree) (Hint: lines 264 - 265).
What is the linear regression equation?
If we know that a tree’s circumference is 50mm, what is the estimated age, in days, of this tree (Hint: lines 281 - 282)?