Correlation

Correlation is a measure of linear dependence between two variables. We measure the correlation by calculating the correlation coefficient. The correlation coefficient can be any number between -1 and +1.

Let’s look at the data set cars.

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

From the head of cars, we can see that there are two variables in this dataset: speed and dist.

When we are looking at the relationship between two variables, one variable is called the independent variable and the other is called the dependent variable. When we graph the relationship between two variables, the independent variable is on the x-axis while the dependent variable is on the y-axis. We can say that the dependent variable is a function of the independent variable.

What do you think the linear relationship between speed and stopping distance is? Do you think that as the speed of a car increases, the stopping distance increases? Or do you think that as the speed of a car increases, the stopping distance decreases?

We can check the linear relationship between these two variables by finding their correlation coefficient.

cor(cars$speed, cars$dist)
## [1] 0.8068949

0.8068949 is positive, so the linear relationship between speed and stopping distance is positive. In other words, as the speed of a car increases, the stopping distance also increases.

0.8068949 is also very close to +1, but how do we know if we can say that this a strong positive linear relationship?

There are ways we classify strong, moderate, and weak linear relationships.

Since our correlation coefficient is between 0.8 and 1, we can say that a cars speed and stopping distance have a strong positive linear relationship.

Let’s graph speed vs stopping distance so we can see what a strong positive linear relationship looks like.

par(pin = c(4,2.8))
plot(x = cars$speed,     
     y = cars$dist, 
     pch = 19,                  # `pch` controls the symbol type of the data points                                
     col = "blue",              # `col` controls the symbol type of the data points
     cex = 1.2,                 # `cex` controls symbol size
     main = "Stopping Distance as a function of car speed",     
     ylab = "dist (feet)",
     xlab = "speed (miles per hour)"
     )

When we run the code chunk above, we can see that the points in the plot create a line that is positively correlated.

In the example of stopping distance and speed, speed is the independent variable and stopping distance is the dependent variable. We can say that stopping distance is a function of speed.

Let’s look at a dataset with more information on cars from Motor Trends magazine. The magazine is from 1974, so these cars will seem ancient!

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can get information on all of the columns, but we can guess that mpg is miles per gallon and wt is weight.

The 1/4 mile time (qsec) is also known as the drag racing time (time from standing stop to quarter mile mark).

library(imager)
## Warning: package 'imager' was built under R version 4.0.5
## Loading required package: magrittr
## 
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
## 
##     add
## The following objects are masked from 'package:stats':
## 
##     convolve, spectrum
## The following object is masked from 'package:graphics':
## 
##     frame
## The following object is masked from 'package:base':
## 
##     save.image
drag_img <- load.image(file.path(data_dir, "maxresdefault.jpg"))

plot(drag_img,
     axes=FALSE
)

When we run the code chunk below we can see what all of the columns mean.

help(mtcars)
## starting httpd help server ... done

Let’s look at the correlation coefficient of weight vs miles per gallon.

cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594

Based on what we learned earlier in this module, can you determine what kind of linear relationship these two variables have?

Let’s graph weight vs miles per gallon to see this relationship.

par(pin = c(4,2.8))
plot(x = mtcars$wt,     
     y = mtcars$mpg, 
     pch = 19,                  # `pch` controls the symbol type of the data points                                
     col = "red",               # `col` controls the symbol type of the data points
     cex = 1.2,                 # `cex` controls symbol size
     main = "Miles per gallon as a function of car weight",     
     ylab = "Miles per gallon (mpg)",
     xlab = "Weight (1000 lbs)"
     )

When we run the code chunk above, we can see that the points in the plot create a line that is negatively correlated.

Let’s also look at the correlation coefficient of weight and 1/4 mile time from mtcars.

cor(mtcars$wt, mtcars$qsec)
## [1] -0.1747159

Based on what we learned earlier in this module, can you determine what kind of linear realtionship these two variables have?

Let’s graph weight vs 1/4 mile time to see this relationship.

par(pin = c(4,2.8))
plot(x = mtcars$wt,     
     y = mtcars$qsec, 
     pch = 19,                  # `pch` controls the symbol type of the data points                                
     col = "red",               # `col` controls the symbol type of the data points
     cex = 1.2,                 # `cex` controls symbol size
     main = "1/4 Mile time as a function of car weight",     
     ylab = "1/4 Mile time",
     xlab = "Weight (1000 lbs)"
     )

When we run the code chunk, we can see that these variables have a weak linear relationship.

Summary Linear Regression Model

Now we want to look at the summary of the linear regression model. We use summary() to look at the summary of this.

Run the code chunk below to see the summary. Before we look at the summary of the linear regression model, let’s pass the linear regression model into the object model.

model <- lm(formula = dist ~ speed, data = cars)
summary(model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

When running this code chunk, we get a lot of statistics.

Let’s go through everything step by step.

model$call
## lm(formula = dist ~ speed, data = cars)
model$residuals
##          1          2          3          4          5          6          7 
##   3.849460  11.849460  -5.947766  12.052234   2.119825  -7.812584  -3.744993 
##          8          9         10         11         12         13         14 
##   4.255007  12.255007  -8.677401   2.322599 -15.609810  -9.609810  -5.609810 
##         15         16         17         18         19         20         21 
##  -1.609810  -7.542219   0.457781   0.457781  12.457781 -11.474628  -1.474628 
##         22         23         24         25         26         27         28 
##  22.525372  42.525372 -21.407036 -15.407036  12.592964 -13.339445  -5.339445 
##         29         30         31         32         33         34         35 
## -17.271854  -9.271854   0.728146 -11.204263   2.795737  22.795737  30.795737 
##         36         37         38         39         40         41         42 
## -21.136672 -11.136672  10.863328 -29.069080 -13.069080  -9.069080  -5.069080 
##         43         44         45         46         47         48         49 
##   2.930920  -2.933898 -18.866307  -6.798715  15.201285  16.201285  43.201285 
##         50 
##   4.268876

The code chunk above prints all of the residuals. When we do the summary of model, we get the 5 number summary of these values.

This shows the distance from the data to the fitted line. For this, we want the Min value and the Max value to be approximately the same distance from zero. We also want the 1Q value and 3Q value to be approximately the same distance from zero. We also would like the Median to be close to zero.

model$coefficients
## (Intercept)       speed 
##  -17.579095    3.932409
* In our case, we can see that our y-intercept is estimated to be -17.5791 and our slope is
estimated to be 3.9324.

* The columns `Std. Error` and `t value` are used to calculate the `p-value` (`Pr(>|t|)`).

* The `Pr(>|t|)` column gives the `p-value`. The `p-value` tests whether the estimate for the
y-intercept and the slope are equal to zero or not. 

  * If the estimate for the y-intercept or the slope is 0, we should use a different model for
    estimating the values.

* For general analysis, we are not interested in the y-intercept `p-value`. We are only interested
in the `p-value` for the slope.

  * We want the `p-value` for the slope to be less than or equal to 0.05. If the `p-value` is less
  than or equal to 0.05, it is said to be **statistically significant**.
  
  * If the `p-value` for the slope is statistically significant, the linear model is a good
  predictor of the y-variable.
  
    *  In our case, our y-variable is `dist`.
    
  * Our `p-value` for the slope is 1.49e-12 (0.00000000000149). This means that the `p-value` is
  statistically significant. Because the `p-value` is statistically significant, we can say that
  this linear model is a good predictor of `dist`.
  

I know this was a lot to take in.

From this, we can say that the speed is a reliable estimate for the stopping distance using the linear model. This means that our linear regression equation is a good way to estimate stopping distance given the speed of a car.

Linear Regression Equation

We can use lm() to determine what the regression equation is (the equation we can use to predict the y-value when given an x-value [also called the line of best fit]).

We want to know the relationship between speed and distance. That is why we write speed ~ dist.

Our equation is gonna be in the form: dist = b + (m*speed) where b = y-intercept and m = slope.

reg <- lm(dist ~ speed, data = cars)
reg
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

When we run the code chunk above, we can see that the y-intercept is -17.579 and the slope is 3.932.

Note: We passed the linear model into an object reg so we can use this value later.

Plugging the y-intercept and slope values into our equation above, we get that:

dist = -17.579 + 3.932*(speed)

If I gave you the speed of a car, could you find the estimated stopping distance of the car?

Let’s say the speed of a car is 55 mph. Run the code chunk below to see what the stopping distance is according to the linear regression equation.

dist55 = -17.579 + 3.92*(55)
dist55
## [1] 198.021

If a car was traveling at 55mph, we can estimate that it will take it 198.021 feet to stop.

Try this yourself:

If a car is traveling at 40mph, what is the estimated stopping distance of the car?

Practice

Use the hints and your previous knowledge to try to do these practice problems.

Let’s look at the dataset Orange.

head(Orange)
##   Tree  age circumference
## 1    1  118            30
## 2    1  484            58
## 3    1  664            87
## 4    1 1004           115
## 5    1 1231           120
## 6    1 1372           142

We can see that there are three variables in this dataset: Tree, age and circumference.

Before we do any analysis, let’s pass the age and circumference columns into objects.

age <- Orange$age
circumference <- Orange$circumference

Use this dataset to calculate the correlation coefficient (Hint: line 54).

Can you guess what the plot will look like?

Now, create a simple scatterplot of age vs circumference (age is the y-variable and circumference is the x-variable).

#plot(x = , y = , main = "Age as a function of circumference")

Can you see what the correlation coefficient is what it is?

Now, look at the summary of the linear model (Hint: line 175-176).

What is the p-value? What does this p-value tell us? Is the linear model a good fit for the relationship between age and circumference?

Find the linear regression equation between age and circumference (we want to use the circumference of the tree to predict the age of the tree) (Hint: lines 264 - 265).

What is the linear regression equation?

If we know that a tree’s circumference is 50mm, what is the estimated age, in days, of this tree (Hint: lines 281 - 282)?