Applying tidyverse to Solve and Graphically Represent a Linear Regression Problem

Overview

Linear regression is a technique employed by data scientists to assess the relationship between two or more variables. This relationship can be measured using regression output variables such as the T value and the P value, which can help us in determining whether or not the strength of the relationship between two variables is statistically significant. We can employ the use of the tidyverse, a package in the R program, to help us in assessing this relationship graphically.

The tidyverse is a powerful package in the R program, which can be utilised to solve a simple linear regression problem. Moreover, the power of the tidyverse is embedded in the fact that it is composed of eight individual packages. Therefore, this vignette will utilise the tidyverse to communicate and visualise the results of a simple linear regression problem; whether there is a statistically significant relationship between the speed that a car is travelling at(mph) and distance taken to stop the car(ft).

Loading the necessary packages and data

Initially, the tidyverse package must be loaded.

Next, the data set for analysis must be loaded.

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0       v purrr   0.3.2  
## v tibble  2.1.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

data("cars")

Now that the appropriate package and data have been loaded, we can compute several functions on the data set in order to get a gist as to the data characterisitcs.

str(cars)

## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Plotting the two variables

As detailed above, there are two variables, speed and distance, where each variable has 50 observations. The ggplot2 package (exists within the tidyverse package) can be used to visualise this data set. The speed of the car will be plotted on the x axis (independent/predictor variable) and the distance taken to stop the car will be plotted on the y axis(dependent/outcome variable).

Speed <- cars$speed
Distance <- cars$dist
ggplot(data = cars) + geom_point(mapping = aes(x = Speed, y = Distance)) + xlab("Speed (mph)") + ylab("Distance (ft)")

Regression Analysis

The lm() function will be used to regress the distance taken to stop the car(ft) against the speed of the car (mph), at each observation.

slm <- lm(Distance ~ Speed) #regress the predictor and the outcome variables
coef(slm) #return the coefficients of the regression analysis

## (Intercept)       Speed 
##  -17.579095    3.932409

The results of the regression are displayed above. However, we are not interested in the distance taken to stop a car when the speed = 0mph. So, to get a better interpretation of the y intercept value, the data can be centred in the predictor variable.

Slm_normalised <- lm(cars$dist ~ I(cars$speed - mean(cars$speed))) #centre the predictor variable 
summary(Slm_normalised)

## 
## Call:
## lm(formula = cars$dist ~ I(cars$speed - mean(cars$speed)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       42.9800     2.1750  19.761  < 2e-16 ***
## I(cars$speed - mean(cars$speed))   3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Since we have centred the data in the predictor variable, we now have a value for the y intercept that makes more sense for interpretation purposes. The y intercept will now equal the average speed of the car. So, when the car is travelling at it’s average speed the distance taken to stop the car will be approximately 43 feet.

Moreover, it is evident from this summary that the p value is very low (1.49e-12), which is less than the 0.05 level, implying that we can reject the null hypothesis.

Finally, we can map the regression line over the top of the scatter plot previously mapped.

Speed <- cars$speed
Distance <- cars$dist
ggplot(data = cars, mapping = aes(x = Speed, y = Distance)) + geom_point() + geom_smooth(method = 'lm', size = 2, alpha = 0.75) + xlab("Speed (mph)") + ylab("Distance (ft)")

This vignette will now look at the correlation between the two variables. We will explore the concept of normalisation, and how plotting the variables after they have been normalised will result in a better visualisation and explanation of the correlation between the two variables.

Correlation

Given the statistically significant relationship between speed and distance, we can show the correlation between the variables numerically and graphically. Firstly, the correlation function cor() will allow us to calculate the correlation coefficient between the variables.

cor(Speed, Distance) #calculte the correlation between speed and distance

## [1] 0.8068949

The correlation between speed and distance is equal to 0.807. This represents a highly positive correlation between the two variables.

Now, we will normalise each of the variables so we can prove that the correlation between the two variables does not change after the effects of normalisation, in order to accurately graph the correlation.

distance1 <- (cars$dist - mean(cars$dist))/sd(cars$dist) # normalise each variable
speed1 <- (cars$speed - mean(cars$speed))/sd(cars$speed)
cor(distance1,speed1)

## [1] 0.8068949

As can be seen above, normalising the variables has not affected the correlation coefficent of 0.807.

Finally, we can represent this correlation graphically to better illustrate the point.

distance1 <- (cars$dist - mean(cars$dist))/sd(cars$dist) #normalise the variable
speed1 <- (cars$speed - mean(cars$speed))/sd(cars$speed) #normalise the variable
rho <- cor(speed1, distance1) #assign 'rho' to the correlation between the variables
 ggplot( data = cars, mapping = aes(x = speed1, y = distance1)) + geom_point() + geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + geom_abline(intercept = 0, slope = rho, size = 2, colour = "blue") + xlab("Speed (mph)") + ylab("Distance (ft)")

The graph above represents the variables of speed and distance when normalised, meaning that each variable has a mean = 0 and a standard deviation = 1. Grid lines for x = 0 and y = 0 have been inserted to represent the origin of the speed and distance variables in order to display the normalised distribution of the variables. The blue line represents the correlation between the variables. Moreover, the graph represents the concept known as regression to the mean, whereby multiplication of the correlation coefficient (less than 1) to a variable observation will regress it towards zero. And, since we have normalised the variables,their mean is equal to zero, thus regression to the mean.

Result

This vignette has developed the linear regression problem under scrutiny, and has employed the use of the tidyverse package to solve the problem and graphically represent the results of the regression analysis.

References

Caffo, B. 2015, Regression Models for Data Science in R, 1st edn, Leanpub, Canada.