Scatterplot to Linear Regression model

Introduction

As this is my first ever vignette, I thought I should keep it simple. Using the ‘mtcars’ data set in R Studio, and taking the two variables; miles per gallon and car weight, the purpose of my vignette is to show how to investigate the correlation between the two variables by creating a simple scatter plot, calculating a correlation coefficient and adding a regression line to the scatter plot.

Getting Started

I will use the mtcars data set which is available in R Studio.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

I will use ggplot2 package to create the visualization

library(ggplot2)

Creating a Basic Scatterplot

Scatter plots have been described as “arguably the most versatile polymorphic and generally useful invention in the history of statistical graphics”, that sounds like a pretty big call, I thought this would be a good place to start.

A simple scatter plot can be created using the R code below. Using the ‘mtcars’ data set I will plot the weight “wt” on the x axis and miles per gallon “mpg” on the y axis and use the function geom_point

#Create a scatter plot of mpg vs weight
ggplot(mtcars, aes(wt, y=mpg)) + geom_point()

To make the axis names for the variables more understandable we can re name the axis using the scale_x_continuous function

#Add names to x and y axis
ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point() + scale_x_continuous("Weight of Car") + scale_y_continuous("Miles Per Gallon")

The scatter plot shows (visually) the strength of the relationship between the two variables from this observation we could probably say that the form of the relationship is linear, negative and moderately strong.

Quantifying the Strength of the Relationship

I now want to quantify the strength of the linear relationship for the two variables, I will use the Pearson Product Moment Correlation using the code below.

#Caluclate the correlation coefficient
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594

The correlation coefficient is a number between 1 and -1, The magnitude of the number indicates the strength of the linear relationship. The sign corresponds to the direction of that relationship. A correlation coefficient of -1 indicates a near perfect negatively correlated linear relationship. In this case we observe a correlation coefficient of -0.8676594 so we can say this is an indication of strong, negative linear relationship between the two variables which is what we observed in the scatter plot.

Adding a Regression Line to the Scatterplot

I can now add a regression line. The simple linear regression model for miles per gallon as a function of weight can be visualized on the scatter plot by a straight line. This is a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points.

I can add the regression line using the geom_smooth function in the code below:

# Add the regression line
ggplot(mtcars, aes(x=wt, y=mpg)) + 
  geom_point()+
  geom_smooth(method=lm)

Now we have the regression line, we should be able to build the linear model. Using the lm function R will calculate the intercept and the slope.

lm(mtcars$mpg~mtcars$wt)
## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$wt)
## 
## Coefficients:
## (Intercept)    mtcars$wt  
##      37.285       -5.344

The aim of linear regression is to model a continuous variable Y as a mathematical function of one or more X variable(s), so that we can use the regression model to predict the Y when only the X is known. This mathematical equation can be generalized as follows:

Y = β1 + β2X + ϵ

where, β1 is the intercept and β2 is the slope. Collectively, they are called regression coefficients. ϵ is the error term, the part of Y the regression model is unable to explain.

Using the above formula, as we have now have the value for both the coefficients, we should be able to now predict the value of Y(mile per gallon) given a value of X(weight)

mpg = 37.285 + (-5.344 * wt)

The question now is how statistically significant is the model.

Checking Statistical Signifigance

To check the statistical signifigance of the model we can print the summary statistics using the code below

summary(lm(mtcars$mpg~mtcars$wt))
## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$wt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## mtcars$wt    -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Conclusion

Looking at the p value (bottom row right hand side) it is returning a value of 1.294e-10

We generally only consider a model to statiscally significant when the p value is < 0.05, in this case the p value is significantly less than 0.05.

References

http//campus.datacamp.com/courses/correlation-and-regression/simple-linear-regression http://blog.yhat.com/posts/r-lm-summary.html, http://www.sthda.com, http://www.r-tutor.com/elementary-statistics/simple-linear-regression,