As this is my first ever vignette, I thought I should keep it simple. Using the ‘mtcars’ data set in R Studio, and taking the two variables; miles per gallon and car weight, the purpose of my vignette is to show how to investigate the correlation between the two variables by creating a simple scatter plot, calculating a correlation coefficient and adding a regression line to the scatter plot.
I will use the mtcars data set which is available in R Studio.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
I will use ggplot2 package to create the visualization
library(ggplot2)
Scatter plots have been described as “arguably the most versatile polymorphic and generally useful invention in the history of statistical graphics”, that sounds like a pretty big call, I thought this would be a good place to start.
A simple scatter plot can be created using the R code below. Using the ‘mtcars’ data set I will plot the weight “wt” on the x axis and miles per gallon “mpg” on the y axis and use the function geom_point
#Create a scatter plot of mpg vs weight
ggplot(mtcars, aes(wt, y=mpg)) + geom_point()
To make the axis names for the variables more understandable we can re name the axis using the scale_x_continuous function
#Add names to x and y axis
ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point() + scale_x_continuous("Weight of Car") + scale_y_continuous("Miles Per Gallon")
The scatter plot shows (visually) the strength of the relationship between the two variables from this observation we could probably say that the form of the relationship is linear, negative and moderately strong.
I now want to quantify the strength of the linear relationship for the two variables, I will use the Pearson Product Moment Correlation using the code below.
#Caluclate the correlation coefficient
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
The correlation coefficient is a number between 1 and -1, The magnitude of the number indicates the strength of the linear relationship. The sign corresponds to the direction of that relationship. A correlation coefficient of -1 indicates a near perfect negatively correlated linear relationship. In this case we observe a correlation coefficient of -0.8676594 so we can say this is an indication of strong, negative linear relationship between the two variables which is what we observed in the scatter plot.
I can now add a regression line. The simple linear regression model for miles per gallon as a function of weight can be visualized on the scatter plot by a straight line. This is a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points.
I can add the regression line using the geom_smooth function in the code below:
# Add the regression line
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point()+
geom_smooth(method=lm)
Now we have the regression line, we should be able to build the linear model. Using the lm function R will calculate the intercept and the slope.
lm(mtcars$mpg~mtcars$wt)
##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$wt)
##
## Coefficients:
## (Intercept) mtcars$wt
## 37.285 -5.344
The aim of linear regression is to model a continuous variable Y as a mathematical function of one or more X variable(s), so that we can use the regression model to predict the Y when only the X is known. This mathematical equation can be generalized as follows:
Y = β1 + β2X + ϵ
where, β1 is the intercept and β2 is the slope. Collectively, they are called regression coefficients. ϵ is the error term, the part of Y the regression model is unable to explain.
Using the above formula, as we have now have the value for both the coefficients, we should be able to now predict the value of Y(mile per gallon) given a value of X(weight)
mpg = 37.285 + (-5.344 * wt)
The question now is how statistically significant is the model.
To check the statistical signifigance of the model we can print the summary statistics using the code below
summary(lm(mtcars$mpg~mtcars$wt))
##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$wt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## mtcars$wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Looking at the p value (bottom row right hand side) it is returning a value of 1.294e-10
We generally only consider a model to statiscally significant when the p value is < 0.05, in this case the p value is significantly less than 0.05.
http//campus.datacamp.com/courses/correlation-and-regression/simple-linear-regression http://blog.yhat.com/posts/r-lm-summary.html, http://www.sthda.com, http://www.r-tutor.com/elementary-statistics/simple-linear-regression,