A linear regression is a linear approximation of a causal relationship between two or more variables. A regression model is commonly used to make predictions and inferences towards a larger population.
In a simple linear regression model, there are usually two variables involved:
I will be conducting a simple linear regression model using a sample dataset taken from the 365 Data Science Introduction to R Programming course.
First, I will load the packages needed. Because I will only be using a simple linear regression model, I will be using the lm() function already built in R’s ‘stats’ base package.
#loading packages for usage
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.4 v purrr 0.3.4
## v tibble 3.1.2 v stringr 1.4.0
## v tidyr 1.1.3 v forcats 0.5.1
## v readr 1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Next, I will import and upload the data get a summary of it.
college<-read.csv("A:/RStudio/R_Projects/elatih/365/Data/regression-example.csv")
View(college)
summary(college)
## SAT GPA
## Min. :1634 Min. :2.400
## 1st Qu.:1772 1st Qu.:3.190
## Median :1846 Median :3.380
## Mean :1845 Mean :3.330
## 3rd Qu.:1934 3rd Qu.:3.502
## Max. :2050 Max. :3.810
As you can see, the data contains two columns of variables titled ‘SAT’ and ‘GPA’. I would like to conduct a simple regression model to see if SAT scores are predictive of a student’s GPA upon college graduation. Understanding the true extent of the relationship between SAT scores and graduating GPA will help provide valuable insight into how public and private stakeholders view academic achievement in secondary and tertiary education.
Before I conduct the regression model, I want to plot scatter plot to see the linear relationship between the two variables.
#plotting data for exploration
sat<-ggplot(college, aes(x=SAT, y=GPA)) +
geom_point()+
theme_minimal()+
labs(x= "SAT Scores",
y= "GPA upon graduation",
title= "SAT and GPA")
sat
It is important to establish a linear relationship between the variables prior to running the the actual regression model. The plot above indeed shows that the two variables are linearly correlated.
#building and running regression model
model<-lm(GPA~SAT, data=college)
summary(model)
##
## Call:
## lm(formula = GPA ~ SAT, data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71289 -0.12825 0.03256 0.11660 0.43957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2750403 0.4087394 0.673 0.503
## SAT 0.0016557 0.0002212 7.487 7.2e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2106 on 82 degrees of freedom
## Multiple R-squared: 0.406, Adjusted R-squared: 0.3988
## F-statistic: 56.05 on 1 and 82 DF, p-value: 7.2e-11
#adding a regression line to previous scatterplot
sat + stat_smooth(method="lm", se=F)
## `geom_smooth()` using formula 'y ~ x'
From the coefficients table, we can see that b0=0.275 and b1=0.0017. This means that, with every increase of SAT scores, the graduating GPA is expected to increase by 0.0017. The p values shown in the table (p<0.000) indicates that SAT scores are a significant predictor for predicting graduating college GPA.
The adjusted R-squared tells us that the regression model with only the SAT scores can explain up to 39.88% of the variability in the graduating GPA of the students.However, since the model only explains less than 50% of the variability, this shows that there are probably other variables, that we did not take into account, that could also predict college GPA. To understand a full picture, a multiple linear regression model is recommended so that other possible variables (such as gender, income, IQ score, level of parents education) can be included in the regression model.