A linear regression is a linear approximation of a causal relationship between two or more variables. A regression model is commonly used to make predictions and inferences towards a larger population.

In a simple linear regression model, there are usually two variables involved:

  1. The dependent variable (y): The variable being predicted.
  2. The independent variable (x): The variable that acts as the predictor.

I will be conducting a simple linear regression model using a sample dataset taken from the 365 Data Science Introduction to R Programming course.

First, I will load the packages needed. Because I will only be using a simple linear regression model, I will be using the lm() function already built in R’s ‘stats’ base package.

#loading packages for usage
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.4     v purrr   0.3.4
## v tibble  3.1.2     v stringr 1.4.0
## v tidyr   1.1.3     v forcats 0.5.1
## v readr   1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Next, I will import and upload the data get a summary of it.

college<-read.csv("A:/RStudio/R_Projects/elatih/365/Data/regression-example.csv")
View(college)
summary(college)
##       SAT            GPA       
##  Min.   :1634   Min.   :2.400  
##  1st Qu.:1772   1st Qu.:3.190  
##  Median :1846   Median :3.380  
##  Mean   :1845   Mean   :3.330  
##  3rd Qu.:1934   3rd Qu.:3.502  
##  Max.   :2050   Max.   :3.810

As you can see, the data contains two columns of variables titled ‘SAT’ and ‘GPA’. I would like to conduct a simple regression model to see if SAT scores are predictive of a student’s GPA upon college graduation. Understanding the true extent of the relationship between SAT scores and graduating GPA will help provide valuable insight into how public and private stakeholders view academic achievement in secondary and tertiary education.

Before I conduct the regression model, I want to plot scatter plot to see the linear relationship between the two variables.

#plotting data for exploration

sat<-ggplot(college, aes(x=SAT, y=GPA)) +
      geom_point()+
      theme_minimal()+
      labs(x= "SAT Scores",
         y= "GPA upon graduation",
        title= "SAT and GPA")
sat

It is important to establish a linear relationship between the variables prior to running the the actual regression model. The plot above indeed shows that the two variables are linearly correlated.

#building and running regression model
model<-lm(GPA~SAT, data=college)
summary(model)
## 
## Call:
## lm(formula = GPA ~ SAT, data = college)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71289 -0.12825  0.03256  0.11660  0.43957 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2750403  0.4087394   0.673    0.503    
## SAT         0.0016557  0.0002212   7.487  7.2e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2106 on 82 degrees of freedom
## Multiple R-squared:  0.406,  Adjusted R-squared:  0.3988 
## F-statistic: 56.05 on 1 and 82 DF,  p-value: 7.2e-11
#adding a regression line to previous scatterplot
sat + stat_smooth(method="lm", se=F)
## `geom_smooth()` using formula 'y ~ x'

From the coefficients table, we can see that b0=0.275 and b1=0.0017. This means that, with every increase of SAT scores, the graduating GPA is expected to increase by 0.0017. The p values shown in the table (p<0.000) indicates that SAT scores are a significant predictor for predicting graduating college GPA.

The adjusted R-squared tells us that the regression model with only the SAT scores can explain up to 39.88% of the variability in the graduating GPA of the students.However, since the model only explains less than 50% of the variability, this shows that there are probably other variables, that we did not take into account, that could also predict college GPA. To understand a full picture, a multiple linear regression model is recommended so that other possible variables (such as gender, income, IQ score, level of parents education) can be included in the regression model.