LINEAR REGRESSION SAMPLE ANALYSIS

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

This dataset contains information on the performance of high school students in mathematics, including their grades and demographic information. The data was collected from three high schools in the United States. “This dataset was created for educational purposes and was generated, not collected from actual data sources.”

Importing Data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(readr)
library(dplyr)
exams <- read_csv("/Users/otheraccount/Downloads/exams.csv")
## Rows: 1000 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): gender, race/ethnicity, parental level of education, lunch, test pr...
## dbl (3): math score, reading score, writing score
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
exams
## # A tibble: 1,000 × 8
##    gender `race/ethnicity` parental leve…¹ lunch test …² math …³ readi…⁴ writi…⁵
##    <chr>  <chr>            <chr>           <chr> <chr>     <dbl>   <dbl>   <dbl>
##  1 female group D          some college    stan… comple…      59      70      78
##  2 male   group D          associate's de… stan… none         96      93      87
##  3 female group D          some college    free… none         57      76      77
##  4 male   group B          some college    free… none         70      70      63
##  5 female group D          associate's de… stan… none         83      85      86
##  6 male   group C          some high scho… stan… none         68      57      54
##  7 female group E          associate's de… stan… none         82      83      80
##  8 female group B          some high scho… stan… none         46      61      58
##  9 male   group C          some high scho… stan… none         80      75      73
## 10 female group C          bachelor's deg… stan… comple…      57      69      77
## # … with 990 more rows, and abbreviated variable names
## #   ¹​`parental level of education`, ²​`test preparation course`, ³​`math score`,
## #   ⁴​`reading score`, ⁵​`writing score`
head(exams)
## # A tibble: 6 × 8
##   gender `race/ethnicity` parental level…¹ lunch test …² math …³ readi…⁴ writi…⁵
##   <chr>  <chr>            <chr>            <chr> <chr>     <dbl>   <dbl>   <dbl>
## 1 female group D          some college     stan… comple…      59      70      78
## 2 male   group D          associate's deg… stan… none         96      93      87
## 3 female group D          some college     free… none         57      76      77
## 4 male   group B          some college     free… none         70      70      63
## 5 female group D          associate's deg… stan… none         83      85      86
## 6 male   group C          some high school stan… none         68      57      54
## # … with abbreviated variable names ¹​`parental level of education`,
## #   ²​`test preparation course`, ³​`math score`, ⁴​`reading score`,
## #   ⁵​`writing score`
summary(exams)
##     gender          race/ethnicity     parental level of education
##  Length:1000        Length:1000        Length:1000                
##  Class :character   Class :character   Class :character           
##  Mode  :character   Mode  :character   Mode  :character           
##                                                                   
##                                                                   
##                                                                   
##     lunch           test preparation course   math score     reading score   
##  Length:1000        Length:1000             Min.   : 15.00   Min.   : 25.00  
##  Class :character   Class :character        1st Qu.: 58.00   1st Qu.: 61.00  
##  Mode  :character   Mode  :character        Median : 68.00   Median : 70.50  
##                                             Mean   : 67.81   Mean   : 70.38  
##                                             3rd Qu.: 79.25   3rd Qu.: 80.00  
##                                             Max.   :100.00   Max.   :100.00  
##  writing score   
##  Min.   : 15.00  
##  1st Qu.: 59.00  
##  Median : 70.00  
##  Mean   : 69.14  
##  3rd Qu.: 80.00  
##  Max.   :100.00
#Change the column name - math score to math
exams %>% 
    rename("math" = "math score")
## # A tibble: 1,000 × 8
##    gender `race/ethnicity` parental level …¹ lunch test …²  math readi…³ writi…⁴
##    <chr>  <chr>            <chr>             <chr> <chr>   <dbl>   <dbl>   <dbl>
##  1 female group D          some college      stan… comple…    59      70      78
##  2 male   group D          associate's degr… stan… none       96      93      87
##  3 female group D          some college      free… none       57      76      77
##  4 male   group B          some college      free… none       70      70      63
##  5 female group D          associate's degr… stan… none       83      85      86
##  6 male   group C          some high school  stan… none       68      57      54
##  7 female group E          associate's degr… stan… none       82      83      80
##  8 female group B          some high school  stan… none       46      61      58
##  9 male   group C          some high school  stan… none       80      75      73
## 10 female group C          bachelor's degree stan… comple…    57      69      77
## # … with 990 more rows, and abbreviated variable names
## #   ¹​`parental level of education`, ²​`test preparation course`,
## #   ³​`reading score`, ⁴​`writing score`
library(dplyr)
exams <- exams %>% 
    rename("math" = "math score")
print(exams)
## # A tibble: 1,000 × 8
##    gender `race/ethnicity` parental level …¹ lunch test …²  math readi…³ writi…⁴
##    <chr>  <chr>            <chr>             <chr> <chr>   <dbl>   <dbl>   <dbl>
##  1 female group D          some college      stan… comple…    59      70      78
##  2 male   group D          associate's degr… stan… none       96      93      87
##  3 female group D          some college      free… none       57      76      77
##  4 male   group B          some college      free… none       70      70      63
##  5 female group D          associate's degr… stan… none       83      85      86
##  6 male   group C          some high school  stan… none       68      57      54
##  7 female group E          associate's degr… stan… none       82      83      80
##  8 female group B          some high school  stan… none       46      61      58
##  9 male   group C          some high school  stan… none       80      75      73
## 10 female group C          bachelor's degree stan… comple…    57      69      77
## # … with 990 more rows, and abbreviated variable names
## #   ¹​`parental level of education`, ²​`test preparation course`,
## #   ³​`reading score`, ⁴​`writing score`

Building a linear model

ggplot(exams, aes(x = math, y = gender)) +
  geom_point()

The “lm” function in the data:

exams_lm <- lm(math ~ gender, data = exams)
exams_lm
## 
## Call:
## lm(formula = math ~ gender, data = exams)
## 
## Coefficients:
## (Intercept)   gendermale  
##      64.774        5.976

The equation of the regression is 64.774 + 5.976 ∗ gender

ggplot(data = exams, aes(x = math, y = gender)) +
  geom_point() +
  stat_smooth(method = "lm", se = F)
## `geom_smooth()` using formula = 'y ~ x'

summary(exams_lm)
## 
## Call:
## lm(formula = math ~ gender, data = exams)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.750  -9.774   1.226  10.250  33.226 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  64.7744     0.6745  96.028  < 2e-16 ***
## gendermale    5.9756     0.9464   6.314 4.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.96 on 998 degrees of freedom
## Multiple R-squared:  0.03841,    Adjusted R-squared:  0.03745 
## F-statistic: 39.87 on 1 and 998 DF,  p-value: 4.084e-10

Residual Analysis

ggplot(data = exams_lm, aes(x = .fitted, y = .resid)) +
geom_point() 

par(mfrow = c(2, 2))
plot(exams_lm)

https://www.kaggle.com/code/jekeelmayurshah/students-performance-prediction