LINEAR REGRESSION SAMPLE ANALYSIS
Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
This dataset contains information on the performance of high school students in mathematics, including their grades and demographic information. The data was collected from three high schools in the United States. “This dataset was created for educational purposes and was generated, not collected from actual data sources.”
Importing Data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(readr)
library(dplyr)
exams <- read_csv("/Users/otheraccount/Downloads/exams.csv")
## Rows: 1000 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): gender, race/ethnicity, parental level of education, lunch, test pr...
## dbl (3): math score, reading score, writing score
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
exams
## # A tibble: 1,000 × 8
## gender `race/ethnicity` parental leve…¹ lunch test …² math …³ readi…⁴ writi…⁵
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 female group D some college stan… comple… 59 70 78
## 2 male group D associate's de… stan… none 96 93 87
## 3 female group D some college free… none 57 76 77
## 4 male group B some college free… none 70 70 63
## 5 female group D associate's de… stan… none 83 85 86
## 6 male group C some high scho… stan… none 68 57 54
## 7 female group E associate's de… stan… none 82 83 80
## 8 female group B some high scho… stan… none 46 61 58
## 9 male group C some high scho… stan… none 80 75 73
## 10 female group C bachelor's deg… stan… comple… 57 69 77
## # … with 990 more rows, and abbreviated variable names
## # ¹`parental level of education`, ²`test preparation course`, ³`math score`,
## # ⁴`reading score`, ⁵`writing score`
head(exams)
## # A tibble: 6 × 8
## gender `race/ethnicity` parental level…¹ lunch test …² math …³ readi…⁴ writi…⁵
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 female group D some college stan… comple… 59 70 78
## 2 male group D associate's deg… stan… none 96 93 87
## 3 female group D some college free… none 57 76 77
## 4 male group B some college free… none 70 70 63
## 5 female group D associate's deg… stan… none 83 85 86
## 6 male group C some high school stan… none 68 57 54
## # … with abbreviated variable names ¹`parental level of education`,
## # ²`test preparation course`, ³`math score`, ⁴`reading score`,
## # ⁵`writing score`
summary(exams)
## gender race/ethnicity parental level of education
## Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## lunch test preparation course math score reading score
## Length:1000 Length:1000 Min. : 15.00 Min. : 25.00
## Class :character Class :character 1st Qu.: 58.00 1st Qu.: 61.00
## Mode :character Mode :character Median : 68.00 Median : 70.50
## Mean : 67.81 Mean : 70.38
## 3rd Qu.: 79.25 3rd Qu.: 80.00
## Max. :100.00 Max. :100.00
## writing score
## Min. : 15.00
## 1st Qu.: 59.00
## Median : 70.00
## Mean : 69.14
## 3rd Qu.: 80.00
## Max. :100.00
#Change the column name - math score to math
exams %>%
rename("math" = "math score")
## # A tibble: 1,000 × 8
## gender `race/ethnicity` parental level …¹ lunch test …² math readi…³ writi…⁴
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 female group D some college stan… comple… 59 70 78
## 2 male group D associate's degr… stan… none 96 93 87
## 3 female group D some college free… none 57 76 77
## 4 male group B some college free… none 70 70 63
## 5 female group D associate's degr… stan… none 83 85 86
## 6 male group C some high school stan… none 68 57 54
## 7 female group E associate's degr… stan… none 82 83 80
## 8 female group B some high school stan… none 46 61 58
## 9 male group C some high school stan… none 80 75 73
## 10 female group C bachelor's degree stan… comple… 57 69 77
## # … with 990 more rows, and abbreviated variable names
## # ¹`parental level of education`, ²`test preparation course`,
## # ³`reading score`, ⁴`writing score`
library(dplyr)
exams <- exams %>%
rename("math" = "math score")
print(exams)
## # A tibble: 1,000 × 8
## gender `race/ethnicity` parental level …¹ lunch test …² math readi…³ writi…⁴
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 female group D some college stan… comple… 59 70 78
## 2 male group D associate's degr… stan… none 96 93 87
## 3 female group D some college free… none 57 76 77
## 4 male group B some college free… none 70 70 63
## 5 female group D associate's degr… stan… none 83 85 86
## 6 male group C some high school stan… none 68 57 54
## 7 female group E associate's degr… stan… none 82 83 80
## 8 female group B some high school stan… none 46 61 58
## 9 male group C some high school stan… none 80 75 73
## 10 female group C bachelor's degree stan… comple… 57 69 77
## # … with 990 more rows, and abbreviated variable names
## # ¹`parental level of education`, ²`test preparation course`,
## # ³`reading score`, ⁴`writing score`
Building a linear model
ggplot(exams, aes(x = math, y = gender)) +
geom_point()
The “lm” function in the data:
exams_lm <- lm(math ~ gender, data = exams)
exams_lm
##
## Call:
## lm(formula = math ~ gender, data = exams)
##
## Coefficients:
## (Intercept) gendermale
## 64.774 5.976
The equation of the regression is 64.774 + 5.976 ∗ gender
ggplot(data = exams, aes(x = math, y = gender)) +
geom_point() +
stat_smooth(method = "lm", se = F)
## `geom_smooth()` using formula = 'y ~ x'
summary(exams_lm)
##
## Call:
## lm(formula = math ~ gender, data = exams)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.750 -9.774 1.226 10.250 33.226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.7744 0.6745 96.028 < 2e-16 ***
## gendermale 5.9756 0.9464 6.314 4.08e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.96 on 998 degrees of freedom
## Multiple R-squared: 0.03841, Adjusted R-squared: 0.03745
## F-statistic: 39.87 on 1 and 998 DF, p-value: 4.084e-10
Residual Analysis
ggplot(data = exams_lm, aes(x = .fitted, y = .resid)) +
geom_point()
par(mfrow = c(2, 2))
plot(exams_lm)
https://www.kaggle.com/code/jekeelmayurshah/students-performance-prediction