HOMEWORK
- Load your chosen dataset into Rmarkdown
- Select the dependent variable you are interested in, along with
independent variables which you believe are causing the dependent
variable
- create a linear model using the “lm()” command, save it to some
object(using the arrow)
- call a “summary()” on your new model
- interpret the model’s r-squared and p-values. How much of the
dependent variable does the overall model explain? What are the
significant variables? What are the insignificant variables?
- Choose some significant independent variables. Interpret its
Estimates (or Beta Coefficients). How do the independent variables
individually affect the dependent variable? (Cyl goes up by 1, hp goes
up by) (If you don’t have sign. variables, just pick one and pretend
it’s signficant)
- Does the model you create meet or violate the assumption of
linearity? Show your work with “plot(x,which=1)” -Don’t be afraid of
getting it wrong, if you’re not sure, it’s usually non linear.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
district_data <- read_excel("district.xls")
library(dplyr)
district_data <- district_data %>% rename (Four_Year_Grad_Rate_Class_2021 = DAGC4X21R )
district_data <- district_data %>% rename (Number_of_Students_Per_Teacher = DPSTKIDR)
district_data <- district_data %>% rename (Percentage_African_American_Students = DPETBLAP)
district_data <- district_data %>% rename (Percentage_White_Students = DPETWHIP)
district_data <- district_data %>% rename (Perecentage_Hispanic_Students = DPETHISP)
district_data <- district_data %>% rename (Spending_Per_Pupil = DPFEAOPFK)
district_data <- district_data %>% rename (Revenue_Per_Pupil = DPFRAALLK)
District_Data_Frame<-district_data%>% select(Four_Year_Grad_Rate_Class_2021, Number_of_Students_Per_Teacher, Percentage_African_American_Students, Percentage_White_Students, Perecentage_Hispanic_Students, Spending_Per_Pupil, Revenue_Per_Pupil)
View(District_Data_Frame)
Educational_Attainment_Model<-lm(Four_Year_Grad_Rate_Class_2021~Number_of_Students_Per_Teacher + Percentage_African_American_Students + Percentage_White_Students + Perecentage_Hispanic_Students + Spending_Per_Pupil + Revenue_Per_Pupil, data = District_Data_Frame)
summary(Educational_Attainment_Model)
##
## Call:
## lm(formula = Four_Year_Grad_Rate_Class_2021 ~ Number_of_Students_Per_Teacher +
## Percentage_African_American_Students + Percentage_White_Students +
## Perecentage_Hispanic_Students + Spending_Per_Pupil + Revenue_Per_Pupil,
## data = District_Data_Frame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.165 -1.361 1.763 4.473 59.102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.038e+02 8.219e+00 12.635 < 2e-16 ***
## Number_of_Students_Per_Teacher -2.304e-01 1.715e-01 -1.343 0.17950
## Percentage_African_American_Students -9.754e-02 9.288e-02 -1.050 0.29389
## Percentage_White_Students 7.767e-02 8.109e-02 0.958 0.33833
## Perecentage_Hispanic_Students 5.365e-03 7.707e-02 0.070 0.94452
## Spending_Per_Pupil -1.060e-03 1.562e-04 -6.786 1.91e-11 ***
## Revenue_Per_Pupil 2.275e-04 8.072e-05 2.818 0.00492 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.21 on 1065 degrees of freedom
## (135 observations deleted due to missingness)
## Multiple R-squared: 0.09478, Adjusted R-squared: 0.08968
## F-statistic: 18.58 on 6 and 1065 DF, p-value: < 2.2e-16
# The overall model explains only 9% of the dependent variable, Four year Graduation Rate.
# P-value is small, the probability of this happening because of chance is 0.
# The significant variables are Spending per pupil and Revenue Per Pupil.As Revenue goes up by 1, Graduation rate goes up by 2%. As spending per pupil goes up by 1, the graduation rate decreases by 1%. Spending per pupil has a strong relationship with Graduation Rate, but the impact of Revenue per pupil has a bigger impact.
#Race appears to be statistically insignificant when it comes to graduation rate for the class of 2021
plot(Educational_Attainment_Model, which =1)

# The red line tries to follow the dotted line in the center, but is not linear