HOMEWORK

  1. Load your chosen dataset into Rmarkdown
  2. Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable
  3. create a linear model using the “lm()” command, save it to some object(using the arrow)
  4. call a “summary()” on your new model
  5. interpret the model’s r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?
  6. Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable? (Cyl goes up by 1, hp goes up by) (If you don’t have sign. variables, just pick one and pretend it’s signficant)
  7. Does the model you create meet or violate the assumption of linearity? Show your work with “plot(x,which=1)” -Don’t be afraid of getting it wrong, if you’re not sure, it’s usually non linear.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
district_data <- read_excel("district.xls")
library(dplyr)
district_data <- district_data %>% rename (Four_Year_Grad_Rate_Class_2021 = DAGC4X21R )
district_data <- district_data %>% rename (Number_of_Students_Per_Teacher = DPSTKIDR)
district_data <- district_data %>% rename (Percentage_African_American_Students = DPETBLAP)
district_data <- district_data %>% rename (Percentage_White_Students = DPETWHIP)
district_data <- district_data %>% rename (Perecentage_Hispanic_Students = DPETHISP)
district_data <- district_data %>% rename (Spending_Per_Pupil = DPFEAOPFK)
district_data <- district_data %>% rename (Revenue_Per_Pupil = DPFRAALLK)
District_Data_Frame<-district_data%>% select(Four_Year_Grad_Rate_Class_2021, Number_of_Students_Per_Teacher, Percentage_African_American_Students, Percentage_White_Students, Perecentage_Hispanic_Students, Spending_Per_Pupil, Revenue_Per_Pupil)
View(District_Data_Frame)
Educational_Attainment_Model<-lm(Four_Year_Grad_Rate_Class_2021~Number_of_Students_Per_Teacher + Percentage_African_American_Students + Percentage_White_Students + Perecentage_Hispanic_Students + Spending_Per_Pupil + Revenue_Per_Pupil, data = District_Data_Frame)
summary(Educational_Attainment_Model)
## 
## Call:
## lm(formula = Four_Year_Grad_Rate_Class_2021 ~ Number_of_Students_Per_Teacher + 
##     Percentage_African_American_Students + Percentage_White_Students + 
##     Perecentage_Hispanic_Students + Spending_Per_Pupil + Revenue_Per_Pupil, 
##     data = District_Data_Frame)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.165  -1.361   1.763   4.473  59.102 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           1.038e+02  8.219e+00  12.635  < 2e-16 ***
## Number_of_Students_Per_Teacher       -2.304e-01  1.715e-01  -1.343  0.17950    
## Percentage_African_American_Students -9.754e-02  9.288e-02  -1.050  0.29389    
## Percentage_White_Students             7.767e-02  8.109e-02   0.958  0.33833    
## Perecentage_Hispanic_Students         5.365e-03  7.707e-02   0.070  0.94452    
## Spending_Per_Pupil                   -1.060e-03  1.562e-04  -6.786 1.91e-11 ***
## Revenue_Per_Pupil                     2.275e-04  8.072e-05   2.818  0.00492 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.21 on 1065 degrees of freedom
##   (135 observations deleted due to missingness)
## Multiple R-squared:  0.09478,    Adjusted R-squared:  0.08968 
## F-statistic: 18.58 on 6 and 1065 DF,  p-value: < 2.2e-16
# The overall model explains only 9% of the dependent variable, Four year Graduation Rate.
# P-value is small, the probability of this happening because of chance is 0.
# The significant variables are Spending per pupil and Revenue Per Pupil.As Revenue goes up by 1, Graduation rate goes up by 2%. As spending per pupil goes up by 1, the graduation rate decreases by 1%. Spending per pupil has a strong relationship with Graduation Rate, but the impact of Revenue per pupil has a bigger impact. 

#Race appears to be statistically insignificant when it comes to graduation rate for the class of 2021
plot(Educational_Attainment_Model, which =1)

# The red line tries to follow the dotted line in the center, but is not linear