HOMEWORK

  1. Load your chosen dataset into Rmarkdown
  2. Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable
  3. create a linear model using the “lm()” command, save it to some object
  4. call a “summary()” on your new model
  5. interpret the model’s r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?
  6. Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable? ((If you have no significant variables, just pick one and pretend that it’s significant))
  7. Does the model you create meet or violate the assumption of linearity? Show your work with “plot(x,which=1)”
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)

#1
project_data <- read_excel("texas federal funds.xlsx")
#2, #3
environmental_model <- lm(`ENVIRONMENTAL HEALTH`~`MATERNAL AND CHILD HEALTH SERVICES BLOCK GRANT TO THE STATES`+`EVEN START - STATE EDUCATIONAL AGENCIES`+`STATE ADMINISTRATIVE EXPENSES FOR CHILD NUTRITION`, data=project_data)

#4
summary(environmental_model)
## 
## Call:
## lm(formula = `ENVIRONMENTAL HEALTH` ~ `MATERNAL AND CHILD HEALTH SERVICES BLOCK GRANT TO THE STATES` + 
##     `EVEN START - STATE EDUCATIONAL AGENCIES` + `STATE ADMINISTRATIVE EXPENSES FOR CHILD NUTRITION`, 
##     data = project_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1705532 -1212895   400060   649241  1778931 
## 
## Coefficients:
##                                                                  Estimate
## (Intercept)                                                     5.264e+05
## `MATERNAL AND CHILD HEALTH SERVICES BLOCK GRANT TO THE STATES`  1.927e-02
## `EVEN START - STATE EDUCATIONAL AGENCIES`                      -6.671e-02
## `STATE ADMINISTRATIVE EXPENSES FOR CHILD NUTRITION`             1.520e-01
##                                                                Std. Error
## (Intercept)                                                     1.927e+06
## `MATERNAL AND CHILD HEALTH SERVICES BLOCK GRANT TO THE STATES`  3.954e-02
## `EVEN START - STATE EDUCATIONAL AGENCIES`                       8.558e-02
## `STATE ADMINISTRATIVE EXPENSES FOR CHILD NUTRITION`             1.068e-01
##                                                                t value Pr(>|t|)
## (Intercept)                                                      0.273    0.791
## `MATERNAL AND CHILD HEALTH SERVICES BLOCK GRANT TO THE STATES`   0.487    0.638
## `EVEN START - STATE EDUCATIONAL AGENCIES`                       -0.780    0.456
## `STATE ADMINISTRATIVE EXPENSES FOR CHILD NUTRITION`              1.423    0.188
## 
## Residual standard error: 1346000 on 9 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.1897, Adjusted R-squared:  -0.08034 
## F-statistic: 0.7025 on 3 and 9 DF,  p-value: 0.574

#5 The r-squared value is quite low, indicating that the chosen programs in our model only explain 18% of the variance in the data. The p-value sitting at 0.574 also does little to instill confidence, as that value confirms the null hypothesis (the selected health programs have no impact on the funding patterns for environmental programs). Further confirming the null hypothesis is the fact that all selected programs (variables) are insignificant with p-values well above 0.05.

#6 Let’s pretend that the STATE ADMINISTRATIVE EXPENSES FOR CHILD NUTRITION (SAECN) variable is significant. The estimate value (AKA beta coefficient) indicates that for a single unit increase in SAECN, the dependent variable (ENVIRONMENTAL HEALTH program) will increase by a value of 1.423. The other independent variables listed follow the same structure based on their estimate values (MATERNAL… = 0.487 increase, EVEN START… = 0.780 decrease).

#7
plot(environmental_model, which=1)

#7 The model I have created violates the assumption of linearity. For the realationship to be linear, we would need to see our red line be relatively straight and fitted to the dotted line. What we see instead is multiple peaks and valleys, which strongly suggests a violation in the assumption of linearity. Not even our in-class example of a violated model had multiple peaks and valleys (it simply curved), making my model easy to identify as non-linear.