Data 605 Discussion Post 11

Load in the Education dataset
Conclusion

Load in the Education dataset

library(ggplot2)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.3

## -- Attaching packages ------------------------------------------------------------------------------------ tidyverse 1.2.1 --

## v tibble  2.0.1     v purrr   0.2.5
## v tidyr   0.8.2     v dplyr   0.7.8
## v readr   1.3.1     v stringr 1.3.1
## v tibble  2.0.1     v forcats 0.3.0

## -- Conflicts --------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

setwd("C:/Users/arnou/Documents/Education Data")
df <- read.csv(file="states_education_data.csv", header=TRUE, sep=",")

df$Expenditure_per_child <- df$TOTAL_EXPENDITURE*1000/(df$GRADES_ALL_G)
df$Total_Scores <- (df$AVG_MATH_4_SCORE+df$AVG_MATH_8_SCORE+df$AVG_READING_4_SCORE+df$AVG_READING_8_SCORE)/4


ggplot(data = df, aes(x = Expenditure_per_child, y = Total_Scores)) + 
  geom_point(color='blue') +
  geom_smooth(method = "lm", se = FALSE)+xlim(0,25000)

## Warning: Removed 1073 rows containing non-finite values (stat_smooth).

## Warning: Removed 1073 rows containing missing values (geom_point).

linear_model <- lm(Total_Scores~Expenditure_per_child,df)

plot(linear_model)

linear_model

## 
## Call:
## lm(formula = Total_Scores ~ Expenditure_per_child, data = df)
## 
## Coefficients:
##           (Intercept)  Expenditure_per_child  
##             2.408e+02              7.304e-04

summary(linear_model)

## 
## Call:
## lm(formula = Total_Scores ~ Expenditure_per_child, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.681  -2.878   0.904   4.065  11.230 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.408e+02  9.849e-01 244.505   <2e-16 ***
## Expenditure_per_child 7.304e-04  7.822e-05   9.337   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.289 on 418 degrees of freedom
##   (1072 observations deleted due to missingness)
## Multiple R-squared:  0.1726, Adjusted R-squared:  0.1706 
## F-statistic: 87.19 on 1 and 418 DF,  p-value: < 2.2e-16

Conclusion

For each additional thousand dollars that a state spends in tax dollars the average test scores of the students will go up around .72 points, the model is statistically significant. The r-squared value is .17 which means that the expenditure variable explains about 17% of variation in test scores

From the quantile plot we can see that the residuals are normally distributed except on the very extreme ends