library(fastDummies)
Hchron<-read.csv("HoustonChronicle.csv")
head(Hchron)
## District X.Repeating.1st.Grade X.Low.income.students Year County
## 1 Alvin 4.1 49.7 2004 Brazoria
## 2 Alvin 5.8 41.1 1994 Brazoria
## 3 Angleton 7.1 44.2 2004 Brazoria
## 4 Angleton 6.7 30.2 1994 Brazoria
## 5 Brazosport 7.3 49.4 2004 Brazoria
## 6 Brazosport 2.6 33.7 1994 Brazoria
m<-lm(X.Repeating.1st.Grade~X.Low.income.students,Hchron)
plot(Hchron$X.Low.income.students,Hchron$X.Repeating.1st.Grade)
summary(m)
##
## Call:
## lm(formula = X.Repeating.1st.Grade ~ X.Low.income.students, data = Hchron)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9845 -2.5072 -0.4184 1.8505 11.1067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.91419 0.83836 3.476 0.000709 ***
## X.Low.income.students 0.07550 0.01823 4.141 6.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.821 on 120 degrees of freedom
## Multiple R-squared: 0.125, Adjusted R-squared: 0.1177
## F-statistic: 17.14 on 1 and 120 DF, p-value: 6.472e-05
abline(m)
hist(m$residuals)
qqnorm(m$residuals)
qqline(m$residuals)
Both t value for the estimate of low income and the f test for the model as a whole show low income to be significant. As the estimate is positive we can conclude that there is in fact an increase in students repeating first grade and the increase in low income students
h<-Hchron[,!(names(Hchron) %in% c('County','District'))]
h$Year<-as.factor(h$Year)
r<-dummy_cols(h)
head(r)
## X.Repeating.1st.Grade X.Low.income.students Year Year_1994 Year_2004
## 1 4.1 49.7 2004 0 1
## 2 5.8 41.1 1994 1 0
## 3 7.1 44.2 2004 0 1
## 4 6.7 30.2 1994 1 0
## 5 7.3 49.4 2004 0 1
## 6 2.6 33.7 1994 1 0
m1<-lm(X.Repeating.1st.Grade~X.Low.income.students+Year_2004,r)
summary(m1)
##
## Call:
## lm(formula = X.Repeating.1st.Grade ~ X.Low.income.students +
## Year_2004, data = r)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6768 -2.5451 -0.4769 1.6624 11.3469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.84900 0.84995 3.352 0.001076 **
## X.Low.income.students 0.07248 0.01917 3.782 0.000245 ***
## Year_2004 0.38311 0.72716 0.527 0.599274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.832 on 119 degrees of freedom
## Multiple R-squared: 0.127, Adjusted R-squared: 0.1124
## F-statistic: 8.659 on 2 and 119 DF, p-value: 0.0003083
A dummy variable for the year has been added to the dataframe and a linear regression was calculated using the same. The dummy variable Year_2004 is 1 when the row is in the 2004 group and 0 otherwise. We use it in the model. Since the estimate is positive, meaning estimates for the models outcome Y or students repeating first grade, is in fact higher than the 1994 group. So there is an increase between 1994 and 2004.
Here we look at the summary again, and we find that although the model is statistically significant with a low F test p value, the estimate for the dummy variable Year_2004 is not, showing a p value greater than 0.5. So the deference between 1994 and 2004 is not statistically significant.