Cuny 621 - Week 5

MARR Chap 5 - Ex 2

library(fastDummies)
Hchron<-read.csv("HoustonChronicle.csv")
head(Hchron)

##     District X.Repeating.1st.Grade X.Low.income.students Year   County
## 1      Alvin                   4.1                  49.7 2004 Brazoria
## 2      Alvin                   5.8                  41.1 1994 Brazoria
## 3   Angleton                   7.1                  44.2 2004 Brazoria
## 4   Angleton                   6.7                  30.2 1994 Brazoria
## 5 Brazosport                   7.3                  49.4 2004 Brazoria
## 6 Brazosport                   2.6                  33.7 1994 Brazoria

Increase in low income students is associated with increase in the percentage of students repeating first grade

m<-lm(X.Repeating.1st.Grade~X.Low.income.students,Hchron)
plot(Hchron$X.Low.income.students,Hchron$X.Repeating.1st.Grade)
summary(m)

## 
## Call:
## lm(formula = X.Repeating.1st.Grade ~ X.Low.income.students, data = Hchron)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9845 -2.5072 -0.4184  1.8505 11.1067 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.91419    0.83836   3.476 0.000709 ***
## X.Low.income.students  0.07550    0.01823   4.141 6.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.821 on 120 degrees of freedom
## Multiple R-squared:  0.125,  Adjusted R-squared:  0.1177 
## F-statistic: 17.14 on 1 and 120 DF,  p-value: 6.472e-05

abline(m)

hist(m$residuals)

qqnorm(m$residuals)
qqline(m$residuals)

Both t value for the estimate of low income and the f test for the model as a whole show low income to be significant. As the estimate is positive we can conclude that there is in fact an increase in students repeating first grade and the increase in low income students

There is an increase in students repeating first grade between 1994-1995 and 2004-2005

h<-Hchron[,!(names(Hchron) %in% c('County','District'))]
h$Year<-as.factor(h$Year)
r<-dummy_cols(h)
head(r)

##   X.Repeating.1st.Grade X.Low.income.students Year Year_1994 Year_2004
## 1                   4.1                  49.7 2004         0         1
## 2                   5.8                  41.1 1994         1         0
## 3                   7.1                  44.2 2004         0         1
## 4                   6.7                  30.2 1994         1         0
## 5                   7.3                  49.4 2004         0         1
## 6                   2.6                  33.7 1994         1         0

m1<-lm(X.Repeating.1st.Grade~X.Low.income.students+Year_2004,r)
summary(m1)

## 
## Call:
## lm(formula = X.Repeating.1st.Grade ~ X.Low.income.students + 
##     Year_2004, data = r)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6768 -2.5451 -0.4769  1.6624 11.3469 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.84900    0.84995   3.352 0.001076 ** 
## X.Low.income.students  0.07248    0.01917   3.782 0.000245 ***
## Year_2004              0.38311    0.72716   0.527 0.599274    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.832 on 119 degrees of freedom
## Multiple R-squared:  0.127,  Adjusted R-squared:  0.1124 
## F-statistic: 8.659 on 2 and 119 DF,  p-value: 0.0003083

A dummy variable for the year has been added to the dataframe and a linear regression was calculated using the same. The dummy variable Year_2004 is 1 when the row is in the 2004 group and 0 otherwise. We use it in the model. Since the estimate is positive, meaning estimates for the models outcome Y or students repeating first grade, is in fact higher than the 1994 group. So there is an increase between 1994 and 2004.

Any association between percentage of students repeating first grade and the percentage of low income students differs between 1994-1995 and 2004-2005

Here we look at the summary again, and we find that although the model is statistically significant with a low F test p value, the estimate for the dummy variable Year_2004 is not, showing a p value greater than 0.5. So the deference between 1994 and 2004 is not statistically significant.

Cuny 621 - Week 5

Peter Kowalchuk

September 24, 2019

MARR Chap 5 - Ex 2