1 Description

The interest is to model the number of days of absence from school using student’s gender, the standardized test in math, and the type of program in which the student is enrolled.

Data: The data set days absent contains attendance data on 314 high school juniors from high schools. The response variable of interest is days absent from school. The standardized math score for each student was given as well as the program variable indicating the type of instructional program in which the student was enrolled.

Column 1: Student ID
Column 2: Gender ID
Column 3: Program ID
Column 4: Standardized mathematics score
Column 5: Days absent from school

根據學生的性別、標準化的數學測驗得分,以及學生錄取的課程類型統計缺席日數。

## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

2 input data

 dta2 <- read.csv("C:/Users/HANK/Desktop/HOMEWORK/days_absent.csv")

str(dta2)
## 'data.frame':    314 obs. of  5 variables:
##  $ ID     : chr  "S1001" "S1002" "S1003" "S1004" ...
##  $ Gender : chr  "male" "male" "female" "female" ...
##  $ Program: chr  "Academic" "Academic" "Academic" "Academic" ...
##  $ Math   : int  63 27 20 16 2 71 63 3 51 49 ...
##  $ Days   : int  4 4 2 3 3 13 11 7 10 9 ...
head(dta2)
##      ID Gender  Program Math Days
## 1 S1001   male Academic   63    4
## 2 S1002   male Academic   27    4
## 3 S1003 female Academic   20    2
## 4 S1004 female Academic   16    3
## 5 S1005 female Academic    2    3
## 6 S1006 female Academic   71   13
summary(dta2)
##       ID               Gender            Program               Math      
##  Length:314         Length:314         Length:314         Min.   : 1.00  
##  Class :character   Class :character   Class :character   1st Qu.:28.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :48.00  
##                                                           Mean   :48.27  
##                                                           3rd Qu.:70.00  
##                                                           Max.   :99.00  
##       Days       
##  Min.   : 0.000  
##  1st Qu.: 1.000  
##  Median : 4.000  
##  Mean   : 5.955  
##  3rd Qu.: 8.000  
##  Max.   :35.000
ggplot(dta2, 
       aes(Days)) +
 geom_histogram(binwidth = 1, fill=8, col=1)+
 facet_grid(Gender ~ Program) +
 labs(x="Days absent", 
      y="Frequency") +
 theme_minimal()+
 theme(legend.position = c(.95, .9))

以“性別”、“課程類型”作為區分組別,繪製缺席天數與次數表。

3 means and variances

aggregate(cbind(Days, Math)~ Gender+Program, data=dta2, FUN=mean)
##   Gender    Program      Days     Math
## 1 female   Academic  7.759036 43.69880
## 2   male   Academic  6.119048 41.50000
## 3 female    General 12.363636 44.40909
## 4   male    General  8.555556 45.33333
## 5 female Vocational  2.709091 58.00000
## 6   male Vocational  2.634615 58.84615
aggregate(cbind(Days, Math)~ Gender+Program, data=dta2, FUN=var)
##   Gender    Program     Days     Math
## 1 female   Academic 68.55098 586.7252
## 2   male   Academic 41.81698 689.8193
## 3 female    General 84.33766 804.5390
## 4   male    General 41.67320 923.5294
## 5 female Vocational 10.28418 512.4815
## 6   male Vocational 18.07956 364.9170

4 Effects of Program, gender, and math

ggplot(dta2, 
       aes(Math, Days, 
           color=Gender)) +
 stat_smooth(method="glm",
             formula=y ~ x,
             method.args=list(family=poisson),
             size=rel(.5)) +
 geom_point(pch=20, alpha=.5) +
 facet_grid(. ~ Program) +
 labs(y="Days absent", 
      x="Math score") +
 theme_minimal()+
 theme(legend.position = c(.95, .9))

結果顯示,無論是男生或是女生,在“Academic”與“General”兩種課程類型中,缺席的天數愈少,數學成績會愈高;而“Vocational”的學生的數學成績則與出缺席天數無太大相關。可能與不同的課程類型有不同的學習內容有關,例如:技職教育的特定學科不強調數學如:美容。

5 Parameter estimates

sjPlot::tab_model(m0 <- glm(Days ~ Program*Gender*Math, data=dta2, 
                            family=poisson(link=log)), show.se=T, show.r2=F, show.obs=F)
  Days
Predictors Incidence Rate Ratios std. Error CI p
(Intercept) 10.43 0.08 8.94 – 12.14 <0.001
Program [General] 1.47 0.13 1.13 – 1.91 0.004
Program [Vocational] 0.23 0.25 0.14 – 0.36 <0.001
Gender [male] 0.85 0.11 0.69 – 1.06 0.146
Math 0.99 0.00 0.99 – 1.00 <0.001
Program [General] *
Gender [male]
0.97 0.20 0.65 – 1.43 0.863
Program [Vocational] *
Gender [male]
1.59 0.37 0.76 – 3.27 0.214
Program [General] * Math 1.00 0.00 1.00 – 1.01 0.479
Program [Vocational] *
Math
1.01 0.00 1.00 – 1.02 0.019
Gender [male] * Math 1.00 0.00 0.99 – 1.00 0.274
(Program [General]
Gender [male])
Math
1.00 0.00 0.99 – 1.01 0.692
(Program [Vocational]
Gender [male])
Math
1.00 0.01 0.98 – 1.01 0.634

6 Goodness of fit and overdispersion

1 - pchisq(deviance(m0), df.residual(m0))
## [1] 0
performance::check_overdispersion(m0)
## # Overdispersion test
## 
##        dispersion ratio =    6.492
##   Pearson's Chi-Squared = 1960.446
##                 p-value =  < 0.001
## Overdispersion detected.
plot(log(fitted(m0)), log((dta2$Days-fitted(m0))^2), 
     xlab=expression(hat(mu)),
     ylab=expression((y-hat(mu))^2))
grid()
abline(0, 1, col=2)

7 Dispersion parameter

knitr::kable(as.data.frame(summary(m0, dispersion=6.492)$coef))
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.3450136 0.1989867 11.7847779 0.0000000
ProgramGeneral 0.3873043 0.3411743 1.1352096 0.2562875
ProgramVocational -1.4878389 0.6257345 -2.3777477 0.0174187
Gendermale -0.1588403 0.2785265 -0.5702878 0.5684825
Math -0.0071045 0.0043252 -1.6425893 0.1004679
ProgramGeneral:Gendermale -0.0347250 0.5144980 -0.0674929 0.9461893
ProgramVocational:Gendermale 0.4617659 0.9469746 0.4876222 0.6258175
ProgramGeneral:Math 0.0019809 0.0071360 0.2775861 0.7813301
ProgramVocational:Math 0.0094842 0.0103312 0.9180121 0.3586125
Gendermale:Math -0.0026914 0.0062654 -0.4295662 0.6675112
ProgramGeneral:Gendermale:Math -0.0017088 0.0110050 -0.1552787 0.8766016
ProgramVocational:Gendermale:Math -0.0029735 0.0159262 -0.1867067 0.8518906