The data set days absent contains attendance data on 314 high school juniors from high schools. The response variable of interest is days absent from school.
The standardized math score for each student was given as well as the program variable indicating the type of instructional program in which the student was enrolled.
Column 1: Student ID
Column 2: Gender ID
Column 3: Program ID
Column 4: Standardized mathematics score
Column 5: Days absent from school
The interest is to model the number of days of absence from school using student’s gender, the standardized test in math, and the type of program in which the student is enrolled.
#input data
dta <- read.csv("C:/Users/Ching-Fang Wu/Documents/data/days_absent.csv",h=T)
head(dta,8)
## ID Gender Program Math Days
## 1 S1001 male Academic 63 4
## 2 S1002 male Academic 27 4
## 3 S1003 female Academic 20 2
## 4 S1004 female Academic 16 3
## 5 S1005 female Academic 2 3
## 6 S1006 female Academic 71 13
## 7 S1007 female Academic 63 11
## 8 S1008 male Academic 3 7
tail(dta,8)
## ID Gender Program Math Days
## 307 S2150 male Vocational 63 3
## 308 S2151 female Vocational 59 7
## 309 S2152 female Vocational 46 1
## 310 S2153 male Academic 26 1
## 311 S2154 female Vocational 79 3
## 312 S2155 female Academic 59 0
## 313 S2156 female Vocational 90 0
## 314 S2157 female Vocational 77 2
str(dta)
## 'data.frame': 314 obs. of 5 variables:
## $ ID : chr "S1001" "S1002" "S1003" "S1004" ...
## $ Gender : chr "male" "male" "female" "female" ...
## $ Program: chr "Academic" "Academic" "Academic" "Academic" ...
## $ Math : int 63 27 20 16 2 71 63 3 51 49 ...
## $ Days : int 4 4 2 3 3 13 11 7 10 9 ...
類別變數有三個:ID、Gender、Program 數值變數有兩個:Math、Days
summary(dta)
## ID Gender Program Math
## Length:314 Length:314 Length:314 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.:28.00
## Mode :character Mode :character Mode :character Median :48.00
## Mean :48.27
## 3rd Qu.:70.00
## Max. :99.00
## Days
## Min. : 0.000
## 1st Qu.: 1.000
## Median : 4.000
## Mean : 5.955
## 3rd Qu.: 8.000
## Max. :35.000
標準化數學分數的平均值=48.27和中位數差不多。 缺席天數平均值=5.955,中位數=4。
library(ggplot2)
ggplot(dta,
aes(Days)) +
geom_histogram(binwidth = 1, fill=8, col=1)+
facet_grid(Gender ~ Program) + #依據Gender及Program來畫缺席天數次數表
labs(x="Days absent",
y="Frequency") +
theme_minimal()+
theme(legend.position = c(.95, .9))
knitr::kable(aggregate(cbind(Days, Math)~ Gender+Program, data=dta, FUN=mean))
Gender | Program | Days | Math |
---|---|---|---|
female | Academic | 7.759036 | 43.69880 |
male | Academic | 6.119048 | 41.50000 |
female | General | 12.363636 | 44.40909 |
male | General | 8.555556 | 45.33333 |
female | Vocational | 2.709091 | 58.00000 |
male | Vocational | 2.634615 | 58.84615 |
Academic program 的女學生平均缺席天數為7.759,數學標準化分數平均為43.698。 General program的女學生平均缺席天數為12.364,數學標準化分數平均為44.409。 Vocational program的女學生平均缺席天數為2.709,數學標準化分數平均為58。
knitr::kable(aggregate(cbind(Days, Math)~ Gender+Program, data=dta, FUN=var))
Gender | Program | Days | Math |
---|---|---|---|
female | Academic | 68.55098 | 586.7252 |
male | Academic | 41.81698 | 689.8193 |
female | General | 84.33766 | 804.5390 |
male | General | 41.67320 | 923.5294 |
female | Vocational | 10.28418 | 512.4815 |
male | Vocational | 18.07956 | 364.9170 |
ggplot(dta,
aes(Math, Days,
color=Gender)) +
stat_smooth(method="glm",
formula=y ~ x,
method.args=list(family=poisson),
size=rel(.5)) +
geom_point(pch=20, alpha=.5) +
facet_grid(. ~ Program) +
labs(y="Days absent",
x="Math score") +
theme_minimal()+
theme(legend.position = c(.95, .9))
Acdemic及Generalprogram的學生,不論男女,缺課天數越少,數學分數越高。
不同program學生的缺課天數,缺課天數最多的是General的學生,Acdemic次之,Vactional最少。
sjPlot::tab_model(m0 <- glm(Days ~ Program*Gender*Math, data=dta,family=poisson(link=log)), show.se=T, show.r2=F, show.obs=F)
Days | ||||
---|---|---|---|---|
Predictors | Incidence Rate Ratios | std. Error | CI | p |
(Intercept) | 10.43 | 0.08 | 8.94 – 12.14 | <0.001 |
Program [General] | 1.47 | 0.13 | 1.13 – 1.91 | 0.004 |
Program [Vocational] | 0.23 | 0.25 | 0.14 – 0.36 | <0.001 |
Gender [male] | 0.85 | 0.11 | 0.69 – 1.06 | 0.146 |
Math | 0.99 | 0.00 | 0.99 – 1.00 | <0.001 |
Program [General] * Gender [male] |
0.97 | 0.20 | 0.65 – 1.43 | 0.863 |
Program [Vocational] * Gender [male] |
1.59 | 0.37 | 0.76 – 3.27 | 0.214 |
Program [General] * Math | 1.00 | 0.00 | 1.00 – 1.01 | 0.479 |
Program [Vocational] * Math |
1.01 | 0.00 | 1.00 – 1.02 | 0.019 |
Gender [male] * Math | 1.00 | 0.00 | 0.99 – 1.00 | 0.274 |
(Program [General] Gender [male]) Math |
1.00 | 0.00 | 0.99 – 1.01 | 0.692 |
(Program [Vocational] Gender [male]) Math |
1.00 | 0.01 | 0.98 – 1.01 | 0.634 |
1 - pchisq(deviance(m0), df.residual(m0))
## [1] 0
performance::check_overdispersion(m0)
## # Overdispersion test
##
## dispersion ratio = 6.492
## Pearson's Chi-Squared = 1960.446
## p-value = < 0.001
## Overdispersion detected.
dispersion ratio = 6.492
plot(log(fitted(m0)), log((dta$Days-fitted(m0))^2),
xlab=expression(hat(mu)),
ylab=expression((y-hat(mu))^2))
grid()
abline(0, 1, col=2)
knitr::kable(as.data.frame(summary(m0, dispersion=6.492)$coef))
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 2.3450136 | 0.1989867 | 11.7847779 | 0.0000000 |
ProgramGeneral | 0.3873043 | 0.3411743 | 1.1352096 | 0.2562875 |
ProgramVocational | -1.4878389 | 0.6257345 | -2.3777477 | 0.0174187 |
Gendermale | -0.1588403 | 0.2785265 | -0.5702878 | 0.5684825 |
Math | -0.0071045 | 0.0043252 | -1.6425893 | 0.1004679 |
ProgramGeneral:Gendermale | -0.0347250 | 0.5144980 | -0.0674929 | 0.9461893 |
ProgramVocational:Gendermale | 0.4617659 | 0.9469746 | 0.4876222 | 0.6258175 |
ProgramGeneral:Math | 0.0019809 | 0.0071360 | 0.2775861 | 0.7813301 |
ProgramVocational:Math | 0.0094842 | 0.0103312 | 0.9180121 | 0.3586125 |
Gendermale:Math | -0.0026914 | 0.0062654 | -0.4295662 | 0.6675112 |
ProgramGeneral:Gendermale:Math | -0.0017088 | 0.0110050 | -0.1552787 | 0.8766016 |
ProgramVocational:Gendermale:Math | -0.0029735 | 0.0159262 | -0.1867067 | 0.8518906 |