Student performance in Exams
This dataset was collected from the students in high school Students from the United States. It contains 1000 observations. There are 1000 rows and 9 columns.
In this assignment, I would lke to study the relationshi between gender and socres.
library(Zelig)
## Loading required package: survival
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
performance<-read_csv("/Users/Jessica/Desktop/performance.csv")
## Parsed with column specification:
## cols(
## ID = col_double(),
## gender = col_character(),
## race = col_character(),
## parental = col_character(),
## lunch = col_character(),
## `test preparation course` = col_character(),
## math = col_double(),
## reading = col_double(),
## writing = col_double()
## )
data(performance)
## Warning in data(performance): data set 'performance' not found
head(performance)
## # A tibble: 6 x 9
## ID gender race parental lunch `test preparati… math reading writing
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 female group… bachelo… stan… none 72 72 74
## 2 2 female group… some co… stan… completed 69 90 88
## 3 3 female group… master'… stan… none 90 95 93
## 4 4 male group… associa… free… none 47 57 44
## 5 5 male group… some co… stan… none 76 78 75
## 6 6 female group… associa… stan… none 71 83 78
# recode gender
performance=performance%>%
mutate(sex = as.integer(gender))
## Warning in evalq(as.integer(gender), <environment>): NAs introduced by
## coercion
head(performance)
## # A tibble: 6 x 10
## ID gender race parental lunch `test preparati… math reading writing
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 female grou… bachelo… stan… none 72 72 74
## 2 2 female grou… some co… stan… completed 69 90 88
## 3 3 female grou… master'… stan… none 90 95 93
## 4 4 male grou… associa… free… none 47 57 44
## 5 5 male grou… some co… stan… none 76 78 75
## 6 6 female grou… associa… stan… none 71 83 78
## # … with 1 more variable: sex <int>
performance <- performance %>%
mutate(gender = sjmisc::rec(gender, rec = "male=0; female=1")) %>%
select(ID, gender, race, everything()) %>%
select(-lunch)
head(performance)
## # A tibble: 6 x 9
## ID gender race parental `test preparati… math reading writing sex
## <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 1 1 group… bachelo… none 72 72 74 NA
## 2 2 1 group… some co… completed 69 90 88 NA
## 3 3 1 group… master'… none 90 95 93 NA
## 4 4 0 group… associa… none 47 57 44 NA
## 5 5 0 group… some co… none 76 78 75 NA
## 6 6 1 group… associa… none 71 83 78 NA
Let’s try and predict students’ academic performance
#math
m0 <- glm(gender ~ math, family = "binomial", data = performance)
summary(m0)
##
## Call:
## glm(formula = gender ~ math, family = "binomial", data = performance)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.607 -1.181 0.862 1.135 1.486
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.589260 0.297472 5.343 9.16e-08 ***
## math -0.022913 0.004376 -5.235 1.65e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1385.0 on 999 degrees of freedom
## Residual deviance: 1356.3 on 998 degrees of freedom
## AIC: 1360.3
##
## Number of Fisher Scoring iterations: 4
#reading
m1 <- glm(gender ~ reading, family = "binomial", data = performance)
summary(m1)
##
## Call:
## glm(formula = gender ~ reading, family = "binomial", data = performance)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6967 -1.1499 0.7706 1.0896 1.9668
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.381634 0.333302 -7.146 8.96e-13 ***
## reading 0.035504 0.004729 7.507 6.04e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1385.0 on 999 degrees of freedom
## Residual deviance: 1323.6 on 998 degrees of freedom
## AIC: 1327.6
##
## Number of Fisher Scoring iterations: 4
#writing
m2 <- glm(gender ~ writing, family = "binomial", data = performance)
summary(m2)
##
## Call:
## glm(formula = gender ~ writing, family = "binomial", data = performance)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.828 -1.116 0.678 1.057 2.250
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.882989 0.330868 -8.713 <2e-16 ***
## writing 0.043450 0.004763 9.123 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1385.0 on 999 degrees of freedom
## Residual deviance: 1290.1 on 998 degrees of freedom
## AIC: 1294.1
##
## Number of Fisher Scoring iterations: 4
#math + reading + writing
m3 <- glm(gender ~ math + reading + writing, family = "binomial", data = performance)
summary(m3)
##
## Call:
## glm(formula = gender ~ math + reading + writing, family = "binomial",
## data = performance)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.01710 -0.34709 0.05175 0.40383 2.55775
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.44996 0.53166 -4.608 4.06e-06 ***
## math -0.36591 0.02366 -15.467 < 2e-16 ***
## reading 0.07242 0.02623 2.761 0.00577 **
## writing 0.31897 0.02935 10.867 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1385.0 on 999 degrees of freedom
## Residual deviance: 585.1 on 996 degrees of freedom
## AIC: 593.1
##
## Number of Fisher Scoring iterations: 6
coef(m3)
## (Intercept) math reading writing
## -2.44995527 -0.36590614 0.07242259 0.31896702
We can interpret model 4 like this:
Let’s try writing because it is significant. It can be interpreted as that an increase in writing score by 1 unit on average will increase the log odds of being a female by 0.3189. However, an increase in math score of a cell by 1 unit on average will decrease the log odds of being a female by 0.006.
library(texreg)
## Version: 1.36.23
## Date: 2017-03-03
## Author: Philip Leifeld (University of Glasgow)
##
## Please cite the JSS article in your publications -- see citation("texreg").
htmlreg(list(m0,m1,m2,m3))
| Model 1 | Model 2 | Model 3 | Model 4 | ||
|---|---|---|---|---|---|
| (Intercept) | 1.59*** | -2.38*** | -2.88*** | -2.45*** | |
| (0.30) | (0.33) | (0.33) | (0.53) | ||
| math | -0.02*** | -0.37*** | |||
| (0.00) | (0.02) | ||||
| reading | 0.04*** | 0.07** | |||
| (0.00) | (0.03) | ||||
| writing | 0.04*** | 0.32*** | |||
| (0.00) | (0.03) | ||||
| AIC | 1360.31 | 1327.59 | 1294.15 | 593.10 | |
| BIC | 1370.13 | 1337.41 | 1303.96 | 612.73 | |
| Log Likelihood | -678.16 | -661.80 | -645.07 | -292.55 | |
| Deviance | 1356.31 | 1323.59 | 1290.15 | 585.10 | |
| Num. obs. | 1000 | 1000 | 1000 | 1000 | |
| p < 0.001, p < 0.01, p < 0.05 | |||||
Looking at both AIC and BIC, we can see that our models get better with complexity. The best performing model based on both of these metrics is model 4 which includes all features.
anova(m0, m1, m2, m3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: gender ~ math
## Model 2: gender ~ reading
## Model 3: gender ~ writing
## Model 4: gender ~ math + reading + writing
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 998 1356.3
## 2 998 1323.6 0 32.72
## 3 998 1290.2 0 33.44
## 4 996 585.1 2 705.05 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Doing our analysis of deviance, we view that model 4 is the best based on Deviance as it has the lowest compared to the other models.
This can be interpreted as, an increase in math score by 1 unit on average will decrease the log odds of being a female by 0.37.
library(visreg)
visreg(m0, "math",scale="response")
By looking at the math score graph above, we can conclude that male students are tend to have better math scaore.
library(visreg)
visreg(m1, "reading",scale="response")
By looking at the reading score graph above, we can conclude that female students are tend to have better math scaore.
library(visreg)
visreg(m2, "writing",scale="response")
By looking at the writing score graph above, we can conclude that female students are tend to have better math scaore.
In conclusion, through our analysis, we can tell that male students are better at science subject like math, and females students have better scorea in art subjects like reading and writing.