Homework 4 Binary

Description of Data

Dataset:

Student performance in Exams

Description

This dataset was collected from the students in high school Students from the United States. It contains 1000 observations. There are 1000 rows and 9 columns.

In this assignment, I would lke to study the relationshi between gender and socres.

Reading the data, renaming the features, and coding male to 0 and female to 1

library(Zelig)

## Loading required package: survival

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
performance<-read_csv("/Users/Jessica/Desktop/performance.csv")

## Parsed with column specification:
## cols(
##   ID = col_double(),
##   gender = col_character(),
##   race = col_character(),
##   parental = col_character(),
##   lunch = col_character(),
##   `test preparation course` = col_character(),
##   math = col_double(),
##   reading = col_double(),
##   writing = col_double()
## )

data(performance)

## Warning in data(performance): data set 'performance' not found

head(performance)

## # A tibble: 6 x 9
##      ID gender race   parental lunch `test preparati…  math reading writing
##   <dbl> <chr>  <chr>  <chr>    <chr> <chr>            <dbl>   <dbl>   <dbl>
## 1     1 female group… bachelo… stan… none                72      72      74
## 2     2 female group… some co… stan… completed           69      90      88
## 3     3 female group… master'… stan… none                90      95      93
## 4     4 male   group… associa… free… none                47      57      44
## 5     5 male   group… some co… stan… none                76      78      75
## 6     6 female group… associa… stan… none                71      83      78

# recode gender
performance=performance%>%
  mutate(sex = as.integer(gender))

## Warning in evalq(as.integer(gender), <environment>): NAs introduced by
## coercion

head(performance)

## # A tibble: 6 x 10
##      ID gender race  parental lunch `test preparati…  math reading writing
##   <dbl> <chr>  <chr> <chr>    <chr> <chr>            <dbl>   <dbl>   <dbl>
## 1     1 female grou… bachelo… stan… none                72      72      74
## 2     2 female grou… some co… stan… completed           69      90      88
## 3     3 female grou… master'… stan… none                90      95      93
## 4     4 male   grou… associa… free… none                47      57      44
## 5     5 male   grou… some co… stan… none                76      78      75
## 6     6 female grou… associa… stan… none                71      83      78
## # … with 1 more variable: sex <int>

performance <- performance %>% 
  mutate(gender = sjmisc::rec(gender, rec = "male=0; female=1")) %>% 
  select(ID, gender, race, everything()) %>% 
  select(-lunch)
head(performance)

## # A tibble: 6 x 9
##      ID gender race   parental `test preparati…  math reading writing   sex
##   <dbl>  <dbl> <chr>  <chr>    <chr>            <dbl>   <dbl>   <dbl> <int>
## 1     1      1 group… bachelo… none                72      72      74    NA
## 2     2      1 group… some co… completed           69      90      88    NA
## 3     3      1 group… master'… none                90      95      93    NA
## 4     4      0 group… associa… none                47      57      44    NA
## 5     5      0 group… some co… none                76      78      75    NA
## 6     6      1 group… associa… none                71      83      78    NA

Running the regression

Let’s try and predict students’ academic performance

#math
m0 <- glm(gender ~ math, family = "binomial", data = performance)
summary(m0)

## 
## Call:
## glm(formula = gender ~ math, family = "binomial", data = performance)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.607  -1.181   0.862   1.135   1.486  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.589260   0.297472   5.343 9.16e-08 ***
## math        -0.022913   0.004376  -5.235 1.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1385.0  on 999  degrees of freedom
## Residual deviance: 1356.3  on 998  degrees of freedom
## AIC: 1360.3
## 
## Number of Fisher Scoring iterations: 4

#reading
m1 <- glm(gender ~ reading, family = "binomial", data = performance)
summary(m1)

## 
## Call:
## glm(formula = gender ~ reading, family = "binomial", data = performance)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6967  -1.1499   0.7706   1.0896   1.9668  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.381634   0.333302  -7.146 8.96e-13 ***
## reading      0.035504   0.004729   7.507 6.04e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1385.0  on 999  degrees of freedom
## Residual deviance: 1323.6  on 998  degrees of freedom
## AIC: 1327.6
## 
## Number of Fisher Scoring iterations: 4

#writing
m2 <- glm(gender ~ writing, family = "binomial", data = performance)
summary(m2)

## 
## Call:
## glm(formula = gender ~ writing, family = "binomial", data = performance)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.828  -1.116   0.678   1.057   2.250  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.882989   0.330868  -8.713   <2e-16 ***
## writing      0.043450   0.004763   9.123   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1385.0  on 999  degrees of freedom
## Residual deviance: 1290.1  on 998  degrees of freedom
## AIC: 1294.1
## 
## Number of Fisher Scoring iterations: 4

#math + reading + writing
m3 <- glm(gender ~ math + reading + writing, family = "binomial", data = performance)
summary(m3)

## 
## Call:
## glm(formula = gender ~ math + reading + writing, family = "binomial", 
##     data = performance)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.01710  -0.34709   0.05175   0.40383   2.55775  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.44996    0.53166  -4.608 4.06e-06 ***
## math        -0.36591    0.02366 -15.467  < 2e-16 ***
## reading      0.07242    0.02623   2.761  0.00577 ** 
## writing      0.31897    0.02935  10.867  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1385.0  on 999  degrees of freedom
## Residual deviance:  585.1  on 996  degrees of freedom
## AIC: 593.1
## 
## Number of Fisher Scoring iterations: 6

coef(m3)

## (Intercept)        math     reading     writing 
## -2.44995527 -0.36590614  0.07242259  0.31896702

Some Interpretations

We can interpret model 4 like this:

Let’s try writing because it is significant. It can be interpreted as that an increase in writing score by 1 unit on average will increase the log odds of being a female by 0.3189. However, an increase in math score of a cell by 1 unit on average will decrease the log odds of being a female by 0.006.

Table and Compare

library(texreg)

## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").

htmlreg(list(m0,m1,m2,m3))

Statistical models
	Model 1	Model 2	Model 3	Model 4
(Intercept)	1.59^***	-2.38^***	-2.88^***	-2.45^***
	(0.30)	(0.33)	(0.33)	(0.53)
math	-0.02^***			-0.37^***
	(0.00)			(0.02)
reading		0.04^***		0.07^**
		(0.00)		(0.03)
writing			0.04^***	0.32^***
			(0.00)	(0.03)
AIC	1360.31	1327.59	1294.15	593.10
BIC	1370.13	1337.41	1303.96	612.73
Log Likelihood	-678.16	-661.80	-645.07	-292.55
Deviance	1356.31	1323.59	1290.15	585.10
Num. obs.	1000	1000	1000	1000
p < 0.001, p < 0.01, p < 0.05

Looking at both AIC and BIC, we can see that our models get better with complexity. The best performing model based on both of these metrics is model 4 which includes all features.

anova(m0, m1, m2, m3, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: gender ~ math
## Model 2: gender ~ reading
## Model 3: gender ~ writing
## Model 4: gender ~ math + reading + writing
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       998     1356.3                          
## 2       998     1323.6  0    32.72              
## 3       998     1290.2  0    33.44              
## 4       996      585.1  2   705.05 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Doing our analysis of deviance, we view that model 4 is the best based on Deviance as it has the lowest compared to the other models.

This can be interpreted as, an increase in math score by 1 unit on average will decrease the log odds of being a female by 0.37.

library(visreg)
visreg(m0, "math",scale="response")

By looking at the math score graph above, we can conclude that male students are tend to have better math scaore.

library(visreg)
visreg(m1, "reading",scale="response")

By looking at the reading score graph above, we can conclude that female students are tend to have better math scaore.

library(visreg)
visreg(m2, "writing",scale="response")

By looking at the writing score graph above, we can conclude that female students are tend to have better math scaore.

In conclusion, through our analysis, we can tell that male students are better at science subject like math, and females students have better scorea in art subjects like reading and writing.