Today, I will be using performance data from Kaggle which includes the following variables: lunch (free/reduced), test preparation course (if the student completed or not the prep course), range of math, reading and writing scores, students’ gender, parental level of education (hight school, some college, bathelor’s degree, etc. which are going to be recoded) and the varibale of their belonging to racial/ethnic group of A, B, or C. I am going to analyze how the math scores are impacted by gender, preparation, and parental level of education. My hypothesis is that the highest math scores are impacted by the parental level of education and test preparation, not gender.

performance <- read.csv ("C:/Users/Marcy/Documents/soc 712/StudentsPerformance.csv")
head (performance)

First, I install the packaged that are needed for this analysis.

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)
library(Zelig)
## Warning: package 'Zelig' was built under R version 3.5.3
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.5.3
library(ZeligChoice)
## Warning: package 'ZeligChoice' was built under R version 3.5.3
library(faraway)
## Warning: package 'faraway' was built under R version 3.5.3
## 
## Attaching package: 'faraway'
## The following objects are masked from 'package:survival':
## 
##     rats, solder
library(dplyr)
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
library(survival)

Recoding the variable of parental level of education.

performance$parental.level.of.education<-as.character(performance$parental.level.of.education)

performance$parental.level.of.education=recode(performance$parental.level.of.education,"bachelor's degree" = 'College Degree', "master's degree" = 'College Degree', "associate's degree"="Some College Degree","high school"="No College Degree","some high school"="No College Degree","some college"="Some College Degree")
performance$parental.level.of.education<-as.factor(performance$parental.level.of.education)

Now, choosing a model to work with. It looks like, Model 3 is the best fit model to work with. It has interaction of gender of the student and parental level of educaion among other varibales.

mod1 <- zelig(math.score ~ writing.score, model = "poisson", data = performance, cite = F)
mod2 <- zelig(math.score ~ gender + lunch, model = "poisson", data = performance, cite = F)
mod3<- zelig(math.score ~ gender*parental.level.of.education+test.preparation.course  + reading.score + lunch, model = "poisson", data = performance, cite = F)
texreg::htmlreg(list(mod1, mod2, mod3),doctype = FALSE)
Statistical models
Model 1 Model 2 Model 3
(Intercept) 3.33*** 4.04*** 3.02***
(0.02) (0.01) (0.03)
writing.score 0.01***
(0.00)
gendermale 0.07*** 0.18***
(0.01) (0.02)
lunchstandard 0.17*** 0.07***
(0.01) (0.01)
parental.level.of.educationNo College Degree -0.01
(0.02)
parental.level.of.educationSome College Degree 0.01
(0.01)
test.preparation.coursenone 0.02*
(0.01)
reading.score 0.01***
(0.00)
gendermale:parental.level.of.educationNo College Degree 0.01
(0.02)
gendermale:parental.level.of.educationSome College Degree -0.00
(0.02)
AIC 7429.71 9187.41 6755.87
BIC 7439.52 9202.13 6800.04
Log Likelihood -3712.85 -4590.70 -3368.93
Deviance 1428.91 3184.60 741.07
Num. obs. 1000 1000 1000
p < 0.001, p < 0.01, p < 0.05

Let’s now explore the simulated gender difference in male and female students of taking preparation test.

mod4<- zelig(math.score ~ gender * test.preparation.course, model = "poisson", data = performance, cite = F)

x <- setx(mod4, gender = "female")
x1 <- setx(mod4, gender = "male")
s <- sim(mod4, x = x, x1 = x1)
fd <- s$get_qi(xvalue="x1", qi="fd")
gen.difference <- as.data.frame(cbind(fd))
 gen.difference<- gen.difference %>% 
  gather(test.preparation.course, simv)
gen.difference %>% 
  group_by(test.preparation.course) %>% 
  summarise(mean = mean(simv), sd = sd(simv))

Apparently, it is likely that female students take preparation course 5 times more often than males according to the analysis of this data.

This difference seems very significant, thus, to make sure that it makes sense and has relevance, I am going to find the frequency distribution of male and female students.

library (ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:Zelig':
## 
##     stat
ggplot(performance, aes(x=gender))+geom_bar(fill="green")+labs(y= "number of students")+ggtitle("Students' Gender Frequency Distribution") + theme(plot.title = element_text(hjust = 0.5)) 

As shown, this sample contains about even number of male and female students, with a slightly higher number of females. Thus, this sample is surely gender representattive.

Now, I will see how many student took preparation course. As seen below, almost half of the students did not take the test preparation course.

ggplot(performance, aes(x=test.preparation.course))+geom_bar(fill="blue")+labs(y= "number of students")+ggtitle("Students' Test Preparation Completion Distribution") + theme(plot.title = element_text(hjust = 0.5)) 

Now, I am going to analyze gender differences in different parental levels of education which are recoded as parents having the folowing educaion: “No College,” “Some College”, or “College degree”.

c1x <- setx(mod3, gender = "male", parental.level.of.education = "College Degree")
c1x1 <- setx(mod3, gender = "female", parental.level.of.education = "College Degree")
c1s <- sim(mod3, x = c1x, x1 = c1x1)


sc1x <- setx(mod3, gender = "male", parental.level.of.education = "Some College Degree")
sc1x1 <- setx(mod3, gender = "female", parental.level.of.education = "Some College Degree")
sc1s <- sim(mod3, x = sc1x, x1 = sc1x1)

nc1x <- setx(mod3, gender = "male", parental.level.of.education = "No College Degree")
nc1x1 <- setx(mod3, gender = "female", parental.level.of.education = "No College Degree")
nc1s <- sim(mod3, x = nc1x, x1 = nc1x1)

pd1 <- c1s$get_qi(xvalue="x1", qi="fd")
pd2 <- sc1s$get_qi(xvalue="x1", qi="fd")
pd3 <- nc1s$get_qi(xvalue="x1", qi="fd")

ppfd <- as.data.frame(cbind(pd1, pd2, pd3))

perdd <-ppfd %>% 
  gather(edu, edudmv, 1:3)

head(perdd)
perdd %>% 
  group_by(edu) %>% 
  summarise(mean = mean(edudmv), sd = sd(edudmv))

Plotting the results of the distribution:

ggplot(perdd, aes(edudmv)) + geom_histogram() + facet_grid(~edu)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Apparently, the analysis shows that the parental level of education has no much significant difference if these are some college or no college degrees. As parental level of education increases.

ggplot(gen.difference, aes(simv)) + geom_histogram(fill="brown") + facet_grid(~test.preparation.course) + labs(x = "Simulated First Difference (Mean)", y= "Test Preparation Event")+
ggtitle("Gender Difference in Test Preparation")+theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, I will see the impact of gender and lunch (standart or free) on test preparation completion.

#test preparation taken/standard lunch
lunx <- setx(mod3, gender = "male",   lunch = "standard", test.preparation.course ="completed")
lunx1 <- setx(mod3, gender = "female",  lunch = "standard", test.preparation.course ="completed")
l <- sim(mod3, x = lunx, x1 = lunx1)

# test preparation - taken/reduced lunch
lunxy <- setx(mod3, gender = "male",   lunch = "free/reduced", test.preparation.course ="completed")
lunxy1 <- setx(mod3, gender = "female",  lunch= "free/reduced", test.preparation.course ="completed")

lot1 <- sim(mod3, x = lunxy, x1 = lunxy1)

Now, I am going to see how parental level of education impacts math score of male and female students.

ggplot(data= performance)+ 
  geom_col(aes(x=parental.level.of.education, y = gender, fill = math.score ), position= "dodge") + labs (title = "Data", y = "math.score")

Thus, there is a correlation between gender, parental level of education and the math score. So, my hypothesis was not correct. Female students of college educated parents have the higherst math score of all groups.