LUKA ČERNILA

RESEARCH QUESTION:
How genderand result score on writting exam affect the result score on reading exam?

mydata <- read.table("./StudentsPerformance.csv", header = TRUE, sep = ",")

head(mydata)
##   gender race.ethnicity parental.level.of.education        lunch test.preparation.course math.score reading.score
## 1 female        group B           bachelor's degree     standard                    none         72            72
## 2 female        group C                some college     standard               completed         69            90
## 3 female        group B             master's degree     standard                    none         90            95
## 4   male        group A          associate's degree free/reduced                    none         47            57
## 5   male        group C                some college     standard                    none         76            78
## 6 female        group B          associate's degree     standard                    none         71            83
##   writing.score
## 1            74
## 2            88
## 3            93
## 4            44
## 5            75
## 6            78
mydata$race.ethnicity <- NULL

mydata$parental.level.of.education <- NULL

mydata$lunch <- NULL

mydata$test.preparation.course <- NULL

mydata$math.score <- NULL

˙˙

head(mydata, 10)
##    gender reading.score writing.score
## 1  female            72            74
## 2  female            90            88
## 3  female            95            93
## 4    male            57            44
## 5    male            78            75
## 6  female            83            78
## 7  female            95            92
## 8    male            43            39
## 9    male            64            67
## 10 female            60            50

Unit of observation: one student

I have found this dataset on Kaggle.com with the title Students Performance on Exams. Since the sample size (1000) is way to large, i will choose a random sample of 150 students.

set.seed(1)
mydata <- mydata[sample(nrow(mydata), 150), ]
head(mydata)
##     gender reading.score writing.score
## 836 female            64            74
## 679   male            75            78
## 129   male            82            74
## 930 female            56            51
## 509   male            78            77
## 471 female            85            90

Factoring

mydata$genderF <- factor(mydata$gender,
                         levels = c("male", "female"),
                         labels = c("male", "female"))

mydata$ID <- seq(1, nrow(mydata))
head(mydata)
##     gender reading.score writing.score genderF ID
## 836 female            64            74  female  1
## 679   male            75            78    male  2
## 129   male            82            74    male  3
## 930 female            56            51  female  4
## 509   male            78            77    male  5
## 471 female            85            90  female  6

Descriptive statistics

summary(mydata[c("genderF", "reading.score", "writing.score")])
##    genderF   reading.score    writing.score   
##  male  :68   Min.   : 24.00   Min.   : 15.00  
##  female:82   1st Qu.: 58.00   1st Qu.: 56.00  
##              Median : 69.50   Median : 68.50  
##              Mean   : 68.71   Mean   : 67.95  
##              3rd Qu.: 79.00   3rd Qu.: 80.00  
##              Max.   :100.00   Max.   :100.00

In the explanatory variable gender “Female” is going to be the reference group, since it has higher frequency in our sample than “male”.
Average score on reading comprehension is 68.71 and average score on writing comprehension is 67.95. Max score on both exams was 100, while lowest score on writing was lower by 9 points.

Regression model

In my regression model I will try to explain how big of a effect on reading score, gender and writing score have. I expect that reading score would increase, if writing score is high as well. Therefore I assume that the relationship between reading and writing score will be positive. I also assume that females would do better on the exam than males, However, this is just a perosnal hunch and could be wrong. Therefore I predict that the model where I would include both explanatory variables (gender and writing score) would have a stronger relationship with reading score, compared to the model where I would include only writing score as an explanatory variable.

scatterplot(reading.score ~ writing.score | genderF, 
            ylab = "reading score in points", 
            xlab = "writing score in points", 
            smooth = FALSE,
            boxplots = FALSE,
            data = mydata)

Due to the two different gradients of the lines I will first test for interactions. However the difference in slopes is minimal and could be just due to the fact that we took a random sample.

Checkinmg for interactions

fit_F <- lm(reading.score ~ writing.score, 
                 data = mydata[mydata$genderF == "female", ])

summary(fit_F)
## 
## Call:
## lm(formula = reading.score ~ writing.score, data = mydata[mydata$genderF == 
##     "female", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7938 -2.6650  0.1906  2.7368  9.4181 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.76219    2.09270   4.187 7.21e-05 ***
## writing.score  0.87881    0.02847  30.867  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.179 on 80 degrees of freedom
## Multiple R-squared:  0.9225, Adjusted R-squared:  0.9216 
## F-statistic: 952.8 on 1 and 80 DF,  p-value: < 2.2e-16
fit_M<- lm(reading.score ~ writing.score, 
                 data = mydata[mydata$genderF == "male", ])

summary(fit_M)
## 
## Call:
## lm(formula = reading.score ~ writing.score, data = mydata[mydata$genderF == 
##     "male", ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1025  -3.2365  -0.5496   3.8111  10.4504 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.97279    2.62174   2.278    0.026 *  
## writing.score  0.93089    0.04031  23.092   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.758 on 66 degrees of freedom
## Multiple R-squared:  0.8899, Adjusted R-squared:  0.8882 
## F-statistic: 533.2 on 1 and 66 DF,  p-value: < 2.2e-16
fit0 <- lm(reading.score ~ writing.score + genderF+ writing.score:genderF, 
          data = mydata)

summary(fit0)
## 
## Call:
## lm(formula = reading.score ~ writing.score + genderF + writing.score:genderF, 
##     data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1025  -3.0253  -0.0155   3.4412  10.4504 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  5.97279    2.45214   2.436   0.0161 *  
## writing.score                0.93089    0.03770  24.689   <2e-16 ***
## genderFfemale                2.78940    3.31340   0.842   0.4012    
## writing.score:genderFfemale -0.05208    0.04838  -1.076   0.2835    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.45 on 146 degrees of freedom
## Multiple R-squared:  0.9132, Adjusted R-squared:  0.9114 
## F-statistic: 511.7 on 3 and 146 DF,  p-value: < 2.2e-16

We can disregard the interactions at p=0,29 and conclude that there is not enough statistical evidance that there is a difference in affect between the two genders.

Linearity

library(car)
scatterplot(reading.score ~ writing.score, 
            ylab = "points scored on reading", 
            xlab = "Points scored on writing", 
            smooth = FALSE, 
            data = mydata)

We can see from the scaterplot above that the assumption of linearity is met.

Ggplot

library(ggplot2)
ggplot(mydata, aes(y = reading.score, x = writing.score)) +
  geom_point(colour = "darkred") +
  geom_smooth(method = 'lm', formula = y ~ x, se = FALSE, color = "darkorange") +
  ylab("reading score in points") + 
  xlab("writing score in points") +
  scale_x_continuous(breaks = seq(0, 100, 10), limits = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10), limits = c(0, 100)) +
  theme_dark() +
  geom_text(
    label = mydata$ID, 
    nudge_x = 0.50, nudge_y = 0.3, 
    check_overlap = T, size = 2.5) +
  geom_hline(yintercept = mean(mydata$reading.score), linetype = "dashed", colour = "blue") +
  geom_vline(xintercept = mean(mydata$writing.score), linetype = "dashed", colour = "blue") 
## Warning: Removed 3 rows containing missing values (`geom_text()`).

I have used this funnction to display the three deviations, so unexplained, explained and total and because we did the same at the seminars.

fit <- lm(reading.score ~ writing.score, 
          data = mydata)

Testing of homoskedasticity

H0: Homoscedasticity is met.
H1: Homoscedasticity is violated.

mydata$StdFitted <- scale(fit$fitted.values)

library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

From the scatterplot above I could assume homoskedasticity, but i will rather also check it with Breuch pagan test.

#install.packages("olsrr")
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                   Data                    
##  -----------------------------------------
##  Response : reading.score 
##  Variables: fitted values of reading.score 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.4421539 
##  Prob > Chi2   =    0.5060847

Based on the p value (p=0,51) we cannot reject the null hypothesis.therefore i will assume that homoskedasticity is not violated.

Normal distribution of errors, outliers and high impact units

HO: Errors are normally distributed.
H1: Errors are not normally distributed.

mydata$StdResid <- round(rstandard(fit), 3) 
mydata$CooksD <- round(cooks.distance(fit), 3)

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

From the look at the histogram I would say that this distribution looks somewhat normal. But just in case i will still perform wilk shapiro test. Even if thisn test will show results that this is not normal distribution, i will disregard that and treat it as such since we said, we can assume normall distribution if we have a asample larger than 30, at university.

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.99583, p-value = 0.9493

Based on the p value (p=0,95), we cannot reject the H0 and can assume normal distribution.

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histrogram of Cooks distances")

Based on this histogram i can tell that i need to drop some units (those that are not attached to others). Therefore I will remove all units with Cooks distance greater than 0,05.

head(mydata[order(- mydata$StdResid),], 10)
##     gender reading.score writing.score genderF  ID  StdFitted StdResid CooksD
## 304   male            76            64    male  96 -0.2475293    2.440  0.021
## 187   male            76            65    male  10 -0.1849165    2.239  0.017
## 729 female            92            84  female  24  1.0047269    2.024  0.028
## 874   male            90            82    male  14  0.8795013    1.974  0.023
## 716 female            94            87  female 130  1.1925653    1.872  0.029
## 871   male            58            47    male  90 -1.3119471    1.818  0.031
## 129   male            82            74    male   3  0.3785988    1.778  0.012
## 382   male           100            95    male  30  1.6934678    1.620  0.035
## 855   male            64            55    male 105 -0.8110446    1.551  0.013
## 802   male            80            73    male  28  0.3159860    1.529  0.009
head(mydata[order(mydata$StdResid),], 10)
##     gender reading.score writing.score genderF  ID  StdFitted StdResid CooksD
## 487   male            47            56    male 104 -0.7484318   -2.492  0.033
## 836 female            64            74  female   1  0.3785988   -2.283  0.020
## 811   male            31            36    male 122 -2.0006880   -2.093  0.076
## 889 female            65            74  female  46  0.3785988   -2.057  0.016
## 110 female            64            72  female  88  0.2533732   -1.879  0.013
## 43  female            58            65  female 141 -0.1849165   -1.821  0.012
## 684 female            40            44  female  51 -1.4997855   -1.660  0.031
## 422 female            58            64  female  39 -0.2475293   -1.619  0.009
## 214   male            51            56    male 135 -0.7484318   -1.588  0.013
## 299   male            46            50    male   7 -1.1241087   -1.510  0.018

As we could already see from the histogram above and also now with std residuals displayed in a chart we dont have to remove any units here since all of them are withing +- 3.

head(mydata[order(-mydata$CooksD),], 10)
##     gender reading.score writing.score genderF  ID  StdFitted StdResid CooksD
## 811   male            31            36    male 122 -2.0006880   -2.093  0.076
## 339 female            38            27  female 138 -2.5642033    1.358  0.049
## 382   male           100            95    male  30  1.6934678    1.620  0.035
## 111 female            89            98  female  40  1.8813063   -1.499  0.035
## 487   male            47            56    male 104 -0.7484318   -2.492  0.033
## 684 female            40            44  female  51 -1.4997855   -1.660  0.031
## 871   male            58            47    male  90 -1.3119471    1.818  0.031
## 716 female            94            87  female 130  1.1925653    1.872  0.029
## 729 female            92            84  female  24  1.0047269    2.024  0.028
## 874   male            90            82    male  14  0.8795013    1.974  0.023

i only have to remove unit ID 122 due to the gaps in Cooks distance (high impact unit).

#install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>%
  filter(!ID %in% c(122))
tail(mydata)
##     gender reading.score writing.score genderF  ID  StdFitted StdResid CooksD
## 144 female            63            59  female 145 -0.5605934    0.517  0.001
## 145 female            68            73  female 146  0.3159860   -1.178  0.005
## 146   male            48            46    male 147 -1.3745599   -0.248  0.001
## 147 female            86            83  female 148  0.9421141    0.868  0.005
## 148   male            60            57    male 149 -0.6858190    0.243  0.000
## 149 female            62            56  female 150 -0.7484318    0.897  0.004

We can see that the unit was removed since we now have only 149 units in our sample.

Model with writting score

H0: Ro^2 is equal to 0.
H1: Ro^2 > 0.

summary(fit)
## 
## Call:
## lm(formula = reading.score ~ writing.score, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0302  -2.9239  -0.1583   3.4091  10.8199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.98088    1.59294    5.01 1.53e-06 ***
## writing.score  0.89374    0.02282   39.16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.45 on 148 degrees of freedom
## Multiple R-squared:  0.912,  Adjusted R-squared:  0.9114 
## F-statistic:  1533 on 1 and 148 DF,  p-value: < 2.2e-16
sqrt(summary(fit)$r.squared)
## [1] 0.9549744

Based on the random sample and the p value (p<o,oo1), i can reject the HO and conclude that writing score has an effect on the reading score.

Explanation of the Multiple R-squared: 91,2% of variability of points scored on the reading exam is explained by variability of the points scored on the writing exam.

Explanation of estimated coefficient: for each additional point scored on the writing exam, reading exam points increas on average by 0,893 points.

Explanation of multiple correlation coefficient: The relationship between reading score and writing score is positive and very strong

Model with added gender

fit1 <- lm(reading.score ~ writing.score + genderF, 
          data = mydata)
summary(fit1)
## 
## Call:
## lm(formula = reading.score ~ writing.score + genderF, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5364  -2.9319  -0.1174   3.3038  10.3296 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     8.5983     1.6000   5.374 2.98e-07 ***
## writing.score   0.8918     0.0236  37.781  < 2e-16 ***
## genderFfemale  -0.7642     0.7475  -1.022    0.308    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.397 on 146 degrees of freedom
## Multiple R-squared:  0.9114, Adjusted R-squared:  0.9102 
## F-statistic: 750.7 on 2 and 146 DF,  p-value: < 2.2e-16

Conclusion

Gender is statistically insignificant when checking its affect on the reading exam score. Therefore taking the first, simpler model is adviced.