RESEARCH QUESTION:
How genderand result score on writting exam affect the result score on
reading exam?
mydata <- read.table("./StudentsPerformance.csv", header = TRUE, sep = ",")
head(mydata)
## gender race.ethnicity parental.level.of.education lunch test.preparation.course math.score reading.score
## 1 female group B bachelor's degree standard none 72 72
## 2 female group C some college standard completed 69 90
## 3 female group B master's degree standard none 90 95
## 4 male group A associate's degree free/reduced none 47 57
## 5 male group C some college standard none 76 78
## 6 female group B associate's degree standard none 71 83
## writing.score
## 1 74
## 2 88
## 3 93
## 4 44
## 5 75
## 6 78
mydata$race.ethnicity <- NULL
mydata$parental.level.of.education <- NULL
mydata$lunch <- NULL
mydata$test.preparation.course <- NULL
mydata$math.score <- NULL
˙˙
head(mydata, 10)
## gender reading.score writing.score
## 1 female 72 74
## 2 female 90 88
## 3 female 95 93
## 4 male 57 44
## 5 male 78 75
## 6 female 83 78
## 7 female 95 92
## 8 male 43 39
## 9 male 64 67
## 10 female 60 50
Unit of observation: one student
I have found this dataset on Kaggle.com with the title Students Performance on Exams. Since the sample size (1000) is way to large, i will choose a random sample of 150 students.
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 150), ]
head(mydata)
## gender reading.score writing.score
## 836 female 64 74
## 679 male 75 78
## 129 male 82 74
## 930 female 56 51
## 509 male 78 77
## 471 female 85 90
mydata$genderF <- factor(mydata$gender,
levels = c("male", "female"),
labels = c("male", "female"))
mydata$ID <- seq(1, nrow(mydata))
head(mydata)
## gender reading.score writing.score genderF ID
## 836 female 64 74 female 1
## 679 male 75 78 male 2
## 129 male 82 74 male 3
## 930 female 56 51 female 4
## 509 male 78 77 male 5
## 471 female 85 90 female 6
summary(mydata[c("genderF", "reading.score", "writing.score")])
## genderF reading.score writing.score
## male :68 Min. : 24.00 Min. : 15.00
## female:82 1st Qu.: 58.00 1st Qu.: 56.00
## Median : 69.50 Median : 68.50
## Mean : 68.71 Mean : 67.95
## 3rd Qu.: 79.00 3rd Qu.: 80.00
## Max. :100.00 Max. :100.00
In the explanatory variable gender “Female” is going to be the
reference group, since it has higher frequency in our sample than
“male”.
Average score on reading comprehension is 68.71 and average score on
writing comprehension is 67.95. Max score on both exams was 100, while
lowest score on writing was lower by 9 points.
In my regression model I will try to explain how big of a effect on reading score, gender and writing score have. I expect that reading score would increase, if writing score is high as well. Therefore I assume that the relationship between reading and writing score will be positive. I also assume that females would do better on the exam than males, However, this is just a perosnal hunch and could be wrong. Therefore I predict that the model where I would include both explanatory variables (gender and writing score) would have a stronger relationship with reading score, compared to the model where I would include only writing score as an explanatory variable.
scatterplot(reading.score ~ writing.score | genderF,
ylab = "reading score in points",
xlab = "writing score in points",
smooth = FALSE,
boxplots = FALSE,
data = mydata)
Due to the two different gradients of the lines I will first test for interactions. However the difference in slopes is minimal and could be just due to the fact that we took a random sample.
fit_F <- lm(reading.score ~ writing.score,
data = mydata[mydata$genderF == "female", ])
summary(fit_F)
##
## Call:
## lm(formula = reading.score ~ writing.score, data = mydata[mydata$genderF ==
## "female", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7938 -2.6650 0.1906 2.7368 9.4181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.76219 2.09270 4.187 7.21e-05 ***
## writing.score 0.87881 0.02847 30.867 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.179 on 80 degrees of freedom
## Multiple R-squared: 0.9225, Adjusted R-squared: 0.9216
## F-statistic: 952.8 on 1 and 80 DF, p-value: < 2.2e-16
fit_M<- lm(reading.score ~ writing.score,
data = mydata[mydata$genderF == "male", ])
summary(fit_M)
##
## Call:
## lm(formula = reading.score ~ writing.score, data = mydata[mydata$genderF ==
## "male", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1025 -3.2365 -0.5496 3.8111 10.4504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.97279 2.62174 2.278 0.026 *
## writing.score 0.93089 0.04031 23.092 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.758 on 66 degrees of freedom
## Multiple R-squared: 0.8899, Adjusted R-squared: 0.8882
## F-statistic: 533.2 on 1 and 66 DF, p-value: < 2.2e-16
fit0 <- lm(reading.score ~ writing.score + genderF+ writing.score:genderF,
data = mydata)
summary(fit0)
##
## Call:
## lm(formula = reading.score ~ writing.score + genderF + writing.score:genderF,
## data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1025 -3.0253 -0.0155 3.4412 10.4504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.97279 2.45214 2.436 0.0161 *
## writing.score 0.93089 0.03770 24.689 <2e-16 ***
## genderFfemale 2.78940 3.31340 0.842 0.4012
## writing.score:genderFfemale -0.05208 0.04838 -1.076 0.2835
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.45 on 146 degrees of freedom
## Multiple R-squared: 0.9132, Adjusted R-squared: 0.9114
## F-statistic: 511.7 on 3 and 146 DF, p-value: < 2.2e-16
We can disregard the interactions at p=0,29 and conclude that there is not enough statistical evidance that there is a difference in affect between the two genders.
library(car)
scatterplot(reading.score ~ writing.score,
ylab = "points scored on reading",
xlab = "Points scored on writing",
smooth = FALSE,
data = mydata)
We can see from the scaterplot above that the assumption of linearity is met.
library(ggplot2)
ggplot(mydata, aes(y = reading.score, x = writing.score)) +
geom_point(colour = "darkred") +
geom_smooth(method = 'lm', formula = y ~ x, se = FALSE, color = "darkorange") +
ylab("reading score in points") +
xlab("writing score in points") +
scale_x_continuous(breaks = seq(0, 100, 10), limits = c(0, 100)) +
scale_y_continuous(breaks = seq(0, 100, 10), limits = c(0, 100)) +
theme_dark() +
geom_text(
label = mydata$ID,
nudge_x = 0.50, nudge_y = 0.3,
check_overlap = T, size = 2.5) +
geom_hline(yintercept = mean(mydata$reading.score), linetype = "dashed", colour = "blue") +
geom_vline(xintercept = mean(mydata$writing.score), linetype = "dashed", colour = "blue")
## Warning: Removed 3 rows containing missing values (`geom_text()`).
I have used this funnction to display the three deviations, so unexplained, explained and total and because we did the same at the seminars.
fit <- lm(reading.score ~ writing.score,
data = mydata)
H0: Homoscedasticity is met.
H1: Homoscedasticity is violated.
mydata$StdFitted <- scale(fit$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
From the scatterplot above I could assume homoskedasticity, but i will rather also check it with Breuch pagan test.
#install.packages("olsrr")
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------------
## Response : reading.score
## Variables: fitted values of reading.score
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.4421539
## Prob > Chi2 = 0.5060847
Based on the p value (p=0,51) we cannot reject the null hypothesis.therefore i will assume that homoskedasticity is not violated.
HO: Errors are normally distributed.
H1: Errors are not normally distributed.
mydata$StdResid <- round(rstandard(fit), 3)
mydata$CooksD <- round(cooks.distance(fit), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
From the look at the histogram I would say that this distribution looks somewhat normal. But just in case i will still perform wilk shapiro test. Even if thisn test will show results that this is not normal distribution, i will disregard that and treat it as such since we said, we can assume normall distribution if we have a asample larger than 30, at university.
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.99583, p-value = 0.9493
Based on the p value (p=0,95), we cannot reject the H0 and can assume normal distribution.
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histrogram of Cooks distances")
Based on this histogram i can tell that i need to drop some units (those that are not attached to others). Therefore I will remove all units with Cooks distance greater than 0,05.
head(mydata[order(- mydata$StdResid),], 10)
## gender reading.score writing.score genderF ID StdFitted StdResid CooksD
## 304 male 76 64 male 96 -0.2475293 2.440 0.021
## 187 male 76 65 male 10 -0.1849165 2.239 0.017
## 729 female 92 84 female 24 1.0047269 2.024 0.028
## 874 male 90 82 male 14 0.8795013 1.974 0.023
## 716 female 94 87 female 130 1.1925653 1.872 0.029
## 871 male 58 47 male 90 -1.3119471 1.818 0.031
## 129 male 82 74 male 3 0.3785988 1.778 0.012
## 382 male 100 95 male 30 1.6934678 1.620 0.035
## 855 male 64 55 male 105 -0.8110446 1.551 0.013
## 802 male 80 73 male 28 0.3159860 1.529 0.009
head(mydata[order(mydata$StdResid),], 10)
## gender reading.score writing.score genderF ID StdFitted StdResid CooksD
## 487 male 47 56 male 104 -0.7484318 -2.492 0.033
## 836 female 64 74 female 1 0.3785988 -2.283 0.020
## 811 male 31 36 male 122 -2.0006880 -2.093 0.076
## 889 female 65 74 female 46 0.3785988 -2.057 0.016
## 110 female 64 72 female 88 0.2533732 -1.879 0.013
## 43 female 58 65 female 141 -0.1849165 -1.821 0.012
## 684 female 40 44 female 51 -1.4997855 -1.660 0.031
## 422 female 58 64 female 39 -0.2475293 -1.619 0.009
## 214 male 51 56 male 135 -0.7484318 -1.588 0.013
## 299 male 46 50 male 7 -1.1241087 -1.510 0.018
As we could already see from the histogram above and also now with std residuals displayed in a chart we dont have to remove any units here since all of them are withing +- 3.
head(mydata[order(-mydata$CooksD),], 10)
## gender reading.score writing.score genderF ID StdFitted StdResid CooksD
## 811 male 31 36 male 122 -2.0006880 -2.093 0.076
## 339 female 38 27 female 138 -2.5642033 1.358 0.049
## 382 male 100 95 male 30 1.6934678 1.620 0.035
## 111 female 89 98 female 40 1.8813063 -1.499 0.035
## 487 male 47 56 male 104 -0.7484318 -2.492 0.033
## 684 female 40 44 female 51 -1.4997855 -1.660 0.031
## 871 male 58 47 male 90 -1.3119471 1.818 0.031
## 716 female 94 87 female 130 1.1925653 1.872 0.029
## 729 female 92 84 female 24 1.0047269 2.024 0.028
## 874 male 90 82 male 14 0.8795013 1.974 0.023
i only have to remove unit ID 122 due to the gaps in Cooks distance (high impact unit).
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!ID %in% c(122))
tail(mydata)
## gender reading.score writing.score genderF ID StdFitted StdResid CooksD
## 144 female 63 59 female 145 -0.5605934 0.517 0.001
## 145 female 68 73 female 146 0.3159860 -1.178 0.005
## 146 male 48 46 male 147 -1.3745599 -0.248 0.001
## 147 female 86 83 female 148 0.9421141 0.868 0.005
## 148 male 60 57 male 149 -0.6858190 0.243 0.000
## 149 female 62 56 female 150 -0.7484318 0.897 0.004
We can see that the unit was removed since we now have only 149 units in our sample.
H0: Ro^2 is equal to 0.
H1: Ro^2 > 0.
summary(fit)
##
## Call:
## lm(formula = reading.score ~ writing.score, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0302 -2.9239 -0.1583 3.4091 10.8199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.98088 1.59294 5.01 1.53e-06 ***
## writing.score 0.89374 0.02282 39.16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.45 on 148 degrees of freedom
## Multiple R-squared: 0.912, Adjusted R-squared: 0.9114
## F-statistic: 1533 on 1 and 148 DF, p-value: < 2.2e-16
sqrt(summary(fit)$r.squared)
## [1] 0.9549744
Based on the random sample and the p value (p<o,oo1), i can reject the HO and conclude that writing score has an effect on the reading score.
Explanation of the Multiple R-squared: 91,2% of variability of points scored on the reading exam is explained by variability of the points scored on the writing exam.
Explanation of estimated coefficient: for each additional point scored on the writing exam, reading exam points increas on average by 0,893 points.
Explanation of multiple correlation coefficient: The relationship between reading score and writing score is positive and very strong
fit1 <- lm(reading.score ~ writing.score + genderF,
data = mydata)
summary(fit1)
##
## Call:
## lm(formula = reading.score ~ writing.score + genderF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5364 -2.9319 -0.1174 3.3038 10.3296
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.5983 1.6000 5.374 2.98e-07 ***
## writing.score 0.8918 0.0236 37.781 < 2e-16 ***
## genderFfemale -0.7642 0.7475 -1.022 0.308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.397 on 146 degrees of freedom
## Multiple R-squared: 0.9114, Adjusted R-squared: 0.9102
## F-statistic: 750.7 on 2 and 146 DF, p-value: < 2.2e-16
Gender is statistically insignificant when checking its affect on the reading exam score. Therefore taking the first, simpler model is adviced.