The Inspiration is to understand the influence of the parents background, test preparation etc on students performance. the dataset comprises of 1,000 rows and 8 columns, with which we continue to determine what all the features which plays a vital role in affecting the student's performance, and we also try to solve some of the myths for example : students who perform good in math is bad at writing.
To determine the most significant factors involved in affecting the scores of the students, and to explore if some of the urban myths has statistical significance or not.
Some of the urban myths we try to explore are:
1, Does one particular gender excel another? 2, Does practice help to excel scores ? 3, Does one particular race oversmarts others? 4, Does student good at math bad at writing?
The dataset consists of the marks secured by the students in various subjects, which accessible from Kaggle Student Performance in Exams. there are 1000 occurance and 8 columns:
Another column ‘percent’ has been created with the average score in all three subjects; percent = (math_score + writing_score + reading_score)/3
library(readr)
studentPerformance <- read_csv("students-performance/StudentsPerformance.csv")
studentPerformance <- na.omit(studentPerformance) #remove Na if there are any
studentPerformance <- studentPerformance %>% mutate(percent = (`math_score` + `reading_score`+`writing_score`)/3) # create column percent with avg of data
head(studentPerformance)
## # A tibble: 6 x 9
## gender `race/ethnicity` `parental level~ lunch test_preparatio~
## <chr> <chr> <chr> <chr> <chr>
## 1 female group B bachelor's degr~ stan~ none
## 2 female group C some college stan~ completed
## 3 female group B master's degree stan~ none
## 4 male group A associate's deg~ free~ none
## 5 male group C some college stan~ none
## 6 female group B associate's deg~ stan~ none
## # ... with 4 more variables: math_score <dbl>, reading_score <dbl>,
## # writing_score <dbl>, percent <dbl>
data visualization for supporting each of the 5 problem enlisted above are shown below.
#1 Does one particular gender excel another in math ?
a<-studentPerformance %>%
group_by(gender) %>%
summarise(avg = mean(percent))
tab1<-xtabs(avg~gender,a)
barplot(tab1, main="AVg. score",
xlab="Gender",col='blue')
# 2, Does practice help to excel scores ?
ggplot(data=studentPerformance, mapping=aes(x=`test_preparation_course`, y=math_score, col=`test_preparation_course` ))+
theme_bw() +
geom_boxplot()+
scale_y_continuous(limits=c(0,110),breaks = seq(0,110,10))+
labs(title="The Urban Myth #2", subtitle="Does practice help to excel scores ?", x="pre test course status", y="Avg Score")
# 3, Does one particular race oversmarts others?
ggplot(data=studentPerformance, mapping=aes(x=`race/ethnicity`, y=percent, col=`race/ethnicity` ))+
theme_bw() +
geom_boxplot()+
scale_y_continuous(limits=c(0,110),breaks = seq(0,110,10))+
labs(title="The Urban Myth #3", subtitle="Does a particular race excels at math?", x="Race Group", y="avg Score")
#4, Does student good at math bad at writing?
studentPerformance <- studentPerformance %>% mutate(diff = `math_score` - `reading_score`)
abb<-c(
sum(studentPerformance$diff < 0),
sum(studentPerformance$diff > 0))
barplot( abb, main="Does student good at math bad at writing?",
xlab="Count of student who scored less in writing when compared maths and vice versa ",col='red', ylab="count")
To answer this question the average of the all the three scores are taken (math_score
+reading_score
+writing_score
) and a bar plot is plotted to visualise the result, and from the plot, it’s evident that females tend to have a higher avg score in comparison to males.
In order to answer this boxplot are plotted and it’s inferred that the students who completed the test preparation course have a high avg score of 80 and a lowest of 60 compared to student who hasn’t taken any test preparation course.
A boxplot visualization answers this question. Group E ethnicity scores well compared to all other race/ethnicity. Among all the ethnicities Group A students have the lowest score, so clearly we can interpret that one race over smarts the others. Similarly, Group D over smarts Group C.
A barplot is plotted to identify the trend and it’s found that the students who have scored well in maths are less when compared to scores by students in writing.
The descriptive statistics of the influence of important features like race, gender, parental education backgroud and pre test course status are shown below
# summary on pre test course staus determining the average score
studentPerformance %>% group_by(test_preparation_course) %>% summarise(Min = min(percent,na.rm = TRUE),
Q1 = quantile(percent ,probs= .25,na.rm = TRUE),
Median = median(percent, na.rm = TRUE),
Q3 = quantile(percent,probs = .75,na.rm = TRUE),
Max = max(percent,na.rm = TRUE),
Mean = mean(percent, na.rm = TRUE),
SD = sd(percent, na.rm = TRUE),
n = n(),
Missing = sum(is.na(percent))) -> table1
knitr::kable(table1)
test_preparation_course | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
---|---|---|---|---|---|---|---|---|---|
completed | 34.33333 | 65.00000 | 73.50000 | 82.16667 | 100 | 72.66946 | 13.03696 | 358 | 0 |
none | 9.00000 | 55.41667 | 65.33333 | 75.00000 | 100 | 65.03894 | 14.18671 | 642 | 0 |
# summary on gender determining the average score
studentPerformance %>% group_by(gender) %>% summarise(Min = min(percent,na.rm = TRUE),
Q1 = quantile(percent ,probs= .25,na.rm = TRUE),
Median = median(percent, na.rm = TRUE),
Q3 = quantile(percent,probs = .75,na.rm = TRUE),
Max = max(percent,na.rm = TRUE),
Mean = mean(percent, na.rm = TRUE),
SD = sd(percent, na.rm = TRUE),
n = n(),
Missing = sum(is.na(percent))) -> table2
knitr::kable(table2)
gender | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
---|---|---|---|---|---|---|---|---|---|
female | 9 | 60.66667 | 70.33333 | 78.66667 | 100 | 69.56950 | 14.54181 | 518 | 0 |
male | 23 | 56.00000 | 66.33333 | 76.25000 | 100 | 65.83748 | 13.69884 | 482 | 0 |
# summary on gender determining the average score
studentPerformance %>% group_by(`parental level of education`) %>% summarise(Min = min(percent,na.rm = TRUE),
Q1 = quantile(percent ,probs= .25,na.rm = TRUE),
Median = median(percent, na.rm = TRUE),
Q3 = quantile(percent,probs = .75,na.rm = TRUE),
Max = max(percent,na.rm = TRUE),
Mean = mean(percent, na.rm = TRUE),
SD = sd(percent, na.rm = TRUE),
n = n(),
Missing = sum(is.na(percent))) -> table3
knitr::kable(table3)
parental level of education | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
---|---|---|---|---|---|---|---|---|---|
associate’s degree | 31.66667 | 58.66667 | 69.66667 | 79.00000 | 100.00000 | 69.56907 | 13.67091 | 222 | 0 |
bachelor’s degree | 39.00000 | 64.08333 | 71.16667 | 80.66667 | 100.00000 | 71.92373 | 13.94661 | 118 | 0 |
high school | 18.33333 | 53.91667 | 65.00000 | 72.66667 | 95.66667 | 63.09694 | 13.51058 | 196 | 0 |
master’s degree | 44.66667 | 63.16667 | 73.33333 | 85.50000 | 97.66667 | 73.59887 | 13.60102 | 59 | 0 |
some college | 23.33333 | 60.00000 | 68.66667 | 78.00000 | 99.00000 | 68.47640 | 13.71097 | 226 | 0 |
some high school | 9.00000 | 55.66667 | 66.66667 | 76.50000 | 99.00000 | 65.10801 | 14.98408 | 179 | 0 |
# summary on gender determining the average score
studentPerformance %>% group_by(`race/ethnicity`) %>% summarise(Min = min(percent,na.rm = TRUE),
Q1 = quantile(percent ,probs= .25,na.rm = TRUE),
Median = median(percent, na.rm = TRUE),
Q3 = quantile(percent,probs = .75,na.rm = TRUE),
Max = max(percent,na.rm = TRUE),
Mean = mean(percent, na.rm = TRUE),
SD = sd(percent, na.rm = TRUE),
n = n(),
Missing = sum(is.na(percent))) -> table4
knitr::kable(table4)
race/ethnicity | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
---|---|---|---|---|---|---|---|---|---|
group A | 23.33333 | 52.00000 | 61.33333 | 73.00000 | 96.33333 | 62.99251 | 14.44460 | 89 | 0 |
group B | 18.33333 | 56.66667 | 65.00000 | 76.83333 | 96.66667 | 65.46842 | 14.73213 | 190 | 0 |
group C | 9.00000 | 57.66667 | 68.33333 | 77.00000 | 98.66667 | 67.13166 | 13.87221 | 319 | 0 |
group D | 31.00000 | 60.33333 | 70.00000 | 78.58333 | 99.00000 | 69.17939 | 13.25278 | 262 | 0 |
group E | 26.00000 | 64.66667 | 73.50000 | 82.41667 | 100.00000 | 72.75238 | 14.56502 | 140 | 0 |
Summary Statistics is applied for four different attributes of the dataframe (test_preparation_course
,gender
,parental level of education
,race/ethnicity
) to the percent which is mean of (math_score
+reading_score
+writing_score
)
The summary statistics shows that the mean score of students who have completed a test_preparation_course is 72 to that of not completed is 65 which is comparetively lower. The same when grouped by gender gives 69.5 for females and 65.8 for males. With the parental level of education that influenced the student score , it’s seen that the high mean is for parents with education level master’s level and lowest with parents having eduction a high school degree.The influence of the race or ethnicity on student score is observed for Group E to be highest with a mean of 72.8 and lowest for Group A with mean value of 63.
Though for the first two questions it is evident that 1, Does one particular gender excel another?
Avg female score is higher than male
2, Does practice help to excel scores ?
Avg score of students who has taken pre test course has scored more than students who has not
However because of the sampling error it is necessary to perform two sided test to provide statistical evidence to our assumption, here all pairs are independent we perform 2 side independant t test.
Step 1: check normality of the groups for question 1
#For test_preparation_course - 'Completed'
male_fil <- studentPerformance %>% filter(gender == "male")
male_fil$percent %>% qqPlot(dist="norm")
## [1] 296 170
#For test_preparation_course - 'None'
female_fil <- studentPerformance %>% filter(gender == "female")
female_fil$percent %>% qqPlot(dist="norm")
## [1] 31 506
check normality of the groups for question 2
#For test_preparation_course - 'Completed'
comp_fil <- studentPerformance %>% filter(test_preparation_course == "completed")
comp_fil$percent %>% qqPlot(dist="norm")
## [1] 299 238
#For test_preparation_course - 'None'
None_fil <- studentPerformance %>% filter(test_preparation_course == "none")
None_fil$percent %>% qqPlot(dist="norm")
## [1] 43 632
From the qq plots as most of the points of all groups are falling between the quantile ranges, hence we are safe to conclude that it is normally distributed, as the sample size is anyway higher than 30, though we have few points falling outside this range the t-test is robust againt minor deviations from the normality.
Step 2: Levene test
#1
leveneTest(percent ~ gender, data = studentPerformance)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.1345 0.7139
## 998
#2
leveneTest(percent ~ test_preparation_course, data = studentPerformance)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 2.8851 0.08971 .
## 998
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Levene Test:
Question 1: From the levene test we are safe to assume that the given groups has equal variance.
Question 2: From the levene test we are safe to assume that the given groups has equal variance.
Step 3: Two side t (independant) test for both question 1 and 3
#1
t.test(
percent ~ gender,
data = studentPerformance,
paired = FALSE,
var.equal = TRUE,
alternative = "two.sided",
conf.interval=0.95
)
##
## Two Sample t-test
##
## data: percent by gender
## t = 4.1699, df = 998, p-value = 3.312e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.975745 5.488286
## sample estimates:
## mean in group female mean in group male
## 69.56950 65.83748
#2
t.test(
percent ~ test_preparation_course,
data = studentPerformance,
paired = FALSE,
var.equal = TRUE,
alternative = "two.sided",
conf.interval=0.95
)
##
## Two Sample t-test
##
## data: percent by test_preparation_course
## t = 8.3909, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.846012 9.415027
## sample estimates:
## mean in group completed mean in group none
## 72.66946 65.03894
Question 1 : 2 side t-test was chosen for hypothesis testing with H0:???? = 0 & HA:???? ??? 0.Significance level is 5% or 0.05.
Levene’s test is performed to check homogeneity of variance’s of both the population samples. From the Levene’s test, the p-value is found to be 0.7139, which is greater than 0.05 (the significance level). Hence we assume the variances to be equal and unknown. We also assume the population data to be normally distributed and the Alternate hypothesis (Ha) to be such that the mean female score percent is higher mean male score percent. (Ho) is that there is no difference in mean score of female and male.
Interpretation:
A 2 sided t-test was used to test for a significant difference between the mean percent score between male and female. From the t-test that’s been carried out, the p-value is (3.312e-05) found to be p-value<0.001. We already know that the significance level(??) is 0.05. Clearly, the p-value is lesser than the significance level(??) and the 95% CI of the mean difference does not capture Ho:???? = 0. This indicates that mean female score percent is higher mean male score percent and we reject the null hypothesis (Ho) and test is statistically significant.
thus we have statistical evidence that mean female score percent is higher mean male score percent
question 2 : paired t-test was chosen for hypothesis testing with H0:???? = 0 & HA:???? ??? 0.Significance level is 5% or 0.05.
Levene’s test is performed to check homogeneity of variance’s of both the population samples. From the Levene’s test, the p-value is found to be 0.08971, which is greater than 0.05 (the significance level). Hence we assume the variances to be equal and unknown. We also assume the population data to be normally distributed and the Alternate hypothesis (Ha) to be such that the mean percent of students who has taken pre test course is higher than the mean percent of students who has not taken pre test course (Ho) is that there is no difference in mean percent of student who has taken vs not taken the course.
Interpretation:
A paired t-test was used to test for a significant difference between the mean product prices between Woolworths and Coles. From the t-test that’s been carried out, the p-value is found to be 2.2e-16, therefor p-value<0.01. We already know that the significance level(??) is 0.05. Clearly, the p-value is lesser than the significance level(??) and the 95% CI of the mean difference does not capture Ho:???? = 0. This indicates that there is statistical evidence the mean percent of students who has taken pre test course is higher than the mean percent of students who has not taken pre test course and we reject the null hypothesis (Ho) and test is statistically significant.
thus we have statistical evidence that the mean percent of students who has taken pre test course is higher than the mean percent of students who has not taken pre test course
For the above mentioned problem statement from data visualization and t test performed to negate the sampling error we can conclude the following:
1, Does one particular gender excel another?
Yes, the mean percent female score is higher compared to male and from t-test we have statistical evidence to prove our assumption derived from descreiptive statistics.
2, Does practice help to excel scores ?
Yes, the mean percent of students who has taken pre test course is higher than the mean percent of students who has not taken pre test course and from t-test we have statistical evidence to prove our assumption derived from descreiptive statistics.
3, Does one particular race oversmarts others?
From the data visualization that the group E outperforms other group, the decreasing order of performance is (E, D, C, B, A)
4, Does student good at math bad at writing?
th count of student who did better is maths compared to writing is less than the number of students who did better in writing, thus we can conclude there is no enough evidence to prove that the students who do good is math is bad at writing.
Limitation and future recommendation: there can be availablity bias which can be handled by gathereing more data in future.
Strenghts: it can handle the sampling error.