Introduction

The Inspiration is to understand the influence of the parents background, test preparation etc on students performance. the dataset comprises of 1,000 rows and 8 columns, with which we continue to determine what all the features which plays a vital role in affecting the student's performance, and we also try to solve some of the myths for example : students who perform good in math is bad at writing. 

Problem Statement

To determine the most significant factors involved in affecting the scores of the students, and to explore if some of the urban myths has statistical significance or not.

Some of the urban myths we try to explore are:

1, Does one particular gender excel another? 2, Does practice help to excel scores ? 3, Does one particular race oversmarts others? 4, Does student good at math bad at writing?

Data

The dataset consists of the marks secured by the students in various subjects, which accessible from Kaggle Student Performance in Exams. there are 1000 occurance and 8 columns:

Another column ‘percent’ has been created with the average score in all three subjects; percent = (math_score + writing_score + reading_score)/3

library(readr)
studentPerformance <- read_csv("students-performance/StudentsPerformance.csv")


studentPerformance <- na.omit(studentPerformance)  #remove Na if there are any

studentPerformance <- studentPerformance %>% mutate(percent = (`math_score` + `reading_score`+`writing_score`)/3) # create column percent with avg of data


head(studentPerformance)
## # A tibble: 6 x 9
##   gender `race/ethnicity` `parental level~ lunch test_preparatio~
##   <chr>  <chr>            <chr>            <chr> <chr>           
## 1 female group B          bachelor's degr~ stan~ none            
## 2 female group C          some college     stan~ completed       
## 3 female group B          master's degree  stan~ none            
## 4 male   group A          associate's deg~ free~ none            
## 5 male   group C          some college     stan~ none            
## 6 female group B          associate's deg~ stan~ none            
## # ... with 4 more variables: math_score <dbl>, reading_score <dbl>,
## #   writing_score <dbl>, percent <dbl>

Descriptive Statistics and Visualisation

data visualization for supporting each of the 5 problem enlisted above are shown below.

#1  Does one particular gender excel another in math ?


a<-studentPerformance %>% 
  group_by(gender) %>% 
  summarise(avg = mean(percent))

tab1<-xtabs(avg~gender,a)

barplot(tab1, main="AVg. score", 
        xlab="Gender",col='blue')

# 2, Does practice help to excel scores ?

ggplot(data=studentPerformance, mapping=aes(x=`test_preparation_course`, y=math_score, col=`test_preparation_course` ))+
  theme_bw() +
  geom_boxplot()+
  scale_y_continuous(limits=c(0,110),breaks = seq(0,110,10))+
  labs(title="The Urban Myth #2", subtitle="Does practice help to excel scores ?", x="pre test course status",       y="Avg Score")

# 3, Does one particular race oversmarts others?


ggplot(data=studentPerformance, mapping=aes(x=`race/ethnicity`, y=percent, col=`race/ethnicity` ))+
  theme_bw() +
  geom_boxplot()+
  scale_y_continuous(limits=c(0,110),breaks = seq(0,110,10))+
  labs(title="The Urban Myth #3", subtitle="Does a particular race excels at math?", x="Race Group",       y="avg Score")

#4, Does student good at math bad at writing?  


studentPerformance <- studentPerformance %>% mutate(diff = `math_score` - `reading_score`)

abb<-c(
  sum(studentPerformance$diff < 0),
  sum(studentPerformance$diff > 0))

barplot( abb, main="Does student good at math bad at writing?", 
         xlab="Count of student who scored less in writing when compared maths and vice versa ",col='red', ylab="count")

1 Does one particular gender excel another?

To answer this question the average of the all the three scores are taken (math_score +reading_score+writing_score) and a bar plot is plotted to visualise the result, and from the plot, it’s evident that females tend to have a higher avg score in comparison to males.

2 Does practice help to excel scores?

In order to answer this boxplot are plotted and it’s inferred that the students who completed the test preparation course have a high avg score of 80 and a lowest of 60 compared to student who hasn’t taken any test preparation course.

3, Does one particular race over smarts others?

A boxplot visualization answers this question. Group E ethnicity scores well compared to all other race/ethnicity. Among all the ethnicities Group A students have the lowest score, so clearly we can interpret that one race over smarts the others. Similarly, Group D over smarts Group C.

4, Does student good at math bad at writing?

A barplot is plotted to identify the trend and it’s found that the students who have scored well in maths are less when compared to scores by students in writing.

Decsriptive Statistics :.

The descriptive statistics of the influence of important features like race, gender, parental education backgroud and pre test course status are shown below

# summary on pre test course staus determining the average score
studentPerformance %>% group_by(test_preparation_course) %>% summarise(Min = min(percent,na.rm = TRUE),
                                                                        Q1 = quantile(percent ,probs= .25,na.rm = TRUE),
                                                                        Median = median(percent, na.rm = TRUE),
                                                                        Q3 = quantile(percent,probs = .75,na.rm = TRUE),
                                                                        Max = max(percent,na.rm = TRUE),
                                                                        Mean = mean(percent, na.rm = TRUE),
                                                                        SD = sd(percent, na.rm = TRUE),
                                                                        n = n(),
                                                                        Missing = sum(is.na(percent))) -> table1
knitr::kable(table1)
test_preparation_course Min Q1 Median Q3 Max Mean SD n Missing
completed 34.33333 65.00000 73.50000 82.16667 100 72.66946 13.03696 358 0
none 9.00000 55.41667 65.33333 75.00000 100 65.03894 14.18671 642 0
# summary on gender determining the average score
studentPerformance %>% group_by(gender) %>% summarise(Min = min(percent,na.rm = TRUE),
                                                                        Q1 = quantile(percent ,probs= .25,na.rm = TRUE),
                                                                        Median = median(percent, na.rm = TRUE),
                                                                        Q3 = quantile(percent,probs = .75,na.rm = TRUE),
                                                                        Max = max(percent,na.rm = TRUE),
                                                                        Mean = mean(percent, na.rm = TRUE),
                                                                        SD = sd(percent, na.rm = TRUE),
                                                                        n = n(),
                                                                        Missing = sum(is.na(percent))) -> table2
knitr::kable(table2)
gender Min Q1 Median Q3 Max Mean SD n Missing
female 9 60.66667 70.33333 78.66667 100 69.56950 14.54181 518 0
male 23 56.00000 66.33333 76.25000 100 65.83748 13.69884 482 0
# summary on gender determining the average score

studentPerformance %>% group_by(`parental level of education`) %>% summarise(Min = min(percent,na.rm = TRUE),
                                                                        Q1 = quantile(percent ,probs= .25,na.rm = TRUE),
                                                                        Median = median(percent, na.rm = TRUE),
                                                                        Q3 = quantile(percent,probs = .75,na.rm = TRUE),
                                                                        Max = max(percent,na.rm = TRUE),
                                                                        Mean = mean(percent, na.rm = TRUE),
                                                                        SD = sd(percent, na.rm = TRUE),
                                                                        n = n(),
                                                                        Missing = sum(is.na(percent))) -> table3
knitr::kable(table3)
parental level of education Min Q1 Median Q3 Max Mean SD n Missing
associate’s degree 31.66667 58.66667 69.66667 79.00000 100.00000 69.56907 13.67091 222 0
bachelor’s degree 39.00000 64.08333 71.16667 80.66667 100.00000 71.92373 13.94661 118 0
high school 18.33333 53.91667 65.00000 72.66667 95.66667 63.09694 13.51058 196 0
master’s degree 44.66667 63.16667 73.33333 85.50000 97.66667 73.59887 13.60102 59 0
some college 23.33333 60.00000 68.66667 78.00000 99.00000 68.47640 13.71097 226 0
some high school 9.00000 55.66667 66.66667 76.50000 99.00000 65.10801 14.98408 179 0
# summary on gender determining the average score
studentPerformance %>% group_by(`race/ethnicity`) %>% summarise(Min = min(percent,na.rm = TRUE),
                                                                        Q1 = quantile(percent ,probs= .25,na.rm = TRUE),
                                                                        Median = median(percent, na.rm = TRUE),
                                                                        Q3 = quantile(percent,probs = .75,na.rm = TRUE),
                                                                        Max = max(percent,na.rm = TRUE),
                                                                        Mean = mean(percent, na.rm = TRUE),
                                                                        SD = sd(percent, na.rm = TRUE),
                                                                        n = n(),
                                                                        Missing = sum(is.na(percent))) -> table4
knitr::kable(table4)
race/ethnicity Min Q1 Median Q3 Max Mean SD n Missing
group A 23.33333 52.00000 61.33333 73.00000 96.33333 62.99251 14.44460 89 0
group B 18.33333 56.66667 65.00000 76.83333 96.66667 65.46842 14.73213 190 0
group C 9.00000 57.66667 68.33333 77.00000 98.66667 67.13166 13.87221 319 0
group D 31.00000 60.33333 70.00000 78.58333 99.00000 69.17939 13.25278 262 0
group E 26.00000 64.66667 73.50000 82.41667 100.00000 72.75238 14.56502 140 0

Summary Statistics is applied for four different attributes of the dataframe (test_preparation_course,gender,parental level of education,race/ethnicity ) to the percent which is mean of (math_score +reading_score+writing_score)

The summary statistics shows that the mean score of students who have completed a test_preparation_course is 72 to that of not completed is 65 which is comparetively lower. The same when grouped by gender gives 69.5 for females and 65.8 for males. With the parental level of education that influenced the student score , it’s seen that the high mean is for parents with education level master’s level and lowest with parents having eduction a high school degree.The influence of the race or ethnicity on student score is observed for Group E to be highest with a mean of 72.8 and lowest for Group A with mean value of 63.

Hypothesis Testing

Though for the first two questions it is evident that 1, Does one particular gender excel another?

Avg female score is higher than male

2, Does practice help to excel scores ?

Avg score of students who has taken pre test course has scored more than students who has not

However because of the sampling error it is necessary to perform two sided test to provide statistical evidence to our assumption, here all pairs are independent we perform 2 side independant t test.

Step 1: check normality of the groups for question 1

#For test_preparation_course - 'Completed'
male_fil <- studentPerformance %>% filter(gender == "male")
male_fil$percent %>% qqPlot(dist="norm")

## [1] 296 170
#For test_preparation_course - 'None'
female_fil <- studentPerformance %>% filter(gender == "female")
female_fil$percent %>% qqPlot(dist="norm")

## [1]  31 506

check normality of the groups for question 2

#For test_preparation_course - 'Completed'
comp_fil <- studentPerformance %>% filter(test_preparation_course == "completed")
comp_fil$percent %>% qqPlot(dist="norm")

## [1] 299 238
#For test_preparation_course - 'None'
None_fil <- studentPerformance %>% filter(test_preparation_course == "none")
None_fil$percent %>% qqPlot(dist="norm")

## [1]  43 632

From the qq plots as most of the points of all groups are falling between the quantile ranges, hence we are safe to conclude that it is normally distributed, as the sample size is anyway higher than 30, though we have few points falling outside this range the t-test is robust againt minor deviations from the normality.

Step 2: Levene test

#1
leveneTest(percent ~ gender, data = studentPerformance)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.1345 0.7139
##       998
#2
leveneTest(percent ~ test_preparation_course, data = studentPerformance)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  2.8851 0.08971 .
##       998                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Levene Test:

Question 1: From the levene test we are safe to assume that the given groups has equal variance.

Question 2: From the levene test we are safe to assume that the given groups has equal variance.

Step 3: Two side t (independant) test for both question 1 and 3

#1 


t.test(
  percent ~ gender,
  data = studentPerformance,
  paired = FALSE,
  var.equal = TRUE,
  alternative = "two.sided",
  conf.interval=0.95
)
## 
##  Two Sample t-test
## 
## data:  percent by gender
## t = 4.1699, df = 998, p-value = 3.312e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.975745 5.488286
## sample estimates:
## mean in group female   mean in group male 
##             69.56950             65.83748
#2
t.test(
  percent ~ test_preparation_course,
  data = studentPerformance,
  paired = FALSE,
  var.equal = TRUE,
  alternative = "two.sided",
  conf.interval=0.95
)
## 
##  Two Sample t-test
## 
## data:  percent by test_preparation_course
## t = 8.3909, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5.846012 9.415027
## sample estimates:
## mean in group completed      mean in group none 
##                72.66946                65.03894

Hypthesis Testing Cont.

Question 1 : 2 side t-test was chosen for hypothesis testing with H0:???? = 0 & HA:???? ??? 0.Significance level is 5% or 0.05.

Levene’s test is performed to check homogeneity of variance’s of both the population samples. From the Levene’s test, the p-value is found to be 0.7139, which is greater than 0.05 (the significance level). Hence we assume the variances to be equal and unknown. We also assume the population data to be normally distributed and the Alternate hypothesis (Ha) to be such that the mean female score percent is higher mean male score percent. (Ho) is that there is no difference in mean score of female and male.

Interpretation:

A 2 sided t-test was used to test for a significant difference between the mean percent score between male and female. From the t-test that’s been carried out, the p-value is (3.312e-05) found to be p-value<0.001. We already know that the significance level(??) is 0.05. Clearly, the p-value is lesser than the significance level(??) and the 95% CI of the mean difference does not capture Ho:???? = 0. This indicates that mean female score percent is higher mean male score percent and we reject the null hypothesis (Ho) and test is statistically significant.

thus we have statistical evidence that mean female score percent is higher mean male score percent

question 2 : paired t-test was chosen for hypothesis testing with H0:???? = 0 & HA:???? ??? 0.Significance level is 5% or 0.05.

Levene’s test is performed to check homogeneity of variance’s of both the population samples. From the Levene’s test, the p-value is found to be 0.08971, which is greater than 0.05 (the significance level). Hence we assume the variances to be equal and unknown. We also assume the population data to be normally distributed and the Alternate hypothesis (Ha) to be such that the mean percent of students who has taken pre test course is higher than the mean percent of students who has not taken pre test course (Ho) is that there is no difference in mean percent of student who has taken vs not taken the course.

Interpretation:

A paired t-test was used to test for a significant difference between the mean product prices between Woolworths and Coles. From the t-test that’s been carried out, the p-value is found to be 2.2e-16, therefor p-value<0.01. We already know that the significance level(??) is 0.05. Clearly, the p-value is lesser than the significance level(??) and the 95% CI of the mean difference does not capture Ho:???? = 0. This indicates that there is statistical evidence the mean percent of students who has taken pre test course is higher than the mean percent of students who has not taken pre test course and we reject the null hypothesis (Ho) and test is statistically significant.

thus we have statistical evidence that the mean percent of students who has taken pre test course is higher than the mean percent of students who has not taken pre test course

Discussion

For the above mentioned problem statement from data visualization and t test performed to negate the sampling error we can conclude the following:

1, Does one particular gender excel another?

Yes, the mean percent female score is higher compared to male and from t-test we have statistical evidence to prove our assumption derived from descreiptive statistics.

2, Does practice help to excel scores ?

Yes, the mean percent of students who has taken pre test course is higher than the mean percent of students who has not taken pre test course and from t-test we have statistical evidence to prove our assumption derived from descreiptive statistics.

3, Does one particular race oversmarts others?

From the data visualization that the group E outperforms other group, the decreasing order of performance is (E, D, C, B, A)

4, Does student good at math bad at writing?

th count of student who did better is maths compared to writing is less than the number of students who did better in writing, thus we can conclude there is no enough evidence to prove that the students who do good is math is bad at writing.

Limitation and future recommendation: there can be availablity bias which can be handled by gathereing more data in future.

Strenghts: it can handle the sampling error.

References