Introduction

The Inspiration is to understand the influence of the parents background, test preparation etc on students performance. the dataset comprises of 1,000 rows and 8 columns, with which we continue to determine what all the features which plays a vital role in affecting the student's performance, and we also try to solve some of the myths for example : students who perform good in math is bad at writing.

Problem Statement

To determine the most significant factors involved in affecting the scores of the students, and to explore if some of the urban myths has statistical significance or not.

Some of the urban myths we try to explore are:

1, Does one particular gender excel another? 2, Does practice help to excel scores ? 3, Does one particular race oversmarts others? 4, Does student good at math bad at writing?

Data

The dataset consists of the marks secured by the students in various subjects, which accessible from Kaggle Student Performance in Exams. there are 1000 occurance and 8 columns:

gender
race / ethnicity
parental level of education
lunch
test preparation course
math score
reading score
writing score

Another column ‘percent’ has been created with the average score in all three subjects; percent = (math_score + writing_score + reading_score)/3

library(readr)
studentPerformance <- read_csv("students-performance/StudentsPerformance.csv")


studentPerformance <- na.omit(studentPerformance)  #remove Na if there are any

studentPerformance <- studentPerformance %>% mutate(percent = (`math_score` + `reading_score`+`writing_score`)/3) # create column percent with avg of data


head(studentPerformance)

## # A tibble: 6 x 9
##   gender `race/ethnicity` `parental level~ lunch test_preparatio~
##   <chr>  <chr>            <chr>            <chr> <chr>           
## 1 female group B          bachelor's degr~ stan~ none            
## 2 female group C          some college     stan~ completed       
## 3 female group B          master's degree  stan~ none            
## 4 male   group A          associate's deg~ free~ none            
## 5 male   group C          some college     stan~ none            
## 6 female group B          associate's deg~ stan~ none            
## # ... with 4 more variables: math_score <dbl>, reading_score <dbl>,
## #   writing_score <dbl>, percent <dbl>

Descriptive Statistics and Visualisation

data visualization for supporting each of the 5 problem enlisted above are shown below.

#1  Does one particular gender excel another in math ?


a<-studentPerformance %>% 
  group_by(gender) %>% 
  summarise(avg = mean(percent))

tab1<-xtabs(avg~gender,a)

barplot(tab1, main="AVg. score", 
        xlab="Gender",col='blue')

# 2, Does practice help to excel scores ?

ggplot(data=studentPerformance, mapping=aes(x=`test_preparation_course`, y=math_score, col=`test_preparation_course` ))+
  theme_bw() +
  geom_boxplot()+
  scale_y_continuous(limits=c(0,110),breaks = seq(0,110,10))+
  labs(title="The Urban Myth #2", subtitle="Does practice help to excel scores ?", x="pre test course status",       y="Avg Score")

# 3, Does one particular race oversmarts others?


ggplot(data=studentPerformance, mapping=aes(x=`race/ethnicity`, y=percent, col=`race/ethnicity` ))+
  theme_bw() +
  geom_boxplot()+
  scale_y_continuous(limits=c(0,110),breaks = seq(0,110,10))+
  labs(title="The Urban Myth #3", subtitle="Does a particular race excels at math?", x="Race Group",       y="avg Score")

#4, Does student good at math bad at writing?  


studentPerformance <- studentPerformance %>% mutate(diff = `math_score` - `reading_score`)

abb<-c(
  sum(studentPerformance$diff < 0),
  sum(studentPerformance$diff > 0))

barplot( abb, main="Does student good at math bad at writing?", 
         xlab="Count of student who scored less in writing when compared maths and vice versa ",col='red', ylab="count")

1 Does one particular gender excel another?

To answer this question the average of the all the three scores are taken (math_score +reading_score+writing_score) and a bar plot is plotted to visualise the result, and from the plot, it’s evident that females tend to have a higher avg score in comparison to males.

2 Does practice help to excel scores?

In order to answer this boxplot are plotted and it’s inferred that the students who completed the test preparation course have a high avg score of 80 and a lowest of 60 compared to student who hasn’t taken any test preparation course.

3, Does one particular race over smarts others?

A boxplot visualization answers this question. Group E ethnicity scores well compared to all other race/ethnicity. Among all the ethnicities Group A students have the lowest score, so clearly we can interpret that one race over smarts the others. Similarly, Group D over smarts Group C.

4, Does student good at math bad at writing?

A barplot is plotted to identify the trend and it’s found that the students who have scored well in maths are less when compared to scores by students in writing.

Decsriptive Statistics :.

The descriptive statistics of the influence of important features like race, gender, parental education backgroud and pre test course status are shown below

# summary on pre test course staus determining the average score
studentPerformance %>% group_by(test_preparation_course) %>% summarise(Min = min(percent,na.rm = TRUE),
                                                                        Q1 = quantile(percent ,probs= .25,na.rm = TRUE),
                                                                        Median = median(percent, na.rm = TRUE),
                                                                        Q3 = quantile(percent,probs = .75,na.rm = TRUE),
                                                                        Max = max(percent,na.rm = TRUE),
                                                                        Mean = mean(percent, na.rm = TRUE),
                                                                        SD = sd(percent, na.rm = TRUE),
                                                                        n = n(),
                                                                        Missing = sum(is.na(percent))) -> table1
knitr::kable(table1)

test_preparation_course	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
completed	34.33333	65.00000	73.50000	82.16667	100	72.66946	13.03696	358	0
none	9.00000	55.41667	65.33333	75.00000	100	65.03894	14.18671	642	0

# summary on gender determining the average score
studentPerformance %>% group_by(gender) %>% summarise(Min = min(percent,na.rm = TRUE),
                                                                        Q1 = quantile(percent ,probs= .25,na.rm = TRUE),
                                                                        Median = median(percent, na.rm = TRUE),
                                                                        Q3 = quantile(percent,probs = .75,na.rm = TRUE),
                                                                        Max = max(percent,na.rm = TRUE),
                                                                        Mean = mean(percent, na.rm = TRUE),
                                                                        SD = sd(percent, na.rm = TRUE),
                                                                        n = n(),
                                                                        Missing = sum(is.na(percent))) -> table2
knitr::kable(table2)

gender	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
female	9	60.66667	70.33333	78.66667	100	69.56950	14.54181	518	0
male	23	56.00000	66.33333	76.25000	100	65.83748	13.69884	482	0

# summary on gender determining the average score

studentPerformance %>% group_by(`parental level of education`) %>% summarise(Min = min(percent,na.rm = TRUE),
                                                                        Q1 = quantile(percent ,probs= .25,na.rm = TRUE),
                                                                        Median = median(percent, na.rm = TRUE),
                                                                        Q3 = quantile(percent,probs = .75,na.rm = TRUE),
                                                                        Max = max(percent,na.rm = TRUE),
                                                                        Mean = mean(percent, na.rm = TRUE),
                                                                        SD = sd(percent, na.rm = TRUE),
                                                                        n = n(),
                                                                        Missing = sum(is.na(percent))) -> table3
knitr::kable(table3)

parental level of education	Min	Q1	Median	Q3	Max	Mean	SD	n
associate’s degree	31.66667	58.66667	69.66667	79.00000	100.00000	69.56907	13.67091	222
bachelor’s degree	39.00000	64.08333	71.16667	80.66667	100.00000	71.92373	13.94661	118
high school	18.33333	53.91667	65.00000	72.66667	95.66667	63.09694	13.51058	196
master’s degree	44.66667	63.16667	73.33333	85.50000	97.66667	73.59887	13.60102	59
some college	23.33333	60.00000	68.66667	78.00000	99.00000	68.47640	13.71097	226
some high school	9.00000	55.66667	66.66667	76.50000	99.00000	65.10801	14.98408	179

# summary on gender determining the average score
studentPerformance %>% group_by(`race/ethnicity`) %>% summarise(Min = min(percent,na.rm = TRUE),
                                                                        Q1 = quantile(percent ,probs= .25,na.rm = TRUE),
                                                                        Median = median(percent, na.rm = TRUE),
                                                                        Q3 = quantile(percent,probs = .75,na.rm = TRUE),
                                                                        Max = max(percent,na.rm = TRUE),
                                                                        Mean = mean(percent, na.rm = TRUE),
                                                                        SD = sd(percent, na.rm = TRUE),
                                                                        n = n(),
                                                                        Missing = sum(is.na(percent))) -> table4
knitr::kable(table4)

race/ethnicity	Min	Q1	Median	Q3	Max	Mean	SD	n
group A	23.33333	52.00000	61.33333	73.00000	96.33333	62.99251	14.44460	89
group B	18.33333	56.66667	65.00000	76.83333	96.66667	65.46842	14.73213	190
group C	9.00000	57.66667	68.33333	77.00000	98.66667	67.13166	13.87221	319
group D	31.00000	60.33333	70.00000	78.58333	99.00000	69.17939	13.25278	262
group E	26.00000	64.66667	73.50000	82.41667	100.00000	72.75238	14.56502	140

Summary Statistics is applied for four different attributes of the dataframe (test_preparation_course,gender,parental level of education,race/ethnicity ) to the percent which is mean of (math_score +reading_score+writing_score)

The summary statistics shows that the mean score of students who have completed a test_preparation_course is 72 to that of not completed is 65 which is comparetively lower. The same when grouped by gender gives 69.5 for females and 65.8 for males. With the parental level of education that influenced the student score , it’s seen that the high mean is for parents with education level master’s level and lowest with parents having eduction a high school degree.The influence of the race or ethnicity on student score is observed for Group E to be highest with a mean of 72.8 and lowest for Group A with mean value of 63.

Hypothesis Testing

Though for the first two questions it is evident that 1, Does one particular gender excel another?

Avg female score is higher than male

2, Does practice help to excel scores ?

Avg score of students who has taken pre test course has scored more than students who has not

However because of the sampling error it is necessary to perform two sided test to provide statistical evidence to our assumption, here all pairs are independent we perform 2 side independant t test.

Step 1: check normality of the groups for question 1

#For test_preparation_course - 'Completed'
male_fil <- studentPerformance %>% filter(gender == "male")
male_fil$percent %>% qqPlot(dist="norm")

## [1] 296 170

#For test_preparation_course - 'None'
female_fil <- studentPerformance %>% filter(gender == "female")
female_fil$percent %>% qqPlot(dist="norm")

## [1]  31 506

check normality of the groups for question 2

#For test_preparation_course - 'Completed'
comp_fil <- studentPerformance %>% filter(test_preparation_course == "completed")
comp_fil$percent %>% qqPlot(dist="norm")

## [1] 299 238

#For test_preparation_course - 'None'
None_fil <- studentPerformance %>% filter(test_preparation_course == "none")
None_fil$percent %>% qqPlot(dist="norm")

## [1]  43 632

From the qq plots as most of the points of all groups are falling between the quantile ranges, hence we are safe to conclude that it is normally distributed, as the sample size is anyway higher than 30, though we have few points falling outside this range the t-test is robust againt minor deviations from the normality.

Step 2: Levene test

#1
leveneTest(percent ~ gender, data = studentPerformance)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.1345 0.7139
##       998

#2
leveneTest(percent ~ test_preparation_course, data = studentPerformance)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  2.8851 0.08971 .
##       998                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Levene Test:

Question 1: From the levene test we are safe to assume that the given groups has equal variance.

Question 2: From the levene test we are safe to assume that the given groups has equal variance.

Step 3: Two side t (independant) test for both question 1 and 3

#1 


t.test(
  percent ~ gender,
  data = studentPerformance,
  paired = FALSE,
  var.equal = TRUE,
  alternative = "two.sided",
  conf.interval=0.95
)

## 
##  Two Sample t-test
## 
## data:  percent by gender
## t = 4.1699, df = 998, p-value = 3.312e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.975745 5.488286
## sample estimates:
## mean in group female   mean in group male 
##             69.56950             65.83748

#2
t.test(
  percent ~ test_preparation_course,
  data = studentPerformance,
  paired = FALSE,
  var.equal = TRUE,
  alternative = "two.sided",
  conf.interval=0.95
)

## 
##  Two Sample t-test
## 
## data:  percent by test_preparation_course
## t = 8.3909, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5.846012 9.415027
## sample estimates:
## mean in group completed      mean in group none 
##                72.66946                65.03894

Hypthesis Testing Cont.

Question 1 : 2 side t-test was chosen for hypothesis testing with H0:???? = 0 & HA:???? ??? 0.Significance level is 5% or 0.05.

Levene’s test is performed to check homogeneity of variance’s of both the population samples. From the Levene’s test, the p-value is found to be 0.7139, which is greater than 0.05 (the significance level). Hence we assume the variances to be equal and unknown. We also assume the population data to be normally distributed and the Alternate hypothesis (Ha) to be such that the mean female score percent is higher mean male score percent. (Ho) is that there is no difference in mean score of female and male.

Interpretation:

A 2 sided t-test was used to test for a significant difference between the mean percent score between male and female. From the t-test that’s been carried out, the p-value is (3.312e-05) found to be p-value<0.001. We already know that the significance level(??) is 0.05. Clearly, the p-value is lesser than the significance level(??) and the 95% CI of the mean difference does not capture Ho:???? = 0. This indicates that mean female score percent is higher mean male score percent and we reject the null hypothesis (Ho) and test is statistically significant.

thus we have statistical evidence that mean female score percent is higher mean male score percent

question 2 : paired t-test was chosen for hypothesis testing with H0:???? = 0 & HA:???? ??? 0.Significance level is 5% or 0.05.

Levene’s test is performed to check homogeneity of variance’s of both the population samples. From the Levene’s test, the p-value is found to be 0.08971, which is greater than 0.05 (the significance level). Hence we assume the variances to be equal and unknown. We also assume the population data to be normally distributed and the Alternate hypothesis (Ha) to be such that the mean percent of students who has taken pre test course is higher than the mean percent of students who has not taken pre test course (Ho) is that there is no difference in mean percent of student who has taken vs not taken the course.

Interpretation:

A paired t-test was used to test for a significant difference between the mean product prices between Woolworths and Coles. From the t-test that’s been carried out, the p-value is found to be 2.2e-16, therefor p-value<0.01. We already know that the significance level(??) is 0.05. Clearly, the p-value is lesser than the significance level(??) and the 95% CI of the mean difference does not capture Ho:???? = 0. This indicates that there is statistical evidence the mean percent of students who has taken pre test course is higher than the mean percent of students who has not taken pre test course and we reject the null hypothesis (Ho) and test is statistically significant.

thus we have statistical evidence that the mean percent of students who has taken pre test course is higher than the mean percent of students who has not taken pre test course

Discussion

For the above mentioned problem statement from data visualization and t test performed to negate the sampling error we can conclude the following:

1, Does one particular gender excel another?

Yes, the mean percent female score is higher compared to male and from t-test we have statistical evidence to prove our assumption derived from descreiptive statistics.

2, Does practice help to excel scores ?

Yes, the mean percent of students who has taken pre test course is higher than the mean percent of students who has not taken pre test course and from t-test we have statistical evidence to prove our assumption derived from descreiptive statistics.

3, Does one particular race oversmarts others?

From the data visualization that the group E outperforms other group, the decreasing order of performance is (E, D, C, B, A)

4, Does student good at math bad at writing?

th count of student who did better is maths compared to writing is less than the number of students who did better in writing, thus we can conclude there is no enough evidence to prove that the students who do good is math is bad at writing.

Limitation and future recommendation: there can be availablity bias which can be handled by gathereing more data in future.

Strenghts: it can handle the sampling error.

STUDENT PERFORMANCE ANALYSIS

Introduction

Problem Statement

Data

Descriptive Statistics and Visualisation

1 Does one particular gender excel another?

2 Does practice help to excel scores?

3, Does one particular race over smarts others?

4, Does student good at math bad at writing?

Decsriptive Statistics :.

Hypothesis Testing

Hypthesis Testing Cont.

Discussion

References