DATA606 - Data Project

December 3, 2020

Students Performance in Exams

Overview

Data is collected by kaggle to explore and build in a web-based data science environment.

There are 1000 observations with 8 variables in this given data set, and each case represents a student in the United States.

It is an observatory study in which:

The response variable is mean tests score and is numerical
The explanatory variable is test preparation course and is categorical.

Research question:

“Is the average tests score different from students who have test preparation course and those who don’t ?”

The data

# Load data from Github repository

data <- read.csv("https://raw.githubusercontent.com/jnataky/DATA-607/master/A2_Various_dataset_transformation/students_performance.csv")


data_sub <- data %>%
  select(test.preparation.course, math.score, writing.score, reading.score)

data_clean <- data_sub %>%
  transmute(test.preparation.course, tests_score = (math.score + writing.score + reading.score) / 3)

names(data_clean) <- c("test_prep", "tests_score")

# Check for missing values

sum(is.na(data_clean))

data_final <- data_clean %>%
  transmute(test_prep_course = ifelse(data_clean$test_prep == "completed", "yes", "no"), tests_score)

Exploratory data analysis

test_prep_yes <- data_final %>% filter(test_prep_course == "yes")

test_prep_no <- data_final %>% filter(test_prep_course == "no")

boxplot(test_prep_yes$tests_score, test_prep_no$tests_score,
        names = c("Test score with preparation", "Test score with no preparation"))

Summary statistics

Description

describe(test_prep_yes$tests_score)

##    vars   n  mean    sd median trimmed  mad   min max range  skew kurtosis   se
## X1    1 358 72.67 13.04   73.5   73.03 12.6 34.33 100 65.67 -0.26    -0.26 0.69

describe(test_prep_no$tests_score)

##    vars   n  mean    sd median trimmed   mad min max range  skew kurtosis   se
## X1    1 642 65.04 14.19  65.33   65.33 14.33   9 100    91 -0.28     0.22 0.56

Summary statistics

Plot: Distribution of students who had test preparation course

ggplot(test_prep_yes, aes(x = tests_score)) + geom_histogram() +
  ggtitle("Distribution of students who had test preparation course")

Summary statistics

Plot: Distribution of students who had no test preparation course

ggplot(test_prep_no, aes(x = tests_score)) + geom_histogram() +
  ggtitle("Distribution of students who had no test preparation course")

Inference

Conditions:

Independence: The sample is made of different students and they are independent each others.
Random samples:it is a random sample of observations.
Approximately normal: more tan 30 samples, CLT.

Hypotheses:

H0: The average tests score from students who have test preparation course is the same to those who don’t.
H1: The average tests score from students who have test preparation course is different to those who don’t.

Significance level:

\(\alpha = 0.05\)

Inference

Point estimate

point_estimate <- data_final %>%
  specify(tests_score ~ test_prep_course) %>%
  calculate(stat = "diff in means" , order = c("yes", "no"))

Null distribution

set.seed(1412)

ci_null_dist <- data_final %>%
  specify(tests_score ~ test_prep_course) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 200, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

Inference

Confidence interval

ci_null_dist %>%
  
  get_confidence_interval(point_estimate = point_estimate,
                          
                          level = 0.95,
                          
                          type = "se")

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     5.77     9.50

We are 95% confident that the difference between the tests score of those who have test preparation test and those who don’t falls in (5.77, 9.50)

Inference

P value

ci_null_dist %>%
  
  get_p_value(obs_stat = point_estimate,
              direction = "two-sided")

## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation based on
## the number of `reps` chosen in the `generate()` step. See `?get_p_value()` for more information.

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

p_value < 0.05

We reject the null hypothesis. The data provides evidence for the alternative.

Thus, the average tests score from students who have test preparation course are different to those who don’t.

Inference

Visualization

ci_null_dist %>%
  visualise()

Conclusion

This analysis helps us to understand that test preparation is very important to the success of students. Students need to consider test preparation course in order to be successful in exams.

Limitation is in the fact that the test preparation in the analysis can’t be quantified. This means that the data should at least provide some information on the amount of hours of preparation for each student who took the test. This will give more general idea on the average time of preparation is needed to be successful in exams although we know that the learning curve is different from one student to another one.

References

https://www.kaggle.com/spscientist/students-performance-in-exams