December 3, 2020
Data is collected by kaggle to explore and build in a web-based data science environment.
There are 1000 observations with 8 variables in this given data set, and each case represents a student in the United States.
It is an observatory study in which:
Research question:
“Is the average tests score different from students who have test preparation course and those who don’t ?”
# Load data from Github repository
data <- read.csv("https://raw.githubusercontent.com/jnataky/DATA-607/master/A2_Various_dataset_transformation/students_performance.csv")
data_sub <- data %>%
select(test.preparation.course, math.score, writing.score, reading.score)
data_clean <- data_sub %>%
transmute(test.preparation.course, tests_score = (math.score + writing.score + reading.score) / 3)
names(data_clean) <- c("test_prep", "tests_score")
# Check for missing values
sum(is.na(data_clean))
data_final <- data_clean %>%
transmute(test_prep_course = ifelse(data_clean$test_prep == "completed", "yes", "no"), tests_score)
test_prep_yes <- data_final %>% filter(test_prep_course == "yes")
test_prep_no <- data_final %>% filter(test_prep_course == "no")
boxplot(test_prep_yes$tests_score, test_prep_no$tests_score,
names = c("Test score with preparation", "Test score with no preparation"))
Description
describe(test_prep_yes$tests_score)
## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 358 72.67 13.04 73.5 73.03 12.6 34.33 100 65.67 -0.26 -0.26 0.69
describe(test_prep_no$tests_score)
## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 642 65.04 14.19 65.33 65.33 14.33 9 100 91 -0.28 0.22 0.56
Plot: Distribution of students who had test preparation course
ggplot(test_prep_yes, aes(x = tests_score)) + geom_histogram() +
ggtitle("Distribution of students who had test preparation course")
Plot: Distribution of students who had no test preparation course
ggplot(test_prep_no, aes(x = tests_score)) + geom_histogram() +
ggtitle("Distribution of students who had no test preparation course")
Conditions:
Hypotheses:
H0: The average tests score from students who have test preparation course is the same to those who don’t.
H1: The average tests score from students who have test preparation course is different to those who don’t.
Significance level:
\(\alpha = 0.05\)
Point estimate
point_estimate <- data_final %>%
specify(tests_score ~ test_prep_course) %>%
calculate(stat = "diff in means" , order = c("yes", "no"))
Null distribution
set.seed(1412)
ci_null_dist <- data_final %>%
specify(tests_score ~ test_prep_course) %>%
hypothesize(null = "independence") %>%
generate(reps = 200, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
Confidence interval
ci_null_dist %>%
get_confidence_interval(point_estimate = point_estimate,
level = 0.95,
type = "se")
## # A tibble: 1 x 2 ## lower_ci upper_ci ## <dbl> <dbl> ## 1 5.77 9.50
We are 95% confident that the difference between the tests score of those who have test preparation test and those who don’t falls in (5.77, 9.50)
P value
ci_null_dist %>%
get_p_value(obs_stat = point_estimate,
direction = "two-sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an approximation based on ## the number of `reps` chosen in the `generate()` step. See `?get_p_value()` for more information.
## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0
p_value < 0.05
We reject the null hypothesis. The data provides evidence for the alternative.
Thus, the average tests score from students who have test preparation course are different to those who don’t.
Visualization
ci_null_dist %>% visualise()
This analysis helps us to understand that test preparation is very important to the success of students. Students need to consider test preparation course in order to be successful in exams.
Limitation is in the fact that the test preparation in the analysis can’t be quantified. This means that the data should at least provide some information on the amount of hours of preparation for each student who took the test. This will give more general idea on the average time of preparation is needed to be successful in exams although we know that the learning curve is different from one student to another one.
https://www.kaggle.com/spscientist/students-performance-in-exams