The dataset the project will analyze is the Students Exam Scores: Extended Dataset, which can be found here: . We load the CSV file into a GitHub repository and use the following code to bring to R.
# load data
library(tidyverse)
library(psych)
data <- read.csv("https://raw.githubusercontent.com/sokkarbishoy/DATA607/main/Expanded_data_with_more_features.csv")
#Fist we remove all values with no data and NA values from each column
data <- data[complete.cases(data), ]
data <- na.omit(data)
data <- data[data$EthnicGroup != "",]
data <- data[data$ParentEduc != "",]
data <- data[data$TestPrep != "",]
data <- data[data$EthnicGroup != "",]
data <- data[data$ParentMaritalStatus != "",]
data <- data[data$PracticeSport != "",]
data <- data[data$WklyStudyHours != "",]
#delete variables that are non relevent to the questions asked, like transportation methode and if the chilt was first child in the family
data <- data[, -9]
data <- data[, -10]
#create a new overall test score that has the average of 3 test scores
data <- data %>%
mutate(Total_avg_scores = rowMeans(select(., MathScore, WritingScore, ReadingScore), na.rm = TRUE))
You should phrase your research question in a way that matches up with the scope of inference your data set allows for.
My final question for the project will be one of the following questions.
Does practicing sports have an effect on student grades?
Does parent education can predict a student’s grade?
Does the martial status of parents has an effect to the student’s academic performance
What are the cases, and how many are there?
There are 30,641 observations of different students in the data set before clearing the NA values. After tidying the data we are left with 22058 observations where each row has a value.
Describe the method of data collection.
The data of Kraggle is found on this link. The data set contains data on three test scores and other range of personal and socio-economic factors that could potentially influence these scores. The data was stored in a public GitHub repository and recalled using the read.csv function.
What type of study is this (observational/experiment)?
This is an observational study.
If you collected the data, state self-collected. If not,
provide a citation/link.
The original dataset generator creator is Mr. Royce
Kimmons. The data collection source was from kaggle.com which can be
found here
Kimmons, Royce. “ Exam Scores : Exam Scores for Students at a Public School.” Royce Kimmons: Understanding Digital Participation Divides, 2012, roycekimmons.com/tools/generated_data/exams.
What is the response variable? Is it quantitative or qualitative?
The dependent variable is the Student exam scores. There are three different exams, we can also create a new variable called GPA which measures the overall performance of a student. The dependent variable in this case is quantitative.
Independent variables can depend on the question asked but examples could be: Parents’ education level, parents’ marital status and test preparation. Most of these variables are quantitative variables.
Provide summary statistics for each of the variables. Also include appropriate visualizations related to your research question (e.g. scatter plots, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
describe(data$MathScore)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 22058 66.56 15.38 67 66.77 16.31 0 100 100 -0.16 -0.23 0.1
describe(data$ReadingScore)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 22058 69.4 14.78 70 69.66 14.83 10 100 90 -0.19 -0.27 0.1
describe(data$WritingScore)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 22058 68.47 15.47 69 68.68 16.31 4 100 96 -0.16 -0.3 0.1
describe(data$Total_avg_scores)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 22058 68.14 14.48 68.33 68.39 15.32 9 100 91 -0.19 -0.27 0.1
table(data$PracticeSport)
##
## never regularly sometimes
## 2927 8005 11126
From a glance the following plot, we can argue that it shows a relationship between practicing sport and achieving an overall average score. the H1 would be: Practicing sports sometimes and regularly can affect students achieving higher test scores. H0: Practicing sports does not have an effect on student tests performance.
ggplot(data, aes(x = PracticeSport, y = Total_avg_scores, color = PracticeSport)) +
geom_boxplot()+
geom_smooth(method = "lm", se = FALSE, color = "darkblue")+
geom_jitter(aes(color = PracticeSport), width = 0.1, alpha = 0.5)
## `geom_smooth()` using formula = 'y ~ x'
table(data$ParentEduc)
##
## associate's degree bachelor's degree high school master's degree
## 4254 2611 4339 1561
## some college some high school
## 5064 4229
From the following plot we can see that the higher education level
completed by parents, the higher the student overall scores. H1
would be: Parents education level can predect students’ exams
performance. H0: Parents education level has no effect on students’
exams performance.
ggplot(data, aes(x = ParentEduc, y = Total_avg_scores, color = ParentEduc)) +
geom_boxplot()+
geom_smooth(method = "lm", se = FALSE, color = "darkblue")+
geom_jitter(aes(color = ParentEduc), width = 0.1, alpha = 0.5)+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'
table(data$ParentMaritalStatus)
##
## divorced married single widowed
## 3736 12609 5274 439
From the following plot, we can ignor the research question mentioned above: Does the martial status of parents has an effect to the student’s academic performance. Since the plot shows that the students scores are very closely
ggplot(data, aes(x = ParentMaritalStatus, y = Total_avg_scores)) +
geom_boxplot(fill = "skyblue", color = "darkblue", alpha = 0.7) +
geom_jitter(aes(color = ParentMaritalStatus), width = 0.1, alpha = 0.5)
From the above we have two questions worth of investigating:
Does practicing sports have an effect on student grades?
Does parent education can predict a student’s grade?