DATA 606 Data Project Proposal

Data Preparation

The dataset the project will analyze is the Students Exam Scores: Extended Dataset, which can be found here: . We load the CSV file into a GitHub repository and use the following code to bring to R.

# load data
library(tidyverse)

library(psych)

data <- read.csv("https://raw.githubusercontent.com/sokkarbishoy/DATA607/main/Expanded_data_with_more_features.csv")

#Fist we remove all values with no data and NA values from each column 
data <- data[complete.cases(data), ]
data <- na.omit(data)
data <- data[data$EthnicGroup != "",]
data <- data[data$ParentEduc != "",]
data <- data[data$TestPrep != "",]
data <- data[data$EthnicGroup != "",]
data <- data[data$ParentMaritalStatus != "",]
data <- data[data$PracticeSport != "",]
data <- data[data$WklyStudyHours != "",]

#delete variables that are non relevent to the questions asked, like transportation methode and if the chilt was first child in the family 

data <- data[, -9]
data <- data[, -10]

#create a new overall test score that has the average of 3 test scores 
data <- data %>%
  mutate(Total_avg_scores = rowMeans(select(., MathScore, WritingScore, ReadingScore), na.rm = TRUE))

Research question

You should phrase your research question in a way that matches up with the scope of inference your data set allows for.

My final question for the project will be one of the following questions.

Does practicing sports have an effect on student grades?
Does parent education can predict a student’s grade?
Does the martial status of parents has an effect to the student’s academic performance

Cases

What are the cases, and how many are there?

There are 30,641 observations of different students in the data set before clearing the NA values. After tidying the data we are left with 22058 observations where each row has a value.

Data collection

Describe the method of data collection.

The data of Kraggle is found on this link. The data set contains data on three test scores and other range of personal and socio-economic factors that could potentially influence these scores. The data was stored in a public GitHub repository and recalled using the read.csv function.

Type of Study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.
The original dataset generator creator is Mr. Royce Kimmons. The data collection source was from kaggle.com which can be found here

Kimmons, Royce. “ Exam Scores : Exam Scores for Students at a Public School.” Royce Kimmons: Understanding Digital Participation Divides, 2012, roycekimmons.com/tools/generated_data/exams.

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The dependent variable is the Student exam scores. There are three different exams, we can also create a new variable called GPA which measures the overall performance of a student. The dependent variable in this case is quantitative.

Independent Variable(s)

Independent variables can depend on the question asked but examples could be: Parents’ education level, parents’ marital status and test preparation. Most of these variables are quantitative variables.

Relevant summary statistics

Provide summary statistics for each of the variables. Also include appropriate visualizations related to your research question (e.g. scatter plots, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

describe(data$MathScore)

##    vars     n  mean    sd median trimmed   mad min max range  skew kurtosis  se
## X1    1 22058 66.56 15.38     67   66.77 16.31   0 100   100 -0.16    -0.23 0.1

describe(data$ReadingScore)

##    vars     n mean    sd median trimmed   mad min max range  skew kurtosis  se
## X1    1 22058 69.4 14.78     70   69.66 14.83  10 100    90 -0.19    -0.27 0.1

describe(data$WritingScore)

##    vars     n  mean    sd median trimmed   mad min max range  skew kurtosis  se
## X1    1 22058 68.47 15.47     69   68.68 16.31   4 100    96 -0.16     -0.3 0.1

describe(data$Total_avg_scores)

##    vars     n  mean    sd median trimmed   mad min max range  skew kurtosis  se
## X1    1 22058 68.14 14.48  68.33   68.39 15.32   9 100    91 -0.19    -0.27 0.1

table(data$PracticeSport)

## 
##     never regularly sometimes 
##      2927      8005     11126

From a glance the following plot, we can argue that it shows a relationship between practicing sport and achieving an overall average score. the H1 would be: Practicing sports sometimes and regularly can affect students achieving higher test scores. H0: Practicing sports does not have an effect on student tests performance.

ggplot(data, aes(x = PracticeSport, y = Total_avg_scores, color = PracticeSport)) +
  geom_boxplot()+
  geom_smooth(method = "lm", se = FALSE, color = "darkblue")+
  geom_jitter(aes(color = PracticeSport), width = 0.1, alpha = 0.5)

## `geom_smooth()` using formula = 'y ~ x'

table(data$ParentEduc)

## 
## associate's degree  bachelor's degree        high school    master's degree 
##               4254               2611               4339               1561 
##       some college   some high school 
##               5064               4229

From the following plot we can see that the higher education level completed by parents, the higher the student overall scores. H1 would be: Parents education level can predect students’ exams performance. H0: Parents education level has no effect on students’ exams performance.

ggplot(data, aes(x = ParentEduc, y = Total_avg_scores, color = ParentEduc)) +
  geom_boxplot()+
  geom_smooth(method = "lm", se = FALSE, color = "darkblue")+
  geom_jitter(aes(color = ParentEduc), width = 0.1, alpha = 0.5)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## `geom_smooth()` using formula = 'y ~ x'

table(data$ParentMaritalStatus)

## 
## divorced  married   single  widowed 
##     3736    12609     5274      439

From the following plot, we can ignor the research question mentioned above: Does the martial status of parents has an effect to the student’s academic performance. Since the plot shows that the students scores are very closely

ggplot(data, aes(x = ParentMaritalStatus, y = Total_avg_scores)) +
  geom_boxplot(fill = "skyblue", color = "darkblue", alpha = 0.7) +
  geom_jitter(aes(color = ParentMaritalStatus), width = 0.1, alpha = 0.5)

From the above we have two questions worth of investigating:

Does practicing sports have an effect on student grades?
Does parent education can predict a student’s grade?