The Student Performance dataset encompasses various factors that may influence a student’s academic performance. This report aims to provide a comprehensive descriptive analysis of the dataset, shedding light on the key features and distributions of the variables.
Let’s begin by loading the dataset and taking a preliminary look at its structure and content.
library(tidyverse)
## Warning: package 'readr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the dataset
url <- "https://raw.githubusercontent.com/Naik-Khyati/data_621/main/blogs/blog1/StudentsPerformance.csv"
student_performance <- read.csv(url)
# Display basic information about the dataset
str(student_performance)
## 'data.frame': 1000 obs. of 8 variables:
## $ gender : chr "female" "female" "female" "male" ...
## $ race.ethnicity : chr "group B" "group C" "group B" "group A" ...
## $ parental.level.of.education: chr "bachelor's degree" "some college" "master's degree" "associate's degree" ...
## $ lunch : chr "standard" "standard" "standard" "free/reduced" ...
## $ test.preparation.course : chr "none" "completed" "none" "none" ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
# Display the first few rows of the dataset
head(student_performance)
The dataset consists of various categorical and numeric variables, such as gender, race/ethnicity, parental education, lunch type, test preparation, and scores in math, reading, and writing.
Next, let’s compute summary statistics for the numeric variables to gain insights into central tendency, dispersion, and overall distributions.
# Summary statistics for numeric variables
summary(student_performance[, c("math.score", "reading.score", "writing.score")])
## math.score reading.score writing.score
## Min. : 0.00 Min. : 17.00 Min. : 10.00
## 1st Qu.: 57.00 1st Qu.: 59.00 1st Qu.: 57.75
## Median : 66.00 Median : 70.00 Median : 69.00
## Mean : 66.09 Mean : 69.17 Mean : 68.05
## 3rd Qu.: 77.00 3rd Qu.: 79.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00 Max. :100.00
The summary statistics provide a concise overview of the numeric variables ‘math.score’, ‘reading.score’, and ‘writing.score’. These include minimum and maximum values, quartiles, and the mean. For instance, the mean math score is approximately 66.09, the median (50th percentile) is 66.00, and the minimum and maximum scores are 0.00 and 100.00, respectively.
Visualizations can enhance our understanding of the data. Let’s create boxplots for each score variable to visualize the distribution and identify potential outliers.
# Boxplots for each score variable
par(mfrow = c(1, 3))
boxplot(student_performance$math.score, main = "Math Score", col = "skyblue", border = "black")
boxplot(student_performance$reading.score, main = "Reading Score", col = "lightgreen", border = "black")
boxplot(student_performance$writing.score, main = "Writing Score", col = "lightcoral", border = "black")
These boxplots provide a visual summary of the distribution of scores in each subject, highlighting any variations or outliers. The box represents the interquartile range (IQR), with the median indicated by a line inside the box. Whiskers extend to the minimum and maximum values within 1.5 times the IQR.
Now, let’s explore the categorical variables. We’ll create frequency tables and visualizations to understand the distribution of gender, race/ethnicity, parental education, lunch type, and test preparation.
# Frequency table for gender
table(student_performance$gender)
##
## female male
## 518 482
# Bar plot for race/ethnicity
barplot(table(student_performance$race.ethnicity), main = "Race/Ethnicity Distribution", col = "skyblue", border = "black")
# Bar plot for parental education
barplot(table(student_performance$parental.level.of.education), main = "Parental Education Distribution", col = "lightgreen", border = "black")
# Bar plot for lunch type
barplot(table(student_performance$lunch), main = "Lunch Type Distribution", col = "lightcoral", border = "black")
# Bar plot for test preparation
barplot(table(student_performance$test.preparation.course), main = "Test Preparation Distribution", col = "gold", border = "black")
These frequency tables and bar plots offer insights into the distribution of categorical variables, allowing us to understand the composition of the dataset.
Understanding the demographics and academic performance of students has real-world implications. For instance, educational policymakers can use these insights to identify potential achievement gaps among different demographic groups. Schools can tailor intervention programs to address specific needs based on parental education levels, offering targeted support where it is most needed. Additionally, knowledge of the impact of test preparation on performance can guide the development of effective educational strategies and resources.
In the corporate world, understanding the factors influencing academic performance can inform hiring practices and diversity initiatives. Companies may consider these insights when designing educational programs for their employees or supporting initiatives that promote equal opportunities for professional development.
In conclusion, the descriptive analysis of the Student Performance dataset not only provides valuable insights into the characteristics of the dataset but also lays the groundwork for informed decision-making in education and beyond. The real-world applications of these analyses extend to educational institutions, policymakers, and organizations striving to create inclusive and effective learning environments.