Descriptive Analysis Report for Student Performance Dataset

Introduction:

The Student Performance dataset encompasses various factors that may influence a student’s academic performance. This report aims to provide a comprehensive descriptive analysis of the dataset, shedding light on the key features and distributions of the variables.

1. Loading and Exploring the Dataset:

Let’s begin by loading the dataset and taking a preliminary look at its structure and content.

library(tidyverse)

## Warning: package 'readr' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load the dataset
url <- "https://raw.githubusercontent.com/Naik-Khyati/data_621/main/blogs/blog1/StudentsPerformance.csv"
student_performance <- read.csv(url)

# Display basic information about the dataset
str(student_performance)

## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : chr  "female" "female" "female" "male" ...
##  $ race.ethnicity             : chr  "group B" "group C" "group B" "group A" ...
##  $ parental.level.of.education: chr  "bachelor's degree" "some college" "master's degree" "associate's degree" ...
##  $ lunch                      : chr  "standard" "standard" "standard" "free/reduced" ...
##  $ test.preparation.course    : chr  "none" "completed" "none" "none" ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

# Display the first few rows of the dataset
head(student_performance)

The dataset consists of various categorical and numeric variables, such as gender, race/ethnicity, parental education, lunch type, test preparation, and scores in math, reading, and writing.

2. Summary Statistics:

Next, let’s compute summary statistics for the numeric variables to gain insights into central tendency, dispersion, and overall distributions.

# Summary statistics for numeric variables
summary(student_performance[, c("math.score", "reading.score", "writing.score")])

##    math.score     reading.score    writing.score   
##  Min.   :  0.00   Min.   : 17.00   Min.   : 10.00  
##  1st Qu.: 57.00   1st Qu.: 59.00   1st Qu.: 57.75  
##  Median : 66.00   Median : 70.00   Median : 69.00  
##  Mean   : 66.09   Mean   : 69.17   Mean   : 68.05  
##  3rd Qu.: 77.00   3rd Qu.: 79.00   3rd Qu.: 79.00  
##  Max.   :100.00   Max.   :100.00   Max.   :100.00

The summary statistics provide a concise overview of the numeric variables ‘math.score’, ‘reading.score’, and ‘writing.score’. These include minimum and maximum values, quartiles, and the mean. For instance, the mean math score is approximately 66.09, the median (50th percentile) is 66.00, and the minimum and maximum scores are 0.00 and 100.00, respectively.

3. Visual Exploration:

Visualizations can enhance our understanding of the data. Let’s create boxplots for each score variable to visualize the distribution and identify potential outliers.

# Boxplots for each score variable
par(mfrow = c(1, 3))
boxplot(student_performance$math.score, main = "Math Score", col = "skyblue", border = "black")
boxplot(student_performance$reading.score, main = "Reading Score", col = "lightgreen", border = "black")
boxplot(student_performance$writing.score, main = "Writing Score", col = "lightcoral", border = "black")

These boxplots provide a visual summary of the distribution of scores in each subject, highlighting any variations or outliers. The box represents the interquartile range (IQR), with the median indicated by a line inside the box. Whiskers extend to the minimum and maximum values within 1.5 times the IQR.

4. Categorical Variables:

Now, let’s explore the categorical variables. We’ll create frequency tables and visualizations to understand the distribution of gender, race/ethnicity, parental education, lunch type, and test preparation.

# Frequency table for gender
table(student_performance$gender)

## 
## female   male 
##    518    482

# Bar plot for race/ethnicity
barplot(table(student_performance$race.ethnicity), main = "Race/Ethnicity Distribution", col = "skyblue", border = "black")

# Bar plot for parental education
barplot(table(student_performance$parental.level.of.education), main = "Parental Education Distribution", col = "lightgreen", border = "black")

# Bar plot for lunch type
barplot(table(student_performance$lunch), main = "Lunch Type Distribution", col = "lightcoral", border = "black")

# Bar plot for test preparation
barplot(table(student_performance$test.preparation.course), main = "Test Preparation Distribution", col = "gold", border = "black")

These frequency tables and bar plots offer insights into the distribution of categorical variables, allowing us to understand the composition of the dataset.

Descriptive Analysis Report for Student Performance Dataset

Khyati Naik

November 26, 2023

Introduction:

1. Loading and Exploring the Dataset:

2. Summary Statistics:

3. Visual Exploration:

4. Categorical Variables:

Real-World Uses:

Conclusion: