library(tidyverse)
library(ggplot2)
setwd("C:/Users/kpeter81/OneDrive - montgomerycollege.edu/Datasets")
hsb <- read_csv("hsb2 - hsb2.csv")DATA 101 - Project #1
Starter Code
Loading the Libraries and the dataset.
More about the dataset can be found here
Introduction
Which standardized tests (math, reading, writing, science, or social studies) did students with low social economic status score worst on? Data was collected on 200 randomly selected 12th grade students by the National Center of Education Statistics. 11 variables were studied, including gender, race, social economic status, school type, and scores on various standardized tests (OpenIntro).
The social economic status variable (ses) is a factor variable and has three levels: low, middle, and high. The various standardized testing scores (read, write, math, science, socst) are all numeric variables. In this project I will find which school subject low social economic status students scored worst on.
Data Analysis
Since I want to determine which test those with a low social economic status scored worst on, I first need to ensure that all the variables are lowercase and contain no spaces. I also want to ensure that there are no NA’s in columns I will be using. Then, I am going to create a table that will display the median scores for each of the different test subjects: reading, writing, math, science, and social studies for each economic status. I want to be able to compare the test scores across different economic statuses when I examine the results. To do this, I will first group the dataset by social economic status, and then utilize the summarize function to display the medians for each of the test subjects. Then I will be able to determine which subjects had which scores for each economic status.
Cleaning the dataset
# ensuring the variables are lowercase and contain no spaces
names(hsb) <- tolower(names(hsb))
names(hsb) <- gsub(" ","_",names(hsb))
head(hsb)# A tibble: 6 × 11
id gender race ses schtyp prog read write math science socst
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 70 male white low public general 57 52 41 47 57
2 121 female white middle public vocational 68 59 53 63 61
3 86 male white high public general 44 33 54 58 31
4 141 male white high public vocational 63 44 47 53 56
5 172 male white middle public academic 47 52 57 53 61
6 113 male white middle public academic 44 52 51 63 61
# Checking for NA values
colSums(is.na(hsb)) id gender race ses schtyp prog read write math science
0 0 0 0 0 0 0 0 0 0
socst
0
Data Analysis
hsb1 <- hsb |>
group_by(ses) |>
summarize(read_avg = median(read),
write_med = median(write),
math_med = median(math),
sci_med = median(science),
socst_med = median(socst))
hsb1# A tibble: 3 × 6
ses read_avg write_med math_med sci_med socst_med
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 high 57.5 59 57 58 61
2 low 47 49 46 47 46
3 middle 50 54 52 53 51
Visualization
# Math scores by social economic status
hsb_box <- hsb |>
ggplot(aes(x = ses, y = math, fill = ses)) +
geom_boxplot() +
scale_fill_manual(values = c("steelblue", "steelblue", "steelblue")) +
labs(title = "Math Scores by Social Economic Status", x = "Social Economic Status", y = "Standardized Math Scores")
hsb_box# Reading scores by social economic status
hsb_box_read <- hsb |>
ggplot(aes(x = ses, y = read, fill = ses)) +
geom_boxplot() +
scale_fill_manual(values = c("steelblue", "steelblue", "steelblue")) +
labs(title = "Reading Scores by Social Economic Status", x = "Social Economic Status", y = "Standardized Reading Scores")
hsb_box_read# Writing scores by social economic status
hsb_box_write <- hsb |>
ggplot(aes(x = ses, y = write, fill = ses)) +
geom_boxplot() +
scale_fill_manual(values = c("steelblue", "steelblue", "steelblue")) +
labs(title = "Writing Scores by Social Economic Status", x = "Social Economic Status", y = "Standardized Writing Scores")
hsb_box_write# Social Studies scores by social economic status
hsb_box_socst <- hsb |>
ggplot(aes(x = ses, y = socst, fill = ses)) +
geom_boxplot() +
scale_fill_manual(values = c("steelblue", "steelblue", "steelblue")) +
labs(title = "Social Studies Scores by Social Economic Status", x = "Social Economic Status", y = "Standardized Social Studies Scores")
hsb_box_socst# Science scores by social economic status
hsb_box_science <- hsb |>
ggplot(aes(x = ses, y = science, fill = ses)) +
geom_boxplot() +
scale_fill_manual(values = c("steelblue", "steelblue", "steelblue")) +
labs(title = "Science Scores by Social Economic Status", x = "Social Economic Status", y = "Standardized Science Scores")
hsb_box_scienceConclusion
Through my code I found that the standardized test in which students scored worst on was either math or social studies. More interesting, however, is the clear positive correlation between a high social economic status and the best scores on tests. Those with high economic status consistently earned higher scores, those with middle economic status scored in the middle, and those with low economic status scored lowest on the standardized test. This implies that, though low social status doesn’t necessarily cause low scores, the two variables are related. Further research should be conducted to understand both the manner and extent to which they are related, and possibly to discover a way to bridge the gap between scores and social status.
References
- Dataset retrieved from OpenIntro, collected by UCLA Institute for Digital Research & Education - Statistical Consulting. (https://www.openintro.org/data/index.php?data=hsb2)