DATA 101 - Project #1

Author

Kalina Peterson

Starter Code

Loading the Libraries and the dataset.

More about the dataset can be found here

library(tidyverse)
library(ggplot2)
setwd("C:/Users/kpeter81/OneDrive - montgomerycollege.edu/Datasets")
hsb <- read_csv("hsb2 - hsb2.csv")

Introduction

Which standardized tests (math, reading, writing, science, or social studies) did students with low social economic status score worst on? Data was collected on 200 randomly selected 12th grade students by the National Center of Education Statistics. 11 variables were studied, including gender, race, social economic status, school type, and scores on various standardized tests (OpenIntro).

The social economic status variable (ses) is a factor variable and has three levels: low, middle, and high. The various standardized testing scores (read, write, math, science, socst) are all numeric variables. In this project I will find which school subject low social economic status students scored worst on.

Data Analysis

Since I want to determine which test those with a low social economic status scored worst on, I first need to ensure that all the variables are lowercase and contain no spaces. I also want to ensure that there are no NA’s in columns I will be using. Then, I am going to create a table that will display the median scores for each of the different test subjects: reading, writing, math, science, and social studies for each economic status. I want to be able to compare the test scores across different economic statuses when I examine the results. To do this, I will first group the dataset by social economic status, and then utilize the summarize function to display the medians for each of the test subjects. Then I will be able to determine which subjects had which scores for each economic status.

Cleaning the dataset

# ensuring the variables are lowercase and contain no spaces
names(hsb) <- tolower(names(hsb))
names(hsb) <- gsub(" ","_",names(hsb))
head(hsb)

# A tibble: 6 × 11
     id gender race  ses    schtyp prog        read write  math science socst
  <dbl> <chr>  <chr> <chr>  <chr>  <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl>
1    70 male   white low    public general       57    52    41      47    57
2   121 female white middle public vocational    68    59    53      63    61
3    86 male   white high   public general       44    33    54      58    31
4   141 male   white high   public vocational    63    44    47      53    56
5   172 male   white middle public academic      47    52    57      53    61
6   113 male   white middle public academic      44    52    51      63    61

# Checking for NA values
colSums(is.na(hsb))

     id  gender    race     ses  schtyp    prog    read   write    math science 
      0       0       0       0       0       0       0       0       0       0 
  socst 
      0

Data Analysis

hsb1 <- hsb |>
  group_by(ses) |>
  summarize(read_avg = median(read),
            write_med = median(write),
            math_med = median(math),
            sci_med = median(science),
            socst_med = median(socst))
hsb1

# A tibble: 3 × 6
  ses    read_avg write_med math_med sci_med socst_med
  <chr>     <dbl>     <dbl>    <dbl>   <dbl>     <dbl>
1 high       57.5        59       57      58        61
2 low        47          49       46      47        46
3 middle     50          54       52      53        51

Visualization

# Math scores by social economic status
hsb_box <- hsb |>
  ggplot(aes(x = ses, y = math, fill = ses)) +
  geom_boxplot() +
  scale_fill_manual(values = c("steelblue", "steelblue", "steelblue")) +
  labs(title = "Math Scores by Social Economic Status", x = "Social Economic Status", y = "Standardized Math Scores")
hsb_box

# Reading scores by social economic status
hsb_box_read <- hsb |>
  ggplot(aes(x = ses, y = read, fill = ses)) +
  geom_boxplot() +
  scale_fill_manual(values = c("steelblue", "steelblue", "steelblue")) +
  labs(title = "Reading Scores by Social Economic Status", x = "Social Economic Status", y = "Standardized Reading Scores")
hsb_box_read

# Writing scores by social economic status
hsb_box_write <- hsb |>
  ggplot(aes(x = ses, y = write, fill = ses)) +
  geom_boxplot() +
  scale_fill_manual(values = c("steelblue", "steelblue", "steelblue")) +
  labs(title = "Writing Scores by Social Economic Status", x = "Social Economic Status", y = "Standardized Writing Scores")
hsb_box_write

# Social Studies scores by social economic status
hsb_box_socst <- hsb |>
  ggplot(aes(x = ses, y = socst, fill = ses)) +
  geom_boxplot() +
  scale_fill_manual(values = c("steelblue", "steelblue", "steelblue")) +
  labs(title = "Social Studies Scores by Social Economic Status", x = "Social Economic Status", y = "Standardized Social Studies Scores")
hsb_box_socst

# Science scores by social economic status
hsb_box_science <- hsb |>
  ggplot(aes(x = ses, y = science, fill = ses)) +
  geom_boxplot() +
  scale_fill_manual(values = c("steelblue", "steelblue", "steelblue")) +
  labs(title = "Science Scores by Social Economic Status", x = "Social Economic Status", y = "Standardized Science Scores")
hsb_box_science

Conclusion

Through my code I found that the standardized test in which students scored worst on was either math or social studies. More interesting, however, is the clear positive correlation between a high social economic status and the best scores on tests. Those with high economic status consistently earned higher scores, those with middle economic status scored in the middle, and those with low economic status scored lowest on the standardized test. This implies that, though low social status doesn’t necessarily cause low scores, the two variables are related. Further research should be conducted to understand both the manner and extent to which they are related, and possibly to discover a way to bridge the gap between scores and social status.

References

Dataset retrieved from OpenIntro, collected by UCLA Institute for Digital Research & Education - Statistical Consulting. (https://www.openintro.org/data/index.php?data=hsb2)