Introduction

In this R script, we follow a structured approach to visualize the median household income by state in the U.S. using the built-in dataset state.x77.

The dataset is included in base R and is derived from the U.S. Census data and other sources available at the time of R’s development. It provides historical socio-economic indicators for each U.S. state, making it a useful dataset for learning and teaching purposes within the R environment. While it is not a public dataset per se, it is readily accessible as part of R’s built-in datasets and serves as a classic example for statistical analysis and data visualization.

Step 1: Load library

The script begins by loading the tidyverse package, which is essential for data manipulation and visualization. This package includes tools such as ggplot2, dplyr, and tibble that streamline the data analysis process.

# 1. Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Step 2: Loading and Preparing the Dataset

The built-in dataset state.x77, which provides various statistics about U.S. states, is loaded into the R environment. The dataset is then converted from its default matrix form into a data frame to facilitate easier manipulation. A new column named State is added to this data frame, which contains the names of the states previously used as row names in the matrix.

# Step 2: Load and explore the dataset
data("state.x77")  # Load the built-in dataset
## Warning in data("state.x77"): data set 'state.x77' not found
# Convert the dataset to a data frame and add state names as a column
data <- as.data.frame(state.x77)
data$State <- rownames(state.x77)  # Add state names as a column

# View the structure of the dataset
glimpse(data)
## Rows: 50
## Columns: 9
## $ Population <dbl> 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277, 4931, …
## $ Income     <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 4815, 4091,…
## $ Illiteracy <dbl> 2.1, 1.5, 1.8, 1.9, 1.1, 0.7, 1.1, 0.9, 1.3, 2.0, 1.9, 0.6,…
## $ `Life Exp` <dbl> 69.05, 69.31, 70.55, 70.66, 71.71, 72.06, 72.48, 70.06, 70.…
## $ Murder     <dbl> 15.1, 11.3, 7.8, 10.1, 10.3, 6.8, 3.1, 6.2, 10.7, 13.9, 6.2…
## $ `HS Grad`  <dbl> 41.3, 66.7, 58.1, 39.9, 62.6, 63.9, 56.0, 54.6, 52.6, 40.6,…
## $ Frost      <dbl> 20, 152, 15, 65, 20, 166, 139, 103, 11, 60, 0, 126, 127, 12…
## $ Area       <dbl> 50708, 566432, 113417, 51945, 156361, 103766, 4862, 1982, 5…
## $ State      <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "…

Step 3: Clean the data

In this step, the script selects only the relevant columns for the analysis: State and Income. It also filters out any rows where the Income value is missing, ensuring that only complete data is used for visualization.

# Step 3: Clean the data by selecting relevant columns and handling missing values
clean_data <- data %>%
  select(State, Income) %>%  # Select only the 'State' and 'Income' columns
  filter(!is.na(Income))     # Filter out rows with missing 'Income' values (if any)

Step 4: Create visualization

Using ggplot2, a bar plot is created to visualize the median household income across U.S. states. The aes function specifies that the State column should be on the x-axis (reordered by income from highest to lowest) and Income on the y-axis. The geom_bar function generates the bars of the plot with a steel blue color. The coord_flip function rotates the plot to make it easier to read by displaying the bars horizontally. The labs function adds a title and axis labels, and theme_minimal provides a clean background. Finally, theme is used to adjust the size of the y-axis text for better readability.

# Step 4: Create the bar plot
ggplot(clean_data, aes(x = reorder(State, -Income), y = Income)) +  # Order by income (high to low)
  geom_bar(stat = "identity", fill = "steelblue") +  # Create a bar plot with steelblue color
  coord_flip() +  # Flip coordinates for better readability (horizontal bars)
  labs(title = "Median Household Income by State in the U.S.", 
       x = "State", 
       y = "Median Household Income (USD)") +  # Add title and axis labels
  theme_minimal() +  # Use a clean minimal theme
  theme(axis.text.y = element_text(size = 8))  # Adjust y-axis text size for readability

Step 5: Interpretation

The visualization of median household income by state reveals significant regional disparities across the United States. It shows that states in the Northeast, West Coast, and certain parts of the Midwest generally have higher median incomes compared to those in the South and some other Midwestern areas. Coastal states, particularly those on the Northeast and West Coast, tend to have higher median incomes. This trend may be attributed to the presence of high-paying industries such as technology and finance, which are more prevalent in these regions.

In contrast, Southern states often report lower median incomes. This could be due to a combination of historical economic differences, economies that are more focused on agriculture, and lower levels of educational attainment. The Midwest region, on the other hand, displays a diverse range of incomes. States like Illinois and Minnesota report relatively high median incomes, while states such as Mississippi and West Virginia have lower figures.

There are a few notable observations from the data. For instance, Alaska stands out with a particularly high median income, which might reflect its unique economic conditions and wealth in natural resources. Additionally, states with similar geographic locations or economic characteristics, such as those in the Great Plains or Southeast, often show similar income levels.

Several factors contribute to these income disparities. The cost of living plays a significant role; states with higher living costs, including expensive housing and transportation, usually have higher median incomes to balance these expenses. Economic diversity also influences income levels, with states that have a mix of high-paying industries like technology and finance typically reporting higher incomes. Education levels are another important factor; states with strong educational institutions and a higher percentage of residents with college degrees tend to have higher median incomes. Lastly, state policies related to taxation, education, and economic development can also impact income levels.

Overall, this visualization provides valuable insights into the regional variations in median household income across the U.S. It highlights how factors such as geography, industry, education, and state policies can influence economic disparities.