Introduction

This report presents the required visualizations and analyses based on the HRS2010_psych_R7.sav dataset, using the finalized set of variables: MI861 (Pulse), MC010 (Diabetes Status), MG006 (Climbing Stairs Difficulty), and MC070 (Arthritis Status) as the grouping variable.

Setup and Data Loading

The code below prepares the R environment and loads the necessary data and packages.

Step-by-Step Walkthrough:

  1. Libraries: Required packages (haven, tidyverse, DT, corrplot) are loaded.
  2. Data Loading: The read_sav() function reads the SPSS file.
  3. Data Cleaning & Variable Preparation:
    • The four final variables are selected.
    • MI861 (Pulse) is kept as is.
    • MC010 (Diabetes) is converted to a numerical score (0 or 1) for correlation, then to a factor (DIABETES_cat) for plotting.
    • MG006 (Stairs Difficulty) is converted to a factor (STAIRS_cat) for plotting.
    • MC070 (Arthritis Status) is converted to the main factor grouping variable (ARTHRITIS_cat).
    • Data is filtered to remove missing values and invalid codes (e.g., negative values for Refused/Don’t Know).
# List of required packages
packages <- c("haven", "tidyverse", "DT", "corrplot")

# Check, install, and load packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
lapply(packages, library, character.only = TRUE)
## [[1]]
## [1] "haven"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "haven"     "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "DT"        "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "haven"    
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## [[4]]
##  [1] "corrplot"  "DT"        "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "haven"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [19] "methods"   "base"
# -----------------
# 2. LOAD DATA
# -----------------
data <- read_sav("HRS2010_psych_R7.sav")

# -----------------
# 3. SELECT AND PREPARE VARIABLES
# -----------------
data_clean <- data %>%
  select(MI861, MC010, MG006, MC070) %>%
  # Filter out missing codes (typically > 5 or negative in HRS data)
  filter(MI861 > 0 & MG006 < 6 & MC010 < 5 & MC070 < 5) %>%
  
  # Prepare Categorical Variables
  mutate(
    # MC070 (Arthritis) as main grouping factor
    ARTHRITIS_cat = as_factor(MC070), 
    # MC010 (Diabetes) as a plotting factor
    DIABETES_cat = as_factor(MC010),
    # MG006 (Stairs Difficulty) as an ordered factor for plotting
    STAIRS_cat = factor(MG006, ordered = TRUE, levels = c(1, 2, 3), 
                        labels = c("1: Not Difficult", "2: A Little Difficult", "3: Very Difficult/Cannot Do")),
    # MC010 for correlation (numeric 0=No, 1=Yes)
    DIABETES_num = if_else(MC010 == 1, 1, 0)
  ) %>%
  drop_na() # Remove any remaining NA values

# Display a quick summary of the clean data
head(data_clean)
# Use the DT package to create a paged, interactive data table
DT::datatable(data_clean,
              options = list(pageLength = 10, scrollX = TRUE),
              caption = "Interactive Data Table of Selected Variables (Pulse, Diabetes, Stairs, Arthritis)")

#1. Visualize the Basic Distribution of Certain Key Variables 1A. Distribution of Pulse (Heart Rate) (MI861) (Histogram) Step-by-Step Walkthrough:

Plot Type: A histogram is chosen to show the frequency distribution of the continuous variable MI861 (Pulse).

geom_histogram(): Adds the histogram layer. binwidth = 5 groups the heart rates in steps of 5 bpm.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

# Histogram for MI861 (Pulse)
data_clean %>%
  ggplot(aes(x = MI861)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(
    title = "Distribution of Respondent's Heart Rate (Pulse)",
    x = "Pulse (Beats Per Minute, bpm)",
    y = "Count of Respondents"
  ) +
  theme_minimal()

#1B. Distribution of Difficulty Climbing Stairs (MG006) (Bar Chart) Step-by-Step Walkthrough:

Plot Type: A bar chart displays the counts for the ordinal variable STAIRS_cat.

geom_bar(): Plots the frequency of each difficulty level.

# Bar Chart for STAIRS_cat (Difficulty Climbing Stairs)
data_clean %>%
  ggplot(aes(x = STAIRS_cat)) +
  geom_bar(fill = "gold", color = "black") +
  labs(
    title = "Distribution of Difficulty Climbing Stairs",
    x = "Stairs Difficulty Level",
    y = "Count of Respondents"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#2. Run Scatterplots Scatterplot of Pulse vs. Diabetes Status, Grouped by Arthritis Status Step-by-Step Walkthrough:

Variables: Plots the continuous variable Pulse (MI861) (y-axis) against the categorical variable Diabetes Status (DIABETES_cat) (x-axis), with points colored by Arthritis Status (ARTHRITIS_cat).

geom_jitter(): Used to spread out the points on the categorical x-axis, revealing the density of Pulse scores within each group.

geom_boxplot(): Adds a boxplot layer behind the jittered points to show the median and quartiles clearly for comparison.

# Scatterplot (Jittered) comparing Pulse by Diabetes Status, grouped by Arthritis
data_clean %>%
  ggplot(aes(x = DIABETES_cat, y = MI861, color = ARTHRITIS_cat)) +
  # Boxplots to show distribution
  geom_boxplot(width = 0.5, alpha = 0.5, outlier.shape = NA) +
  # Jittered points to show raw data density
  geom_jitter(width = 0.1, alpha = 0.6) + 
  labs(
    title = "Heart Rate (Pulse) by Diabetes Status",
    subtitle = "Grouped by Arthritis Status",
    x = "Diabetes Status (MC010)",
    y = "Pulse (Beats Per Minute, bpm)",
    color = "Arthritis Status (MC070)"
  ) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

#3. Visualize and Compare Among Some Key Variables Comparison of Pulse (Heart Rate) by Arthritis Status (Boxplot) Step-by-Step Walkthrough:

Plot Type: A boxplot compares the distribution of the continuous variable Pulse (MI861) across the two categories of Arthritis Status (ARTHRITIS_cat).

geom_boxplot(): Creates the boxplot, illustrating differences in the median, spread, and potential outliers.

# Boxplot for comparing MI861 (Pulse) between MC070 (Arthritis) groups
data_clean %>%
  ggplot(aes(x = ARTHRITIS_cat, y = MI861, fill = ARTHRITIS_cat)) +
  geom_boxplot(alpha = 0.8) +
  labs(
    title = "Pulse Distribution by Self-Reported Arthritis Status",
    x = "Arthritis Status (MC070)",
    y = "Pulse (Beats Per Minute, bpm)",
    fill = "Arthritis Status"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

4. Correlation Visualizations Among Some Key Variables

Correlation Matrix Plot (Correlogram)

Step-by-Step Walkthrough:

1. Data Preparation: Calculates the correlation coefficients between the three numerical variables.

2. Tool: The dedicated corrplot package is used for visualizing the matrix.

3. FIX: The order parameter is changed from “hclust” to “original” to avoid the NA/NaN error caused by zero variance in one of the variables.

# 1. Select the final numerical variables for correlation
numeric_data <- data_clean %>%
  select(MI861, DIABETES_num, MG006) %>%
  rename(Pulse = MI861, Diabetes = DIABETES_num, Stairs_Difficulty = MG006)

# 2. Calculate the correlation matrix
correlation_matrix <- cor(numeric_data)
## Warning in cor(numeric_data): the standard deviation is zero
# 3. Visualize the correlation matrix (using order = "original" to prevent clustering error)
corrplot::corrplot(
  correlation_matrix,
  method = "circle",      # Shape of the representation
  type = "upper",         # Only show the upper half
  diag = FALSE,           
  order = "original",     # FIX: Use original ordering to avoid hclust failure
  addCoef.col = "black",  # Add correlation coefficients as text
  tl.srt = 45,            # Rotate axis labels
  title = " \nCorrelation Matrix of Pulse, Diabetes, and Stairs Difficulty" 
)

#5. Select one or two of your favorite plots and try to mimic Mimicking a Stacked Bar Chart (Arthritis and Stairs Difficulty)

# 1. Summarize data to calculate percentages within each Arthritis category
data_summary_stacked <- data_clean %>%
  group_by(ARTHRITIS_cat, STAIRS_cat) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(ARTHRITIS_cat) %>%
  mutate(percentage = count / sum(count))

# 2. Create the Stacked Bar Chart
ggplot(data_summary_stacked, aes(x = ARTHRITIS_cat, y = percentage, fill = STAIRS_cat)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(
    title = "Proportion of Stairs Difficulty by Arthritis Status",
    x = "Arthritis Status (MC070)",
    y = "Proportion of Respondents (100% Stacked)",
    fill = "Stairs Difficulty (MG006)"
  ) +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  scale_fill_brewer(palette = "YlOrRd")