Introduction
Setup and Data Loading
4. Correlation Visualizations Among Some Key Variables
- Correlation Matrix Plot (Correlogram)
Step-by-Step Walkthrough:
1. Data Preparation: Calculates the correlation coefficients between the three numerical variables.
2. Tool: The dedicated corrplot package is used for visualizing the matrix.
3. FIX: The order parameter is changed from “hclust” to “original” to avoid the NA/NaN error caused by zero variance in one of the variables.

Introduction

This report presents the required visualizations and analyses based on the HRS2010_psych_R7.sav dataset, using the finalized set of variables: MI861 (Pulse), MC010 (Diabetes Status), MG006 (Climbing Stairs Difficulty), and MC070 (Arthritis Status) as the grouping variable.

Setup and Data Loading

The code below prepares the R environment and loads the necessary data and packages.

Step-by-Step Walkthrough:

Libraries: Required packages (haven, tidyverse, DT, corrplot) are loaded.
Data Loading: The read_sav() function reads the SPSS file.
Data Cleaning & Variable Preparation:
- The four final variables are selected.
- MI861 (Pulse) is kept as is.
- MC010 (Diabetes) is converted to a numerical score (0 or 1) for correlation, then to a factor (DIABETES_cat) for plotting.
- MG006 (Stairs Difficulty) is converted to a factor (STAIRS_cat) for plotting.
- MC070 (Arthritis Status) is converted to the main factor grouping variable (ARTHRITIS_cat).
- Data is filtered to remove missing values and invalid codes (e.g., negative values for Refused/Don’t Know).

# List of required packages
packages <- c("haven", "tidyverse", "DT", "corrplot")

# Check, install, and load packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
lapply(packages, library, character.only = TRUE)

## [[1]]
## [1] "haven"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "haven"     "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "DT"        "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "haven"    
## [13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [19] "base"     
## 
## [[4]]
##  [1] "corrplot"  "DT"        "lubridate" "forcats"   "stringr"   "dplyr"    
##  [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
## [13] "haven"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [19] "methods"   "base"

# -----------------
# 2. LOAD DATA
# -----------------
data <- read_sav("HRS2010_psych_R7.sav")

# -----------------
# 3. SELECT AND PREPARE VARIABLES
# -----------------
data_clean <- data %>%
  select(MI861, MC010, MG006, MC070) %>%
  # Filter out missing codes (typically > 5 or negative in HRS data)
  filter(MI861 > 0 & MG006 < 6 & MC010 < 5 & MC070 < 5) %>%
  
  # Prepare Categorical Variables
  mutate(
    # MC070 (Arthritis) as main grouping factor
    ARTHRITIS_cat = as_factor(MC070), 
    # MC010 (Diabetes) as a plotting factor
    DIABETES_cat = as_factor(MC010),
    # MG006 (Stairs Difficulty) as an ordered factor for plotting
    STAIRS_cat = factor(MG006, ordered = TRUE, levels = c(1, 2, 3), 
                        labels = c("1: Not Difficult", "2: A Little Difficult", "3: Very Difficult/Cannot Do")),
    # MC010 for correlation (numeric 0=No, 1=Yes)
    DIABETES_num = if_else(MC010 == 1, 1, 0)
  ) %>%
  drop_na() # Remove any remaining NA values

# Display a quick summary of the clean data
head(data_clean)

# Use the DT package to create a paged, interactive data table
DT::datatable(data_clean,
              options = list(pageLength = 10, scrollX = TRUE),
              caption = "Interactive Data Table of Selected Variables (Pulse, Diabetes, Stairs, Arthritis)")

#1. Visualize the Basic Distribution of Certain Key Variables 1A. Distribution of Pulse (Heart Rate) (MI861) (Histogram) Step-by-Step Walkthrough:

Plot Type: A histogram is chosen to show the frequency distribution of the continuous variable MI861 (Pulse).

geom_histogram(): Adds the histogram layer. binwidth = 5 groups the heart rates in steps of 5 bpm.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

# Histogram for MI861 (Pulse)
data_clean %>%
  ggplot(aes(x = MI861)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(
    title = "Distribution of Respondent's Heart Rate (Pulse)",
    x = "Pulse (Beats Per Minute, bpm)",
    y = "Count of Respondents"
  ) +
  theme_minimal()

#1B. Distribution of Difficulty Climbing Stairs (MG006) (Bar Chart) Step-by-Step Walkthrough:

Plot Type: A bar chart displays the counts for the ordinal variable STAIRS_cat.

geom_bar(): Plots the frequency of each difficulty level.

# Bar Chart for STAIRS_cat (Difficulty Climbing Stairs)
data_clean %>%
  ggplot(aes(x = STAIRS_cat)) +
  geom_bar(fill = "gold", color = "black") +
  labs(
    title = "Distribution of Difficulty Climbing Stairs",
    x = "Stairs Difficulty Level",
    y = "Count of Respondents"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#2. Run Scatterplots Scatterplot of Pulse vs. Diabetes Status, Grouped by Arthritis Status Step-by-Step Walkthrough:

Variables: Plots the continuous variable Pulse (MI861) (y-axis) against the categorical variable Diabetes Status (DIABETES_cat) (x-axis), with points colored by Arthritis Status (ARTHRITIS_cat).

geom_jitter(): Used to spread out the points on the categorical x-axis, revealing the density of Pulse scores within each group.

geom_boxplot(): Adds a boxplot layer behind the jittered points to show the median and quartiles clearly for comparison.

# Scatterplot (Jittered) comparing Pulse by Diabetes Status, grouped by Arthritis
data_clean %>%
  ggplot(aes(x = DIABETES_cat, y = MI861, color = ARTHRITIS_cat)) +
  # Boxplots to show distribution
  geom_boxplot(width = 0.5, alpha = 0.5, outlier.shape = NA) +
  # Jittered points to show raw data density
  geom_jitter(width = 0.1, alpha = 0.6) + 
  labs(
    title = "Heart Rate (Pulse) by Diabetes Status",
    subtitle = "Grouped by Arthritis Status",
    x = "Diabetes Status (MC010)",
    y = "Pulse (Beats Per Minute, bpm)",
    color = "Arthritis Status (MC070)"
  ) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

#3. Visualize and Compare Among Some Key Variables Comparison of Pulse (Heart Rate) by Arthritis Status (Boxplot) Step-by-Step Walkthrough:

Plot Type: A boxplot compares the distribution of the continuous variable Pulse (MI861) across the two categories of Arthritis Status (ARTHRITIS_cat).

geom_boxplot(): Creates the boxplot, illustrating differences in the median, spread, and potential outliers.

# Boxplot for comparing MI861 (Pulse) between MC070 (Arthritis) groups
data_clean %>%
  ggplot(aes(x = ARTHRITIS_cat, y = MI861, fill = ARTHRITIS_cat)) +
  geom_boxplot(alpha = 0.8) +
  labs(
    title = "Pulse Distribution by Self-Reported Arthritis Status",
    x = "Arthritis Status (MC070)",
    y = "Pulse (Beats Per Minute, bpm)",
    fill = "Arthritis Status"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

4. Correlation Visualizations Among Some Key Variables

Correlation Matrix Plot (Correlogram)

Step-by-Step Walkthrough:

1. Data Preparation: Calculates the correlation coefficients between the three numerical variables.

2. Tool: The dedicated `corrplot` package is used for visualizing the matrix.

3. FIX: The `order` parameter is changed from “hclust” to “original” to avoid the NA/NaN error caused by zero variance in one of the variables.

# 1. Select the final numerical variables for correlation
numeric_data <- data_clean %>%
  select(MI861, DIABETES_num, MG006) %>%
  rename(Pulse = MI861, Diabetes = DIABETES_num, Stairs_Difficulty = MG006)

# 2. Calculate the correlation matrix
correlation_matrix <- cor(numeric_data)

## Warning in cor(numeric_data): the standard deviation is zero

# 3. Visualize the correlation matrix (using order = "original" to prevent clustering error)
corrplot::corrplot(
  correlation_matrix,
  method = "circle",      # Shape of the representation
  type = "upper",         # Only show the upper half
  diag = FALSE,           
  order = "original",     # FIX: Use original ordering to avoid hclust failure
  addCoef.col = "black",  # Add correlation coefficients as text
  tl.srt = 45,            # Rotate axis labels
  title = " \nCorrelation Matrix of Pulse, Diabetes, and Stairs Difficulty" 
)

#5. Select one or two of your favorite plots and try to mimic Mimicking a Stacked Bar Chart (Arthritis and Stairs Difficulty)

# 1. Summarize data to calculate percentages within each Arthritis category
data_summary_stacked <- data_clean %>%
  group_by(ARTHRITIS_cat, STAIRS_cat) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(ARTHRITIS_cat) %>%
  mutate(percentage = count / sum(count))

# 2. Create the Stacked Bar Chart
ggplot(data_summary_stacked, aes(x = ARTHRITIS_cat, y = percentage, fill = STAIRS_cat)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(
    title = "Proportion of Stairs Difficulty by Arthritis Status",
    x = "Arthritis Status (MC070)",
    y = "Proportion of Respondents (100% Stacked)",
    fill = "Stairs Difficulty (MG006)"
  ) +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  scale_fill_brewer(palette = "YlOrRd")

HRS 2010 Psychological Data Analysis Report

Your Name

November 04, 2025

Introduction

Setup and Data Loading

4. Correlation Visualizations Among Some Key Variables

Correlation Matrix Plot (Correlogram)

Step-by-Step Walkthrough:

1. Data Preparation: Calculates the correlation coefficients between the three numerical variables.

2. Tool: The dedicated `corrplot` package is used for visualizing the matrix.

3. FIX: The `order` parameter is changed from “hclust” to “original” to avoid the NA/NaN error caused by zero variance in one of the variables.

HRS 2010 Psychological Data Analysis Report

Your Name

November 04, 2025

Introduction

Setup and Data Loading

4. Correlation Visualizations Among Some Key Variables

Correlation Matrix Plot (Correlogram)

Step-by-Step Walkthrough:

1. Data Preparation: Calculates the correlation coefficients between the three numerical variables.

2. Tool: The dedicated corrplot package is used for visualizing the matrix.

3. FIX: The order parameter is changed from “hclust” to “original” to avoid the NA/NaN error caused by zero variance in one of the variables.

2. Tool: The dedicated `corrplot` package is used for visualizing the matrix.

3. FIX: The `order` parameter is changed from “hclust” to “original” to avoid the NA/NaN error caused by zero variance in one of the variables.