corrplot package is used for
visualizing the matrix.order parameter is changed from “hclust” to
“original” to avoid the NA/NaN error caused by zero variance in one of
the variables.This report presents the required visualizations and analyses based
on the HRS2010_psych_R7.sav dataset, using the finalized
set of variables: MI861 (Pulse),
MC010 (Diabetes Status),
MG006 (Climbing Stairs Difficulty), and
MC070 (Arthritis Status) as the grouping
variable.
The code below prepares the R environment and loads the necessary data and packages.
Step-by-Step Walkthrough:
haven,
tidyverse, DT, corrplot) are
loaded.read_sav() function
reads the SPSS file.MI861 (Pulse) is kept as is.MC010 (Diabetes) is converted to a
numerical score (0 or 1) for correlation, then to a factor
(DIABETES_cat) for plotting.MG006 (Stairs Difficulty) is converted
to a factor (STAIRS_cat) for plotting.MC070 (Arthritis Status) is converted
to the main factor grouping variable (ARTHRITIS_cat).# List of required packages
packages <- c("haven", "tidyverse", "DT", "corrplot")
# Check, install, and load packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
lapply(packages, library, character.only = TRUE)
## [[1]]
## [1] "haven" "stats" "graphics" "grDevices" "utils" "datasets"
## [7] "methods" "base"
##
## [[2]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "haven" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "DT" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "haven"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
##
## [[4]]
## [1] "corrplot" "DT" "lubridate" "forcats" "stringr" "dplyr"
## [7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [13] "haven" "stats" "graphics" "grDevices" "utils" "datasets"
## [19] "methods" "base"
# -----------------
# 2. LOAD DATA
# -----------------
data <- read_sav("HRS2010_psych_R7.sav")
# -----------------
# 3. SELECT AND PREPARE VARIABLES
# -----------------
data_clean <- data %>%
select(MI861, MC010, MG006, MC070) %>%
# Filter out missing codes (typically > 5 or negative in HRS data)
filter(MI861 > 0 & MG006 < 6 & MC010 < 5 & MC070 < 5) %>%
# Prepare Categorical Variables
mutate(
# MC070 (Arthritis) as main grouping factor
ARTHRITIS_cat = as_factor(MC070),
# MC010 (Diabetes) as a plotting factor
DIABETES_cat = as_factor(MC010),
# MG006 (Stairs Difficulty) as an ordered factor for plotting
STAIRS_cat = factor(MG006, ordered = TRUE, levels = c(1, 2, 3),
labels = c("1: Not Difficult", "2: A Little Difficult", "3: Very Difficult/Cannot Do")),
# MC010 for correlation (numeric 0=No, 1=Yes)
DIABETES_num = if_else(MC010 == 1, 1, 0)
) %>%
drop_na() # Remove any remaining NA values
# Display a quick summary of the clean data
head(data_clean)
# Use the DT package to create a paged, interactive data table
DT::datatable(data_clean,
options = list(pageLength = 10, scrollX = TRUE),
caption = "Interactive Data Table of Selected Variables (Pulse, Diabetes, Stairs, Arthritis)")
#1. Visualize the Basic Distribution of Certain Key Variables 1A. Distribution of Pulse (Heart Rate) (MI861) (Histogram) Step-by-Step Walkthrough:
Plot Type: A histogram is chosen to show the frequency distribution of the continuous variable MI861 (Pulse).
geom_histogram(): Adds the histogram layer. binwidth = 5 groups the heart rates in steps of 5 bpm.
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
# Histogram for MI861 (Pulse)
data_clean %>%
ggplot(aes(x = MI861)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(
title = "Distribution of Respondent's Heart Rate (Pulse)",
x = "Pulse (Beats Per Minute, bpm)",
y = "Count of Respondents"
) +
theme_minimal()
#1B. Distribution of Difficulty Climbing Stairs (MG006) (Bar Chart) Step-by-Step Walkthrough:
Plot Type: A bar chart displays the counts for the ordinal variable STAIRS_cat.
geom_bar(): Plots the frequency of each difficulty level.
# Bar Chart for STAIRS_cat (Difficulty Climbing Stairs)
data_clean %>%
ggplot(aes(x = STAIRS_cat)) +
geom_bar(fill = "gold", color = "black") +
labs(
title = "Distribution of Difficulty Climbing Stairs",
x = "Stairs Difficulty Level",
y = "Count of Respondents"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#2. Run Scatterplots Scatterplot of Pulse vs. Diabetes Status, Grouped by Arthritis Status Step-by-Step Walkthrough:
Variables: Plots the continuous variable Pulse (MI861) (y-axis) against the categorical variable Diabetes Status (DIABETES_cat) (x-axis), with points colored by Arthritis Status (ARTHRITIS_cat).
geom_jitter(): Used to spread out the points on the categorical x-axis, revealing the density of Pulse scores within each group.
geom_boxplot(): Adds a boxplot layer behind the jittered points to show the median and quartiles clearly for comparison.
# Scatterplot (Jittered) comparing Pulse by Diabetes Status, grouped by Arthritis
data_clean %>%
ggplot(aes(x = DIABETES_cat, y = MI861, color = ARTHRITIS_cat)) +
# Boxplots to show distribution
geom_boxplot(width = 0.5, alpha = 0.5, outlier.shape = NA) +
# Jittered points to show raw data density
geom_jitter(width = 0.1, alpha = 0.6) +
labs(
title = "Heart Rate (Pulse) by Diabetes Status",
subtitle = "Grouped by Arthritis Status",
x = "Diabetes Status (MC010)",
y = "Pulse (Beats Per Minute, bpm)",
color = "Arthritis Status (MC070)"
) +
theme_minimal() +
scale_color_brewer(palette = "Set1")
#3. Visualize and Compare Among Some Key Variables Comparison of Pulse (Heart Rate) by Arthritis Status (Boxplot) Step-by-Step Walkthrough:
Plot Type: A boxplot compares the distribution of the continuous variable Pulse (MI861) across the two categories of Arthritis Status (ARTHRITIS_cat).
geom_boxplot(): Creates the boxplot, illustrating differences in the median, spread, and potential outliers.
# Boxplot for comparing MI861 (Pulse) between MC070 (Arthritis) groups
data_clean %>%
ggplot(aes(x = ARTHRITIS_cat, y = MI861, fill = ARTHRITIS_cat)) +
geom_boxplot(alpha = 0.8) +
labs(
title = "Pulse Distribution by Self-Reported Arthritis Status",
x = "Arthritis Status (MC070)",
y = "Pulse (Beats Per Minute, bpm)",
fill = "Arthritis Status"
) +
theme_minimal() +
theme(legend.position = "none")
corrplot package is used for
visualizing the matrix.order parameter is changed from “hclust” to
“original” to avoid the NA/NaN error caused by zero variance in one of
the variables.# 1. Select the final numerical variables for correlation
numeric_data <- data_clean %>%
select(MI861, DIABETES_num, MG006) %>%
rename(Pulse = MI861, Diabetes = DIABETES_num, Stairs_Difficulty = MG006)
# 2. Calculate the correlation matrix
correlation_matrix <- cor(numeric_data)
## Warning in cor(numeric_data): the standard deviation is zero
# 3. Visualize the correlation matrix (using order = "original" to prevent clustering error)
corrplot::corrplot(
correlation_matrix,
method = "circle", # Shape of the representation
type = "upper", # Only show the upper half
diag = FALSE,
order = "original", # FIX: Use original ordering to avoid hclust failure
addCoef.col = "black", # Add correlation coefficients as text
tl.srt = 45, # Rotate axis labels
title = " \nCorrelation Matrix of Pulse, Diabetes, and Stairs Difficulty"
)
#5. Select one or two of your favorite plots and try to mimic Mimicking a Stacked Bar Chart (Arthritis and Stairs Difficulty)
# 1. Summarize data to calculate percentages within each Arthritis category
data_summary_stacked <- data_clean %>%
group_by(ARTHRITIS_cat, STAIRS_cat) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(ARTHRITIS_cat) %>%
mutate(percentage = count / sum(count))
# 2. Create the Stacked Bar Chart
ggplot(data_summary_stacked, aes(x = ARTHRITIS_cat, y = percentage, fill = STAIRS_cat)) +
geom_bar(stat = "identity", position = "fill") +
labs(
title = "Proportion of Stairs Difficulty by Arthritis Status",
x = "Arthritis Status (MC070)",
y = "Proportion of Respondents (100% Stacked)",
fill = "Stairs Difficulty (MG006)"
) +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
scale_fill_brewer(palette = "YlOrRd")