How does the type of health insurance (public vs private) influence health score distributions? How do chronic group vs non-chronic groups defer in health scores? How does your health score influence your activity level?
I am using a data set of doctors visit metrics.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(readr)
DoctorVisits <- read_csv("C:/Users/bleac/OneDrive/Documents/summer bridge final/DoctorVisits.csv")
## New names:
## • `` -> `...1`
## Rows: 5190 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): gender, private, freepoor, freerepat, nchronic, lchronic
## dbl (7): ...1, visits, age, income, illness, reduced, health
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(DoctorVisits)
## # A tibble: 6 × 13
## ...1 visits gender age income illness reduced health private freepoor
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1 female 0.19 0.55 1 4 1 yes no
## 2 2 1 female 0.19 0.45 1 2 1 yes no
## 3 3 1 male 0.19 0.9 3 0 0 no no
## 4 4 1 male 0.19 0.15 1 0 0 no no
## 5 5 1 male 0.19 0.45 2 5 1 no no
## 6 6 1 female 0.19 0.35 5 1 9 no no
## # ℹ 3 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>
Now lets explore the data and get summary statistics:
#Summary statistics
summary(DoctorVisits)
## ...1 visits gender age
## Min. : 1 Min. :0.0000 Length:5190 Min. :0.1900
## 1st Qu.:1298 1st Qu.:0.0000 Class :character 1st Qu.:0.2200
## Median :2596 Median :0.0000 Mode :character Median :0.3200
## Mean :2596 Mean :0.3017 Mean :0.4064
## 3rd Qu.:3893 3rd Qu.:0.0000 3rd Qu.:0.6200
## Max. :5190 Max. :9.0000 Max. :0.7200
## income illness reduced health
## Min. :0.0000 Min. :0.000 Min. : 0.0000 Min. : 0.000
## 1st Qu.:0.2500 1st Qu.:0.000 1st Qu.: 0.0000 1st Qu.: 0.000
## Median :0.5500 Median :1.000 Median : 0.0000 Median : 0.000
## Mean :0.5832 Mean :1.432 Mean : 0.8619 Mean : 1.218
## 3rd Qu.:0.9000 3rd Qu.:2.000 3rd Qu.: 0.0000 3rd Qu.: 2.000
## Max. :1.5000 Max. :5.000 Max. :14.0000 Max. :12.000
## private freepoor freerepat nchronic
## Length:5190 Length:5190 Length:5190 Length:5190
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## lchronic
## Length:5190
## Class :character
## Mode :character
##
##
##
Doctor visits is a dataset of n=5190
Getting statistics on the age distribution (age is divided by 100):
mean(DoctorVisits$age)
## [1] 0.4063854
median(DoctorVisits$age)
## [1] 0.32
min(DoctorVisits$age)
## [1] 0.19
max(DoctorVisits$age)
## [1] 0.72
sd(DoctorVisits$age)
## [1] 0.2047818
Mean age is ~41, median age is 32, min is 19, max is 72, with a standard deviation of +/- 20.5 years
Getting statistics on annual income in hundred of thousands of $:
mean(DoctorVisits$income)
## [1] 0.5831599
median(DoctorVisits$income)
## [1] 0.55
min(DoctorVisits$income)
## [1] 0
max(DoctorVisits$income)
## [1] 1.5
sd(DoctorVisits$income)
## [1] 0.3689067
The Mean income is $58k, with median income being $55K. Min income is $0 and Max income is $150k with a standard deviation of +/- $37k.
Getting sex statistics:
sex=table(DoctorVisits$gender)
print(sex)
##
## female male
## 2702 2488
The sample is almost evenly split with 2702 females and 2488 males
Data Wrangling:
Do patients with private health insurance have different conditions than patients with public insurance?
Going to split up the data into the three types of insurances.
private_ins = DoctorVisits %>% filter(private == "yes")
medicaid = DoctorVisits %>% filter(freepoor == "yes")
medicare = DoctorVisits %>% filter(freerepat == "yes")
# Combining all three data sets into one data frame for easy plotting
combined_df = rbind(
data.frame(Insurance_Type = 'Private', Health_Score = private_ins$health),
data.frame(Insurance_Type = 'Medicaid', Health_Score = medicaid$health),
data.frame(Insurance_Type = 'Medicare', Health_Score = medicare$health)
)
Now lets compare between the three. Frst Health scores (0 to 36 with 36 being the worse):
meanpriv=mean(private_ins$health)
meanmedicaid=mean(medicaid$health)
meanmedicare=mean(medicare$health)
medpriv=median(private_ins$health)
medmedicaid=median(medicaid$health)
medmedicare=median(medicare$health)
print(meanpriv)
## [1] 1.097476
print(meanmedicaid)
## [1] 1.797297
print(meanmedicare)
## [1] 1.498625
print(medpriv)
## [1] 0
print(medmedicaid)
## [1] 1
print(medmedicare)
## [1] 0
The mean and median health scores for all three groups are close. It is expected that the median for medicaid to be higher since the medicaid group includes disabled individuals and also veterans
Lets plot the distribution of health scores per insurance group:
# convert Health_Score to factor because the sample sizes of the three grou0ps are different
combined_df$Health_Score = as.factor(combined_df$Health_Score)
# plot
ggplot(combined_df, aes(x=Health_Score, fill = Insurance_Type)) +
geom_bar(position = "dodge", aes(y = ..prop.., group = Insurance_Type)) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Distribution of Health Scores", x = "Health Score", y = "Proportion", fill = "Insurance Type") +
theme_minimal() +
theme(legend.position = "top")
## Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(prop)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
60% of private health insurance member have a health score of 0 compared to a score of 53% for medicare and 45% medicaid
Lets further divide the three groups into chronic vs non chronic:
priv_chronic = private_ins %>% filter(lchronic == "yes")
medicaid_chronic = medicaid %>% filter(lchronic == "yes")
medicare_chronic = medicare %>% filter(lchronic == "yes")
# Non-chronic patients subsets
priv_no_chronic = private_ins %>% filter(lchronic == "no")
medicaid_no_chronic = medicaid %>% filter(lchronic == "no")
medicare_no_chronic = medicare %>% filter(lchronic == "no")
# Creating a data frame for chronic patients
combined_df_chronic = rbind(
data.frame(Insurance_Type = 'Private', Health_Score = as.factor(priv_chronic$health)),
data.frame(Insurance_Type = 'Medicaid', Health_Score = as.factor(medicaid_chronic$health)),
data.frame(Insurance_Type = 'Medicare', Health_Score = as.factor(medicare_chronic$health))
)
# Creating a data frame for non-chronic patients
combined_df_non_chronic = rbind(
data.frame(Insurance_Type = 'Private', Health_Score = as.factor(priv_no_chronic$health)),
data.frame(Insurance_Type = 'Medicaid', Health_Score = as.factor(medicaid_no_chronic$health)),
data.frame(Insurance_Type = 'Medicare', Health_Score = as.factor(medicare_no_chronic$health))
)
Plotting for chronic patients:
# Plotting for chronic patients
ggplot(combined_df_chronic, aes(x=Health_Score, fill = Insurance_Type)) +
geom_bar(position = "dodge", aes(y = ..prop.., group = Insurance_Type)) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Chronic Patients: Distribution of Health Scores", x = "Health Score", y = "Proportion", fill = "Insurance Type") +
theme_minimal() +
theme(legend.position = "top")
chronic patients have a higher chance of having a healthscore of 0 in private health insurance when compared to private health insurance
A score of 12 chances are you a a medicare patient
Plotting for non-chronic patients:
# Plotting for non-chronic patients
ggplot(combined_df_non_chronic, aes(x=Health_Score, fill = Insurance_Type)) +
geom_bar(position = "dodge", aes(y = ..prop.., group = Insurance_Type)) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Non-Chronic Patients: Distribution of Health Scores", x = "Health Score", y = "Proportion", fill = "Insurance Type") +
theme_minimal() +
theme(legend.position = "top")
Non chronic Medicaid has a higher chance of having worse health score.
Both non chronic and chronic have similar distribution patterns of health score, with the majority.
Being concentrated towards the beginning (meaning that they are healthier)
Chronic has more spread (expected)
How does your health score influence your activity level?
We will plot average days of reduced activities vs health score
First, i will add a combined column for insurance type and chronic condition:
# adding a combined column for insurance type and chronic condition
combined_df = rbind(
data.frame(Insurance_Type = 'Private', Health_Score = private_ins$health),
data.frame(Insurance_Type = 'Medicaid', Health_Score = medicaid$health),
data.frame(Insurance_Type = 'Medicare', Health_Score = medicare$health)
)
# Aggregating the data for each insurance type and condition
priv_agg_chronic = priv_chronic %>%
group_by(health) %>%
summarise(mean_reduced = mean(reduced, na.rm = TRUE))
priv_agg_non_chronic = priv_no_chronic %>%
group_by(health) %>%
summarise(mean_reduced = mean(reduced, na.rm = TRUE))
medicaid_agg_chronic = medicaid_chronic %>%
group_by(health) %>%
summarise(mean_reduced = mean(reduced, na.rm = TRUE))
medicaid_agg_non_chronic = medicaid_no_chronic %>%
group_by(health) %>%
summarise(mean_reduced = mean(reduced, na.rm = TRUE))
medicare_agg_chronic = medicare_chronic %>%
group_by(health) %>%
summarise(mean_reduced = mean(reduced, na.rm = TRUE))
medicare_agg_non_chronic = medicare_no_chronic %>%
group_by(health) %>%
summarise(mean_reduced = mean(reduced, na.rm = TRUE))
Now I’m going to plot the data for each insurance type and condition. I am also going to perform a linear regression for each.
Private Insurance:
ggplot(data = priv_agg_chronic, aes(x = health, y = mean_reduced)) +
geom_point() +
labs(title = "Mean Reduced Activity vs Health Score for Private Insurance - Chronic",
x = "Health Score",
y = "Mean Days of Reduced Activity") +
theme_minimal()
priv_chronic_model = lm(mean_reduced ~ health, data = priv_agg_chronic)
summary(priv_chronic_model)
##
## Call:
## lm(formula = mean_reduced ~ health, data = priv_agg_chronic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0359 -1.2450 0.1789 1.0897 5.1171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4992 1.3927 1.795 0.100
## health 0.3215 0.1970 1.632 0.131
##
## Residual standard error: 2.657 on 11 degrees of freedom
## Multiple R-squared: 0.195, Adjusted R-squared: 0.1218
## F-statistic: 2.665 on 1 and 11 DF, p-value: 0.1309
#Interesting. We see a direct linear relatrionship up until health score 4, then there is variability
#for patients with chronic conditions and private health insurance,
#Their health score is proportional to their reduced days of activities under health score of 5
ggplot(data = priv_agg_non_chronic, aes(x = health, y = mean_reduced)) +
geom_point() +
labs(title = "Mean Reduced Activity vs Health Score for Private Insurance - Non Chronic",
x = "Health Score",
y = "Mean Days of Reduced Activity") +
theme_minimal()
priv_non_chronic_model = lm(mean_reduced ~ health, data = priv_agg_non_chronic)
summary(priv_non_chronic_model)
##
## Call:
## lm(formula = mean_reduced ~ health, data = priv_agg_non_chronic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.69252 -0.64984 -0.05565 0.67618 1.85243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.86196 0.49723 1.734 0.111
## health 0.10581 0.07032 1.505 0.161
##
## Residual standard error: 0.9487 on 11 degrees of freedom
## Multiple R-squared: 0.1707, Adjusted R-squared: 0.0953
## F-statistic: 2.264 on 1 and 11 DF, p-value: 0.1606
#Here we see less of a relationship compared to chronic
Interesting. We see a direct linear relatrionship up until health score 4, then there is variability.
For patients with chronic conditions and private health insurance, their health score is proportional to their reduced days of activities under health score of 5.
Medicaid:
# Medicaid
ggplot(data = medicaid_agg_chronic, aes(x = health, y = mean_reduced)) +
geom_point() +
labs(title = "Mean Reduced Activity vs Health Score for Medicaid - Chronic",
x = "Health Score",
y = "Mean Days of Reduced Activity") +
theme_minimal()
medicaid_chronic_model = lm(mean_reduced ~ health, data = medicaid_agg_chronic)
summary(medicaid_chronic_model)
##
## Call:
## lm(formula = mean_reduced ~ health, data = medicaid_agg_chronic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.598 -2.882 -1.645 1.266 9.845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8076 2.8874 0.280 0.789
## health 0.5580 0.6227 0.896 0.405
##
## Residual standard error: 4.803 on 6 degrees of freedom
## Multiple R-squared: 0.118, Adjusted R-squared: -0.02897
## F-statistic: 0.8029 on 1 and 6 DF, p-value: 0.4047
ggplot(data = medicaid_agg_non_chronic, aes(x = health, y = mean_reduced)) +
geom_point() +
labs(title = "Mean Reduced Activity vs Health Score for Medicaid - Non Chronic",
x = "Health Score",
y = "Mean Days of Reduced Activity") +
theme_minimal()
# Medicaid - Non-Chronic
medicaid_non_chronic_model = lm(mean_reduced ~ health, data = medicaid_agg_non_chronic)
summary(medicaid_non_chronic_model)
##
## Call:
## lm(formula = mean_reduced ~ health, data = medicaid_agg_non_chronic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.29496 -0.82872 -0.03818 0.49558 2.91450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.009137 0.766278 -0.012 0.9907
## health 0.209463 0.108368 1.933 0.0794 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.462 on 11 degrees of freedom
## Multiple R-squared: 0.2535, Adjusted R-squared: 0.1857
## F-statistic: 3.736 on 1 and 11 DF, p-value: 0.0794
Medicaid chronic larger r squared but close to x axis --> less days of reduced activity
Medicare:
ggplot(data = medicare_agg_chronic, aes(x = health, y = mean_reduced)) +
geom_point() +
labs(title = "Mean Reduced Activity vs Health Score for Medicare - Chronic",
x = "Health Score",
y = "Mean Days of Reduced Activity") +
theme_minimal()
medicare_chronic_model = lm(mean_reduced ~ health, data = medicare_agg_chronic)
summary(medicare_chronic_model)
##
## Call:
## lm(formula = mean_reduced ~ health, data = medicare_agg_chronic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4416 -1.4126 -0.6727 0.8732 4.9270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2724 1.0542 0.258 0.800866
## health 0.8801 0.1491 5.903 0.000103 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.011 on 11 degrees of freedom
## Multiple R-squared: 0.76, Adjusted R-squared: 0.7382
## F-statistic: 34.84 on 1 and 11 DF, p-value: 0.0001027
ggplot(data = medicare_agg_non_chronic, aes(x = health, y = mean_reduced)) +
geom_point() +
labs(title = "Mean Reduced Activity vs Health Score for Medicare - Non Chronic",
x = "Health Score",
y = "Mean Days of Reduced Activity") +
theme_minimal()
medicare_non_chronic_model = lm(mean_reduced ~ health, data = medicare_agg_non_chronic)
summary(medicare_non_chronic_model)
##
## Call:
## lm(formula = mean_reduced ~ health, data = medicare_agg_non_chronic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3642 -0.7603 -0.2274 0.6096 4.0414
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1334 1.1936 0.950 0.365
## health 0.2028 0.1838 1.103 0.296
##
## Residual standard error: 2.198 on 10 degrees of freedom
## Multiple R-squared: 0.1085, Adjusted R-squared: 0.01937
## F-statistic: 1.217 on 1 and 10 DF, p-value: 0.2957
Medicare chronic seems to have a direct linear trend going on
So does medicare non chronic
It seems like the higher your health score is, the more days of reduced activity you have (not true for medicaid)
Now lets look at the distribution of health scores by gender:
# Boxplot
ggplot(DoctorVisits, aes(x = gender, y = health, fill = gender)) +
geom_boxplot() +
labs(title = "Boxplot of Health Score by Gender",
x = "Gender",
y = "Health Score",
fill = "Gender") +
theme_minimal()
# Violin plot
ggplot(DoctorVisits, aes(x = gender, y = health, fill = gender)) +
geom_violin() +
labs(title = "Violin Plot of Health Score by Gender",
x = "Gender",
y = "Health Score",
fill = "Gender") +
theme_minimal()
We see that men are the higher health scores. Men also have more variablility in health scores than women. Woman health scores were lower.
Conclusion:
Regarding the health insurance, the data was divided into three groups based on the type of health insurance: Private, Medicaid, and Medicare. The health scores of the three groups were found to be closely comparable, suggesting no immediate observable impact of the type of insurance on health scores. However, further analysis demonstrated that individuals with private insurance and non-chronic conditions had a higher likelihood of having a health score of 0, signifying better health, compared to those with public health insurance.
Chronic patients, compared to non-chronic, exhibited a wider spread in their health score distribution, which was expected due to the nature of their chronic conditions. For non-chronic patients, those insured through Medicaid demonstrated a higher chance of having a worse health score.
Next, the influence of health scores on activity level was examined. There was a general trend that higher health scores (indicating worse health) corresponded to increased days of reduced activity. This relationship was particularly strong for chronic patients with private insurance and both chronic and non-chronic Medicare patients. It was less pronounced among Medicaid patients, with the non-chronic group showing little correlation.
Finally, when examining the distribution of health scores by gender, it was observed that males had higher and more varied health scores compared to females. This indicates men generally had poorer health outcomes, but also a wider spread in health conditions, in contrast to women, whose health scores were overall lower, indicating better health outcomes.