Final Project – CUNY SPS MSDS R Summer Bridge

How does the type of health insurance (public vs private) influence health score distributions? How do chronic group vs non-chronic groups defer in health scores? How does your health score influence your activity level?

I am using a data set of doctors visit metrics.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(readr)
DoctorVisits <- read_csv("C:/Users/bleac/OneDrive/Documents/summer bridge final/DoctorVisits.csv")

## New names:
## • `` -> `...1`

## Rows: 5190 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): gender, private, freepoor, freerepat, nchronic, lchronic
## dbl (7): ...1, visits, age, income, illness, reduced, health
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(DoctorVisits)

## # A tibble: 6 × 13
##    ...1 visits gender   age income illness reduced health private freepoor
##   <dbl>  <dbl> <chr>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl> <chr>   <chr>   
## 1     1      1 female  0.19   0.55       1       4      1 yes     no      
## 2     2      1 female  0.19   0.45       1       2      1 yes     no      
## 3     3      1 male    0.19   0.9        3       0      0 no      no      
## 4     4      1 male    0.19   0.15       1       0      0 no      no      
## 5     5      1 male    0.19   0.45       2       5      1 no      no      
## 6     6      1 female  0.19   0.35       5       1      9 no      no      
## # ℹ 3 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>

Now lets explore the data and get summary statistics:

#Summary statistics
summary(DoctorVisits)

##       ...1          visits          gender               age        
##  Min.   :   1   Min.   :0.0000   Length:5190        Min.   :0.1900  
##  1st Qu.:1298   1st Qu.:0.0000   Class :character   1st Qu.:0.2200  
##  Median :2596   Median :0.0000   Mode  :character   Median :0.3200  
##  Mean   :2596   Mean   :0.3017                      Mean   :0.4064  
##  3rd Qu.:3893   3rd Qu.:0.0000                      3rd Qu.:0.6200  
##  Max.   :5190   Max.   :9.0000                      Max.   :0.7200  
##      income          illness         reduced            health      
##  Min.   :0.0000   Min.   :0.000   Min.   : 0.0000   Min.   : 0.000  
##  1st Qu.:0.2500   1st Qu.:0.000   1st Qu.: 0.0000   1st Qu.: 0.000  
##  Median :0.5500   Median :1.000   Median : 0.0000   Median : 0.000  
##  Mean   :0.5832   Mean   :1.432   Mean   : 0.8619   Mean   : 1.218  
##  3rd Qu.:0.9000   3rd Qu.:2.000   3rd Qu.: 0.0000   3rd Qu.: 2.000  
##  Max.   :1.5000   Max.   :5.000   Max.   :14.0000   Max.   :12.000  
##    private            freepoor          freerepat           nchronic        
##  Length:5190        Length:5190        Length:5190        Length:5190       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    lchronic        
##  Length:5190       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Doctor visits is a dataset of n=5190

Getting statistics on the age distribution (age is divided by 100):

mean(DoctorVisits$age)

## [1] 0.4063854

median(DoctorVisits$age)

## [1] 0.32

min(DoctorVisits$age)

## [1] 0.19

max(DoctorVisits$age)

## [1] 0.72

sd(DoctorVisits$age)

## [1] 0.2047818

Mean age is ~41, median age is 32, min is 19, max is 72, with a standard deviation of +/- 20.5 years

Getting statistics on annual income in hundred of thousands of $:

mean(DoctorVisits$income)

## [1] 0.5831599

median(DoctorVisits$income)

## [1] 0.55

min(DoctorVisits$income)

## [1] 0

max(DoctorVisits$income)

## [1] 1.5

sd(DoctorVisits$income)

## [1] 0.3689067

The Mean income is $58k, with median income being $55K. Min income is $0 and Max income is $150k with a standard deviation of +/- $37k.

Getting sex statistics:

sex=table(DoctorVisits$gender)
print(sex)

## 
## female   male 
##   2702   2488

The sample is almost evenly split with 2702 females and 2488 males

Data Wrangling:

Do patients with private health insurance have different conditions than patients with public insurance?

Going to split up the data into the three types of insurances.

private_ins = DoctorVisits %>% filter(private == "yes")
medicaid = DoctorVisits %>% filter(freepoor == "yes")
medicare = DoctorVisits %>% filter(freerepat == "yes")

# Combining all three data sets into one data frame for easy plotting
combined_df = rbind(
  data.frame(Insurance_Type = 'Private', Health_Score = private_ins$health),
  data.frame(Insurance_Type = 'Medicaid', Health_Score = medicaid$health),
  data.frame(Insurance_Type = 'Medicare', Health_Score = medicare$health)
)

Now lets compare between the three. Frst Health scores (0 to 36 with 36 being the worse):

meanpriv=mean(private_ins$health)
meanmedicaid=mean(medicaid$health)
meanmedicare=mean(medicare$health)

medpriv=median(private_ins$health)
medmedicaid=median(medicaid$health)
medmedicare=median(medicare$health)

print(meanpriv)

## [1] 1.097476

print(meanmedicaid)

## [1] 1.797297

print(meanmedicare)

## [1] 1.498625

print(medpriv)

## [1] 0

print(medmedicaid)

## [1] 1

print(medmedicare)

## [1] 0

The mean and median health scores for all three groups are close. It is expected that the median for medicaid to be higher since the medicaid group includes disabled individuals and also veterans

Lets plot the distribution of health scores per insurance group:

# convert Health_Score to factor because the sample sizes of the three grou0ps are different

combined_df$Health_Score = as.factor(combined_df$Health_Score)

# plot
ggplot(combined_df, aes(x=Health_Score, fill = Insurance_Type)) +
  geom_bar(position = "dodge", aes(y = ..prop.., group = Insurance_Type)) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Distribution of Health Scores", x = "Health Score", y = "Proportion", fill = "Insurance Type") +
  theme_minimal() +
  theme(legend.position = "top")

## Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(prop)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

60% of private health insurance member have a health score of 0 compared to a score of 53% for medicare and 45% medicaid

Lets further divide the three groups into chronic vs non chronic:

priv_chronic = private_ins %>% filter(lchronic == "yes")
medicaid_chronic = medicaid %>% filter(lchronic == "yes")
medicare_chronic = medicare %>% filter(lchronic == "yes")

# Non-chronic patients subsets
priv_no_chronic = private_ins %>% filter(lchronic == "no")
medicaid_no_chronic = medicaid %>% filter(lchronic == "no")
medicare_no_chronic = medicare %>% filter(lchronic == "no")
# Creating a data frame for chronic patients
combined_df_chronic = rbind(
  data.frame(Insurance_Type = 'Private', Health_Score = as.factor(priv_chronic$health)),
  data.frame(Insurance_Type = 'Medicaid', Health_Score = as.factor(medicaid_chronic$health)),
  data.frame(Insurance_Type = 'Medicare', Health_Score = as.factor(medicare_chronic$health))
)

# Creating a data frame for non-chronic patients
combined_df_non_chronic = rbind(
  data.frame(Insurance_Type = 'Private', Health_Score = as.factor(priv_no_chronic$health)),
  data.frame(Insurance_Type = 'Medicaid', Health_Score = as.factor(medicaid_no_chronic$health)),
  data.frame(Insurance_Type = 'Medicare', Health_Score = as.factor(medicare_no_chronic$health))
)

Plotting for chronic patients:

# Plotting for chronic patients
ggplot(combined_df_chronic, aes(x=Health_Score, fill = Insurance_Type)) +
  geom_bar(position = "dodge", aes(y = ..prop.., group = Insurance_Type)) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Chronic Patients: Distribution of Health Scores", x = "Health Score", y = "Proportion", fill = "Insurance Type") +
  theme_minimal() +
  theme(legend.position = "top")

chronic patients have a higher chance of having a healthscore of 0 in private health insurance when compared to private health insurance

A score of 12 chances are you a a medicare patient

Plotting for non-chronic patients:

# Plotting for non-chronic patients
ggplot(combined_df_non_chronic, aes(x=Health_Score, fill = Insurance_Type)) +
  geom_bar(position = "dodge", aes(y = ..prop.., group = Insurance_Type)) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Non-Chronic Patients: Distribution of Health Scores", x = "Health Score", y = "Proportion", fill = "Insurance Type") +
  theme_minimal() +
  theme(legend.position = "top")

Non chronic Medicaid has a higher chance of having worse health score.

Both non chronic and chronic have similar distribution patterns of health score, with the majority.

Being concentrated towards the beginning (meaning that they are healthier)

Chronic has more spread (expected)

How does your health score influence your activity level?

We will plot average days of reduced activities vs health score

First, i will add a combined column for insurance type and chronic condition:

# adding a combined column for insurance type and chronic condition

combined_df = rbind(
  data.frame(Insurance_Type = 'Private', Health_Score = private_ins$health),
  data.frame(Insurance_Type = 'Medicaid', Health_Score = medicaid$health),
  data.frame(Insurance_Type = 'Medicare', Health_Score = medicare$health)
)


# Aggregating the data for each insurance type and condition
priv_agg_chronic = priv_chronic %>%
  group_by(health) %>%
  summarise(mean_reduced = mean(reduced, na.rm = TRUE))

priv_agg_non_chronic = priv_no_chronic %>%
  group_by(health) %>%
  summarise(mean_reduced = mean(reduced, na.rm = TRUE))

medicaid_agg_chronic = medicaid_chronic %>%
  group_by(health) %>%
  summarise(mean_reduced = mean(reduced, na.rm = TRUE))

medicaid_agg_non_chronic = medicaid_no_chronic %>%
  group_by(health) %>%
  summarise(mean_reduced = mean(reduced, na.rm = TRUE))

medicare_agg_chronic = medicare_chronic %>%
  group_by(health) %>%
  summarise(mean_reduced = mean(reduced, na.rm = TRUE))

medicare_agg_non_chronic = medicare_no_chronic %>%
  group_by(health) %>%
  summarise(mean_reduced = mean(reduced, na.rm = TRUE))

Now I’m going to plot the data for each insurance type and condition. I am also going to perform a linear regression for each.

Private Insurance:

ggplot(data = priv_agg_chronic, aes(x = health, y = mean_reduced)) +
  geom_point() +
  labs(title = "Mean Reduced Activity vs Health Score for Private Insurance - Chronic",
       x = "Health Score",
       y = "Mean Days of Reduced Activity") +
  theme_minimal()

priv_chronic_model = lm(mean_reduced ~ health, data = priv_agg_chronic)
summary(priv_chronic_model)

## 
## Call:
## lm(formula = mean_reduced ~ health, data = priv_agg_chronic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0359 -1.2450  0.1789  1.0897  5.1171 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   2.4992     1.3927   1.795    0.100
## health        0.3215     0.1970   1.632    0.131
## 
## Residual standard error: 2.657 on 11 degrees of freedom
## Multiple R-squared:  0.195,  Adjusted R-squared:  0.1218 
## F-statistic: 2.665 on 1 and 11 DF,  p-value: 0.1309

#Interesting. We see a direct linear relatrionship up until health score 4, then there is variability
#for patients with chronic conditions and private health insurance,
#Their health score is proportional to their reduced days of activities under health score of 5

ggplot(data = priv_agg_non_chronic, aes(x = health, y = mean_reduced)) +
  geom_point() +
  labs(title = "Mean Reduced Activity vs Health Score for Private Insurance - Non Chronic",
       x = "Health Score",
       y = "Mean Days of Reduced Activity") +
  theme_minimal()

priv_non_chronic_model = lm(mean_reduced ~ health, data = priv_agg_non_chronic)
summary(priv_non_chronic_model)

## 
## Call:
## lm(formula = mean_reduced ~ health, data = priv_agg_non_chronic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.69252 -0.64984 -0.05565  0.67618  1.85243 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.86196    0.49723   1.734    0.111
## health       0.10581    0.07032   1.505    0.161
## 
## Residual standard error: 0.9487 on 11 degrees of freedom
## Multiple R-squared:  0.1707, Adjusted R-squared:  0.0953 
## F-statistic: 2.264 on 1 and 11 DF,  p-value: 0.1606

#Here we see less of a relationship compared to chronic

Interesting. We see a direct linear relatrionship up until health score 4, then there is variability.

For patients with chronic conditions and private health insurance, their health score is proportional to their reduced days of activities under health score of 5.

Medicaid:

# Medicaid
ggplot(data = medicaid_agg_chronic, aes(x = health, y = mean_reduced)) +
  geom_point() +
  labs(title = "Mean Reduced Activity vs Health Score for Medicaid - Chronic",
       x = "Health Score",
       y = "Mean Days of Reduced Activity") +
  theme_minimal()

medicaid_chronic_model = lm(mean_reduced ~ health, data = medicaid_agg_chronic)
summary(medicaid_chronic_model)

## 
## Call:
## lm(formula = mean_reduced ~ health, data = medicaid_agg_chronic)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.598 -2.882 -1.645  1.266  9.845 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.8076     2.8874   0.280    0.789
## health        0.5580     0.6227   0.896    0.405
## 
## Residual standard error: 4.803 on 6 degrees of freedom
## Multiple R-squared:  0.118,  Adjusted R-squared:  -0.02897 
## F-statistic: 0.8029 on 1 and 6 DF,  p-value: 0.4047

ggplot(data = medicaid_agg_non_chronic, aes(x = health, y = mean_reduced)) +
  geom_point() +
  labs(title = "Mean Reduced Activity vs Health Score for Medicaid - Non Chronic",
       x = "Health Score",
       y = "Mean Days of Reduced Activity") +
  theme_minimal()

# Medicaid - Non-Chronic
medicaid_non_chronic_model = lm(mean_reduced ~ health, data = medicaid_agg_non_chronic)
summary(medicaid_non_chronic_model)

## 
## Call:
## lm(formula = mean_reduced ~ health, data = medicaid_agg_non_chronic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.29496 -0.82872 -0.03818  0.49558  2.91450 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.009137   0.766278  -0.012   0.9907  
## health       0.209463   0.108368   1.933   0.0794 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.462 on 11 degrees of freedom
## Multiple R-squared:  0.2535, Adjusted R-squared:  0.1857 
## F-statistic: 3.736 on 1 and 11 DF,  p-value: 0.0794

Medicaid chronic larger r squared but close to x axis --> less days of reduced activity

Medicare:

ggplot(data = medicare_agg_chronic, aes(x = health, y = mean_reduced)) +
  geom_point() +
  labs(title = "Mean Reduced Activity vs Health Score for Medicare - Chronic",
       x = "Health Score",
       y = "Mean Days of Reduced Activity") +
  theme_minimal()

medicare_chronic_model = lm(mean_reduced ~ health, data = medicare_agg_chronic)
summary(medicare_chronic_model)

## 
## Call:
## lm(formula = mean_reduced ~ health, data = medicare_agg_chronic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4416 -1.4126 -0.6727  0.8732  4.9270 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.2724     1.0542   0.258 0.800866    
## health        0.8801     0.1491   5.903 0.000103 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.011 on 11 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7382 
## F-statistic: 34.84 on 1 and 11 DF,  p-value: 0.0001027

ggplot(data = medicare_agg_non_chronic, aes(x = health, y = mean_reduced)) +
  geom_point() +
  labs(title = "Mean Reduced Activity vs Health Score for Medicare - Non Chronic",
       x = "Health Score",
       y = "Mean Days of Reduced Activity") +
  theme_minimal()

medicare_non_chronic_model = lm(mean_reduced ~ health, data = medicare_agg_non_chronic)
summary(medicare_non_chronic_model)

## 
## Call:
## lm(formula = mean_reduced ~ health, data = medicare_agg_non_chronic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3642 -0.7603 -0.2274  0.6096  4.0414 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.1334     1.1936   0.950    0.365
## health        0.2028     0.1838   1.103    0.296
## 
## Residual standard error: 2.198 on 10 degrees of freedom
## Multiple R-squared:  0.1085, Adjusted R-squared:  0.01937 
## F-statistic: 1.217 on 1 and 10 DF,  p-value: 0.2957

Medicare chronic seems to have a direct linear trend going on

So does medicare non chronic

It seems like the higher your health score is, the more days of reduced activity you have (not true for medicaid)

Now lets look at the distribution of health scores by gender:

# Boxplot
ggplot(DoctorVisits, aes(x = gender, y = health, fill = gender)) +
  geom_boxplot() +
  labs(title = "Boxplot of Health Score by Gender",
       x = "Gender",
       y = "Health Score",
       fill = "Gender") +
  theme_minimal()

# Violin plot
ggplot(DoctorVisits, aes(x = gender, y = health, fill = gender)) +
  geom_violin() +
  labs(title = "Violin Plot of Health Score by Gender",
       x = "Gender",
       y = "Health Score",
       fill = "Gender") +
  theme_minimal()

We see that men are the higher health scores. Men also have more variablility in health scores than women. Woman health scores were lower.

Conclusion:

Regarding the health insurance, the data was divided into three groups based on the type of health insurance: Private, Medicaid, and Medicare. The health scores of the three groups were found to be closely comparable, suggesting no immediate observable impact of the type of insurance on health scores. However, further analysis demonstrated that individuals with private insurance and non-chronic conditions had a higher likelihood of having a health score of 0, signifying better health, compared to those with public health insurance.

Chronic patients, compared to non-chronic, exhibited a wider spread in their health score distribution, which was expected due to the nature of their chronic conditions. For non-chronic patients, those insured through Medicaid demonstrated a higher chance of having a worse health score.

Next, the influence of health scores on activity level was examined. There was a general trend that higher health scores (indicating worse health) corresponded to increased days of reduced activity. This relationship was particularly strong for chronic patients with private insurance and both chronic and non-chronic Medicare patients. It was less pronounced among Medicaid patients, with the non-chronic group showing little correlation.

Finally, when examining the distribution of health scores by gender, it was observed that males had higher and more varied health scores compared to females. This indicates men generally had poorer health outcomes, but also a wider spread in health conditions, in contrast to women, whose health scores were overall lower, indicating better health outcomes.

Final Project – CUNY SPS MSDS R Summer Bridge

Jean Jimenez

2023-07-26