Final Project – CUNY SPS MSDS R Summer Bridge

How does the type of health insurance (public vs private) patients have influence doctor visit metrics?

Are there any differences between privately insured high and low risk populations?

I am using a data set of doctors visit metrics.

#Importing the dataset
#tried to import from url but failed due to file size
library(readr)
DoctorVisits <- read_csv("C:/Users/bleac/OneDrive/Documents/summer bridge final/DoctorVisits.csv")

## New names:
## Rows: 5190 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (6): gender, private, freepoor, freerepat, nchronic, lchronic dbl (7): ...1,
## visits, age, income, illness, reduced, health
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

head(DoctorVisits)

## # A tibble: 6 × 13
##    ...1 visits gender   age income illness reduced health private freepoor
##   <dbl>  <dbl> <chr>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl> <chr>   <chr>   
## 1     1      1 female  0.19   0.55       1       4      1 yes     no      
## 2     2      1 female  0.19   0.45       1       2      1 yes     no      
## 3     3      1 male    0.19   0.9        3       0      0 no      no      
## 4     4      1 male    0.19   0.15       1       0      0 no      no      
## 5     5      1 male    0.19   0.45       2       5      1 no      no      
## 6     6      1 female  0.19   0.35       5       1      9 no      no      
## # ℹ 3 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>

Data Exploration:

Now lets explore the data and get some summary statistics

#Summary statistics
summary(DoctorVisits)

##       ...1          visits          gender               age        
##  Min.   :   1   Min.   :0.0000   Length:5190        Min.   :0.1900  
##  1st Qu.:1298   1st Qu.:0.0000   Class :character   1st Qu.:0.2200  
##  Median :2596   Median :0.0000   Mode  :character   Median :0.3200  
##  Mean   :2596   Mean   :0.3017                      Mean   :0.4064  
##  3rd Qu.:3893   3rd Qu.:0.0000                      3rd Qu.:0.6200  
##  Max.   :5190   Max.   :9.0000                      Max.   :0.7200  
##      income          illness         reduced            health      
##  Min.   :0.0000   Min.   :0.000   Min.   : 0.0000   Min.   : 0.000  
##  1st Qu.:0.2500   1st Qu.:0.000   1st Qu.: 0.0000   1st Qu.: 0.000  
##  Median :0.5500   Median :1.000   Median : 0.0000   Median : 0.000  
##  Mean   :0.5832   Mean   :1.432   Mean   : 0.8619   Mean   : 1.218  
##  3rd Qu.:0.9000   3rd Qu.:2.000   3rd Qu.: 0.0000   3rd Qu.: 2.000  
##  Max.   :1.5000   Max.   :5.000   Max.   :14.0000   Max.   :12.000  
##    private            freepoor          freerepat           nchronic        
##  Length:5190        Length:5190        Length:5190        Length:5190       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    lchronic        
##  Length:5190       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Doctor visits is a data set of n=5190 people

Getting statistics on the age distribution (age is divided by 100)

mean(DoctorVisits$age)

## [1] 0.4063854

median(DoctorVisits$age)

## [1] 0.32

min(DoctorVisits$age)

## [1] 0.19

max(DoctorVisits$age)

## [1] 0.72

sd(DoctorVisits$age)

## [1] 0.2047818

Mean age is ~41, median age is 32, min is 19, max is 72, with a standard deviation of +/- 20.5 years

Getting statistics on annual income in hundred of thousands of $:

mean(DoctorVisits$income)

## [1] 0.5831599

median(DoctorVisits$income)

## [1] 0.55

min(DoctorVisits$income)

## [1] 0

max(DoctorVisits$income)

## [1] 1.5

sd(DoctorVisits$income)

## [1] 0.3689067

The Mean income is $58k, with median income being $55K. Min income is $0 and Max income is $150k with a standard deviation of +/- $37k.

Getting sex statistics:

sex=table(DoctorVisits$gender)
print(sex)

## 
## female   male 
##   2702   2488

The sample is almost evenly split with 2702 females and 2488 males

Data Wrangling:

Do patients with private health insurance have different conditions than patients with public insurance?

Going to split up the data into the three types of insurances.

Private insurance:

list_of_priv=split(DoctorVisits,DoctorVisits$private)

private_ins=list_of_priv$yes

head(private_ins)

## # A tibble: 6 × 13
##    ...1 visits gender   age income illness reduced health private freepoor
##   <dbl>  <dbl> <chr>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl> <chr>   <chr>   
## 1     1      1 female  0.19   0.55       1       4      1 yes     no      
## 2     2      1 female  0.19   0.45       1       2      1 yes     no      
## 3     9      1 female  0.19   0.65       2       0      5 yes     no      
## 4    10      1 male    0.19   0.15       1       0      0 yes     no      
## 5    15      1 male    0.19   0.25       3       1      0 yes     no      
## 6    17      2 male    0.19   0.45       1       0      5 yes     no      
## # ℹ 3 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>

Medicaid:

list_of_pub_poor=split(DoctorVisits,DoctorVisits$freepoor)

medicaid=list_of_pub_poor$yes

head(medicaid)

## # A tibble: 6 × 13
##    ...1 visits gender   age income illness reduced health private freepoor
##   <dbl>  <dbl> <chr>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl> <chr>   <chr>   
## 1    26      1 female  0.19   0.15       1       2      6 no      yes     
## 2    28      1 male    0.19   0          1       0      0 no      yes     
## 3    34      1 female  0.19   0.06       1       0      0 no      yes     
## 4    46      1 female  0.19   0.25       1       0      1 no      yes     
## 5    50      1 male    0.19   0.15       1       0      4 no      yes     
## 6    59      2 male    0.19   0.25       1       0      0 no      yes     
## # ℹ 3 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>

Medicare:

list_of_pub_old=split(DoctorVisits,DoctorVisits$freerepat)

medicare=list_of_pub_old$yes

head(medicare)

## # A tibble: 6 × 13
##    ...1 visits gender   age income illness reduced health private freepoor
##   <dbl>  <dbl> <chr>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl> <chr>   <chr>   
## 1    12      1 male    0.19   0.25       2       0      2 no      no      
## 2    25      1 female  0.19   0.25       2      14     11 no      no      
## 3    45      1 female  0.19   0.55       2       3      0 no      no      
## 4    83      1 female  0.19   0.25       1       0      9 no      no      
## 5   140      1 male    0.22   0.25       2       0     12 no      no      
## 6   144      1 female  0.22   0.25       1       0      0 no      no      
## # ℹ 3 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>

Now lets compare between the three

First compare visits:

meanpriv=mean(private_ins$visits)
meanmedicaid=mean(medicaid$visits)
meanmedicare=mean(medicare$visits)

print(meanpriv)

## [1] 0.294604

print(meanmedicaid)

## [1] 0.1576577

print(meanmedicare)

## [1] 0.4665445

min(private_ins$illness)

## [1] 0

People with medicare are more likely to visit the doctor than people with private insurance/medicaid

Now, lets make a new category in private called risk. Risk is a score that will flag people who are more at risk to be unhealthy and cost more money (made up)

The score will be dependent on age, income, and # of illness

calculate_risk = function(age, income, num_illnesses) {
  
  # Maximum possible values 
  max_age = 72
  max_income = 1.5
  max_illnesses = 5
  
  # Normalize each parameter to a 0-1 scale
  age_score = age / max_age
  income_score = income / max_income
  illness_score = num_illnesses / max_illnesses
  
  # Calculate risk. Lower age, higher income, and fewer illnesses reduce risk
  risk = (age_score + income_score + illness_score ) * 100
  
  return(risk)
}

priv_risk=calculate_risk(age=(private_ins$age*100),income=(private_ins$income),num_illnesses=private_ins$illness)
private_ins$risk=priv_risk
head(private_ins)

## # A tibble: 6 × 14
##    ...1 visits gender   age income illness reduced health private freepoor
##   <dbl>  <dbl> <chr>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl> <chr>   <chr>   
## 1     1      1 female  0.19   0.55       1       4      1 yes     no      
## 2     2      1 female  0.19   0.45       1       2      1 yes     no      
## 3     9      1 female  0.19   0.65       2       0      5 yes     no      
## 4    10      1 male    0.19   0.15       1       0      0 yes     no      
## 5    15      1 male    0.19   0.25       3       1      0 yes     no      
## 6    17      2 male    0.19   0.45       1       0      5 yes     no      
## # ℹ 4 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>,
## #   risk <dbl>

min(private_ins$risk)

## [1] 26.38889

max(private_ins$risk)

## [1] 300

avg_risk=mean(private_ins$risk)

Now lets divide further. Low risk and High risk private

low_risk=subset(private_ins, risk > avg_risk)
high_risk=subset(private_ins, risk <= avg_risk)

head(low_risk)

## # A tibble: 6 × 14
##    ...1 visits gender   age income illness reduced health private freepoor
##   <dbl>  <dbl> <chr>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl> <chr>   <chr>   
## 1    35      1 female  0.19   0.45       4       0      0 yes     no      
## 2    37      1 female  0.19   1.1        2       0      1 yes     no      
## 3   119      1 male    0.22   1.1        2       7      1 yes     no      
## 4   126      1 male    0.22   0.75       3       6      3 yes     no      
## 5   128      2 male    0.22   0.65       3       0      0 yes     no      
## 6   141      1 female  0.22   0.9        2       0      0 yes     no      
## # ℹ 4 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>,
## #   risk <dbl>

head(high_risk)

## # A tibble: 6 × 14
##    ...1 visits gender   age income illness reduced health private freepoor
##   <dbl>  <dbl> <chr>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl> <chr>   <chr>   
## 1     1      1 female  0.19   0.55       1       4      1 yes     no      
## 2     2      1 female  0.19   0.45       1       2      1 yes     no      
## 3     9      1 female  0.19   0.65       2       0      5 yes     no      
## 4    10      1 male    0.19   0.15       1       0      0 yes     no      
## 5    15      1 male    0.19   0.25       3       1      0 yes     no      
## 6    17      2 male    0.19   0.45       1       0      5 yes     no      
## # ℹ 4 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>,
## #   risk <dbl>

Graphics:

Lets plot the age distributions of these two groups

Plotting the density plot of low risk:

# Plot the density plot of low risk
library(ggplot2)
ggplot(low_risk, aes(x=age*100)) +
  geom_density(alpha=0.5) +
  theme_minimal() +
  labs(title="Age distribution of low risk private insurance", 
       x="Age", 
       y="Frequency")+
  xlim(0,100)

Plotting the density plot of high risk:

# Plot the density plot of high risk

ggplot(high_risk, aes(x=age*100)) +
  geom_density(alpha=0.5) +
  theme_minimal() +
  labs(title="Age distribution of high risk private insurance", 
       x="Age", 
       y="Frequency")+
  xlim(0,100)

As you can see, the age distribution of lower risk private insurance more spread out than the age distribution of high risk private insurance

Most high risk private insurance are young

Now, lets look at histogram of number of illnesses in low and high risk groups

Histogram for Low Risk Patients:

# Histogram for Low Risk Patients

ggplot(low_risk, aes(x=illness)) +
  geom_histogram(binwidth=1, color="black", fill="blue") +
  labs(title="Histogram of Number of Illnesses for Low Risk Patients", 
       x="Number of Illnesses", 
       y="Frequency")

Histogram of high risk patients:

# Histogram for High Risk Patients

ggplot(high_risk, aes(x=illness)) +
  geom_histogram(binwidth=1, color="black", fill="red") +
  labs(title="Histogram of Number of Illnesses for High Risk Patients", 
       x="Number of Illnesses", 
       y="Frequency")

Low risk patients might have more illnesses.

Number of illnesses might have a low influence on risk score

Now lets see the distribution of income in low risk patients

Boxplot for Income Distribution for Low Risk Patients:

ggplot(low_risk, aes(y=income*100000)) +
  geom_boxplot(fill="green", outlier.shape = NA) +
  labs(title="Income Distribution for Low Risk Patients", 
       y="Annual Income in $", 
       x="")+
  ylim(0,150000)+
  theme_minimal()

Here you can see the distribution of the income in low risk patients.

When examining the types of health insurance, the analysis revealed that people with Medicare visit doctors more frequently than those with private insurance or Medicaid. This makes sense since Medicare patients are older and older adults generally visit the doctor more frequently. This led to a further investigation into the private insurance subgroup, where a risk score was calculated for each patient based on their age, income, and the number of illnesses.

Dividing the private insurance patients into low and high-risk groups based on the calculated risk score, various insights were derived. Firstly, the age distribution among lower-risk private insurance patients was more spread out compared to the high-risk group, where younger patients predominated. Interestingly, despite being classified as low-risk, these patients might have more illnesses, indicating that the number of illnesses has less influence on the risk score.

In conclusion, this analysis provides valuable insights into the relationship between different health insurance types, patient characteristics, and their associated risk levels. However, it also reveals areas for future investigations such as refining the risk calculation method and exploring why younger patients are more likely to fall into the high-risk group.

Final Project – CUNY SPS MSDS R Summer Bridge

Jean Jimenez

2023-07-25