How does the type of health insurance (public vs private) patients have influence doctor visit metrics?
Are there any differences between privately insured high and low risk populations?
I am using a data set of doctors visit metrics.
#Importing the dataset
#tried to import from url but failed due to file size
library(readr)
DoctorVisits <- read_csv("C:/Users/bleac/OneDrive/Documents/summer bridge final/DoctorVisits.csv")
## New names:
## Rows: 5190 Columns: 13
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (6): gender, private, freepoor, freerepat, nchronic, lchronic dbl (7): ...1,
## visits, age, income, illness, reduced, health
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(DoctorVisits)
## # A tibble: 6 × 13
## ...1 visits gender age income illness reduced health private freepoor
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1 female 0.19 0.55 1 4 1 yes no
## 2 2 1 female 0.19 0.45 1 2 1 yes no
## 3 3 1 male 0.19 0.9 3 0 0 no no
## 4 4 1 male 0.19 0.15 1 0 0 no no
## 5 5 1 male 0.19 0.45 2 5 1 no no
## 6 6 1 female 0.19 0.35 5 1 9 no no
## # ℹ 3 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>
Data Exploration:
Now lets explore the data and get some summary statistics
#Summary statistics
summary(DoctorVisits)
## ...1 visits gender age
## Min. : 1 Min. :0.0000 Length:5190 Min. :0.1900
## 1st Qu.:1298 1st Qu.:0.0000 Class :character 1st Qu.:0.2200
## Median :2596 Median :0.0000 Mode :character Median :0.3200
## Mean :2596 Mean :0.3017 Mean :0.4064
## 3rd Qu.:3893 3rd Qu.:0.0000 3rd Qu.:0.6200
## Max. :5190 Max. :9.0000 Max. :0.7200
## income illness reduced health
## Min. :0.0000 Min. :0.000 Min. : 0.0000 Min. : 0.000
## 1st Qu.:0.2500 1st Qu.:0.000 1st Qu.: 0.0000 1st Qu.: 0.000
## Median :0.5500 Median :1.000 Median : 0.0000 Median : 0.000
## Mean :0.5832 Mean :1.432 Mean : 0.8619 Mean : 1.218
## 3rd Qu.:0.9000 3rd Qu.:2.000 3rd Qu.: 0.0000 3rd Qu.: 2.000
## Max. :1.5000 Max. :5.000 Max. :14.0000 Max. :12.000
## private freepoor freerepat nchronic
## Length:5190 Length:5190 Length:5190 Length:5190
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## lchronic
## Length:5190
## Class :character
## Mode :character
##
##
##
Doctor visits is a data set of n=5190 people
Getting statistics on the age distribution (age is divided by 100)
mean(DoctorVisits$age)
## [1] 0.4063854
median(DoctorVisits$age)
## [1] 0.32
min(DoctorVisits$age)
## [1] 0.19
max(DoctorVisits$age)
## [1] 0.72
sd(DoctorVisits$age)
## [1] 0.2047818
Mean age is ~41, median age is 32, min is 19, max is 72, with a standard deviation of +/- 20.5 years
Getting statistics on annual income in hundred of thousands of $:
mean(DoctorVisits$income)
## [1] 0.5831599
median(DoctorVisits$income)
## [1] 0.55
min(DoctorVisits$income)
## [1] 0
max(DoctorVisits$income)
## [1] 1.5
sd(DoctorVisits$income)
## [1] 0.3689067
The Mean income is $58k, with median income being $55K. Min income is $0 and Max income is $150k with a standard deviation of +/- $37k.
Getting sex statistics:
sex=table(DoctorVisits$gender)
print(sex)
##
## female male
## 2702 2488
The sample is almost evenly split with 2702 females and 2488 males
Data Wrangling:
Do patients with private health insurance have different conditions than patients with public insurance?
Going to split up the data into the three types of insurances.
Private insurance:
list_of_priv=split(DoctorVisits,DoctorVisits$private)
private_ins=list_of_priv$yes
head(private_ins)
## # A tibble: 6 × 13
## ...1 visits gender age income illness reduced health private freepoor
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1 female 0.19 0.55 1 4 1 yes no
## 2 2 1 female 0.19 0.45 1 2 1 yes no
## 3 9 1 female 0.19 0.65 2 0 5 yes no
## 4 10 1 male 0.19 0.15 1 0 0 yes no
## 5 15 1 male 0.19 0.25 3 1 0 yes no
## 6 17 2 male 0.19 0.45 1 0 5 yes no
## # ℹ 3 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>
Medicaid:
list_of_pub_poor=split(DoctorVisits,DoctorVisits$freepoor)
medicaid=list_of_pub_poor$yes
head(medicaid)
## # A tibble: 6 × 13
## ...1 visits gender age income illness reduced health private freepoor
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 26 1 female 0.19 0.15 1 2 6 no yes
## 2 28 1 male 0.19 0 1 0 0 no yes
## 3 34 1 female 0.19 0.06 1 0 0 no yes
## 4 46 1 female 0.19 0.25 1 0 1 no yes
## 5 50 1 male 0.19 0.15 1 0 4 no yes
## 6 59 2 male 0.19 0.25 1 0 0 no yes
## # ℹ 3 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>
Medicare:
list_of_pub_old=split(DoctorVisits,DoctorVisits$freerepat)
medicare=list_of_pub_old$yes
head(medicare)
## # A tibble: 6 × 13
## ...1 visits gender age income illness reduced health private freepoor
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 12 1 male 0.19 0.25 2 0 2 no no
## 2 25 1 female 0.19 0.25 2 14 11 no no
## 3 45 1 female 0.19 0.55 2 3 0 no no
## 4 83 1 female 0.19 0.25 1 0 9 no no
## 5 140 1 male 0.22 0.25 2 0 12 no no
## 6 144 1 female 0.22 0.25 1 0 0 no no
## # ℹ 3 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>
Now lets compare between the three
First compare visits:
meanpriv=mean(private_ins$visits)
meanmedicaid=mean(medicaid$visits)
meanmedicare=mean(medicare$visits)
print(meanpriv)
## [1] 0.294604
print(meanmedicaid)
## [1] 0.1576577
print(meanmedicare)
## [1] 0.4665445
min(private_ins$illness)
## [1] 0
People with medicare are more likely to visit the doctor than people with private insurance/medicaid
Now, lets make a new category in private called risk. Risk is a score that will flag people who are more at risk to be unhealthy and cost more money (made up)
The score will be dependent on age, income, and # of illness
calculate_risk = function(age, income, num_illnesses) {
# Maximum possible values
max_age = 72
max_income = 1.5
max_illnesses = 5
# Normalize each parameter to a 0-1 scale
age_score = age / max_age
income_score = income / max_income
illness_score = num_illnesses / max_illnesses
# Calculate risk. Lower age, higher income, and fewer illnesses reduce risk
risk = (age_score + income_score + illness_score ) * 100
return(risk)
}
priv_risk=calculate_risk(age=(private_ins$age*100),income=(private_ins$income),num_illnesses=private_ins$illness)
private_ins$risk=priv_risk
head(private_ins)
## # A tibble: 6 × 14
## ...1 visits gender age income illness reduced health private freepoor
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1 female 0.19 0.55 1 4 1 yes no
## 2 2 1 female 0.19 0.45 1 2 1 yes no
## 3 9 1 female 0.19 0.65 2 0 5 yes no
## 4 10 1 male 0.19 0.15 1 0 0 yes no
## 5 15 1 male 0.19 0.25 3 1 0 yes no
## 6 17 2 male 0.19 0.45 1 0 5 yes no
## # ℹ 4 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>,
## # risk <dbl>
min(private_ins$risk)
## [1] 26.38889
max(private_ins$risk)
## [1] 300
avg_risk=mean(private_ins$risk)
Now lets divide further. Low risk and High risk private
low_risk=subset(private_ins, risk > avg_risk)
high_risk=subset(private_ins, risk <= avg_risk)
head(low_risk)
## # A tibble: 6 × 14
## ...1 visits gender age income illness reduced health private freepoor
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 35 1 female 0.19 0.45 4 0 0 yes no
## 2 37 1 female 0.19 1.1 2 0 1 yes no
## 3 119 1 male 0.22 1.1 2 7 1 yes no
## 4 126 1 male 0.22 0.75 3 6 3 yes no
## 5 128 2 male 0.22 0.65 3 0 0 yes no
## 6 141 1 female 0.22 0.9 2 0 0 yes no
## # ℹ 4 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>,
## # risk <dbl>
head(high_risk)
## # A tibble: 6 × 14
## ...1 visits gender age income illness reduced health private freepoor
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1 female 0.19 0.55 1 4 1 yes no
## 2 2 1 female 0.19 0.45 1 2 1 yes no
## 3 9 1 female 0.19 0.65 2 0 5 yes no
## 4 10 1 male 0.19 0.15 1 0 0 yes no
## 5 15 1 male 0.19 0.25 3 1 0 yes no
## 6 17 2 male 0.19 0.45 1 0 5 yes no
## # ℹ 4 more variables: freerepat <chr>, nchronic <chr>, lchronic <chr>,
## # risk <dbl>
Graphics:
Lets plot the age distributions of these two groups
Plotting the density plot of low risk:
# Plot the density plot of low risk
library(ggplot2)
ggplot(low_risk, aes(x=age*100)) +
geom_density(alpha=0.5) +
theme_minimal() +
labs(title="Age distribution of low risk private insurance",
x="Age",
y="Frequency")+
xlim(0,100)
Plotting the density plot of high risk:
# Plot the density plot of high risk
ggplot(high_risk, aes(x=age*100)) +
geom_density(alpha=0.5) +
theme_minimal() +
labs(title="Age distribution of high risk private insurance",
x="Age",
y="Frequency")+
xlim(0,100)
As you can see, the age distribution of lower risk private insurance more spread out than the age distribution of high risk private insurance
Most high risk private insurance are young
Now, lets look at histogram of number of illnesses in low and high risk groups
Histogram for Low Risk Patients:
# Histogram for Low Risk Patients
ggplot(low_risk, aes(x=illness)) +
geom_histogram(binwidth=1, color="black", fill="blue") +
labs(title="Histogram of Number of Illnesses for Low Risk Patients",
x="Number of Illnesses",
y="Frequency")
Histogram of high risk patients:
# Histogram for High Risk Patients
ggplot(high_risk, aes(x=illness)) +
geom_histogram(binwidth=1, color="black", fill="red") +
labs(title="Histogram of Number of Illnesses for High Risk Patients",
x="Number of Illnesses",
y="Frequency")
Low risk patients might have more illnesses.
Number of illnesses might have a low influence on risk score
Now lets see the distribution of income in low risk patients
Boxplot for Income Distribution for Low Risk Patients:
ggplot(low_risk, aes(y=income*100000)) +
geom_boxplot(fill="green", outlier.shape = NA) +
labs(title="Income Distribution for Low Risk Patients",
y="Annual Income in $",
x="")+
ylim(0,150000)+
theme_minimal()
Here you can see the distribution of the income in low risk patients.
When examining the types of health insurance, the analysis revealed that people with Medicare visit doctors more frequently than those with private insurance or Medicaid. This makes sense since Medicare patients are older and older adults generally visit the doctor more frequently. This led to a further investigation into the private insurance subgroup, where a risk score was calculated for each patient based on their age, income, and the number of illnesses.
Dividing the private insurance patients into low and high-risk groups based on the calculated risk score, various insights were derived. Firstly, the age distribution among lower-risk private insurance patients was more spread out compared to the high-risk group, where younger patients predominated. Interestingly, despite being classified as low-risk, these patients might have more illnesses, indicating that the number of illnesses has less influence on the risk score.
In conclusion, this analysis provides valuable insights into the relationship between different health insurance types, patient characteristics, and their associated risk levels. However, it also reveals areas for future investigations such as refining the risk calculation method and exploring why younger patients are more likely to fall into the high-risk group.