Project Title: Exploratory Data Analysis of Healthcare Costs

Ask

Objective:

In this project, we will perform exploratory data analysis (EDA) on a dataset focusing on healthcare costs. The dataset contains information about various factors affecting healthcare costs, such as age, BMI, region, smoking habits, etc. We will utilize Excel for data cleaning, SQL for data manipulation (optional), and R for data analysis and visualization.

Questions for Analysis:

The following questions will be the focus of my analysis.
- How do healthcare costs vary by demographic factors such as age, sex, BMI, and region?
- What is the impact of smoking status on healthcare costs?
- What is the relationship between BMI (Body Mass Index) and healthcare costs?
- Are there any regional differences in healthcare costs?

Prepare

Data Souce:
* Dataset: “Medical Cost Personal Datasets” via Kaggle
* Description: This dataset contains information about patients including their age, sex, BMI, children, smoking status, region, and their healthcare costs.

Limitaions:
The dataset contains a limited number of variables, focusing primarily on demographic information (age, sex, BMI, children) and lifestyle factors (smoking status), with healthcare costs being the primary outcome variable. Other important factors such as pre-existing medical conditions, type of healthcare coverage, and specific medical procedures are not included, which could limit the comprehensiveness of the analysis.

Process

Clean Data in Excel

I chose to initally clean the dataset in Excel and took the following steps to clean and organize the data:
- Remove any duplicate rows.
- Checked for missing values.
- Cheched for inconsistencies in the data types and fixed them if necessary.
- Changed formatting for “charges” column to reflect currency.

Analyze and Share

Next, I used R to analyze the data and create visualizations.

Importing Data Set and Libraries

I started by loading the libraries and dataset:

library(tidyverse)
library(lubridate)
library(ggplot2)
library(ggpubr)
library(tidyr)
library(janitor)
library(scales)
healthcare_data <- read_csv ("insurance.csv")
## Rows: 1338 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sex, smoker, region
## dbl (4): age, bmi, children, charges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(healthcare_data)
## # A tibble: 6 Ă— 7
##     age sex      bmi children smoker region    charges
##   <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.
## 2    18 male    33.8        1 no     southeast   1726.
## 3    28 male    33          3 no     southeast   4449.
## 4    33 male    22.7        0 no     northwest  21984.
## 5    32 male    28.9        0 no     northwest   3867.
## 6    31 female  25.7        0 no     southeast   3757.
clean_names(healthcare_data)
## # A tibble: 1,338 Ă— 7
##      age sex      bmi children smoker region    charges
##    <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
##  1    19 female  27.9        0 yes    southwest  16885.
##  2    18 male    33.8        1 no     southeast   1726.
##  3    28 male    33          3 no     southeast   4449.
##  4    33 male    22.7        0 no     northwest  21984.
##  5    32 male    28.9        0 no     northwest   3867.
##  6    31 female  25.7        0 no     southeast   3757.
##  7    46 female  33.4        1 no     southeast   8241.
##  8    37 female  27.7        3 no     northwest   7282.
##  9    37 male    29.8        2 no     northeast   6406.
## 10    60 female  25.8        0 no     northwest  28923.
## # ℹ 1,328 more rows
Analysis and Visualizations

I then began to analyze each section and create visualizations to help with the process.

The first section I looked at was Age. The available ranged from ages 18 to 64.

healthcare_data %>%
  select(age) %>%
  summary()
##       age       
##  Min.   :18.00  
##  1st Qu.:27.00  
##  Median :39.00  
##  Mean   :39.21  
##  3rd Qu.:51.00  
##  Max.   :64.00

The chart below shows how as we age the cost of medical care increases.

ggplot(data=healthcare_data)+
  geom_smooth(mapping=aes(x=age,y=charges))+
  labs(title="Age")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Next I looked at how medical expenses compare for females vs males and if there was any substantial differences. As seen below, data shows that Males typically spend more on medical expenses than females.

ggplot(data = healthcare_data, aes(x=sex,y=charges))+
  geom_col(fill="darkblue")+
  labs(title= "Males VS Females")+
  annotate("text",x=1,y=7500000,label="On average,females spend $12,569.58",color="white",fontface="bold",size=3)+
  annotate("text",x=2,y=8500000,label="On average,males spend $13,959.75",color="white",fontface="bold",size=3)

We also looked at BMI to see if that played a role and if so by how much. The range of BMI in our data is from 15.96 at the lowest and 53.13 at the highest with a median BMI at 30.40.

healthcare_data %>%
  select(bmi) %>%
  summary()
##       bmi       
##  Min.   :15.96  
##  1st Qu.:26.30  
##  Median :30.40  
##  Mean   :30.66  
##  3rd Qu.:34.69  
##  Max.   :53.13
ggplot(data=healthcare_data)+
  geom_smooth(mapping=aes(x=bmi,y=charges))+
  labs(title="Body Mass Index (BMI)")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

In regards to smokers vs non-smokers, we see a mix but overall smokers tend to spend more on medical expenses than non-smokers.

ggplot(data=healthcare_data)+
  geom_jitter(mapping=aes(x=smoker,y=charges),color="darkblue")+
  labs(title="Smoker VS Non-Smoker")

And lastly, I looked at how the amount spent on medical expenses varies throughout the 4 main regions in the U.S.

We don’t see to much variation from one region to the next but on average less was spent in the Southwest compared to in the Southeast where they spent the most on average:

  • Southwest: $12,346.94
  • Northwest: $12,417.58
  • Northeast: $13,406.38
  • Southeast: $14,735.41
ggplot(data = healthcare_data, aes(x=region,y=charges))+
  geom_col(fill="darkblue")+
  labs(title= "Regions")+
  annotate("text",x=4,y=4e+05,label="$12,346.94",color="white",fontface="bold",size=4)+
  annotate("text",x=2,y=4e+05,label="$12,417.58",color="white",fontface="bold",size=4)+
  annotate("text",x=1,y=4e+05,label="$13,406.38",color="white",fontface="bold",size=4)+
  annotate("text",x=3,y=4e+05,label="$14,735.41",color="white",fontface="bold",size=4)

Act

Recommendations
  • Promotion of Healthy Lifestyle Choices: Encourage initiatives aimed at promoting healthy lifestyle choices, such as smoking cessation programs, nutrition education, and physical activity promotion. By reducing risk factors like smoking and obesity, individuals may experience improved health outcomes and lower healthcare costs in the long term.

  • Preventive Healthcare Interventions: Advocate for increased access to preventive healthcare services, including routine screenings, vaccinations, and preventive care visits. Early detection and management of chronic conditions can help prevent costly complications and reduce overall healthcare expenditures.

  • Targeted Interventions for High-Risk Groups: Identify high-risk demographic groups, such as smokers or individuals with high BMI, and develop targeted interventions tailored to their specific needs. This may include subsidized smoking cessation programs, weight management interventions, or targeted health education campaigns.