EDA and Hypothesis testing on public data

Khusi Sirohi

Student code: S4015423
Click here to see this in browser

Introduction

Health insurance plays a pivotal role in the healthcare system, providing individuals with financial protection against high or unexpected healthcare costs. This analysis dives deep into a dataset that encapsulates variables such as age, gender, smoking status, and their associated medical charges. The objective is to extract meaningful insights, identify trends, and examine the relationship between different factors and medical charges.

Data

The dataset used in this study is sourced from kaggle platform. It consists of several variables:

-Age: Age of the primary beneficiary.

-Sex: Gender of the beneficiary.

-BMI: Body Mass Index, providing an understanding of body fat based on weight and height.

-Children: Number of children/dependents covered by health insurance.

-Smoker: Smoking status of the beneficiary.

-Region: Residential area in the US.

-Charges: Individual medical costs billed by health insurance.

Data Preprocessing:

Before diving into analysis, the dataset underwent preprocessing to ensure its quality and reliability. Missing values were checked for, and outliers were treated to achieve a more accurate result.

Descriptive Statistics

The initial exploration of the data reveals some interesting patterns. The age distribution skews towards younger individuals, with a notable proportion between 20 to 30 years. Gender distribution is fairly balanced. A significant difference in charges between smokers and non-smokers has been observed, with smokers tending to have higher medical charges. Correlation analysis suggests a positive correlation between age and medical charges.

summary(insurance_data)
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

Visualisation

Age Distribution

  ggplot(insurance_data, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "black") +
  theme_minimal()+
  labs(title = "Age Distribution")

Visualisation (Contd..)

Gender Distribution

  ggplot(insurance_data, aes(x = sex)) +
  geom_bar(fill = "blue", color = "black") +
  theme_minimal() +
  labs(title = "Gender Distribution")

Visualisation (Contd..)

Charges vs. Smoker Status

  ggplot(insurance_data, aes(x = smoker, y = charges)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Charges Distribution by Smoker Status")

Visualisation (Contd..)

Charges vs. Age

  ggplot(insurance_data, aes(x = age, y = charges)) +
  geom_point(aes(color = smoker)) +
  theme_minimal() +
  labs(title = "Scatter Plot of Charges vs. Age", color = "Smoker")

Correlation Matrix

   corrplot(correlations, method = "circle")

Hypothesis Testing and it’s discussion

To delve deeper into the observed differences in charges between smokers and non-smokers, a t-test was conducted. The null hypothesis posits no significant difference in charges between the two groups, while the alternative hypothesis suggests a significant difference. Upon testing, the results reject the null hypothesis, indicating a statistically significant difference in medical charges between smokers and non-smokers.

  test_result
## 
##  Welch Two Sample t-test
## 
## data:  charges by smoker
## t = -32.752, df = 311.85, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -25034.71 -22197.21
## sample estimates:
##  mean in group no mean in group yes 
##          8434.268         32050.232

Hypothesis Visualisation

 ggplot(insurance_data, aes(x = smoker, y = charges)) +
  geom_boxplot(aes(fill = smoker)) +
  labs(title = "Charges Distribution by Smoker Status",
       x = "Smoker",
       y = "Charges",
       fill = "Smoker") +
  theme_minimal()

Discussion

The analysis of the health insurance dataset offers several key takeaways. Most prominently, the impact of smoking on medical charges underscores the financial implications of health behaviors. While the dataset provides a comprehensive overview, potential limitations include the lack of certain variables like income levels, which could further influence medical charges. Future studies might consider a more granular exploration, perhaps examining how specific medical conditions or treatments interact with other variables in determining costs.

Key Takeaway: Smoking significantly influences medical charges, highlighting the financial burden of unhealthy behaviors in addition to their health implications.