EDA and Hypothesis testing on public data

Khusi Sirohi

Student code: S4015423
Click here to see this in browser

Introduction

Health insurance is essential to the healthcare system because it protects people financially against large or unforeseen medical expenses. This investigation delves deeply into a dataset that includes elements like age, gender, smoking status, and the medical costs linked to each of these factors. The goal is to discover patterns, gather useful insights, and investigate the correlation between various variables and medical costs.

Data

The dataset for this investigation was accessed using an open data portal. It includes a number of things, such as:

Data Preprocessing:

The dataset received preprocessing to guarantee its reliability and quality before analysis began. In order to produce a more accurate result, missing values were looked for and outliers were dealt with.

Descriptive Statistics

We find some intriguing trends in the data after our initial analysis. The age distribution is skewed towards younger people, with a sizable majority falling between 20 and 30. The distribution of gender is largely balanced. Smokers typically incur higher medical costs than non-smokers, with the charges varying significantly between the two groups. Age and medical costs appear to be positively correlated, according to correlation analyses.

summary(insurance_data)
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

Visualisation

Age Distribution

  ggplot(insurance_data, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "black") +
  theme_minimal()+
  labs(title = "Age Distribution")

Visualisation (Contd..)

Gender Distribution

  ggplot(insurance_data, aes(x = sex)) +
  geom_bar(fill = "blue", color = "black") +
  theme_minimal() +
  labs(title = "Gender Distribution")

Visualisation (Contd..)

Charges vs. Smoker Status

  ggplot(insurance_data, aes(x = smoker, y = charges)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Charges Distribution by Smoker Status")

Visualisation (Contd..)

Charges vs. Age

  ggplot(insurance_data, aes(x = age, y = charges)) +
  geom_point(aes(color = smoker)) +
  theme_minimal() +
  labs(title = "Scatter Plot of Charges vs. Age", color = "Smoker")

Correlation Matrix

   corrplot(correlations, method = "circle")

Hypothesis Testing and it’s discussion

The identified charge variations between smokers and non-smokers were further investigated using a t-test. Contrary to the null hypothesis’ claim that there isn’t, the alternative hypothesis asserts that there is a sizable difference in charges between the two groups.

  test_result
## 
##  Welch Two Sample t-test
## 
## data:  charges by smoker
## t = -32.752, df = 311.85, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -25034.71 -22197.21
## sample estimates:
##  mean in group no mean in group yes 
##          8434.268         32050.232

Hypothesis Visualisation

Testing reveals a statistically significant difference in medical costs between smokers and non-smokers, which leads to the results rejecting the null hypothesis.

 ggplot(insurance_data, aes(x = smoker, y = charges)) +
  geom_boxplot(aes(fill = smoker)) +
  labs(title = "Charges Distribution by Smoker Status",
       x = "Smoker",
       y = "Charges",
       fill = "Smoker") +
  theme_minimal()

Discussion

A few important conclusions can be drawn from the examination of the health insurance dataset. Most obviously, the effect of smoking on medical costs emphasizes the financial costs associated with healthy behaviors. Although the dataset offers a thorough perspective, potential drawbacks include the absence of some variables, such as income levels, which may further affect medical expenditures. Future research might think about a more in-depth investigation, possibly looking at how particular medical disorders or therapies interact with other factors to affect expenses.

Key Takeaway: Smoking has a major impact on medical costs, emphasizing both the financial burden and health effects of harmful behaviors.