Student code: S4015423
Click here to see this
in browser
Health insurance is essential to the healthcare system because it protects people financially against large or unforeseen medical expenses. This investigation delves deeply into a dataset that includes elements like age, gender, smoking status, and the medical costs linked to each of these factors. The goal is to discover patterns, gather useful insights, and investigate the correlation between various variables and medical costs.
The dataset for this investigation was accessed using an open data portal. It includes a number of things, such as:
Age:the major beneficiary’s age.
Sex: Sexe of the recipient.
BMI:Body Mass Index, which estimates body fat from a person’s height and weight.
Children: number of dependents/children with health coverage.
Smoker:The beneficiary’s smoking habits.
Region: US residential neighbourhood.
Charges: Health insurance companies’ charges for particular medical expenses.
The dataset received preprocessing to guarantee its reliability and quality before analysis began. In order to produce a more accurate result, missing values were looked for and outliers were dealt with.
We find some intriguing trends in the data after our initial analysis. The age distribution is skewed towards younger people, with a sizable majority falling between 20 and 30. The distribution of gender is largely balanced. Smokers typically incur higher medical costs than non-smokers, with the charges varying significantly between the two groups. Age and medical costs appear to be positively correlated, according to correlation analyses.
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
ggplot(insurance_data, aes(x = age)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black") +
theme_minimal()+
labs(title = "Age Distribution")The identified charge variations between smokers and non-smokers were further investigated using a t-test. Contrary to the null hypothesis’ claim that there isn’t, the alternative hypothesis asserts that there is a sizable difference in charges between the two groups.
##
## Welch Two Sample t-test
##
## data: charges by smoker
## t = -32.752, df = 311.85, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -25034.71 -22197.21
## sample estimates:
## mean in group no mean in group yes
## 8434.268 32050.232
Testing reveals a statistically significant difference in medical costs between smokers and non-smokers, which leads to the results rejecting the null hypothesis.
ggplot(insurance_data, aes(x = smoker, y = charges)) +
geom_boxplot(aes(fill = smoker)) +
labs(title = "Charges Distribution by Smoker Status",
x = "Smoker",
y = "Charges",
fill = "Smoker") +
theme_minimal()A few important conclusions can be drawn from the examination of the health insurance dataset. Most obviously, the effect of smoking on medical costs emphasizes the financial costs associated with healthy behaviors. Although the dataset offers a thorough perspective, potential drawbacks include the absence of some variables, such as income levels, which may further affect medical expenditures. Future research might think about a more in-depth investigation, possibly looking at how particular medical disorders or therapies interact with other factors to affect expenses.
Key Takeaway: Smoking has a major impact on medical costs, emphasizing both the financial burden and health effects of harmful behaviors.