This dataset contains detailed information about insurance customers, including their age, sex, body mass index (BMI), number of children, smoking status, region and the insurance charges paid in dollars($). This dataset was gotten from kaggle and the location and time frame was not given.
To explore and analyze the demographic and health factors that most significantly influence individual medical insurance charges.
Investigate how age, Body Mass Index (BMI), number of children, and region relate to charges.
Visualize the difference in charges between smokers and non-smokers.
Explore the relationship between age,BMI,and smoking status.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(summarytools)
## Warning: package 'summarytools' was built under R version 4.5.2
##
## Attaching package: 'summarytools'
##
## The following object is masked from 'package:tibble':
##
## view
library(dplyr)
mydata<- read.csv("C:\\Users\\HP ELITE BOOK\\Desktop\\insurance.csv")
View(mydata)
head(mydata)
## index age sex bmi children smoker region charges
## 1 0 19 female 27.900 0 yes southwest 16884.924
## 2 1 18 male 33.770 1 no southeast 1725.552
## 3 2 28 male 33.000 3 no southeast 4449.462
## 4 3 33 male 22.705 0 no northwest 21984.471
## 5 4 32 male 28.880 0 no northwest 3866.855
## 6 5 31 female 25.740 0 no southeast 3756.622
str(mydata)
## 'data.frame': 1338 obs. of 8 variables:
## $ index : int 0 1 2 3 4 5 6 7 8 9 ...
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
#check for missing value
colSums(is.na(mydata))
## index age sex bmi children smoker region charges
## 0 0 0 0 0 0 0 0
summary(mydata)
## index age sex bmi
## Min. : 0.0 Min. :18.00 Length:1338 Min. :15.96
## 1st Qu.: 334.2 1st Qu.:27.00 Class :character 1st Qu.:26.30
## Median : 668.5 Median :39.00 Mode :character Median :30.40
## Mean : 668.5 Mean :39.21 Mean :30.66
## 3rd Qu.:1002.8 3rd Qu.:51.00 3rd Qu.:34.69
## Max. :1337.0 Max. :64.00 Max. :53.13
## children smoker region charges
## Min. :0.000 Length:1338 Length:1338 Min. : 1122
## 1st Qu.:0.000 Class :character Class :character 1st Qu.: 4740
## Median :1.000 Mode :character Mode :character Median : 9382
## Mean :1.095 Mean :13270
## 3rd Qu.:2.000 3rd Qu.:16640
## Max. :5.000 Max. :63770
mydata$sex<- factor(x= mydata$sex, levels = c("male","female"))
mydata$smoker<- factor(x=mydata$smoker, levels= c("yes","no"))
mydata$region<- factor(x= mydata$region, levels= c("northeast","northwest","southwest","southeast") )
str(mydata)
## 'data.frame': 1338 obs. of 8 variables:
## $ index : int 0 1 2 3 4 5 6 7 8 9 ...
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 2 2 1 2 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "yes","no": 1 2 2 2 2 2 2 2 2 2 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 3 4 4 2 2 4 4 2 1 2 ...
## $ charges : num 16885 1726 4449 21984 3867 ...
table(mydata$sex)
##
## male female
## 676 662
table(mydata$smoker)
##
## yes no
## 274 1064
table(mydata$region)
##
## northeast northwest southwest southeast
## 324 325 325 364
descr(mydata[, c("age", "bmi", "children", "charges")])
## Descriptive Statistics
## mydata
## N: 1338
##
## age bmi charges children
## ----------------- --------- --------- ---------- ----------
## Mean 39.21 30.66 13270.42 1.09
## Std.Dev 14.05 6.10 12110.01 1.21
## Min 18.00 15.96 1121.87 0.00
## Q1 27.00 26.29 4738.27 0.00
## Median 39.00 30.40 9382.03 1.00
## Q3 51.00 34.70 16657.72 2.00
## Max 64.00 53.13 63770.43 5.00
## MAD 17.79 6.20 7440.81 1.48
## IQR 24.00 8.40 11899.63 2.00
## CV 0.36 0.20 0.91 1.10
## Skewness 0.06 0.28 1.51 0.94
## SE.Skewness 0.07 0.07 0.07 0.07
## Kurtosis -1.25 -0.06 1.59 0.19
## N.Valid 1338.00 1338.00 1338.00 1338.00
## N 1338.00 1338.00 1338.00 1338.00
## Pct.Valid 100.00 100.00 100.00 100.00
sex_count<-mydata%>%
count(sex)
sex_count
## sex n
## 1 male 676
## 2 female 662
ggplot(sex_count, aes(x=sex,y=n,fill=sex))+
geom_col()+
geom_text(aes(label=n),vjust=-0.2)+
labs(title = "Dstribution by sex",
x = "sex",
y = "count")+
theme_minimal()
ggplot(mydata, aes(x = age, y = charges)) +
geom_point(alpha = 0.6, color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Relationship between Age and Charges",
x = "Age", y = "Charges($)")
## `geom_smooth()` using formula = 'y ~ x'
As age increases, insurance charges tend to rise, which implies that older people have higher medical cost, but even at the same age group, charges vary which shows that variation is caused by other factors.
ggplot(mydata, aes(x = bmi, y = charges)) +
geom_point(alpha = 0.6, color = "darkgreen") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Relationship between BMI and Charges",
x = "BMI", y = "Charges($)")
## `geom_smooth()` using formula = 'y ~ x'
The red line indicates that as BMI charges increases, insurance charges tend to rise. We could also see that people with same BMI have varying charges, but the case is a little different starting from BMI 30 and above, we could see that there is a wide variation in the charges paid from BMI 30 and above(Those in BMI 30 and above are in the obese class).
# Summary statistics
mydata %>%
group_by(children) %>%
summarise(mean_charges = mean(charges),
max_charges= max(charges),
min_charges= min(charges))
## # A tibble: 6 × 4
## children mean_charges max_charges min_charges
## <int> <dbl> <dbl> <dbl>
## 1 0 12366. 63770. 1122.
## 2 1 12731. 58571. 1711.
## 3 2 15074. 49578. 2304.
## 4 3 15355. 60021. 3443.
## 5 4 13851. 40182. 4505.
## 6 5 8786. 19023. 4688.
ggplot(mydata, aes(x = factor(children), y = charges, fill = factor(children))) +
geom_boxplot() +
labs(title = "Relationship between number of
children and charges ",
x = "Number of Children",
y = "Charges($)") +
theme_minimal()
There is no particular trend which shows that there is no much relationship between the number of dependent(number of children) with the charges paid.
# Summary statistics
mydata %>%
group_by(region) %>%
summarise(mean_charges = mean(charges),
max_charges= max(charges),
min_charges= min(charges))
## # A tibble: 4 × 4
## region mean_charges max_charges min_charges
## <fct> <dbl> <dbl> <dbl>
## 1 northeast 13406. 58571. 1695.
## 2 northwest 12418. 60021. 1621.
## 3 southwest 12347. 52591. 1242.
## 4 southeast 14735. 63770. 1122.
ggplot(mydata, aes(x = region, y = charges, fill = region)) +
geom_boxplot() +
labs(title = "Charges by Region", x = "Region", y = "Charges($)")
Same applies with Charges by region. There is no particular trend which shows that there is no much relationship between the region with the charges.
# Summary statistics
mydata %>%
group_by(smoker) %>%
summarise(mean_charges = mean(charges),
max_charges= max(charges),
min_charges= min(charges))
## # A tibble: 2 × 4
## smoker mean_charges max_charges min_charges
## <fct> <dbl> <dbl> <dbl>
## 1 yes 32050. 63770. 12829.
## 2 no 8434. 36911. 1122.
ggplot(mydata, aes(x = smoker, y = charges, fill = smoker)) +
geom_boxplot() +
labs(title = "Charges by Smoker Status", x = "Smoker", y = "Charges($)")
This is the most influencing factor as visualized, with the minimum charges of smokers resulting into 12,829 while those of non-smokers is 1,122, and the highest charges paid by smokers being 63,770 while that of non-smokers being 36,911.
# 1. Summarize the data to get the average charges
average_charges <- mydata %>%
dplyr::group_by(smoker) %>%
dplyr::summarise(mean_charges = mean(charges))
# 2. Plot the summarized data using 'geom_col'
ggplot(average_charges, aes(x = smoker, y = mean_charges, fill = smoker)) +
geom_col() +
labs(title = "Average Charges by Smoker Status", x = "Smoker", y = "Average Charges($)") +
theme_minimal()
Also,when we look at the average charges of smokers compared to non-smokers, it is observed that there is a big difference.
ggplot(mydata, aes(x = age, y = charges, color = smoker)) +
geom_point(alpha = 0.9) +
labs(title = "Age vs Charges (Smoker vs Non-Smoker)", x = "age", y = "Charges($)")
Since those in the same age group have varying charges, I decided to further Check the age against charges putting into consideration smoking status and it is very obvious that those that pay high in the same age group are the smokers.
BMI vs Charges (Smoker vs Non-Smoker)
ggplot(mydata, aes(x = bmi, y = charges, color = smoker)) +
geom_point(alpha = 0.9) +
labs(title = "BMI vs Charges (Smoker vs Non-Smoker)", x = "BMI", y = "Charges($)")
I did the same with BMI since those in the same BMI group have varying
charges. I also Checked the BMI against charges putting into
consideration smoking status and it is very obvious that those that pay
high in the same BMI group are the smokers and the charges was extremely
high for obsessed (BMI>= 30)people who are smokers.
In conclusion, an individual’s smoking status is the most powerful predictor of high insurance charges. While age and BMI are contributing factors, they primarily amplify the cost for those who smoke. For insurance providers, this analysis validates the use of smoking status as a critical variable in risk assessment and premium calculation.