PROJECT REPORT

About Dataset

This dataset contains detailed information about insurance customers, including their age, sex, body mass index (BMI), number of children, smoking status, region and the insurance charges paid in dollars($). This dataset was gotten from kaggle and the location and time frame was not given.

Aim

To explore and analyze the demographic and health factors that most significantly influence individual medical insurance charges.

Objectives

Investigate how age, Body Mass Index (BMI), number of children, and region relate to charges.
Visualize the difference in charges between smokers and non-smokers.
Explore the relationship between age,BMI,and smoking status.

Loading Necessary Libraries

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## Warning: package 'forcats' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(summarytools)

## Warning: package 'summarytools' was built under R version 4.5.2

## 
## Attaching package: 'summarytools'
## 
## The following object is masked from 'package:tibble':
## 
##     view

library(dplyr)

Import the data

mydata<- read.csv("C:\\Users\\HP ELITE BOOK\\Desktop\\insurance.csv")
View(mydata)

Data cleaning

head(mydata)

##   index age    sex    bmi children smoker    region   charges
## 1     0  19 female 27.900        0    yes southwest 16884.924
## 2     1  18   male 33.770        1     no southeast  1725.552
## 3     2  28   male 33.000        3     no southeast  4449.462
## 4     3  33   male 22.705        0     no northwest 21984.471
## 5     4  32   male 28.880        0     no northwest  3866.855
## 6     5  31 female 25.740        0     no southeast  3756.622

str(mydata)

## 'data.frame':    1338 obs. of  8 variables:
##  $ index   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

#check for missing value
colSums(is.na(mydata))

##    index      age      sex      bmi children   smoker   region  charges 
##        0        0        0        0        0        0        0        0

summary(mydata)

##      index             age            sex                 bmi       
##  Min.   :   0.0   Min.   :18.00   Length:1338        Min.   :15.96  
##  1st Qu.: 334.2   1st Qu.:27.00   Class :character   1st Qu.:26.30  
##  Median : 668.5   Median :39.00   Mode  :character   Median :30.40  
##  Mean   : 668.5   Mean   :39.21                      Mean   :30.66  
##  3rd Qu.:1002.8   3rd Qu.:51.00                      3rd Qu.:34.69  
##  Max.   :1337.0   Max.   :64.00                      Max.   :53.13  
##     children        smoker             region             charges     
##  Min.   :0.000   Length:1338        Length:1338        Min.   : 1122  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.: 4740  
##  Median :1.000   Mode  :character   Mode  :character   Median : 9382  
##  Mean   :1.095                                         Mean   :13270  
##  3rd Qu.:2.000                                         3rd Qu.:16640  
##  Max.   :5.000                                         Max.   :63770

mydata$sex<- factor(x= mydata$sex, levels = c("male","female"))

mydata$smoker<- factor(x=mydata$smoker, levels= c("yes","no"))

mydata$region<- factor(x= mydata$region, levels= c("northeast","northwest","southwest","southeast") )
str(mydata)

## 'data.frame':    1338 obs. of  8 variables:
##  $ index   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 2 2 1 2 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "yes","no": 1 2 2 2 2 2 2 2 2 2 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 3 4 4 2 2 4 4 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

table(mydata$sex)

## 
##   male female 
##    676    662

table(mydata$smoker)

## 
##  yes   no 
##  274 1064

table(mydata$region)

## 
## northeast northwest southwest southeast 
##       324       325       325       364

descr(mydata[, c("age", "bmi", "children", "charges")])

## Descriptive Statistics  
## mydata  
## N: 1338  
## 
##                         age       bmi    charges   children
## ----------------- --------- --------- ---------- ----------
##              Mean     39.21     30.66   13270.42       1.09
##           Std.Dev     14.05      6.10   12110.01       1.21
##               Min     18.00     15.96    1121.87       0.00
##                Q1     27.00     26.29    4738.27       0.00
##            Median     39.00     30.40    9382.03       1.00
##                Q3     51.00     34.70   16657.72       2.00
##               Max     64.00     53.13   63770.43       5.00
##               MAD     17.79      6.20    7440.81       1.48
##               IQR     24.00      8.40   11899.63       2.00
##                CV      0.36      0.20       0.91       1.10
##          Skewness      0.06      0.28       1.51       0.94
##       SE.Skewness      0.07      0.07       0.07       0.07
##          Kurtosis     -1.25     -0.06       1.59       0.19
##           N.Valid   1338.00   1338.00    1338.00    1338.00
##                 N   1338.00   1338.00    1338.00    1338.00
##         Pct.Valid    100.00    100.00     100.00     100.00

Distribution by Sex

sex_count<-mydata%>%
  count(sex)
sex_count

##      sex   n
## 1   male 676
## 2 female 662

ggplot(sex_count, aes(x=sex,y=n,fill=sex))+
      geom_col()+
  geom_text(aes(label=n),vjust=-0.2)+
  labs(title = "Dstribution by sex", 
       x = "sex",
       y = "count")+
  theme_minimal()

1i. Relationship between Age and charges

ggplot(mydata, aes(x = age, y = charges)) +
  geom_point(alpha = 0.6, color = "blue") +
   geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Relationship between Age and Charges",
       x = "Age", y = "Charges($)")

## `geom_smooth()` using formula = 'y ~ x'

As age increases, insurance charges tend to rise, which implies that older people have higher medical cost, but even at the same age group, charges vary which shows that variation is caused by other factors.

ii. Relationship between BMI and charges

ggplot(mydata, aes(x = bmi, y = charges)) +
  geom_point(alpha = 0.6, color = "darkgreen") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Relationship between BMI and Charges",
       x = "BMI", y = "Charges($)")

## `geom_smooth()` using formula = 'y ~ x'

The red line indicates that as BMI charges increases, insurance charges tend to rise. We could also see that people with same BMI have varying charges, but the case is a little different starting from BMI 30 and above, we could see that there is a wide variation in the charges paid from BMI 30 and above(Those in BMI 30 and above are in the obese class).

iii. Relationship between number of children and charges

# Summary statistics
mydata %>%
  group_by(children) %>%
  summarise(mean_charges = mean(charges),
            max_charges= max(charges),
            min_charges= min(charges))

## # A tibble: 6 × 4
##   children mean_charges max_charges min_charges
##      <int>        <dbl>       <dbl>       <dbl>
## 1        0       12366.      63770.       1122.
## 2        1       12731.      58571.       1711.
## 3        2       15074.      49578.       2304.
## 4        3       15355.      60021.       3443.
## 5        4       13851.      40182.       4505.
## 6        5        8786.      19023.       4688.

ggplot(mydata, aes(x = factor(children), y = charges, fill = factor(children))) +
  geom_boxplot() +
  labs(title = "Relationship between number of 
       children and charges ",
       x = "Number of Children",
       y = "Charges($)") +
  theme_minimal()

There is no particular trend which shows that there is no much relationship between the number of dependent(number of children) with the charges paid.

iv. Relationship between Region and charges

# Summary statistics
mydata %>%
  group_by(region) %>%
  summarise(mean_charges = mean(charges),
            max_charges= max(charges),
            min_charges= min(charges))

## # A tibble: 4 × 4
##   region    mean_charges max_charges min_charges
##   <fct>            <dbl>       <dbl>       <dbl>
## 1 northeast       13406.      58571.       1695.
## 2 northwest       12418.      60021.       1621.
## 3 southwest       12347.      52591.       1242.
## 4 southeast       14735.      63770.       1122.

ggplot(mydata, aes(x = region, y = charges, fill = region)) +
  geom_boxplot() +
  labs(title = "Charges by Region", x = "Region", y = "Charges($)")

Same applies with Charges by region. There is no particular trend which shows that there is no much relationship between the region with the charges.

2. Relationship between Smokers status and charges

# Summary statistics
mydata %>%
  group_by(smoker) %>%
  summarise(mean_charges = mean(charges),
            max_charges= max(charges),
            min_charges= min(charges))

## # A tibble: 2 × 4
##   smoker mean_charges max_charges min_charges
##   <fct>         <dbl>       <dbl>       <dbl>
## 1 yes          32050.      63770.      12829.
## 2 no            8434.      36911.       1122.

ggplot(mydata, aes(x = smoker, y = charges, fill = smoker)) +
  geom_boxplot() +
  labs(title = "Charges by Smoker Status", x = "Smoker", y = "Charges($)")

This is the most influencing factor as visualized, with the minimum charges of smokers resulting into 12,829 while those of non-smokers is 1,122, and the highest charges paid by smokers being 63,770 while that of non-smokers being 36,911.

# 1. Summarize the data to get the average charges
average_charges <- mydata %>%
  dplyr::group_by(smoker) %>%
  dplyr::summarise(mean_charges = mean(charges))

# 2. Plot the summarized data using 'geom_col'
ggplot(average_charges, aes(x = smoker, y = mean_charges, fill = smoker)) +
  geom_col() +
  labs(title = "Average Charges by Smoker Status", x = "Smoker", y = "Average Charges($)") +
  theme_minimal()

Also,when we look at the average charges of smokers compared to non-smokers, it is observed that there is a big difference.

3. Age vs Charges (Smoker vs Non-Smoker)

ggplot(mydata, aes(x = age, y = charges, color = smoker)) +
  geom_point(alpha = 0.9) +
  labs(title = "Age vs Charges (Smoker vs Non-Smoker)", x = "age", y = "Charges($)")

Since those in the same age group have varying charges, I decided to further Check the age against charges putting into consideration smoking status and it is very obvious that those that pay high in the same age group are the smokers.

BMI vs Charges (Smoker vs Non-Smoker)

ggplot(mydata, aes(x = bmi, y = charges, color = smoker)) +
  geom_point(alpha = 0.9) +
  labs(title = "BMI vs Charges (Smoker vs Non-Smoker)", x = "BMI", y = "Charges($)")

I did the same with BMI since those in the same BMI group have varying charges. I also Checked the BMI against charges putting into consideration smoking status and it is very obvious that those that pay high in the same BMI group are the smokers and the charges was extremely high for obsessed (BMI>= 30)people who are smokers.

Conclusion

In conclusion, an individual’s smoking status is the most powerful predictor of high insurance charges. While age and BMI are contributing factors, they primarily amplify the cost for those who smoke. For insurance providers, this analysis validates the use of smoking status as a critical variable in risk assessment and premium calculation.

PROJECT

ODEYEMI Bukunmi. S

2025-12-08