An insurance company in Arizona,was in dillemma :whether is charging its clients too high or too low.They decided to consult Mugo the data scientist.Mugo used kaggle data to explain to them the insights .They thanked him very much.Look what he told them.
GENERAL OBJECTIVE
To determine the effect of associated factors to charges in insurance.
SPECIFIC OBJECTIVES
1.0 to determine if age has effect on insurance charges
2.0 to determine if sex has effect on insurance charges
3.0 to determine if smoking has effect on insurance charges
4.0 to determine if number of children has effect on insurance charges
5.0 to determine if region has effect on insurance charges
NULL HYPOTHESIS
1.0 AGE has no effects to insurance charges
2.0 sex has no effect to insurance charges
3.0 smoking has no effect to insurance charges
4.0 region has no effect to insurance charges
5.0 no of children have no effect to insurance charges
library(readr)
insurance <- read_csv("C:/Users/USER/Desktop/insurance.csv")
## Rows: 1338 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sex, smoker, region
## dbl (4): age, bmi, children, charges
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(insurance)
## # A tibble: 6 × 7
## age sex bmi children smoker region charges
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 28 male 33 3 no southeast 4449.
## 4 33 male 22.7 0 no northwest 21984.
## 5 32 male 28.9 0 no northwest 3867.
## 6 31 female 25.7 0 no southeast 3757.
summary(insurance)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
names(insurance)
## [1] "age" "sex" "bmi" "children" "smoker" "region" "charges"
insurance$children<-as.factor(insurance$children)
insurance$region<-as.factor(insurance$region)
insurance$sex<-as.factor(insurance$sex)
insurance$smoker<-as.factor(insurance$smoker)
levels(insurance$children)
## [1] "0" "1" "2" "3" "4" "5"
class(insurance$children)
## [1] "factor"
levels(insurance$sex)
## [1] "female" "male"
str(insurance)
## spc_tbl_ [1,338 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:1338] 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num [1:1338] 27.9 33.8 33 22.7 28.9 ...
## $ children: Factor w/ 6 levels "0","1","2","3",..: 1 2 4 1 1 1 2 4 3 1 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ charges : num [1:1338] 16885 1726 4449 21984 3867 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. sex = col_character(),
## .. bmi = col_double(),
## .. children = col_double(),
## .. smoker = col_character(),
## .. region = col_character(),
## .. charges = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
explore the data
library(psych)
pairs.panels(insurance[1:4],gap=0,bg=c("red","green","blue"))
as shown in the table a bove there is no multicollinearity problem.But as shown below all the numeric variables are not normally distributed;since they to accept the null hypotheis;null hypothesis is there is no difference between avariable distribution curve with a normally distributed curve;since their pv are <0.5.
solution is to split the variables each into categories/binning
shapiro.test(insurance$age)
##
## Shapiro-Wilk normality test
##
## data: insurance$age
## W = 0.9447, p-value < 2.2e-16
shapiro.test(insurance$bmi)
##
## Shapiro-Wilk normality test
##
## data: insurance$bmi
## W = 0.99389, p-value = 2.605e-05
binning
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
insurance<-insurance%>%mutate(bmis=cut(bmi,breaks =c(15,25,35,45,55)))
insurance<-insurance%>%mutate(ages=cut(age,breaks =c(15,35,55,75)))
tail(insurance)
## # A tibble: 6 × 9
## age sex bmi children smoker region charges bmis ages
## <dbl> <fct> <dbl> <fct> <fct> <fct> <dbl> <fct> <fct>
## 1 52 female 44.7 3 no southwest 11412. (35,45] (35,55]
## 2 50 male 31.0 3 no northwest 10601. (25,35] (35,55]
## 3 18 female 31.9 0 no northeast 2206. (25,35] (15,35]
## 4 18 female 36.8 0 no southeast 1630. (35,45] (15,35]
## 5 21 female 25.8 0 no southwest 2008. (25,35] (15,35]
## 6 61 female 29.1 0 yes northwest 29141. (25,35] (55,75]
levels(insurance$bmis)
## [1] "(15,25]" "(25,35]" "(35,45]" "(45,55]"
levels(insurance$region)
## [1] "northeast" "northwest" "southeast" "southwest"
levels(insurance$sex)
## [1] "female" "male"
levels(insurance$smoker)
## [1] "no" "yes"
lm<-lm(charges~ages+children+bmis+sex+smoker+region,data=insurance)
summary(lm)
##
## Call:
## lm(formula = charges ~ ages + children + bmis + sex + smoker +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14832 -3030 -1106 1669 30821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1738.6 573.6 3.031 0.00249 **
## ages(35,55] 4544.6 384.2 11.829 < 2e-16 ***
## ages(55,75] 9388.4 509.2 18.438 < 2e-16 ***
## children1 1002.0 448.1 2.236 0.02553 *
## children2 2324.7 497.6 4.672 3.29e-06 ***
## children3 1897.5 576.0 3.294 0.00101 **
## children4 3463.4 1290.1 2.685 0.00735 **
## children5 1615.5 1518.8 1.064 0.28766
## bmis(25,35] 2669.7 464.5 5.748 1.12e-08 ***
## bmis(35,45] 5728.1 560.5 10.220 < 2e-16 ***
## bmis(45,55] 6166.7 1489.8 4.139 3.71e-05 ***
## sexmale -125.5 345.1 -0.364 0.71613
## smokeryes 23724.1 429.6 55.227 < 2e-16 ***
## regionnorthwest -406.6 495.0 -0.821 0.41154
## regionsoutheast -779.4 495.2 -1.574 0.11574
## regionsouthwest -838.5 496.2 -1.690 0.09126 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6283 on 1322 degrees of freedom
## Multiple R-squared: 0.7338, Adjusted R-squared: 0.7308
## F-statistic: 243 on 15 and 1322 DF, p-value: < 2.2e-16
results apart from region all the explanatory variable have a pv < 0.5 we render them statistically significant;They all fail to accept null hypothesis.
the objective have been achieved and all the related questions are positively answeared;apart from region.
73% of changes in charges are explained by the explanatory variables in the model.
smokers are charged 23724.1 higher than non smokers holding other variable constant in the model,males are charges 125.5 less than females holding other factors constant.
Holding other factors constant these who have 1,2,3,4,5 children are charged 1002 ,2324 , 1897, 3463 ,1615 higher than reference point(these who have o children)
Holding other factors constant in the model,age groups 35-55, 55-75 are charged higher then reference age group by 4544 and 9388 repectively
since region has ho significance i drop it from the model. PV of 0.3049 from LRT below means we accept null hypotheis that there is no statistically diffence between the two models and hence we drop region.but u can maintain it to improve predictive effect and causatic effect.
lm2<-lm(charges~ages+children+bmis+sex+smoker,data=insurance)
anova(lm,lm2,test="LRT")
## Analysis of Variance Table
##
## Model 1: charges ~ ages + children + bmis + sex + smoker + region
## Model 2: charges ~ ages + children + bmis + sex + smoker
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 1322 5.2190e+10
## 2 1325 5.2333e+10 -3 -143097306 0.3049
plot(lm)
```