An insurance company in Arizona,was in dillemma :whether is charging its clients too high or too low.They decided to consult Mugo the data scientist.Mugo used kaggle data to explain to them the insights .They thanked him very much.Look what he told them.

GENERAL OBJECTIVE

To determine the effect of associated factors to charges in insurance.

SPECIFIC OBJECTIVES

1.0 to determine if age has effect on insurance charges

2.0 to determine if sex has effect on insurance charges

3.0 to determine if smoking has effect on insurance charges

4.0 to determine if number of children has effect on insurance charges

5.0 to determine if region has effect on insurance charges

NULL HYPOTHESIS

1.0 AGE has no effects to insurance charges

2.0 sex has no effect to insurance charges

3.0 smoking has no effect to insurance charges

4.0 region has no effect to insurance charges

5.0 no of children have no effect to insurance charges

library(readr)
insurance <- read_csv("C:/Users/USER/Desktop/insurance.csv")
## Rows: 1338 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sex, smoker, region
## dbl (4): age, bmi, children, charges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(insurance)
## # A tibble: 6 × 7
##     age sex      bmi children smoker region    charges
##   <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.
## 2    18 male    33.8        1 no     southeast   1726.
## 3    28 male    33          3 no     southeast   4449.
## 4    33 male    22.7        0 no     northwest  21984.
## 5    32 male    28.9        0 no     northwest   3867.
## 6    31 female  25.7        0 no     southeast   3757.
summary(insurance)
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770
names(insurance)
## [1] "age"      "sex"      "bmi"      "children" "smoker"   "region"   "charges"
insurance$children<-as.factor(insurance$children)
insurance$region<-as.factor(insurance$region)
insurance$sex<-as.factor(insurance$sex)
insurance$smoker<-as.factor(insurance$smoker)
levels(insurance$children)
## [1] "0" "1" "2" "3" "4" "5"
class(insurance$children)
## [1] "factor"
levels(insurance$sex)
## [1] "female" "male"
str(insurance)
## spc_tbl_ [1,338 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age     : num [1:1338] 19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num [1:1338] 27.9 33.8 33 22.7 28.9 ...
##  $ children: Factor w/ 6 levels "0","1","2","3",..: 1 2 4 1 1 1 2 4 3 1 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num [1:1338] 16885 1726 4449 21984 3867 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_character(),
##   ..   bmi = col_double(),
##   ..   children = col_double(),
##   ..   smoker = col_character(),
##   ..   region = col_character(),
##   ..   charges = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

explore the data

library(psych)
pairs.panels(insurance[1:4],gap=0,bg=c("red","green","blue"))

as shown in the table a bove there is no multicollinearity problem.But as shown below all the numeric variables are not normally distributed;since they to accept the null hypotheis;null hypothesis is there is no difference between avariable distribution curve with a normally distributed curve;since their pv are <0.5.

solution is to split the variables each into categories/binning

shapiro.test(insurance$age)
## 
##  Shapiro-Wilk normality test
## 
## data:  insurance$age
## W = 0.9447, p-value < 2.2e-16
shapiro.test(insurance$bmi)
## 
##  Shapiro-Wilk normality test
## 
## data:  insurance$bmi
## W = 0.99389, p-value = 2.605e-05

binning

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()   masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
insurance<-insurance%>%mutate(bmis=cut(bmi,breaks =c(15,25,35,45,55)))
insurance<-insurance%>%mutate(ages=cut(age,breaks =c(15,35,55,75)))
tail(insurance)
## # A tibble: 6 × 9
##     age sex      bmi children smoker region    charges bmis    ages   
##   <dbl> <fct>  <dbl> <fct>    <fct>  <fct>       <dbl> <fct>   <fct>  
## 1    52 female  44.7 3        no     southwest  11412. (35,45] (35,55]
## 2    50 male    31.0 3        no     northwest  10601. (25,35] (35,55]
## 3    18 female  31.9 0        no     northeast   2206. (25,35] (15,35]
## 4    18 female  36.8 0        no     southeast   1630. (35,45] (15,35]
## 5    21 female  25.8 0        no     southwest   2008. (25,35] (15,35]
## 6    61 female  29.1 0        yes    northwest  29141. (25,35] (55,75]
levels(insurance$bmis)
## [1] "(15,25]" "(25,35]" "(35,45]" "(45,55]"
levels(insurance$region)
## [1] "northeast" "northwest" "southeast" "southwest"
levels(insurance$sex)
## [1] "female" "male"
levels(insurance$smoker)
## [1] "no"  "yes"
lm<-lm(charges~ages+children+bmis+sex+smoker+region,data=insurance)
summary(lm)
## 
## Call:
## lm(formula = charges ~ ages + children + bmis + sex + smoker + 
##     region, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14832  -3030  -1106   1669  30821 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1738.6      573.6   3.031  0.00249 ** 
## ages(35,55]       4544.6      384.2  11.829  < 2e-16 ***
## ages(55,75]       9388.4      509.2  18.438  < 2e-16 ***
## children1         1002.0      448.1   2.236  0.02553 *  
## children2         2324.7      497.6   4.672 3.29e-06 ***
## children3         1897.5      576.0   3.294  0.00101 ** 
## children4         3463.4     1290.1   2.685  0.00735 ** 
## children5         1615.5     1518.8   1.064  0.28766    
## bmis(25,35]       2669.7      464.5   5.748 1.12e-08 ***
## bmis(35,45]       5728.1      560.5  10.220  < 2e-16 ***
## bmis(45,55]       6166.7     1489.8   4.139 3.71e-05 ***
## sexmale           -125.5      345.1  -0.364  0.71613    
## smokeryes        23724.1      429.6  55.227  < 2e-16 ***
## regionnorthwest   -406.6      495.0  -0.821  0.41154    
## regionsoutheast   -779.4      495.2  -1.574  0.11574    
## regionsouthwest   -838.5      496.2  -1.690  0.09126 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6283 on 1322 degrees of freedom
## Multiple R-squared:  0.7338, Adjusted R-squared:  0.7308 
## F-statistic:   243 on 15 and 1322 DF,  p-value: < 2.2e-16

results apart from region all the explanatory variable have a pv < 0.5 we render them statistically significant;They all fail to accept null hypothesis.

the objective have been achieved and all the related questions are positively answeared;apart from region.

73% of changes in charges are explained by the explanatory variables in the model.

smokers are charged 23724.1 higher than non smokers holding other variable constant in the model,males are charges 125.5 less than females holding other factors constant.

Holding other factors constant these who have 1,2,3,4,5 children are charged 1002 ,2324 , 1897, 3463 ,1615 higher than reference point(these who have o children)

Holding other factors constant in the model,age groups 35-55, 55-75 are charged higher then reference age group by 4544 and 9388 repectively

since region has ho significance i drop it from the model. PV of 0.3049 from LRT below means we accept null hypotheis that there is no statistically diffence between the two models and hence we drop region.but u can maintain it to improve predictive effect and causatic effect.

lm2<-lm(charges~ages+children+bmis+sex+smoker,data=insurance)
anova(lm,lm2,test="LRT")
## Analysis of Variance Table
## 
## Model 1: charges ~ ages + children + bmis + sex + smoker + region
## Model 2: charges ~ ages + children + bmis + sex + smoker
##   Res.Df        RSS Df  Sum of Sq Pr(>Chi)
## 1   1322 5.2190e+10                       
## 2   1325 5.2333e+10 -3 -143097306   0.3049
plot(lm)

```