Read data

setwd("C:/Users/user/Desktop")
insurance<-read.csv(file="insurance.csv",header=TRUE) #read csv

View(insurance)
head(insurance) #the first six data
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622
str(insurance) #structure of data
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...
dim(insurance) #rows and columns
## [1] 1338    7
summary(insurance) #summary of data
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770
sum(is.na(insurance)) #to check if there is na in data
## [1] 0

The data is from Kaggle. URL: https://www.kaggle.com/code/sudhirnl7/linear-regression-tutorial/data

The raw data contains 1338 rows and 7 columns.

There is no NA in data.

The variables in insurance are age,sex,bmi,children,smoker,region,charges.

age: Age

sex: Gender

bmi: Body mass index

children: Number of children

smoker: Indicator if you smoke

region: Region where the participant lives

charges: insurance fees

I want to find a linear model predict charges.

We can find out that in this data, their mean bmi is 30.66, it seems a little unhealthy!

Dummy variables

a <- sub("no","0",insurance$smoker)
b <- sub("yes","1",a)
insurance$smoker<- b 


a1 <- sub("female","0",insurance$sex)
b1 <- sub("male","1",a1)
insurance$sex <- b1

insurance$smoker<-as.numeric(insurance$smoker)
insurance$sex<-as.numeric(insurance$sex)

I changed who smoked before to 1, who doesn’t have to 0.

I changed who is male to 1, who is female to 0.

Plot

par(mfrow=c(1,3))
boxplot(insurance$age,main="Age")
boxplot(insurance$bmi,main="Bmi")
boxplot(insurance$charges,main="Charges")

par(mfrow=c(1,2))
histogram(insurance$smoker,col="darkblue",main="Smoking")

histogram(insurance$sex,col="green",main="Sex")

pairs(insurance[,-6], pch = '.', upper.panel = panel.smooth, lower.panel = NULL,  col = 'brown')

Correlation of variables

insurance_new<-insurance[,-6]
rcorr(as.matrix(insurance_new), type="pearson") #the correlation matrix of all variables
##            age   sex  bmi children smoker charges
## age       1.00 -0.02 0.11     0.04  -0.03    0.30
## sex      -0.02  1.00 0.05     0.02   0.08    0.06
## bmi       0.11  0.05 1.00     0.01   0.00    0.20
## children  0.04  0.02 0.01     1.00   0.01    0.07
## smoker   -0.03  0.08 0.00     0.01   1.00    0.79
## charges   0.30  0.06 0.20     0.07   0.79    1.00
## 
## n= 1338 
## 
## 
## P
##          age    sex    bmi    children smoker charges
## age             0.4459 0.0000 0.1205   0.3605 0.0000 
## sex      0.4459        0.0900 0.5305   0.0053 0.0361 
## bmi      0.0000 0.0900        0.6410   0.8910 0.0000 
## children 0.1205 0.5305 0.6410          0.7792 0.0129 
## smoker   0.3605 0.0053 0.8910 0.7792          0.0000 
## charges  0.0000 0.0361 0.0000 0.0129   0.0000

Most of the variables are not significant correlated with each other. Charges and sex,bmi,smoker has significant correlated in 5% level.

Linear Model

insurance_lm<- lm(charges ~age+sex+bmi+children+smoker, data = insurance_new)

summary(insurance_lm)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker, data = insurance_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11837.2  -2916.7   -994.2   1375.3  29565.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12052.46     951.26 -12.670  < 2e-16 ***
## age            257.73      11.90  21.651  < 2e-16 ***
## sex           -128.64     333.36  -0.386 0.699641    
## bmi            322.36      27.42  11.757  < 2e-16 ***
## children       474.41     137.86   3.441 0.000597 ***
## smoker       23823.39     412.52  57.750  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6070 on 1332 degrees of freedom
## Multiple R-squared:  0.7497, Adjusted R-squared:  0.7488 
## F-statistic:   798 on 5 and 1332 DF,  p-value: < 2.2e-16
plot(insurance_lm, which = 1) # Plot of the fitted values against the residuals for cvmod, with a line showing the relationship between the two.

plot(insurance_lm, which = 2) # Plot of the theoretical quantiles according to the model, against the quantiles of the standardised residuals.

plot(insurance_lm, which = 3) # Plot of the fitted values (model predictions) against the square root of the abs standardised residuals.

hist( x = residuals(insurance_lm),
      xlab = "Value of residual",
      main = "",
      breaks = 20)

vif(insurance_lm, digits = 3) # to check collinearity
##      age      sex      bmi children   smoker 
## 1.015129 1.008878 1.014578 1.002242 1.006457

The Model:Charges=-12052.46+257.73* age-128.64* sex+322.36* bmi+474.41* children+23823.39* smoker

Every single variables except sex that I put in the model has arrived significant standard in 5% confidence level.

The R-squared is 0.75, it indicated that the regression model accounts for 75% of the variability in the outcome measure.

A funny thing is that in 5 % confidence level, every variable has positive coefficients. We can say that if our age,bmi,children, smoker are greater, we have more possibility to have greater charges.

Among all variable in data set, we can conclude that the most effective variable would be smoker, because its β is the largest.

Just take two example of coeffiecient explanations:

Regression coefficient of 322.36(BMI) means that if increase BMI by 1, then charges will increase by 322.36 while control other variables. It also assumes that fatter person would have more charges.

Regression coefficient of 23823.39(Smoker) means that if increase Smoker by 1, then Heart disease will increase by 23823.39 while control other variables. It also assumes that the smoker yes(1) and smoker no(0) group differences.

The VIF values are all below 4, we can conclude that they don’t have collinearity problem.

In the plot, we can find out that the data may not fit Normality, its residuals points don’t really follow the straight dashed line.

In the plot, we can find out that the data may not have great linearity and homoscedasticity, the horizontal line doesn’t equally spread.