setwd("C:/Users/user/Desktop")
insurance<-read.csv(file="insurance.csv",header=TRUE) #read csv
View(insurance)
head(insurance) #the first six data
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
str(insurance) #structure of data
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
dim(insurance) #rows and columns
## [1] 1338 7
summary(insurance) #summary of data
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
sum(is.na(insurance)) #to check if there is na in data
## [1] 0
The data is from Kaggle. URL: https://www.kaggle.com/code/sudhirnl7/linear-regression-tutorial/data
The raw data contains 1338 rows and 7 columns.
There is no NA in data.
The variables in insurance are age,sex,bmi,children,smoker,region,charges.
age: Age
sex: Gender
bmi: Body mass index
children: Number of children
smoker: Indicator if you smoke
region: Region where the participant lives
charges: insurance fees
I want to find a linear model predict charges.
We can find out that in this data, their mean bmi is 30.66, it seems a little unhealthy!
a <- sub("no","0",insurance$smoker)
b <- sub("yes","1",a)
insurance$smoker<- b
a1 <- sub("female","0",insurance$sex)
b1 <- sub("male","1",a1)
insurance$sex <- b1
insurance$smoker<-as.numeric(insurance$smoker)
insurance$sex<-as.numeric(insurance$sex)
I changed who smoked before to 1, who doesn’t have to 0.
I changed who is male to 1, who is female to 0.
par(mfrow=c(1,3))
boxplot(insurance$age,main="Age")
boxplot(insurance$bmi,main="Bmi")
boxplot(insurance$charges,main="Charges")
par(mfrow=c(1,2))
histogram(insurance$smoker,col="darkblue",main="Smoking")
histogram(insurance$sex,col="green",main="Sex")
pairs(insurance[,-6], pch = '.', upper.panel = panel.smooth, lower.panel = NULL, col = 'brown')
insurance_new<-insurance[,-6]
rcorr(as.matrix(insurance_new), type="pearson") #the correlation matrix of all variables
## age sex bmi children smoker charges
## age 1.00 -0.02 0.11 0.04 -0.03 0.30
## sex -0.02 1.00 0.05 0.02 0.08 0.06
## bmi 0.11 0.05 1.00 0.01 0.00 0.20
## children 0.04 0.02 0.01 1.00 0.01 0.07
## smoker -0.03 0.08 0.00 0.01 1.00 0.79
## charges 0.30 0.06 0.20 0.07 0.79 1.00
##
## n= 1338
##
##
## P
## age sex bmi children smoker charges
## age 0.4459 0.0000 0.1205 0.3605 0.0000
## sex 0.4459 0.0900 0.5305 0.0053 0.0361
## bmi 0.0000 0.0900 0.6410 0.8910 0.0000
## children 0.1205 0.5305 0.6410 0.7792 0.0129
## smoker 0.3605 0.0053 0.8910 0.7792 0.0000
## charges 0.0000 0.0361 0.0000 0.0129 0.0000
Most of the variables are not significant correlated with each other. Charges and sex,bmi,smoker has significant correlated in 5% level.
insurance_lm<- lm(charges ~age+sex+bmi+children+smoker, data = insurance_new)
summary(insurance_lm)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker, data = insurance_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11837.2 -2916.7 -994.2 1375.3 29565.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12052.46 951.26 -12.670 < 2e-16 ***
## age 257.73 11.90 21.651 < 2e-16 ***
## sex -128.64 333.36 -0.386 0.699641
## bmi 322.36 27.42 11.757 < 2e-16 ***
## children 474.41 137.86 3.441 0.000597 ***
## smoker 23823.39 412.52 57.750 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6070 on 1332 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7488
## F-statistic: 798 on 5 and 1332 DF, p-value: < 2.2e-16
plot(insurance_lm, which = 1) # Plot of the fitted values against the residuals for cvmod, with a line showing the relationship between the two.
plot(insurance_lm, which = 2) # Plot of the theoretical quantiles according to the model, against the quantiles of the standardised residuals.
plot(insurance_lm, which = 3) # Plot of the fitted values (model predictions) against the square root of the abs standardised residuals.
hist( x = residuals(insurance_lm),
xlab = "Value of residual",
main = "",
breaks = 20)
vif(insurance_lm, digits = 3) # to check collinearity
## age sex bmi children smoker
## 1.015129 1.008878 1.014578 1.002242 1.006457
The Model:Charges=-12052.46+257.73* age-128.64* sex+322.36* bmi+474.41* children+23823.39* smoker
Every single variables except sex that I put in the model has arrived significant standard in 5% confidence level.
The R-squared is 0.75, it indicated that the regression model accounts for 75% of the variability in the outcome measure.
A funny thing is that in 5 % confidence level, every variable has positive coefficients. We can say that if our age,bmi,children, smoker are greater, we have more possibility to have greater charges.
Among all variable in data set, we can conclude that the most effective variable would be smoker, because its β is the largest.
Just take two example of coeffiecient explanations:
Regression coefficient of 322.36(BMI) means that if increase BMI by 1, then charges will increase by 322.36 while control other variables. It also assumes that fatter person would have more charges.
Regression coefficient of 23823.39(Smoker) means that if increase Smoker by 1, then Heart disease will increase by 23823.39 while control other variables. It also assumes that the smoker yes(1) and smoker no(0) group differences.
The VIF values are all below 4, we can conclude that they don’t have collinearity problem.
In the plot, we can find out that the data may not fit Normality, its residuals points don’t really follow the straight dashed line.
In the plot, we can find out that the data may not have great linearity and homoscedasticity, the horizontal line doesn’t equally spread.