setwd("C:/Users/Owner/Desktop/MachineLearningR_sampleData")
insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)
head(insurance)
## age sex bmi children smoker region expenses
## 1 19 female 27.9 0 yes southwest 16884.92
## 2 18 male 33.8 1 no southeast 1725.55
## 3 28 male 33.0 3 no southeast 4449.46
## 4 33 male 22.7 0 no northwest 21984.47
## 5 32 male 28.9 0 no northwest 3866.86
## 6 31 female 25.7 0 no southeast 3756.62
There are 6 independent features and the expenses column (dependent variable).
str(insurance)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ expenses: num 16885 1726 4449 21984 3867 ...
summary(insurance)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :16.00 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.67 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.70 3rd Qu.:2.000
## Max. :64.00 Max. :53.10 Max. :5.000
## region expenses
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
1338 people enrolled as primary beneficiaries in this program
Age, BMI, number of children and expenses are numeric variables Sex, gender and smoking status are non-numeric
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2
ggplot(insurance, aes(x = expenses)) +
theme_bw() +
geom_histogram(binwidth = 7500) +
labs(y = "Frequency",
x = "Expenses",
title = "Medical Expenses Distribtion") +
theme(text = element_text(size=20))
The distribution is positively skewed Most of the people enroll in this program have between $0 and $20,000 medical expenses each year
cor(insurance[c("age", "bmi", "children", "expenses")])
## age bmi children expenses
## age 1.0000000 0.10934101 0.04246900 0.29900819
## bmi 0.1093410 1.00000000 0.01264471 0.19857626
## children 0.0424690 0.01264471 1.00000000 0.06799823
## expenses 0.2990082 0.19857626 0.06799823 1.00000000
library(psych)
## Warning: package 'psych' was built under R version 3.4.2
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
pairs.panels(insurance[c("age", "bmi", "children", "expenses")])
The older we get, we turn to disproportionately spend more on our health insurance
I will create 2 new features called “age2” and “age3” which has values that are two and three fold the original age values respectively.
insurance$age2 <- insurance$age^2
insurance$age3 <- insurance$age^3
BMI has a threshold effect on our health. A BMI <= 30 might not affect the health or medical expenses of an individual but if the person becomes obese with a BMI > 30, they might see an increase in their medical expenses.
I will convert the numeric BMI into a binary indicator that distinguishes obese from non-obese individuals
insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)
The combination of Obesity and smoking might have a more profound effect on our health and medical expenses.
Also aging and smoking can have a more negative effect on our health, which can lead to increases on our medical expenses.
Step 1: Randomize our dataset
insurance <- insurance[sample(nrow(insurance)),]
Step 2: we will split our data into training and testing datasets.
insurance_train <-insurance[1:1100, ]
insurance_test <-insurance[1101:1338, ]
insurance_model <- lm(expenses ~ age + age2 + age3 + children + bmi + sex +
bmi30*smoker + age3*bmi + age3*smoker + region, data = insurance_train)
insurance_model
##
## Call:
## lm(formula = expenses ~ age + age2 + age3 + children + bmi +
## sex + bmi30 * smoker + age3 * bmi + age3 * smoker + region,
## data = insurance_train)
##
## Coefficients:
## (Intercept) age age2 age3
## 2.635e+03 -2.848e+02 1.033e+01 -5.170e-02
## children bmi sexmale bmi30
## 7.003e+02 1.444e+02 -6.273e+02 -1.201e+03
## smokeryes regionnorthwest regionsoutheast regionsouthwest
## 1.352e+04 -3.091e+02 -1.115e+03 -1.441e+03
## bmi30:smokeryes age3:bmi age3:smokeryes
## 1.979e+04 -5.688e-05 -1.141e-03
The intercept values are not very informative for this particular purpose because for some of the variables, there is no true zero. For example, we can not have a BMI of zero.
summary(insurance_model)
##
## Call:
## lm(formula = expenses ~ age + age2 + age3 + children + bmi +
## sex + bmi30 * smoker + age3 * bmi + age3 * smoker + region,
## data = insurance_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17091.0 -1771.7 -1274.1 -682.4 22661.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.635e+03 3.854e+03 0.684 0.494241
## age -2.848e+02 3.087e+02 -0.923 0.356379
## age2 1.033e+01 8.100e+00 1.275 0.202434
## age3 -5.170e-02 6.809e-02 -0.759 0.447871
## children 7.003e+02 1.220e+02 5.741 1.22e-08 ***
## bmi 1.444e+02 4.588e+01 3.148 0.001691 **
## sexmale -6.273e+02 2.759e+02 -2.274 0.023181 *
## bmi30 -1.201e+03 4.782e+02 -2.511 0.012197 *
## smokeryes 1.352e+04 5.964e+02 22.674 < 2e-16 ***
## regionnorthwest -3.091e+02 3.936e+02 -0.785 0.432440
## regionsoutheast -1.115e+03 3.967e+02 -2.812 0.005014 **
## regionsouthwest -1.441e+03 3.959e+02 -3.640 0.000285 ***
## bmi30:smokeryes 1.979e+04 6.814e+02 29.049 < 2e-16 ***
## age3:bmi -5.688e-05 3.144e-04 -0.181 0.856452
## age3:smokeryes -1.141e-03 4.603e-03 -0.248 0.804264
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4550 on 1085 degrees of freedom
## Multiple R-squared: 0.8615, Adjusted R-squared: 0.8597
## F-statistic: 482 on 14 and 1085 DF, p-value: < 2.2e-16
The multiple R-squared value of ~ 0.85 suggest that this model can explain about 85% of the variation in the expenses (dependent variable).
prediction<-predict(insurance_model, insurance_test)
cor(prediction, insurance_test$expenses)
## [1] 0.9431769
There is an excellent corelation (> 0.90) between the predicted expenses to the true expenses in the testing dataset.