Data set for Homework 3, was found on Kaggle. I used the same data set as I used in Homework 1 and 2. Link to data set is https://www.kaggle.com/datasets/harshsingh2209/medical-insurance-payout .
mydata <- read.table("./expenses.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
Explanation of data set
Unit of observation: customer of the insurance company
Sample size: 1338 units
mydata$sexF <- factor(mydata$sex, # creating a factor for variable sex
levels = c("male", "female"),
labels = c("male", "female"))
mydata$smokerF <- factor(mydata$smoker, # creating a factor for variable smoker
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$regionF <- factor(mydata$region, # creating a factor for variable region
labels = c("northeast", "southeast", "southwest", "northwest"),
levels = c("northeast", "southeast", "southwest", "northwest"))
head(mydata, 3)
## age sex bmi children smoker region charges sexF smokerF regionF
## 1 19 female 27.90 0 yes southwest 16884.924 female yes southwest
## 2 18 male 33.77 1 no southeast 1725.552 male no southeast
## 3 28 male 33.00 3 no southeast 4449.462 male no southeast
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(4)
mydata1 <- mydata %>% # for the purpose of exercise, I will take a random sample of 400 units
sample_n(400)
head(mydata1)
## age sex bmi children smoker region charges sexF smokerF
## 1 21 male 20.235 3 no northeast 3861.210 male no
## 2 47 female 26.125 1 yes northeast 23401.306 female yes
## 3 27 female 24.750 0 yes southeast 16577.780 female yes
## 4 53 male 24.320 0 no northwest 9863.472 male no
## 5 61 female 21.090 0 no northwest 13415.038 female no
## 6 39 female 22.800 3 no northeast 7985.815 female no
## regionF
## 1 northeast
## 2 northeast
## 3 southeast
## 4 northwest
## 5 northwest
## 6 northeast
summary(mydata1 [,c(-2,-5,-6)])
## age bmi children charges sexF
## Min. :18.00 Min. :17.29 Min. :0.000 Min. : 1122 male :190
## 1st Qu.:27.00 1st Qu.:26.10 1st Qu.:0.000 1st Qu.: 4687 female:210
## Median :37.50 Median :30.25 Median :1.000 Median : 9018
## Mean :38.54 Mean :30.49 Mean :1.097 Mean :12402
## 3rd Qu.:50.00 3rd Qu.:34.41 3rd Qu.:2.000 3rd Qu.:14352
## Max. :64.00 Max. :47.74 Max. :5.000 Max. :60021
## smokerF regionF
## yes: 76 northeast: 98
## no :324 southeast:119
## southwest: 98
## northwest: 85
##
##
library(psych)
describe(mydata1 [,c(-2,-5,-6,-8,-9,-10)])
## vars n mean sd median trimmed mad min max
## age 1 400 38.54 13.91 37.50 38.19 17.79 18.00 64.00
## bmi 2 400 30.49 6.05 30.26 30.32 6.17 17.29 47.74
## children 3 400 1.10 1.22 1.00 0.93 1.48 0.00 5.00
## charges 4 400 12402.02 11278.53 9018.46 10173.07 6928.94 1121.87 60021.40
## range skew kurtosis se
## age 46.00 0.15 -1.19 0.70
## bmi 30.45 0.27 -0.22 0.30
## children 5.00 1.02 0.59 0.06
## charges 58899.53 1.71 2.47 563.93
In my regression model charges is dependent variable and age, BMI and smoking status are independent variables. We can expect that with higher age, expenditures of insurance company increases, because the older the person is, more likely it is that the person have heath issues and consequently expenditures of insurance company are higher. If we observe BMI variable, we can expect that BMI(body max index) will increase expenditures, because people with higher BMI usually have more health issues and that means that insurance company expenditures are higher. And lastly, we can expect that people who smoke, have more health issues than others and consequently expenditures are higher.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata1[ , c(7, 3, 1)],
smooth = FALSE)
From the scatterplot above we can not be sure that there is linearity between charges and bmi. I did logarithms of charges and bmi in another file but after I did a scatterplot, it was even worse. I decided to keep variables without logarithms.
fit1 <- lm(charges ~ age + bmi + smokerF, #Dependent and explanatory variable
data = mydata1) #Name of the data frame
vif(fit1)
## age bmi smokerF
## 1.024468 1.017070 1.011959
mean(vif(fit1))
## [1] 1.017833
All tree VIF are close to 1 and also mean is very close to 1, so we can say that we do not have problems with multicolinearity.
mydata1$StdResid <- round(rstandard(fit1), 3) #Standardized residuals
mydata1$CooksD <- round(cooks.distance(fit1), 3) #Cooks distances
hist(mydata1$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
From the graph we can observe that there are some outliers (values that are below -3 or above than 3) that we need to remove.
I will now do the Saphiro Wilk test for normality. In our case this is not needed because we have enough large sample, but I will just do it as a purpose of practice.
shapiro.test(mydata1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata1$StdResid
## W = 0.88057, p-value < 2.2e-16
We reject the null hypothesis (p=0.001) and conclude that errors are not normally distributed, however, this does not matter since the sample size of 400 is sufficiently big.
hist(mydata1$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
All values are less than 1 which is good, but it should not be any gap between values.
head(mydata1[order(-mydata1$StdResid),], 15)
## age sex bmi children smoker region charges sexF smokerF
## 275 52 male 34.485 3 yes northwest 60021.40 male yes
## 89 34 male 22.420 2 no northeast 27375.90 male no
## 178 21 female 32.680 2 no northwest 26018.95 female no
## 137 23 male 18.715 0 no northwest 21595.38 male no
## 90 28 male 36.400 1 yes southwest 51194.56 male yes
## 291 23 female 24.225 2 no northeast 22395.74 female no
## 199 26 female 29.640 4 no northeast 24671.66 female no
## 326 28 female 24.320 1 no northeast 23288.93 female no
## 395 60 male 28.595 0 no northeast 30260.00 male no
## 138 60 female 25.840 0 no northwest 28923.14 female no
## 111 53 female 32.300 2 no northeast 29186.48 female no
## 257 19 female 27.930 3 no northwest 18838.70 female no
## 380 52 male 26.400 3 no southeast 25992.82 male no
## 8 28 female 27.500 2 no southwest 20177.67 female no
## 115 35 female 38.095 2 no northeast 24915.05 female no
## regionF StdResid CooksD
## 275 northwest 4.053 0.074
## 89 northeast 3.818 0.029
## 178 northwest 3.560 0.026
## 137 northwest 3.520 0.048
## 90 southwest 3.453 0.053
## 291 northeast 3.341 0.025
## 199 northeast 3.297 0.015
## 326 northeast 3.280 0.019
## 395 northeast 2.915 0.020
## 138 northwest 2.846 0.023
## 111 northeast 2.813 0.011
## 257 northwest 2.713 0.016
## 380 southeast 2.648 0.012
## 8 southwest 2.591 0.009
## 115 northeast 2.519 0.012
head(mydata1[order(mydata1$StdResid),], 15)
## age sex bmi children smoker region charges sexF smokerF
## 340 38 male 19.300 0 yes southwest 15820.70 male yes
## 66 42 male 24.640 0 yes southeast 19515.54 male yes
## 106 53 male 20.900 0 yes southeast 21195.82 male yes
## 62 43 male 20.130 2 yes southeast 18767.74 male yes
## 71 33 male 24.795 0 yes northeast 17904.53 male yes
## 127 25 male 24.130 0 yes northwest 15817.99 male yes
## 195 29 female 21.850 0 yes northeast 16115.30 female yes
## 3 27 female 24.750 0 yes southeast 16577.78 female yes
## 375 39 male 26.410 0 yes northeast 20149.32 male yes
## 235 42 male 30.000 0 yes southwest 22144.03 male yes
## 366 27 male 28.500 0 yes northwest 18310.74 male yes
## 185 20 male 27.300 0 yes southwest 16232.85 male yes
## 255 40 female 22.220 2 yes southeast 19444.27 female yes
## 44 36 female 22.600 2 yes southwest 18608.26 female yes
## 290 29 female 21.755 1 yes northeast 16657.72 female yes
## regionF StdResid CooksD
## 340 southwest -1.897 0.019
## 66 southeast -1.732 0.012
## 106 southeast -1.700 0.018
## 62 southeast -1.655 0.015
## 71 northeast -1.643 0.010
## 127 northwest -1.631 0.011
## 195 northeast -1.620 0.012
## 3 southeast -1.619 0.010
## 375 northeast -1.600 0.009
## 235 southwest -1.586 0.009
## 366 northwest -1.534 0.008
## 185 southwest -1.533 0.010
## 255 southeast -1.532 0.011
## 44 southwest -1.529 0.010
## 290 northeast -1.525 0.011
head(mydata1[order(-mydata1$CooksD),], 10)
## age sex bmi children smoker region charges sexF smokerF
## 275 52 male 34.485 3 yes northwest 60021.40 male yes
## 90 28 male 36.400 1 yes southwest 51194.56 male yes
## 137 23 male 18.715 0 no northwest 21595.38 male no
## 89 34 male 22.420 2 no northeast 27375.90 male no
## 178 21 female 32.680 2 no northwest 26018.95 female no
## 38 44 female 38.060 0 yes southeast 48885.14 female yes
## 291 23 female 24.225 2 no northeast 22395.74 female no
## 138 60 female 25.840 0 no northwest 28923.14 female no
## 368 25 male 45.540 2 yes southeast 42112.24 male yes
## 395 60 male 28.595 0 no northeast 30260.00 male no
## regionF StdResid CooksD
## 275 northwest 4.053 0.074
## 90 southwest 3.453 0.053
## 137 northwest 3.520 0.048
## 89 northeast 3.818 0.029
## 178 northwest 3.560 0.026
## 38 southeast 2.333 0.025
## 291 northeast 3.341 0.025
## 138 northwest 2.846 0.023
## 368 southeast 1.580 0.021
## 395 northeast 2.915 0.020
We need to eliminate all units with standardized residuals below -3 and above than +3. This units are: ID 275 89 178 137 90 291 199 326. From calculating Cooks distances, I identified that units with high impact are units with ID 275, 90 and 137. This three units were also identified as an outliers.
mydata2 <- mydata1[abs(mydata1$StdResid) <= 3, ]
I removed outliers. Because units with high impact were also outliers, I did not need to use another code to eliminate them.
fit2 <- lm(charges ~ age + bmi + smokerF,
data = mydata2)
mydata2$StdFitted <- scale(fit2$fitted.values)
mydata2$StdResid <- round(rstandard(fit2), 3)
library(car)
scatterplot(y = mydata2$StdResid, x = mydata2$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------
## Response : charges
## Variables: fitted values of charges
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 116.9762
## Prob > Chi2 = 2.905388e-27
From the scatterplot we can observe that variances are probably not constant. To prove it I used Breusch Pagan test.
Based on the sample data we can reject H0(p<0.001) and conclude that heteroscedasticity is present. We will need to use robust standard errors.
library(estimatr)
fit_robust <- lm_robust(charges ~ age + bmi + smokerF,
se_type = "HC1",
data = mydata2)
summary(fit_robust)
##
## Call:
## lm_robust(formula = charges ~ age + bmi + smokerF, data = mydata2,
## se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 9600.0 1815.79 5.287 2.081e-07 6030.0 13170.0 388
## age 261.8 19.29 13.571 1.332e-34 223.9 299.7 388
## bmi 354.4 50.87 6.966 1.402e-11 254.3 454.4 388
## smokerFno -22880.9 1038.74 -22.027 2.458e-70 -24923.1 -20838.6 388
##
## Multiple R-squared: 0.7651 , Adjusted R-squared: 0.7633
## F-statistic: 225.3 on 3 and 388 DF, p-value: < 2.2e-16
Charges = 9600 + 261.8age + 354.4bmi - 22880.9smokerFno
Multiple R-squared: 76.51%. - 76.51 % of variability of the charges is explained by variability of age, bmi and smoking status.
F-statistic:
Based on the sample data we reject H0(p<0.001) and conclude that at least one independent variable has impact on dependent variable. This model is better model than model without any independent variable.
Based on the sample data, we found that age, Body Max Index and smoking status influence the expenditures of insurance company.