Introduction

In this notebook, I will try to fit a linear regression model for this data set to estimate the average charge for each population.

According to Investopedia, insurance is a contract, represented by a policy, in which a policyholder receives financial protection or reimbursement against losses from an insurance company. The company pools clients’ risks to make payments more affordable for the insured. Most people have some insurance: for their car, their house, their healthcare, or their life. 1

About Dataset

Content

This content about this dataset from Data Card in Kaggle

age : age of primary beneficiary

sex:insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

Loading libraries

Firstly I will start by loading some packages that I will use during the analysis

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(ggsci)

Getting the data

insurance<-read.csv("insurance.csv")

Exploration of the data¶

The structure of the data

str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

In addition to the independent variable, we have the bmi, a numerical variable; two integer variables indicate the individual’s age and the number of children he has; three character variables indicate the individual’s sex, region, and smoking status.

The first fifteen rows of the data

head(insurance,15)
##    age    sex    bmi children smoker    region   charges
## 1   19 female 27.900        0    yes southwest 16884.924
## 2   18   male 33.770        1     no southeast  1725.552
## 3   28   male 33.000        3     no southeast  4449.462
## 4   33   male 22.705        0     no northwest 21984.471
## 5   32   male 28.880        0     no northwest  3866.855
## 6   31 female 25.740        0     no southeast  3756.622
## 7   46 female 33.440        1     no southeast  8240.590
## 8   37 female 27.740        3     no northwest  7281.506
## 9   37   male 29.830        2     no northeast  6406.411
## 10  60 female 25.840        0     no northwest 28923.137
## 11  25   male 26.220        0     no northeast  2721.321
## 12  62 female 26.290        0    yes southeast 27808.725
## 13  23   male 34.400        0     no southwest  1826.843
## 14  56 female 39.820        0     no southeast 11090.718
## 15  27   male 42.130        0    yes southeast 39611.758

We’ll analyze this data by going through those processes.

1- Look for basic descriptive statistics for the data.

2- Searching for duplicates and NAs.

3- Look at the distributions of the numeric variables and the bar plot for the character variables.

4- Look at the scatter plot for each variable against the independent variable.

5- preparing the data for the analysis.

6- fitting a multiple linear regression model using all the data.

7- evaluating the model and trying to improve its performance.

Basic descriptive statistics for the data

describe(insurance)
##          vars    n     mean       sd  median  trimmed     mad     min      max
## age         1 1338    39.21    14.05   39.00    39.01   17.79   18.00    64.00
## sex*        2 1338     1.51     0.50    2.00     1.51    0.00    1.00     2.00
## bmi         3 1338    30.66     6.10   30.40    30.50    6.20   15.96    53.13
## children    4 1338     1.09     1.21    1.00     0.94    1.48    0.00     5.00
## smoker*     5 1338     1.20     0.40    1.00     1.13    0.00    1.00     2.00
## region*     6 1338     2.52     1.10    3.00     2.52    1.48    1.00     4.00
## charges     7 1338 13270.42 12110.01 9382.03 11076.02 7440.81 1121.87 63770.43
##             range  skew kurtosis     se
## age         46.00  0.06    -1.25   0.38
## sex*         1.00 -0.02    -2.00   0.01
## bmi         37.17  0.28    -0.06   0.17
## children     5.00  0.94     0.19   0.03
## smoker*      1.00  1.46     0.14   0.01
## region*      3.00 -0.04    -1.33   0.03
## charges  62648.55  1.51     1.59 331.07

The trimmed mean is almost the same as the mean for all variables except for the charge variable, which is slightly less than the mean, which may indicate that the variable has a positive skew. Also, the skew parameter supports that, and all the kurtosis is less than 3, which means its distribution has a thin tail.2

Checking for NAs

colSums(is.na(insurance))
##      age      sex      bmi children   smoker   region  charges 
##        0        0        0        0        0        0        0

checking for duplicates

dim(insurance)
## [1] 1338    7
dim(unique(insurance))
## [1] 1337    7

There is one duplicate in the dataset; let’s see the observation of the non-unique value.

insurance[duplicated(insurance),]
##     age  sex   bmi children smoker    region  charges
## 582  19 male 30.59        0     no northwest 1639.563

According to the values of this raw, it’s possible to have a real repeated value, but let’s remove it.

insurance<-distinct (insurance)

Distributions of the dependent variables

Let’s start with the age.

ggplot(insurance, aes(x =age)) +
  geom_histogram(fill="#3B9C9C",binwidth=4,col="white")+
  theme_solarized()+
  scale_fill_brewer(palette="Set2")+
  ggtitle("Distribution of age")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Age")

With a minor bias toward the stage between 18 and 22, the distribution is nearly uniform and represents all age groups in the dataset equally.

BMI

ggplot(insurance, aes(x =bmi)) +
  geom_histogram(fill="#5E5A80",binwidth=4,col="white")+
  theme_solarized()+
  scale_fill_brewer(palette="Set2")+
  ggtitle("Distribution of bmi")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("BMI")

bmi is an almost normal distribution with a slightly positive skew.

charges

ggplot(insurance, aes(x =charges)) +
  geom_histogram(fill="#728FCE",col="white")+
  theme_solarized()+
  scale_fill_brewer(palette="Set2")+
  ggtitle("Distribution of charges")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Charges")

The charge variable has a Log-Normal distribution. Normal distributions may present a few problems that log-normal distributions can solve. Mainly, normal distributions can allow for negative random variables, while log-normal distributions include all positive variables.

One of the most common applications where log-normal distributions are used in finance is in the analysis of stock prices.3

The log-normal distribution is right-skewed with a long tail towards the right.4

Sex

ggplot(insurance, aes(x =sex,fill=sex)) +
  geom_bar()+
  theme_solarized()+
  scale_fill_brewer(palette="Set1")+
  ggtitle("Distribution of sex")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Sex")

Number of children

ggplot(insurance, aes(x =children,fill=as.factor(children))) +
  geom_bar()+
  theme_solarized()+
  scale_fill_brewer(palette="Set2")+
  ggtitle("Distribution of children")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position = "none")+
  xlab("Children")

Smoker

ggplot(insurance, aes(x =smoker,fill=smoker)) +
  geom_bar()+
  theme_solarized()+
  scale_fill_brewer(palette="Accent")+
  ggtitle("Distribution of smoker")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Smoker")

Region

ggplot(insurance, aes(x =region,fill=region)) +
  geom_bar()+
  theme_solarized()+
  scale_fill_brewer(palette="Dark2")+
  ggtitle("Distribution of region")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Region")

It seems like we have equal numbers for each sex, as well as every region, but a higher number of non-smokers compared to smokers. Also,  the people with more children appear to be less than the people with fewer children, with a log-normal distribution.

Character variables’ effectiveness on the charges

Sex

ggplot(insurance,aes(y=charges,x=sex,fill=sex))+
  geom_boxplot()+
  ggtitle("Boxplot of the charges by sex")+
  xlab("Sex")+
  ylab("Charges")+
  theme_solarized()+
  scale_fill_brewer(palette="Set2")+
  theme(plot.title = element_text(hjust = 0.5))

Number of children

ggplot(insurance,aes(y=charges,x=as.factor(children),fill=as.factor(children)))+
  geom_boxplot()+
  ggtitle("Boxplot of the charges by number of children")+
  ylab("charges")+
  xlab("Number of children")+
  theme_solarized()+
  scale_fill_brewer(palette="Set1")+
  theme(legend.position = "none")+
  theme(plot.title = element_text(hjust = 0.5))

Smoker

ggplot(insurance,aes(y=charges,x=smoker,fill=smoker))+
  geom_boxplot()+
  ggtitle("Boxplot of the charges by smoker")+
  ylab("charges")+
  xlab("smoker")+
  theme_solarized()+
  scale_fill_startrek()+
  theme(plot.title = element_text(hjust = 0.5))

Region

ggplot(insurance,aes(y=charges,x=region,fill=region))+
  geom_boxplot()+
  ggtitle("Boxplot of the charges by region")+
  ylab("charges")+
  xlab("region")+
  theme_solarized()+
  scale_fill_jco()+
  theme(plot.title = element_text(hjust = 0.5))

Based on the plots, it appears that there are no differences between the components in terms of sex, region, or number of children. Naturally, an ANOVA test is necessary to confirm that there is no difference between those components, but we will let that happen now because if there are any differences, it seems to be small differences. We can negligate them by now.

It appears that our upcoming model heavily considers the smoking condition.

Visualizing the relationship between the dependent and independent variables

creating a scatter plot with other independent numeric variables and the charges and trying to see the effects of the character variables on it.

ggplot(insurance,aes(x=age,y=charges))+
  geom_point(col="#488AC7")+
  theme_solarized()+
  ggtitle("Relationship between charges and age")+
  xlab("Age")+
  ylab("charges")+
  theme(plot.title = element_text(hjust = 0.5))

It seems clear that increasing age correlates with higher insurance costs, and it also appears that the value of the charge is determined by one or two character variables. Additionally, it appears that the relationship is linear or has a modest tendency to be a second- or third-degree polynomial.

Let’s separate the plot by the character variables.

ggplot(insurance,aes(x=age,y=charges,col=sex))+
  geom_point()+
  theme_solarized()+
  ggtitle("Relationship between charges and age by sex")+
  xlab("Age")+
  ylab("charges")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(insurance,aes(x=age,y=charges,col=smoker))+
  geom_point()+
  theme_solarized()+
  ggtitle("Relationship between charges and age by smoker")+
  xlab("Age")+
  ylab("charges")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(insurance,aes(x=age,y=charges,col=region))+
  geom_point()+
  theme_solarized()+
  ggtitle("Relationship between charges and age by region")+
  xlab("Age")+
  ylab("charges")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(insurance,aes(x=age,y=charges,col=as.factor(children)))+
  geom_point()+
  theme_solarized()+
  ggtitle("Relationship between charges and age by number of children ")+
  xlab("Age")+
  ylab("charges")+
  guides(color = guide_legend(title = "Number of children"))+
  theme(plot.title = element_text(hjust = 0.5))

As it appeared in the boxplots section, the smoking state separated the values of the charges and had that much impact on the dependent variable. and the two lines seem to be parallel, so we didn’t need an interaction term between the age and smoker variables5. Also, it seems to be that there is another separate variable beside the character variables that we had, so we are going to create another character variable, well, call it “obs,” depending on the body mass index “bmi” flow with thes rules. BMI classifications are: underweight (under 18.5 kg/m2), normal weight (18.5 to 24.9), overweight (25 to 29.9), and obese (30 or more).6

insurance$obs<-case_when(
  insurance$bmi<18.5~"underweight",
  insurance$bmi>=18.5&insurance$bmi<=24.999~"normal",
  insurance$bmi>=25&insurance$bmi<=29.999~"overweight",
  insurance$bmi>=30~"obese")

Visualize the numbers of the new variable and its impact on the charges.

ggplot(insurance, aes(x =obs,fill=obs)) +
  geom_bar()+
  theme_solarized()+
  scale_fill_brewer(palette="Set1")+
  ggtitle("Distribution of category of BMI")+
  guides(fill = guide_legend(title = "category of BMI"))+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("category of BMI")

ggplot(insurance,aes(y=charges,x=obs,fill=obs))+
  geom_boxplot()+
  ggtitle("Boxplot of the charges by the category of BMI")+
  ylab("charges")+
  xlab("category of BMI")+
  theme_solarized()+
  scale_fill_brewer(palette="Set1")+
  theme(legend.position = "none")+
  theme(plot.title = element_text(hjust = 0.5))

Clearly, the data set appears to have more instances of the obese and overweight categories than of the normal weight and underweight category.

However, obesity appears to have a significant impact on charges; any charge beyond forty thousand is unquestionably associated with obesity.

Let’s see the plot of age separated by category of BMI

ggplot(insurance,aes(x=age,y=charges,col=obs))+
  geom_point()+
  theme_solarized()+
  ggtitle("Relationship between charges and age by category of BMI ")+
  xlab("Age")+
  ylab("charges")+
  guides(color = guide_legend(title = "category of BMI"))+
  theme(plot.title = element_text(hjust = 0.5))

Clearly, the category of BMI separates the charges, and it also seems to be in the same way, so we didn’t need the interaction term by age and category of BMI variables.

Relationship between charges and age by category of BMI and smoking state

ggplot(insurance,aes(x=age,y=charges,col=obs,shape=smoker))+
    geom_point()+
    theme_solarized()+
    ggtitle("Relationship between charges and age by category of BMI ")+
    xlab("Age")+
    ylab("charges")+
    guides(color = guide_legend(title = "category of BMI"))+
    theme(plot.title = element_text(hjust = 0.5))

It is obvious that everyone in the higher charge group falls into the smoking obesity category, and being older has a beneficial effect on the charges.

Let’s see the impact of bmi on charges as a numeric variable and combine it with the other character variables.

ggplot(insurance,aes(x=bmi,y=charges))+
  geom_point(col="#488AC7")+
  theme_solarized()+
  ggtitle("Relationship between charges and BMI")+
  xlab("BMI")+
  ylab("charges")+
  theme(plot.title = element_text(hjust = 0.5))

separate the plot by the character variables.

ggplot(insurance,aes(x=bmi,y=charges,col=sex))+
  geom_point()+
  theme_solarized()+
  ggtitle("Relationship between charges and BMI by sex")+
  xlab("BMI")+
  ylab("charges")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(insurance,aes(x=bmi,y=charges,col=smoker))+
  geom_point()+
  theme_solarized()+
  ggtitle("Relationship between charges and BMI by smoker")+
  xlab("BMI")+
  ylab("charges")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(insurance,aes(x=bmi,y=charges,col=region))+
  geom_point()+
  theme_solarized()+
  ggtitle("Relationship between charges and BMI by region")+
  xlab("BMI")+
  ylab("charges")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(insurance,aes(x=bmi,y=charges,col=as.factor(children)))+
  geom_point()+
  theme_solarized()+
  ggtitle("Relationship between charges and BMI by number of children ")+
  xlab("BMI")+
  ylab("charges")+
  guides(color = guide_legend(title = "Number of children"))+
  theme(plot.title = element_text(hjust = 0.5))

The plots show that the increase in BMI doesn’t mean an increase in charge unless the smoking state is yes.

Model Fitting

Multiple logistic regression model using all the data

model1<-lm(charges~.,data=insurance)
summary(model1)
## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11904.4  -3466.2   -167.2   1577.0  28412.9 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -7373.45    1372.27  -5.373 9.13e-08 ***
## age               256.75      11.81  21.732  < 2e-16 ***
## sexmale          -160.33     330.28  -0.485 0.627436    
## bmi               135.56      54.19   2.501 0.012488 *  
## children          477.67     136.71   3.494 0.000491 ***
## smokeryes       23850.31     409.76  58.205  < 2e-16 ***
## regionnorthwest  -393.34     473.23  -0.831 0.406020    
## regionsoutheast  -882.82     476.02  -1.855 0.063875 .  
## regionsouthwest  -964.23     474.61  -2.032 0.042390 *  
## obsobese         3076.89     815.69   3.772 0.000169 ***
## obsoverweight     170.57     571.87   0.298 0.765542    
## obsunderweight   -579.21    1429.01  -0.405 0.685308    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6009 on 1325 degrees of freedom
## Multiple R-squared:  0.7558, Adjusted R-squared:  0.7538 
## F-statistic: 372.8 on 11 and 1325 DF,  p-value: < 2.2e-16

The coefficient finding indicates that, if all other factors remain constant, each increase in age raises insurance costs by an average of $266.75 per year. Moreover, the cost of insurance rises by $135.56 for every unit increase in BMI. There is a $477.67 cost increase for each subsequent child. The primary factor that affects insurance costs is smoking status; if all other factors remain unchanged. being a smoker will increase your insurance costs by $2385.31. The cost of a male is less than that of a woman by $160.33. 

As we see, the adjusted R-squared is 0.7538 and the maximum error is 28412.9.

Our goal is to build another model that increases the adjusted R-squared and decreases the error.

Model improving

We will start by adding the second degree of the age variable.

insurance$age2<-insurance$age^2
model2<-lm(charges~.,data=insurance)
summary(model2)
## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11869.6  -3376.1     78.4   1380.9  29234.2 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2559.133   1906.963  -1.342 0.179828    
## age               -31.812     80.614  -0.395 0.693187    
## sexmale          -165.246    328.782  -0.503 0.615328    
## bmi               137.286     53.949   2.545 0.011049 *  
## children          631.825    142.605   4.431 1.02e-05 ***
## smokeryes       23861.438    407.917  58.496  < 2e-16 ***
## regionnorthwest  -406.505    471.095  -0.863 0.388352    
## regionsoutheast  -885.337    473.860  -1.868 0.061934 .  
## regionsouthwest  -965.598    472.457  -2.044 0.041173 *  
## obsobese         3004.860    812.236   3.699 0.000225 ***
## obsoverweight     216.739    569.419   0.381 0.703537    
## obsunderweight   -681.194   1422.815  -0.479 0.632185    
## age2                3.637      1.005   3.618 0.000308 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5982 on 1324 degrees of freedom
## Multiple R-squared:  0.7582, Adjusted R-squared:  0.756 
## F-statistic:   346 on 12 and 1324 DF,  p-value: < 2.2e-16

The second-degree age is significant, but it didn’t add to the adjusted R-squared or describe the error very well.

Add an interaction value between BMI and smokers.

model3<-lm(
  charges~age+age2+bmi*smoker+children+region+sex+obs,
  data = insurance
          )
summary(model3)
## 
## Call:
## lm(formula = charges ~ age + age2 + bmi * smoker + children + 
##     region + sex + obs, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11999.0  -2204.7  -1115.8   -167.1  29667.0 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.049e+03  1.547e+03   4.557 5.67e-06 ***
## age             -6.921e+00  6.379e+01  -0.109  0.91361    
## age2             3.412e+00  7.954e-01   4.289 1.92e-05 ***
## bmi             -1.869e+02  4.421e+01  -4.227 2.54e-05 ***
## smokeryes       -2.054e+04  1.610e+03 -12.754  < 2e-16 ***
## children         6.631e+02  1.128e+02   5.876 5.30e-09 ***
## regionnorthwest -6.352e+02  3.728e+02  -1.704  0.08869 .  
## regionsoutheast -1.048e+03  3.750e+02  -2.796  0.00525 ** 
## regionsouthwest -1.227e+03  3.739e+02  -3.282  0.00106 ** 
## sexmale         -5.355e+02  2.605e+02  -2.056  0.04000 *  
## obsobese         3.131e+03  6.427e+02   4.872 1.24e-06 ***
## obsoverweight    1.417e+02  4.505e+02   0.315  0.75317    
## obsunderweight  -1.964e+02  1.126e+03  -0.174  0.86156    
## bmi:smokeryes    1.447e+03  5.143e+01  28.141  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4733 on 1323 degrees of freedom
## Multiple R-squared:  0.8487, Adjusted R-squared:  0.8473 
## F-statistic:   571 on 13 and 1323 DF,  p-value: < 2.2e-16

The interaction term is significant, increases the adjusted R-square, and decreases the standard error.

Add another interaction term between smokers and BMI categories.

model4<-lm(
  charges~age+age2+(obs+bmi)*smoker+children+region+sex,
  data = insurance
          )
summary(model4)
## 
## Call:
## lm(formula = charges ~ age + age2 + (obs + bmi) * smoker + children + 
##     region + sex, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15053.0  -1528.9  -1306.1   -917.8  24238.6 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3104.4729  1474.0996   2.106 0.035392 *  
## age                        -23.0772    59.2372  -0.390 0.696915    
## age2                         3.6234     0.7385   4.906 1.04e-06 ***
## obsobese                   143.0791   669.2005   0.214 0.830731    
## obsoverweight               39.4159   470.1204   0.084 0.933195    
## obsunderweight           -1444.5257  1202.5398  -1.201 0.229878    
## bmi                          0.3899    44.2952   0.009 0.992978    
## smokeryes                 -347.6755  2236.8198  -0.155 0.876504    
## children                   660.1357   105.0472   6.284 4.47e-10 ***
## regionnorthwest           -382.4438   346.0855  -1.105 0.269337    
## regionsoutheast           -895.7580   347.6699  -2.576 0.010090 *  
## regionsouthwest          -1253.6016   346.7486  -3.615 0.000311 ***
## sexmale                   -527.8569   241.6990  -2.184 0.029142 *  
## obsobese:smokeryes       14543.5597  1444.7436  10.067  < 2e-16 ***
## obsoverweight:smokeryes    -16.8278  1015.5776  -0.017 0.986782    
## obsunderweight:smokeryes  4265.0972  2431.0152   1.754 0.079585 .  
## bmi:smokeryes              536.6980    93.5057   5.740 1.17e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4387 on 1320 degrees of freedom
## Multiple R-squared:  0.8704, Adjusted R-squared:  0.8688 
## F-statistic: 553.9 on 16 and 1320 DF,  p-value: < 2.2e-16

This is our final model, and it explains around 87% of the variation in our dataset with an adjusted R-squared of 0.8688 and a standard error of 4387.


  1. Insurance: Definition, How It Works↩︎

  2. What Is Kurtosis?↩︎

  3. investopedia: Log-Normal Distribution↩︎

  4. sciencedirect: Log-Normal Distribution↩︎

  5. Interactions in Multiple Linear Regression↩︎

  6. wikipedia:Body mass index↩︎