Introduction

The dataset is from Kaggle and it is a sample of different age groups and their bmi’s. The dataset is used for US Health Insurance and I am building a linear model to understand the variations of BMI over age and see if there is any huge fluctuation in BMI.

The link to the dataset is https://www.kaggle.com/teertha/ushealthinsurancedataset/data#

Load CSV

df <- read.csv("C:/Users/jey19/Desktop/MSDS/Semester 2/DATA 605/Discussion/insurance.csv")
df

Correlation Coefficient

Correlation coefficient provides us the relationship between the variables. If it is greater then a strong relationship exists and if it is low then the relationship is weak or none.

cor_df <- cor(df$age, df$bmi)
cor_df  
## [1] 0.1092719

The Correlation coefficient is ~ 0.1 and this shows BMI doesnt impact w.r.t age.

Linear Model

The Plot sahows how data is distributed across age and BMI

lm_age <- lm(bmi ~ age, data = df)
lm_age
## 
## Call:
## lm(formula = bmi ~ age, data = df)
## 
## Coefficients:
## (Intercept)          age  
##    28.80389      0.04743
plot(x = df$bmi, y = df$age)
abline(lm_age)

summary(lm_age)
## 
## Call:
## lm(formula = bmi ~ age, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.791  -4.359  -0.240   4.127  23.472 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.80389    0.49158  58.595  < 2e-16 ***
## age          0.04743    0.01180   4.018 6.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.064 on 1336 degrees of freedom
## Multiple R-squared:  0.01194,    Adjusted R-squared:  0.0112 
## F-statistic: 16.15 on 1 and 1336 DF,  p-value: 6.194e-05

Plot Residuals

It is necessary to plot residuals and see if there are any variances.

hist(lm_age$residuals)

The histogram shows the data is not skewed and distributed normally. This proves that there are no variances as age grows.

qqnorm(lm_age$residuals)
qqline(lm_age$residuals)  

plot(lm_age$residuals ~ df$bmi)

par(mfrow=c(2,2))
plot(lm_age)

Conclusion

All visuals & plots clearly explains the negative impact of bmi versus age.