Introduction
The dataset is from Kaggle and it is a sample of different age groups and their bmi’s. The dataset is used for US Health Insurance and I am building a linear model to understand the variations of BMI over age and see if there is any huge fluctuation in BMI.
The link to the dataset is https://www.kaggle.com/teertha/ushealthinsurancedataset/data#
df <- read.csv("C:/Users/jey19/Desktop/MSDS/Semester 2/DATA 605/Discussion/insurance.csv")
df
Correlation coefficient provides us the relationship between the variables. If it is greater then a strong relationship exists and if it is low then the relationship is weak or none.
cor_df <- cor(df$age, df$bmi)
cor_df
## [1] 0.1092719
The Correlation coefficient is ~ 0.1 and this shows BMI doesnt impact w.r.t age.
The Plot sahows how data is distributed across age and BMI
lm_age <- lm(bmi ~ age, data = df)
lm_age
##
## Call:
## lm(formula = bmi ~ age, data = df)
##
## Coefficients:
## (Intercept) age
## 28.80389 0.04743
plot(x = df$bmi, y = df$age)
abline(lm_age)
summary(lm_age)
##
## Call:
## lm(formula = bmi ~ age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.791 -4.359 -0.240 4.127 23.472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.80389 0.49158 58.595 < 2e-16 ***
## age 0.04743 0.01180 4.018 6.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.064 on 1336 degrees of freedom
## Multiple R-squared: 0.01194, Adjusted R-squared: 0.0112
## F-statistic: 16.15 on 1 and 1336 DF, p-value: 6.194e-05
It is necessary to plot residuals and see if there are any variances.
hist(lm_age$residuals)
The histogram shows the data is not skewed and distributed normally. This proves that there are no variances as age grows.
qqnorm(lm_age$residuals)
qqline(lm_age$residuals)
plot(lm_age$residuals ~ df$bmi)
par(mfrow=c(2,2))
plot(lm_age)
All visuals & plots clearly explains the negative impact of bmi versus age.