Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Data Set Summary:
The data set here is from looking at the medical cost per person. The columns are age, sex, body mass index(bmi), number of dependents, if the person is a smoker, residential area in the US by region, and individual medical costs. (https://www.kaggle.com/datasets/mirichoi0218/insurance?ref=hackernoon.com)
#reads a CSV file from a URL and stores it in a data object
data <- read.csv("https://raw.githubusercontent.com/rogercarelli/Medical_Cost_Personal_Dataset/main/raw_dataset.csv")
#summarizes the data
summary(data)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
The binary variables are changed to a dummy variable of one or zero for the sex and smoker column. Also the region had dummy variables add to its unique values.
# Create a binary variable based on the values in the "sex" and "smoker" column
data$binarysex <- ifelse(data$sex == "female", 1, 0)
data$binarysmoke <- ifelse(data$smoker == "yes", 1, 0)
# Convert "region" column to a factor with numerical levels
data$region_num <- factor(data$region, levels = unique(data$region))
# Convert factor variable to numeric variable
data$region_num <- as.numeric(data$region_num)
head(data)
## age sex bmi children smoker region charges binarysex binarysmoke
## 1 19 female 27.900 0 yes southwest 16884.924 1 1
## 2 18 male 33.770 1 no southeast 1725.552 0 0
## 3 28 male 33.000 3 no southeast 4449.462 0 0
## 4 33 male 22.705 0 no northwest 21984.471 0 0
## 5 32 male 28.880 0 no northwest 3866.855 0 0
## 6 31 female 25.740 0 no southeast 3756.622 1 0
## region_num
## 1 1
## 2 2
## 3 2
## 4 3
## 5 3
## 6 2
#The quadratic variable
quad<-data$bmi^2
#dichotomous vs. quantitative interaction
divquantsex <- data$binarysex * data$charges
divquantsmoke <- data$binarysmoke * data$charges
#multiple regression model
data_lm <- lm(charges~ bmi+binarysex+binarysmoke+quad+divquantsex+divquantsmoke, data = data)
summary(data_lm)
##
## Call:
## lm(formula = charges ~ bmi + binarysex + binarysmoke + quad +
## divquantsex + divquantsmoke, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9490.5 -3191.8 -277.1 2641.4 22964.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.435e+02 2.620e+03 -0.284 0.776626
## bmi 5.681e+02 1.667e+02 3.408 0.000674 ***
## binarysex -3.934e+03 3.796e+02 -10.365 < 2e-16 ***
## binarysmoke -6.624e+03 9.391e+02 -7.054 2.79e-12 ***
## quad -7.923e+00 2.611e+00 -3.034 0.002460 **
## divquantsex 3.373e-01 2.030e-02 16.615 < 2e-16 ***
## divquantsmoke 8.445e-01 2.839e-02 29.750 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4840 on 1331 degrees of freedom
## Multiple R-squared: 0.841, Adjusted R-squared: 0.8403
## F-statistic: 1173 on 6 and 1331 DF, p-value: < 2.2e-16
Trying to improve on the model by added variables.
#Added Age
data_lm <- lm(charges~ bmi+age+binarysex+binarysmoke+quad+divquantsex+divquantsmoke, data = data)
summary(data_lm)
##
## Call:
## lm(formula = charges ~ bmi + age + binarysex + binarysmoke +
## quad + divquantsex + divquantsmoke, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8877.9 -2001.9 -745.9 327.8 22317.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.962e+03 2.235e+03 -2.220 0.0266 *
## bmi 3.656e+02 1.420e+02 2.574 0.0102 *
## age 1.899e+02 8.400e+00 22.609 < 2e-16 ***
## binarysex -2.784e+03 3.267e+02 -8.520 < 2e-16 ***
## binarysmoke -4.318e+03 8.049e+02 -5.365 9.55e-08 ***
## quad -5.059e+00 2.224e+00 -2.275 0.0231 *
## divquantsex 2.398e-01 1.779e-02 13.481 < 2e-16 ***
## divquantsmoke 8.063e-01 2.420e-02 33.325 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4115 on 1330 degrees of freedom
## Multiple R-squared: 0.8851, Adjusted R-squared: 0.8845
## F-statistic: 1464 on 7 and 1330 DF, p-value: < 2.2e-16
#Added Children
data_lm <- lm(charges~ bmi+age+children+binarysex+binarysmoke+quad+divquantsex+divquantsmoke, data = data)
summary(data_lm)
##
## Call:
## lm(formula = charges ~ bmi + age + children + binarysex + binarysmoke +
## quad + divquantsex + divquantsmoke, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9042.2 -2135.4 -784.4 280.2 22318.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.374e+03 2.221e+03 -2.419 0.01569 *
## bmi 3.654e+02 1.410e+02 2.591 0.00967 **
## age 1.887e+02 8.345e+00 22.615 < 2e-16 ***
## children 4.158e+02 9.285e+01 4.478 8.18e-06 ***
## binarysex -2.734e+03 3.246e+02 -8.422 < 2e-16 ***
## binarysmoke -4.309e+03 7.992e+02 -5.392 8.24e-08 ***
## quad -5.061e+00 2.208e+00 -2.292 0.02204 *
## divquantsex 2.373e-01 1.767e-02 13.429 < 2e-16 ***
## divquantsmoke 8.065e-01 2.402e-02 33.571 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4086 on 1329 degrees of freedom
## Multiple R-squared: 0.8868, Adjusted R-squared: 0.8862
## F-statistic: 1302 on 8 and 1329 DF, p-value: < 2.2e-16
#Adding region
data_lm <- lm(charges~ bmi+region_num+age+binarysex+binarysmoke+quad+divquantsex+divquantsmoke, data = data)
summary(data_lm)
##
## Call:
## lm(formula = charges ~ bmi + region_num + age + binarysex + binarysmoke +
## quad + divquantsex + divquantsmoke, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8773.4 -1926.2 -790.1 356.4 22386.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.005e+03 2.263e+03 -2.653 0.00807 **
## bmi 3.803e+02 1.418e+02 2.683 0.00740 **
## region_num 2.780e+02 1.031e+02 2.697 0.00708 **
## age 1.898e+02 8.380e+00 22.652 < 2e-16 ***
## binarysex -2.750e+03 3.262e+02 -8.431 < 2e-16 ***
## binarysmoke -4.318e+03 8.031e+02 -5.377 8.92e-08 ***
## quad -5.165e+00 2.219e+00 -2.328 0.02009 *
## divquantsex 2.374e-01 1.777e-02 13.360 < 2e-16 ***
## divquantsmoke 8.070e-01 2.414e-02 33.430 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4105 on 1329 degrees of freedom
## Multiple R-squared: 0.8858, Adjusted R-squared: 0.8851
## F-statistic: 1288 on 8 and 1329 DF, p-value: < 2.2e-16
Region lowers the R values so it could be left out of the model.
Overall the model has a very small p-value which leads to the conclusion that the regression model fits the data. Also, the R-squared coefficient indicates the accuracy of the model is approximately 89%.