Week 12 Discussion Topic

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Data Set Summary:

The data set here is from looking at the medical cost per person. The columns are age, sex, body mass index(bmi), number of dependents, if the person is a smoker, residential area in the US by region, and individual medical costs. (https://www.kaggle.com/datasets/mirichoi0218/insurance?ref=hackernoon.com)

#reads a CSV file from a URL and stores it in a data object
data <- read.csv("https://raw.githubusercontent.com/rogercarelli/Medical_Cost_Personal_Dataset/main/raw_dataset.csv")
#summarizes the data
summary(data)
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

The binary variables are changed to a dummy variable of one or zero for the sex and smoker column. Also the region had dummy variables add to its unique values.

# Create a binary variable based on the values in the "sex" and "smoker" column
data$binarysex <- ifelse(data$sex == "female", 1, 0)
data$binarysmoke <- ifelse(data$smoker == "yes", 1, 0)

# Convert "region" column to a factor with numerical levels
data$region_num <- factor(data$region, levels = unique(data$region))

# Convert factor variable to numeric variable
data$region_num <- as.numeric(data$region_num)

head(data)
##   age    sex    bmi children smoker    region   charges binarysex binarysmoke
## 1  19 female 27.900        0    yes southwest 16884.924         1           1
## 2  18   male 33.770        1     no southeast  1725.552         0           0
## 3  28   male 33.000        3     no southeast  4449.462         0           0
## 4  33   male 22.705        0     no northwest 21984.471         0           0
## 5  32   male 28.880        0     no northwest  3866.855         0           0
## 6  31 female 25.740        0     no southeast  3756.622         1           0
##   region_num
## 1          1
## 2          2
## 3          2
## 4          3
## 5          3
## 6          2
#The quadratic variable
quad<-data$bmi^2

#dichotomous vs. quantitative interaction 
divquantsex <- data$binarysex * data$charges
divquantsmoke <- data$binarysmoke * data$charges

#multiple regression model
data_lm <- lm(charges~ bmi+binarysex+binarysmoke+quad+divquantsex+divquantsmoke, data = data)
summary(data_lm)
## 
## Call:
## lm(formula = charges ~ bmi + binarysex + binarysmoke + quad + 
##     divquantsex + divquantsmoke, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9490.5 -3191.8  -277.1  2641.4 22964.4 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.435e+02  2.620e+03  -0.284 0.776626    
## bmi            5.681e+02  1.667e+02   3.408 0.000674 ***
## binarysex     -3.934e+03  3.796e+02 -10.365  < 2e-16 ***
## binarysmoke   -6.624e+03  9.391e+02  -7.054 2.79e-12 ***
## quad          -7.923e+00  2.611e+00  -3.034 0.002460 ** 
## divquantsex    3.373e-01  2.030e-02  16.615  < 2e-16 ***
## divquantsmoke  8.445e-01  2.839e-02  29.750  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4840 on 1331 degrees of freedom
## Multiple R-squared:  0.841,  Adjusted R-squared:  0.8403 
## F-statistic:  1173 on 6 and 1331 DF,  p-value: < 2.2e-16

Trying to improve on the model by added variables.

#Added Age 

data_lm <- lm(charges~ bmi+age+binarysex+binarysmoke+quad+divquantsex+divquantsmoke, data = data)
summary(data_lm)
## 
## Call:
## lm(formula = charges ~ bmi + age + binarysex + binarysmoke + 
##     quad + divquantsex + divquantsmoke, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8877.9 -2001.9  -745.9   327.8 22317.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.962e+03  2.235e+03  -2.220   0.0266 *  
## bmi            3.656e+02  1.420e+02   2.574   0.0102 *  
## age            1.899e+02  8.400e+00  22.609  < 2e-16 ***
## binarysex     -2.784e+03  3.267e+02  -8.520  < 2e-16 ***
## binarysmoke   -4.318e+03  8.049e+02  -5.365 9.55e-08 ***
## quad          -5.059e+00  2.224e+00  -2.275   0.0231 *  
## divquantsex    2.398e-01  1.779e-02  13.481  < 2e-16 ***
## divquantsmoke  8.063e-01  2.420e-02  33.325  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4115 on 1330 degrees of freedom
## Multiple R-squared:  0.8851, Adjusted R-squared:  0.8845 
## F-statistic:  1464 on 7 and 1330 DF,  p-value: < 2.2e-16
#Added Children

data_lm <- lm(charges~ bmi+age+children+binarysex+binarysmoke+quad+divquantsex+divquantsmoke, data = data)
summary(data_lm)
## 
## Call:
## lm(formula = charges ~ bmi + age + children + binarysex + binarysmoke + 
##     quad + divquantsex + divquantsmoke, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9042.2 -2135.4  -784.4   280.2 22318.2 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.374e+03  2.221e+03  -2.419  0.01569 *  
## bmi            3.654e+02  1.410e+02   2.591  0.00967 ** 
## age            1.887e+02  8.345e+00  22.615  < 2e-16 ***
## children       4.158e+02  9.285e+01   4.478 8.18e-06 ***
## binarysex     -2.734e+03  3.246e+02  -8.422  < 2e-16 ***
## binarysmoke   -4.309e+03  7.992e+02  -5.392 8.24e-08 ***
## quad          -5.061e+00  2.208e+00  -2.292  0.02204 *  
## divquantsex    2.373e-01  1.767e-02  13.429  < 2e-16 ***
## divquantsmoke  8.065e-01  2.402e-02  33.571  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4086 on 1329 degrees of freedom
## Multiple R-squared:  0.8868, Adjusted R-squared:  0.8862 
## F-statistic:  1302 on 8 and 1329 DF,  p-value: < 2.2e-16
#Adding region 

data_lm <- lm(charges~ bmi+region_num+age+binarysex+binarysmoke+quad+divquantsex+divquantsmoke, data = data)
summary(data_lm)
## 
## Call:
## lm(formula = charges ~ bmi + region_num + age + binarysex + binarysmoke + 
##     quad + divquantsex + divquantsmoke, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8773.4 -1926.2  -790.1   356.4 22386.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.005e+03  2.263e+03  -2.653  0.00807 ** 
## bmi            3.803e+02  1.418e+02   2.683  0.00740 ** 
## region_num     2.780e+02  1.031e+02   2.697  0.00708 ** 
## age            1.898e+02  8.380e+00  22.652  < 2e-16 ***
## binarysex     -2.750e+03  3.262e+02  -8.431  < 2e-16 ***
## binarysmoke   -4.318e+03  8.031e+02  -5.377 8.92e-08 ***
## quad          -5.165e+00  2.219e+00  -2.328  0.02009 *  
## divquantsex    2.374e-01  1.777e-02  13.360  < 2e-16 ***
## divquantsmoke  8.070e-01  2.414e-02  33.430  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4105 on 1329 degrees of freedom
## Multiple R-squared:  0.8858, Adjusted R-squared:  0.8851 
## F-statistic:  1288 on 8 and 1329 DF,  p-value: < 2.2e-16

Region lowers the R values so it could be left out of the model.

Overall the model has a very small p-value which leads to the conclusion that the regression model fits the data. Also, the R-squared coefficient indicates the accuracy of the model is approximately 89%.