Regression is a statistical method used to model the relationship between variables when they have a linear relationship. It estimates coefficients in an equation that describes the relationship between two or more variables. In marketing science, it’s a valuable tool since we can evaluate the impact of a variable (linear regression) or multiple variables (multiple regression) on an desired outcome. For example, if a software company wants to predict user engagement with their software they could use variables such as demographic measures like age and income, psychographic measures like willingness to try new things, or behavioral measures like how often they use other software applications. Let’s use an open source data set to implement regression.
data<- read.csv("g:/Portfolio Projects/Regression/Data/insurance.csv")
kable(data[1:5, 1:ncol(data)], caption = "Examine first few rows")
| age | sex | bmi | children | smoker | region | charges |
|---|---|---|---|---|---|---|
| 19 | female | 27.900 | 0 | yes | southwest | 16884.924 |
| 18 | male | 33.770 | 1 | no | southeast | 1725.552 |
| 28 | male | 33.000 | 3 | no | southeast | 4449.462 |
| 33 | male | 22.705 | 0 | no | northwest | 21984.471 |
| 32 | male | 28.880 | 0 | no | northwest | 3866.855 |
# check variables for missing values
missing_vals<- data %>% map(anyNA)
missings<- names(which(missing_vals == TRUE))
missings
## character(0)
# no missing values
str(data)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
# let's convert the character variables to factor
data<- data %>%
mutate(across(where(is.character), as.factor))
# make sure the conversions worked as expected
str(data)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ charges : num 16885 1726 4449 21984 3867 ...
# now the character variables are factors, perfect
ggplot(data = data, aes(x = bmi, y = charges, color = age)) +
geom_point()
From the above plot, it’s fair to hypothesize that age and bmi play at least a small role in amounts charged for insurance. Let’s test this by performing linear regression on each variable.
age.lm<- lm(charges ~ age, data = data)
summary(age.lm)
##
## Call:
## lm(formula = charges ~ age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8059 -6671 -5939 5440 47829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3165.9 937.1 3.378 0.000751 ***
## age 257.7 22.5 11.453 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11560 on 1336 degrees of freedom
## Multiple R-squared: 0.08941, Adjusted R-squared: 0.08872
## F-statistic: 131.2 on 1 and 1336 DF, p-value: < 0.00000000000000022
bmi.lm<- lm(charges ~ bmi, data = data)
summary(bmi.lm)
##
## Call:
## lm(formula = charges ~ bmi, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20956 -8118 -3757 4722 49442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1192.94 1664.80 0.717 0.474
## bmi 393.87 53.25 7.397 0.000000000000246 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11870 on 1336 degrees of freedom
## Multiple R-squared: 0.03934, Adjusted R-squared: 0.03862
## F-statistic: 54.71 on 1 and 1336 DF, p-value: 0.0000000000002459
We can see that while our hypothesis is correct, the amount of variance in insurance charges explained by age and bmi alone is small. 8.872% for age and 3.862% for bmi. Let’s take a different approach to explaining the variability in insurance charges.
Typically, outcomes, such as insurance charges, cannot be adequately explained by a single variable. Instead, a linear combination of variables may provide a more comprehensive explanation.
# multiple regression
multiple.lm<- lm(charges ~ ., data = data)
summary(multiple.lm)
##
## Call:
## lm(formula = charges ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 0.0000000000000002 ***
## age 256.9 11.9 21.587 < 0.0000000000000002 ***
## sexmale -131.3 332.9 -0.394 0.693348
## bmi 339.2 28.6 11.860 < 0.0000000000000002 ***
## children 475.5 137.8 3.451 0.000577 ***
## smokeryes 23848.5 413.1 57.723 < 0.0000000000000002 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 0.00000000000000022
For each additional year in age there is an increase of over 200 dollars in charges and each additional unit of BMI yields an increase over 300 dollars. Each child has a predicted increase of almost 500 dollars. Smokers have an enormous increase in insurance charges compared to non smokers.
The overall model is highly significant with a p-value < 0.00000000000000022 and F-stat of 500.8. The variables used as inputs explain about 75% of the variability in insurance charges. Where’s the other 25%? That can’t be explained by any of the variables we used as predictors so we can’t make assumptions about the rest of the variance in insurance charges.
Regression is a powerful tool used in marketing science and enables marketers to make data-driven decisions, optimize resource allocation, and enhance overall marketing effectiveness.