R Markdown

Regression is a statistical method used to model the relationship between variables when they have a linear relationship. It estimates coefficients in an equation that describes the relationship between two or more variables. In marketing science, it’s a valuable tool since we can evaluate the impact of a variable (linear regression) or multiple variables (multiple regression) on an desired outcome. For example, if a software company wants to predict user engagement with their software they could use variables such as demographic measures like age and income, psychographic measures like willingness to try new things, or behavioral measures like how often they use other software applications. Let’s use an open source data set to implement regression.

Read in the data

data<- read.csv("g:/Portfolio Projects/Regression/Data/insurance.csv")
kable(data[1:5, 1:ncol(data)], caption = "Examine first few rows")

Examine first few rows
age	sex	bmi	children	smoker	region	charges
19	female	27.900	0	yes	southwest	16884.924
18	male	33.770	1	no	southeast	1725.552
28	male	33.000	3	no	southeast	4449.462
33	male	22.705	0	no	northwest	21984.471
32	male	28.880	0	no	northwest	3866.855

Are there missing values in the data?

# check variables for missing values
missing_vals<- data %>% map(anyNA)
missings<- names(which(missing_vals == TRUE))
missings

## character(0)

# no missing values

Let’s check out the data

str(data)

## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

# let's convert the character variables to factor
data<- data %>% 
  mutate(across(where(is.character), as.factor))

# make sure the conversions worked as expected
str(data)

## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

# now the character variables are factors, perfect

Let’s visualize the data to see if there are any linear relationships

ggplot(data = data, aes(x = bmi, y = charges, color = age)) +
  geom_point()

From the above plot, it’s fair to hypothesize that age and bmi play at least a small role in amounts charged for insurance. Let’s test this by performing linear regression on each variable.

age.lm<- lm(charges ~ age, data = data)
summary(age.lm)

## 
## Call:
## lm(formula = charges ~ age, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8059  -6671  -5939   5440  47829 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   3165.9      937.1   3.378             0.000751 ***
## age            257.7       22.5  11.453 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11560 on 1336 degrees of freedom
## Multiple R-squared:  0.08941,    Adjusted R-squared:  0.08872 
## F-statistic: 131.2 on 1 and 1336 DF,  p-value: < 0.00000000000000022

bmi.lm<- lm(charges ~ bmi, data = data)
summary(bmi.lm)

## 
## Call:
## lm(formula = charges ~ bmi, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20956  -8118  -3757   4722  49442 
## 
## Coefficients:
##             Estimate Std. Error t value          Pr(>|t|)    
## (Intercept)  1192.94    1664.80   0.717             0.474    
## bmi           393.87      53.25   7.397 0.000000000000246 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11870 on 1336 degrees of freedom
## Multiple R-squared:  0.03934,    Adjusted R-squared:  0.03862 
## F-statistic: 54.71 on 1 and 1336 DF,  p-value: 0.0000000000002459

We can see that while our hypothesis is correct, the amount of variance in insurance charges explained by age and bmi alone is small. 8.872% for age and 3.862% for bmi. Let’s take a different approach to explaining the variability in insurance charges.

Multiple regression

Typically, outcomes, such as insurance charges, cannot be adequately explained by a single variable. Instead, a linear combination of variables may provide a more comprehensive explanation.

# multiple regression
multiple.lm<- lm(charges ~ ., data = data)
summary(multiple.lm)

## 
## Call:
## lm(formula = charges ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086 < 0.0000000000000002 ***
## age                256.9       11.9  21.587 < 0.0000000000000002 ***
## sexmale           -131.3      332.9  -0.394             0.693348    
## bmi                339.2       28.6  11.860 < 0.0000000000000002 ***
## children           475.5      137.8   3.451             0.000577 ***
## smokeryes        23848.5      413.1  57.723 < 0.0000000000000002 ***
## regionnorthwest   -353.0      476.3  -0.741             0.458769    
## regionsoutheast  -1035.0      478.7  -2.162             0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009             0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 0.00000000000000022

Conclusion

We have several variables that significantly impact insurance charges. Let’s examine the most notable…

For each additional year in age there is an increase of over 200 dollars in charges and each additional unit of BMI yields an increase over 300 dollars. Each child has a predicted increase of almost 500 dollars. Smokers have an enormous increase in insurance charges compared to non smokers.

The overall model is highly significant with a p-value < 0.00000000000000022 and F-stat of 500.8. The variables used as inputs explain about 75% of the variability in insurance charges. Where’s the other 25%? That can’t be explained by any of the variables we used as predictors so we can’t make assumptions about the rest of the variance in insurance charges.

Regression is a powerful tool used in marketing science and enables marketers to make data-driven decisions, optimize resource allocation, and enhance overall marketing effectiveness.