Agenda

  1. Simple Linear Regression & Multiple Linear Regression

  2. Importing Data

  3. Exploring and Preparing the Data

    1. Exploring relationships among features with correlation

    2. Visualizing relationships among features - scatterplot matrix

  4. Training a Model

  5. Evaluating Model Performance

Linear Regression

  • Linear regression is a statistical technique to represent relationships between two or more variables using a linear equation.

    • Simple linear regression: only have one predictor variable [dependent variable is continuous]

    • Multiple linear regression: have multiple predictors [dependent variable is continuous]

    • Logistic regression: used to model a binary categorical outcome

  • You can use linear regression to predict an outcome(dependent variable, or target variable), given some input (independent variables or predictors).

Multiple Linear Regression

Most real-world analyses have more than one independent variable.

Therefore, it is likely that you will be using multiple linear regression for most numeric prediction tasks.

Strengths

  • By far the most common approach for modeling numeric data

  • Can be adapted to model almost any modeling task

  • Provides estimates of both the strength and size of the relationships among features and the outcome

Weaknesses

  • Makes strong assumptions about the data (i.e., Linearity, Independence, Normality, Equal variance)

  • The model’s form must be specified by the user in advance

  • Does not handle missing data

  • Only works with numeric features, so categorical data requires extra processing

  • Requires some knowledge of statistics to understand the model

Case Study - Predicting Medical Expenses

Health insurance companies need to ensure their yearly premiums to be more than the medical cost they spend for its beneficiaries. As a result, insurers need to develop models that accurately forecast medical expenses for the insured population.

Although it is a challenging task, some conditions are more prevalent for certain segments of the population. For instance, lung cancer is more likely among smokers than non-smokers, and heart disease may be more likely among the obese.

Objective: To use patient data to estimate the average medical care expenses. These estimates can be used to create actuarial tables that set the price of yearly premiums higher or lower, depending on the expected treatment costs.

Import Data

We will use a simulated dataset containing hypothetical medical expenses for patients in the United States, based on the demographic statistics from the US Census Bureau. Thus, it approximately reflects real-world conditions.

First load the data insurance.csv from canvas Module Sample Dataset, or click this link insurance.csv

insurance <- read.csv("insurance.csv")

Explore Dataset

The insurance.csv file includes 1,338 examples of beneficiaries currently enrolled in the insurance plan, with features indicating characteristics of the patient as well as the total medical expenses charged to the plan for the calendar year.

It is important to give some thought to how these variables may be related to billed medical expenses.

str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ expenses: num  16885 1726 4449 21984 3867 ...

Explore Dataset

Our model’s dependent variable is expenses, which measures the medical costs each person charged to the insurance plan for the year. Prior to building a regression model, it is often helpful to check for normality (follow normal distribution or not).

# summarize the charges variable
summary(insurance$expenses)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1122    4740    9382   13270   16640   63770

# histogram of insurance charges
hist(insurance$expenses)

# table of region
table(insurance$region)
## 
## northeast northwest southeast southwest 
##       324       325       364       325

Exploring relationship among features - Correlation Matrix

A correlation matrix provides a quick overview of the relationships between independent variables and dependent variables, and with each other.

# exploring relationships among features: correlation matrix
cor(insurance[c("age", "bmi", "children", "expenses")])
##                age        bmi   children   expenses
## age      1.0000000 0.10934101 0.04246900 0.29900819
## bmi      0.1093410 1.00000000 0.01264471 0.19857626
## children 0.0424690 0.01264471 1.00000000 0.06799823
## expenses 0.2990082 0.19857626 0.06799823 1.00000000

Visualizing relationship - Scatterplot Matrix

It can also be helpful to visualize the relationships among numeric features by using a scatterplot matrix, a collection of scatterplots arranged in a grid. It can be used to detect patterns among multiple variables.

# visualing relationships among features: scatterplot matrix
pairs(insurance[c("age","bmi","children","expenses")])

Do you notice any patterns in these plots? Although some look like random clouds of points, a few seem to display some trends.

More Informative Scatterplot Matrix

# more informative scatterplot matrix
#install.packages("psych")
library(psych)
pairs.panels(insurance[c("age","bmi","children","expenses")])

Interpretation of Visualization

  • Above the diagonal, the scatterplots have been replaced with a correlation matrix. On the diagnoal, histograms depicting the distribution of values.

  • The oval-shaped object on each scatterplot is a correlation ellipse.The more stretched it is, the stronger the correlation.

  • The curve drawn on the scatterplot is called a loess curve. It indicates the general relationship between the x and y axis variables.

Train a Multiple Regression Model

The following command fits a linear regression model relating the six independent variables to the total medical expenses.

## Training a model on the data ----
ins_model <- lm(expenses ~ age + children + bmi + 
                  sex + smoker + region,
                data = insurance)
## Training a model on the data ----
ins_model <- lm(expenses ~ ., data = insurance) 
# this is equivalent to above
# see the estimated beta coefficients
ins_model
## 
## Call:
## lm(formula = expenses ~ ., data = insurance)
## 
## Coefficients:
##     (Intercept)              age          sexmale              bmi  
##        -11941.6            256.8           -131.4            339.3  
##        children        smokeryes  regionnorthwest  regionsoutheast  
##           475.7          23847.5           -352.8          -1035.6  
## regionsouthwest  
##          -959.3

Also, lm() function automatically applied a technique known as dummy coding to each of the factor-type variables we included in the model.

Interpretation of Model Coefficients

The beta coefficients indicate the estimated increase in expenses for an increase of one in each of the features, assuming all other values are held constant. - For instance, for each additional year of age, we would expect $256.80 higher medical expenses on average, assuming everything else is equal.

Questions: How do you interpret other coefficients? Which one is the most important factor?

Evaluate Model Performance

The parameter estimates we obtained by typing ins_model tell us about how the independent variables are related to the dependent variable, but we also need to know how the model fits our data. We can use summary() function.

summary(ins_model)
## 
## Call:
## lm(formula = expenses ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11302.7  -2850.9   -979.6   1383.9  29981.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11941.6      987.8 -12.089  < 2e-16 ***
## age                256.8       11.9  21.586  < 2e-16 ***
## sexmale           -131.3      332.9  -0.395 0.693255    
## bmi                339.3       28.6  11.864  < 2e-16 ***
## children           475.7      137.8   3.452 0.000574 ***
## smokeryes        23847.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -352.8      476.3  -0.741 0.458976    
## regionsoutheast  -1035.6      478.7  -2.163 0.030685 *  
## regionsouthwest   -959.3      477.9  -2.007 0.044921 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.9 on 8 and 1329 DF,  p-value: < 2.2e-16

Interpretation of Model Performance

  • The residuals section provides summary statistics for the errors in our predictions,

Residual = true value - predicted value

  • P-value: the probability that the true coefficient is zero given the value of the estimate.

  • R-square (coefficient of determination) provides a measure of how well our model as a whole explains the values of the dependent variable.

  • Adjusted R-square: Because models with more features always explain more variation, the adjusted R-squared value corrects R-squared by penalizing models with a large number of independent variables. Useful for comparing models with different numbers of predictors

Improve Model Performance (Optional)

We can improve the model performance by adding more neglected factors. Sometimes the relationship is not necessarily linear. For example, we can

  • add a non-linear term for age

  • add an indicator for obesity using BMI

  • Specify an interaction term between obesity and smoking

# add a higher-order "age" term
insurance$age2 <- insurance$age^2
# add an indicator for BMI >= 30
insurance$bmi30 <-ifelse(insurance$bmi>= 30, 1, 0)

Create a Final Model

# create final model
ins_model2 <- lm(expenses ~ age + age2 + children + bmi 
                 + sex +bmi30*smoker + region, 
                 data = insurance)
summary(ins_model2)
## 
## Call:
## lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 * 
##     smoker + region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17297.1  -1656.0  -1262.7   -727.8  24161.6 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       139.0053  1363.1359   0.102 0.918792    
## age               -32.6181    59.8250  -0.545 0.585690    
## age2                3.7307     0.7463   4.999 6.54e-07 ***
## children          678.6017   105.8855   6.409 2.03e-10 ***
## bmi               119.7715    34.2796   3.494 0.000492 ***
## sexmale          -496.7690   244.3713  -2.033 0.042267 *  
## bmi30            -997.9355   422.9607  -2.359 0.018449 *  
## smokeryes       13404.5952   439.9591  30.468  < 2e-16 ***
## regionnorthwest  -279.1661   349.2826  -0.799 0.424285    
## regionsoutheast  -828.0345   351.6484  -2.355 0.018682 *  
## regionsouthwest -1222.1619   350.5314  -3.487 0.000505 ***
## bmi30:smokeryes 19810.1534   604.6769  32.762  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4445 on 1326 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8653 
## F-statistic: 781.7 on 11 and 1326 DF,  p-value: < 2.2e-16

Since R-square and adjusted-R square both increase, we can conclude this is a better model.

Model Prediction

Now we can plug in any predictors (independent variable) values to our model and predict the potential medical cost.

predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 2,
                   bmi = 30, sex = "male", bmi30 = 1,
                   smoker = "no", region = "northeast"))
##        1 
## 5973.774

Visualize Prediction performance

# making predictions with the regression model
insurance$pred <- predict(ins_model2, insurance)

plot(insurance$pred, insurance$expenses)
abline(a = 0, b = 1, col = "red", lwd = 3, lty = 2)