Agenda

  1. Multiple Linear Regression

  2. Logistic Regression

  3. Training a Model

  4. Evaluating Model Performance

Linear Regression

  • Linear regression is a statistical technique to represent relationships between two or more variables using a linear equation.

    • Simple linear regression: only have one predictor variable [dependent variable is continuous]

    • Multiple linear regression: have multiple predictors [dependent variable is continuous]

    • Logistic regression: used to model a binary categorical outcome

  • You can use linear regression to [predict] an outcome (dependent variable, or target variable), given some [input] (independent variables or predictors).

Multiple Linear Regression

Most real-world analyses have more than one independent variable.

Therefore, it is likely that you will be using multiple linear regression for most numeric prediction tasks.

Strengths

  • By far the most common approach for modeling numeric data

  • Can be adapted to model almost any modeling task

  • Provides estimates of both the strength and size of the relationships among features and the outcome

Weaknesses

  • Makes strong assumptions about the data (i.e., Linearity, Independence, Normality, Equal variance)

  • The model’s form must be specified by the user in advance

  • Does not handle missing data

  • Only works with numeric features, so categorical data requires extra processing

  • Requires some knowledge of statistics to understand the model

Case Study - Predicting Medical Expenses

Health insurance companies need to ensure their yearly premiums to be more than the medical cost they spend for its beneficiaries. As a result, insurers need to develop models that accurately forecast medical expenses for the insured population.

Although it is a challenging task, some conditions are more prevalent for certain segments of the population. For instance, lung cancer is more likely among smokers than non-smokers, and heart disease may be more likely among the obese.

Objective: To use patient data to estimate the average medical care expenses. These estimates can be used to create actuarial tables that set the price of yearly premiums higher or lower, depending on the expected treatment costs.

Import Data

We will use a simulated dataset containing hypothetical medical expenses for patients in the United States, based on the demographic statistics from the US Census Bureau. Thus, it approximately reflects real-world conditions.

First load the data insurance.csv

insurance <- read.csv("insurance.csv")

Explore Dataset

The insurance.csv file includes 1,338 examples of beneficiaries currently enrolled in the insurance plan, with features indicating characteristics of the patient as well as the total medical expenses charged to the plan for the calendar year.

It is important to give some thought to how these variables may be related to billed medical expenses.

str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ expenses: num  16885 1726 4449 21984 3867 ...

Explore Dataset

Our model’s dependent variable is expenses, which measures the medical costs each person charged to the insurance plan for the year. Prior to building a regression model, it is often helpful to check for [normality] (follow normal distribution or not).

hist(insurance$bmi)

Exploring relationship among features - Correlation Matrix

A correlation matrix provides a quick overview of the relationships between independent variables and dependent variables, and with each other.

# exploring relationships among features: correlation matrix
cor(insurance[c("age", "bmi", "children", "expenses")])
##                age        bmi   children   expenses
## age      1.0000000 0.10934101 0.04246900 0.29900819
## bmi      0.1093410 1.00000000 0.01264471 0.19857626
## children 0.0424690 0.01264471 1.00000000 0.06799823
## expenses 0.2990082 0.19857626 0.06799823 1.00000000

Visualizing Relationship - Scatterplot Matrix

It can also be helpful to visualize the relationships among numeric features by using a scatterplot matrix, a collection of scatterplots arranged in a grid. It can be used to detect patterns among multiple variables.

# visualing relationships among features: scatterplot matrix
pairs(insurance[c("age", "bmi", "children", "expenses")])

Do you notice any patterns in these plots? Although some look like random clouds of points, a few seem to display some trends.

Split the Dataset

RNGversion("3.5.2")
## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used
set.seed(112) # use set.seed to use the same random number sequence as the tutorial
train_sample <- sample(1338, 1000)

# split the data frames
insurance_train <- insurance[train_sample, ]
insurance_test  <- insurance[-train_sample, ]

Train a Multiple Regression Model

The following command fits a linear regression model relating the six independent variables to the total medical expenses.

## Training a model on the data ----
ins_model <- lm(expenses ~ age + children + bmi + sex + smoker + region,
                data = insurance_train)
## Training a model on the data ----
ins_model <- lm(expenses ~ ., data = insurance_train) # this is equivalent to above
# see the estimated beta coefficients
ins_model
## 
## Call:
## lm(formula = expenses ~ ., data = insurance_train)
## 
## Coefficients:
##     (Intercept)              age          sexmale              bmi  
##        -11456.5            252.0           -300.3            335.7  
##        children        smokeryes  regionnorthwest  regionsoutheast  
##           377.5          23838.0           -459.3           -877.9  
## regionsouthwest  
##          -724.0

Also, lm() function automatically applied a technique known as dummy coding to each of the factor-type variables we included in the model.

Interpretation of Model Coefficients

The beta coefficients indicate the estimated increase in expenses for an increase of one in each of the features, assuming all other values are held constant. - For instance, for each additional year of age, we would expect $256.80 higher medical expenses on average, assuming everything else is equal.

Questions: How do you interpret other coefficients? Which one is the most important factor?

Evaluate Model Performance

The parameter estimates we obtained by typing ins_model tell us about how the independent variables are related to the dependent variable, but we also need to know how the model fits our data. We can use summary() function.

summary(ins_model)
## 
## Call:
## lm(formula = expenses ~ ., data = insurance_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11592  -2853  -1080   1404  29844 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11456.49    1135.44 -10.090   <2e-16 ***
## age                252.01      13.70  18.393   <2e-16 ***
## sexmale           -300.28     384.70  -0.781    0.435    
## bmi                335.69      33.04  10.160   <2e-16 ***
## children           377.53     160.73   2.349    0.019 *  
## smokeryes        23838.00     482.49  49.406   <2e-16 ***
## regionnorthwest   -459.34     557.19  -0.824    0.410    
## regionsoutheast   -877.90     559.21  -1.570    0.117    
## regionsouthwest   -724.00     559.25  -1.295    0.196    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6035 on 991 degrees of freedom
## Multiple R-squared:  0.7456, Adjusted R-squared:  0.7435 
## F-statistic:   363 on 8 and 991 DF,  p-value: < 2.2e-16

Interpretation of Model Performance

  • t value standardized slope: bigger t indicates more important factor

  • Pr(<|t|) the more stars, the predictor is more significant

  • R-square provides a measure of how well our model as a whole explains the values of the dependent variable.

  • RSE The error of the training stage.

##                    Estimate Std. Error     t value      Pr(>|t|)
## (Intercept)     -11456.4882 1135.44357 -10.0898790  7.438627e-23
## age                252.0080   13.70148  18.3927608  3.162353e-65
## sexmale           -300.2830  384.69525  -0.7805738  4.352396e-01
## bmi                335.6874   33.04086  10.1597677  3.893406e-23
## children           377.5287  160.72904   2.3488520  1.902701e-02
## smokeryes        23838.0023  482.48993  49.4062175 1.479229e-269
## regionnorthwest   -459.3450  557.19386  -0.8243899  4.099164e-01
## regionsoutheast   -877.8975  559.20909  -1.5698913  1.167596e-01
## regionsouthwest   -724.0036  559.25050  -1.2945962  1.957611e-01

Evaluate the Error of Training Stage

The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model (on the data in hand).

The error rate can be estimated by dividing the RSE by the mean outcome variable:

#Training stage error rate
sigma(ins_model)/mean(insurance$expenses)
## [1] 0.4548005

Estimates with a RSE of < 30% meet the standard of reliability and is displayed. Estimates with a RSE of 30%–50% meet a lower standard of reliability and is displayed but should be interpreted with caution. Estimates with a RSE of > 50% are statistically unreliable and not displayed.

Model Prediction

Now we can plug in any predictors (independent variable) values to our model and predict the potential medical cost.

predict(ins_model,
        data.frame(age = 30, children = 2,
                   bmi = 30, sex = "male", bmi30 = 1,
                   smoker = "no", region = "northeast"))
##       1 
## 6629.15

Visualize Prediction Performance

# making predictions with the regression model
insurance_test$pred <- predict(ins_model, insurance_test)

plot(insurance_test$pred, insurance_test$expenses)
abline(a = 0, b = 1, col = "red", lwd = 3, lty = 2)

Evaluate the Error of Testing Stage

# function to calculate the mean absolute error
MAE <- function(actual, predicted) {
  mean(abs(actual - predicted))  
}
# mean absolute error between predicted and actual values
MAE(insurance_test$pred, insurance_test$expenses)
## [1] 4294.953

Logistic Regression - Titanic Survival Prediction

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Step 1: Load data

In this case study, we want to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

#You can laod the datasets from online packages
## Load Titanic library to get the dataset
#install.packages("titanic")
library(titanic)

## Load the datasets
data("titanic_train")
titanic<-titanic_train

Step 2: Exploring and Preparing Data

We can first check whether there are missing and empty values

## Let's check for any missing values in the data
colSums(is.na(titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0
## Checking for empty values
colSums(titanic=='')
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0          NA 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

Step 2: Exploring and Preparing Data

Since there are some categorical data, we will first check how many categories each have to decide keep them in the model. (sapply()is used to apply a function over a list/vector).

For the ones with a few values and relevant to our prediction of survival, we can keep them and convert them as factor and dummy variables.

## Check number of unique values for each of the column to find out columns which we can convert to factors
sapply(titanic, function(x) length(unique(x)))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##         891           2           3         891           2          89 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           7           7         681         248         148           4

Dealing with Categorical Variables

## Removing Cabin as it has very high missing values, passengerId, Ticket and Name are not required
library(dplyr)
titanic <- titanic %>% select(-c(Cabin, PassengerId, Ticket, Name))

## Converting "Survived","Pclass","Sex","Embarked" to factors
for (i in c("Survived","Pclass","Sex","Embarked")){
  titanic[,i]=as.factor(titanic[,i])
}

Impute Missing Values

Then we can impute/fill in missing value, as well as discard those variable not relevant

 ## Missing values imputation
titanic$Embarked[titanic$Embarked==""] <- "S"

titanic$Age[is.na(titanic$Age)] <- median(titanic$Age,na.rm=T)

# Remove missing values:
# titanic <- titanic %>% drop_na(Embarked, Age)

Spliting data into training and testing datasets

We will split our data into two portions:

  • a training dataset to build the model (500 records)

  • a test dataset to evaluate the model performance (the rest)

train <- titanic[1:500,]
test <- titanic[500:891,]

Step 3- Training a Logistic Model

We will use the generalized linear models in the glm package to train our logistic regression model.

## Model Creation
model <- glm(Survived ~.,family=binomial(link='logit'),data=train)
## Model Summary
summary(model)
## 
## Call:
## glm(formula = Survived ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.7322836  0.6499976   5.742 9.36e-09 ***
## Pclass2     -0.8431614  0.4253430  -1.982  0.04744 *  
## Pclass3     -1.8983414  0.4253657  -4.463 8.09e-06 ***
## Sexmale     -2.7663596  0.2564556 -10.787  < 2e-16 ***
## Age         -0.0386478  0.0105815  -3.652  0.00026 ***
## SibSp       -0.3555439  0.1437543  -2.473  0.01339 *  
## Parch        0.0758196  0.1781682   0.426  0.67044    
## Fare        -0.0005304  0.0036455  -0.145  0.88432    
## EmbarkedQ    0.7123748  0.5095780   1.398  0.16212    
## EmbarkedS   -0.2148003  0.3272835  -0.656  0.51162    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 666.93  on 499  degrees of freedom
## Residual deviance: 444.68  on 490  degrees of freedom
## AIC: 464.68
## 
## Number of Fisher Scoring iterations: 5

Predict with Logistic Model

## Predicting Test Data
result <- predict(model,newdata=test,type='response')
result <- ifelse(result > 0.5,1,0)

result<-as.factor(result)
test$Survived<- as.factor(test$Survived)

Evaluate with Confusion Matrix

## Confusion matrix and statistics
#install.packages("caret")
library(caret)
confusionMatrix(data=result, reference=test$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 213  53
##          1  30  96
##                                           
##                Accuracy : 0.7883          
##                  95% CI : (0.7444, 0.8277)
##     No Information Rate : 0.6199          
##     P-Value [Acc > NIR] : 6.084e-13       
##                                           
##                   Kappa : 0.5369          
##                                           
##  Mcnemar's Test P-Value : 0.01574         
##                                           
##             Sensitivity : 0.8765          
##             Specificity : 0.6443          
##          Pos Pred Value : 0.8008          
##          Neg Pred Value : 0.7619          
##              Prevalence : 0.6199          
##          Detection Rate : 0.5434          
##    Detection Prevalence : 0.6786          
##       Balanced Accuracy : 0.7604          
##                                           
##        'Positive' Class : 0               
##