Multiple Linear Regression
Logistic Regression
Training a Model
Evaluating Model Performance
Multiple Linear Regression
Logistic Regression
Training a Model
Evaluating Model Performance
Linear regression is a statistical technique to represent relationships between two or more variables using a linear equation.
Simple linear regression: only have one predictor variable [dependent variable is continuous]
Multiple linear regression: have multiple predictors [dependent variable is continuous]
Logistic regression: used to model a binary categorical outcome
You can use linear regression to [predict] an outcome (dependent variable, or target variable), given some [input] (independent variables or predictors).
Most real-world analyses have more than one independent variable.
Therefore, it is likely that you will be using multiple linear regression for most numeric prediction tasks.
Strengths
By far the most common approach for modeling numeric data
Can be adapted to model almost any modeling task
Provides estimates of both the strength and size of the relationships among features and the outcome
Weaknesses
Makes strong assumptions about the data (i.e., Linearity, Independence, Normality, Equal variance)
The model’s form must be specified by the user in advance
Does not handle missing data
Only works with numeric features, so categorical data requires extra processing
Requires some knowledge of statistics to understand the model
Health insurance companies need to ensure their yearly premiums to be more than the medical cost they spend for its beneficiaries. As a result, insurers need to develop models that accurately forecast medical expenses for the insured population.
Although it is a challenging task, some conditions are more prevalent for certain segments of the population. For instance, lung cancer is more likely among smokers than non-smokers, and heart disease may be more likely among the obese.
Objective: To use patient data to estimate the average medical care expenses. These estimates can be used to create actuarial tables that set the price of yearly premiums higher or lower, depending on the expected treatment costs.
We will use a simulated dataset containing hypothetical medical expenses for patients in the United States, based on the demographic statistics from the US Census Bureau. Thus, it approximately reflects real-world conditions.
First load the data insurance.csv
insurance <- read.csv("insurance.csv")
The insurance.csv file includes 1,338 examples of beneficiaries currently enrolled in the insurance plan, with features indicating characteristics of the patient as well as the total medical expenses charged to the plan for the calendar year.
It is important to give some thought to how these variables may be related to billed medical expenses.
str(insurance)
## 'data.frame': 1338 obs. of 7 variables: ## $ age : int 19 18 28 33 32 31 46 37 37 60 ... ## $ sex : chr "female" "male" "male" "male" ... ## $ bmi : num 27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ... ## $ children: int 0 1 3 0 0 0 1 3 2 0 ... ## $ smoker : chr "yes" "no" "no" "no" ... ## $ region : chr "southwest" "southeast" "southeast" "northwest" ... ## $ expenses: num 16885 1726 4449 21984 3867 ...
Our model’s dependent variable is expenses, which measures the medical costs each person charged to the insurance plan for the year. Prior to building a regression model, it is often helpful to check for [normality] (follow normal distribution or not).
hist(insurance$bmi)
A correlation matrix provides a quick overview of the relationships between independent variables and dependent variables, and with each other.
# exploring relationships among features: correlation matrix
cor(insurance[c("age", "bmi", "children", "expenses")])
## age bmi children expenses ## age 1.0000000 0.10934101 0.04246900 0.29900819 ## bmi 0.1093410 1.00000000 0.01264471 0.19857626 ## children 0.0424690 0.01264471 1.00000000 0.06799823 ## expenses 0.2990082 0.19857626 0.06799823 1.00000000
It can also be helpful to visualize the relationships among numeric features by using a scatterplot matrix, a collection of scatterplots arranged in a grid. It can be used to detect patterns among multiple variables.
# visualing relationships among features: scatterplot matrix
pairs(insurance[c("age", "bmi", "children", "expenses")])
Do you notice any patterns in these plots? Although some look like random clouds of points, a few seem to display some trends.
RNGversion("3.5.2")
## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used
set.seed(112) # use set.seed to use the same random number sequence as the tutorial train_sample <- sample(1338, 1000) # split the data frames insurance_train <- insurance[train_sample, ] insurance_test <- insurance[-train_sample, ]
The following command fits a linear regression model relating the six independent variables to the total medical expenses.
## Training a model on the data ----
ins_model <- lm(expenses ~ age + children + bmi + sex + smoker + region,
data = insurance_train)
## Training a model on the data ---- ins_model <- lm(expenses ~ ., data = insurance_train) # this is equivalent to above # see the estimated beta coefficients ins_model
## ## Call: ## lm(formula = expenses ~ ., data = insurance_train) ## ## Coefficients: ## (Intercept) age sexmale bmi ## -11456.5 252.0 -300.3 335.7 ## children smokeryes regionnorthwest regionsoutheast ## 377.5 23838.0 -459.3 -877.9 ## regionsouthwest ## -724.0
Also, lm() function automatically applied a technique known as dummy coding to each of the factor-type variables we included in the model.
The beta coefficients indicate the estimated increase in expenses for an increase of one in each of the features, assuming all other values are held constant. - For instance, for each additional year of age, we would expect $256.80 higher medical expenses on average, assuming everything else is equal.
Questions: How do you interpret other coefficients? Which one is the most important factor?
The parameter estimates we obtained by typing ins_model tell us about how the independent variables are related to the dependent variable, but we also need to know how the model fits our data. We can use summary() function.
summary(ins_model)
## ## Call: ## lm(formula = expenses ~ ., data = insurance_train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11592 -2853 -1080 1404 29844 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -11456.49 1135.44 -10.090 <2e-16 *** ## age 252.01 13.70 18.393 <2e-16 *** ## sexmale -300.28 384.70 -0.781 0.435 ## bmi 335.69 33.04 10.160 <2e-16 *** ## children 377.53 160.73 2.349 0.019 * ## smokeryes 23838.00 482.49 49.406 <2e-16 *** ## regionnorthwest -459.34 557.19 -0.824 0.410 ## regionsoutheast -877.90 559.21 -1.570 0.117 ## regionsouthwest -724.00 559.25 -1.295 0.196 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6035 on 991 degrees of freedom ## Multiple R-squared: 0.7456, Adjusted R-squared: 0.7435 ## F-statistic: 363 on 8 and 991 DF, p-value: < 2.2e-16
t value standardized slope: bigger t indicates more important factor
Pr(<|t|) the more stars, the predictor is more significant
R-square provides a measure of how well our model as a whole explains the values of the dependent variable.
RSE The error of the training stage.
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -11456.4882 1135.44357 -10.0898790 7.438627e-23 ## age 252.0080 13.70148 18.3927608 3.162353e-65 ## sexmale -300.2830 384.69525 -0.7805738 4.352396e-01 ## bmi 335.6874 33.04086 10.1597677 3.893406e-23 ## children 377.5287 160.72904 2.3488520 1.902701e-02 ## smokeryes 23838.0023 482.48993 49.4062175 1.479229e-269 ## regionnorthwest -459.3450 557.19386 -0.8243899 4.099164e-01 ## regionsoutheast -877.8975 559.20909 -1.5698913 1.167596e-01 ## regionsouthwest -724.0036 559.25050 -1.2945962 1.957611e-01
The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model (on the data in hand).
The error rate can be estimated by dividing the RSE by the mean outcome variable:
#Training stage error rate sigma(ins_model)/mean(insurance$expenses)
## [1] 0.4548005
Estimates with a RSE of < 30% meet the standard of reliability and is displayed. Estimates with a RSE of 30%–50% meet a lower standard of reliability and is displayed but should be interpreted with caution. Estimates with a RSE of > 50% are statistically unreliable and not displayed.
Now we can plug in any predictors (independent variable) values to our model and predict the potential medical cost.
predict(ins_model,
data.frame(age = 30, children = 2,
bmi = 30, sex = "male", bmi30 = 1,
smoker = "no", region = "northeast"))
## 1 ## 6629.15
# making predictions with the regression model insurance_test$pred <- predict(ins_model, insurance_test) plot(insurance_test$pred, insurance_test$expenses) abline(a = 0, b = 1, col = "red", lwd = 3, lty = 2)
# function to calculate the mean absolute error
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
# mean absolute error between predicted and actual values
MAE(insurance_test$pred, insurance_test$expenses)
## [1] 4294.953
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this case study, we want to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
#You can laod the datasets from online packages
## Load Titanic library to get the dataset
#install.packages("titanic")
library(titanic)
## Load the datasets
data("titanic_train")
titanic<-titanic_train
We can first check whether there are missing and empty values
## Let's check for any missing values in the data colSums(is.na(titanic))
## PassengerId Survived Pclass Name Sex Age ## 0 0 0 0 0 177 ## SibSp Parch Ticket Fare Cabin Embarked ## 0 0 0 0 0 0
## Checking for empty values colSums(titanic=='')
## PassengerId Survived Pclass Name Sex Age ## 0 0 0 0 0 NA ## SibSp Parch Ticket Fare Cabin Embarked ## 0 0 0 0 687 2
Since there are some categorical data, we will first check how many categories each have to decide keep them in the model. (sapply()is used to apply a function over a list/vector).
For the ones with a few values and relevant to our prediction of survival, we can keep them and convert them as factor and dummy variables.
## Check number of unique values for each of the column to find out columns which we can convert to factors sapply(titanic, function(x) length(unique(x)))
## PassengerId Survived Pclass Name Sex Age ## 891 2 3 891 2 89 ## SibSp Parch Ticket Fare Cabin Embarked ## 7 7 681 248 148 4
## Removing Cabin as it has very high missing values, passengerId, Ticket and Name are not required
library(dplyr)
titanic <- titanic %>% select(-c(Cabin, PassengerId, Ticket, Name))
## Converting "Survived","Pclass","Sex","Embarked" to factors
for (i in c("Survived","Pclass","Sex","Embarked")){
titanic[,i]=as.factor(titanic[,i])
}
Then we can impute/fill in missing value, as well as discard those variable not relevant
## Missing values imputation titanic$Embarked[titanic$Embarked==""] <- "S" titanic$Age[is.na(titanic$Age)] <- median(titanic$Age,na.rm=T) # Remove missing values: # titanic <- titanic %>% drop_na(Embarked, Age)
We will split our data into two portions:
a training dataset to build the model (500 records)
a test dataset to evaluate the model performance (the rest)
train <- titanic[1:500,] test <- titanic[500:891,]
We will use the generalized linear models in the glm package to train our logistic regression model.
## Model Creation model <- glm(Survived ~.,family=binomial(link='logit'),data=train) ## Model Summary summary(model)
## ## Call: ## glm(formula = Survived ~ ., family = binomial(link = "logit"), ## data = train) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.7322836 0.6499976 5.742 9.36e-09 *** ## Pclass2 -0.8431614 0.4253430 -1.982 0.04744 * ## Pclass3 -1.8983414 0.4253657 -4.463 8.09e-06 *** ## Sexmale -2.7663596 0.2564556 -10.787 < 2e-16 *** ## Age -0.0386478 0.0105815 -3.652 0.00026 *** ## SibSp -0.3555439 0.1437543 -2.473 0.01339 * ## Parch 0.0758196 0.1781682 0.426 0.67044 ## Fare -0.0005304 0.0036455 -0.145 0.88432 ## EmbarkedQ 0.7123748 0.5095780 1.398 0.16212 ## EmbarkedS -0.2148003 0.3272835 -0.656 0.51162 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 666.93 on 499 degrees of freedom ## Residual deviance: 444.68 on 490 degrees of freedom ## AIC: 464.68 ## ## Number of Fisher Scoring iterations: 5
## Predicting Test Data result <- predict(model,newdata=test,type='response') result <- ifelse(result > 0.5,1,0) result<-as.factor(result) test$Survived<- as.factor(test$Survived)
## Confusion matrix and statistics
#install.packages("caret")
library(caret)
confusionMatrix(data=result, reference=test$Survived)
## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 213 53 ## 1 30 96 ## ## Accuracy : 0.7883 ## 95% CI : (0.7444, 0.8277) ## No Information Rate : 0.6199 ## P-Value [Acc > NIR] : 6.084e-13 ## ## Kappa : 0.5369 ## ## Mcnemar's Test P-Value : 0.01574 ## ## Sensitivity : 0.8765 ## Specificity : 0.6443 ## Pos Pred Value : 0.8008 ## Neg Pred Value : 0.7619 ## Prevalence : 0.6199 ## Detection Rate : 0.5434 ## Detection Prevalence : 0.6786 ## Balanced Accuracy : 0.7604 ## ## 'Positive' Class : 0 ##