Seminar: Linear, Multiple Linear and Logistic Regression in R

Multiple Linear Regression

Multiple linear regression is a linear regression with more than one predictor variables. It produces a model that identifies the best weighted combination of the independent variables to predict the forecast variable. In R, it is used in the exact same way as the simple linear regression.

Getting Familiar with the dataset

To show how it works we will use the .state.x77 dataset which is included in the base package. To have a look at the dataset type head(state.x77) and summary(state.x77). The dataset contains the following data for 50 US states:

Variables	Description
Population	population estimate as of 1st July 1975
Income	per capita income (1974)
Illiteracy	illiteracy (% of population in 1970)
Life Exp	life expectancy in years (1969-1971)
Murder	murder and non-negligent manslaughter rate per 100,000 population (1976)
HS Grad	percent high-school graduates (1970)
Frost	mean number of days with minimum temperature below freezing (1931-1960) in capital or large city
Area	land area in square miles

Correlation of Variables

As an example, we will use the Population, Illiteracy, Income and Frost variables to predict the Murder variable. First, we can see how these variables correlate using:

cor function.Use the cor function to see the correlation among the 5 variables. You can use the help function to understand the syntax of cor. Leave the default values

cor(state.x77[,c(“Murder”, “Population”,“Illiteracy”, “Income”, “Frost”)])

Plot a correlation plot for the 5 variables. You can use the functions corrgram or corrplot

(hint:You need to install first and load the libraries)

Corrplot will produce something like the plot below with the method=“circle”

## corrplot 0.92 loaded

corrplot(cor(state.x77[,c(“Murder”, “Population”,“Illiteracy”, “Income”, “Frost”)]), method=“circle”)

In its simplest form corrgram will produce the following plot (Intense blue is high positive correlation intense red high negative correlation):

corrgram(state.x77[,c(“Murder”, “Population”,“Illiteracy”, “Income”, “Frost”)])

You can also use the following code to produce a matrix of scatterplots

pairs(state.x77[,c("Murder", "Population","Illiteracy", "Income", "Frost")])

From this you should be able to see some interesting relationships and understand which variables would be good for predicting other variables.

Multiple Linear Regression

Now, we will use the data to employ a multiple linear regression model.

Run a multiple linear regression with Murder as dependent variable and the other four as independent variables. Save this to an object called fit.

(hint:As state.x77 is stored as a matrix and the lm function handles data as a data frame, the data should converted to a data frame using the as.data.frame function)

fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=as.data.frame(state.x77))

View the regression output of the multiple linear regression. Which variables can explain some of the variation of the murder rate?

summary(fit)

Based on the results of the previous model let’s remove the variables that are not found to be significant predictors and re-run the model with the remaining predictors. Name this fit2

fit2 <- lm(Murder ~ Population + Illiteracy, data=as.data.frame(state.x77))

Revisit the correlation matrix and compare it with the results of the regression. Do you have the expected results?

Based on the correlation matrix we would say that it was expected the relationship of Murder with Frost to be more significant than Murder with Population. One explanation for what we observe is that because there is a strong negative correlation of the Frost variable with the Illiteracy all the information of the former variable is subsumed by the latter variable.

Models comparison

A good model not only needs to fit data well, it also needs to be parsimonious. That is, a good model should only be as complex as necessary to describe a dataset. If you are choosing between a very simple model with 1 independent, and a very complex model with, let’s say, 10 independent variables, the more complex model needs to provide a much better fit to the data in order to justify its increased complexity. If it can’t, then the simpler model is preferable. To compare the fits of two models, you can use the anova() function with the regression objects as the two arguments.

View the results of the new model. What can you say about the goodness of fit of the new model compared to the previous one?
Compare the two models and conclude whether the removal of the independent variables has any significant effect to the explanatory power of the model.

(hint:_To compare the fits of two models, you can use the anova() function with the regression objects as the two arguments. The anova() function will take the model objects as arguments, and return an ANOVA test value about whether the more complex model is significantly better at capturing the data than the simpler model. If the resulting p-value is sufficiently low (usually less than 0.05), we conclude that the more complex model is significantly better than the simpler model, and thus favor the more complex model. If the p-value is not sufficiently low (usually greater than 0.05), we should favor the simpler model.)

anova(fit,fit2)

Stepwise Regression

What we just did is essentially called stepwise regression, the idea is to start with as many predictor variables as possible and then continually remove the least significant variable from the model until all the remaining predictor variables are significant. This can be done automatically in R using the step function. To test this, let’s first create a model using all the variables in state.x77 to predict the Murder variable:

fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=as.data.frame(state.x77))
fit4 <- step(fit, direction="backward")  # (goes backward and drop least useful variable from previous model that use all variables.)

## Start:  AIC=97.75
## Murder ~ Population + Illiteracy + Income + Frost
## 
##              Df Sum of Sq    RSS     AIC
## - Frost       1     0.021 289.19  95.753
## - Income      1     0.057 289.22  95.759
## <none>                    289.17  97.749
## - Population  1    39.238 328.41 102.111
## - Illiteracy  1   144.264 433.43 115.986
## 
## Step:  AIC=95.75
## Murder ~ Population + Illiteracy + Income
## 
##              Df Sum of Sq    RSS     AIC
## - Income      1     0.057 289.25  93.763
## <none>                    289.19  95.753
## - Population  1    43.658 332.85 100.783
## - Illiteracy  1   236.196 525.38 123.605
## 
## Step:  AIC=93.76
## Murder ~ Population + Illiteracy
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    289.25  93.763
## - Population  1    48.517 337.76  99.516
## - Illiteracy  1   299.646 588.89 127.311

summary(fit4)

## 
## Call:
## lm(formula = Murder ~ Population + Illiteracy, data = as.data.frame(state.x77))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7652 -1.6561 -0.0898  1.4570  7.6758 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.652e+00  8.101e-01   2.039  0.04713 *  
## Population  2.242e-04  7.984e-05   2.808  0.00724 ** 
## Illiteracy  4.081e+00  5.848e-01   6.978 8.83e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.481 on 47 degrees of freedom
## Multiple R-squared:  0.5668, Adjusted R-squared:  0.5484 
## F-statistic: 30.75 on 2 and 47 DF,  p-value: 2.893e-09

As you can see the conclusion is the same

Prediction

Hopefully you have managed to produce the expected results. Now it is time to use this information to make a prediction. As a first step I would like you to use only the output of the regression and make an informed prediction about what is the estimated murder rate per 100,000 population for a state with a population of 10,000 and a 1.1% illiteracy rate?

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.6515497	0.8101121	2.038668	0.0471304
Population	0.0002242	0.0000798	2.807789	0.0072414
Illiteracy	4.0807366	0.5848156	6.977817	0.0000000

Let’s see if your prediction is close to the number of the function predict. Run the following code:

predict(fit2, list(Population=10000, Illiteracy=1.1))

Interaction

As a final step in multiple linear regression analysis let’s see how to perform interaction effect. In cases you will see the interaction effect under the term moderating effect. For example we have observed that the population of states increases the number of murders. Is it however the effect same for states with different income? To do this you needs you need to regress Murder to Population and Income variables including also the term Population:Income.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	6.3119523	4.7718006	1.3227611	0.1924534
Population	0.0033417	0.0012810	2.6086702	0.0122198
Income	-0.0002390	0.0010350	-0.2309358	0.8183887
Population:Income	-0.0000006	0.0000003	-2.3523098	0.0229921

What is the interpretation of that coefficient?

Logistic Regression

Logistic regression is used when you are trying to predict a categorical and your predictor variables are continuous, categorical or both. Logistic regression can be referred to as binomial logistic regression when your forecast variable is binary (either a value of 0 or 1) and can be referred to as multinomial logistic regression when your forecast variable takes more than 2 values (for example: low, medium and high). In R you can use the glm function for the logistic regression. For this section of the seminar we will use the titanic dataset.

First import the titanic dataset from Minerva (I named the dataset data)
First you will need to import the data and then impute the missing values. Do this using simple imputation methods as those you have done in the BADS Seminar for the variables age and fare.

(hint: you should replace the missing values with the mean of the variables. You can use nrow(data[which(is.na(data$age)),]) to see how many rows have no ages data.)

data$age[is.na(data$age)] <- mean(data$age, na.rm=TRUE)

We are going to use survived as the forecast variable. Sex, age and fare will be the predictors. First, we will remove all of the rows where survived is NA.

data <- data[which(!is.na(data$survived)),]

If you now type in nrow(data) you will see there are 1,309 rows in the data. As this is a large dataset, we can split the data up into a training set and testing set. We can use 70% as the training set and the remaining 30% as a testing set. We can then use logistic regression with the training set and evaluate the model with the testing set. To split the data into training and test set, we can use

row_number = sort(sample(nrow(data), nrow(data)*.7)) # Return 70% of random row numbers from the dataframe
train<-data[row_number,] # 70% is training data
test<-data[-row_number,] # 30% is testing data

Now let’s use logistic regression on the training set with:

lfit <- glm(survived ~ sex + age + fare, family=binomial, data=train)
summary(lfit) #Take a look at the results

Question: What is the relationship of the variables sex and fare with probability of having survived?

You will notice that the results for R2 are not presented, this is because R2 which is the measure for the goodness of fit produces in the linear regression cannot be calculated for the logistic regression. There are other measures more suitable in that case.

Model Accuracy

Now we can assess our model with the testing set. First, let’s extract sex, age and fare variables from the testing set using:

test2 <- subset(test,select=c(4,5,9)) # where 4,5,9 are the correct column numbers

Then we can make predictions for the survived using our model with the test set:

results <- predict(lfit, newdata=test2,type='response')

The results returned are values between 0 and 1 because it is within a log scale. We need to assign the outputs of either 0 or 1 whether the passenger survived or not. We can say that any values below 0.5 is a 0 and any values above 0.5 is a 1, essentially the cut-off point is 0.5. To do this use:

results <- ifelse(results > 0.5,1,0)

Then we can calculate the accuracy of the model using:

mean(results == test$survived) ## Think what we are doing here the comparison inside the parenthesis returns TRUE (1) or FALSE (0)

As you can see, there is an accuracy of ~0.78 with the model

The numbers maybe a little different as the sample function is a random choice. That means that if you re-run the code another set of observations will be selected unless you use the set.seed() function.

However, an accuracy measure like this doesn’t give you the full picture of how good was your prediction, i.e. whether it did well or badly in any particular area or if there is an area for improvement. If you are interested in getting a more in-depth evaluation of your classification model, try using a confusion matrix. You can do this following this line of code (check error message if a package is missing):

install.packages("caret") # Install package caret for model validation.
library(caret) # Load package caret
confusionMatrix(as.factor(results), as.factor(test$survived)) # Create a confusion matrix based on our predicted result and real data. It must be converted to factor before running it hence using as.factor

See if you can find out what each of these variables explain? Some of them are more useful than others. If you have finished early, try to see if you can use sex as the forecast variable and survived, age and fare as the predictor variables.

Seminar: Linear, Multiple Linear and Logistic Regression in R

PS

Multiple Linear Regression

Getting Familiar with the dataset

Correlation of Variables

Multiple Linear Regression

Models comparison

Stepwise Regression

Prediction

Interaction

Logistic Regression

Model Accuracy