Multiple linear regression is a linear regression with more than one predictor variables. It produces a model that identifies the best weighted combination of the independent variables to predict the forecast variable. In R, it is used in the exact same way as the simple linear regression.
To show how it works we will use the .state.x77 dataset which is included in the base package. To have a look at the dataset type head(state.x77) and summary(state.x77). The dataset contains the following data for 50 US states:
Variables | Description |
---|---|
Population | population estimate as of 1st July 1975 |
Income | per capita income (1974) |
Illiteracy | illiteracy (% of population in 1970) |
Life Exp | life expectancy in years (1969-1971) |
Murder | murder and non-negligent manslaughter rate per 100,000 population (1976) |
HS Grad | percent high-school graduates (1970) |
Frost | mean number of days with minimum temperature below freezing (1931-1960) in capital or large city |
Area | land area in square miles |
As an example, we will use the Population, Illiteracy, Income and Frost variables to predict the Murder variable. First, we can see how these variables correlate using:
cor(state.x77[,c(“Murder”, “Population”,“Illiteracy”, “Income”, “Frost”)])
(hint:You need to install first and load the libraries)
Corrplot will produce something like the plot below with the method=“circle”
## corrplot 0.92 loaded
corrplot(cor(state.x77[,c(“Murder”, “Population”,“Illiteracy”, “Income”, “Frost”)]), method=“circle”)
In its simplest form corrgram will produce the following plot (Intense blue is high positive correlation intense red high negative correlation):
corrgram(state.x77[,c(“Murder”, “Population”,“Illiteracy”, “Income”, “Frost”)])
You can also use the following code to produce a matrix of scatterplots
pairs(state.x77[,c("Murder", "Population","Illiteracy", "Income", "Frost")])
From this you should be able to see some interesting relationships and understand which variables would be good for predicting other variables.
Now, we will use the data to employ a multiple linear regression model.
(hint:As state.x77 is stored as a matrix and the lm function handles data as a data frame, the data should converted to a data frame using the as.data.frame function)
fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=as.data.frame(state.x77))
summary(fit)
fit2 <- lm(Murder ~ Population + Illiteracy, data=as.data.frame(state.x77))
Based on the correlation matrix we would say that it was expected the relationship of Murder with Frost to be more significant than Murder with Population. One explanation for what we observe is that because there is a strong negative correlation of the Frost variable with the Illiteracy all the information of the former variable is subsumed by the latter variable.
A good model not only needs to fit data well, it also needs to be parsimonious. That is, a good model should only be as complex as necessary to describe a dataset. If you are choosing between a very simple model with 1 independent, and a very complex model with, let’s say, 10 independent variables, the more complex model needs to provide a much better fit to the data in order to justify its increased complexity. If it can’t, then the simpler model is preferable. To compare the fits of two models, you can use the anova() function with the regression objects as the two arguments.
View the results of the new model. What can you say about the goodness of fit of the new model compared to the previous one?
Compare the two models and conclude whether the removal of the independent variables has any significant effect to the explanatory power of the model.
(hint:_To compare the fits of two models, you can use the anova() function with the regression objects as the two arguments. The anova() function will take the model objects as arguments, and return an ANOVA test value about whether the more complex model is significantly better at capturing the data than the simpler model. If the resulting p-value is sufficiently low (usually less than 0.05), we conclude that the more complex model is significantly better than the simpler model, and thus favor the more complex model. If the p-value is not sufficiently low (usually greater than 0.05), we should favor the simpler model.)
anova(fit,fit2)
What we just did is essentially called stepwise regression, the idea is to start with as many predictor variables as possible and then continually remove the least significant variable from the model until all the remaining predictor variables are significant. This can be done automatically in R using the step function. To test this, let’s first create a model using all the variables in state.x77 to predict the Murder variable:
fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=as.data.frame(state.x77))
fit4 <- step(fit, direction="backward") # (goes backward and drop least useful variable from previous model that use all variables.)
## Start: AIC=97.75
## Murder ~ Population + Illiteracy + Income + Frost
##
## Df Sum of Sq RSS AIC
## - Frost 1 0.021 289.19 95.753
## - Income 1 0.057 289.22 95.759
## <none> 289.17 97.749
## - Population 1 39.238 328.41 102.111
## - Illiteracy 1 144.264 433.43 115.986
##
## Step: AIC=95.75
## Murder ~ Population + Illiteracy + Income
##
## Df Sum of Sq RSS AIC
## - Income 1 0.057 289.25 93.763
## <none> 289.19 95.753
## - Population 1 43.658 332.85 100.783
## - Illiteracy 1 236.196 525.38 123.605
##
## Step: AIC=93.76
## Murder ~ Population + Illiteracy
##
## Df Sum of Sq RSS AIC
## <none> 289.25 93.763
## - Population 1 48.517 337.76 99.516
## - Illiteracy 1 299.646 588.89 127.311
summary(fit4)
##
## Call:
## lm(formula = Murder ~ Population + Illiteracy, data = as.data.frame(state.x77))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7652 -1.6561 -0.0898 1.4570 7.6758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.652e+00 8.101e-01 2.039 0.04713 *
## Population 2.242e-04 7.984e-05 2.808 0.00724 **
## Illiteracy 4.081e+00 5.848e-01 6.978 8.83e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.481 on 47 degrees of freedom
## Multiple R-squared: 0.5668, Adjusted R-squared: 0.5484
## F-statistic: 30.75 on 2 and 47 DF, p-value: 2.893e-09
As you can see the conclusion is the same
Hopefully you have managed to produce the expected results. Now it is time to use this information to make a prediction. As a first step I would like you to use only the output of the regression and make an informed prediction about what is the estimated murder rate per 100,000 population for a state with a population of 10,000 and a 1.1% illiteracy rate?
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.6515497 | 0.8101121 | 2.038668 | 0.0471304 |
Population | 0.0002242 | 0.0000798 | 2.807789 | 0.0072414 |
Illiteracy | 4.0807366 | 0.5848156 | 6.977817 | 0.0000000 |
Let’s see if your prediction is close to the number of the function predict. Run the following code:
predict(fit2, list(Population=10000, Illiteracy=1.1))
As a final step in multiple linear regression analysis let’s see how to perform interaction effect. In cases you will see the interaction effect under the term moderating effect. For example we have observed that the population of states increases the number of murders. Is it however the effect same for states with different income? To do this you needs you need to regress Murder to Population and Income variables including also the term Population:Income.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 6.3119523 | 4.7718006 | 1.3227611 | 0.1924534 |
Population | 0.0033417 | 0.0012810 | 2.6086702 | 0.0122198 |
Income | -0.0002390 | 0.0010350 | -0.2309358 | 0.8183887 |
Population:Income | -0.0000006 | 0.0000003 | -2.3523098 | 0.0229921 |
What is the interpretation of that coefficient?
Logistic regression is used when you are trying to predict a categorical and your predictor variables are continuous, categorical or both. Logistic regression can be referred to as binomial logistic regression when your forecast variable is binary (either a value of 0 or 1) and can be referred to as multinomial logistic regression when your forecast variable takes more than 2 values (for example: low, medium and high). In R you can use the glm function for the logistic regression. For this section of the seminar we will use the titanic dataset.
First import the titanic dataset from Minerva (I named the dataset data)
First you will need to import the data and then impute the missing values. Do this using simple imputation methods as those you have done in the BADS Seminar for the variables age and fare.
(hint: you should replace the missing values with the mean of the variables. You can use nrow(data[which(is.na(data$age)),]) to see how many rows have no ages data.)
data$age[is.na(data$age)] <- mean(data$age, na.rm=TRUE)
data <- data[which(!is.na(data$survived)),]
If you now type in nrow(data) you will see there are 1,309 rows in the data. As this is a large dataset, we can split the data up into a training set and testing set. We can use 70% as the training set and the remaining 30% as a testing set. We can then use logistic regression with the training set and evaluate the model with the testing set. To split the data into training and test set, we can use
row_number = sort(sample(nrow(data), nrow(data)*.7)) # Return 70% of random row numbers from the dataframe
train<-data[row_number,] # 70% is training data
test<-data[-row_number,] # 30% is testing data
Now let’s use logistic regression on the training set with:
lfit <- glm(survived ~ sex + age + fare, family=binomial, data=train)
summary(lfit) #Take a look at the results
Question: What is the relationship of the variables sex and fare with probability of having survived?
You will notice that the results for R2 are not presented, this is because R2 which is the measure for the goodness of fit produces in the linear regression cannot be calculated for the logistic regression. There are other measures more suitable in that case.
Now we can assess our model with the testing set. First, let’s extract sex, age and fare variables from the testing set using:
test2 <- subset(test,select=c(4,5,9)) # where 4,5,9 are the correct column numbers
Then we can make predictions for the survived using our model with the test set:
results <- predict(lfit, newdata=test2,type='response')
The results returned are values between 0 and 1 because it is within a log scale. We need to assign the outputs of either 0 or 1 whether the passenger survived or not. We can say that any values below 0.5 is a 0 and any values above 0.5 is a 1, essentially the cut-off point is 0.5. To do this use:
results <- ifelse(results > 0.5,1,0)
Then we can calculate the accuracy of the model using:
mean(results == test$survived) ## Think what we are doing here the comparison inside the parenthesis returns TRUE (1) or FALSE (0)
As you can see, there is an accuracy of ~0.78 with the model
The numbers maybe a little different as the sample function is a random choice. That means that if you re-run the code another set of observations will be selected unless you use the set.seed() function.
However, an accuracy measure like this doesn’t give you the full picture of how good was your prediction, i.e. whether it did well or badly in any particular area or if there is an area for improvement. If you are interested in getting a more in-depth evaluation of your classification model, try using a confusion matrix. You can do this following this line of code (check error message if a package is missing):
install.packages("caret") # Install package caret for model validation.
library(caret) # Load package caret
confusionMatrix(as.factor(results), as.factor(test$survived)) # Create a confusion matrix based on our predicted result and real data. It must be converted to factor before running it hence using as.factor
See if you can find out what each of these variables explain? Some of them are more useful than others. If you have finished early, try to see if you can use sex as the forecast variable and survived, age and fare as the predictor variables.