library(mi)
library(tidyverse)
library(rpart)
In this discussion I will be looking at the mpg.data dataset from UCI’s Machine Learning repository(Lichman 2013) ## Dataset importing and data preprocessing The dataset that we will be looking at today contains vehicle data from the years 1970 to 1982. The data provides 9 attributes, two of which being model and year. Because multiple cars were made in multiple years, to create a unique label for each row I combined model and year into a factor variable called model.year using the as.factor() and paste() functions.
mpg.data <- as.data.frame(read.table("auto-mpg.txt"))
names(mpg.data) <- c("mpg", "cylinders", "displacement", "horsepower", "weight",
"acceleration", "year", "origin", "model")
model.year <- as.factor(paste(mpg.data$year, mpg.data$model, sep = " "))
mpg.data$horsepower <- as.numeric(mpg.data$horsepower)
mpg.data$model.year <- model.year
mpg.data$model <- NULL
mpg.data$year <- NULL
mpg.data$origin <- as.factor(mpg.data$origin)
mpg.data$cylinders <- as.factor(mpg.data$cylinders)
remove(model.year)
mpg.data <- na.omit(mpg.data)
Next I can split the data into training and test sets for performing analysis on them using the tried-but-true rand_sample() function that I made earlier in this course
rand_sample <- function(x, fraction = 0.75, seed = 123){
set.seed(seed)
size <- nrow(x)
training_size <- fraction * size
results <- sort(sample(size, training_size))
return(results)
}
mpg.train <- mpg.data[rand_sample(mpg.data), ]
mpg.test <- mpg.data[-rand_sample(mpg.data), ]
Let’s start off by performing some standard multiple linear regression on this bad boy
regressor1 <- lm(formula = mpg ~ cylinders + displacement + horsepower + weight
+ acceleration + origin, data = mpg.train)
summary(regressor1)
Call:
lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
acceleration + origin, data = mpg.train)
Residuals:
Min 1Q Median 3Q Max
-10.8886 -2.2927 -0.3991 2.1572 13.1365
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.6335749 3.9726375 8.466 1.39e-15 ***
cylinders4 9.8726336 2.9051401 3.398 0.000775 ***
cylinders5 12.3660384 4.0704850 3.038 0.002604 **
cylinders6 6.0155001 3.1164031 1.930 0.054572 .
cylinders8 8.5857033 3.4920116 2.459 0.014544 *
displacement 0.0044192 0.0110508 0.400 0.689529
horsepower -0.0667207 0.0189282 -3.525 0.000494 ***
weight -0.0042418 0.0009653 -4.394 1.57e-05 ***
acceleration -0.0433673 0.1361534 -0.319 0.750327
origin2 0.5000341 0.7746055 0.646 0.519104
origin3 3.3162630 0.7702200 4.306 2.30e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.942 on 283 degrees of freedom
Multiple R-squared: 0.7633, Adjusted R-squared: 0.7549
F-statistic: 91.25 on 10 and 283 DF, p-value: < 2.2e-16
If we set a our Signif. code threshold at 95% (call this the limit for statistical significance), we can start eliminating coefficients. The highest is acceleration, so we can start by removing that.
regressor2 <- lm(formula = mpg ~ cylinders + displacement + horsepower + weight
+ origin, data = mpg.train)
summary(regressor2)
Call:
lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
origin, data = mpg.train)
Residuals:
Min 1Q Median 3Q Max
-11.0329 -2.3205 -0.3886 2.2092 12.9491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.9905500 3.4160264 9.658 < 2e-16 ***
cylinders4 9.7957802 2.8905194 3.389 0.000801 ***
cylinders5 12.2867261 4.0564293 3.029 0.002680 **
cylinders6 5.9532702 3.1053487 1.917 0.056228 .
cylinders8 8.5393912 3.4834595 2.451 0.014832 *
displacement 0.0047530 0.0109836 0.433 0.665536
horsepower -0.0629842 0.0148313 -4.247 2.94e-05 ***
weight -0.0043829 0.0008564 -5.118 5.71e-07 ***
origin2 0.4940332 0.7731504 0.639 0.523346
origin3 3.3167358 0.7689991 4.313 2.22e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.936 on 284 degrees of freedom
Multiple R-squared: 0.7632, Adjusted R-squared: 0.7557
F-statistic: 101.7 on 9 and 284 DF, p-value: < 2.2e-16
Now we can pull displacement
regressor3 <- lm(formula = mpg ~ cylinders + horsepower + weight + origin,
data = mpg.train)
summary(regressor3)
Call:
lm(formula = mpg ~ cylinders + horsepower + weight + origin,
data = mpg.train)
Residuals:
Min 1Q Median 3Q Max
-11.0849 -2.3032 -0.3701 2.1809 12.9751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.7158270 3.3517263 9.761 < 2e-16 ***
cylinders4 10.0120888 2.8429076 3.522 0.000499 ***
cylinders5 12.4948934 4.0220573 3.107 0.002083 **
cylinders6 6.3986156 2.9256755 2.187 0.029551 *
cylinders8 9.2595564 3.0558400 3.030 0.002669 **
horsepower -0.0601953 0.0133388 -4.513 9.37e-06 ***
weight -0.0041990 0.0007425 -5.655 3.78e-08 ***
origin2 0.3909378 0.7344792 0.532 0.594957
origin3 3.2236380 0.7372386 4.373 1.72e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.931 on 285 degrees of freedom
Multiple R-squared: 0.763, Adjusted R-squared: 0.7564
F-statistic: 114.7 on 8 and 285 DF, p-value: < 2.2e-16
Because origin3 is statistically significant, I will keep the entire origin variable.
Now that we have our regressor, we can create our predictions
predictor <- predict(regressor3, newdata = mpg.test)
Now we can create a dataframe to look at the results
result.data <- data.frame(model.year = mpg.test$model.year,
prediction = predictor,
actual = mpg.test$mpg)
percent.diff <- abs(result.data$prediction - result.data$actual) /
result.data$actual * 100
result.data$percent.diff <- percent.diff
remove(percent.diff)
paste("Percent difference:", round(mean(result.data$percent.diff)))
[1] "Percent difference: 12"
result.data$prediction <- round(result.data$prediction, 2)
result.data$percent.diff <- round(result.data$percent.diff, 2)
print(result.data)
12 percent difference isn’t very good. Is there a better way to do this?
Let’s try a decision tree.
regressor4 <- rpart(formula = mpg ~ ., data = mpg.test)
dtpred <- predict(regressor4, data = mpg.test)
plot(regressor4, main = "Decision Tree Regression")
Now let’s evaluate how our model performed:
result.data2 <- data.frame(model.year = mpg.test$model.year,
prediction = dtpred,
actual = mpg.test$mpg)
percent.diff <- abs(result.data2$prediction - result.data2$actual) /
result.data2$actual * 100
result.data2$percent.diff <- percent.diff
remove(percent.diff)
paste("Percent difference:", round(mean(result.data2$percent.diff)))
[1] "Percent difference: 7"
result.data2$prediction <- round(result.data2$prediction, 2)
result.data2$percent.diff <- round(result.data2$percent.diff, 2)
print(result.data2)
This 7% result is much better.
I chose linear regression to start off with because I feel that it is one of the most common and basic methods of predictive analysis. “The linear regression model analyzes the relationship between the response or dependent variable and a set of independent or predictor variables. This relationship is expressed as an equation that predicts the response variable as a linear function of the parameters… [which] are adjusted so that a measure of fit is optimized”(Strickland 2014)
Basically, linear regression uses the old geometric formula that we’ve all been taught, \[ y = mx + b \]
to linearly fit a dependent variable (\(y\)) to an independent variable (\(x\)). This measure of fit discussed above is measured by a property known as the \(R^2\) value, and the fit is considered better the closer to 1 it is. During the linear regression portion, we had an \(R^2 \approx 0.75\), which is not very good - the data was not related linearly, and the linear regression model was a poor choice to use.
We can look at some data that I obtained during a freshman physics lab during my undergraduate degree program. Part of the assignment was to manually perform linear regression on the data, so I thought it would be interesting to pop the data into excel and take a look at it.
dataset <- read_csv("2d.csv")
Parsed with column specification:
cols(
v = col_double(),
m = col_double()
)
y.pred <- lm(m~., data = dataset)
plot(y.pred)
plot(x = dataset$v, y = dataset$m)
Decision trees “utilize a tree structure to model the relationships among features and the potential outcomes… this structure earned its name due to the fact that it mirrors how a literal tree begins at a wide trunk, which if followed upward, splits into narrower and narrower branches. In much the same way, a decision tree classifier uses a structure of branching decisions which channel examples into a final predicted class value”(Lantz 2015)
The main reason I picked a decision tree model is because I wanted to use a random forest model but R 3.4.3 Kite-eating-tree will not recognize the fact that I have Java installed and as a result I cannot use rJava or any of the associated packages that require it. I decided that the next best thing would be to use a decision tree and as a result chose that.
Linear regression is a good model to use when you want fast, simple results. When you have one dependent and one independent variable, linear regression is a great method to use. There are other models that are closely related to linear regression, like polynomial regression, that can provide accurate results as well. (Brown 2017)
It is not without its weaknesses. More complicated datasets, like the mpg dataset we looked at earlier, do not do well with linear regression. Noise data sets can confuse the models as well.
Decision trees are great at looking back at models. You can take prior data and create a well-made model that will capture all the data within before hand that can easily provide solid results at point within the model. Where the model can fail, however, is when it is used to forecast future results.
Also, the models take some understanding as to how to tune them. The above graph captures the data only at one point - at about 23 years in, the model is perfect, but all other points are completely ignored. In this situation, you would be better off averaging all the data points to get your prediction.
A better result for the same data can be found below.
Using the two models of linear regression and decision trees we were able to predict the mpg of various classic cars. That being said, the linear regression model only performed 88% efficiently. The decision tree model performed better at 93% efficiently, and I predict that the use of a random forest model would only increase these results.
Brown, Daniel. 2017. “But I Regress: Using R to Model Data in Different Ways.” https://rpubs.com/dbrown/regression.
Lantz, Brett. 2015. Machine Learning with R. Birmingham, United Kingdom: Packt Publishing.
Lichman, M. 2013. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.
Strickland, Jeffrey S. 2014. Predictive Analytics Using R. Raleigh, NC, United States: Lulu, Inc.