Uploading Cars Data Set
First of all we uploaded Cars Data Set which have 3 variables with 50 observations on each one. You can import data sets using code directly or importing it form RStudio. We used the second option as it is easier, faster and more easy to configure parameters.
library("readr")
library(ggplot2)
cars <- read.csv("~/Desktop/UBIQUM/Data Analytics 2/Get Started with R/R Tutorial Data Sets/cars.csv")| name.of.car | speed.of.car | distance.of.car |
|---|---|---|
| Ford | 4 | 2 |
| Jeep | 4 | 4 |
| Honda | 7 | 10 |
| KIA | 7 | 10 |
| Toyota | 8 | 14 |
| BMW | 9 | 16 |
| Mercedes | 10 | 17 |
| GM | 10 | 18 |
| Hyundai | 10 | 20 |
| Infiniti | 11 | 20 |
| Land Rover | 11 | 22 |
| Lexus | 12 | 24 |
| Mazda | 12 | 26 |
| Mitsubishi | 12 | 26 |
| Nissan | 12 | 26 |
| GMC | 13 | 26 |
| Fiat | 13 | 28 |
| Chrysler | 13 | 28 |
| Dodge | 13 | 32 |
| Acura | 14 | 32 |
| Audi | 14 | 32 |
| Chevrolet | 14 | 34 |
| Buick | 14 | 34 |
| Ford | 15 | 34 |
| Jeep | 15 | 36 |
| Honda | 15 | 36 |
| KIA | 16 | 40 |
| Toyota | 16 | 40 |
| BMW | 17 | 42 |
| Mercedes | 17 | 46 |
| GM | 17 | 46 |
| Hyundai | 18 | 48 |
| Infiniti | 18 | 50 |
| Land Rover | 18 | 52 |
| Lexus | 18 | 54 |
| Mazda | 19 | 54 |
| Mitsubishi | 19 | 56 |
| Nissan | 19 | 56 |
| GMC | 20 | 60 |
| Fiat | 20 | 64 |
| Chrysler | 20 | 66 |
| Dodge | 20 | 68 |
| Acura | 20 | 70 |
| Audi | 22 | 76 |
| Chevrolet | 23 | 80 |
| Buick | 24 | 84 |
| Jeep | 24 | 85 |
| Honda | 24 | 92 |
| KIA | 24 | 93 |
| Dodge | 25 | 120 |
Preprocessing the data
Before starting to analyse and to create the model, we did the process of cleaning the data. There were no errors or missing data in the data set. The only thing that we did was to change the attributes names with the next function.
## [1] "Brand" "Speed" "Distance"
Seems useless, but this will help us avoid coding errors as we will be typing this words several times in R code and it helps that they are short words and without dots.
Getting to know the data
R is great for doing graphs, and in this ScatterPlot (using ggplot package) we can see that there is a linear correlation between speed and distance. Thats is why we are going to use after a Linear Regression algorythm for creating our prediction model.
| Brand | Speed | Distance | |
|---|---|---|---|
| Dodge : 3 | Min. : 4.0 | Min. : 2.00 | |
| Honda : 3 | 1st Qu.:12.0 | 1st Qu.: 26.00 | |
| Jeep : 3 | Median :15.0 | Median : 36.00 | |
| KIA : 3 | Mean :15.4 | Mean : 42.98 | |
| Acura : 2 | 3rd Qu.:19.0 | 3rd Qu.: 56.00 | |
| Audi : 2 | Max. :25.0 | Max. :120.00 | |
| (Other):34 | NA | NA |
Looking for Ouliers
Looking at the ScatterPlot we had the intuition that there could be some outliers that could afect our algorythm. So we used the best way to identify outliers creating boxplots for both attributes.
Excluding Outliers
We noticed that there is 1 outlier in our independent variable, the one that we want to predict (Distance). So, before we create our model we need to remove this outlier from the data set.We just did using the “which” and “out” functions.
cars2 <- cars[-which(cars$Distance == box$out) , ]
boxplot(cars2$Distance, main = "Distance Boxplot without Outlier")Then we created another data set “cars2” for if we still need the original one for something. We prefer to have 2 different data sets, instead of doing changes on the original data set.
Creating testing and training sets
Now is time to prepare data sets for train and test our model. We decided to split the data base using 70% (35 Obs.) for training and 30% (15 Obs.) for testing.
Data Set partition
## [1] 34
## [1] 15
Creating the test set
In the Plan of Attack said to use “123” as a set seed. But we decided to use “1” because using “123” we got some errors in the prediction.
## [1] 4 39 1 34 23 43 14 18 33 21 40 10 7 9 15 48 25 44 5 31 2 38 41
## [24] 12 35 47 20 3 6 28 45 46 27 49
Creating Linear Regression Model
As we said before we are going to use a Linear Regression algorythm for our prediction model. Using the training set of data we see that our model called LRM is having trustable results, obtaining an Ajusted R-squared of 96%.
##
## Call:
## lm(formula = Distance ~ Speed, data = trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.185 -4.282 -1.733 3.177 11.235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.0423 2.5570 -9.793 3.77e-11 ***
## Speed 4.4517 0.1566 28.421 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.285 on 32 degrees of freedom
## Multiple R-squared: 0.9619, Adjusted R-squared: 0.9607
## F-statistic: 807.7 on 1 and 32 DF, p-value: < 2.2e-16
Prediction with Linear Regression Model
Time to face the real world an see if our model is able to predict the distance of the testing data.
To present the results I decided to do it in a table where we could see easily the “Distance”, the “Prediction”, the “AbsError” and the “RelError”.
testSet$Prediction <- round(PredictionCars, digits=2)
testSet$AbsError <- round((testSet$Prediction - testSet$Distance), digits=2)
testSet$RelError <- paste(round(((testSet$AbsError / testSet$Distance)*100), digits=2),"%")
knitr::kable(testSet)| Brand | Speed | Distance | Prediction | AbsError | RelError | |
|---|---|---|---|---|---|---|
| 8 | GM | 10 | 18 | 19.47 | 1.47 | 8.17 % |
| 11 | Land Rover | 11 | 22 | 23.93 | 1.93 | 8.77 % |
| 13 | Mazda | 12 | 26 | 28.38 | 2.38 | 9.15 % |
| 16 | GMC | 13 | 26 | 32.83 | 6.83 | 26.27 % |
| 17 | Fiat | 13 | 28 | 32.83 | 4.83 | 17.25 % |
| 19 | Dodge | 13 | 32 | 32.83 | 0.83 | 2.59 % |
| 22 | Chevrolet | 14 | 34 | 37.28 | 3.28 | 9.65 % |
| 24 | Ford | 15 | 34 | 41.73 | 7.73 | 22.74 % |
| 26 | Honda | 15 | 36 | 41.73 | 5.73 | 15.92 % |
| 29 | BMW | 17 | 42 | 50.64 | 8.64 | 20.57 % |
| 30 | Mercedes | 17 | 46 | 50.64 | 4.64 | 10.09 % |
| 32 | Hyundai | 18 | 48 | 55.09 | 7.09 | 14.77 % |
| 36 | Mazda | 19 | 54 | 59.54 | 5.54 | 10.26 % |
| 37 | Mitsubishi | 19 | 56 | 59.54 | 3.54 | 6.32 % |
| 42 | Dodge | 20 | 68 | 63.99 | -4.01 | -5.9 % |