Uploading Cars Data Set

First of all we uploaded Cars Data Set which have 3 variables with 50 observations on each one. You can import data sets using code directly or importing it form RStudio. We used the second option as it is easier, faster and more easy to configure parameters.

library("readr")
library(ggplot2)
cars <- read.csv("~/Desktop/UBIQUM/Data Analytics 2/Get Started with R/R Tutorial Data Sets/cars.csv")

name.of.car	speed.of.car	distance.of.car
Ford	4	2
Jeep	4	4
Honda	7	10
KIA	7	10
Toyota	8	14
BMW	9	16
Mercedes	10	17
GM	10	18
Hyundai	10	20
Infiniti	11	20
Land Rover	11	22
Lexus	12	24
Mazda	12	26
Mitsubishi	12	26
Nissan	12	26
GMC	13	26
Fiat	13	28
Chrysler	13	28
Dodge	13	32
Acura	14	32
Audi	14	32
Chevrolet	14	34
Buick	14	34
Ford	15	34
Jeep	15	36
Honda	15	36
KIA	16	40
Toyota	16	40
BMW	17	42
Mercedes	17	46
GM	17	46
Hyundai	18	48
Infiniti	18	50
Land Rover	18	52
Lexus	18	54
Mazda	19	54
Mitsubishi	19	56
Nissan	19	56
GMC	20	60
Fiat	20	64
Chrysler	20	66
Dodge	20	68
Acura	20	70
Audi	22	76
Chevrolet	23	80
Buick	24	84
Jeep	24	85
Honda	24	92
KIA	24	93
Dodge	25	120

Preprocessing the data

Before starting to analyse and to create the model, we did the process of cleaning the data. There were no errors or missing data in the data set. The only thing that we did was to change the attributes names with the next function.

names(cars)<-c("Brand","Speed","Distance")
names(cars)

## [1] "Brand"    "Speed"    "Distance"

Seems useless, but this will help us avoid coding errors as we will be typing this words several times in R code and it helps that they are short words and without dots.

Getting to know the data

R is great for doing graphs, and in this ScatterPlot (using ggplot package) we can see that there is a linear correlation between speed and distance. Thats is why we are going to use after a Linear Regression algorythm for creating our prediction model.

ggplot(cars, aes(x = Speed, y = Distance)) + geom_point() + geom_smooth(method = "lm", se = F)

knitr::kable(summary(cars))

Brand	Speed	Distance
Dodge : 3	Min. : 4.0	Min. : 2.00
Honda : 3	1st Qu.:12.0	1st Qu.: 26.00
Jeep : 3	Median :15.0	Median : 36.00
KIA : 3	Mean :15.4	Mean : 42.98
Acura : 2	3rd Qu.:19.0	3rd Qu.: 56.00
Audi : 2	Max. :25.0	Max. :120.00
(Other):34	NA	NA

Looking for Ouliers

Looking at the ScatterPlot we had the intuition that there could be some outliers that could afect our algorythm. So we used the best way to identify outliers creating boxplots for both attributes.

boxplot(cars$Distance, main = "Dsitance Boxplot")

boxplot(cars$Speed, main = "Speed Boxplot")

Excluding Outliers

We noticed that there is 1 outlier in our independent variable, the one that we want to predict (Distance). So, before we create our model we need to remove this outlier from the data set.We just did using the “which” and “out” functions.

box <- boxplot(cars$Distance, main = "Distance Boxplot with Outlier")

cars2 <- cars[-which(cars$Distance == box$out) , ]
boxplot(cars2$Distance, main = "Distance Boxplot without Outlier")

Then we created another data set “cars2” for if we still need the original one for something. We prefer to have 2 different data sets, instead of doing changes on the original data set.

Creating testing and training sets

Now is time to prepare data sets for train and test our model. We decided to split the data base using 70% (35 Obs.) for training and 30% (15 Obs.) for testing.

Data Set partition

trainSize<-round(nrow(cars2)*0.7)
print(trainSize)

## [1] 34

testSize<-nrow(cars2)-trainSize
print(testSize)

## [1] 15

Creating the test set

In the Plan of Attack said to use “123” as a set seed. But we decided to use “1” because using “123” we got some errors in the prediction.

set.seed(1)
training_indices<- sample(seq_len(nrow(cars2)),size =trainSize)
print(training_indices)

##  [1]  4 39  1 34 23 43 14 18 33 21 40 10  7  9 15 48 25 44  5 31  2 38 41
## [24] 12 35 47 20  3  6 28 45 46 27 49

trainSet<- cars2[training_indices, ]
testSet<- cars2[-training_indices, ]

Creating Linear Regression Model

As we said before we are going to use a Linear Regression algorythm for our prediction model. Using the training set of data we see that our model called LRM is having trustable results, obtaining an Ajusted R-squared of 96%.

LRM <- lm(Distance ~ Speed, trainSet)
summary(LRM)

## 
## Call:
## lm(formula = Distance ~ Speed, data = trainSet)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.185 -4.282 -1.733  3.177 11.235 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -25.0423     2.5570  -9.793 3.77e-11 ***
## Speed         4.4517     0.1566  28.421  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.285 on 32 degrees of freedom
## Multiple R-squared:  0.9619, Adjusted R-squared:  0.9607 
## F-statistic: 807.7 on 1 and 32 DF,  p-value: < 2.2e-16

Prediction with Linear Regression Model

Time to face the real world an see if our model is able to predict the distance of the testing data.

PredictionCars <- predict(LRM, testSet)

To present the results I decided to do it in a table where we could see easily the “Distance”, the “Prediction”, the “AbsError” and the “RelError”.

testSet$Prediction <- round(PredictionCars, digits=2)
testSet$AbsError <- round((testSet$Prediction - testSet$Distance), digits=2)
testSet$RelError <- paste(round(((testSet$AbsError / testSet$Distance)*100), digits=2),"%")
knitr::kable(testSet)

	Brand	Speed	Distance	Prediction	AbsError	RelError
8	GM	10	18	19.47	1.47	8.17 %
11	Land Rover	11	22	23.93	1.93	8.77 %
13	Mazda	12	26	28.38	2.38	9.15 %
16	GMC	13	26	32.83	6.83	26.27 %
17	Fiat	13	28	32.83	4.83	17.25 %
19	Dodge	13	32	32.83	0.83	2.59 %
22	Chevrolet	14	34	37.28	3.28	9.65 %
24	Ford	15	34	41.73	7.73	22.74 %
26	Honda	15	36	41.73	5.73	15.92 %
29	BMW	17	42	50.64	8.64	20.57 %
30	Mercedes	17	46	50.64	4.64	10.09 %
32	Hyundai	18	48	55.09	7.09	14.77 %
36	Mazda	19	54	59.54	5.54	10.26 %
37	Mitsubishi	19	56	59.54	3.54	6.32 %
42	Dodge	20	68	63.99	-4.01	-5.9 %

Getting Started with R (Cars)

Pol Mayolas

2019-10-09