Getting Started with R (Cars)

Pol Mayolas

2019-10-09

Uploading Cars Data Set

First of all we uploaded Cars Data Set which have 3 variables with 50 observations on each one. You can import data sets using code directly or importing it form RStudio. We used the second option as it is easier, faster and more easy to configure parameters.

name.of.car speed.of.car distance.of.car
Ford 4 2
Jeep 4 4
Honda 7 10
KIA 7 10
Toyota 8 14
BMW 9 16
Mercedes 10 17
GM 10 18
Hyundai 10 20
Infiniti 11 20
Land Rover 11 22
Lexus 12 24
Mazda 12 26
Mitsubishi 12 26
Nissan 12 26
GMC 13 26
Fiat 13 28
Chrysler 13 28
Dodge 13 32
Acura 14 32
Audi 14 32
Chevrolet 14 34
Buick 14 34
Ford 15 34
Jeep 15 36
Honda 15 36
KIA 16 40
Toyota 16 40
BMW 17 42
Mercedes 17 46
GM 17 46
Hyundai 18 48
Infiniti 18 50
Land Rover 18 52
Lexus 18 54
Mazda 19 54
Mitsubishi 19 56
Nissan 19 56
GMC 20 60
Fiat 20 64
Chrysler 20 66
Dodge 20 68
Acura 20 70
Audi 22 76
Chevrolet 23 80
Buick 24 84
Jeep 24 85
Honda 24 92
KIA 24 93
Dodge 25 120

Preprocessing the data

Before starting to analyse and to create the model, we did the process of cleaning the data. There were no errors or missing data in the data set. The only thing that we did was to change the attributes names with the next function.

## [1] "Brand"    "Speed"    "Distance"

Seems useless, but this will help us avoid coding errors as we will be typing this words several times in R code and it helps that they are short words and without dots.

Getting to know the data

R is great for doing graphs, and in this ScatterPlot (using ggplot package) we can see that there is a linear correlation between speed and distance. Thats is why we are going to use after a Linear Regression algorythm for creating our prediction model.

Brand Speed Distance
Dodge : 3 Min. : 4.0 Min. : 2.00
Honda : 3 1st Qu.:12.0 1st Qu.: 26.00
Jeep : 3 Median :15.0 Median : 36.00
KIA : 3 Mean :15.4 Mean : 42.98
Acura : 2 3rd Qu.:19.0 3rd Qu.: 56.00
Audi : 2 Max. :25.0 Max. :120.00
(Other):34 NA NA

Looking for Ouliers

Looking at the ScatterPlot we had the intuition that there could be some outliers that could afect our algorythm. So we used the best way to identify outliers creating boxplots for both attributes.

Excluding Outliers

We noticed that there is 1 outlier in our independent variable, the one that we want to predict (Distance). So, before we create our model we need to remove this outlier from the data set.We just did using the “which” and “out” functions.

Then we created another data set “cars2” for if we still need the original one for something. We prefer to have 2 different data sets, instead of doing changes on the original data set.

Creating testing and training sets

Now is time to prepare data sets for train and test our model. We decided to split the data base using 70% (35 Obs.) for training and 30% (15 Obs.) for testing.

Creating the test set

In the Plan of Attack said to use “123” as a set seed. But we decided to use “1” because using “123” we got some errors in the prediction.

##  [1]  4 39  1 34 23 43 14 18 33 21 40 10  7  9 15 48 25 44  5 31  2 38 41
## [24] 12 35 47 20  3  6 28 45 46 27 49

Creating Linear Regression Model

As we said before we are going to use a Linear Regression algorythm for our prediction model. Using the training set of data we see that our model called LRM is having trustable results, obtaining an Ajusted R-squared of 96%.

## 
## Call:
## lm(formula = Distance ~ Speed, data = trainSet)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.185 -4.282 -1.733  3.177 11.235 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -25.0423     2.5570  -9.793 3.77e-11 ***
## Speed         4.4517     0.1566  28.421  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.285 on 32 degrees of freedom
## Multiple R-squared:  0.9619, Adjusted R-squared:  0.9607 
## F-statistic: 807.7 on 1 and 32 DF,  p-value: < 2.2e-16

Prediction with Linear Regression Model

Time to face the real world an see if our model is able to predict the distance of the testing data.

To present the results I decided to do it in a table where we could see easily the “Distance”, the “Prediction”, the “AbsError” and the “RelError”.

Brand Speed Distance Prediction AbsError RelError
8 GM 10 18 19.47 1.47 8.17 %
11 Land Rover 11 22 23.93 1.93 8.77 %
13 Mazda 12 26 28.38 2.38 9.15 %
16 GMC 13 26 32.83 6.83 26.27 %
17 Fiat 13 28 32.83 4.83 17.25 %
19 Dodge 13 32 32.83 0.83 2.59 %
22 Chevrolet 14 34 37.28 3.28 9.65 %
24 Ford 15 34 41.73 7.73 22.74 %
26 Honda 15 36 41.73 5.73 15.92 %
29 BMW 17 42 50.64 8.64 20.57 %
30 Mercedes 17 46 50.64 4.64 10.09 %
32 Hyundai 18 48 55.09 7.09 14.77 %
36 Mazda 19 54 59.54 5.54 10.26 %
37 Mitsubishi 19 56 59.54 3.54 6.32 %
42 Dodge 20 68 63.99 -4.01 -5.9 %