After downloading, installing and learnt how to use R and Rstudio I started with the Tutorial Part 1.

Tutorial Part 1: Cars

We will work with a data ser of 50 cars, and we going to predict the distance through the speed of certain cars.

Starting a new project and Getting Know The Data

I created a New Project In a brand new directory then I did the following operations:install.packages(“readr”)

Then i called this library

library(readr)

Uploaded the data with:

cars <- read.csv("cars.csv")
carsPred<- read.csv("cars_comparison.csv")

After uploading data I started applying functions in order to know better the data that I have to use:

attributes(cars)
## $names
## [1] "name.of.car"     "speed.of.car"    "distance.of.car"
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50
summary(cars)
##   name.of.car  speed.of.car  distance.of.car 
##  Dodge  : 3   Min.   : 4.0   Min.   :  2.00  
##  Honda  : 3   1st Qu.:12.0   1st Qu.: 26.00  
##  Jeep   : 3   Median :15.0   Median : 36.00  
##  KIA    : 3   Mean   :15.4   Mean   : 42.98  
##  Acura  : 2   3rd Qu.:19.0   3rd Qu.: 56.00  
##  Audi   : 2   Max.   :25.0   Max.   :120.00  
##  (Other):34

Then I renamed the columns in the dataset

colnames(cars) <- c("Name","Speed","Distance")
names(cars)
## [1] "Name"     "Speed"    "Distance"
cars$Name #print out the instances in the first column
##  [1] Ford       Jeep       Honda      KIA        Toyota     BMW       
##  [7] Mercedes   GM         Hyundai    Infiniti   Land Rover Lexus     
## [13] Mazda      Mitsubishi Nissan     GMC        Fiat       Chrysler  
## [19] Dodge      Acura      Audi       Chevrolet  Buick      Ford      
## [25] Jeep       Honda      KIA        Toyota     BMW        Mercedes  
## [31] GM         Hyundai    Infiniti   Land Rover Lexus      Mazda     
## [37] Mitsubishi Nissan     GMC        Fiat       Chrysler   Dodge     
## [43] Acura      Audi       Chevrolet  Buick      Jeep       Honda     
## [49] KIA        Dodge     
## 23 Levels: Acura Audi BMW Buick Chevrolet Chrysler Dodge Fiat Ford ... Toyota
cars$Speed #print out the instances in the second column
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
## [24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
## [47] 24 24 24 25
cars$Distance #print out the instances in the third column
##  [1]   2   4  10  10  14  16  17  18  20  20  22  24  26  26  26  26  28
## [18]  28  32  32  32  34  34  34  36  36  40  40  42  46  46  48  50  52
## [35]  54  54  56  56  60  64  66  68  70  76  80  84  85  92  93 120

Preprocessing The Data Firstly I converted the data to numeric types

cars$Speed<- as.numeric(cars$Speed)
cars$Distance<- as.numeric(cars$Distance)

In order to detect missing values in the data set I executed the summary() function, and it shows me that there in not missing values:

summary(cars)
##       Name        Speed         Distance     
##  Dodge  : 3   Min.   : 4.0   Min.   :  2.00  
##  Honda  : 3   1st Qu.:12.0   1st Qu.: 26.00  
##  Jeep   : 3   Median :15.0   Median : 36.00  
##  KIA    : 3   Mean   :15.4   Mean   : 42.98  
##  Acura  : 2   3rd Qu.:19.0   3rd Qu.: 56.00  
##  Audi   : 2   Max.   :25.0   Max.   :120.00  
##  (Other):34

I tried with a plot to see the relation between Speed and Distance

plot(cars$Speed, cars$Distance)

Creating Testing and Training Sets

I created smp_size variable using floor to to get a round number of 70% percent of the data set.

smp_size <- floor(0.7 * nrow(cars))

I use a random number generator .seed

set.seed(123)

I define train_indices and sample variables and used a Sequence Generation function called seq_len(). In the same line of code I assign the size of this sample smp_size.

train_indices <- sample(seq_len(nrow(cars)), size = smp_size)

The I assigned part of data for train and test

train<- cars[train_indices, ]
test<- cars[-train_indices, ]

Linear Regression Model

I created Linear Regression Model using the function – lm() called Fit1:

Fit1<-lm(Distance~ Speed, train)

Then I executed the function summary for this model:

summary(Fit1)
## 
## Call:
## lm(formula = Distance ~ Speed, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0012 -5.0012 -0.5603  2.1458 28.4109 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35.2481     4.0712  -8.658 5.25e-10 ***
## Speed         5.0735     0.2519  20.143  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.18 on 33 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9225 
## F-statistic: 405.7 on 1 and 33 DF,  p-value: < 2.2e-16

The result is an Intercept estimation of: -35.24 with an Error ∓ 4.07 with a P-Value of: 5.25e-10. For Speed the result is an estimation of: -5.07 with an Error ∓ 0.25 with a P-Value of: 2.2e-16. Multiple R-squared: 0.9248 and Adjusted R-squared: 0.9225 in both of cases the result is near to 1, it means that the regression line fits the data very well. p-value: < 2.2e-16 this value, less than 0.05, means the relationship between the Independent Variable/Predictor and the Dependent Variable/Response is statistically significant.

How far a car can travel based on speed?

We have a data set of 50 cars, we trained with the 70% exactly 35 cars and tested the models in the remaining 15 cars. To answer this question I used the function predict()

Pred1<- predict(Fit1, test)
Pred1
##         1         2         6        16        18        20        22 
## -14.95415 -14.95415  10.41329  30.70724  30.70724  35.78073  35.78073 
##        23        34        35        38        39        44        46 
##  35.78073  56.07468  56.07468  61.14817  66.22166  76.36864  86.51561 
##        47 
##  86.51561

Create a Table

Number Name Speed Distance Distance.Predicted Difference X.
1 Ford 4 2 -14.95 -16.95 -847.50%
2 Jeep 4 4 -14.95 -18.95 -473.75%
3 Honda 7 10 NA NA
4 KIA 7 10 NA NA
5 Toyota 8 14 NA NA
6 BMW 9 16 10.41 -5.59 -34.94%
7 Mercedes 10 17 NA NA
8 GM 10 18 NA NA
9 Hyundai 10 20 NA NA
10 Infiniti 11 20 NA NA
11 Land Rover 11 22 NA NA
12 Lexus 12 24 NA NA
13 Mazda 12 26 NA NA
14 Mitsubishi 12 26 NA NA
15 Nissan 12 26 NA NA
16 GMC 13 26 30.70 4.70 18.08%
17 Fiat 13 28 NA NA
18 Chrysler 13 28 30.70 2.70 9.64%
19 Dodge 13 32 NA NA
20 Acura 14 32 30.70 -1.30 -4.06%
21 Audi 14 32 NA NA
22 Chevrolet 14 34 35.78 1.78 5.24%
23 Buick 14 34 35.78 1.78 5.24%
24 Ford 15 34 NA NA
25 Jeep 15 36 NA NA
26 Honda 15 36 NA NA
27 KIA 16 40 NA NA
28 Toyota 16 40 NA NA
29 BMW 17 42 NA NA
30 Mercedes 17 46 NA NA
31 GM 17 46 NA NA
32 Hyundai 18 48 NA NA
33 Infiniti 18 50 NA NA
34 Land Rover 18 52 56.07 4.07 7.83%
35 Lexus 18 54 56.07 2.07 3.83%
36 Mazda 19 54 NA NA
37 Mitsubishi 19 56 NA NA
38 Nissan 19 56 61.14 5.14 9.18%
39 GMC 20 60 66.22 6.22 10.37%
40 Fiat 20 64 NA NA
41 Chrysler 20 66 NA NA
42 Dodge 20 68 NA NA
43 Acura 20 70 NA NA
44 Audi 22 76 76.36 0.36 0.47%
45 Chevrolet 23 80 NA NA
46 Buick 24 84 86.51 2.51 2.99%
47 Jeep 24 85 86.51 1.51 1.78%
48 Honda 24 92 NA NA
49 KIA 24 93 NA NA
50 Dodge 25 120 NA NA

I compared the predictions results of this 15 cars in the table below and we can see that the most precise prediction was in the car number 44 (Audi) followed for the cars numbers 46 (Buick) and 47 (Jeep) all of them with a speed higher than 20km/h. These cars are part of the group of cars with 24km/h of speed, and the models take 2 of this to train and 2 to test, maybe it can explain the good prediction. On the other hand, cars numbers 1 and 2 with the lowest speed show a distance result negative which is unthinkable. Both are the only ones with a speed of 4km/h, and the models take both of them for testing and none for training, maybe here are the predictions problem.

Predicting a petal’s length

Now we going to work with a dataset of 150 iris flowers with 5 attributes each. The analysis goal is to predict a petal’s length using the petal’s width.

Creating Testing and Training Sets

##        X           Sepal.Length    Sepal.Width     Petal.Length  
##  Min.   :  1.00   Min.   :4.300   Min.   :2.000   Min.   :1.000  
##  1st Qu.: 38.25   1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600  
##  Median : 75.50   Median :5.800   Median :3.000   Median :4.350  
##  Mean   : 75.50   Mean   :5.843   Mean   :3.057   Mean   :3.758  
##  3rd Qu.:112.75   3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100  
##  Max.   :150.00   Max.   :7.900   Max.   :4.400   Max.   :6.900  
##   Petal.Width          Species  
##  Min.   :0.100   setosa    :50  
##  1st Qu.:0.300   versicolor:50  
##  Median :1.300   virginica :50  
##  Mean   :1.199                  
##  3rd Qu.:1.800                  
##  Max.   :2.500
## 'data.frame':    150 obs. of  6 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## [1] "X"            "Sepal.Length" "Sepal.Width"  "Petal.Length"
## [5] "Petal.Width"  "Species"

I use a random number generator .seed

set.seed(123)

I created trainSize variable using round to to get a round number of 80% percent of the data set.Then I define testSize.

trainSize <- round(nrow(IrisDataset) * 0.8)
testSize <- nrow(IrisDataset) - trainSize 
trainSize
## [1] 120
testSize
## [1] 30

The I assigned part of data for train and test trainSize = 120 flowers testSize = 30 flowers

Linear Regression Model

I define training_indices and assign value to trainSet and testSet

training_indices <- sample(seq_len(nrow(IrisDataset)), size = trainSize)

trainSet <- IrisDataset[training_indices, ]
testSet <- IrisDataset[-training_indices, ]

I created Linear Regression Model using the function – lm() called Fit1:

LinearModel<- lm(Petal.Length ~ Petal.Width,trainSet) 
summary(LinearModel)
## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = trainSet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31533 -0.32661 -0.02686  0.27611  1.40670 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08246    0.08689   12.46   <2e-16 ***
## Petal.Width  2.22203    0.05994   37.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5064 on 118 degrees of freedom
## Multiple R-squared:  0.9209, Adjusted R-squared:  0.9203 
## F-statistic:  1374 on 1 and 118 DF,  p-value: < 2.2e-16

The result is an Intercept estimation of: 1.08 with an Error ∓ 0.086 with a P-Value of: 2e-16. For Petal Width the result is an estimation of: 2.22 with an Error ∓ 0.059 with a P-Value of: 2e-16.

Multiple R-squared: 0.9209 and Adjusted R-squared: 0.9203 in both of cases the result is near to 1, it means that the regression line fits the data very well.

p-value: < 2.2-16 this value, less than 0.05, means the relationship between the Independent Variable/Predictor and the Dependent Variable/Response is statistically significant.

How long are certain petals using the petal’s width?

We have a data set of 150 flowers, we trained with the 80% exactly 120 flowers and tested the models in the remaining 30 flowers.

I used the function predict()

prediction<-predict(LinearModel,testSet)
prediction
##        1        2        3       11       18       19       28       33 
## 1.526862 1.526862 1.526862 1.526862 1.749065 1.749065 1.526862 1.304659 
##       36       48       55       56       57       58       59       61 
## 1.526862 1.526862 4.415504 3.971097 4.637707 3.304488 3.971097 3.304488 
##       62       65       66       70       77       83       84       98 
## 4.415504 3.971097 4.193301 3.526691 4.193301 3.748894 4.637707 3.971097 
##      100      105      113      125      131      141 
## 3.971097 5.970926 5.748723 5.748723 5.304317 6.415333
highchart()%>%
  hc_title(text = "Flowers, Petals length and Width") %>%
  hc_yAxis(title = list(text = "Petal")) %>%
  hc_xAxis(title = list(text = "Percent")) %>%
  hc_add_series(IrisDataset, "scatter", hcaes(x = Petal.Length, y = Petal.Width))

I use a scatter plot to try to find and explanation for these results. This shows that there a lack of data in one segment of parameters that can affect this model and the prediction result.

The errors/warning messages founded

Here I enumerate a list of errors and the correct way: Error 1

install.packages(readr)

The correct way is ("") for the packages’ name

install.packages("readr")

Error 2

library("readr") 

() for the packages’ name without double quotation marks. The correct way:

library(readr)

Error 3

IrisDataset <- read.csv(iris.csv) 

The correct way is ("") data set name

IrisDataset <- read.csv("iris.csv")

Error 4

summary(risDataset) 

Letter I is missing in the data set name - Error in summary(risDataset) : object ‘risDataset’ not found. The correct way:

summary(IrisDataset)

Error 5

str(IrisDatasets) 

Letter s is over - Error in str(IrisDatasets) : object ‘IrisDatasets’ not found. The correct way:

str(IrisDataset)

Error 6

hist(IrisDataset$Species)

In hist.default(IrisDataset$Species) : ‘x’ must be

Error 7

plot(IrisDataset$Sepal.Length 

) symbol is missing. The correct way:

plot(IrisDataset$Sepal.Length)

Error 8

trainSize <- round(nrow(IrisDataset) * 0.2)

20% assign to trainSize is a mistake, we need to take 70% 0r 80% percent at least.The correct way:

trainSize <- round(nrow(IrisDataset) * 0.8)

Error 9

testSize <- nrow(IrisDataset) - trainSet

Object ‘trainSet’ not found We have to use trainSize. The correct way:

testSize <- nrow(IrisDataset) - trainSize     

Error 10

trainSizes 

Object ‘trainSizes’ not found - letter s is over. The correct way:

trainSize

Error 11

trainSet <- IrisDataset[training_indices, ]

Object ‘training_indices’ not found. The correct way:

training_indices <- sample(seq_len(nrow(IrisDataset)), size = trainSize) 

Error 12

LinearModel<- lm(trainSet$Petal.Width ~ testingSet$Petal.Length)

Replace testingSet by trainSet and the order .width instead .Length

LinearModel<- lm(trainSet$Petal.Length ~ trainSet$Petal.Width)

Error 13

#prediction<-predict(LinearModeltestSet)

Replace LinearModeltestSet by LinearModel

prediction<-predict(LinearModel)

Error 14

predictions

Letter s is over

prediction

Was it straightforward to install R and RStudio? Yes, It was very easy to do.

Was the tutorial useful? Would you recommend it to others? The tutorial was very useful, I would recommend it.

What are the main lessons you’ve learned from this experience? I learn the basis of starting coding with R, to define a variable, to create models to train and test, to predict. Also I learn about de Rstudio interface and how to create some plots to compare information.