After downloading, installing and learnt how to use R and Rstudio I started with the Tutorial Part 1.

Tutorial Part 1: Cars

We will work with a data ser of 50 cars, and we going to predict the distance through the speed of certain cars.

Starting a new project and Getting Know The Data

I created a New Project In a brand new directory then I did the following operations:install.packages(“readr”)

Then i called this library

library(readr)

Uploaded the data with:

cars <- read.csv("cars.csv")
carsPred<- read.csv("cars_comparison.csv")

After uploading data I started applying functions in order to know better the data that I have to use:

attributes(cars)

## $names
## [1] "name.of.car"     "speed.of.car"    "distance.of.car"
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50

summary(cars)

##   name.of.car  speed.of.car  distance.of.car 
##  Dodge  : 3   Min.   : 4.0   Min.   :  2.00  
##  Honda  : 3   1st Qu.:12.0   1st Qu.: 26.00  
##  Jeep   : 3   Median :15.0   Median : 36.00  
##  KIA    : 3   Mean   :15.4   Mean   : 42.98  
##  Acura  : 2   3rd Qu.:19.0   3rd Qu.: 56.00  
##  Audi   : 2   Max.   :25.0   Max.   :120.00  
##  (Other):34

Then I renamed the columns in the dataset

colnames(cars) <- c("Name","Speed","Distance")
names(cars)

## [1] "Name"     "Speed"    "Distance"

cars$Name #print out the instances in the first column

##  [1] Ford       Jeep       Honda      KIA        Toyota     BMW       
##  [7] Mercedes   GM         Hyundai    Infiniti   Land Rover Lexus     
## [13] Mazda      Mitsubishi Nissan     GMC        Fiat       Chrysler  
## [19] Dodge      Acura      Audi       Chevrolet  Buick      Ford      
## [25] Jeep       Honda      KIA        Toyota     BMW        Mercedes  
## [31] GM         Hyundai    Infiniti   Land Rover Lexus      Mazda     
## [37] Mitsubishi Nissan     GMC        Fiat       Chrysler   Dodge     
## [43] Acura      Audi       Chevrolet  Buick      Jeep       Honda     
## [49] KIA        Dodge     
## 23 Levels: Acura Audi BMW Buick Chevrolet Chrysler Dodge Fiat Ford ... Toyota

cars$Speed #print out the instances in the second column

##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
## [24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
## [47] 24 24 24 25

cars$Distance #print out the instances in the third column

##  [1]   2   4  10  10  14  16  17  18  20  20  22  24  26  26  26  26  28
## [18]  28  32  32  32  34  34  34  36  36  40  40  42  46  46  48  50  52
## [35]  54  54  56  56  60  64  66  68  70  76  80  84  85  92  93 120

Preprocessing The Data Firstly I converted the data to numeric types

cars$Speed<- as.numeric(cars$Speed)
cars$Distance<- as.numeric(cars$Distance)

In order to detect missing values in the data set I executed the summary() function, and it shows me that there in not missing values:

summary(cars)

##       Name        Speed         Distance     
##  Dodge  : 3   Min.   : 4.0   Min.   :  2.00  
##  Honda  : 3   1st Qu.:12.0   1st Qu.: 26.00  
##  Jeep   : 3   Median :15.0   Median : 36.00  
##  KIA    : 3   Mean   :15.4   Mean   : 42.98  
##  Acura  : 2   3rd Qu.:19.0   3rd Qu.: 56.00  
##  Audi   : 2   Max.   :25.0   Max.   :120.00  
##  (Other):34

I tried with a plot to see the relation between Speed and Distance

plot(cars$Speed, cars$Distance)

Creating Testing and Training Sets

I created smp_size variable using floor to to get a round number of 70% percent of the data set.

smp_size <- floor(0.7 * nrow(cars))

I use a random number generator .seed

set.seed(123)

I define train_indices and sample variables and used a Sequence Generation function called seq_len(). In the same line of code I assign the size of this sample smp_size.

train_indices <- sample(seq_len(nrow(cars)), size = smp_size)

The I assigned part of data for train and test

train<- cars[train_indices, ]
test<- cars[-train_indices, ]

Linear Regression Model

I created Linear Regression Model using the function – lm() called Fit1:

Fit1<-lm(Distance~ Speed, train)

Then I executed the function summary for this model:

summary(Fit1)

## 
## Call:
## lm(formula = Distance ~ Speed, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0012 -5.0012 -0.5603  2.1458 28.4109 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35.2481     4.0712  -8.658 5.25e-10 ***
## Speed         5.0735     0.2519  20.143  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.18 on 33 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9225 
## F-statistic: 405.7 on 1 and 33 DF,  p-value: < 2.2e-16

The result is an Intercept estimation of: -35.24 with an Error ∓ 4.07 with a P-Value of: 5.25e-10. For Speed the result is an estimation of: -5.07 with an Error ∓ 0.25 with a P-Value of: 2.2e-16. Multiple R-squared: 0.9248 and Adjusted R-squared: 0.9225 in both of cases the result is near to 1, it means that the regression line fits the data very well. p-value: < 2.2e-16 this value, less than 0.05, means the relationship between the Independent Variable/Predictor and the Dependent Variable/Response is statistically significant.

How far a car can travel based on speed?

We have a data set of 50 cars, we trained with the 70% exactly 35 cars and tested the models in the remaining 15 cars. To answer this question I used the function predict()

Pred1<- predict(Fit1, test)
Pred1

##         1         2         6        16        18        20        22 
## -14.95415 -14.95415  10.41329  30.70724  30.70724  35.78073  35.78073 
##        23        34        35        38        39        44        46 
##  35.78073  56.07468  56.07468  61.14817  66.22166  76.36864  86.51561 
##        47 
##  86.51561

Create a Table

Number	Name	Speed	Distance	Distance.Predicted	Difference	X.
1	Ford	4	2	-14.95	-16.95	-847.50%
2	Jeep	4	4	-14.95	-18.95	-473.75%
3	Honda	7	10	NA	NA
4	KIA	7	10	NA	NA
5	Toyota	8	14	NA	NA
6	BMW	9	16	10.41	-5.59	-34.94%
7	Mercedes	10	17	NA	NA
8	GM	10	18	NA	NA
9	Hyundai	10	20	NA	NA
10	Infiniti	11	20	NA	NA
11	Land Rover	11	22	NA	NA
12	Lexus	12	24	NA	NA
13	Mazda	12	26	NA	NA
14	Mitsubishi	12	26	NA	NA
15	Nissan	12	26	NA	NA
16	GMC	13	26	30.70	4.70	18.08%
17	Fiat	13	28	NA	NA
18	Chrysler	13	28	30.70	2.70	9.64%
19	Dodge	13	32	NA	NA
20	Acura	14	32	30.70	-1.30	-4.06%
21	Audi	14	32	NA	NA
22	Chevrolet	14	34	35.78	1.78	5.24%
23	Buick	14	34	35.78	1.78	5.24%
24	Ford	15	34	NA	NA
25	Jeep	15	36	NA	NA
26	Honda	15	36	NA	NA
27	KIA	16	40	NA	NA
28	Toyota	16	40	NA	NA
29	BMW	17	42	NA	NA
30	Mercedes	17	46	NA	NA
31	GM	17	46	NA	NA
32	Hyundai	18	48	NA	NA
33	Infiniti	18	50	NA	NA
34	Land Rover	18	52	56.07	4.07	7.83%
35	Lexus	18	54	56.07	2.07	3.83%
36	Mazda	19	54	NA	NA
37	Mitsubishi	19	56	NA	NA
38	Nissan	19	56	61.14	5.14	9.18%
39	GMC	20	60	66.22	6.22	10.37%
40	Fiat	20	64	NA	NA
41	Chrysler	20	66	NA	NA
42	Dodge	20	68	NA	NA
43	Acura	20	70	NA	NA
44	Audi	22	76	76.36	0.36	0.47%
45	Chevrolet	23	80	NA	NA
46	Buick	24	84	86.51	2.51	2.99%
47	Jeep	24	85	86.51	1.51	1.78%
48	Honda	24	92	NA	NA
49	KIA	24	93	NA	NA
50	Dodge	25	120	NA	NA

I compared the predictions results of this 15 cars in the table below and we can see that the most precise prediction was in the car number 44 (Audi) followed for the cars numbers 46 (Buick) and 47 (Jeep) all of them with a speed higher than 20km/h. These cars are part of the group of cars with 24km/h of speed, and the models take 2 of this to train and 2 to test, maybe it can explain the good prediction. On the other hand, cars numbers 1 and 2 with the lowest speed show a distance result negative which is unthinkable. Both are the only ones with a speed of 4km/h, and the models take both of them for testing and none for training, maybe here are the predictions problem.

Predicting a petal’s length

Now we going to work with a dataset of 150 iris flowers with 5 attributes each. The analysis goal is to predict a petal’s length using the petal’s width.

Creating Testing and Training Sets

##        X           Sepal.Length    Sepal.Width     Petal.Length  
##  Min.   :  1.00   Min.   :4.300   Min.   :2.000   Min.   :1.000  
##  1st Qu.: 38.25   1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600  
##  Median : 75.50   Median :5.800   Median :3.000   Median :4.350  
##  Mean   : 75.50   Mean   :5.843   Mean   :3.057   Mean   :3.758  
##  3rd Qu.:112.75   3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100  
##  Max.   :150.00   Max.   :7.900   Max.   :4.400   Max.   :6.900  
##   Petal.Width          Species  
##  Min.   :0.100   setosa    :50  
##  1st Qu.:0.300   versicolor:50  
##  Median :1.300   virginica :50  
##  Mean   :1.199                  
##  3rd Qu.:1.800                  
##  Max.   :2.500

## 'data.frame':    150 obs. of  6 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

## [1] "X"            "Sepal.Length" "Sepal.Width"  "Petal.Length"
## [5] "Petal.Width"  "Species"

I use a random number generator .seed

set.seed(123)

I created trainSize variable using round to to get a round number of 80% percent of the data set.Then I define testSize.

trainSize <- round(nrow(IrisDataset) * 0.8)
testSize <- nrow(IrisDataset) - trainSize

trainSize

## [1] 120

testSize

## [1] 30

The I assigned part of data for train and test trainSize = 120 flowers testSize = 30 flowers

Linear Regression Model

I define training_indices and assign value to trainSet and testSet

training_indices <- sample(seq_len(nrow(IrisDataset)), size = trainSize)

trainSet <- IrisDataset[training_indices, ]
testSet <- IrisDataset[-training_indices, ]

I created Linear Regression Model using the function – lm() called Fit1:

LinearModel<- lm(Petal.Length ~ Petal.Width,trainSet) 
summary(LinearModel)

## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = trainSet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31533 -0.32661 -0.02686  0.27611  1.40670 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08246    0.08689   12.46   <2e-16 ***
## Petal.Width  2.22203    0.05994   37.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5064 on 118 degrees of freedom
## Multiple R-squared:  0.9209, Adjusted R-squared:  0.9203 
## F-statistic:  1374 on 1 and 118 DF,  p-value: < 2.2e-16

The result is an Intercept estimation of: 1.08 with an Error ∓ 0.086 with a P-Value of: 2e-16. For Petal Width the result is an estimation of: 2.22 with an Error ∓ 0.059 with a P-Value of: 2e-16.

Multiple R-squared: 0.9209 and Adjusted R-squared: 0.9203 in both of cases the result is near to 1, it means that the regression line fits the data very well.

p-value: < 2.2-16 this value, less than 0.05, means the relationship between the Independent Variable/Predictor and the Dependent Variable/Response is statistically significant.

How long are certain petals using the petal’s width?

We have a data set of 150 flowers, we trained with the 80% exactly 120 flowers and tested the models in the remaining 30 flowers.

I used the function predict()

prediction<-predict(LinearModel,testSet)
prediction

##        1        2        3       11       18       19       28       33 
## 1.526862 1.526862 1.526862 1.526862 1.749065 1.749065 1.526862 1.304659 
##       36       48       55       56       57       58       59       61 
## 1.526862 1.526862 4.415504 3.971097 4.637707 3.304488 3.971097 3.304488 
##       62       65       66       70       77       83       84       98 
## 4.415504 3.971097 4.193301 3.526691 4.193301 3.748894 4.637707 3.971097 
##      100      105      113      125      131      141 
## 3.971097 5.970926 5.748723 5.748723 5.304317 6.415333

highchart()%>%
  hc_title(text = "Flowers, Petals length and Width") %>%
  hc_yAxis(title = list(text = "Petal")) %>%
  hc_xAxis(title = list(text = "Percent")) %>%
  hc_add_series(IrisDataset, "scatter", hcaes(x = Petal.Length, y = Petal.Width))

I use a scatter plot to try to find and explanation for these results. This shows that there a lack of data in one segment of parameters that can affect this model and the prediction result.

The errors/warning messages founded

Here I enumerate a list of errors and the correct way: Error 1

install.packages(readr)

The correct way is ("") for the packages’ name

install.packages("readr")

Error 2

library("readr")

() for the packages’ name without double quotation marks. The correct way:

library(readr)

Error 3

IrisDataset <- read.csv(iris.csv)

The correct way is ("") data set name

IrisDataset <- read.csv("iris.csv")

Error 4

summary(risDataset)

Letter I is missing in the data set name - Error in summary(risDataset) : object ‘risDataset’ not found. The correct way:

summary(IrisDataset)

Error 5

str(IrisDatasets)

Letter s is over - Error in str(IrisDatasets) : object ‘IrisDatasets’ not found. The correct way:

str(IrisDataset)

Error 6

hist(IrisDataset$Species)

In hist.default(IrisDataset$Species) : ‘x’ must be

Error 7

plot(IrisDataset$Sepal.Length

) symbol is missing. The correct way:

plot(IrisDataset$Sepal.Length)

Error 8

trainSize <- round(nrow(IrisDataset) * 0.2)

20% assign to trainSize is a mistake, we need to take 70% 0r 80% percent at least.The correct way:

trainSize <- round(nrow(IrisDataset) * 0.8)

Error 9

testSize <- nrow(IrisDataset) - trainSet

Object ‘trainSet’ not found We have to use trainSize. The correct way:

testSize <- nrow(IrisDataset) - trainSize

Error 10

trainSizes

Object ‘trainSizes’ not found - letter s is over. The correct way:

trainSize

Error 11

trainSet <- IrisDataset[training_indices, ]

Object ‘training_indices’ not found. The correct way:

training_indices <- sample(seq_len(nrow(IrisDataset)), size = trainSize)

Error 12

LinearModel<- lm(trainSet$Petal.Width ~ testingSet$Petal.Length)

Replace testingSet by trainSet and the order .width instead .Length

LinearModel<- lm(trainSet$Petal.Length ~ trainSet$Petal.Width)

Error 13

#prediction<-predict(LinearModeltestSet)

Replace LinearModeltestSet by LinearModel

prediction<-predict(LinearModel)

Error 14

predictions

Letter s is over

prediction

Was it straightforward to install R and RStudio? Yes, It was very easy to do.

Was the tutorial useful? Would you recommend it to others? The tutorial was very useful, I would recommend it.

What are the main lessons you’ve learned from this experience? I learn the basis of starting coding with R, to define a variable, to create models to train and test, to predict. Also I learn about de Rstudio interface and how to create some plots to compare information.

Informal Report | First experience with R

Matias Barra

2019-10-23

Tutorial Part 1: Cars

Starting a new project and Getting Know The Data

Creating Testing and Training Sets

Linear Regression Model

How far a car can travel based on speed?

Predicting a petal’s length

Creating Testing and Training Sets

Linear Regression Model

How long are certain petals using the petal’s width?

The errors/warning messages founded