After downloading, installing and learnt how to use R and Rstudio I started with the Tutorial Part 1.
We will work with a data ser of 50 cars, and we going to predict the distance through the speed of certain cars.
I created a New Project In a brand new directory then I did the following operations:install.packages(“readr”)
Then i called this library
library(readr)
Uploaded the data with:
cars <- read.csv("cars.csv")
carsPred<- read.csv("cars_comparison.csv")
After uploading data I started applying functions in order to know better the data that I have to use:
attributes(cars)
## $names
## [1] "name.of.car" "speed.of.car" "distance.of.car"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50
summary(cars)
## name.of.car speed.of.car distance.of.car
## Dodge : 3 Min. : 4.0 Min. : 2.00
## Honda : 3 1st Qu.:12.0 1st Qu.: 26.00
## Jeep : 3 Median :15.0 Median : 36.00
## KIA : 3 Mean :15.4 Mean : 42.98
## Acura : 2 3rd Qu.:19.0 3rd Qu.: 56.00
## Audi : 2 Max. :25.0 Max. :120.00
## (Other):34
Then I renamed the columns in the dataset
colnames(cars) <- c("Name","Speed","Distance")
names(cars)
## [1] "Name" "Speed" "Distance"
cars$Name #print out the instances in the first column
## [1] Ford Jeep Honda KIA Toyota BMW
## [7] Mercedes GM Hyundai Infiniti Land Rover Lexus
## [13] Mazda Mitsubishi Nissan GMC Fiat Chrysler
## [19] Dodge Acura Audi Chevrolet Buick Ford
## [25] Jeep Honda KIA Toyota BMW Mercedes
## [31] GM Hyundai Infiniti Land Rover Lexus Mazda
## [37] Mitsubishi Nissan GMC Fiat Chrysler Dodge
## [43] Acura Audi Chevrolet Buick Jeep Honda
## [49] KIA Dodge
## 23 Levels: Acura Audi BMW Buick Chevrolet Chrysler Dodge Fiat Ford ... Toyota
cars$Speed #print out the instances in the second column
## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
## [24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
## [47] 24 24 24 25
cars$Distance #print out the instances in the third column
## [1] 2 4 10 10 14 16 17 18 20 20 22 24 26 26 26 26 28
## [18] 28 32 32 32 34 34 34 36 36 40 40 42 46 46 48 50 52
## [35] 54 54 56 56 60 64 66 68 70 76 80 84 85 92 93 120
Preprocessing The Data Firstly I converted the data to numeric types
cars$Speed<- as.numeric(cars$Speed)
cars$Distance<- as.numeric(cars$Distance)
In order to detect missing values in the data set I executed the summary() function, and it shows me that there in not missing values:
summary(cars)
## Name Speed Distance
## Dodge : 3 Min. : 4.0 Min. : 2.00
## Honda : 3 1st Qu.:12.0 1st Qu.: 26.00
## Jeep : 3 Median :15.0 Median : 36.00
## KIA : 3 Mean :15.4 Mean : 42.98
## Acura : 2 3rd Qu.:19.0 3rd Qu.: 56.00
## Audi : 2 Max. :25.0 Max. :120.00
## (Other):34
I tried with a plot to see the relation between Speed and Distance
plot(cars$Speed, cars$Distance)
I created smp_size variable using floor to to get a round number of 70% percent of the data set.
smp_size <- floor(0.7 * nrow(cars))
I use a random number generator .seed
set.seed(123)
I define train_indices and sample variables and used a Sequence Generation function called seq_len(). In the same line of code I assign the size of this sample smp_size.
train_indices <- sample(seq_len(nrow(cars)), size = smp_size)
The I assigned part of data for train and test
train<- cars[train_indices, ]
test<- cars[-train_indices, ]
I created Linear Regression Model using the function – lm() called Fit1:
Fit1<-lm(Distance~ Speed, train)
Then I executed the function summary for this model:
summary(Fit1)
##
## Call:
## lm(formula = Distance ~ Speed, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0012 -5.0012 -0.5603 2.1458 28.4109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -35.2481 4.0712 -8.658 5.25e-10 ***
## Speed 5.0735 0.2519 20.143 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.18 on 33 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9225
## F-statistic: 405.7 on 1 and 33 DF, p-value: < 2.2e-16
The result is an Intercept estimation of: -35.24 with an Error ∓ 4.07 with a P-Value of: 5.25e-10. For Speed the result is an estimation of: -5.07 with an Error ∓ 0.25 with a P-Value of: 2.2e-16. Multiple R-squared: 0.9248 and Adjusted R-squared: 0.9225 in both of cases the result is near to 1, it means that the regression line fits the data very well. p-value: < 2.2e-16 this value, less than 0.05, means the relationship between the Independent Variable/Predictor and the Dependent Variable/Response is statistically significant.
We have a data set of 50 cars, we trained with the 70% exactly 35 cars and tested the models in the remaining 15 cars. To answer this question I used the function predict()
Pred1<- predict(Fit1, test)
Pred1
## 1 2 6 16 18 20 22
## -14.95415 -14.95415 10.41329 30.70724 30.70724 35.78073 35.78073
## 23 34 35 38 39 44 46
## 35.78073 56.07468 56.07468 61.14817 66.22166 76.36864 86.51561
## 47
## 86.51561
Create a Table
Number | Name | Speed | Distance | Distance.Predicted | Difference | X. |
---|---|---|---|---|---|---|
1 | Ford | 4 | 2 | -14.95 | -16.95 | -847.50% |
2 | Jeep | 4 | 4 | -14.95 | -18.95 | -473.75% |
3 | Honda | 7 | 10 | NA | NA | |
4 | KIA | 7 | 10 | NA | NA | |
5 | Toyota | 8 | 14 | NA | NA | |
6 | BMW | 9 | 16 | 10.41 | -5.59 | -34.94% |
7 | Mercedes | 10 | 17 | NA | NA | |
8 | GM | 10 | 18 | NA | NA | |
9 | Hyundai | 10 | 20 | NA | NA | |
10 | Infiniti | 11 | 20 | NA | NA | |
11 | Land Rover | 11 | 22 | NA | NA | |
12 | Lexus | 12 | 24 | NA | NA | |
13 | Mazda | 12 | 26 | NA | NA | |
14 | Mitsubishi | 12 | 26 | NA | NA | |
15 | Nissan | 12 | 26 | NA | NA | |
16 | GMC | 13 | 26 | 30.70 | 4.70 | 18.08% |
17 | Fiat | 13 | 28 | NA | NA | |
18 | Chrysler | 13 | 28 | 30.70 | 2.70 | 9.64% |
19 | Dodge | 13 | 32 | NA | NA | |
20 | Acura | 14 | 32 | 30.70 | -1.30 | -4.06% |
21 | Audi | 14 | 32 | NA | NA | |
22 | Chevrolet | 14 | 34 | 35.78 | 1.78 | 5.24% |
23 | Buick | 14 | 34 | 35.78 | 1.78 | 5.24% |
24 | Ford | 15 | 34 | NA | NA | |
25 | Jeep | 15 | 36 | NA | NA | |
26 | Honda | 15 | 36 | NA | NA | |
27 | KIA | 16 | 40 | NA | NA | |
28 | Toyota | 16 | 40 | NA | NA | |
29 | BMW | 17 | 42 | NA | NA | |
30 | Mercedes | 17 | 46 | NA | NA | |
31 | GM | 17 | 46 | NA | NA | |
32 | Hyundai | 18 | 48 | NA | NA | |
33 | Infiniti | 18 | 50 | NA | NA | |
34 | Land Rover | 18 | 52 | 56.07 | 4.07 | 7.83% |
35 | Lexus | 18 | 54 | 56.07 | 2.07 | 3.83% |
36 | Mazda | 19 | 54 | NA | NA | |
37 | Mitsubishi | 19 | 56 | NA | NA | |
38 | Nissan | 19 | 56 | 61.14 | 5.14 | 9.18% |
39 | GMC | 20 | 60 | 66.22 | 6.22 | 10.37% |
40 | Fiat | 20 | 64 | NA | NA | |
41 | Chrysler | 20 | 66 | NA | NA | |
42 | Dodge | 20 | 68 | NA | NA | |
43 | Acura | 20 | 70 | NA | NA | |
44 | Audi | 22 | 76 | 76.36 | 0.36 | 0.47% |
45 | Chevrolet | 23 | 80 | NA | NA | |
46 | Buick | 24 | 84 | 86.51 | 2.51 | 2.99% |
47 | Jeep | 24 | 85 | 86.51 | 1.51 | 1.78% |
48 | Honda | 24 | 92 | NA | NA | |
49 | KIA | 24 | 93 | NA | NA | |
50 | Dodge | 25 | 120 | NA | NA |
I compared the predictions results of this 15 cars in the table below and we can see that the most precise prediction was in the car number 44 (Audi) followed for the cars numbers 46 (Buick) and 47 (Jeep) all of them with a speed higher than 20km/h. These cars are part of the group of cars with 24km/h of speed, and the models take 2 of this to train and 2 to test, maybe it can explain the good prediction. On the other hand, cars numbers 1 and 2 with the lowest speed show a distance result negative which is unthinkable. Both are the only ones with a speed of 4km/h, and the models take both of them for testing and none for training, maybe here are the predictions problem.
Now we going to work with a dataset of 150 iris flowers with 5 attributes each. The analysis goal is to predict a petal’s length using the petal’s width.
## X Sepal.Length Sepal.Width Petal.Length
## Min. : 1.00 Min. :4.300 Min. :2.000 Min. :1.000
## 1st Qu.: 38.25 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
## Median : 75.50 Median :5.800 Median :3.000 Median :4.350
## Mean : 75.50 Mean :5.843 Mean :3.057 Mean :3.758
## 3rd Qu.:112.75 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
## Max. :150.00 Max. :7.900 Max. :4.400 Max. :6.900
## Petal.Width Species
## Min. :0.100 setosa :50
## 1st Qu.:0.300 versicolor:50
## Median :1.300 virginica :50
## Mean :1.199
## 3rd Qu.:1.800
## Max. :2.500
## 'data.frame': 150 obs. of 6 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## [1] "X" "Sepal.Length" "Sepal.Width" "Petal.Length"
## [5] "Petal.Width" "Species"
I use a random number generator .seed
set.seed(123)
I created trainSize variable using round to to get a round number of 80% percent of the data set.Then I define testSize.
trainSize <- round(nrow(IrisDataset) * 0.8)
testSize <- nrow(IrisDataset) - trainSize
trainSize
## [1] 120
testSize
## [1] 30
The I assigned part of data for train and test trainSize = 120 flowers testSize = 30 flowers
I define training_indices and assign value to trainSet and testSet
training_indices <- sample(seq_len(nrow(IrisDataset)), size = trainSize)
trainSet <- IrisDataset[training_indices, ]
testSet <- IrisDataset[-training_indices, ]
I created Linear Regression Model using the function – lm() called Fit1:
LinearModel<- lm(Petal.Length ~ Petal.Width,trainSet)
summary(LinearModel)
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.31533 -0.32661 -0.02686 0.27611 1.40670
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08246 0.08689 12.46 <2e-16 ***
## Petal.Width 2.22203 0.05994 37.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5064 on 118 degrees of freedom
## Multiple R-squared: 0.9209, Adjusted R-squared: 0.9203
## F-statistic: 1374 on 1 and 118 DF, p-value: < 2.2e-16
The result is an Intercept estimation of: 1.08 with an Error ∓ 0.086 with a P-Value of: 2e-16. For Petal Width the result is an estimation of: 2.22 with an Error ∓ 0.059 with a P-Value of: 2e-16.
Multiple R-squared: 0.9209 and Adjusted R-squared: 0.9203 in both of cases the result is near to 1, it means that the regression line fits the data very well.
p-value: < 2.2-16 this value, less than 0.05, means the relationship between the Independent Variable/Predictor and the Dependent Variable/Response is statistically significant.
We have a data set of 150 flowers, we trained with the 80% exactly 120 flowers and tested the models in the remaining 30 flowers.
I used the function predict()
prediction<-predict(LinearModel,testSet)
prediction
## 1 2 3 11 18 19 28 33
## 1.526862 1.526862 1.526862 1.526862 1.749065 1.749065 1.526862 1.304659
## 36 48 55 56 57 58 59 61
## 1.526862 1.526862 4.415504 3.971097 4.637707 3.304488 3.971097 3.304488
## 62 65 66 70 77 83 84 98
## 4.415504 3.971097 4.193301 3.526691 4.193301 3.748894 4.637707 3.971097
## 100 105 113 125 131 141
## 3.971097 5.970926 5.748723 5.748723 5.304317 6.415333
highchart()%>%
hc_title(text = "Flowers, Petals length and Width") %>%
hc_yAxis(title = list(text = "Petal")) %>%
hc_xAxis(title = list(text = "Percent")) %>%
hc_add_series(IrisDataset, "scatter", hcaes(x = Petal.Length, y = Petal.Width))
I use a scatter plot to try to find and explanation for these results. This shows that there a lack of data in one segment of parameters that can affect this model and the prediction result.
Here I enumerate a list of errors and the correct way: Error 1
install.packages(readr)
The correct way is ("") for the packages’ name
install.packages("readr")
Error 2
library("readr")
() for the packages’ name without double quotation marks. The correct way:
library(readr)
Error 3
IrisDataset <- read.csv(iris.csv)
The correct way is ("") data set name
IrisDataset <- read.csv("iris.csv")
Error 4
summary(risDataset)
Letter I is missing in the data set name - Error in summary(risDataset) : object ‘risDataset’ not found. The correct way:
summary(IrisDataset)
Error 5
str(IrisDatasets)
Letter s is over - Error in str(IrisDatasets) : object ‘IrisDatasets’ not found. The correct way:
str(IrisDataset)
Error 6
hist(IrisDataset$Species)
In hist.default(IrisDataset$Species) : ‘x’ must be
Error 7
plot(IrisDataset$Sepal.Length
) symbol is missing. The correct way:
plot(IrisDataset$Sepal.Length)
Error 8
trainSize <- round(nrow(IrisDataset) * 0.2)
20% assign to trainSize is a mistake, we need to take 70% 0r 80% percent at least.The correct way:
trainSize <- round(nrow(IrisDataset) * 0.8)
Error 9
testSize <- nrow(IrisDataset) - trainSet
Object ‘trainSet’ not found We have to use trainSize. The correct way:
testSize <- nrow(IrisDataset) - trainSize
Error 10
trainSizes
Object ‘trainSizes’ not found - letter s is over. The correct way:
trainSize
Error 11
trainSet <- IrisDataset[training_indices, ]
Object ‘training_indices’ not found. The correct way:
training_indices <- sample(seq_len(nrow(IrisDataset)), size = trainSize)
Error 12
LinearModel<- lm(trainSet$Petal.Width ~ testingSet$Petal.Length)
Replace testingSet by trainSet and the order .width instead .Length
LinearModel<- lm(trainSet$Petal.Length ~ trainSet$Petal.Width)
Error 13
#prediction<-predict(LinearModeltestSet)
Replace LinearModeltestSet by LinearModel
prediction<-predict(LinearModel)
Error 14
predictions
Letter s is over
prediction
Was it straightforward to install R and RStudio? Yes, It was very easy to do.
Was the tutorial useful? Would you recommend it to others? The tutorial was very useful, I would recommend it.
What are the main lessons you’ve learned from this experience? I learn the basis of starting coding with R, to define a variable, to create models to train and test, to predict. Also I learn about de Rstudio interface and how to create some plots to compare information.