In this Part of the course we started working with R. The difference between R and RapidMiner is that in R it’s much easier to manipulate the data and it gives you way more options when it comes down to all the ways the data can be processed, visualised and how the models can be built although it requires more knowledge because it’s not a point and click software but it requires coding in order to work with it. The task was split in two partsas it follows:
We learned to
install.packages("readr", repos = "http://cran.us.r-project.org")
library(readr)
cars <- read.csv("cars.csv")
## $names
## [1] "name.of.car" "speed.of.car" "distance.of.car"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## name.of.car speed.of.car distance.of.car
## Dodge : 3 Min. : 4.0 Min. : 2.00
## Honda : 3 1st Qu.:12.0 1st Qu.: 26.00
## Jeep : 3 Median :15.0 Median : 36.00
## KIA : 3 Mean :15.4 Mean : 42.98
## Acura : 2 3rd Qu.:19.0 3rd Qu.: 56.00
## Audi : 2 Max. :25.0 Max. :120.00
## (Other):34
## 'data.frame': 50 obs. of 3 variables:
## $ name.of.car : Factor w/ 23 levels "Acura","Audi",..: 9 15 12 16 23 3 20 10 13 14 ...
## $ speed.of.car : int 4 4 7 7 8 9 10 10 10 11 ...
## $ distance.of.car: int 2 4 10 10 14 16 17 18 20 20 ...
## [1] "name.of.car" "speed.of.car" "distance.of.car"
## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
In this part we learned how to change the name and type of the data and also how to deal with missing values.
First of all we have to set the seed. The seed is a number that we choose for a starting point used to create a sequence of random numbers and it helps other to recreate our results.
Next step it’s to calculate the size of our two data sets: the training and the testing set.
## [1] 35
## [1] 15
Third step is actually creating the testing and training sets.
Next we create and train the model on the training data set and check out how it performs using the summary function.
! Note the 0+ was added in order to force the regression line through the 0 intercept. Not doing this will give us a negative intercept which will make out model predinct negative values for the speed which is realistically impossible.
##
## Call:
## lm(formula = distance ~ 0 + speed, data = trainSetCars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.919 -10.407 -8.862 -1.846 45.202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## speed 2.9919 0.1338 22.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.79 on 34 degrees of freedom
## Multiple R-squared: 0.9363, Adjusted R-squared: 0.9345
## F-statistic: 500.1 on 1 and 34 DF, p-value: < 2.2e-16
The last step of this first task was to predict the distance that the cars travel using their speed. For this we used the predict function and applied it to the test dataset.
To visualise the values we just call the name that we gave to the prediction
## 1 2 6 16 18 20 22 23
## 11.96763 11.96763 26.92717 38.89479 38.89479 41.88670 41.88670 41.88670
## 34 35 38 39 44 46 47
## 53.85433 53.85433 56.84624 59.83815 65.82196 71.80577 71.80577
We can also create a plot to see the reav versus the predicted values
In this task we had to find the errors in a given code which was, as the previos ones, creating a linear regression model that predicted the length of the iris flower petals based on the width of the same attribute.
This task showed me the importance of syntax and writing the code corectly. Errors like a missing coma, parantesis or letter make the code unrunnable. Some examples below:
Error in summary(risDataset) : object ‘risDataset’ not found when trying to run - because is missing the I in Iris when trying to run summary(risDataset). Trying to run qqnorm(IrisDataset) will give us the error Error in FUN(X[[i]], …) : only defined on a data frame with all numeric variables because we didn’t set the column on which the plot should be constructed.The well written qqplot function is:
As in the previous example with the cars dataset we have to built the linear regression model and train it. In the example code we got the training indices weren’t set so we had to add them and just then we can build the model.
#we need to set the training indices first
training_indices<-sample(seq_len(nrow(IrisDataset)),size = trainSizeIris)
trainSetIris <- IrisDataset[training_indices, ] #was fine
testSetIris <- IrisDataset[-training_indices, ] #was fine
LinearModel<- lm(Petal.Length ~ Petal.Width, trainSetIris)
#previously LinearModel<- lm(trainSet$Petal.Width ~ testingSet$Petal.Length)
#we got the ERROR In eval(predvars, data, env) : object 'testingSet' not found
After building the model we can call the summary function to assess the performance of the model.
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = trainSetIris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.38983 -0.32418 0.01757 0.26909 1.17582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.02721 0.08938 11.49 <2e-16 ***
## Petal.Width 2.27609 0.06291 36.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4885 on 103 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9264
## F-statistic: 1309 on 1 and 103 DF, p-value: < 2.2e-16
A litlle explanation on the coefficients we can see in the performance
Coefficients-Intercept- average value of y when x = 0 -Estimate- difference 1 unit + in x does to y -std error- the standard deviation of the estimation -t value-t value = estimate/std error. Should be high -Pr(>|t|) - the p-value for the t-test, the p value is the chance that the result happened due to random variation .p value <0.05 means that the result is significant The stars show the significance. The higher nr of stars the better.
Last step is predicting the petal length using the predict function and calling it in order to see the output values.
prediction<-predict(LinearModel,testSetIris)
#Error: object 'predictions' not found is prediction not predictions
prediction
## 1 2 3 5 11 18 19 28
## 1.482431 1.482431 1.482431 1.482431 1.482431 1.710040 1.710040 1.482431
## 29 33 36 45 48 49 55 56
## 1.482431 1.254821 1.482431 1.937649 1.482431 1.482431 4.441349 3.986131
## 57 58 59 61 62 65 66 68
## 4.668958 3.303304 3.986131 3.303304 4.441349 3.986131 4.213740 3.303304
## 70 77 83 84 94 95 98 100
## 3.530913 4.213740 3.758522 4.668958 3.303304 3.986131 3.986131 3.986131
## 101 104 105 111 113 116 125 131
## 6.717440 5.124176 6.034613 5.579395 5.807004 6.262222 5.807004 5.351786
## 133 135 140 141 145
## 6.034613 4.213740 5.807004 6.489831 6.717440
As we can tell looking at the output of the prediction doesen’t tell us very much about how well the prediction worked. We can plot it in order to be able to interpret it better.