1 Getting Started with R Report

In this Part of the course we started working with R. The difference between R and RapidMiner is that in R it’s much easier to manipulate the data and it gives you way more options when it comes down to all the ways the data can be processed, visualised and how the models can be built although it requires more knowledge because it’s not a point and click software but it requires coding in order to work with it. The task was split in two partsas it follows:

  1. First part we worked with the cars dataset and we had to predict the distance the car travelled based on the cars speed.
  2. In the second part we had to correct the code that built the linear regression model on the iri data that predicted the petal length based on the petal width.

1.1 First Part. Basics of R Using the Cars Dataset

We learned to

  • Install packages
  • Use libraries
  • Read data
  • Inspect the data in order to see its’ attributes, type of values and the way it is structured.
## $names
## [1] "name.of.car"     "speed.of.car"    "distance.of.car"
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
##   name.of.car  speed.of.car  distance.of.car 
##  Dodge  : 3   Min.   : 4.0   Min.   :  2.00  
##  Honda  : 3   1st Qu.:12.0   1st Qu.: 26.00  
##  Jeep   : 3   Median :15.0   Median : 36.00  
##  KIA    : 3   Mean   :15.4   Mean   : 42.98  
##  Acura  : 2   3rd Qu.:19.0   3rd Qu.: 56.00  
##  Audi   : 2   Max.   :25.0   Max.   :120.00  
##  (Other):34
## 'data.frame':    50 obs. of  3 variables:
##  $ name.of.car    : Factor w/ 23 levels "Acura","Audi",..: 9 15 12 16 23 3 20 10 13 14 ...
##  $ speed.of.car   : int  4 4 7 7 8 9 10 10 10 11 ...
##  $ distance.of.car: int  2 4 10 10 14 16 17 18 20 20 ...
## [1] "name.of.car"     "speed.of.car"    "distance.of.car"
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25

1.2 Plots

  • Learned how to visualize the data using plots like:

1.2.1 Histogram Plot

1.2.2 Scatter (Box) Plot

1.2.3 Normal Quantile Plot- is a way to see if our data is normally distributed

1.3 Preprocessing the data

In this part we learned how to change the name and type of the data and also how to deal with missing values.

1.4 Creating Testing and Training sets

First of all we have to set the seed. The seed is a number that we choose for a starting point used to create a sequence of random numbers and it helps other to recreate our results.

Next step it’s to calculate the size of our two data sets: the training and the testing set.

## [1] 35
## [1] 15

Third step is actually creating the testing and training sets.

1.5 The Linear Regression Model

Next we create and train the model on the training data set and check out how it performs using the summary function.

! Note the 0+ was added in order to force the regression line through the 0 intercept. Not doing this will give us a negative intercept which will make out model predinct negative values for the speed which is realistically impossible.

## 
## Call:
## lm(formula = distance ~ 0 + speed, data = trainSetCars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.919 -10.407  -8.862  -1.846  45.202 
## 
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)    
## speed   2.9919     0.1338   22.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.79 on 34 degrees of freedom
## Multiple R-squared:  0.9363, Adjusted R-squared:  0.9345 
## F-statistic: 500.1 on 1 and 34 DF,  p-value: < 2.2e-16

1.6 Predictions

The last step of this first task was to predict the distance that the cars travel using their speed. For this we used the predict function and applied it to the test dataset.

To visualise the values we just call the name that we gave to the prediction

##        1        2        6       16       18       20       22       23 
## 11.96763 11.96763 26.92717 38.89479 38.89479 41.88670 41.88670 41.88670 
##       34       35       38       39       44       46       47 
## 53.85433 53.85433 56.84624 59.83815 65.82196 71.80577 71.80577

We can also create a plot to see the reav versus the predicted values

2 Find the Errors in an R Script Uing the Iris Dataset

In this task we had to find the errors in a given code which was, as the previos ones, creating a linear regression model that predicted the length of the iris flower petals based on the width of the same attribute.

This task showed me the importance of syntax and writing the code corectly. Errors like a missing coma, parantesis or letter make the code unrunnable. Some examples below:

Error in summary(risDataset) : object ‘risDataset’ not found when trying to run - because is missing the I in Iris when trying to run summary(risDataset). Trying to run qqnorm(IrisDataset) will give us the error Error in FUN(X[[i]], …) : only defined on a data frame with all numeric variables because we didn’t set the column on which the plot should be constructed.The well written qqplot function is:

As in the previous example with the cars dataset we have to built the linear regression model and train it. In the example code we got the training indices weren’t set so we had to add them and just then we can build the model.

After building the model we can call the summary function to assess the performance of the model.

## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = trainSetIris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.38983 -0.32418  0.01757  0.26909  1.17582 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.02721    0.08938   11.49   <2e-16 ***
## Petal.Width  2.27609    0.06291   36.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4885 on 103 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9264 
## F-statistic:  1309 on 1 and 103 DF,  p-value: < 2.2e-16

A litlle explanation on the coefficients we can see in the performance

Coefficients-Intercept- average value of y when x = 0 -Estimate- difference 1 unit + in x does to y -std error- the standard deviation of the estimation -t value-t value = estimate/std error. Should be high -Pr(>|t|) - the p-value for the t-test, the p value is the chance that the result happened due to random variation .p value <0.05 means that the result is significant The stars show the significance. The higher nr of stars the better.

Last step is predicting the petal length using the predict function and calling it in order to see the output values.

##        1        2        3        5       11       18       19       28 
## 1.482431 1.482431 1.482431 1.482431 1.482431 1.710040 1.710040 1.482431 
##       29       33       36       45       48       49       55       56 
## 1.482431 1.254821 1.482431 1.937649 1.482431 1.482431 4.441349 3.986131 
##       57       58       59       61       62       65       66       68 
## 4.668958 3.303304 3.986131 3.303304 4.441349 3.986131 4.213740 3.303304 
##       70       77       83       84       94       95       98      100 
## 3.530913 4.213740 3.758522 4.668958 3.303304 3.986131 3.986131 3.986131 
##      101      104      105      111      113      116      125      131 
## 6.717440 5.124176 6.034613 5.579395 5.807004 6.262222 5.807004 5.351786 
##      133      135      140      141      145 
## 6.034613 4.213740 5.807004 6.489831 6.717440

2.1 Visualise the Prediction

As we can tell looking at the output of the prediction doesen’t tell us very much about how well the prediction worked. We can plot it in order to be able to interpret it better.