PRELIMINARY

As our Data Mining efforts at Blackwell Electronics continue to show positive results, we have decided to move into ‘R’. This is a programming language used for statistical and graphical analysis, and is now widely considered one of the best for its purposes. The present report documents my experience with R on two training projects.
Note: I had already installed and gone through a few tutorials in the past

The two training projects I will cover are the following:
PROJECT1: Cars Breaking Distance
PROJECT2: Iris Flower Features

OUR OBJECTIVES:
1) Predictions concerning how far a certain car can travel based on speed.
2) Predictions concerning the petal length through using the petal’s width.
3) The errors/warning messages that I encountered and how I overcame them.

PROJECT1: Cars Breaking Distance

This dataset contains observations of different car brands, and the braking distance it takes to stop when they are going at a determined speed. The objective is to predict how far a new car needs to travel when it brakes. We will use the speed of the car to predict that travel distance.

Let’s start by importing the dataset and loading libraries

setwd("C:/Users/S/Documents/Ubiqum/dataanalyticsII_task1")
cars <- read.csv("cars.csv")
library(readr)
suppressPackageStartupMessages(library(ggplot2))

Running a few commands to get a sense of what we’re dealing with

View(cars)
str(cars)
## 'data.frame':    50 obs. of  3 variables:
##  $ name.of.car    : Factor w/ 23 levels "Acura","Audi",..: 9 15 12 16 23 3 20 10 13 14 ...
##  $ speed.of.car   : int  4 4 7 7 8 9 10 10 10 11 ...
##  $ distance.of.car: int  2 4 10 10 14 16 17 18 20 20 ...

Making sure there are no missing values

any(is.na(cars))
## [1] FALSE

Visualizing relationships in data

Looks like we can fit a line quite well.

Let’s start with our regression model.
In order to do this, we first need to split our dataset into two parts: the training and the testing set. As we do this, we need to choose what percentage of the data will go into each set. A common approach for our dataset is:

set.seed(123)   #We set a known random seed so the model can be replicated

trainSize_cars <- nrow(cars)*0.7    #We don't need to round because the split leaves no floats
testSize_cars  <- nrow(cars) -trainSize_cars
View(testSize_cars)
training_indices_cars <- sample(seq_len(nrow(cars)), size=trainSize_cars)
training_indices_cars2 <- sample(seq_len(nrow(cars)), size = testSize_cars)

Defining the new database we will use for prediction

trainSet_cars <- cars[training_indices_cars, ]
testSet_cars  <- cars[training_indices_cars2, ]

Creating and saving the Linear Regression Model now

LinearModel_cars <- lm(distance.of.car~ speed.of.car,trainSet_cars)
summary(LinearModel_cars)
## 
## Call:
## lm(formula = distance.of.car ~ speed.of.car, data = trainSet_cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0012 -5.0012 -0.5603  2.1458 28.4109 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -35.2481     4.0712  -8.658 5.25e-10 ***
## speed.of.car   5.0735     0.2519  20.143  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.18 on 33 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9225 
## F-statistic: 405.7 on 1 and 33 DF,  p-value: < 2.2e-16

So, the Model is giving us an R2 of 0.94. Which means 94% of the variable ‘Speed’ is explaining subsequent braking distance. The coefficient is quite good. The P-value is also extremely low, which is also good.

Let’s apply the Model and see the final predictions.

predictions_Distance <- predict(LinearModel_cars, testSet_cars)
predictions_Distance
##        10        23        27         7        48        32        38 
## 20.560267 35.780731 45.927707 15.486779 86.515612 56.074684 61.148172 
##        25        34        29         5         8        12        13 
## 40.854219 56.074684 51.001195  5.339803 15.486779 25.633755 25.633755 
##        18 
## 30.707243
View(predictions_Distance)

This would be the end of the first tutorial.
But there are a few more things we could do to further improve our model, namely:
1) Check for outliers and remove them if they exist 2) Try squaring the speed variable, since it would make more sense to predict distance, according to its formula.

PROJECT2: Iris Flower Petal Length

This dataset contains observations of three different species of a flower. Each species have data about their sepals and petals, and each sepal and petal has its lengths and width.
Our goal is to predict Petal Length using Petal Width.
As for the tutorial, it contained errors in its script that I have fixed through the analysis.

The process of data analysis and prediction model will be the same used in the previous dataset.

Starting by importing the dataset and libraries

setwd("C:/Users/S/Documents/Ubiqum/dataanalyticsII_task1")
iris <- read.csv("iris.csv")
library(readr)
suppressPackageStartupMessages(library(ggplot2))

Data Exploration

View(iris)
summary(iris)
##        X           Sepal.Length    Sepal.Width     Petal.Length  
##  Min.   :  1.00   Min.   :4.300   Min.   :2.000   Min.   :1.000  
##  1st Qu.: 38.25   1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600  
##  Median : 75.50   Median :5.800   Median :3.000   Median :4.350  
##  Mean   : 75.50   Mean   :5.843   Mean   :3.057   Mean   :3.758  
##  3rd Qu.:112.75   3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100  
##  Max.   :150.00   Max.   :7.900   Max.   :4.400   Max.   :6.900  
##   Petal.Width          Species  
##  Min.   :0.100   setosa    :50  
##  1st Qu.:0.300   versicolor:50  
##  Median :1.300   virginica :50  
##  Mean   :1.199                  
##  3rd Qu.:1.800                  
##  Max.   :2.500
str(iris)
## 'data.frame':    150 obs. of  6 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
names(iris)
## [1] "X"            "Sepal.Length" "Sepal.Width"  "Petal.Length"
## [5] "Petal.Width"  "Species"
any(is.na(iris))
## [1] FALSE

Visualization

hist(iris$Petal.Length)

hist(iris$Petal.Width, breaks = 30)

plot(iris$Sepal.Width, iris$Sepal.Length)

plot(iris$Petal.Length, iris$Petal.Width)

ggplot(iris, aes(x=Petal.Width, y=Petal.Length, color=Species)) +geom_point()

With the plot right above, we can clearly see how the different species differ in size. This means our model would better predict Petal Length if we tested each species separately. However, this can only be done if the dataset is very large, and in our case, it isn’t.

Let’s continue by checking outliers in the dataset. We will look into our relevant parameters:

boxplot(iris[,c("Petal.Length","Petal.Width")], main = "Petal Length and Petal Width")

Regression model

#Assigning a visible random seed
set.seed(234) 

#Defining sets sizes for cross validation
trainSize_iris <- round(nrow(iris)*0.7)
testSize_iris <- nrow(iris) - trainSize_iris

#Assigning train set Sizes as previously defined
training_indices_iris <- sample(seq_len(nrow(iris)),size=trainSize_iris)

trainSet_iris <-iris[training_indices_iris, ]
testSet_iris  <-iris[-training_indices_iris, ]

LinearModIris<-lm(Petal.Length~ Petal.Width, trainSet_iris)
summary(LinearModIris)
## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = trainSet_iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33555 -0.31363 -0.02864  0.24944  1.38367 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.10942    0.09415   11.78   <2e-16 ***
## Petal.Width  2.21922    0.06481   34.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5088 on 103 degrees of freedom
## Multiple R-squared:  0.9192, Adjusted R-squared:  0.9185 
## F-statistic:  1173 on 1 and 103 DF,  p-value: < 2.2e-16

This time around, roughly 92% of the Sepal Length is explaining Petal Length.

predictions_Length <- predict(LinearModIris, testSet_iris)
predictions_Length
##        2        3        4        6       11       12       13       14 
## 1.553263 1.553263 1.553263 1.997107 1.553263 1.553263 1.331341 1.331341 
##       17       22       27       36       37       38       41       50 
## 1.997107 1.997107 1.997107 1.553263 1.553263 1.331341 1.775185 1.553263 
##       51       52       53       56       59       61       62       64 
## 4.216328 4.438250 4.438250 3.994406 3.994406 3.328640 4.438250 4.216328 
##       67       73       76       80       82       94       96       98 
## 4.438250 4.438250 4.216328 3.328640 3.328640 3.328640 3.772484 3.994406 
##      100      101      102      107      111      113      118      122 
## 3.994406 6.657471 5.325938 4.882094 5.547860 5.769782 5.991704 5.547860 
##      124      131      143      145      147 
## 5.104016 5.325938 5.325938 6.657471 5.325938
View(predictions_Length)

This concludes the second analysis.
The next steps to take onto improving our model are, but not limited to, the following:
1) Calculating MRSE, MAE and MRE and look for new insights.
2) Finding out more information about the species themselves and see if we can adapt our data.

CONCLUSION

The tutorial we followed for this analysis explained very well the basics of prediction analysis, and I would recommend it to others.
As for R, it is pretty straightforward to use, for its basic functions and simple datasets like the ones we used this time. However, it is easy to constantly run into errors as you try new things. Admittedly, some will be easy to fix and some will take more time.

Overall, even from just this experience, I can tell it’s far more resourceful than RapidMiner, and because it is opensource and supported by a vast community of users, it is easy to ask for help when getting stuck at trying something.

Finally, my recommendations to other employees at Blackwell Electronics are mainly two:
1) You will learn many different uses, functions, tools, etc. everyday. Don’t stress over the things that seem difficult at the moment.
2) All facts considered, R is a less visual tool than RapidMiner, and that’s why it’s important to stop and understand what new functions do ‘behind’ R.

With this, I conclude my informal report.