Case

Blackwell Electronics could have gathered experiences using RapidMiner during two project which aimed on customer buying patterns and prediction of product profitability. Due to industry trends and the huge range of options they intend to proceed with R. Therefore, a pilot project is considered to make first experiences and gather basic understanding of the statistic programming language R. For that matter two datasets are available which should be used to transfer the lessons learned from both RapidMiner projects and implement a base case in R. The dataset ‘Cars’ has to be used to train a linear regression model to predict brake distance using a cars speed, whereas the ‘Iris’ dataset asks for a linear regression model to predict petal length using petal width.

Executive summary

Procedure

Orientation in R and RStudio

Both datasets have been used for analysis as desired

The analysis followed the common data mining process in an iterative way. This correspondents to recent projects in RapidMiner such that existing knowledge, experiences and lessons learned have been considered.

As suggested by the data mining process the mayor steps are data cleansing, exploration, pre-processing, feature engineering, modelling, prediction and analysis of models’ performance

Experiences

R offers a straightforward way of entrance such that one can rapidly accomplish installation, orientation, loading data, etc

In Case of errors the huge knowledge base of the community and documentations helps out

R comes with an immense number of packages and functions for every case of application and individual setup

Learning R can be management by almost everyone as a plethora of videos and learning stuff is out there

Further investments in data mining using are highly adviceable due to extent of basic and advanced applications while offering a comparatively easy entry

Results

Within the analysis, linear regression models could be trained for both datasets achieving a good model fit

In the case of ‘Cars’ it was crucial to identify outliers using boxplot diagram and transforming the data as the observation follow a quadratic function what cannot covered optimally by a linear model

In the case of ‘Iris’ the data was normally distributed, and it comes with no outliers within the features covered in analysis

The model performance for ‘cars’ is located around 2.5 of routed means square error (RMSE) and 1.9 in terms of mean absolute error (MAE). Both figures are quite similar and with an R^2 0.989 one can attest a sound performance.

Same applies for linear model of ‘Iris’ which achieves a R^2 of approximately 0.92, RMSE of 0.5 and a MAE of 0.4. However, in respect of the wide range of absolute and relative errors it has to be used with care as the error are quite big especially regarding the species of setosa. A more detailed analysis good could lead to a better model fit as there is a high convergence within each species. It therefore makes sense to train different linear models for each species.

Technical documentation

0. Questions

  • Was it straightforward to install R and RStudio? The installation of R did not come with major problems. It is open-source and can be downloaded from https://cran.r-project.org/. Same applies for RStudio (https://www.rstudio.com/products/rstudio/download/), an Integrated Development Environment (IDE). Once RStudio is installed so called packages needs to be installed typing the prompt “install.packages(”name") into the console.

  • Was the tutorial useful? Would you recommend it to others? I would recommend the tutorial to get a first grasp of how to use R, RStudio respec-tively. However, section 6 “Finding Errors..” does not add that much value as it equals the tutorial but gives at least a basic feeling for frequently occurring errors.

  • What are the main lessons you’ve learned from this experience?
    R and the IDE RStudio offer a straightforward way of entrance and make it possible to gain results within days. Besides this R offers plenty of packages and functions each with lots of parameters for almost every application and desire for designs etc. It is therefore highly recommended to plan further steps using R when approaching further applications of data mining within eCommerce.

  • What recommendations would you give to other employees who need to get started using R and doing predictive analytics in R instead of RapidMiner? Keep the same underlying methodology in mind, approach the data step by step and if necessary reiterate several times to achieve the best results. This understanding will help a lot to transform the procedure from RapidMiner GUI into R as all steps should be taken into account but just the execution of tasks looks different.

I. Cars dataset - breaking distance

1. Data exploration & pre-processing

Overview of dataset:

## 'data.frame':    50 obs. of  3 variables:
##  $ name.of.car    : Factor w/ 23 levels "Acura","Audi",..: 9 15 12 16 23 3 20 10 13 14 ...
##  $ speed.of.car   : int  4 4 7 7 8 9 10 10 10 11 ...
##  $ distance.of.car: int  2 4 10 10 14 16 17 18 20 20 ...
##   name.of.car  speed.of.car  distance.of.car 
##  Dodge  : 3   Min.   : 4.0   Min.   :  2.00  
##  Honda  : 3   1st Qu.:12.0   1st Qu.: 26.00  
##  Jeep   : 3   Median :15.0   Median : 36.00  
##  KIA    : 3   Mean   :15.4   Mean   : 42.98  
##  Acura  : 2   3rd Qu.:19.0   3rd Qu.: 56.00  
##  Audi   : 2   Max.   :25.0   Max.   :120.00  
##  (Other):34

The dataset contains 0 missing values.

Feature type has been changed for name to factor as it is categorical.

Scatterplot of label and independent feature.

One can see that the objects rather follow a quadratic than a linear relation.

Histograms of all features:

Boxplot of features:

Outliers has been exluded.

2. Feature Engineering - Transformation

Transformation of distance scale –> sqrt(distance)

Plots of features transformed - speed^2 and sqrtdistance:

3. Modelling

Train and test set have been compiled as follows:

  • Set seed to 123
  • Applying split ratio of 70/30 results in a Training set of 34 and a Test set of 15 objects.

Based on this setup different models linear regression models have been trained:

  • Model 1: Original data (Base model)
  • Model 2: sqrtdistance (similiar to model 3)
  • Model 3: squared speed as this correspondents to the actual law of brake/distance

Model performance:

  1. LR model based on original data:
## 
## Call:
## lm(formula = distance ~ speed, data = trainSet)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.757 -3.612 -1.076  2.928 13.214 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -27.3587     2.6414  -10.36 9.53e-12 ***
## speed         4.5362     0.1611   28.16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.779 on 32 degrees of freedom
## Multiple R-squared:  0.9612, Adjusted R-squared:   0.96 
## F-statistic:   793 on 1 and 32 DF,  p-value: < 2.2e-16
  1. LR model based on sqrtdistance:
## 
## Call:
## lm(formula = sqrtdistance ~ speed, data = trainSet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34930 -0.09397 -0.05924  0.09940  0.43827 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.606392   0.093027   6.518 2.44e-07 ***
## speed       0.366097   0.005673  64.531  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1683 on 32 degrees of freedom
## Multiple R-squared:  0.9924, Adjusted R-squared:  0.9921 
## F-statistic:  4164 on 1 and 32 DF,  p-value: < 2.2e-16
  1. LR model based on speed^2:
## 
## Call:
## lm(formula = distance ~ speed2, data = trainSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7541 -0.9707 -0.5830  1.1084  7.2639 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.604061   0.856192   4.209 0.000194 ***
## speed2      0.147830   0.002744  53.880  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.534 on 32 degrees of freedom
## Multiple R-squared:  0.9891, Adjusted R-squared:  0.9888 
## F-statistic:  2903 on 1 and 32 DF,  p-value: < 2.2e-16

3. Prediction

Comparison betw. actual and predicted values and different models

If actual and values predicted by the base model (B) are compared one can recognize that the prediction is more linear than the actual values.
For that reason model 2 and 3 are computed using transformed features (sqrtdistance and speed^2). Doing so, the actual quadratic relation between speed and brake distance is transformed into a linear one what is more meaningful for a LR-model. Both models are compared in (C) and one can see that there is almost no differences between both predictions. Consequently model 3 is predicted as it comes with some advantages concerning coding effort.
Finally, in D the prediction of model 3 are opposed to the actual values.

Comparing LR-model selected with actual values

The selected model 3 based on the squared speed is plotted against all points in the dataset. Thereby, one can see which brands are over- or underestimated. However, this is a comparison of Training data and the model what is not that meaningful at all but helps to grasp general trends.

Analysis of absolute errors and relative errors

Calculation of performance metrics

The LR-model selected achieves a RMSE of 2.45814696801495 and a MAE of 1.84873180692811 predicting pLength using pWidth are as follows.

========================================================================================================================================================

II. Iris dataset - Predicting

1. Data exploration & pre-processing

Overview of dataset:

## 'data.frame':    150 obs. of  6 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##        X           Sepal.Length    Sepal.Width     Petal.Length  
##  Min.   :  1.00   Min.   :4.300   Min.   :2.000   Min.   :1.000  
##  1st Qu.: 38.25   1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600  
##  Median : 75.50   Median :5.800   Median :3.000   Median :4.350  
##  Mean   : 75.50   Mean   :5.843   Mean   :3.057   Mean   :3.758  
##  3rd Qu.:112.75   3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100  
##  Max.   :150.00   Max.   :7.900   Max.   :4.400   Max.   :6.900  
##   Petal.Width          Species  
##  Min.   :0.100   setosa    :50  
##  1st Qu.:0.300   versicolor:50  
##  Median :1.300   virginica :50  
##  Mean   :1.199                  
##  3rd Qu.:1.800                  
##  Max.   :2.500

The dataset contains 0 missing values.

Feature type has been changed for species from categorical to integer for histogram only.

Histograms of all features:

Boxplot of features:

Outliers need not to be excluded as we want to predict petal length (pLength) using petal width (pWidth).

Scatterplot of label and independent feature.

Checking distribution of pWidth, pLength and pLength after normalization using normal q-q plots. As one can see there is no difference between 2. and 3. plot, between original and normalized independent feature. This means that it is normally distributed.

2. Modelling

Train and test set have been compiled as follows:

  • Set seed to 123
  • Applying split ratio of 80/20 results in a Training set of 120 and a Test set of 30 objects.
## 
## Call:
## lm(formula = pLength ~ pWidth, data = Iris_trainSet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39883 -0.32153 -0.00922  0.30938  1.36918 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.05562    0.08445   12.50   <2e-16 ***
## pWidth       2.26800    0.06136   36.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5004 on 118 degrees of freedom
## Multiple R-squared:  0.9205, Adjusted R-squared:  0.9198 
## F-statistic:  1366 on 1 and 118 DF,  p-value: < 2.2e-16

3. Prediction

Comparing LR-model with actual values

Analysis of absolute errors and relative errors

Calculation of performance metrics

The LR-model achieves a RMSE of 0.496257345319873 and a MAE of 0.377137834274458 predicting pLength using pWidth are as follows.