Blackwell Electronics could have gathered experiences using RapidMiner during two project which aimed on customer buying patterns and prediction of product profitability. Due to industry trends and the huge range of options they intend to proceed with R. Therefore, a pilot project is considered to make first experiences and gather basic understanding of the statistic programming language R. For that matter two datasets are available which should be used to transfer the lessons learned from both RapidMiner projects and implement a base case in R. The dataset ‘Cars’ has to be used to train a linear regression model to predict brake distance using a cars speed, whereas the ‘Iris’ dataset asks for a linear regression model to predict petal length using petal width.
Procedure
Orientation in R and RStudio
Both datasets have been used for analysis as desired
The analysis followed the common data mining process in an iterative way. This correspondents to recent projects in RapidMiner such that existing knowledge, experiences and lessons learned have been considered.
As suggested by the data mining process the mayor steps are data cleansing, exploration, pre-processing, feature engineering, modelling, prediction and analysis of models’ performance
Experiences
R offers a straightforward way of entrance such that one can rapidly accomplish installation, orientation, loading data, etc
In Case of errors the huge knowledge base of the community and documentations helps out
R comes with an immense number of packages and functions for every case of application and individual setup
Learning R can be management by almost everyone as a plethora of videos and learning stuff is out there
Further investments in data mining using are highly adviceable due to extent of basic and advanced applications while offering a comparatively easy entry
Results
Within the analysis, linear regression models could be trained for both datasets achieving a good model fit
In the case of ‘Cars’ it was crucial to identify outliers using boxplot diagram and transforming the data as the observation follow a quadratic function what cannot covered optimally by a linear model
In the case of ‘Iris’ the data was normally distributed, and it comes with no outliers within the features covered in analysis
The model performance for ‘cars’ is located around 2.5 of routed means square error (RMSE) and 1.9 in terms of mean absolute error (MAE). Both figures are quite similar and with an R^2 0.989 one can attest a sound performance.
Same applies for linear model of ‘Iris’ which achieves a R^2 of approximately 0.92, RMSE of 0.5 and a MAE of 0.4. However, in respect of the wide range of absolute and relative errors it has to be used with care as the error are quite big especially regarding the species of setosa. A more detailed analysis good could lead to a better model fit as there is a high convergence within each species. It therefore makes sense to train different linear models for each species.
Was it straightforward to install R and RStudio? The installation of R did not come with major problems. It is open-source and can be downloaded from https://cran.r-project.org/. Same applies for RStudio (https://www.rstudio.com/products/rstudio/download/), an Integrated Development Environment (IDE). Once RStudio is installed so called packages needs to be installed typing the prompt “install.packages(”name") into the console.
Was the tutorial useful? Would you recommend it to others? I would recommend the tutorial to get a first grasp of how to use R, RStudio respec-tively. However, section 6 “Finding Errors..” does not add that much value as it equals the tutorial but gives at least a basic feeling for frequently occurring errors.
What are the main lessons you’ve learned from this experience?
R and the IDE RStudio offer a straightforward way of entrance and make it possible to gain results within days. Besides this R offers plenty of packages and functions each with lots of parameters for almost every application and desire for designs etc. It is therefore highly recommended to plan further steps using R when approaching further applications of data mining within eCommerce.
What recommendations would you give to other employees who need to get started using R and doing predictive analytics in R instead of RapidMiner? Keep the same underlying methodology in mind, approach the data step by step and if necessary reiterate several times to achieve the best results. This understanding will help a lot to transform the procedure from RapidMiner GUI into R as all steps should be taken into account but just the execution of tasks looks different.
Overview of dataset:
## 'data.frame': 50 obs. of 3 variables:
## $ name.of.car : Factor w/ 23 levels "Acura","Audi",..: 9 15 12 16 23 3 20 10 13 14 ...
## $ speed.of.car : int 4 4 7 7 8 9 10 10 10 11 ...
## $ distance.of.car: int 2 4 10 10 14 16 17 18 20 20 ...
## name.of.car speed.of.car distance.of.car
## Dodge : 3 Min. : 4.0 Min. : 2.00
## Honda : 3 1st Qu.:12.0 1st Qu.: 26.00
## Jeep : 3 Median :15.0 Median : 36.00
## KIA : 3 Mean :15.4 Mean : 42.98
## Acura : 2 3rd Qu.:19.0 3rd Qu.: 56.00
## Audi : 2 Max. :25.0 Max. :120.00
## (Other):34
The dataset contains 0 missing values.
Feature type has been changed for name to factor as it is categorical.
Scatterplot of label and independent feature.
One can see that the objects rather follow a quadratic than a linear relation.
Histograms of all features:
Boxplot of features:
Outliers has been exluded.
Transformation of distance scale –> sqrt(distance)
Plots of features transformed - speed^2 and sqrtdistance:
Train and test set have been compiled as follows:
Based on this setup different models linear regression models have been trained:
Model performance:
##
## Call:
## lm(formula = distance ~ speed, data = trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.757 -3.612 -1.076 2.928 13.214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -27.3587 2.6414 -10.36 9.53e-12 ***
## speed 4.5362 0.1611 28.16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.779 on 32 degrees of freedom
## Multiple R-squared: 0.9612, Adjusted R-squared: 0.96
## F-statistic: 793 on 1 and 32 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = sqrtdistance ~ speed, data = trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34930 -0.09397 -0.05924 0.09940 0.43827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.606392 0.093027 6.518 2.44e-07 ***
## speed 0.366097 0.005673 64.531 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1683 on 32 degrees of freedom
## Multiple R-squared: 0.9924, Adjusted R-squared: 0.9921
## F-statistic: 4164 on 1 and 32 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = distance ~ speed2, data = trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7541 -0.9707 -0.5830 1.1084 7.2639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.604061 0.856192 4.209 0.000194 ***
## speed2 0.147830 0.002744 53.880 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.534 on 32 degrees of freedom
## Multiple R-squared: 0.9891, Adjusted R-squared: 0.9888
## F-statistic: 2903 on 1 and 32 DF, p-value: < 2.2e-16
Comparison betw. actual and predicted values and different models
If actual and values predicted by the base model (B) are compared one can recognize that the prediction is more linear than the actual values.
For that reason model 2 and 3 are computed using transformed features (sqrtdistance and speed^2). Doing so, the actual quadratic relation between speed and brake distance is transformed into a linear one what is more meaningful for a LR-model. Both models are compared in (C) and one can see that there is almost no differences between both predictions. Consequently model 3 is predicted as it comes with some advantages concerning coding effort.
Finally, in D the prediction of model 3 are opposed to the actual values.
Comparing LR-model selected with actual values
The selected model 3 based on the squared speed is plotted against all points in the dataset. Thereby, one can see which brands are over- or underestimated. However, this is a comparison of Training data and the model what is not that meaningful at all but helps to grasp general trends.
Analysis of absolute errors and relative errors
Calculation of performance metrics
The LR-model selected achieves a RMSE of 2.45814696801495 and a MAE of 1.84873180692811 predicting pLength using pWidth are as follows.
========================================================================================================================================================
Overview of dataset:
## 'data.frame': 150 obs. of 6 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## X Sepal.Length Sepal.Width Petal.Length
## Min. : 1.00 Min. :4.300 Min. :2.000 Min. :1.000
## 1st Qu.: 38.25 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
## Median : 75.50 Median :5.800 Median :3.000 Median :4.350
## Mean : 75.50 Mean :5.843 Mean :3.057 Mean :3.758
## 3rd Qu.:112.75 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
## Max. :150.00 Max. :7.900 Max. :4.400 Max. :6.900
## Petal.Width Species
## Min. :0.100 setosa :50
## 1st Qu.:0.300 versicolor:50
## Median :1.300 virginica :50
## Mean :1.199
## 3rd Qu.:1.800
## Max. :2.500
The dataset contains 0 missing values.
Feature type has been changed for species from categorical to integer for histogram only.
Histograms of all features:
Boxplot of features:
Outliers need not to be excluded as we want to predict petal length (pLength) using petal width (pWidth).
Scatterplot of label and independent feature.
Checking distribution of pWidth, pLength and pLength after normalization using normal q-q plots. As one can see there is no difference between 2. and 3. plot, between original and normalized independent feature. This means that it is normally distributed.
Train and test set have been compiled as follows:
##
## Call:
## lm(formula = pLength ~ pWidth, data = Iris_trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39883 -0.32153 -0.00922 0.30938 1.36918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.05562 0.08445 12.50 <2e-16 ***
## pWidth 2.26800 0.06136 36.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5004 on 118 degrees of freedom
## Multiple R-squared: 0.9205, Adjusted R-squared: 0.9198
## F-statistic: 1366 on 1 and 118 DF, p-value: < 2.2e-16
Comparing LR-model with actual values
Analysis of absolute errors and relative errors
Calculation of performance metrics
The LR-model achieves a RMSE of 0.496257345319873 and a MAE of 0.377137834274458 predicting pLength using pWidth are as follows.