1 Introduction

The purpose of this report is to assess data quality, conduct exploratory data analysis, build models, and compare model results based on the TMS (Two Month’s Salary) dataset. The dataset and a description file (including a data dictionary) were provided.

The dataset contains attributes relating to round-cut diamonds at ten brick-and-mortar jewelers, and seven online jewelers. The response variable is price, representing the price for the round-cut diamond in U.S. dollars as of April 2001. The goal of the model is to accurately predict the price for a round-cut diamond based on provided or derived attributes in the dataset.

This report contains five sections:

Data Quality Check
Exploratory Data Analysis
Model Build
Model Comparison
Conclusion & Next Steps

An appendix of relevant R code used in producing the report is included. The code is grouped by the same four sections.

2 Data Quality Check

From the TMS metadata file, the response variable is price, which is a quantitative continuous variable. The metadata file also provides definitions and explanations for the variables carat, color, clarity, and cut (the “Four Cs”); as well as channel (the “Sales Channel”). The final variable is store, which refers to the company selling the round-cut diamond.

This section contains four parts: Data Summary, Missing Values, Potential Outliers, and Invalid Values.

2.1 Data Summary

The dimensions of the TMS dataset indicate there are 425 observations and 7 variables, including the response variable price. The variable price was recoded from integer() to numeric(), while the variables color and clarity were recoded from integer() to factor().

Table 1 below shows the class of each variable after recoding. Brief variable definitions are also included, sourced from the metadata file accompanying the dataset.

\begin{center} Table 1: Variable Dictionary for `TMS` Dataset \end{center}

Variable	Class	Definition
carat	numeric	Weight of diamond in carats (1 carat = 200mg)
color	factor	GIA color scale, standardized for grading
clarity	factor	GIA visibility scale, number and size of inclusions
cut	factor	Binary, defined by creator of dataset
channel	factor	Medium of purchase
store	factor	Name of store (company) selling diamond
price	numeric	Cost in U.S. dollars, April 2001

Table 2 below shows summary statistics for each variable in the TMS dataset. The Exploratory Data Analysis section expands on this by using quantitative and qualitative methods to look for interesting relationships in the dataset.

\begin{center} Table 2: Summary Statistics of `TMS` Variables \end{center}

carat	color	clarity	cut
Min. :0.200	I :85	VS2 :108	Ideal :154
1st Qu.:0.720	G :73	SI1 : 84	Not_Ideal:271
Median :1.020	H :73	VS1 : 76	NA
Mean :1.041	E :59	SI2 : 71	NA
3rd Qu.:1.210	F :58	VVS2 : 42	NA
Max. :2.480	J :37	I1 : 20	NA
NA	(Other):40	(Other): 24	NA

channel	store	price
Independent: 48	Blue_Nile :211	Min. : 497
Internet :318	Ashford :107	1st Qu.: 3430
Mall : 59	Riddles : 16	Median : 5476
NA	Fred_Meyer: 15	Mean : 6356
NA	Kay : 14	3rd Qu.: 7792
NA	University: 13	Max. :27575
NA	(Other) : 49	NA

2.2 Missing Values

Another part of the data quality check is to check for missing values and potential outliers. In R, missing values are coded as NA. There is a practical difference between a NA value and a NULL value, though R does not make this distinction. That said, it is valuable to understand which values are NA and which values are NULL in the dataset, despite R coding both as NA.

There do not appear to be any NA values in the Wine dataset. For the most part, R will detect NA values and assign them as such. However, if other characters are used to denote NA values (e.g. the ? character), R may not detect them. Part of the data quality check includes examining the dataset for such occurrences. Counts of NA values either identified by R or manually coded, would be included in the output from the summary() function (such as in Table 2 above).

There does not appear to be any NA or NULL values in the TMS dataset.

2.3 Potential Outliers

Detecting potential outliers begs the question of what constitutes an outlier. There is no single definition for an outlier, and at times the term outlier might be substituted for another term altogether (e.g. extreme observation). A simple definition can be found in Introduction to Linear Regression Analysis by Montgomery, Peck, and Vining (5th Edition, p. 43): “Outliers are observations that differ considerably from the rest of the data.” Detecting potential outliers is important, because they can exert leverage or influence, affect model results - during validation or deployment.

Qualitative methods to detect potential outliers include creating histograms, density plots, and/or boxplots. These plots should not be interpreted blindly! For instance, observations beyond the whiskers of a boxplot are not necessarily outliers - in R, the default setting is to draw the whiskers at 1.5 times the interquartile range (25th to 75th percentiles) from the box. Histograms may be useful, but interpretation can vary depending on the number of bins used. Alternatively, histograms can be plotted using bins of equally spaced probabilities - the widths of the bins vary, but the area represented by each bin is the same.

Potential outliers in the TMS dataset are assessed and discussed in the Exploratory Data Analysis section, where multiple plots (including histograms and boxplots) are produced for each variable. Only relevant plots are included in the report, but all plots may be reproduced using the code in the appendix.

2.4 Invalid Values

The data quality check should also look for invalid values. For numeric variables, this might be values which are negative in a variable where they should only be positive. For factor or categorical variables, this could be a miscoded level - examining the frequency of values at each level can be done quantitatively with counts, or qualitatively with barplots.

Whether checking for missing values, potential outliers, or invalid values, it is important to be mindful of sentinel values, or values that are used as an indicator for some meaning or status. For example: an AGE of -1 could mean Unknown, while a HOME_VALUE of 0 could mean Renter.

There does not appear to be any invalid values in the TMS dataset.

3 Exploratory Data Analysis

After the initial data quality check, data are further examined to identify interesting information or detect interesting relationships. That process is known as Exploratory Data Analysis or EDA.

The type of EDA conducted depends on the statistical problem at hand: is the problem one of regression, or one of classification? The statistical problem faced with the TMS dataset is one of regression.

It is also important to understand what might not be useful. Scatterplots of factor variables against the numeric response variable are not useful for this statistical problem.

To keep content relevant, many of the figures produced are not included here, but can be reproduced using the included code in the Appendix.

This section contains three parts: Traditional EDA - Quantitative, Traditional EDA - Qualitative, and Decision Tree EDA. The Model Build section makes use of Model Based EDA.

Note: At this point, the TMS dataset does not contain any derived variables. Results from EDA help inform possible variable derivations (e.g. trims, transforms).

3.1 Traditional EDA - Quantitative

In the TMS dataset, the only numeric predictor variable is carat. The correlation between the response variable price and carat is positive and quite strong, at 0.8796.

Table 2 illustrated summary statistics for the variables in the TMS dataset. For factor variables, only the number of occurrences by level were provided. In looking for interesting relationships, a custom function was written to view summary statistics for the response variable price by level for each factor variable in the dataset.

Unfortunately, this approach did not really uncover any interesting relationships. One explanation is that carat has an outsized effect on price. For example, in Table 3 below, price by color is shown. Although the level D corresponds to ‘perfectly colorless’, the mean and median price of a round-cut diamond in H is higher.

\begin{center} Table 3: Summary Statistics for `price` by Level of `color` \end{center}

Variable	Split On	Levels	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
tms$price	tms$color	D	1273.00	3602.00	5694.00	7379.00	8503.00	27580.00
tms$price	tms$color	E	1243.00	2788.00	5550.00	6390.00	7242.00	26620.00
tms$price	tms$color	F	1062.00	3233.00	5296.00	6465.00	7624.00	22710.00
tms$price	tms$color	G	986.00	3273.00	5299.00	5910.00	7542.00	21540.00
tms$price	tms$color	H	1025.00	4299.00	6486.00	7411.00	10130.00	23000.00
tms$price	tms$color	I	497.00	3900.00	4965.00	5907.00	6699.00	17440.00
tms$price	tms$color	J	997.00	4152.00	4871.00	5643.00	7230.00	13680.00
tms$price	tms$color	K	2900.00	2999.00	4850.00	5585.00	6995.00	10180.00
tms$price	tms$color	L	799.00	1599.00	2999.00	3349.00	3999.00	7350.00

Looking at summary statistics for price and carat by store:

Goodmans had the highest mean price ($8,465), but the seventh-highest mean carat (0.9662);
Ashford had the second-highest mean price ($7,751), but the highest mean carat (1.2300).

Though interesting on face, tables were not considered relevant enough to include: Goodmans is an Independent store, while Ashford is an Internet store, and it seems reasonable that a brick-and-mortar store would have higher prices on a per-carat basis.

A final observation: the mean values for both price and carat by cut were lower for the level Ideal than Not Ideal. Recall cut is a subjective variable created and populated by the dataset author. Though a different question, it does raise the spectre of subconscious bias. For example: was cut judged in and of itself, or did other data subconsciously influence the chosen value?

\begin{center} Table 5: Summary Statistics for `price` by Level of `cut` \end{center}

Variable	Split On	Levels	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
tms$price	tms$cut	Ideal	878.00	2762.00	4865.00	5767.00	8094.00	22430.00
tms$price	tms$cut	Not_Ideal	497.00	3996.00	5560.00	6690.00	7464.00	27580.00

\begin{center} Table 6: Summary Statistics for `carat` by Level of `cut` \end{center}

Variable	Split On	Levels	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
tms$carat	tms$cut	Ideal	0.3500	0.6675	0.9600	0.9282	1.1680	2.2000
tms$carat	tms$cut	Not_Ideal	0.2000	0.8400	1.0220	1.1050	1.2400	2.4800

3.2 Traditional EDA - Qualitative

For the numeric variable carat, a panel of four plots was examined. The plots are shown in Figure 1 below:

The histogram shows carat is roughly normally distributed, with a slight rightward skew for larger values;
The boxplot shows carat contains a number of observations outside of the upper whisker (in-line with the histogram);
The scatterplot is a nice visual compliment to the strong positive correlation between price and carat; the megaphone shape is noted, but would be of bigger concern in model residuals
The Q-Q plot shows carat mostly follows the theoretical normal distribution, with the most pronounced deviance in the tails.

\begin{center} Figure 1: Panel Plots for `carat` \end{center}

For the factor variables, boxplots were created of the levels (x-axis) versus the response variable price (y-axis). Of these, the channel showed the most distinct levels.

\begin{center} Figure 2: Boxplots of `channel` Levels Versus `price` \end{center}

Outliers: Revisited Qualitative EDA can be a very inexpensive way to better understand the dataset, visually inspect variables (and interactions), and uncover interesting relationships. Boxplots are useful, but as mentioned earlier, beg the question of how an outlier is defined, and how points on the boxplot are calculated.

Variables in the TMS dataset with potential relevant outliers are described below.

The variable carat shows potential outliers in the boxplot. However, the Q-Q plot suggests these values are closer to leverage points than influential observations . See Figure 1 above.

Boxplots of levels of the variable color versus price show many potential outliers. However, many of these are remedied when looking at color versus carat (suggesting that yes, size does matter).

\begin{center} Figure 3: Boxplots of `color` Levels Versus `price` and `carat` \end{center}

Boxplots of levels of the variable cut versus price, and cut versus carat show more potential outliers in the Not Ideal level than Ideal, confirming the findings in Quantitative EDA. See Table 5 and Table 6 above.

\begin{center} Figure 4: Boxplots of `cut` Levels Versus `price` and `carat` \end{center}

Boxplots of levels of the variable channel versus price show more potential outliers than channel versus carat; in particular the level Internet.

\begin{center} Figure 5: Boxplots of `channel` Levels Versus `price` and `carat` \end{center}

Boxplots of levels of the variable store versus price, and store versus carat suggest a premium is paid for round-cut diamonds from the store Blue Nile.

\begin{center} Figure 6: Boxplots of `store` Levels Versus `price` and `carat` \end{center}

Interestingly, Blue Nile is an Internet store, and likely influences the findings in the fourth bullet.

3.3 Decision Tree EDA

A naive decision tree was constructed, fitting the response variable against all other variables in the TMS dataset. Interesting information can still be revealed from this naive model.

In the tree plot below, each rectangle corresponds to either a root, leaf, or terminal node. Within each node:

The first line refers to the mean value of the response;
The second line contains the number of observations at that node; and the percentage of rows at that node, of the total rows in the dataset.

The first node is the root node. The mean value of the response variable price is 6356 (line 1). There are 425 observations in this node, making up 100% of the rows in the dataset (line 2). The tree then splits on the carat variable. Values of carat less than 1.4 branch to the left, while values greater than or equal to 1.4 branch to the right. This continues until arriving at a terminal node.

\begin{center} Figure 7: Naive Tree Plot of `price` \end{center}

Of the six predictor variables fed to the naive tree model, only three are used: carat, clarity, and store. The tree model implies potential variable importance, as these variables (and levels or values) were most useful in predicting the mean value of the response variable price.

Even cursory EDA can provide valuable insight and information on the relationships within the dataset. The relationships identified here can be leveraged in model construction. The branches of the naive tree model illustrate potentially important variables and their associated levels or values.

4 Model Build

This section focuses on building models to accurately predict the response variable price in the TMS dataset. There are four parts:

Model-based EDA, which uses naive models to uncover interesting properties or relationships in the TMS dataset;
Variable Derivations & Manipulations, which creates new predictor variables from derivations or manipulations;
Model-based EDA Revisited, which revisits model-based EDA using the predictor variables created in the previous section;
Model Construction, which builds a modeling suite of four distinct models over three model types.

The Model Comparison section discusses results and predictive accuracy across the modeling suite.

The TMS dataset was split by a random sample into a 70/30 training-test set. The training set was used for model-based EDA, and any subsequent variable derivations of manipulations occurred in both the training and test set for model construction. The random sample closely approximates the split, with 70.6 percent of rows going in the training set, and 29.4 percent of rows going in the test set.

4.1 Model-based EDA

Model-based EDA is another way to glean information about the relationships in the dataset. Naive models are used for this purpose, since the goal at this stage is not to build a highly accurate predictive model, but to uncover additional information. All models fit used the training set of the TMS dataset. The fit models used a two-step process, described below.

First, the {caret} package fit models with the train() parameter method corresponding to forward, backward, stepwise, and/or exhaustive selection methods. Minimizing RMSE (Root Mean Square Error) was used as the metric for “best” model selection under each AVS method. Although these are naive models, repeated cross-validation (10 folds, repeated 3 times) was used during the AVS technique. In looking over the model results, the model with the lowest RMSE was chosen. The model formula (intercept, beta coefficients, variable name) was examined, as was variable importance (a function within {caret}).

Second, the lm() function in {stats} fit the model according to the variables selected in the best model from the train() function in {caret}. While the intercept and beta coefficient values remained the same between the models, the purpose of this was to visually examine model residuals (scatterplot, Q-Q plot), the AIC value, and VIF information. The AIC value was examined as it considers parsimony. Measures of predictive accuracy (e.g. MSE, RMSE, MASE, etc.) were not computed at this stage as the model-based EDA takes place in-sample on the training set.

4.1.1 Model 1

Original Variables, Training Set, Backward Selection

The first model built used original variables in the TMS dataset, and was fit to the training set using backward selection.

The model chose variables carat, color_I, and color_J (specific levels are to the right of the underscore). The model resulted in an AIC value of 5399.672, and all VIF values were close to 1.0. The residual plots from this model are shown in Figure 8 below.

\begin{center} Figure 8: Residual Plots of Model 1 \end{center}

The upper left plot in Figure 8 shows a slight bow shape in the residuals, indicating heteroscedasticity. The Q-Q plot shows heavy deviation from the theoretical normal distribution in the tails, particularly the upper tail. A natural log transformation on the response variable may resolve this issue.

Using the same chosen predictor variables, another model was run using a natural log transformation on the response variable price. The model resulted in an AIC value of 125.8216, and all VIF values were close to 1.0. The AIC value is not comparable to the value above. The residual plots from this model are shown in Figure 9 below.

\begin{center} Figure 9: Residual Plots of Model 1, Natural Log Response \end{center}

Improvement in fit is seen across all plots. While not wholly remedied, the dispersement of residuals in Figure 9 are much more homoscedastic than Figure 8. Similarly, while the Q-Q plot in is not perfect, the deviance in the tails is largely improved.

Conclusion: The initial results suggest a natural log transformation of the response variable may be beneficial to model fit. Subsequent models for EDA use AVS on a natural log transformation of the response variable.

Two points to keep in mind:

The examination of residuals is more appropriate for statistical models built for inference, not predictive accuracy.
If the response variable price is transformed using natural log, then the predicted values must be exponentiated to return them to the correct units.

4.1.2 Model 2:

Original Variables, Training Set, Backward Selection, Natural Log Response

The second model built used original variables in the TMS dataset, and was fit to the training set using backward selection with a natural log transformation on the response variable.

The model chose variables carat, color_I, color_J, and clarity_I1 (specific levels are to the right of the underscore). The model resulted in an AIC value of 105.4516, and all VIF values were close to 1.0. The residual plots from this model are shown in Figure 10 below.

\begin{center} Figure 10: Residual Plots of Model 2, Natural Log Response \end{center}

Conclusion: Compared to Model 1 (with the same natural log response variable), the fit appears to marginally improve throughout all four plots. The residuals show slightly better dispersion, and the Q-Q plot shows slightly closer fit to the theoretical normal distribution line. The AIC value is much lower.

4.1.3 Model 3:

Original Variables, Training Set, Forward Selection, Natural Log Response

The third model built used original variables in the TMS dataset, and was fit to the training set using forward selection with a natural log transformation on the response variable.

The model chose variables carat, color_I, color_J, and clarity_I1 (specific levels are to the right of the underscore). These are the exact same variables as Model 2 above. As a result, the AIC and VIF values are identical, as are the resulting residual plots. They are not shown here to avoid redundancy.

4.1.4 Model 4:

Original Variables, Training Set, Stepwise Selection, Natural Log Response

The fourth model built used original variables in the TMS dataset, and was fit to the training set using stepwise (sequential replacement) selection with a natural log transformation on the response variable.

The model chose variables carat, color_I, color_J, and color_L (specific levels are to the right of the underscore). The model resulted in an AIC value of 105.1677, and all VIF values were close to 1.0. The residual plots from this model are shown in Figure 11 below.

\begin{center} Figure 11: Residual Plots of Model 4, Natural Log Response \end{center}

Conclusion: Compared to Model 1 (with the same natural log response variable) and Model 2, the fit appears to marginally improve throughout all four plots. The residuals show slightly less dispersion, but are closer to the zero-bound. The Q-Q plot shows slightly closer fit to the theoretical normal distribution line. The AIC value lower, although not by much.

4.1.5 Model 5:

Original Variables, Training Set, AIC Stepwise Selection, Natural Log Response

The fifth model built used original variables in the TMS dataset, and was fit to the training set using AIC stepwise selection with a natural log transformation on the response variable. Note: this method was used as exhaustive search continually hung in R.

The model shows signs of serious overfitting. The model chose a plethora of variables: carat, color_E, color_F, color_G, color_H, color_I, color_J, color_K, color_L, clarity_VVS1, clarity_VVS2, clarity_VS1, clarity_VS2, clarity_SI1, clarity_SI2, clarity_I1, clarity_I2, cut_Not_Ideal, channel_Internet, channel_Mall, store_Ausmans, store_Fred_Meyer, and store_Goodmans (specific levels are to the right of the underscore).

Not surprisingly, the model had the lowest AIC value, at -2.3131. However, the VIF values show clear multicollinearity issues. Variables with a VIF of 3.0 or higher include: color_G, color_H, color_I, color_K, clarity_VVS2, clarity_VS1, clarity_VS2, clarity_SI1, clarity_SI1, clarity_I1, clarity_I2, channel_Internet, and channel_Mall.

Residual plots from the model are shown in Figure 12 below.

\begin{center} Figure 12: Residual Plots of Model 5, Natural Log Response \end{center}

Conclusion: The residual plots show clear overfitting with this method. In the upper left plot, the residuals take on an inverted bow shape. While the Q-Q plot has the closest fit to the theoretical normal distribution, the model is clearly too complex. This is how woodchipper models are created.

4.1.6 Model 6:

Original Variables, Training Set, LASSO, Natural Log Response

The sixth model built used original variables in the TMS dataset, and was fit to the training set using LASSO with a natural log transformation on the response variable.

The selected model used all variables, with the exception of the following levels: color_D, clarity_IF, cut_Ideal, channel_Internet, store_Danford, store_Ashford, and store_Riddles. The levels the model excluded are n-1 of the total levels, with the exception of store, where additional levels were excluded. No variable coefficients were set to zero.

4.2 Variable Derivations & Manipulations

A number of variable derivations and manipulations were tried, almost all did not show much promise. For instance, standardizing the variable carat to mean = 0 and standard deviation = 1 had little to no impact on the model. Another thought was to create a variable akin to “price per carat”, but price is the response variable, and this would not be correct.

Ultimately, three indicator variables were created based on the frequency of levels:

IV_Internet: value of 1 for being from channel Internet, and 0 if not;
IV_Blue_Nile: value of 1 for being from store Blue_Nile, and 0 if not;
IV_Color_IJ: value of 1 for being of color I or J, and 0 if not.

These variables were tested in model construction.

4.3 Model Construction

Four models were built using information gleaned from the previous sections. Each model has a dedicated subsection:

Linear Regression Model - No Interactions;
Linear Regression Model - Some Interactions;
Decision Tree Model;
Random Forest Model.

Additionally, each model uses the response variable price with a natural log transformation.

4.3.1 Linear Regression Model - No Interactions

The same two-step process described at the beginning of the Model-based EDA subsection was used. However, the data fed to the initial model included the three user-created indicator variables. The model was fit to the training set using stepwise (sequential replacement) selection.

The seventh model chose variables carat, clarity_I1, channel_Mall, and IV_Color_IJ (specific levels are to the right of the underscore). The model resulted in an AIC value of 86.9924, and most VIF values were close to 1.0, with the highest approximating 1.25. The residual plots from this model are shown in Figure 12 below.

\begin{center} Figure 12: Residual Plots of Model 7, Natural Log Response \end{center}

Conclusion: Compared to the previous models (with the same natural log response variable), the plots give mixed results. The residuals are starting to take on an inverted bow shape, but are not quite worrisome (yet). The Q-Q plot shows slightly closer fit to the theoretical normal distribution line. The biggest improvement was to the AIC value, decreasing from 105.4516 in Model 2 to 86.9924 here.

4.3.2 Linear Regression Model - Some Interactions

The eighth model chose variables carat, and interactions color_H:clarity_VS1, clarity_I1:cut_Not_Ideal, and clarity_SI2:store_Riddles (specific levels are to the right of the underscore). The model resulted in an AIC value of 152.4542. Only two VIF values were greater than 3.0: clarity_I1 and the interaction clarity_I1:cut_Not_Ideal. This is not surprising and the multicollinearity is clear. The residual plots from this model are shown in Figure 13 below.

\begin{center} Figure 13: Residual Plots of Model 8, Natural Log Response \end{center}

Conclusion: Compared to Model 7 (with the same natural log response variable), the plots are slightly improved. The residuals appear more homoscedastic with clustering, but no discernible shape. The Q-Q plot shows slightly closer fit to the theoretical normal distribution line. However, the biggest downside was the jump in the AIC value, increasing from 86.9924 in Model 7 to 152.4542 here.

4.3.3 Decision Tree Model

The ninth model continued to use the {caret} package with repeated cross-validation (10 folds, repeated 3 times). The decision tree was fit with the M5 method from the RWeka package. The algorithm is suitable for regression problems. Minimizing RMSE was used as the metric for “best” model selection.

The decision tree with the lowest RMSE value was pruned, smoothed, and did not use rules. According to the RWeka documentation, the M5Rules parameter turns the “best” leaf into rules. Figure 14 below shows rules are not used as each terminal node contains a separate model.

\begin{center} Figure 14: Decision Tree of Model 9, Natural Log Response \end{center}

In Figure 14 above, the root node splits on the variable carat, with values less than or equal to 0.765 going to the left branch, and values greater than 0.765 going to the right. The two terminal nodes each contain a different model (rules).

4.3.4 Random Forest Model

The tenth model continued to use the {caret} package with repeated cross-validation (10 folds, repeated 3 times). The random forest was fit with the rf method from the randomForest package. The algorithm is suitable for regression problems. Minimizing RMSE was used as the metric for “best” model selection.

The random forest with the lowest RMSE value used 500 trees and tried 18 variables at each split. Figure 15 below shows the decrease in error as the number of trees increases.

\begin{center} Figure 15: Random Forest Error of Model 10, Natural Log Response \end{center}

5 Model Comparison

This section compares the fit on the four models built in the Model Construction subsection. Both in-sample and out-of-sample fit is assessed. Table 7 below shows a comparison of model fit using RMSE. The last column represents the percent change in RMSE between training and test data.

\begin{center} Table 7: Comparison of Model Fit by Type \end{center}

Model Name	RMSE, Train	RMSE, Test	Percent Change
M7: Linear Regression, no interactions	0.293	0.3116	0.0636
M8: Linear Regression, some interactions	0.3067	0.3329	0.08572
M9: Decision Tree, M5 algorithm	0.2109	0.1996	-0.0538
M10: Random Forest, RF algorithm	0.2005	0.1777	-0.114

By RMSE, the random forest model (Model 10) was the clear winner with the lowest RMSE values in both the training and test set. Interestingly, it also had the largest percent change, predicting with a lower RMSE on the test set than on the train set. In second place was the decision tree model (Model 9), which used the M5 algorithm.

6 Conclusion & Next Steps

While random forest models may be “black boxes”, the predictive accuracy on the TMS dataset is clear. The decision tree model using the M5 algorithm also came quite close in terms of performance to the random forest model. This was a bit surprising.

Prior to this assignment, I did not have an opportunity to use the {caret} package. I presume the performance was enhanced by using repeated cross-validation (10 fold, 3 repeats) on the training set for all models. I also had not used the M5 algorithm from {RWeka} before, and was surprised to learn each terminal node contained a different model.

For next steps, I would like to play with the hyperparameter settings to see how tuning them affects the various models. Although the decision tree model came in second place, the interpretability cannot be overlooked. This goes back to answering what we are solving for, and why. If the goal is to have the most accurate predictive model, then random forest is the clear winner. If the goal is to have the most accurate and interpretable predictive model, the decision tree model would take the trophy.

6.1 Appendix - Relevant R Code

#==============================================================================
# Functions
#==============================================================================

#--------------------------------------
# GitHub
#--------------------------------------
# Create function to source functions from GitHub
source.GitHub = function(url){
    require(RCurl)
    sapply(url, function(x){
        eval(parse(text = getURL(x, followlocation = T,
                                 cainfo = system.file("CurlSSL", 
                                          "cacert.pem", package = "RCurl"))),
             envir = .GlobalEnv)
    })
}

# Assign URL and source functions
url = "http://bit.ly/1T6LhBJ"
source.GitHub(url); rm(url)

#--------------------------------------
# Percent Change
#--------------------------------------
# Function to calculate percent change
perc.change = function(y1, y2){
    return((y2 - y1) / y1)
}

#==============================================================================
# Data Import, Prep, and Staging
#==============================================================================

#--------------------------------------
# Import
#--------------------------------------
# Read data
tms = read.csv("~/Two_Months_Salary.csv", header = T)

# Check dimensions
dim(tms)

# Check variable classes
sapply(tms, class)

# Summary statistics
summary(tms)

# Check for NAs
colSums(is.na(tms))[colSums(is.na(tms)) > 0]

#--------------------------------------
# Prep
#--------------------------------------
# Recode integers to numeric
tms$price = as.numeric(tms$price)

# Recode integers to factor
tms$color = as.factor(tms$color)
tms$clarity = as.factor(tms$clarity)

# Set factor variable levels
levels(tms$color) = c("D", "E", "F", "G", "H", "I", "J", "K", "L")
levels(tms$clarity) = c("IF", "VVS1", "VVS2", "VS1", "VS2", "SI1", "SI2", 
                        "I1", "I2")

# Rename factor variable levels (replace spaces)
levels(tms$store) = c("Ashford", "Ausmans", "Blue_Nile", "Chalmers", 
                      "Danford", "Fred_Meyer", "Goodmans", "Kay", 
                      "R_Holland", "Riddles", "University", "Zales")
levels(tms$cut) = c("Ideal", "Not_Ideal")

#--------------------------------------
# Staging
#--------------------------------------
# Store dataset name for use in titles, etc. later
data.name <- "tms$"

# Set response variable
data.response <- "price"

# Assign column names
tms.cn.all = colnames(tms)

# Assign numeric column names
tms.cn.num = colnames(tms[, !sapply(tms, is.factor)])

# Assign factor column names
tms.cn.fac = colnames(tms[, sapply(tms, is.factor)])

# Drop response variable in tms.cn.num
tms.cn.num = tms.cn.num[!tms.cn.num == data.response]

#--------------------------------------
# Dictionary
#--------------------------------------
# Build table
tms.dict = as.data.frame(sapply(tms, class))
tms.def = c("Weight of diamond in carats (1 carat = 200mg)", 
            "GIA color scale, standardized for grading", 
            "GIA visibility scale, number and size of inclusions", 
            "Binary, defined by creator of dataset", 
            "Medium of purchase", 
            "Name of store (company) selling diamond", 
            "Cost in U.S. dollars, April 2001")
tms.dict = data.frame(rownames(tms.dict), tms.dict, tms.def)
colnames(tms.dict) = c("Variable", "Class", "Definition")
rownames(tms.dict) = 1:nrow(tms.dict)
rm(tms.def)

#==============================================================================
# Exploratory Data Analysis
#==============================================================================

#------------------------------------------------------------------------------
# Traditional EDA - Quantitative
#------------------------------------------------------------------------------
num.freq(tms, tms.cn.fac, "price")
num.freq(tms, tms.cn.fac, "carat")
fac.freq(tms, tms.cn.fac, cat = F)

#------------------------------------------------------------------------------
# Traditional EDA - Qualitative
#------------------------------------------------------------------------------
num.plots(tms, tms.cn.num, "price")
fac.boxplot(tms, tms.cn.fac, "price")
fac.boxplot(tms, tms.cn.fac, "carat")
fac.barplot(tms, tms.cn.fac)

#------------------------------------------------------------------------------
# Decision Tree EDA
#------------------------------------------------------------------------------
fancyRpartPlot(rpart(tms$price ~ ., data = tms), sub = "", cex = 0.8)

#==============================================================================
# Model Build
#==============================================================================

# Random sample into 70/30 training-test split
set.seed(123)
tms.train = createDataPartition(tms$price, p = 0.70, list = F)
tms.test = as.matrix(as.integer(rownames(tms))[-tms.train])

# Specify fit parameters
set.seed(123)
fitControl = trainControl(method = "repeatedcv", number = 10, repeats = 3)

#------------------------------------------------------------------------------
# Variable Derivations & Manipulations
#------------------------------------------------------------------------------

# Create clone versions of data.frame
tms.og = tms
tms.mod = tms

#--------------------------------------
# Levels
#--------------------------------------

# Create levels
tms = fac.flag(tms, tms.cn.fac)

# Drop original variables
tms = tms[, !names(tms) %in% tms.cn.fac]

# Re-assign factor column names
tms.cn.fac = colnames(tms[, sapply(tms, is.factor)])

#--------------------------------------
# Derivations
#--------------------------------------

# Internet
tms$IV_Internet = as.factor(ifelse(tms.og$channel == "Internet", 1, 0))
tms.mod$IV_Internet = tms$IV_Internet

# Store
tms$IV_Blue_Nile = as.factor(ifelse(tms.og$store == "Blue_Nile", 1, 0))
tms.mod$IV_Blue_Nile = tms$IV_Blue_Nile

# Colors
tms$IV_Color_IJ = as.factor(ifelse(tms.og$color == "I" | tms.og$color == "J",
                                   1, 0))
tms.mod$IV_Color_IJ = tms$IV_Color_IJ

#------------------------------------------------------------------------------
# Model-based EDA
#------------------------------------------------------------------------------

#--------------------------------------
# Model 1
#--------------------------------------
# Original Variables, Training Set, Backward Selection

#------------------
# leapBackward
#------------------
set.seed(123)
tms.train.bwd.m1.ct = train(price ~ ., data = tms.og, subset = tms.train, 
                            method = "leapBackward", trControl = fitControl)

# View summary information
tms.train.bwd.m1.ct
coef(tms.train.bwd.m1.ct$finalModel, tms.train.bwd.m1.ct$bestTune$nvmax)

# View variable importance
varImp(tms.train.bwd.m1.ct)
plot(varImp(tms.train.bwd.m1.ct))

#------------------
# lm | price
#------------------
tms.train.bwd.m1.lm.1 = lm(price ~ carat + color_I + color_J, 
                           data = tms, subset = tms.train)

# View summary information
summary(tms.train.bwd.m1.lm.1)
AIC(tms.train.bwd.m1.lm.1)
vif(tms.train.bwd.m1.lm.1)

# View plots
par(mfrow=c(2, 2))
plot(tms.train.bwd.m1.lm.1, ask = F)
par(mfrow=c(1, 1))

#------------------
# lm | log(price)
#------------------
tms.train.bwd.m1.lm.2 = lm(log(price) ~ carat + color_I + color_J, 
                           data = tms, subset = tms.train)

# View summary information
summary(tms.train.bwd.m1.lm.2)
AIC(tms.train.bwd.m1.lm.2)
vif(tms.train.bwd.m1.lm.2)

# View plots
par(mfrow=c(2, 2))
plot(tms.train.bwd.m1.lm.2, ask = F)
par(mfrow=c(1, 1))

#--------------------------------------
# Model 2
#--------------------------------------
# Original Variables, Training Set, Backward Selection, Natural Log Response

#------------------
# leapBackward
#------------------
set.seed(123)
tms.train.bwd.m2.ct = train(log(price) ~ ., data = tms.og, subset = tms.train, 
                            method = "leapBackward", trControl = fitControl)

# View summary information
tms.train.bwd.m2.ct
coef(tms.train.bwd.m2.ct$finalModel, tms.train.bwd.m2.ct$bestTune$nvmax)

# View variable importance
varImp(tms.train.bwd.m2.ct)
plot(varImp(tms.train.bwd.m2.ct))

#------------------
# lm | log(price)
#------------------
tms.train.bwd.m2.lm = lm(log(price) ~ carat + color_I + color_J + clarity_I1,
                         data = tms, subset = tms.train)

# View summary information
summary(tms.train.bwd.m2.lm)
AIC(tms.train.bwd.m2.lm)
vif(tms.train.bwd.m2.lm)

# View plots
par(mfrow=c(2, 2))
plot(tms.train.bwd.m2.lm, ask = F)
par(mfrow=c(1, 1))

#--------------------------------------
# Model 3
#--------------------------------------
# Original Variables, Training Set, Forward Selection, Natural Log Response

#------------------
# leapForward
#------------------
set.seed(123)
tms.train.fwd.m1.ct = train(log(price) ~ ., data = tms.og, subset = tms.train, 
                            method = "leapForward", trControl = fitControl)

# View summary information
tms.train.fwd.m1.ct
coef(tms.train.fwd.m1.ct$finalModel, tms.train.fwd.m1.ct$bestTune$nvmax)

# View variable importance
varImp(tms.train.fwd.m1.ct)
plot(varImp(tms.train.fwd.m1.ct))

#------------------
# lm | log(price)
#------------------
tms.train.fwd.m1.lm = lm(log(price) ~ carat + color_I + color_J + clarity_I1, 
                         data = tms, subset = tms.train)

# View summary information
summary(tms.train.fwd.m1.lm)
AIC(tms.train.fwd.m1.lm)
vif(tms.train.fwd.m1.lm)

# View plots
par(mfrow=c(2, 2))
plot(tms.train.fwd.m1.lm, ask = F)
par(mfrow=c(1, 1))

#--------------------------------------
# Model 4
#--------------------------------------
# Original Variables, Training Set, Stepwise Selection, Natural Log Response

#------------------
# leapSeq
#------------------
set.seed(123)
tms.train.seq.m1.ct = train(log(price) ~ ., data = tms.og, subset = tms.train, 
                            method = "leapSeq", trControl = fitControl)

# View summary information
tms.train.seq.m1.ct
coef(tms.train.seq.m1.ct$finalModel, tms.train.seq.m1.ct$bestTune$nvmax)

# View variable importance
varImp(tms.train.seq.m1.ct)
plot(varImp(tms.train.seq.m1.ct))

#------------------
# lm | log(price)
#------------------
tms.train.seq.m1.lm = lm(log(price) ~ carat + color_I + color_J + color_L, 
                         data = tms, subset = tms.train)

# View summary information
summary(tms.train.seq.m1.lm)
AIC(tms.train.seq.m1.lm)
vif(tms.train.seq.m1.lm)

# View plots
par(mfrow=c(2, 2))
plot(tms.train.seq.m1.lm, ask = F)
par(mfrow=c(1, 1))

#--------------------------------------
# Model 5
#--------------------------------------
# Original Variables, Training Set, AIC Stepwise Selection, Natural Log Response

#------------------
# lmStepAIC
#------------------
set.seed(123)
tms.train.aic.m1.ct = train(log(price) ~ ., data = tms.og, subset = tms.train, 
                            method = "lmStepAIC", trControl = fitControl)

# View summary information
tms.train.aic.m1.ct
coef(tms.train.aic.m1.ct$finalModel, tms.train.aic.m1.ct$bestTune$nvmax)

# View variable importance
varImp(tms.train.aic.m1.ct)
plot(varImp(tms.train.aic.m1.ct))

#------------------
# lm | log(price)
#------------------
tms.train.aic.m1.lm = lm(log(price) ~ carat + color_E + color_F + color_G 
                         + color_H + color_I + color_J + color_K + color_L 
                         + clarity_VVS1 + clarity_VVS2 + clarity_VS1 
                         + clarity_VS2 + clarity_SI1 + clarity_SI2 + clarity_I1 
                         + clarity_I2 + cut_Not_Ideal + channel_Internet 
                         + channel_Mall + store_Ausmans + store_Fred_Meyer
                         + store_Goodmans, 
                         data = tms, subset = tms.train)

# View summary information
summary(tms.train.aic.m1.lm)
AIC(tms.train.aic.m1.lm)
vif(tms.train.aic.m1.lm)

# View plots
par(mfrow=c(2, 2))
plot(tms.train.aic.m1.lm, ask = F)
par(mfrow=c(1, 1))

#--------------------------------------
# Model 6
#--------------------------------------
# Original Variables, Training Set, LASSO, Natural Log Response

#------------------
# LASSO
#------------------
set.seed(123)
tms.train.lso.m1 = train(log(price) ~ ., data = tms.og, subset = tms.train, 
                         method = "lasso", trControl = fitControl)

# View summary information
tms.train.lso.m1
predict.enet(tms.train.lso.m1$finalModel, type = "coefficients", 
             s = tms.train.lso.m1$bestTune$fraction, mode = "fraction")

# View variable importance
varImp(tms.train.lso.m1)
plot(varImp(tms.train.lso.m1))

#------------------------------------------------------------------------------
# Model Construction
#------------------------------------------------------------------------------

#--------------------------------------
# Model 7
#--------------------------------------
# Linear Regression Model - No Interactions

#------------------
# leapSeq
#------------------
set.seed(123)
tms.train.seq.m2.ct = train(log(price) ~ ., data = tms.mod, subset = tms.train, 
                            method = "leapSeq", trControl = fitControl)

# View summary information
tms.train.seq.m2.ct
coef(tms.train.seq.m2.ct$finalModel, tms.train.seq.m2.ct$bestTune$nvmax)

# View variable importance
varImp(tms.train.seq.m2.ct)
plot(varImp(tms.train.seq.m2.ct))

#------------------
# lm | log(price)
#------------------
tms.train.seq.m2.lm = lm(log(price) ~ carat + clarity_I1 + channel_Mall 
                         + IV_Color_IJ, 
                         data = tms, subset = tms.train)

# View summary information
summary(tms.train.seq.m2.lm)
AIC(tms.train.seq.m2.lm)
vif(tms.train.seq.m2.lm)

# View plots
par(mfrow=c(2, 2))
plot(tms.train.seq.m2.lm, ask = F)
par(mfrow=c(1, 1))

#--------------------------------------
# Model 8
#--------------------------------------
# Linear Regression Model - Some Interactions

#------------------
# leapSeq
#------------------
set.seed(123)
tms.train.seq.m3.ct = train(log(price) ~ .*., 
                            data = tms.mod, subset = tms.train, 
                            method = "leapSeq", trControl = fitControl)

# View summary information
tms.train.seq.m3.ct
coef(tms.train.seq.m3.ct$finalModel, tms.train.seq.m3.ct$bestTune$nvmax)

# View variable importance
varImp(tms.train.seq.m3.ct)
plot(varImp(tms.train.seq.m3.ct))

#------------------
# lm | log(price)
#------------------
tms.train.seq.m3.lm = lm(log(price) ~ carat + color_H*clarity_VS1 
                         + clarity_I1*cut_Not_Ideal 
                         + clarity_SI2*store_Riddles, 
                         data = tms, subset = tms.train)

# View summary information
summary(tms.train.seq.m3.lm)
AIC(tms.train.seq.m3.lm)
vif(tms.train.seq.m3.lm)

# View plots
par(mfrow=c(2, 2))
plot(tms.train.seq.m3.lm, ask = F)
par(mfrow=c(1, 1))

#--------------------------------------
# Model 9
#--------------------------------------
# Decision Tree Model

#------------------
# M5
#------------------
set.seed(123)
tms.train.dt.m1.ct = train(log(price) ~ .,
                           data = tms.mod, subset = tms.train, 
                           method = "M5", trControl = fitControl)

# View summary information
tms.train.dt.m1.ct
tms.train.dt.m1.ct$finalModel
plot(tms.train.dt.m1.ct$finalModel)

#--------------------------------------
# Model 10
#--------------------------------------
# Random Forest Model

#------------------
# rf
#------------------
set.seed(123)
tms.train.rf.m1.ct = train(log(price) ~ .,
                           data = tms.mod, subset = tms.train, 
                           method = "rf", trControl = fitControl)

# View summary information
tms.train.rf.m1.ct
tms.train.rf.m1.ct$finalModel
plot(tms.train.rf.m1.ct$finalModel)

#==============================================================================
# Model Comparison
#==============================================================================

#--------------------------------------
# Model 7
#--------------------------------------

# Predict
tms.train.seq.m2.pred = predict(tms.train.seq.m2.ct, 
                                newdata = tms.mod[tms.test,])

# Store RMSE values
m7.train.rmse = getTrainPerf(tms.train.seq.m2.ct)[1, 1]
m7.test.rmse = as.numeric(postResample(tms.train.seq.m2.pred, 
                                       log(tms$price)[tms.test])[1])

#--------------------------------------
# Model 8
#--------------------------------------

# Predict
tms.train.seq.m3.pred = predict(tms.train.seq.m3.ct, 
                                newdata = tms.mod[tms.test,])

# Store RMSE values
m8.train.rmse = getTrainPerf(tms.train.seq.m3.ct)[1, 1]
m8.test.rmse = as.numeric(postResample(tms.train.seq.m3.pred, 
                                       log(tms$price)[tms.test])[1])

#--------------------------------------
# Model 9
#--------------------------------------

# Predict
tms.train.dt.m1.pred = predict(tms.train.dt.m1.ct, 
                               newdata = tms.mod[tms.test,])

# Store RMSE values
m9.train.rmse = getTrainPerf(tms.train.dt.m1.ct)[1, 1]
m9.test.rmse = as.numeric(postResample(tms.train.dt.m1.pred, 
                                       log(tms$price)[tms.test])[1])

#--------------------------------------
# Model 10
#--------------------------------------

# Predict
tms.train.rf.m1.pred = predict(tms.train.rf.m1.ct, 
                               newdata = tms.mod[tms.test,])

# Store RMSE values
m10.train.rmse = getTrainPerf(tms.train.rf.m1.ct)[1, 1]
m10.test.rmse = as.numeric(postResample(tms.train.rf.m1.pred, 
                                        log(tms$price)[tms.test])[1])

#--------------------------------------
# Table Results
#--------------------------------------

# Model Names
model.names = rbind("M7: Linear Regression, no interactions",
                    "M8: Linear Regression, some interactions", 
                    "M9: Decision Tree, M5 algorithm", 
                    "M10: Random Forest, RF algorithm")

# RMSE, Train
model.train.rmse = rbind(m7.train.rmse, m8.train.rmse, m9.train.rmse, 
                         m10.train.rmse)

# RMSE, Test
model.test.rmse = rbind(m7.test.rmse, m8.test.rmse, m9.test.rmse, 
                        m10.test.rmse)

# RMSE, Percent Change
model.pc = rbind(perc.change(m7.train.rmse, m7.test.rmse), 
                 perc.change(m8.train.rmse, m8.test.rmse), 
                 perc.change(m9.train.rmse, m9.test.rmse), 
                 perc.change(m10.train.rmse, m10.test.rmse))

# Data Frame
model.comp = data.frame(model.names, model.train.rmse, model.test.rmse, 
                        model.pc)
rownames(model.comp) = 1:nrow(model.comp)
colnames(model.comp) = c("Model Name", "RMSE, Train", "RMSE, Test", 
                         "Percent Change")

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] randomForest_4.6-12 RWeka_0.4-29        elasticnet_1.1     
##  [4] lars_1.2            MASS_7.3-45         RCurl_1.95-4.8     
##  [7] bitops_1.0-6        rpart_4.1-10        rattle_4.1.0       
## [10] pander_0.6.0        leaps_2.9           car_2.1-2          
## [13] caret_6.0-70        ggplot2_2.1.0       lattice_0.20-33    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5        compiler_3.3.1     RColorBrewer_1.1-2
##  [4] formatR_1.4        nloptr_1.0.4       plyr_1.8.4        
##  [7] RWekajars_3.9.0-1  iterators_1.0.8    tools_3.3.1       
## [10] partykit_1.1-0     digest_0.6.9       lme4_1.1-12       
## [13] evaluate_0.9       nlme_3.1-128       gtable_0.2.0      
## [16] mgcv_1.8-12        Matrix_1.2-6       foreach_1.4.3     
## [19] yaml_2.1.13        parallel_3.3.1     SparseM_1.7       
## [22] rJava_0.9-8        RGtk2_2.20.31      stringr_1.0.0     
## [25] knitr_1.13.1       MatrixModels_0.4-1 stats4_3.3.1      
## [28] grid_3.3.1         nnet_7.3-12        survival_2.39-5   
## [31] rmarkdown_1.0      Formula_1.2-1      minqa_1.2.4       
## [34] reshape2_1.4.1     magrittr_1.5       scales_0.4.0      
## [37] codetools_0.2-14   htmltools_0.3.5    splines_3.3.1     
## [40] rpart.plot_2.0.1   pbkrtest_0.4-6     colorspace_1.2-6  
## [43] quantreg_5.26      stringi_1.1.1      munsell_0.4.3

454_Assignment_02

Michael Gilbert

18 July, 2016