Initial Analysis

S. Jackson Kelley

A01281942

Reading in the data

Here we read in data exported from the mongoDB database running on the AWS server. Outcomes contains only the itemID, categoryID/Name, and final selling price. forSale contains any features I thought might be of use to a machine learning algorithm. This includes things like the item’s title string as well as location.

outcomes = read.csv("completedItems_2.csv",header=TRUE)
forSale = read.csv("forSale_2.csv",header=TRUE)
head(outcomes)
##           X_id categoryId        categoryName sellingPrice
## 1 112185507879     139971 Video Game Consoles       349.99
## 2 122202585486     139971 Video Game Consoles       305.00
## 3 291926308477     139971 Video Game Consoles       200.98
## 4 262695233466     139971 Video Game Consoles       280.00
## 5 142163675662     139971 Video Game Consoles       220.00
## 6 262695805629     139971 Video Game Consoles       250.00
head(forSale)
##           X_id
## 1 172388678529
## 2 122194533957
## 3 282229141811
## 4 282233993583
## 5 162252708291
## 6 131981426207
##                                                                           title
## 1                  NEW PLAYSTATION 4 LIMITED EDITION TACO BELL GOLD CONSOLE PS4
## 2 SONY PLAYSTATION 4 (CUH-1215A) 500GB DESTINY THE TAKEN KING LIMITED EDITION  
## 3                                         SONY PLAYSTATION 4 SYSTEM  500 GIG HD
## 4                                   SONY PLAYSTATION 4 500 GB JET BLACK CONSOLE
## 5                    UNCHARTED 4: A THIEF'S END (SONY PLAYSTATION 4 SLIM, 2016)
## 6                                                            SONY PLAYSTATION 4
##   currentPrice shippingCost calculateShipping totalPrice
## 1       650.00        25.00             false     675.00
## 2       250.00        10.00             false     260.00
## 3       239.99        19.99             false     259.98
## 4       200.00        30.00             false     230.00
## 5       237.50        28.00             false     265.50
## 6       132.50        14.00             false     146.50
##                dateQueried                  endDate               location
## 1 2016-10-29T07:30:02.399Z 2016-10-30T02:35:15.000Z       Las Vegas,NV,USA
## 2 2016-10-29T07:30:02.423Z 2016-10-29T15:18:15.000Z      Montgomery,AL,USA
## 3 2016-10-29T07:30:03.615Z 2016-10-29T17:05:05.000Z         Ballwin,MO,USA
## 4 2016-10-29T07:30:03.620Z 2016-10-29T18:57:11.000Z        Trinidad,CO,USA
## 5 2016-10-29T07:30:03.623Z 2016-10-29T16:49:40.000Z West Palm Beach,FL,USA
## 6 2016-10-29T07:30:03.629Z 2016-10-29T21:54:15.000Z       Watertown,CT,USA
##   country bidCount listingType bestOffer buyItNowAvailable conditionId
## 1      US      NaN  FixedPrice     false             false        1000
## 2      US        1     Auction     false             false        3000
## 3      US        0     Auction     false             false        3000
## 4      US        0     Auction     false             false        3000
## 5      US       41     Auction     false             false        1500
## 6      US       20     Auction     false             false        3000
##      conditionDisplayName
## 1                     New
## 2                    Used
## 3                    Used
## 4                    Used
## 5 New other (see details)
## 6                    Used
##                                                                                timeLeft
## 1  {"milliseconds":0,"seconds":32,"minutes":3,"hours":19,"days":0,"months":0,"years":0}
## 2  {"milliseconds":0,"seconds":32,"minutes":46,"hours":7,"days":0,"months":0,"years":0}
## 3  {"milliseconds":0,"seconds":20,"minutes":33,"hours":9,"days":0,"months":0,"years":0}
## 4 {"milliseconds":0,"seconds":26,"minutes":25,"hours":11,"days":0,"months":0,"years":0}
## 5  {"milliseconds":0,"seconds":55,"minutes":17,"hours":9,"days":0,"months":0,"years":0}
## 6 {"milliseconds":0,"seconds":30,"minutes":22,"hours":14,"days":0,"months":0,"years":0}

Above is shown an overview of every feature collected from the eBay API in the forSale dataframe. A few confusing one’s will be noted for the purposes of this presentation.

Pre-processing

I will perform the imputation of shipping costs now. First we change the -1 values to NA’s, and then impute NA’s with the average shipping cost for non NA items.

forSale$shippingCost[forSale$calculateShipping == "true"] = NA
forSale$shippingCost = with (forSale,impute(shippingCost,mean))
forSale$bidCount     = with(forSale, impute(bidCount,mean))
forSale$bidCount = as.numeric(forSale$bidCount)

Calculating imputed totalPrice and removing sold items that aren’t video game consoles.

forSale$totalPrice = forSale$currentPrice + forSale$shippingCost
forSale$totalPrice = as.numeric(forSale$totalPrice)
outcomes = outcomes[outcomes$categoryName == "Video Game Consoles",]

Dates

Here I’m converting date strings to actual R dates.

dates = as.Date(forSale$endDate)
forSale$dayOfWeek = as.factor(weekdays(dates))

Outliers

We can use R’s built in boxplot code to detect and remove outliers from the dataset.

outliers = boxplot.stats(forSale$totalPrice)$out
forSale = forSale[!forSale$totalPrice %in% outliers,]
forSale = forSale[(forSale$totalPrice > 100),]

Summary

Here we summarize the data to see if anything stands out as interesting in the data for further analysis or visualization.

forSale Summary

It looks like surprisingly there are a higher number of Playstation 4’s with the title “SONY PLAYSTATION 4 BASIC SET 500 GB BLACK CONSOLE” and “SONY PLAYSTATION 4 (LATEST MODEL)- 500 GB BLACK CONSOLE” in the title. This may be due to the fact that many people are buying and relisting with the same title, similar to my software. We also print out the SD for all playstation 4’s in order to get a better idea of how variable the prices are.

summary(forSale)
## 
##  739 values imputed to 10.05239
##       X_id          
##  Min.   :1.120e+11  
##  1st Qu.:1.623e+11  
##  Median :2.223e+11  
##  Mean   :2.271e+11  
##  3rd Qu.:2.822e+11  
##  Max.   :4.012e+11  
##                     
##                                                                               title     
##  SONY PLAYSTATION 4 BASIC SET 500 GB BLACK CONSOLE                               : 212  
##  SONY PLAYSTATION 4 (LATEST MODEL)- 500 GB BLACK CONSOLE                         :  67  
##  SONY PLAYSTATION 4 500 GB BLACK CONSOLE                                         :  58  
##  SONY PLAYSTATION 4 CALL OF DUTY: BLACK OPS III - STANDARD EDITION 500 GB JET... :  42  
##  PLAYSTATION 4                                                                   :  24  
##  SONY PLAYSTATION 4 CALL OF DUTY: BLACK OPS III - STANDARD EDITION 500 GB JET BL…:  20  
##  (Other)                                                                         :1216  
##   currentPrice    shippingCost   calculateShipping   totalPrice   
##  Min.   : 80.0   Min.   : 0.00   false:900         Min.   :102.0  
##  1st Qu.:177.5   1st Qu.: 0.00   true :739         1st Qu.:187.5  
##  Median :202.5   Median :10.05                     Median :215.5  
##  Mean   :225.8   Mean   :10.21                     Mean   :236.0  
##  3rd Qu.:255.0   3rd Qu.:12.95                     3rd Qu.:269.0  
##  Max.   :469.9   Max.   :50.00                     Max.   :469.9  
##                                                                   
##                    dateQueried                      endDate    
##                          :739   2016-11-02T17:06:29.000Z:   2  
##  2016-10-29T07:30:02.423Z:  1   2016-11-04T01:00:08.000Z:   2  
##  2016-10-29T07:30:03.615Z:  1   2016-11-12T03:00:07.000Z:   2  
##  2016-10-29T07:30:03.620Z:  1   2016-11-14T16:27:11.000Z:   2  
##  2016-10-29T07:30:03.623Z:  1   2016-11-15T15:30:23.000Z:   2  
##  2016-10-29T07:30:03.629Z:  1   2016-11-22T00:59:57.000Z:   2  
##  (Other)                 :895   (Other)                 :1627  
##                location       country        bidCount    
##  USA               :  38   US     :1585   Min.   : 0.00  
##  Canada            :  28   CA     :  28   1st Qu.: 0.00  
##  Columbus,OH,USA   :  23   GB     :   5   Median : 8.00  
##  Miami,FL,USA      :  21   KR     :   5   Mean   :11.36  
##  Los Angeles,CA,USA:  20   JP     :   4   3rd Qu.:12.00  
##  New York,NY,USA   :  18   RU     :   3   Max.   :82.00  
##  (Other)           :1491   (Other):   9                  
##          listingType   bestOffer    buyItNowAvailable  conditionId  
##  Auction       :1159   false:1512   false:1480        Min.   :1000  
##  AuctionWithBIN: 159   true : 127   true : 159        1st Qu.:3000  
##  FixedPrice    : 204                                  Median :3000  
##  StoreInventory: 117                                  Mean   :2629  
##                                                       3rd Qu.:3000  
##                                                       Max.   :3000  
##                                                                     
##                conditionDisplayName
##  Used                    :1280     
##  New                     : 219     
##  New other (see details) :  93     
##  Seller refurbished      :  34     
##  Manufacturer refurbished:  13     
##  Brand New               :   0     
##  (Other)                 :   0     
##                                                                                   timeLeft   
##  {"milliseconds":0,"seconds":14,"minutes":28,"hours":18,"days":0,"months":0,"years":0}:   3  
##  {"milliseconds":0,"seconds":0,"minutes":43,"hours":18,"days":0,"months":0,"years":0} :   2  
##  {"milliseconds":0,"seconds":10,"minutes":4,"hours":16,"days":0,"months":0,"years":0} :   2  
##  {"milliseconds":0,"seconds":13,"minutes":28,"hours":18,"days":0,"months":0,"years":0}:   2  
##  {"milliseconds":0,"seconds":16,"minutes":33,"hours":20,"days":0,"months":0,"years":0}:   2  
##  {"milliseconds":0,"seconds":22,"minutes":28,"hours":10,"days":0,"months":0,"years":0}:   2  
##  (Other)                                                                              :1626  
##      dayOfWeek  
##  Friday   :225  
##  Monday   :300  
##  Saturday :223  
##  Sunday   :266  
##  Thursday :203  
##  Tuesday  :248  
##  Wednesday:174
sd(forSale$totalPrice)
## [1] 68.25486

Initial Questions

  • Does a higher bidCount the day before the end date result in a higher price?

  • Does buyItNow effect price?

  • Does the text in the item title effect price? (appears to be significant after totalCost)

  • Does day of the week effect price?

Outcomes Summary

Really there isn’t much to this data as it’s our outcome we’re trying to predict. We do see that the average price of all Playstation 4’s is 267USD.

It appears that the SD’s of the two datasets are similar, which gives me some comfort in the data gathering methods.

summary(outcomes)
##       X_id             categoryId                            categoryName 
##  Min.   :1.121e+11   Min.   :139971   Video Game Consoles          :7797  
##  1st Qu.:1.623e+11   1st Qu.:139971   Cell Phones & Smartphones    :   0  
##  Median :2.321e+11   Median :139971   Consoles de jeux vidéo       :   0  
##  Mean   :2.297e+11   Mean   :139971   Console Systems              :   0  
##  3rd Qu.:2.822e+11   3rd Qu.:139971   DJ Controllers               :   0  
##  Max.   :4.012e+11   Max.   :139971   Faceplates, Decals & Stickers:   0  
##                                       (Other)                      :   0  
##   sellingPrice  
##  Min.   :120.0  
##  1st Qu.:224.9  
##  Median :247.5  
##  Mean   :265.0  
##  3rd Qu.:285.0  
##  Max.   :592.0  
## 
sd(outcomes$sellingPrice)
## [1] 70.30095

Visualization

I’d first like to see the histogram of these items and see just how variable their prices are.

Histogram of Price Counts for Sold Playstation4s

library(ggplot2)
ggplot(outcomes, aes(sellingPrice, fill = categoryName)) +
  geom_histogram(binwidth = 5)

###Histogram of Price Counts for Playstation4s currently for sale

ggplot(forSale, aes(totalPrice, fill = listingType)) +
  geom_histogram(binwidth = 5)

These data both appear to be positively skewed.

Plotting the location of sold Playstation4s

The map plot shows a high number of Playstation 4’s are sold in the North East of the US and around the Los Angeles area. This is probably simple due to the higher populations in these areas.

library(ggmap)
map1 = ggmap(get_map(location = "United States",zoom=4))+geom_point(data=locations, aes(x=lon,y=lat),color="orange")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=United+States&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=United%20States&sensor=false
map1
## Warning: Removed 16 rows containing missing values (geom_point).

Gluing outcome and forSale dataset

Now we merge the outcome into the forSale dataset to train, CV, and test machine learning algorithms against it.

fullDataset <- merge(forSale,outcomes,by="X_id")
summary(fullDataset)
## 
##  516 values imputed to 10.05239
##       X_id          
##  Min.   :1.122e+11  
##  1st Qu.:1.623e+11  
##  Median :2.223e+11  
##  Mean   :2.292e+11  
##  3rd Qu.:2.823e+11  
##  Max.   :4.012e+11  
##                     
##                                                                               title    
##  SONY PLAYSTATION 4 BASIC SET 500 GB BLACK CONSOLE                               :172  
##  SONY PLAYSTATION 4 (LATEST MODEL)- 500 GB BLACK CONSOLE                         : 44  
##  SONY PLAYSTATION 4 500 GB BLACK CONSOLE                                         : 38  
##  SONY PLAYSTATION 4 CALL OF DUTY: BLACK OPS III - STANDARD EDITION 500 GB JET... : 31  
##  SONY PLAYSTATION 4 CALL OF DUTY: BLACK OPS III - STANDARD EDITION 500 GB JET BL…: 17  
##  PLAYSTATION 4                                                                   : 16  
##  (Other)                                                                         :779  
##   currentPrice    shippingCost   calculateShipping   totalPrice   
##  Min.   : 80.0   Min.   : 0.00   false:581         Min.   :102.0  
##  1st Qu.:170.0   1st Qu.:10.05   true :516         1st Qu.:181.1  
##  Median :192.0   Median :10.05                     Median :201.6  
##  Mean   :202.9   Mean   :10.84                     Mean   :213.7  
##  3rd Qu.:219.5   3rd Qu.:14.99                     3rd Qu.:230.0  
##  Max.   :450.0   Max.   :50.00                     Max.   :450.0  
##                                                                   
##                    dateQueried                      endDate    
##                          :516   2016-11-04T01:00:08.000Z:   2  
##  2016-10-29T07:30:02.423Z:  1   2016-11-14T16:27:11.000Z:   2  
##  2016-10-29T07:30:03.620Z:  1   2016-11-15T15:30:23.000Z:   2  
##  2016-10-29T07:30:03.623Z:  1   2016-11-22T00:59:57.000Z:   2  
##  2016-10-29T07:30:03.629Z:  1   2016-10-29T15:18:15.000Z:   1  
##  2016-10-29T07:30:04.831Z:  1   2016-10-29T16:47:03.000Z:   1  
##  (Other)                 :576   (Other)                 :1087  
##                location       country        bidCount    
##  USA               :  23   US     :1071   Min.   : 0.00  
##  Canada            :  17   CA     :  17   1st Qu.: 1.00  
##  Brooklyn,NY,USA   :  14   PR     :   2   Median : 5.00  
##  Boise,ID,USA      :  13   AU     :   1   Mean   :12.89  
##  Columbus,OH,USA   :  13   EE     :   1   3rd Qu.:20.00  
##  Los Angeles,CA,USA:  12   GB     :   1   Max.   :82.00  
##  (Other)           :1005   (Other):   4                  
##          listingType  bestOffer    buyItNowAvailable  conditionId  
##  Auction       :984   false:1080   false:1020        Min.   :1000  
##  AuctionWithBIN: 77   true :  17   true :  77        1st Qu.:3000  
##  FixedPrice    : 22                                  Median :3000  
##  StoreInventory: 14                                  Mean   :2714  
##                                                      3rd Qu.:3000  
##                                                      Max.   :3000  
##                                                                    
##                conditionDisplayName
##  Used                    :912      
##  New                     :109      
##  New other (see details) : 55      
##  Seller refurbished      : 15      
##  Manufacturer refurbished:  6      
##  Brand New               :  0      
##  (Other)                 :  0      
##                                                                                   timeLeft   
##  {"milliseconds":0,"seconds":14,"minutes":28,"hours":18,"days":0,"months":0,"years":0}:   3  
##  {"milliseconds":0,"seconds":13,"minutes":28,"hours":18,"days":0,"months":0,"years":0}:   2  
##  {"milliseconds":0,"seconds":16,"minutes":33,"hours":20,"days":0,"months":0,"years":0}:   2  
##  {"milliseconds":0,"seconds":22,"minutes":43,"hours":16,"days":0,"months":0,"years":0}:   2  
##  {"milliseconds":0,"seconds":29,"minutes":28,"hours":18,"days":0,"months":0,"years":0}:   2  
##  {"milliseconds":0,"seconds":35,"minutes":28,"hours":18,"days":0,"months":0,"years":0}:   2  
##  (Other)                                                                              :1084  
##      dayOfWeek     categoryId                            categoryName 
##  Friday   :157   Min.   :139971   Video Game Consoles          :1097  
##  Monday   :219   1st Qu.:139971   Cell Phones & Smartphones    :   0  
##  Saturday :151   Median :139971   Consoles de jeux vidéo       :   0  
##  Sunday   :190   Mean   :139971   Console Systems              :   0  
##  Thursday :140   3rd Qu.:139971   DJ Controllers               :   0  
##  Tuesday  :139   Max.   :139971   Faceplates, Decals & Stickers:   0  
##  Wednesday:101                    (Other)                      :   0  
##   sellingPrice  
##  Min.   :127.5  
##  1st Qu.:223.5  
##  Median :243.5  
##  Mean   :255.9  
##  3rd Qu.:275.0  
##  Max.   :545.0  
## 

Day Of Week

Is the day of the week statistically significant? Here we see the average price for each day of the week plotted in a boxplot.

averagePricePerDay = with(fullDataset, tapply(sellingPrice, dayOfWeek,mean))
plotDF =aggregate(fullDataset$sellingPrice ~ fullDataset$dayOfWeek, FUN = mean)
plotDF
##   fullDataset$dayOfWeek fullDataset$sellingPrice
## 1                Friday                 251.0188
## 2                Monday                 255.1098
## 3              Saturday                 253.4496
## 4                Sunday                 261.3309
## 5              Thursday                 246.7069
## 6               Tuesday                 266.2805
## 7             Wednesday                 257.3285
ggplot(fullDataset, aes(y = sellingPrice, x = factor(dayOfWeek))) +scale_x_discrete("Day Of The Week") + 
        scale_y_continuous("Final Selling Price") + geom_boxplot(outlier.color="red")

But is this statistically significant? A pairwise t-test across the days of the week as well as a oneway ANOVA shows that it is not. This may be due to the fact that my dataset is still relatively small though.

pairwise.t.test(fullDataset$sellingPrice,
                fullDataset$dayOfWeek, alternative ="two.sided")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  fullDataset$sellingPrice and fullDataset$dayOfWeek 
## 
##           Friday Monday Saturday Sunday Thursday Tuesday
## Monday    1.000  -      -        -      -        -      
## Saturday  1.000  1.000  -        -      -        -      
## Sunday    1.000  1.000  1.000    -      -        -      
## Thursday  1.000  1.000  1.000    0.263  -        -      
## Tuesday   0.263  0.878  0.705    1.000  0.043    -      
## Wednesday 1.000  1.000  1.000    1.000  1.000    1.000  
## 
## P value adjustment method: holm
oneway.test(sellingPrice ~ dayOfWeek, data = fullDataset)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  sellingPrice and dayOfWeek
## F = 2.1621, num df = 6.00, denom df = 450.09, p-value = 0.04555

Bid Count

It doesn’t look as though the number of bids effects price that much. We do see a slightly positive correlation in regard to new items with conditions attached to them, but again, this could be due to lack of data.

p1 <- ggplot(fullDataset, aes(x = bidCount, y = sellingPrice, color = factor(conditionDisplayName))) + geom_point() + geom_smooth(method = "lm", se = TRUE)
p1

lmOut = lm(sellingPrice ~ bidCount, data = fullDataset)
summary(lmOut)
## 
## Call:
## lm(formula = sellingPrice ~ bidCount, data = fullDataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -127.90  -32.58  -12.40   18.98  289.60 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 255.07652    2.02825 125.762   <2e-16 ***
## bidCount      0.06546    0.09648   0.679    0.498    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.06 on 1095 degrees of freedom
## Multiple R-squared:  0.0004202,  Adjusted R-squared:  -0.0004926 
## F-statistic: 0.4604 on 1 and 1095 DF,  p-value: 0.4976

We also see the p-value on a linear regression as too high for statistical significance.

Buy It Now

Does the ability to “buy it now” effect price in any way?

averagePricePerBuyItNow = with(fullDataset, tapply(sellingPrice, buyItNowAvailable,mean))
plotDF = aggregate(fullDataset$sellingPrice ~ fullDataset$buyItNowAvailable, FUN = mean)
plotDF
##   fullDataset$buyItNowAvailable fullDataset$sellingPrice
## 1                         false                 254.3955
## 2                          true                 276.1234
ggplot(fullDataset, aes(y = sellingPrice, x = factor(buyItNowAvailable))) +scale_x_discrete("Buy It Now Available") + 
        scale_y_continuous("Final Selling Price") + geom_boxplot(outlier.color="red")

But is this statistically significant? A pairwise t-test across the BuyItNow variable as well as a oneway ANOVA shows that it actually may be. Again, this may be due to the fact that my dataset is still relatively small though.

pairwise.t.test(fullDataset$sellingPrice,
                fullDataset$buyItNowAvailable, alternative ="two.sided")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  fullDataset$sellingPrice and fullDataset$buyItNowAvailable 
## 
##      false  
## true 0.00051
## 
## P value adjustment method: holm
oneway.test(sellingPrice ~ buyItNowAvailable, data = fullDataset)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  sellingPrice and buyItNowAvailable
## F = 7.388, num df = 1.000, denom df = 82.544, p-value = 0.008

Title Analysis

Here we analyze the titles used in listings. This is done by taking relatively sparse terms using tf-idf and looking to see if any of these terms stand out in relation to price. Another question could be, does using a similar title to other listings lead to a higher selling price?

library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(wordcloud)
## Loading required package: RColorBrewer
text = Corpus(VectorSource(fullDataset$title))
text = tm_map(text,removePunctuation)
text = tm_map(text,content_transformer(tolower))
text = tm_map(text,removeWords,stopwords("english"))
# for(i in seq(text)){
#         text[[i]] = gsub("playstation4","playstation4",text[[i]])
#         text[[i]] = gsub("playstation 4","playstation4",text[[i]])
#         text[[i]] = gsub("playstation","playstation4",text[[i]])
# }
# text <- tm_map(text, stemDocument)
text <- tm_map(text, stripWhitespace)
# text <- tm_map(text,PlainTextDocument)
dtm = DocumentTermMatrix(text)
freq = colSums(as.matrix(dtm))
# length(freq)
ord = order(freq)
# freq[ord]
dtms = removeSparseTerms(dtm,.98)

A Wordcloud of Playstation4 titles

Here’s a nice visualization of the titles, for use in presentations as well as the poster.

set.seed(142)
wordcloud(names(freq),freq,max.words=100,rot.per=.2,colors=brewer.pal(6,"Dark2"))

Cleaning the Title

Here we add the split up title data frame to the full dataset to add it as a feature for prediction.

dtmsDF = as.data.frame(as.matrix(dtms))
dtmsDF$X_id = fullDataset$X_id
fullDataset = merge(fullDataset,dtmsDF, by = "X_id")
fullDataset$title = NULL
names(fullDataset) = make.names(names(fullDataset))

Machine Learning

Here I’m training multiple different machine learning algorithms in order to compare them against one another using RMSE on the test set.

Cross-Validation of the Full Dataset

Here I’m splitting the data into a test and train set for cross-validation, as well as a few small preprocessing steps.

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:Hmisc':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
library(rpart)
library(mlbench)

set.seed(415)
smp_size <- floor(0.75 * nrow(fullDataset))
train_indices = sample(seq_len(nrow(fullDataset)),size=smp_size)
trainingData = fullDataset[train_indices,]
testData     = fullDataset[-train_indices,]

make.names(levels(trainingData))
## character(0)

RandomForest

Here I’m training two Random Forest algorithms using the caret package, which allows me to do k-fold repeated cross validation, as well as tune parameters across the training (see the tunegrid object)

Here I use out of bag error as the metric for tuning.

NeuralNet

Here I’m training a neural net, turning over the size variables as well as decay.

library(nnet)
modelNN = train(trainingData[,c(2,5,10,11,14,15,17,36),],trainingData$sellingPrice, method='nnet', linout=TRUE, trace = FALSE, maxit=1000,
                #Grid of tuning parameters to try:
                tuneGrid=expand.grid(.size=c(1,5,10),.decay=c(0,0.001,0.1)))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.

Linear Regression

Here I’m training a linear regression model.

modelGLM =  glm(trainingData$sellingPrice ~ trainingData$currentPrice +
                        trainingData$totalPrice +
                        trainingData$bidCount +
                        trainingData$listingType +
                        trainingData$conditionId)
modelLM  = lm(trainingData$sellingPrice ~ trainingData$currentPrice +
                        trainingData$totalPrice +
                        trainingData$bidCount +
                        trainingData$listingType +
                        trainingData$conditionId +
                        trainingData$dayOfWeek +
                        trainingData$edition)

GLMNet

Here I’m training a GLM with Neural Net hybrid, with repeated cross validation, and lambda and alpha tuning.

library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
library(Matrix)
GLMNetctrl = trainControl(method="repeatedcv", number=10, repeats= 3,search="grid")
tunegrid = expand.grid(.alpha=seq(0,1,.1),.lambda=c(0:15))
modelGLMNet = train(sellingPrice ~
                        currentPrice +
                        totalPrice +
                        bidCount +
                        listingType +
                        conditionId +
                        dayOfWeek +
                        edition,data=trainingData, method="glmnet",trControl=GLMNetctrl, tuneGrid=tunegrid)

SVM

Here I train 3 different implementations of SVM, each with repeated cross-validation.

library(kernlab)
SVMctrl = trainControl(method="repeatedcv",number=10,repeats=3)

modelSVM = train(sellingPrice ~
                        currentPrice +
                        totalPrice +
                        bidCount +
                        listingType +
                        conditionId +
                        dayOfWeek +
                        edition,data=trainingData, method="svmLinear",trControl=SVMctrl)
library(kernlab)
SVMctrlR = trainControl(method="repeatedcv",number=10,repeats=3)
modelSVMR = train(sellingPrice ~
                        currentPrice +
                        totalPrice +
                        bidCount +
                        listingType +
                        conditionId +
                        dayOfWeek +
                        edition,data=trainingData, method="svmRadial",trControl=SVMctrl)

Price Prediction

We now have a price prediction! The items the algorithm predicts to have a higher price than they sold for are items I will target for purchasing, items lower or below a profit threshold will be ignored.

The predictions are made, and RMSE against the test set is taken in order to select the most accurate machine learning algorithms.

options(width = 1000)
##RANDOM FOREST PREDICTION
modelRF
## Random Forest 
## 
## 822 samples
##   7 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 739, 739, 740, 739, 741, 741, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared 
##    1    39.27841  0.6280630
##    2    32.77796  0.6532402
##    3    31.89493  0.6551324
##    4    31.87098  0.6543588
##    5    32.03830  0.6511541
##    6    32.22236  0.6478545
##    7    32.35928  0.6452307
##    8    32.44278  0.6435712
##    9    32.49475  0.6428315
##   10    32.69318  0.6391877
##   11    32.75517  0.6377128
##   12    32.87788  0.6356592
##   13    32.96935  0.6339659
##   14    33.05244  0.6324629
##   15    33.05315  0.6325832
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 4.
predictRF <- predict(modelRF, newdata = testData)
predictRF_2 <- predict(modelRF_2, newdata = testData)


# varImp(modelRF_2$finalModel)
# varImp(modelRF$finalModel)
# 
# varImpPlot(modelRF_2$finalModel)
# varImpPlot(modelRF$finalModel)

RMSE

Here are the RMSE for each algorithm

RFrmse <- sqrt(mean((predictRF - testData$sellingPrice)^2))
RFrmse
## [1] 30.91988
RFrmse_2 <- sqrt(mean((predictRF_2 - testData$sellingPrice)^2))
RFrmse_2
## [1] 31.45724
##GLM PREDICTION
# summary(modelGLM)
GLMpredict = predict(modelGLM,newdata = testData)
## Warning: 'newdata' had 275 rows but variables found have 822 rows
GLMrmse <- sqrt(mean((GLMpredict - testData$sellingPrice)^2))
## Warning in GLMpredict - testData$sellingPrice: longer object length is not a multiple of shorter object length
GLMrmse
## [1] 67.67035
##LM PREDICTION
# summary(modelLM)
LMpredict = predict(modelLM,newdata = testData)
## Warning: 'newdata' had 275 rows but variables found have 822 rows
LMrmse <- sqrt(mean((LMpredict - testData$sellingPrice)^2))
## Warning in LMpredict - testData$sellingPrice: longer object length is not a multiple of shorter object length
LMrmse
## [1] 68.16249
##GLMNet PREDICTION
GLMNetpredict <- predict(modelGLMNet, newdata = testData)
GLMNetrmse <- sqrt(mean((GLMNetpredict - testData$sellingPrice)^2))
GLMNetrmse
## [1] 31.11125
##NN PREDICTION
NNpredict <- predict(modelNN, newdata = testData)
NNrmse <- sqrt(mean((NNpredict - testData$sellingPrice)^2))
NNrmse
## [1] 32.36508
##SVM PREDICTION
SVMpredict <- predict(modelSVM, newdata = testData)
SVMrmse <- sqrt(mean((SVMpredict - testData$sellingPrice)^2))
SVMrmse
## [1] 30.26452
##SVM3 PREDICTION
SVMpredict3 <- predict(modelSVM3, newdata = testData)
SVMrmse3 <- sqrt(mean((SVMpredict3 - testData$sellingPrice)^2))
SVMrmse3
## [1] 37.36498
##SVMRADIAL PREDICTION
SVMpredictR <- predict(modelSVMR, newdata = testData)
SVMrmseR <- sqrt(mean((SVMpredictR - testData$sellingPrice)^2))
SVMrmseR
## [1] 34.49452

Here we observe that the first Random Forest model as well as the first SVM model perform the best on the test set. These will be used to detect items for sale below our predicted price.

Output

Here we plot the outcome against the predictions made by the Random Forest and SVM algorithms, as well as the average of these two predictions. It is then sorted by the most profitable items.

SVMprofit = round(SVMpredict-testData$sellingPrice,2)
RFprofit = round(predictRF-testData$sellingPrice,2)
printOutRF = data.frame(testData$X_id,testData$sellingPrice,RFprofit,SVMprofit, rowMeans(cbind(SVMprofit,RFprofit)))
colnames(printOutRF) = c("itemId", "priceSold","predictedProfit_RF","predictedProfit_SVM", "avgPredictedProfit")
printOutRF = printOutRF[order(printOutRF$avgPredictedProfit,decreasing=TRUE),]
printOutRF
##            itemId priceSold predictedProfit_RF predictedProfit_SVM avgPredictedProfit
## 988  332000007489    158.29             111.25              114.54            112.895
## 975  322326493360    127.50             109.22               53.72             81.470
## 401  192007168188    255.00              85.87               65.64             75.755
## 655  262613806540    230.00              66.14               46.33             56.235
## 357  182333187247    149.49              54.55               45.32             49.935
## 361  182338982582    242.50              56.56               42.96             49.760
## 810  282247067075    223.00              53.15               44.10             48.625
## 84   122211776615    225.00              57.72               36.18             46.950
## 716  262718314570    175.00              49.52               38.88             44.200
## 180  152292945987    208.00              46.48               40.42             43.450
## 514  222305963744    180.00              42.22               43.84             43.030
## 63   122198104186    162.50              57.06               27.84             42.450
## 362  182339003245    215.00              44.12               40.27             42.195
## 375  182344309342    226.00              44.62               38.41             41.515
## 996  332018981092    175.97              45.29               36.75             41.020
## 119  131981426207    164.50              38.45               37.45             37.950
## 356  182332883754    207.50              40.53               33.07             36.800
## 564  232133287719    167.50              47.51               25.65             36.580
## 1054 371774623724    222.50              34.47               35.21             34.840
## 824  282253629374    195.00              43.12               25.66             34.390
## 242  152325008569    212.50              37.63               29.73             33.680
## 946  322314005074    220.50              33.54               31.99             32.765
## 260  162264301644    207.49              29.60               35.57             32.585
## 115  122226575297    194.00              21.00               43.75             32.375
## 720  262720496308    175.00              36.38               27.27             31.825
## 60   122194533957    260.00              29.71               33.62             31.665
## 108  122223674707    222.50              33.49               29.73             31.610
## 656  262633144612    250.00              43.66               19.15             31.405
## 185  152296012585    214.99              28.79               33.41             31.100
## 363  182339192801    203.00              37.63               24.46             31.045
## 212  152311180850    193.50              37.08               24.70             30.890
## 406  192015286062    215.00              39.54               22.14             30.840
## 929  311733703886    222.00              30.06               30.93             30.495
## 403  192009910452    204.56              36.94               23.66             30.300
## 83   122210382518    215.48              27.31               33.06             30.185
## 1036 351902328990    213.50              30.35               28.31             29.330
## 503  222299824490    247.50              36.62               21.37             28.995
## 37   112199176439    204.00              31.98               25.92             28.950
## 626  252629564240    225.50              26.48               31.35             28.915
## 376  182345540511    270.00              34.41               23.35             28.880
## 1020 332031104385    224.99              24.48               33.20             28.840
## 508  222303644187    202.50              26.05               29.89             27.970
## 545  222314595911    197.50              33.68               21.83             27.755
## 747  272430375793    230.50              25.25               28.56             26.905
## 263  162267095611    225.00              27.73               23.95             25.840
## 568  232138865857    215.45              30.74               20.14             25.440
## 189  152299568284    275.00              30.31               19.71             25.010
## 793  282231578551    355.00              29.52               20.20             24.860
## 1079 391610105250    203.50              21.30               28.03             24.665
## 796  282233993583    230.00              18.02               30.93             24.475
## 69   122202618668    204.00              17.15               30.84             23.995
## 497  222295470518    225.45              21.56               25.83             23.695
## 273  162272916866    215.00              25.62               21.24             23.430
## 1040 361751823013    289.99              36.60               10.22             23.410
## 290  162280893629    232.50              27.70               18.91             23.305
## 256  162262068952    245.00              25.28               19.86             22.570
## 315  172395169410    232.50              30.28               14.40             22.340
## 265  162268955618    205.50              29.88               14.73             22.305
## 438  201701911555    222.49              17.65               26.91             22.280
## 99   122220700134    295.00              21.96               22.24             22.100
## 96   122219950985    232.50              16.55               25.34             20.945
## 899  302131295304    232.50              20.05               21.77             20.910
## 992  332018143330    202.50              19.68               21.19             20.435
## 889  302123067927    222.69              19.30               21.53             20.415
## 700  262714743506    198.50              17.72               22.86             20.290
## 997  332019033707    280.00              30.95                9.50             20.225
## 30   112196373304    225.00              15.37               24.46             19.915
## 426  192026567326    227.00              25.15               14.51             19.830
## 644  252636184342    227.00              14.39               22.96             18.675
## 1004 332022945664    252.50              18.21               18.95             18.580
## 924  311727907930    235.73              12.38               23.75             18.065
## 862  291939593391    231.00              18.89               17.07             17.980
## 270  162271501180    217.50              21.92               13.87             17.895
## 346  172409274771    270.00              20.38               15.27             17.825
## 26   112193803223    238.50              27.41                8.02             17.715
## 209  152310022841    235.00              15.23               19.75             17.490
## 966  322324143451    235.00              15.23               19.75             17.490
## 409  192016969618    217.50              13.77               20.81             17.290
## 764  272444905318    232.50              12.00               22.57             17.285
## 425  192026504076    228.50              16.11               18.32             17.215
## 548  222315086554    269.50              20.38               13.94             17.160
## 874  291943622670    207.50              16.67               17.53             17.100
## 1018 332030753584    230.00              19.65               14.40             17.025
## 802  282241152074    218.84               8.03               25.92             16.975
## 123  131985432900    223.95              20.80               12.85             16.825
## 633  252633322030    233.46              14.85               18.55             16.700
## 921  302138457059    225.00              12.80               19.52             16.160
## 450  201711348804    222.50              18.14               14.17             16.155
## 582  232143708528    213.50              18.50               13.65             16.075
## 456  201712090051    235.50              13.11               18.30             15.705
## 380  182349894389    216.50              11.93               19.26             15.595
## 36   112198598802    242.50              19.64               11.52             15.580
## 230  152317225162    291.50               4.48               26.49             15.485
## 152  142165435173    220.00              18.43               12.52             15.475
## 144  132003738004    229.95              17.97               12.97             15.470
## 322  172398950422    270.00              10.89               19.94             15.415
## 662  262692777542    235.50              22.23                8.57             15.400
## 388  182352778984    220.00              17.71               12.86             15.285
## 909  302133871944    221.49              24.36                5.87             15.115
## 836  291920583024    291.00              13.84               16.19             15.015
## 411  192017898096    303.82               7.80               22.14             14.970
## 789  282229130090    217.50              13.44               16.50             14.970
## 661  262692314038    314.99              21.09                8.56             14.825
## 832  282257251519    272.50              15.73               13.45             14.590
## 157  142171167630    227.50              21.50                7.41             14.455
## 75   122205467239    232.49              11.20               16.72             13.960
## 893  302125554374    221.00               7.52               17.89             12.705
## 895  302128278706    222.95              14.54               10.73             12.635
## 502  222299102749    245.50               9.62               15.55             12.585
## 798  282235472071    217.50              19.96                5.06             12.510
## 980  322328696274    209.50              15.14                9.66             12.400
## 110  122225590377    242.50              20.28                4.40             12.340
## 854  291935791162    212.50              17.09                6.78             11.935
## 98   122220360347    215.00               7.43               15.92             11.675
## 566  232135755969    237.50              13.02                9.91             11.465
## 434  201699978183    325.00              16.36                6.12             11.240
## 527  222309074964    214.25              21.38                0.64             11.010
## 622  252625542628    218.00               9.81               11.94             10.875
## 878  302113466514    319.99              35.99              -14.35             10.820
## 313  172394599965    227.50              11.90                9.44             10.670
## 851  291935564525    273.00               4.06               15.97             10.015
## 150  142163152670    265.00               6.89               12.89              9.890
## 845  291931494579    237.50              16.23                3.42              9.825
## 928  311733175781    237.49              10.27                9.27              9.770
## 491  222293662616    215.00              20.05               -0.71              9.670
## 756  272440439564    220.50              15.82                2.66              9.240
## 427  192027216941    226.50              10.28                8.08              9.180
## 864  291940019872    260.50              14.51                3.79              9.150
## 474  201716429471    227.50              11.28                6.84              9.060
## 772  272446548882    222.80              13.99                3.74              8.865
## 335  172405077698    227.50              10.57                5.47              8.020
## 95   122219792662    259.99               1.25               14.59              7.920
## 297  172363749332    329.99              33.55              -17.91              7.820
## 686  262705467137    233.00               4.45               10.20              7.325
## 16   112188956079    300.00               2.46               11.65              7.055
## 1050 361831172902    222.50               7.45                5.82              6.635
## 619  252624461310    224.45               5.10                6.72              5.910
## 677  262702200415    242.00              10.02                1.16              5.590
## 565  232133681950    227.50               6.06                4.70              5.380
## 822  282253145725    280.00               7.75                2.82              5.285
## 331  172403340389    285.00               1.34                8.35              4.845
## 490  222293359601    233.50               2.35                6.76              4.555
## 687  262706787054    281.01               5.62                3.14              4.380
## 165  142173392649    250.50               0.92                7.79              4.355
## 569  232139416739    222.50               6.08                2.29              4.185
## 23   112193065662    281.20              10.47               -3.23              3.620
## 739  272428720490    250.27               5.24                1.17              3.205
## 338  172405671622    252.50               2.42                2.67              2.545
## 461  201713151325    235.45              15.60              -11.89              1.855
## 788  282227331162    274.99               0.95                1.66              1.305
## 581  232143440302    277.45              -0.90                2.03              0.565
## 624  252628009848    242.50              -1.71                2.01              0.150
## 711  262718124164    246.15               1.78               -1.54              0.120
## 682  262704530361    315.00              -0.57                0.74              0.085
## 1063 371792850079    229.50              11.00              -11.04             -0.020
## 325  172400181709    236.00               3.86               -4.53             -0.335
## 208  152309968648    230.49               0.66               -3.67             -1.505
## 828  282255196560    217.95               3.77               -7.52             -1.875
## 441  201702886518    271.50              -6.28                1.79             -2.245
## 526  222308891649    216.50              -1.13               -3.47             -2.300
## 748  272432349648    247.50              -7.13                1.96             -2.585
## 730  272424230867    200.50              -1.71               -4.16             -2.935
## 360  182337962561    255.00               0.66               -6.63             -2.985
## 902  302131899961    425.00               4.27              -10.35             -3.040
## 97   122220353190    215.00              -1.48               -4.69             -3.085
## 304  172388409180    253.00               3.90              -10.08             -3.090
## 669  262696716080    237.50               1.78               -7.98             -3.100
## 610  252620544400    230.00               5.58              -12.14             -3.280
## 91   122218428289    297.89              -6.35               -0.63             -3.490
## 371  182342930215    254.50              -2.18               -5.76             -3.970
## 741  272428965475    224.50              -5.75               -5.61             -5.680
## 352  182331848292    208.50              -2.50               -9.54             -6.020
## 791  282230414981    214.49              -2.24              -10.53             -6.385
## 838  291921827173    219.88              -3.30              -10.21             -6.755
## 245  162251600220    257.50              -7.19               -7.94             -7.565
## 168  142174994299    245.00              -7.13               -8.60             -7.865
## 188  152299499061    272.50             -14.16               -1.79             -7.975
## 583  232144092063    245.50             -17.37               -0.35             -8.860
## 912  302134617027    320.00             -14.37               -3.49             -8.930
## 465  201714842411    237.50              -2.76              -15.36             -9.060
## 819  282252681085    237.50              -5.62              -13.38             -9.500
## 1027 351890510188    223.50              -8.29              -11.42             -9.855
## 1091 401215833398    244.95              -8.96              -11.69            -10.325
## 839  291921827212    208.00             -10.52              -11.00            -10.760
## 520  222306920822    246.50             -10.59              -11.15            -10.870
## 239  152320764046    405.00               1.84              -24.06            -11.110
## 736  272427386637    254.50              -4.72              -17.98            -11.350
## 378  182347685410    227.50              -2.17              -20.78            -11.475
## 759  272442581356    233.49              -6.28              -17.81            -12.045
## 280  162276436019    230.00             -13.84              -10.88            -12.360
## 470  201715654610    247.50             -10.99              -13.73            -12.360
## 308  172391489021    220.00              -1.06              -24.68            -12.870
## 207  152309483008    268.50             -19.97               -6.26            -13.115
## 944  322310998591    240.50              -4.32              -22.01            -13.165
## 137  131995254892    280.00              -8.51              -18.81            -13.660
## 291  162281064398    450.00             -23.99               -3.33            -13.660
## 896  302128597559    240.50             -18.88               -8.72            -13.800
## 221  152314083719    290.00              -9.82              -17.99            -13.905
## 917  302136559416    245.00             -11.40              -17.32            -14.360
## 225  152315332301    260.92             -15.03              -13.75            -14.390
## 518  222306817360    247.99             -14.01              -14.81            -14.410
## 587  252581045405    374.99               7.56              -36.62            -14.530
## 733  272426274590    256.00             -10.16              -19.75            -14.955
## 643  252635979869    255.00             -17.75              -13.84            -15.795
## 295  162286139145    236.00             -13.84              -20.33            -17.085
## 955  322319441088    267.49             -23.34              -12.40            -17.870
## 1092 401218433185    249.00             -22.01              -14.35            -18.180
## 771  272446408354    245.00             -12.79              -23.75            -18.270
## 493  222293748148    275.00             -18.38              -18.55            -18.465
## 703  262715369148    232.50             -16.78              -20.62            -18.700
## 586  252569958608    380.00               0.08              -37.57            -18.745
## 1026 351886767432    247.50             -15.26              -22.43            -18.845
## 766  272445373895    242.50             -14.56              -23.18            -18.870
## 640  252634992970    251.00              -5.75              -32.23            -18.990
## 994  332018233549    272.50             -26.06              -12.40            -19.230
## 990  332014209803    435.00             -11.87              -26.95            -19.410
## 601  252612570636    290.00             -22.50              -17.10            -19.800
## 522  222307344331    241.50             -12.83              -27.41            -20.120
## 271  162271591894    267.50             -21.38              -19.98            -20.680
## 709  262716844659    253.50             -21.97              -19.41            -20.690
## 852  291935651598    267.50             -19.04              -22.52            -20.780
## 962  322322120916    305.00             -18.19              -24.21            -21.200
## 139  131997839936    250.00             -19.92              -23.58            -21.750
## 451  201711351284    256.00             -17.44              -26.16            -21.800
## 950  322318322555    300.00             -23.70              -19.91            -21.805
## 382  182350283248    247.50             -16.06              -28.28            -22.170
## 249  162254944440    281.00             -18.74              -25.87            -22.305
## 534  222311156565    252.50             -19.44              -25.33            -22.385
## 654  252642788043    406.00             -14.16              -31.70            -22.930
## 182  152294104088    280.00             -19.36              -27.53            -23.445
## 779  272448309245    268.00             -20.16              -28.13            -24.145
## 884  302118338655    250.00             -18.99              -29.65            -24.320
## 104  122222872782    247.50             -24.78              -24.77            -24.775
## 47   112203293150    300.00             -27.10              -22.71            -24.905
## 713  262718194698    246.00             -25.55              -24.81            -25.180
## 933  311740138371    257.00             -23.82              -27.16            -25.490
## 468  201715040236    275.50             -30.30              -22.42            -26.360
## 343  172408138117    257.50             -25.31              -32.60            -28.955
## 769  272445522564    233.00             -14.12              -44.52            -29.320
## 19   112190060053    280.00             -32.17              -29.43            -30.800
## 934  322230786608    425.00             -23.04              -38.67            -30.855
## 135  131993371448    310.00             -34.57              -28.39            -31.480
## 1044 361805397511    288.99             -38.01              -28.90            -33.455
## 285  162278444111    265.50             -28.92              -38.13            -33.525
## 306  172390327293    258.00             -29.32              -40.43            -34.875
## 24   112193210879    271.00             -41.72              -29.85            -35.785
## 510  222304155110    295.00             -34.73              -37.25            -35.990
## 424  192025960273    266.00             -27.62              -48.98            -38.300
## 79   122207043259    257.50             -31.19              -45.60            -38.395
## 485  222287299617    309.00             -36.90              -41.44            -39.170
## 608  252616522689    255.00             -35.41              -43.29            -39.350
## 513  222305020925    414.79             -26.63              -52.65            -39.640
## 179  152291828993    290.00             -34.38              -47.06            -40.720
## 671  262697394178    247.50             -35.55              -45.93            -40.740
## 216  152313115344    285.00             -43.90              -39.51            -41.705
## 200  152305969425    262.50             -34.10              -50.08            -42.090
## 920  302137647405    334.00             -39.98              -45.82            -42.900
## 8    112181900794    276.76             -39.58              -48.77            -44.175
## 516  222306074990    290.99             -43.41              -45.74            -44.575
## 1072 381840957212    251.00             -43.44              -49.30            -46.370
## 665  262695437867    290.50             -47.96              -47.81            -47.885
## 780  272448503561    295.00             -47.02              -57.91            -52.465
## 540  222312257284    311.00             -50.13              -54.90            -52.515
## 176  142179693198    280.00             -56.20              -49.40            -52.800
## 697  262712990594    255.50             -40.69              -66.05            -53.370
## 459  201712517938    290.00             -60.50              -63.43            -61.965
## 673  262701096393    315.00             -68.00              -64.64            -66.320
## 1028 351891415447    285.50             -60.92              -72.49            -66.705
## 164  142172675260    292.00             -70.41              -69.89            -70.150
## 39   112199560370    325.00             -77.09              -89.96            -83.525
## 913  302134880670    330.00             -85.41              -86.97            -86.190
## 161  142172296000    385.00             -90.66              -82.47            -86.565
## 383  182350366307    375.00             -82.20              -91.82            -87.010
## 163  142172642383    365.00             -96.71              -81.02            -88.865
## 722  262722260878    405.00            -143.47             -141.96           -142.715

Overview

Typically you’d want to minimize error, but in this case, I’m really trying to exploit the error of a machine learning algorithm in a sense. I want my algorithm to be “smarter” than the data, the cases where it’s “incorrect” (really we’re saying the algorithm is correct and the market is not correct) in a certain direction will be cases I expect to see profit. This was initially difficult for me to wrap my head around when looking at the outputs from the various algorithms.