Here we read in data exported from the mongoDB database running on the AWS server. Outcomes contains only the itemID, categoryID/Name, and final selling price. forSale contains any features I thought might be of use to a machine learning algorithm. This includes things like the item’s title string as well as location.
outcomes = read.csv("completedItems_2.csv",header=TRUE)
forSale = read.csv("forSale_2.csv",header=TRUE)
head(outcomes)
## X_id categoryId categoryName sellingPrice
## 1 112185507879 139971 Video Game Consoles 349.99
## 2 122202585486 139971 Video Game Consoles 305.00
## 3 291926308477 139971 Video Game Consoles 200.98
## 4 262695233466 139971 Video Game Consoles 280.00
## 5 142163675662 139971 Video Game Consoles 220.00
## 6 262695805629 139971 Video Game Consoles 250.00
head(forSale)
## X_id
## 1 172388678529
## 2 122194533957
## 3 282229141811
## 4 282233993583
## 5 162252708291
## 6 131981426207
## title
## 1 NEW PLAYSTATION 4 LIMITED EDITION TACO BELL GOLD CONSOLE PS4
## 2 SONY PLAYSTATION 4 (CUH-1215A) 500GB DESTINY THE TAKEN KING LIMITED EDITION
## 3 SONY PLAYSTATION 4 SYSTEM 500 GIG HD
## 4 SONY PLAYSTATION 4 500 GB JET BLACK CONSOLE
## 5 UNCHARTED 4: A THIEF'S END (SONY PLAYSTATION 4 SLIM, 2016)
## 6 SONY PLAYSTATION 4
## currentPrice shippingCost calculateShipping totalPrice
## 1 650.00 25.00 false 675.00
## 2 250.00 10.00 false 260.00
## 3 239.99 19.99 false 259.98
## 4 200.00 30.00 false 230.00
## 5 237.50 28.00 false 265.50
## 6 132.50 14.00 false 146.50
## dateQueried endDate location
## 1 2016-10-29T07:30:02.399Z 2016-10-30T02:35:15.000Z Las Vegas,NV,USA
## 2 2016-10-29T07:30:02.423Z 2016-10-29T15:18:15.000Z Montgomery,AL,USA
## 3 2016-10-29T07:30:03.615Z 2016-10-29T17:05:05.000Z Ballwin,MO,USA
## 4 2016-10-29T07:30:03.620Z 2016-10-29T18:57:11.000Z Trinidad,CO,USA
## 5 2016-10-29T07:30:03.623Z 2016-10-29T16:49:40.000Z West Palm Beach,FL,USA
## 6 2016-10-29T07:30:03.629Z 2016-10-29T21:54:15.000Z Watertown,CT,USA
## country bidCount listingType bestOffer buyItNowAvailable conditionId
## 1 US NaN FixedPrice false false 1000
## 2 US 1 Auction false false 3000
## 3 US 0 Auction false false 3000
## 4 US 0 Auction false false 3000
## 5 US 41 Auction false false 1500
## 6 US 20 Auction false false 3000
## conditionDisplayName
## 1 New
## 2 Used
## 3 Used
## 4 Used
## 5 New other (see details)
## 6 Used
## timeLeft
## 1 {"milliseconds":0,"seconds":32,"minutes":3,"hours":19,"days":0,"months":0,"years":0}
## 2 {"milliseconds":0,"seconds":32,"minutes":46,"hours":7,"days":0,"months":0,"years":0}
## 3 {"milliseconds":0,"seconds":20,"minutes":33,"hours":9,"days":0,"months":0,"years":0}
## 4 {"milliseconds":0,"seconds":26,"minutes":25,"hours":11,"days":0,"months":0,"years":0}
## 5 {"milliseconds":0,"seconds":55,"minutes":17,"hours":9,"days":0,"months":0,"years":0}
## 6 {"milliseconds":0,"seconds":30,"minutes":22,"hours":14,"days":0,"months":0,"years":0}
Above is shown an overview of every feature collected from the eBay API in the forSale dataframe. A few confusing one’s will be noted for the purposes of this presentation.
calculateShipping is a boolean that tells me whether the eBay API returned the shipping price for an item, or if this item needs to be queried with a separate API call to get it’s shipping price. The items with true will probably be imputed with an average of all shipping prices.
dateQueried is the date my algorithm found the item on eBay.
endDate is when the item was no longer for sale (I think I will turn these into categoric variables to see if day of the week is an important variable to price prediction)
bidCount is how many bids have been placed on this item when the my algorithm found it on eBay.
listingType describes whether the item is a FixedPrice item or an item for Auction.
conditionId is a condition code used by eBay to desribe an item’s condition, 1000 is new, 3000, is used, 1500 is new with conditions attached to the item.
timeLeft specifies how much longer this item is for sale.
I will perform the imputation of shipping costs now. First we change the -1 values to NA’s, and then impute NA’s with the average shipping cost for non NA items.
forSale$shippingCost[forSale$calculateShipping == "true"] = NA
forSale$shippingCost = with (forSale,impute(shippingCost,mean))
forSale$bidCount = with(forSale, impute(bidCount,mean))
forSale$bidCount = as.numeric(forSale$bidCount)
Calculating imputed totalPrice and removing sold items that aren’t video game consoles.
forSale$totalPrice = forSale$currentPrice + forSale$shippingCost
forSale$totalPrice = as.numeric(forSale$totalPrice)
outcomes = outcomes[outcomes$categoryName == "Video Game Consoles",]
Here I’m converting date strings to actual R dates.
dates = as.Date(forSale$endDate)
forSale$dayOfWeek = as.factor(weekdays(dates))
We can use R’s built in boxplot code to detect and remove outliers from the dataset.
outliers = boxplot.stats(forSale$totalPrice)$out
forSale = forSale[!forSale$totalPrice %in% outliers,]
forSale = forSale[(forSale$totalPrice > 100),]
Here we summarize the data to see if anything stands out as interesting in the data for further analysis or visualization.
It looks like surprisingly there are a higher number of Playstation 4’s with the title “SONY PLAYSTATION 4 BASIC SET 500 GB BLACK CONSOLE” and “SONY PLAYSTATION 4 (LATEST MODEL)- 500 GB BLACK CONSOLE” in the title. This may be due to the fact that many people are buying and relisting with the same title, similar to my software. We also print out the SD for all playstation 4’s in order to get a better idea of how variable the prices are.
summary(forSale)
##
## 739 values imputed to 10.05239
## X_id
## Min. :1.120e+11
## 1st Qu.:1.623e+11
## Median :2.223e+11
## Mean :2.271e+11
## 3rd Qu.:2.822e+11
## Max. :4.012e+11
##
## title
## SONY PLAYSTATION 4 BASIC SET 500 GB BLACK CONSOLE : 212
## SONY PLAYSTATION 4 (LATEST MODEL)- 500 GB BLACK CONSOLE : 67
## SONY PLAYSTATION 4 500 GB BLACK CONSOLE : 58
## SONY PLAYSTATION 4 CALL OF DUTY: BLACK OPS III - STANDARD EDITION 500 GB JET... : 42
## PLAYSTATION 4 : 24
## SONY PLAYSTATION 4 CALL OF DUTY: BLACK OPS III - STANDARD EDITION 500 GB JET BL…: 20
## (Other) :1216
## currentPrice shippingCost calculateShipping totalPrice
## Min. : 80.0 Min. : 0.00 false:900 Min. :102.0
## 1st Qu.:177.5 1st Qu.: 0.00 true :739 1st Qu.:187.5
## Median :202.5 Median :10.05 Median :215.5
## Mean :225.8 Mean :10.21 Mean :236.0
## 3rd Qu.:255.0 3rd Qu.:12.95 3rd Qu.:269.0
## Max. :469.9 Max. :50.00 Max. :469.9
##
## dateQueried endDate
## :739 2016-11-02T17:06:29.000Z: 2
## 2016-10-29T07:30:02.423Z: 1 2016-11-04T01:00:08.000Z: 2
## 2016-10-29T07:30:03.615Z: 1 2016-11-12T03:00:07.000Z: 2
## 2016-10-29T07:30:03.620Z: 1 2016-11-14T16:27:11.000Z: 2
## 2016-10-29T07:30:03.623Z: 1 2016-11-15T15:30:23.000Z: 2
## 2016-10-29T07:30:03.629Z: 1 2016-11-22T00:59:57.000Z: 2
## (Other) :895 (Other) :1627
## location country bidCount
## USA : 38 US :1585 Min. : 0.00
## Canada : 28 CA : 28 1st Qu.: 0.00
## Columbus,OH,USA : 23 GB : 5 Median : 8.00
## Miami,FL,USA : 21 KR : 5 Mean :11.36
## Los Angeles,CA,USA: 20 JP : 4 3rd Qu.:12.00
## New York,NY,USA : 18 RU : 3 Max. :82.00
## (Other) :1491 (Other): 9
## listingType bestOffer buyItNowAvailable conditionId
## Auction :1159 false:1512 false:1480 Min. :1000
## AuctionWithBIN: 159 true : 127 true : 159 1st Qu.:3000
## FixedPrice : 204 Median :3000
## StoreInventory: 117 Mean :2629
## 3rd Qu.:3000
## Max. :3000
##
## conditionDisplayName
## Used :1280
## New : 219
## New other (see details) : 93
## Seller refurbished : 34
## Manufacturer refurbished: 13
## Brand New : 0
## (Other) : 0
## timeLeft
## {"milliseconds":0,"seconds":14,"minutes":28,"hours":18,"days":0,"months":0,"years":0}: 3
## {"milliseconds":0,"seconds":0,"minutes":43,"hours":18,"days":0,"months":0,"years":0} : 2
## {"milliseconds":0,"seconds":10,"minutes":4,"hours":16,"days":0,"months":0,"years":0} : 2
## {"milliseconds":0,"seconds":13,"minutes":28,"hours":18,"days":0,"months":0,"years":0}: 2
## {"milliseconds":0,"seconds":16,"minutes":33,"hours":20,"days":0,"months":0,"years":0}: 2
## {"milliseconds":0,"seconds":22,"minutes":28,"hours":10,"days":0,"months":0,"years":0}: 2
## (Other) :1626
## dayOfWeek
## Friday :225
## Monday :300
## Saturday :223
## Sunday :266
## Thursday :203
## Tuesday :248
## Wednesday:174
sd(forSale$totalPrice)
## [1] 68.25486
Does a higher bidCount the day before the end date result in a higher price?
Does buyItNow effect price?
Does the text in the item title effect price? (appears to be significant after totalCost)
Does day of the week effect price?
Really there isn’t much to this data as it’s our outcome we’re trying to predict. We do see that the average price of all Playstation 4’s is 267USD.
It appears that the SD’s of the two datasets are similar, which gives me some comfort in the data gathering methods.
summary(outcomes)
## X_id categoryId categoryName
## Min. :1.121e+11 Min. :139971 Video Game Consoles :7797
## 1st Qu.:1.623e+11 1st Qu.:139971 Cell Phones & Smartphones : 0
## Median :2.321e+11 Median :139971 Consoles de jeux vidéo : 0
## Mean :2.297e+11 Mean :139971 Console Systems : 0
## 3rd Qu.:2.822e+11 3rd Qu.:139971 DJ Controllers : 0
## Max. :4.012e+11 Max. :139971 Faceplates, Decals & Stickers: 0
## (Other) : 0
## sellingPrice
## Min. :120.0
## 1st Qu.:224.9
## Median :247.5
## Mean :265.0
## 3rd Qu.:285.0
## Max. :592.0
##
sd(outcomes$sellingPrice)
## [1] 70.30095
I’d first like to see the histogram of these items and see just how variable their prices are.
library(ggplot2)
ggplot(outcomes, aes(sellingPrice, fill = categoryName)) +
geom_histogram(binwidth = 5)
###Histogram of Price Counts for Playstation4s currently for sale
ggplot(forSale, aes(totalPrice, fill = listingType)) +
geom_histogram(binwidth = 5)
These data both appear to be positively skewed.
The map plot shows a high number of Playstation 4’s are sold in the North East of the US and around the Los Angeles area. This is probably simple due to the higher populations in these areas.
library(ggmap)
map1 = ggmap(get_map(location = "United States",zoom=4))+geom_point(data=locations, aes(x=lon,y=lat),color="orange")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=United+States&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=United%20States&sensor=false
map1
## Warning: Removed 16 rows containing missing values (geom_point).
Now we merge the outcome into the forSale dataset to train, CV, and test machine learning algorithms against it.
fullDataset <- merge(forSale,outcomes,by="X_id")
summary(fullDataset)
##
## 516 values imputed to 10.05239
## X_id
## Min. :1.122e+11
## 1st Qu.:1.623e+11
## Median :2.223e+11
## Mean :2.292e+11
## 3rd Qu.:2.823e+11
## Max. :4.012e+11
##
## title
## SONY PLAYSTATION 4 BASIC SET 500 GB BLACK CONSOLE :172
## SONY PLAYSTATION 4 (LATEST MODEL)- 500 GB BLACK CONSOLE : 44
## SONY PLAYSTATION 4 500 GB BLACK CONSOLE : 38
## SONY PLAYSTATION 4 CALL OF DUTY: BLACK OPS III - STANDARD EDITION 500 GB JET... : 31
## SONY PLAYSTATION 4 CALL OF DUTY: BLACK OPS III - STANDARD EDITION 500 GB JET BL…: 17
## PLAYSTATION 4 : 16
## (Other) :779
## currentPrice shippingCost calculateShipping totalPrice
## Min. : 80.0 Min. : 0.00 false:581 Min. :102.0
## 1st Qu.:170.0 1st Qu.:10.05 true :516 1st Qu.:181.1
## Median :192.0 Median :10.05 Median :201.6
## Mean :202.9 Mean :10.84 Mean :213.7
## 3rd Qu.:219.5 3rd Qu.:14.99 3rd Qu.:230.0
## Max. :450.0 Max. :50.00 Max. :450.0
##
## dateQueried endDate
## :516 2016-11-04T01:00:08.000Z: 2
## 2016-10-29T07:30:02.423Z: 1 2016-11-14T16:27:11.000Z: 2
## 2016-10-29T07:30:03.620Z: 1 2016-11-15T15:30:23.000Z: 2
## 2016-10-29T07:30:03.623Z: 1 2016-11-22T00:59:57.000Z: 2
## 2016-10-29T07:30:03.629Z: 1 2016-10-29T15:18:15.000Z: 1
## 2016-10-29T07:30:04.831Z: 1 2016-10-29T16:47:03.000Z: 1
## (Other) :576 (Other) :1087
## location country bidCount
## USA : 23 US :1071 Min. : 0.00
## Canada : 17 CA : 17 1st Qu.: 1.00
## Brooklyn,NY,USA : 14 PR : 2 Median : 5.00
## Boise,ID,USA : 13 AU : 1 Mean :12.89
## Columbus,OH,USA : 13 EE : 1 3rd Qu.:20.00
## Los Angeles,CA,USA: 12 GB : 1 Max. :82.00
## (Other) :1005 (Other): 4
## listingType bestOffer buyItNowAvailable conditionId
## Auction :984 false:1080 false:1020 Min. :1000
## AuctionWithBIN: 77 true : 17 true : 77 1st Qu.:3000
## FixedPrice : 22 Median :3000
## StoreInventory: 14 Mean :2714
## 3rd Qu.:3000
## Max. :3000
##
## conditionDisplayName
## Used :912
## New :109
## New other (see details) : 55
## Seller refurbished : 15
## Manufacturer refurbished: 6
## Brand New : 0
## (Other) : 0
## timeLeft
## {"milliseconds":0,"seconds":14,"minutes":28,"hours":18,"days":0,"months":0,"years":0}: 3
## {"milliseconds":0,"seconds":13,"minutes":28,"hours":18,"days":0,"months":0,"years":0}: 2
## {"milliseconds":0,"seconds":16,"minutes":33,"hours":20,"days":0,"months":0,"years":0}: 2
## {"milliseconds":0,"seconds":22,"minutes":43,"hours":16,"days":0,"months":0,"years":0}: 2
## {"milliseconds":0,"seconds":29,"minutes":28,"hours":18,"days":0,"months":0,"years":0}: 2
## {"milliseconds":0,"seconds":35,"minutes":28,"hours":18,"days":0,"months":0,"years":0}: 2
## (Other) :1084
## dayOfWeek categoryId categoryName
## Friday :157 Min. :139971 Video Game Consoles :1097
## Monday :219 1st Qu.:139971 Cell Phones & Smartphones : 0
## Saturday :151 Median :139971 Consoles de jeux vidéo : 0
## Sunday :190 Mean :139971 Console Systems : 0
## Thursday :140 3rd Qu.:139971 DJ Controllers : 0
## Tuesday :139 Max. :139971 Faceplates, Decals & Stickers: 0
## Wednesday:101 (Other) : 0
## sellingPrice
## Min. :127.5
## 1st Qu.:223.5
## Median :243.5
## Mean :255.9
## 3rd Qu.:275.0
## Max. :545.0
##
Is the day of the week statistically significant? Here we see the average price for each day of the week plotted in a boxplot.
averagePricePerDay = with(fullDataset, tapply(sellingPrice, dayOfWeek,mean))
plotDF =aggregate(fullDataset$sellingPrice ~ fullDataset$dayOfWeek, FUN = mean)
plotDF
## fullDataset$dayOfWeek fullDataset$sellingPrice
## 1 Friday 251.0188
## 2 Monday 255.1098
## 3 Saturday 253.4496
## 4 Sunday 261.3309
## 5 Thursday 246.7069
## 6 Tuesday 266.2805
## 7 Wednesday 257.3285
ggplot(fullDataset, aes(y = sellingPrice, x = factor(dayOfWeek))) +scale_x_discrete("Day Of The Week") +
scale_y_continuous("Final Selling Price") + geom_boxplot(outlier.color="red")
But is this statistically significant? A pairwise t-test across the days of the week as well as a oneway ANOVA shows that it is not. This may be due to the fact that my dataset is still relatively small though.
pairwise.t.test(fullDataset$sellingPrice,
fullDataset$dayOfWeek, alternative ="two.sided")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: fullDataset$sellingPrice and fullDataset$dayOfWeek
##
## Friday Monday Saturday Sunday Thursday Tuesday
## Monday 1.000 - - - - -
## Saturday 1.000 1.000 - - - -
## Sunday 1.000 1.000 1.000 - - -
## Thursday 1.000 1.000 1.000 0.263 - -
## Tuesday 0.263 0.878 0.705 1.000 0.043 -
## Wednesday 1.000 1.000 1.000 1.000 1.000 1.000
##
## P value adjustment method: holm
oneway.test(sellingPrice ~ dayOfWeek, data = fullDataset)
##
## One-way analysis of means (not assuming equal variances)
##
## data: sellingPrice and dayOfWeek
## F = 2.1621, num df = 6.00, denom df = 450.09, p-value = 0.04555
It doesn’t look as though the number of bids effects price that much. We do see a slightly positive correlation in regard to new items with conditions attached to them, but again, this could be due to lack of data.
p1 <- ggplot(fullDataset, aes(x = bidCount, y = sellingPrice, color = factor(conditionDisplayName))) + geom_point() + geom_smooth(method = "lm", se = TRUE)
p1
lmOut = lm(sellingPrice ~ bidCount, data = fullDataset)
summary(lmOut)
##
## Call:
## lm(formula = sellingPrice ~ bidCount, data = fullDataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.90 -32.58 -12.40 18.98 289.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 255.07652 2.02825 125.762 <2e-16 ***
## bidCount 0.06546 0.09648 0.679 0.498
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.06 on 1095 degrees of freedom
## Multiple R-squared: 0.0004202, Adjusted R-squared: -0.0004926
## F-statistic: 0.4604 on 1 and 1095 DF, p-value: 0.4976
We also see the p-value on a linear regression as too high for statistical significance.
Does the ability to “buy it now” effect price in any way?
averagePricePerBuyItNow = with(fullDataset, tapply(sellingPrice, buyItNowAvailable,mean))
plotDF = aggregate(fullDataset$sellingPrice ~ fullDataset$buyItNowAvailable, FUN = mean)
plotDF
## fullDataset$buyItNowAvailable fullDataset$sellingPrice
## 1 false 254.3955
## 2 true 276.1234
ggplot(fullDataset, aes(y = sellingPrice, x = factor(buyItNowAvailable))) +scale_x_discrete("Buy It Now Available") +
scale_y_continuous("Final Selling Price") + geom_boxplot(outlier.color="red")
But is this statistically significant? A pairwise t-test across the BuyItNow variable as well as a oneway ANOVA shows that it actually may be. Again, this may be due to the fact that my dataset is still relatively small though.
pairwise.t.test(fullDataset$sellingPrice,
fullDataset$buyItNowAvailable, alternative ="two.sided")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: fullDataset$sellingPrice and fullDataset$buyItNowAvailable
##
## false
## true 0.00051
##
## P value adjustment method: holm
oneway.test(sellingPrice ~ buyItNowAvailable, data = fullDataset)
##
## One-way analysis of means (not assuming equal variances)
##
## data: sellingPrice and buyItNowAvailable
## F = 7.388, num df = 1.000, denom df = 82.544, p-value = 0.008
Here we analyze the titles used in listings. This is done by taking relatively sparse terms using tf-idf and looking to see if any of these terms stand out in relation to price. Another question could be, does using a similar title to other listings lead to a higher selling price?
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(wordcloud)
## Loading required package: RColorBrewer
text = Corpus(VectorSource(fullDataset$title))
text = tm_map(text,removePunctuation)
text = tm_map(text,content_transformer(tolower))
text = tm_map(text,removeWords,stopwords("english"))
# for(i in seq(text)){
# text[[i]] = gsub("playstation4","playstation4",text[[i]])
# text[[i]] = gsub("playstation 4","playstation4",text[[i]])
# text[[i]] = gsub("playstation","playstation4",text[[i]])
# }
# text <- tm_map(text, stemDocument)
text <- tm_map(text, stripWhitespace)
# text <- tm_map(text,PlainTextDocument)
dtm = DocumentTermMatrix(text)
freq = colSums(as.matrix(dtm))
# length(freq)
ord = order(freq)
# freq[ord]
dtms = removeSparseTerms(dtm,.98)
Here’s a nice visualization of the titles, for use in presentations as well as the poster.
set.seed(142)
wordcloud(names(freq),freq,max.words=100,rot.per=.2,colors=brewer.pal(6,"Dark2"))
Here we add the split up title data frame to the full dataset to add it as a feature for prediction.
dtmsDF = as.data.frame(as.matrix(dtms))
dtmsDF$X_id = fullDataset$X_id
fullDataset = merge(fullDataset,dtmsDF, by = "X_id")
fullDataset$title = NULL
names(fullDataset) = make.names(names(fullDataset))
Here I’m training multiple different machine learning algorithms in order to compare them against one another using RMSE on the test set.
Here I’m splitting the data into a test and train set for cross-validation, as well as a few small preprocessing steps.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:Hmisc':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
library(rpart)
library(mlbench)
set.seed(415)
smp_size <- floor(0.75 * nrow(fullDataset))
train_indices = sample(seq_len(nrow(fullDataset)),size=smp_size)
trainingData = fullDataset[train_indices,]
testData = fullDataset[-train_indices,]
make.names(levels(trainingData))
## character(0)
Here I’m training two Random Forest algorithms using the caret package, which allows me to do k-fold repeated cross validation, as well as tune parameters across the training (see the tunegrid object)
Here I use out of bag error as the metric for tuning.
Here I’m training a neural net, turning over the size variables as well as decay.
library(nnet)
modelNN = train(trainingData[,c(2,5,10,11,14,15,17,36),],trainingData$sellingPrice, method='nnet', linout=TRUE, trace = FALSE, maxit=1000,
#Grid of tuning parameters to try:
tuneGrid=expand.grid(.size=c(1,5,10),.decay=c(0,0.001,0.1)))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
Here I’m training a linear regression model.
modelGLM = glm(trainingData$sellingPrice ~ trainingData$currentPrice +
trainingData$totalPrice +
trainingData$bidCount +
trainingData$listingType +
trainingData$conditionId)
modelLM = lm(trainingData$sellingPrice ~ trainingData$currentPrice +
trainingData$totalPrice +
trainingData$bidCount +
trainingData$listingType +
trainingData$conditionId +
trainingData$dayOfWeek +
trainingData$edition)
Here I’m training a GLM with Neural Net hybrid, with repeated cross validation, and lambda and alpha tuning.
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
library(Matrix)
GLMNetctrl = trainControl(method="repeatedcv", number=10, repeats= 3,search="grid")
tunegrid = expand.grid(.alpha=seq(0,1,.1),.lambda=c(0:15))
modelGLMNet = train(sellingPrice ~
currentPrice +
totalPrice +
bidCount +
listingType +
conditionId +
dayOfWeek +
edition,data=trainingData, method="glmnet",trControl=GLMNetctrl, tuneGrid=tunegrid)
Here I train 3 different implementations of SVM, each with repeated cross-validation.
library(kernlab)
SVMctrl = trainControl(method="repeatedcv",number=10,repeats=3)
modelSVM = train(sellingPrice ~
currentPrice +
totalPrice +
bidCount +
listingType +
conditionId +
dayOfWeek +
edition,data=trainingData, method="svmLinear",trControl=SVMctrl)
library(kernlab)
SVMctrlR = trainControl(method="repeatedcv",number=10,repeats=3)
modelSVMR = train(sellingPrice ~
currentPrice +
totalPrice +
bidCount +
listingType +
conditionId +
dayOfWeek +
edition,data=trainingData, method="svmRadial",trControl=SVMctrl)
We now have a price prediction! The items the algorithm predicts to have a higher price than they sold for are items I will target for purchasing, items lower or below a profit threshold will be ignored.
The predictions are made, and RMSE against the test set is taken in order to select the most accurate machine learning algorithms.
options(width = 1000)
##RANDOM FOREST PREDICTION
modelRF
## Random Forest
##
## 822 samples
## 7 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 739, 739, 740, 739, 741, 741, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 1 39.27841 0.6280630
## 2 32.77796 0.6532402
## 3 31.89493 0.6551324
## 4 31.87098 0.6543588
## 5 32.03830 0.6511541
## 6 32.22236 0.6478545
## 7 32.35928 0.6452307
## 8 32.44278 0.6435712
## 9 32.49475 0.6428315
## 10 32.69318 0.6391877
## 11 32.75517 0.6377128
## 12 32.87788 0.6356592
## 13 32.96935 0.6339659
## 14 33.05244 0.6324629
## 15 33.05315 0.6325832
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.
predictRF <- predict(modelRF, newdata = testData)
predictRF_2 <- predict(modelRF_2, newdata = testData)
# varImp(modelRF_2$finalModel)
# varImp(modelRF$finalModel)
#
# varImpPlot(modelRF_2$finalModel)
# varImpPlot(modelRF$finalModel)
Here are the RMSE for each algorithm
RFrmse <- sqrt(mean((predictRF - testData$sellingPrice)^2))
RFrmse
## [1] 30.91988
RFrmse_2 <- sqrt(mean((predictRF_2 - testData$sellingPrice)^2))
RFrmse_2
## [1] 31.45724
##GLM PREDICTION
# summary(modelGLM)
GLMpredict = predict(modelGLM,newdata = testData)
## Warning: 'newdata' had 275 rows but variables found have 822 rows
GLMrmse <- sqrt(mean((GLMpredict - testData$sellingPrice)^2))
## Warning in GLMpredict - testData$sellingPrice: longer object length is not a multiple of shorter object length
GLMrmse
## [1] 67.67035
##LM PREDICTION
# summary(modelLM)
LMpredict = predict(modelLM,newdata = testData)
## Warning: 'newdata' had 275 rows but variables found have 822 rows
LMrmse <- sqrt(mean((LMpredict - testData$sellingPrice)^2))
## Warning in LMpredict - testData$sellingPrice: longer object length is not a multiple of shorter object length
LMrmse
## [1] 68.16249
##GLMNet PREDICTION
GLMNetpredict <- predict(modelGLMNet, newdata = testData)
GLMNetrmse <- sqrt(mean((GLMNetpredict - testData$sellingPrice)^2))
GLMNetrmse
## [1] 31.11125
##NN PREDICTION
NNpredict <- predict(modelNN, newdata = testData)
NNrmse <- sqrt(mean((NNpredict - testData$sellingPrice)^2))
NNrmse
## [1] 32.36508
##SVM PREDICTION
SVMpredict <- predict(modelSVM, newdata = testData)
SVMrmse <- sqrt(mean((SVMpredict - testData$sellingPrice)^2))
SVMrmse
## [1] 30.26452
##SVM3 PREDICTION
SVMpredict3 <- predict(modelSVM3, newdata = testData)
SVMrmse3 <- sqrt(mean((SVMpredict3 - testData$sellingPrice)^2))
SVMrmse3
## [1] 37.36498
##SVMRADIAL PREDICTION
SVMpredictR <- predict(modelSVMR, newdata = testData)
SVMrmseR <- sqrt(mean((SVMpredictR - testData$sellingPrice)^2))
SVMrmseR
## [1] 34.49452
Here we observe that the first Random Forest model as well as the first SVM model perform the best on the test set. These will be used to detect items for sale below our predicted price.
Here we plot the outcome against the predictions made by the Random Forest and SVM algorithms, as well as the average of these two predictions. It is then sorted by the most profitable items.
SVMprofit = round(SVMpredict-testData$sellingPrice,2)
RFprofit = round(predictRF-testData$sellingPrice,2)
printOutRF = data.frame(testData$X_id,testData$sellingPrice,RFprofit,SVMprofit, rowMeans(cbind(SVMprofit,RFprofit)))
colnames(printOutRF) = c("itemId", "priceSold","predictedProfit_RF","predictedProfit_SVM", "avgPredictedProfit")
printOutRF = printOutRF[order(printOutRF$avgPredictedProfit,decreasing=TRUE),]
printOutRF
## itemId priceSold predictedProfit_RF predictedProfit_SVM avgPredictedProfit
## 988 332000007489 158.29 111.25 114.54 112.895
## 975 322326493360 127.50 109.22 53.72 81.470
## 401 192007168188 255.00 85.87 65.64 75.755
## 655 262613806540 230.00 66.14 46.33 56.235
## 357 182333187247 149.49 54.55 45.32 49.935
## 361 182338982582 242.50 56.56 42.96 49.760
## 810 282247067075 223.00 53.15 44.10 48.625
## 84 122211776615 225.00 57.72 36.18 46.950
## 716 262718314570 175.00 49.52 38.88 44.200
## 180 152292945987 208.00 46.48 40.42 43.450
## 514 222305963744 180.00 42.22 43.84 43.030
## 63 122198104186 162.50 57.06 27.84 42.450
## 362 182339003245 215.00 44.12 40.27 42.195
## 375 182344309342 226.00 44.62 38.41 41.515
## 996 332018981092 175.97 45.29 36.75 41.020
## 119 131981426207 164.50 38.45 37.45 37.950
## 356 182332883754 207.50 40.53 33.07 36.800
## 564 232133287719 167.50 47.51 25.65 36.580
## 1054 371774623724 222.50 34.47 35.21 34.840
## 824 282253629374 195.00 43.12 25.66 34.390
## 242 152325008569 212.50 37.63 29.73 33.680
## 946 322314005074 220.50 33.54 31.99 32.765
## 260 162264301644 207.49 29.60 35.57 32.585
## 115 122226575297 194.00 21.00 43.75 32.375
## 720 262720496308 175.00 36.38 27.27 31.825
## 60 122194533957 260.00 29.71 33.62 31.665
## 108 122223674707 222.50 33.49 29.73 31.610
## 656 262633144612 250.00 43.66 19.15 31.405
## 185 152296012585 214.99 28.79 33.41 31.100
## 363 182339192801 203.00 37.63 24.46 31.045
## 212 152311180850 193.50 37.08 24.70 30.890
## 406 192015286062 215.00 39.54 22.14 30.840
## 929 311733703886 222.00 30.06 30.93 30.495
## 403 192009910452 204.56 36.94 23.66 30.300
## 83 122210382518 215.48 27.31 33.06 30.185
## 1036 351902328990 213.50 30.35 28.31 29.330
## 503 222299824490 247.50 36.62 21.37 28.995
## 37 112199176439 204.00 31.98 25.92 28.950
## 626 252629564240 225.50 26.48 31.35 28.915
## 376 182345540511 270.00 34.41 23.35 28.880
## 1020 332031104385 224.99 24.48 33.20 28.840
## 508 222303644187 202.50 26.05 29.89 27.970
## 545 222314595911 197.50 33.68 21.83 27.755
## 747 272430375793 230.50 25.25 28.56 26.905
## 263 162267095611 225.00 27.73 23.95 25.840
## 568 232138865857 215.45 30.74 20.14 25.440
## 189 152299568284 275.00 30.31 19.71 25.010
## 793 282231578551 355.00 29.52 20.20 24.860
## 1079 391610105250 203.50 21.30 28.03 24.665
## 796 282233993583 230.00 18.02 30.93 24.475
## 69 122202618668 204.00 17.15 30.84 23.995
## 497 222295470518 225.45 21.56 25.83 23.695
## 273 162272916866 215.00 25.62 21.24 23.430
## 1040 361751823013 289.99 36.60 10.22 23.410
## 290 162280893629 232.50 27.70 18.91 23.305
## 256 162262068952 245.00 25.28 19.86 22.570
## 315 172395169410 232.50 30.28 14.40 22.340
## 265 162268955618 205.50 29.88 14.73 22.305
## 438 201701911555 222.49 17.65 26.91 22.280
## 99 122220700134 295.00 21.96 22.24 22.100
## 96 122219950985 232.50 16.55 25.34 20.945
## 899 302131295304 232.50 20.05 21.77 20.910
## 992 332018143330 202.50 19.68 21.19 20.435
## 889 302123067927 222.69 19.30 21.53 20.415
## 700 262714743506 198.50 17.72 22.86 20.290
## 997 332019033707 280.00 30.95 9.50 20.225
## 30 112196373304 225.00 15.37 24.46 19.915
## 426 192026567326 227.00 25.15 14.51 19.830
## 644 252636184342 227.00 14.39 22.96 18.675
## 1004 332022945664 252.50 18.21 18.95 18.580
## 924 311727907930 235.73 12.38 23.75 18.065
## 862 291939593391 231.00 18.89 17.07 17.980
## 270 162271501180 217.50 21.92 13.87 17.895
## 346 172409274771 270.00 20.38 15.27 17.825
## 26 112193803223 238.50 27.41 8.02 17.715
## 209 152310022841 235.00 15.23 19.75 17.490
## 966 322324143451 235.00 15.23 19.75 17.490
## 409 192016969618 217.50 13.77 20.81 17.290
## 764 272444905318 232.50 12.00 22.57 17.285
## 425 192026504076 228.50 16.11 18.32 17.215
## 548 222315086554 269.50 20.38 13.94 17.160
## 874 291943622670 207.50 16.67 17.53 17.100
## 1018 332030753584 230.00 19.65 14.40 17.025
## 802 282241152074 218.84 8.03 25.92 16.975
## 123 131985432900 223.95 20.80 12.85 16.825
## 633 252633322030 233.46 14.85 18.55 16.700
## 921 302138457059 225.00 12.80 19.52 16.160
## 450 201711348804 222.50 18.14 14.17 16.155
## 582 232143708528 213.50 18.50 13.65 16.075
## 456 201712090051 235.50 13.11 18.30 15.705
## 380 182349894389 216.50 11.93 19.26 15.595
## 36 112198598802 242.50 19.64 11.52 15.580
## 230 152317225162 291.50 4.48 26.49 15.485
## 152 142165435173 220.00 18.43 12.52 15.475
## 144 132003738004 229.95 17.97 12.97 15.470
## 322 172398950422 270.00 10.89 19.94 15.415
## 662 262692777542 235.50 22.23 8.57 15.400
## 388 182352778984 220.00 17.71 12.86 15.285
## 909 302133871944 221.49 24.36 5.87 15.115
## 836 291920583024 291.00 13.84 16.19 15.015
## 411 192017898096 303.82 7.80 22.14 14.970
## 789 282229130090 217.50 13.44 16.50 14.970
## 661 262692314038 314.99 21.09 8.56 14.825
## 832 282257251519 272.50 15.73 13.45 14.590
## 157 142171167630 227.50 21.50 7.41 14.455
## 75 122205467239 232.49 11.20 16.72 13.960
## 893 302125554374 221.00 7.52 17.89 12.705
## 895 302128278706 222.95 14.54 10.73 12.635
## 502 222299102749 245.50 9.62 15.55 12.585
## 798 282235472071 217.50 19.96 5.06 12.510
## 980 322328696274 209.50 15.14 9.66 12.400
## 110 122225590377 242.50 20.28 4.40 12.340
## 854 291935791162 212.50 17.09 6.78 11.935
## 98 122220360347 215.00 7.43 15.92 11.675
## 566 232135755969 237.50 13.02 9.91 11.465
## 434 201699978183 325.00 16.36 6.12 11.240
## 527 222309074964 214.25 21.38 0.64 11.010
## 622 252625542628 218.00 9.81 11.94 10.875
## 878 302113466514 319.99 35.99 -14.35 10.820
## 313 172394599965 227.50 11.90 9.44 10.670
## 851 291935564525 273.00 4.06 15.97 10.015
## 150 142163152670 265.00 6.89 12.89 9.890
## 845 291931494579 237.50 16.23 3.42 9.825
## 928 311733175781 237.49 10.27 9.27 9.770
## 491 222293662616 215.00 20.05 -0.71 9.670
## 756 272440439564 220.50 15.82 2.66 9.240
## 427 192027216941 226.50 10.28 8.08 9.180
## 864 291940019872 260.50 14.51 3.79 9.150
## 474 201716429471 227.50 11.28 6.84 9.060
## 772 272446548882 222.80 13.99 3.74 8.865
## 335 172405077698 227.50 10.57 5.47 8.020
## 95 122219792662 259.99 1.25 14.59 7.920
## 297 172363749332 329.99 33.55 -17.91 7.820
## 686 262705467137 233.00 4.45 10.20 7.325
## 16 112188956079 300.00 2.46 11.65 7.055
## 1050 361831172902 222.50 7.45 5.82 6.635
## 619 252624461310 224.45 5.10 6.72 5.910
## 677 262702200415 242.00 10.02 1.16 5.590
## 565 232133681950 227.50 6.06 4.70 5.380
## 822 282253145725 280.00 7.75 2.82 5.285
## 331 172403340389 285.00 1.34 8.35 4.845
## 490 222293359601 233.50 2.35 6.76 4.555
## 687 262706787054 281.01 5.62 3.14 4.380
## 165 142173392649 250.50 0.92 7.79 4.355
## 569 232139416739 222.50 6.08 2.29 4.185
## 23 112193065662 281.20 10.47 -3.23 3.620
## 739 272428720490 250.27 5.24 1.17 3.205
## 338 172405671622 252.50 2.42 2.67 2.545
## 461 201713151325 235.45 15.60 -11.89 1.855
## 788 282227331162 274.99 0.95 1.66 1.305
## 581 232143440302 277.45 -0.90 2.03 0.565
## 624 252628009848 242.50 -1.71 2.01 0.150
## 711 262718124164 246.15 1.78 -1.54 0.120
## 682 262704530361 315.00 -0.57 0.74 0.085
## 1063 371792850079 229.50 11.00 -11.04 -0.020
## 325 172400181709 236.00 3.86 -4.53 -0.335
## 208 152309968648 230.49 0.66 -3.67 -1.505
## 828 282255196560 217.95 3.77 -7.52 -1.875
## 441 201702886518 271.50 -6.28 1.79 -2.245
## 526 222308891649 216.50 -1.13 -3.47 -2.300
## 748 272432349648 247.50 -7.13 1.96 -2.585
## 730 272424230867 200.50 -1.71 -4.16 -2.935
## 360 182337962561 255.00 0.66 -6.63 -2.985
## 902 302131899961 425.00 4.27 -10.35 -3.040
## 97 122220353190 215.00 -1.48 -4.69 -3.085
## 304 172388409180 253.00 3.90 -10.08 -3.090
## 669 262696716080 237.50 1.78 -7.98 -3.100
## 610 252620544400 230.00 5.58 -12.14 -3.280
## 91 122218428289 297.89 -6.35 -0.63 -3.490
## 371 182342930215 254.50 -2.18 -5.76 -3.970
## 741 272428965475 224.50 -5.75 -5.61 -5.680
## 352 182331848292 208.50 -2.50 -9.54 -6.020
## 791 282230414981 214.49 -2.24 -10.53 -6.385
## 838 291921827173 219.88 -3.30 -10.21 -6.755
## 245 162251600220 257.50 -7.19 -7.94 -7.565
## 168 142174994299 245.00 -7.13 -8.60 -7.865
## 188 152299499061 272.50 -14.16 -1.79 -7.975
## 583 232144092063 245.50 -17.37 -0.35 -8.860
## 912 302134617027 320.00 -14.37 -3.49 -8.930
## 465 201714842411 237.50 -2.76 -15.36 -9.060
## 819 282252681085 237.50 -5.62 -13.38 -9.500
## 1027 351890510188 223.50 -8.29 -11.42 -9.855
## 1091 401215833398 244.95 -8.96 -11.69 -10.325
## 839 291921827212 208.00 -10.52 -11.00 -10.760
## 520 222306920822 246.50 -10.59 -11.15 -10.870
## 239 152320764046 405.00 1.84 -24.06 -11.110
## 736 272427386637 254.50 -4.72 -17.98 -11.350
## 378 182347685410 227.50 -2.17 -20.78 -11.475
## 759 272442581356 233.49 -6.28 -17.81 -12.045
## 280 162276436019 230.00 -13.84 -10.88 -12.360
## 470 201715654610 247.50 -10.99 -13.73 -12.360
## 308 172391489021 220.00 -1.06 -24.68 -12.870
## 207 152309483008 268.50 -19.97 -6.26 -13.115
## 944 322310998591 240.50 -4.32 -22.01 -13.165
## 137 131995254892 280.00 -8.51 -18.81 -13.660
## 291 162281064398 450.00 -23.99 -3.33 -13.660
## 896 302128597559 240.50 -18.88 -8.72 -13.800
## 221 152314083719 290.00 -9.82 -17.99 -13.905
## 917 302136559416 245.00 -11.40 -17.32 -14.360
## 225 152315332301 260.92 -15.03 -13.75 -14.390
## 518 222306817360 247.99 -14.01 -14.81 -14.410
## 587 252581045405 374.99 7.56 -36.62 -14.530
## 733 272426274590 256.00 -10.16 -19.75 -14.955
## 643 252635979869 255.00 -17.75 -13.84 -15.795
## 295 162286139145 236.00 -13.84 -20.33 -17.085
## 955 322319441088 267.49 -23.34 -12.40 -17.870
## 1092 401218433185 249.00 -22.01 -14.35 -18.180
## 771 272446408354 245.00 -12.79 -23.75 -18.270
## 493 222293748148 275.00 -18.38 -18.55 -18.465
## 703 262715369148 232.50 -16.78 -20.62 -18.700
## 586 252569958608 380.00 0.08 -37.57 -18.745
## 1026 351886767432 247.50 -15.26 -22.43 -18.845
## 766 272445373895 242.50 -14.56 -23.18 -18.870
## 640 252634992970 251.00 -5.75 -32.23 -18.990
## 994 332018233549 272.50 -26.06 -12.40 -19.230
## 990 332014209803 435.00 -11.87 -26.95 -19.410
## 601 252612570636 290.00 -22.50 -17.10 -19.800
## 522 222307344331 241.50 -12.83 -27.41 -20.120
## 271 162271591894 267.50 -21.38 -19.98 -20.680
## 709 262716844659 253.50 -21.97 -19.41 -20.690
## 852 291935651598 267.50 -19.04 -22.52 -20.780
## 962 322322120916 305.00 -18.19 -24.21 -21.200
## 139 131997839936 250.00 -19.92 -23.58 -21.750
## 451 201711351284 256.00 -17.44 -26.16 -21.800
## 950 322318322555 300.00 -23.70 -19.91 -21.805
## 382 182350283248 247.50 -16.06 -28.28 -22.170
## 249 162254944440 281.00 -18.74 -25.87 -22.305
## 534 222311156565 252.50 -19.44 -25.33 -22.385
## 654 252642788043 406.00 -14.16 -31.70 -22.930
## 182 152294104088 280.00 -19.36 -27.53 -23.445
## 779 272448309245 268.00 -20.16 -28.13 -24.145
## 884 302118338655 250.00 -18.99 -29.65 -24.320
## 104 122222872782 247.50 -24.78 -24.77 -24.775
## 47 112203293150 300.00 -27.10 -22.71 -24.905
## 713 262718194698 246.00 -25.55 -24.81 -25.180
## 933 311740138371 257.00 -23.82 -27.16 -25.490
## 468 201715040236 275.50 -30.30 -22.42 -26.360
## 343 172408138117 257.50 -25.31 -32.60 -28.955
## 769 272445522564 233.00 -14.12 -44.52 -29.320
## 19 112190060053 280.00 -32.17 -29.43 -30.800
## 934 322230786608 425.00 -23.04 -38.67 -30.855
## 135 131993371448 310.00 -34.57 -28.39 -31.480
## 1044 361805397511 288.99 -38.01 -28.90 -33.455
## 285 162278444111 265.50 -28.92 -38.13 -33.525
## 306 172390327293 258.00 -29.32 -40.43 -34.875
## 24 112193210879 271.00 -41.72 -29.85 -35.785
## 510 222304155110 295.00 -34.73 -37.25 -35.990
## 424 192025960273 266.00 -27.62 -48.98 -38.300
## 79 122207043259 257.50 -31.19 -45.60 -38.395
## 485 222287299617 309.00 -36.90 -41.44 -39.170
## 608 252616522689 255.00 -35.41 -43.29 -39.350
## 513 222305020925 414.79 -26.63 -52.65 -39.640
## 179 152291828993 290.00 -34.38 -47.06 -40.720
## 671 262697394178 247.50 -35.55 -45.93 -40.740
## 216 152313115344 285.00 -43.90 -39.51 -41.705
## 200 152305969425 262.50 -34.10 -50.08 -42.090
## 920 302137647405 334.00 -39.98 -45.82 -42.900
## 8 112181900794 276.76 -39.58 -48.77 -44.175
## 516 222306074990 290.99 -43.41 -45.74 -44.575
## 1072 381840957212 251.00 -43.44 -49.30 -46.370
## 665 262695437867 290.50 -47.96 -47.81 -47.885
## 780 272448503561 295.00 -47.02 -57.91 -52.465
## 540 222312257284 311.00 -50.13 -54.90 -52.515
## 176 142179693198 280.00 -56.20 -49.40 -52.800
## 697 262712990594 255.50 -40.69 -66.05 -53.370
## 459 201712517938 290.00 -60.50 -63.43 -61.965
## 673 262701096393 315.00 -68.00 -64.64 -66.320
## 1028 351891415447 285.50 -60.92 -72.49 -66.705
## 164 142172675260 292.00 -70.41 -69.89 -70.150
## 39 112199560370 325.00 -77.09 -89.96 -83.525
## 913 302134880670 330.00 -85.41 -86.97 -86.190
## 161 142172296000 385.00 -90.66 -82.47 -86.565
## 383 182350366307 375.00 -82.20 -91.82 -87.010
## 163 142172642383 365.00 -96.71 -81.02 -88.865
## 722 262722260878 405.00 -143.47 -141.96 -142.715
Typically you’d want to minimize error, but in this case, I’m really trying to exploit the error of a machine learning algorithm in a sense. I want my algorithm to be “smarter” than the data, the cases where it’s “incorrect” (really we’re saying the algorithm is correct and the market is not correct) in a certain direction will be cases I expect to see profit. This was initially difficult for me to wrap my head around when looking at the outputs from the various algorithms.