Kaggle launched a machine learning challenge based on the Random Acts of Pizza (RAOP) project on Reddit. In RAOP, people can buy a pizza for any random person that asks for it. People have made requests by descibing hard times, asking very politely, or even offering to write poems in return.
The purpose of this challenge was to analyze the historical requests and create an algorithm capable of predicting whether a request will be successful and recieve a free pizza.
Load the Training Data
You can get the training dataset from the Kaggle website, but In order to download it, Kaggle requires you to sign in to your account and accept the terms of the competition. Unfortunately, they haven’t developed an API to do this with an R script so I’m going to rely on the reader to download the data themself. Here is the link to the competition page: Link. The data downloads as a 1.7MB zip file in JSON format. JSON is short for Java Script Object Notation and is an alternative to XML.
NA Values
With the dataset loaded, we’ll start with some exploratory analytics understand the dataset. We’ll look at the attribute names and then make several plots determine which of the attributes might add predictive power. Let’s take a quick look at the dataset and see where the missing values are. The original dataset used the character string “N/A” to refer to NA values. We’ll have to turn those values into something R understands. The NA values are in the variables shown below.
Variable NACount
giver_username_if_known 3753
requester_subreddits_at_request 729
requester_user_flair 3046
It is interesting that kaggle website describes the requester_user_flair field as used to say if the requester received pizza, received pizza and gave pizza, or didn’t receive any pizza. This fields seems like it would give away the outcome of the request so we’ll check to see if its in the test dataset. This field as well as a few others are only in the training dataset. There are only 17 fields in the test set while there are 32 fields in the training set. These fields will be removed to prevent us from building model on datapoints that won’t be available later.
Time Values
Next, we’ll look at the success rates by time period to see if there are any trends with time. We’ll need to convert the date fields to POSIXct so the year can be extracted and then we can create a chart of the success rates. Using the table below, we can see that the requests that receive pizza has consistently decreased across the years- both proportionally and nominally. The 2011 success rate is approximately 30 percent while the 2013 rate is a little more than 20 percent. This means that a request this year is less likely to be receive a free pizza then the year before. The difference in years could have predictive power so we’ll keep it in mind for later.
Source: local data frame [3 x 4]
Year Count Success Percent
1 2011 1234 371 30.1%
2 2012 1493 352 23.6%
3 2013 1313 271 20.6%
Based on the table above, it looks like requests were getting less successful with each year. Perhaps as word got out about the program there were more requesters than pizza buyers or requests became disingenuous. We’d need to do some analysis on the text to get those answers but for now we’ll take a look at the data by day of the week. This will be a test to see if the requests are more successful on the weekend for example. This data doesn’t appear to have very strong power because the variance between days is not very great.
Source: local data frame [7 x 4]
weekday Count Success Percent
1 Friday 544 131 24.1%
2 Monday 574 153 26.7%
3 Saturday 522 121 23.2%
4 Sunday 585 135 23.1%
5 Thursday 550 164 29.8%
6 Tuesday 606 134 22.1%
7 Wednesday 659 156 23.7%
We can put these three time related variables into our model later and let it decide which has more predictive power.
Binary Variables
One of the fields is called giver_username_if_known. This variable is particulary interesting because, logically, in order for a giver to be known, they must first have given a pizza. We’ll convert this variable to a binary value- 0 for an NA value and 1 for when the giver username exists. We can create a table showing the success rates by this binary value. The table shows us that whenerver the giver_username_if_known is something other than NA
Source: local data frame [2 x 4]
giver_username_if_known Count Success Percent
1 0 3753 707 18.8%
2 1 287 287 100%
Next, we’ll see look at the impact of attaching an image to the request. Per the research paper referenced in the competition wesite, attaching an image is supposed to increase the success rates. We’ll see if this holds true in our training dataset. From the table below, it appears that this is correct but there are very few instances where an image was attached. This variable may not add much to our model.
Source: local data frame [2 x 4]
Image Count Success percent
1 0 3948 954 24.2%
2 1 92 40 43.5%
The length of time the requester has had an account on Reddit may have an influence on whether the request is successful. This analysis will seperate users that are not likely contributors to the community and less likely to receive a free pizza. The difference in the success rates seems to indicate this variable could have predictive power. Band new accounts are approximately 10% LESS successful than accounts open for more than a week.
Source: local data frame [2 x 4]
Acct.Age Count Success Percent
1 0 1045 180 17.2%
2 1 2995 814 27.2%
Create a binary variable by the number of requester comments. The request is significantly more successful if the requester has more than 8 comments in the RAOP community. However, like the image variable, this may not add much predictive power from the small number of instances with more than 8 comments in the training set.
Source: local data frame [2 x 4]
BnRAOP Count Success Percent
1 0 3973 947 23.8%
2 1 67 47 70.1%
Text Analysis
You can imagine that the text of the request is very important to whether the someone buys a pizza or not. It has to grab the readers attention, present a compelling case, or entertain the reader. The text analysis starts with a simple analysis of the word count. Each request will be split and get the length, or count, of the number of words. We’ll start by grouping the requests by percentiles and determining if the success rates have enough variance to give predictive power.
Source: local data frame [2 x 4]
Word.bin Count Success Percent
1 1 2046 407 19.9%
2 2 1994 587 29.4%
We’ll analyze the requests using term word frequency data. The text will be preprocessed by changing all characters to lower case, removing punctuation, removing stop words, and finally stemming the words. One processed, we’ll look for words that are in at least 90% of the documents and use them to create a logistic regression model.
Baseline Model
The baseline model is a simple average of how many requests were successful. In this simple model, we would use this average as the probability that each request was succesful. We’d predict that every each observation had a 24.6 percent chance of success and therefore would most likely fail.
N Success Percent
1 3030 746 24.6%
Logistic Regression Model
Classification Model
Random Forest
Warning in names(RFTrain)[9:54] <- names(TextTrain): number of items to
replace is not a multiple of replacement length
Score the Model
Work in Progress
FALSE TRUE
0 761 1
1 181 67
Area under the curve: 0.7442
Work in Progress The test set will be prepared using the same functions as were used on the training dataset.