Random Acts of Pizza

The Challenge

Kaggle launched a machine learning challenge based on the Random Acts of Pizza (RAOP) project on Reddit. In RAOP, people can buy a pizza for any random person that asks for it. People have made requests by descibing hard times, asking very politely, or even offering to write poems in return.

The purpose of this challenge was to analyze the historical requests and create an algorithm capable of predicting whether a request will be successful and recieve a free pizza.

Exploratory Analytics and Feature Engineering

Load the Training Data
You can get the training dataset from the Kaggle website, but In order to download it, Kaggle requires you to sign in to your account and accept the terms of the competition. Unfortunately, they haven’t developed an API to do this with an R script so I’m going to rely on the reader to download the data themself. Here is the link to the competition page: Link. The data downloads as a 1.7MB zip file in JSON format. JSON is short for Java Script Object Notation and is an alternative to XML.

NA Values
With the dataset loaded, we’ll start with some exploratory analytics understand the dataset. We’ll look at the attribute names and then make several plots determine which of the attributes might add predictive power. Let’s take a quick look at the dataset and see where the missing values are. The original dataset used the character string “N/A” to refer to NA values. We’ll have to turn those values into something R understands. The NA values are in the variables shown below.

                        Variable NACount
 giver_username_if_known            3753
 requester_subreddits_at_request    729 
 requester_user_flair               3046

It is interesting that kaggle website describes the requester_user_flair field as used to say if the requester received pizza, received pizza and gave pizza, or didn’t receive any pizza. This fields seems like it would give away the outcome of the request so we’ll check to see if its in the test dataset. This field as well as a few others are only in the training dataset. There are only 17 fields in the test set while there are 32 fields in the training set. These fields will be removed to prevent us from building model on datapoints that won’t be available later.

Time Values
Next, we’ll look at the success rates by time period to see if there are any trends with time. We’ll need to convert the date fields to POSIXct so the year can be extracted and then we can create a chart of the success rates. Using the table below, we can see that the requests that receive pizza has consistently decreased across the years- both proportionally and nominally. The 2011 success rate is approximately 30 percent while the 2013 rate is a little more than 20 percent. This means that a request this year is less likely to be receive a free pizza then the year before. The difference in years could have predictive power so we’ll keep it in mind for later.

Source: local data frame [3 x 4]

  Year Count Success Percent
1 2011  1234     371   30.1%
2 2012  1493     352   23.6%
3 2013  1313     271   20.6%

Based on the table above, it looks like requests were getting less successful with each year. Perhaps as word got out about the program there were more requesters than pizza buyers or requests became disingenuous. We’d need to do some analysis on the text to get those answers but for now we’ll take a look at the data by day of the week. This will be a test to see if the requests are more successful on the weekend for example. This data doesn’t appear to have very strong power because the variance between days is not very great.

Source: local data frame [7 x 4]

    weekday Count Success Percent
1    Friday   544     131   24.1%
2    Monday   574     153   26.7%
3  Saturday   522     121   23.2%
4    Sunday   585     135   23.1%
5  Thursday   550     164   29.8%
6   Tuesday   606     134   22.1%
7 Wednesday   659     156   23.7%

We can put these three time related variables into our model later and let it decide which has more predictive power.

Binary Variables
One of the fields is called giver_username_if_known. This variable is particulary interesting because, logically, in order for a giver to be known, they must first have given a pizza. We’ll convert this variable to a binary value- 0 for an NA value and 1 for when the giver username exists. We can create a table showing the success rates by this binary value. The table shows us that whenerver the giver_username_if_known is something other than NA

Source: local data frame [2 x 4]

  giver_username_if_known Count Success Percent
1                       0  3753     707   18.8%
2                       1   287     287    100%

Next, we’ll see look at the impact of attaching an image to the request. Per the research paper referenced in the competition wesite, attaching an image is supposed to increase the success rates. We’ll see if this holds true in our training dataset. From the table below, it appears that this is correct but there are very few instances where an image was attached. This variable may not add much to our model.

Source: local data frame [2 x 4]

  Image Count Success percent
1     0  3948     954   24.2%
2     1    92      40   43.5%

The length of time the requester has had an account on Reddit may have an influence on whether the request is successful. This analysis will seperate users that are not likely contributors to the community and less likely to receive a free pizza. The difference in the success rates seems to indicate this variable could have predictive power. Band new accounts are approximately 10% LESS successful than accounts open for more than a week.

Source: local data frame [2 x 4]

  Acct.Age Count Success Percent
1        0  1045     180   17.2%
2        1  2995     814   27.2%

Create a binary variable by the number of requester comments. The request is significantly more successful if the requester has more than 8 comments in the RAOP community. However, like the image variable, this may not add much predictive power from the small number of instances with more than 8 comments in the training set.

Source: local data frame [2 x 4]

  BnRAOP Count Success Percent
1      0  3973     947   23.8%
2      1    67      47   70.1%

Text Analysis
You can imagine that the text of the request is very important to whether the someone buys a pizza or not. It has to grab the readers attention, present a compelling case, or entertain the reader. The text analysis starts with a simple analysis of the word count. Each request will be split and get the length, or count, of the number of words. We’ll start by grouping the requests by percentiles and determining if the success rates have enough variance to give predictive power.

Source: local data frame [2 x 4]

  Word.bin Count Success Percent
1        1  2046     407   19.9%
2        2  1994     587   29.4%

We’ll analyze the requests using term word frequency data. The text will be preprocessed by changing all characters to lower case, removing punctuation, removing stop words, and finally stemming the words. One processed, we’ll look for words that are in at least 90% of the documents and use them to create a logistic regression model.

Create the Models

Baseline Model
The baseline model is a simple average of how many requests were successful. In this simple model, we would use this average as the probability that each request was succesful. We’d predict that every each observation had a 24.6 percent chance of success and therefore would most likely fail.

     N Success Percent
1 3030     746   24.6%

Logistic Regression Model

Classification Model

Random Forest

Warning in names(RFTrain)[9:54] <- names(TextTrain): number of items to
replace is not a multiple of replacement length

Score the Model
Work in Progress

   
    FALSE TRUE
  0   761    1
  1   181   67

Area under the curve: 0.7442

Apply the Model to the Test Data

Work in Progress The test set will be prepared using the same functions as were used on the training dataset.

Random Acts of Pizza

Ryan Kuhn

The Challenge

Exploratory Analytics and Feature Engineering

Create the Models

Apply the Model to the Test Data