Talking Data

Competition Description

Christopher (me), Joe, and Paul, are co-workers that decided to try their hands at this Kaggle competition!

Kaggle has a way to assign competition medals that get featured on your Kaggle profile. For the competition that we’re working on, there are currently about 900 teams, which means that to get a Bronze Medal we need to finish in the Top 100, to get a Silver Medal we need to finish in the Top 50, to get a Gold Medal we need to get into the Top 10 + 0.2% of teams. Our goal first and formost for this project is to learn; learn to become better modellers, data scientists, and team members. Our secondary goal is to get a Bronze Medal! We’d all be very proud to achieve that secondary goal.

From Kaggle.com:

Nothing is more comforting than being greeted by your favorite drink just as you walk through the door of the corner caf?. While a thoughtful barista knows you take a macchiato every Wednesday morning at 8:15, it’s much more difficult in a digital space for your preferred brands to personalize your experience. TalkingData, China’s largest third-party mobile data platform, understands that everyday choices and behaviors paint a picture of who we are and what we value. Currently, TalkingData is seeking to leverage behavioral data from more than 70% of the 500 million mobile devices active daily in China to help its clients better understand and interact with their audiences. In this competition, Kagglers are challenged to build a model predicting users’ demographic characteristics based on their app usage, geolocation, and mobile device properties. Doing so will help millions of developers and brand advertisers around the world pursue data-driven marketing efforts which are relevant to their users and catered to their preferences.“”

For this competition I’m hoping to follow Kaggle Competition Grandmaster Abhishek Thakur’s guide Approaching (Almost) Any Machine Learning Problem. The main difference being is that we will be writing our code in R instead of Python, this will be the challenge! Another challenge we will face is that this is a fairly large data set and R isn’t traditional as good as Python when opperating outside of RAM.

Data

Overview

For this competition we are trying to predict the age/gender group that a device belongs to from the phone brand, model, phone usage patterns, app installation and usage, and geolocation information of the device.

We will need to convert the data schema (outlined below) into a useful Tabular Data format. The labels which are listed as group in our data schema below are Multiple Column, binary data, so this is a classification problem where one sample belongs to one class but there are more than two classes. From abhishek guide

Data Schema

From Kaggle.com Competition Page

File Descriptions - From Kaggle

gender_age_train.csv, gender_age_test.csv - the training and test set
- group: this is the target variable you are going to predict
events.csv, app_events.csv - when a user uses TalkingData SDK, the event gets logged in this data. Each event has an event id, location (lat/long), and the event corresponds to a list of apps in app_events.
- timestamp: when the user is using an app with TalkingData SDK
app_labels.csv - apps and their labels, the label_id’s can be used to join with label_categories
label_categories.csv - apps’ labels and their categories in text
phone_brand_device_model.csv - device ids, brand, and models

Evaluation Metric

Submissions are evaluated using the multi-class logarithmic loss. Each device has been labeled with one true class. For each device, you must submit a set of predicted probabilities (one for each class). The formula is then, \[logloss = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} y_{ij} log(p_{ij})\] where \(N\) is the number of devices in the test set, \(M\) is the number of class labels, \(\\log\\\) is the natural logarithm, \(\\y_{ij}\\\) is \(1\) if device \(\\i\\\) belongs to class \(\\j\\\) and \(0\) otherwise, and \(\\p_{ij}\\\) is the predicted probability that observation \(\\i\\\) belongs to class \(\\j\\\).

Submission Scored Example

library(data.table)
library(bit64)
fakePredictedProbabilities <- fread("sample_submission.csv",showProgress=FALSE, nrows=3, data.table=FALSE)
fakeTrueClasses <- as.data.frame(fakePredictedProbabilities)
fakeTrueClasses[1:3,2:13]=0
fakeTrueClasses[1,2]=1
fakeTrueClasses[2,3]=1
fakeTrueClasses[3,8]=1

fakePredictedProbabilities

##              device_id   F23- F24-26 F27-28 F29-32 F33-42   F43+   M22-
## 1  1002079943728939269 0.0833 0.0833 0.0833 0.0833 0.0833 0.0833 0.0833
## 2 -1547860181818787117 0.0833 0.0833 0.0833 0.0833 0.0833 0.0833 0.0833
## 3  7374582448058474277 0.0833 0.0833 0.0833 0.0833 0.0833 0.0833 0.0833
##   M23-26 M27-28 M29-31 M32-38   M39+
## 1 0.0833 0.0833 0.0833 0.0833 0.0833
## 2 0.0833 0.0833 0.0833 0.0833 0.0833
## 3 0.0833 0.0833 0.0833 0.0833 0.0833

fakeTrueClasses

##              device_id F23- F24-26 F27-28 F29-32 F33-42 F43+ M22- M23-26
## 1  1002079943728939269    1      0      0      0      0    0    0      0
## 2 -1547860181818787117    0      1      0      0      0    0    0      0
## 3  7374582448058474277    0      0      0      0      0    0    1      0
##   M27-28 M29-31 M32-38 M39+
## 1      0      0      0    0
## 2      0      0      0    0
## 3      0      0      0    0

loggedValues <- log(fakePredictedProbabilities[,2:13])
loggedValues

##        F23-    F24-26    F27-28    F29-32    F33-42      F43+      M22-
## 1 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307
## 2 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307
## 3 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307
##      M23-26    M27-28    M29-31    M32-38      M39+
## 1 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307
## 2 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307
## 3 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307

MultipliedValues <- loggedValues*fakeTrueClasses[,2:13]
MultipliedValues

##        F23-    F24-26 F27-28 F29-32 F33-42 F43+      M22- M23-26 M27-28
## 1 -2.485307  0.000000      0      0      0    0  0.000000      0      0
## 2  0.000000 -2.485307      0      0      0    0  0.000000      0      0
## 3  0.000000  0.000000      0      0      0    0 -2.485307      0      0
##   M29-31 M32-38 M39+
## 1      0      0    0
## 2      0      0    0
## 3      0      0    0

PreAverageLogLoss <- 0

for(i in 1:3){
PreAverageLogLoss = PreAverageLogLoss + sum(MultipliedValues[i,])
print(PreAverageLogLoss)
}

## [1] -2.485307
## [1] -4.970613
## [1] -7.45592

PreAverageLogLoss*(-1/3)

## [1] 2.485307

Libraries Used

We will take advantage of R’s numerous machine learning Libraries for this competition:

#Data Manipulation
library(data.table) #Extension of the data.frame, makes data manipulation much easier/faster, code is slightly harder to understand, faster than dplyr usually
library(SOAR) #Possibly required in order to  manage memory for these large datasets
library(snow) #parallel processing for windows
library(doSNOW) #helps wiht parallel processing
library(bit64) #gives R the ability to read 64bit strings.

#Data Visualization
library(ggplot2) #Plotting and Visualizing Data
library(ROCR) #Visualizing the Scoring of Model Performance
#Modeling
library(caret) #Resampling Methods and Hyperparmeter Selection
library(gbm) #Gradient Boosting Machine Models
library(randomForest) #Random Foresst Models
library(e1071) #Support Vector Machine Models


#http://stackoverflow.com/questions/23926334/how-do-i-parallelize-in-r-on-windows-example

Machine Learning Framework

FIGURE FROM: A. THAKUR AND A. KROHN-GRIMBERGHE, AUTOCOMPETE: A FRAMEWORK FOR MACHINE LEARNING COMPETITIONS, AUTOML WORKSHOP, INTERNATIONAL CONFERENCE ON MACHINE LEARNING 2015