Christopher (me), Joe, and Paul, are co-workers that decided to try their hands at this Kaggle competition!
Kaggle has a way to assign competition medals that get featured on your Kaggle profile. For the competition that we’re working on, there are currently about 900 teams, which means that to get a Bronze Medal we need to finish in the Top 100, to get a Silver Medal we need to finish in the Top 50, to get a Gold Medal we need to get into the Top 10 + 0.2% of teams. Our goal first and formost for this project is to learn; learn to become better modellers, data scientists, and team members. Our secondary goal is to get a Bronze Medal! We’d all be very proud to achieve that secondary goal.
Nothing is more comforting than being greeted by your favorite drink just as you walk through the door of the corner caf?. While a thoughtful barista knows you take a macchiato every Wednesday morning at 8:15, it’s much more difficult in a digital space for your preferred brands to personalize your experience. TalkingData, China’s largest third-party mobile data platform, understands that everyday choices and behaviors paint a picture of who we are and what we value. Currently, TalkingData is seeking to leverage behavioral data from more than 70% of the 500 million mobile devices active daily in China to help its clients better understand and interact with their audiences. In this competition, Kagglers are challenged to build a model predicting users’ demographic characteristics based on their app usage, geolocation, and mobile device properties. Doing so will help millions of developers and brand advertisers around the world pursue data-driven marketing efforts which are relevant to their users and catered to their preferences.“”
For this competition I’m hoping to follow Kaggle Competition Grandmaster Abhishek Thakur’s guide Approaching (Almost) Any Machine Learning Problem. The main difference being is that we will be writing our code in R instead of Python, this will be the challenge! Another challenge we will face is that this is a fairly large data set and R isn’t traditional as good as Python when opperating outside of RAM.
For this competition we are trying to predict the age/gender group that a device belongs to from the phone brand, model, phone usage patterns, app installation and usage, and geolocation information of the device.
We will need to convert the data schema (outlined below) into a useful Tabular Data format. The labels which are listed as group in our data schema below are Multiple Column, binary data, so this is a classification problem where one sample belongs to one class but there are more than two classes.
Submissions are evaluated using the multi-class logarithmic loss. Each device has been labeled with one true class. For each device, you must submit a set of predicted probabilities (one for each class). The formula is then, \[logloss = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} y_{ij} log(p_{ij})\] where \(N\) is the number of devices in the test set, \(M\) is the number of class labels, \(\\log\\\) is the natural logarithm, \(\\y_{ij}\\\) is \(1\) if device \(\\i\\\) belongs to class \(\\j\\\) and \(0\) otherwise, and \(\\p_{ij}\\\) is the predicted probability that observation \(\\i\\\) belongs to class \(\\j\\\).
library(data.table)
library(bit64)
fakePredictedProbabilities <- fread("sample_submission.csv",showProgress=FALSE, nrows=3, data.table=FALSE)
fakeTrueClasses <- as.data.frame(fakePredictedProbabilities)
fakeTrueClasses[1:3,2:13]=0
fakeTrueClasses[1,2]=1
fakeTrueClasses[2,3]=1
fakeTrueClasses[3,8]=1
fakePredictedProbabilities
## device_id F23- F24-26 F27-28 F29-32 F33-42 F43+ M22-
## 1 1002079943728939269 0.0833 0.0833 0.0833 0.0833 0.0833 0.0833 0.0833
## 2 -1547860181818787117 0.0833 0.0833 0.0833 0.0833 0.0833 0.0833 0.0833
## 3 7374582448058474277 0.0833 0.0833 0.0833 0.0833 0.0833 0.0833 0.0833
## M23-26 M27-28 M29-31 M32-38 M39+
## 1 0.0833 0.0833 0.0833 0.0833 0.0833
## 2 0.0833 0.0833 0.0833 0.0833 0.0833
## 3 0.0833 0.0833 0.0833 0.0833 0.0833
fakeTrueClasses
## device_id F23- F24-26 F27-28 F29-32 F33-42 F43+ M22- M23-26
## 1 1002079943728939269 1 0 0 0 0 0 0 0
## 2 -1547860181818787117 0 1 0 0 0 0 0 0
## 3 7374582448058474277 0 0 0 0 0 0 1 0
## M27-28 M29-31 M32-38 M39+
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
loggedValues <- log(fakePredictedProbabilities[,2:13])
loggedValues
## F23- F24-26 F27-28 F29-32 F33-42 F43+ M22-
## 1 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307
## 2 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307
## 3 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307
## M23-26 M27-28 M29-31 M32-38 M39+
## 1 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307
## 2 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307
## 3 -2.485307 -2.485307 -2.485307 -2.485307 -2.485307
MultipliedValues <- loggedValues*fakeTrueClasses[,2:13]
MultipliedValues
## F23- F24-26 F27-28 F29-32 F33-42 F43+ M22- M23-26 M27-28
## 1 -2.485307 0.000000 0 0 0 0 0.000000 0 0
## 2 0.000000 -2.485307 0 0 0 0 0.000000 0 0
## 3 0.000000 0.000000 0 0 0 0 -2.485307 0 0
## M29-31 M32-38 M39+
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
PreAverageLogLoss <- 0
for(i in 1:3){
PreAverageLogLoss = PreAverageLogLoss + sum(MultipliedValues[i,])
print(PreAverageLogLoss)
}
## [1] -2.485307
## [1] -4.970613
## [1] -7.45592
PreAverageLogLoss*(-1/3)
## [1] 2.485307
We will take advantage of R’s numerous machine learning Libraries for this competition:
#Data Manipulation
library(data.table) #Extension of the data.frame, makes data manipulation much easier/faster, code is slightly harder to understand, faster than dplyr usually
library(SOAR) #Possibly required in order to manage memory for these large datasets
library(snow) #parallel processing for windows
library(doSNOW) #helps wiht parallel processing
library(bit64) #gives R the ability to read 64bit strings.
#Data Visualization
library(ggplot2) #Plotting and Visualizing Data
library(ROCR) #Visualizing the Scoring of Model Performance
#Modeling
library(caret) #Resampling Methods and Hyperparmeter Selection
library(gbm) #Gradient Boosting Machine Models
library(randomForest) #Random Foresst Models
library(e1071) #Support Vector Machine Models
#http://stackoverflow.com/questions/23926334/how-do-i-parallelize-in-r-on-windows-example