Bike Sharing Kaggle Competition

Tyler Byers

15 Aug 2015

This project is being done as part of the Coursera Intro to Data Science class by the University of Washington. This is the class Peer Review project, and is an open-ended assignment where we get to participate in a Kaggle competition.

I chose the Bike Sharing Demand competition. The website and problem description are at: http://www.kaggle.com/c/bike-sharing-demand

Data Fields

datetime - hourly date + timestamp

season - 1 = spring, 2 = summer, 3 = fall, 4 = winter

holiday - whether the day is considered a holiday

workingday - whether the day is neither a weekend nor holiday

weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

temp - temperature in Celsius

atemp - “feels like” temperature in Celsius

humidity - relative humidity

windspeed - wind speed

casual - number of non-registered user rentals initiated

registered - number of registered user rentals initiated

count - number of total rentals

Load data

train <- read.csv('train.csv')
test <- read.csv('test.csv')
summary(train)

##                 datetime         season        holiday      
##  2011-01-01 00:00:00:    1   Min.   :1.00   Min.   :0.0000  
##  2011-01-01 01:00:00:    1   1st Qu.:2.00   1st Qu.:0.0000  
##  2011-01-01 02:00:00:    1   Median :3.00   Median :0.0000  
##  2011-01-01 03:00:00:    1   Mean   :2.51   Mean   :0.0286  
##  2011-01-01 04:00:00:    1   3rd Qu.:4.00   3rd Qu.:0.0000  
##  2011-01-01 05:00:00:    1   Max.   :4.00   Max.   :1.0000  
##  (Other)            :10880                                  
##    workingday       weather          temp           atemp      
##  Min.   :0.000   Min.   :1.00   Min.   : 0.82   Min.   : 0.76  
##  1st Qu.:0.000   1st Qu.:1.00   1st Qu.:13.94   1st Qu.:16.66  
##  Median :1.000   Median :1.00   Median :20.50   Median :24.24  
##  Mean   :0.681   Mean   :1.42   Mean   :20.23   Mean   :23.66  
##  3rd Qu.:1.000   3rd Qu.:2.00   3rd Qu.:26.24   3rd Qu.:31.06  
##  Max.   :1.000   Max.   :4.00   Max.   :41.00   Max.   :45.45  
##                                                                
##     humidity       windspeed        casual      registered      count    
##  Min.   :  0.0   Min.   : 0.0   Min.   :  0   Min.   :  0   Min.   :  1  
##  1st Qu.: 47.0   1st Qu.: 7.0   1st Qu.:  4   1st Qu.: 36   1st Qu.: 42  
##  Median : 62.0   Median :13.0   Median : 17   Median :118   Median :145  
##  Mean   : 61.9   Mean   :12.8   Mean   : 36   Mean   :156   Mean   :192  
##  3rd Qu.: 77.0   3rd Qu.:17.0   3rd Qu.: 49   3rd Qu.:222   3rd Qu.:284  
##  Max.   :100.0   Max.   :57.0   Max.   :367   Max.   :886   Max.   :977  
##

str(train)

## 'data.frame':    10886 obs. of  12 variables:
##  $ datetime  : Factor w/ 10886 levels "2011-01-01 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
##  $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
##  $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
##  $ windspeed : num  0 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ count     : int  16 40 32 13 1 1 2 3 8 14 ...

summary(test)

##                 datetime        season        holiday      
##  2011-01-20 00:00:00:   1   Min.   :1.00   Min.   :0.0000  
##  2011-01-20 01:00:00:   1   1st Qu.:2.00   1st Qu.:0.0000  
##  2011-01-20 02:00:00:   1   Median :3.00   Median :0.0000  
##  2011-01-20 03:00:00:   1   Mean   :2.49   Mean   :0.0291  
##  2011-01-20 04:00:00:   1   3rd Qu.:3.00   3rd Qu.:0.0000  
##  2011-01-20 05:00:00:   1   Max.   :4.00   Max.   :1.0000  
##  (Other)            :6487                                  
##    workingday       weather          temp           atemp     
##  Min.   :0.000   Min.   :1.00   Min.   : 0.82   Min.   : 0.0  
##  1st Qu.:0.000   1st Qu.:1.00   1st Qu.:13.94   1st Qu.:16.7  
##  Median :1.000   Median :1.00   Median :21.32   Median :25.0  
##  Mean   :0.686   Mean   :1.44   Mean   :20.62   Mean   :24.0  
##  3rd Qu.:1.000   3rd Qu.:2.00   3rd Qu.:27.06   3rd Qu.:31.1  
##  Max.   :1.000   Max.   :4.00   Max.   :40.18   Max.   :50.0  
##                                                               
##     humidity       windspeed   
##  Min.   : 16.0   Min.   : 0.0  
##  1st Qu.: 49.0   1st Qu.: 7.0  
##  Median : 65.0   Median :11.0  
##  Mean   : 64.1   Mean   :12.6  
##  3rd Qu.: 81.0   3rd Qu.:17.0  
##  Max.   :100.0   Max.   :56.0  
##

str(test)

## 'data.frame':    6493 obs. of  9 variables:
##  $ datetime  : Factor w/ 6493 levels "2011-01-20 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ weather   : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ temp      : num  10.7 10.7 10.7 10.7 10.7 ...
##  $ atemp     : num  11.4 13.6 13.6 12.9 12.9 ...
##  $ humidity  : int  56 56 56 56 56 60 60 55 55 52 ...
##  $ windspeed : num  26 0 0 11 11 ...

It's a very clean data set. Going to change datetime from factor to datetime variable, and then split on date and time.

test[c('casual', 'registered', 'count')] <- NA
alldata <- rbind(train, test)
summary(alldata)

##                 datetime         season       holiday      
##  2011-01-01 00:00:00:    1   Min.   :1.0   Min.   :0.0000  
##  2011-01-01 01:00:00:    1   1st Qu.:2.0   1st Qu.:0.0000  
##  2011-01-01 02:00:00:    1   Median :3.0   Median :0.0000  
##  2011-01-01 03:00:00:    1   Mean   :2.5   Mean   :0.0288  
##  2011-01-01 04:00:00:    1   3rd Qu.:3.0   3rd Qu.:0.0000  
##  2011-01-01 05:00:00:    1   Max.   :4.0   Max.   :1.0000  
##  (Other)            :17373                                 
##    workingday       weather          temp           atemp     
##  Min.   :0.000   Min.   :1.00   Min.   : 0.82   Min.   : 0.0  
##  1st Qu.:0.000   1st Qu.:1.00   1st Qu.:13.94   1st Qu.:16.7  
##  Median :1.000   Median :1.00   Median :20.50   Median :24.2  
##  Mean   :0.683   Mean   :1.43   Mean   :20.38   Mean   :23.8  
##  3rd Qu.:1.000   3rd Qu.:2.00   3rd Qu.:27.06   3rd Qu.:31.1  
##  Max.   :1.000   Max.   :4.00   Max.   :41.00   Max.   :50.0  
##                                                               
##     humidity       windspeed        casual       registered  
##  Min.   :  0.0   Min.   : 0.0   Min.   :  0    Min.   :  0   
##  1st Qu.: 48.0   1st Qu.: 7.0   1st Qu.:  4    1st Qu.: 36   
##  Median : 63.0   Median :13.0   Median : 17    Median :118   
##  Mean   : 62.7   Mean   :12.7   Mean   : 36    Mean   :156   
##  3rd Qu.: 78.0   3rd Qu.:17.0   3rd Qu.: 49    3rd Qu.:222   
##  Max.   :100.0   Max.   :57.0   Max.   :367    Max.   :886   
##                                 NA's   :6493   NA's   :6493  
##      count     
##  Min.   :  1   
##  1st Qu.: 42   
##  Median :145   
##  Mean   :192   
##  3rd Qu.:284   
##  Max.   :977   
##  NA's   :6493

alldata$datetime <- strftime(as.character(alldata$datetime), "%Y-%m-%d %H:%M:%S")
str(alldata$datetime)

##  chr [1:17379] "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...

alldata$date <- strftime(alldata$datetime, format = '%Y-%m-%d')
alldata$hour <- strftime(alldata$datetime, format = '%H')

train <- alldata[1:10886,]
test <- alldata[10887:17379,]

Explore the data

library(ggplot2)
library(gridExtra)

## Loading required package: grid

qplot(x=count, data = train, fill = as.factor(season))

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk How many riders are we looking at throughout year

p1 <- qplot(x = count, data = train, fill = as.factor(season)) +
    scale_x_sqrt()
p2 <- qplot(x = casual, data = train, fill = as.factor(season)) +
    scale_x_sqrt()
p3 <- qplot(x = registered, data = train, fill = as.factor(season)) +
    scale_x_sqrt()
grid.arrange(p1, p2, p3)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk How many riders are we looking at throughout year

Registered riders are much more consistent – casual riders as expected are more likely in spring and summer.

qplot(x=count, data = train, fill = as.factor((weather)))

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk riders based on weather

ggplot(aes(x=temp, y = count), data = train) + geom_jitter() + 
    geom_smooth(method = 'lm', color = 'red')

$plot of chunk riders based on weather$

cor.test(train$temp, train$count)

## 
##  Pearson's product-moment correlation
## 
## data:  train$temp and train$count
## t = 44.78, df = 10884, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3785 0.4102
## sample estimates:
##    cor 
## 0.3945

Some slight correlation – 0.3944 between temp and count. I actually want to do a quick model using just temp to see if I can estimate what my Kaggle score will be.

Create Own Training and Test Sets

library(caret)

## Loading required package: lattice

set.seed(815)
trainIndex <- createDataPartition(1:nrow(train), p = 0.5, 
                                  list = FALSE, times = 1)
mytrain <- train[trainIndex,]
mytest <- train[-trainIndex,]
cor.test(mytrain$temp, mytrain$count)

## 
##  Pearson's product-moment correlation
## 
## data:  mytrain$temp and mytrain$count
## t = 30.81, df = 5442, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3625 0.4078
## sample estimates:
##    cor 
## 0.3854

That correlation was roughly the same as my last one.

model <- lm(count ~ temp, data = mytrain)
p <- round(predict(model, newdata = mytest))
head(p)

##  1  3  4  5  6  9 
## 98 91 98 98 98 98

Create psuedo evaluation function

bikeevaluate <- function (data, pred) {
    return(sqrt(1/nrow(data)*sum((log(pred+1)-log(data$count+1))^2)))
}

Evaluate our simple model

bikeevaluate(mytest, p)

## [1] 1.449

# Now need to run model using the test
p <- round(predict(model, newdata = test))
head(p)

## 10887 10888 10889 10890 10891 10892 
##   106   106   106   106   106    98

Estimating Kaggle score approx 1.45

output <- subset(test, select = 'datetime')
output$count <- p
outname <- paste0('results/out_',format(Sys.time(), '%Y%m%d_%H%M%S'),
                  '.csv')
write.csv(output, outname, quote = FALSE, row.names = FALSE)

Kaggle output: 1.4523. Ok – have a good estimator.

rpart

I don't want to think very hard right now, so I'm going to try to make an rpart model using most of the variables.

library(rpart)
fol <- formula(count ~ season + holiday + workingday + weather + temp + atemp +
                   humidity + windspeed)
mymodel <- rpart(fol, data = mytrain)
print(mymodel)

## n= 5444 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 5444 180300000 192.80  
##    2) atemp< 29.93 3820  91070000 150.10  
##      4) temp< 12.71 1055  12850000  90.58 *
##      5) temp>=12.71 2765  73040000 172.90  
##       10) humidity>=65.5 1527  30350000 131.10 *
##       11) humidity< 65.5 1238  36740000 224.40  
##         22) season< 2.5 746  15840000 184.20 *
##         23) season>=2.5 492  17860000 285.40 *
##    3) atemp>=29.93 1624  65900000 293.30  
##      6) humidity>=64.5 569  17310000 202.70 *
##      7) humidity< 64.5 1055  41400000 342.10  
##       14) workingday>=0.5 745  31030000 315.20 *
##       15) workingday< 0.5 310   8543000 406.60 *

mypred <- round(predict(mymodel, newdata = mytest))
bikeevaluate(mytest, mypred)

## [1] 1.385

# That would get us to 1.393, which is a very slight move up the leaderboard.  Not worth submitting at this time.

Refactor variables

There are several variables that really should be factor variables rather than numerical variables. Also I want to create a variable for what day of week it is, rather than just the weekday.

names(alldata)

##  [1] "datetime"   "season"     "holiday"    "workingday" "weather"   
##  [6] "temp"       "atemp"      "humidity"   "windspeed"  "casual"    
## [11] "registered" "count"      "date"       "hour"

alldata$season <- factor(alldata$season, labels = c('spring', 'summer',
                                                    'fall', 'winter'))
alldata$weather <- factor(alldata$weather, 
                          labels = c('Clear', 'Mist', 'LightSnow', 'HeavyRain'))
alldata$holiday <- factor(alldata$holiday, labels = c('Not','Holiday'))
alldata$hour <- factor(alldata$hour)
alldata$weekday <- weekdays(as.Date(alldata$date, '%Y-%m-%d'))
alldata$weekday <- factor(alldata$weekday)
str(alldata)

## 'data.frame':    17379 obs. of  15 variables:
##  $ datetime  : chr  "2011-01-01 00:00:00" "2011-01-01 01:00:00" "2011-01-01 02:00:00" "2011-01-01 03:00:00" ...
##  $ season    : Factor w/ 4 levels "spring","summer",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : Factor w/ 2 levels "Not","Holiday": 1 1 1 1 1 1 1 1 1 1 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weather   : Factor w/ 4 levels "Clear","Mist",..: 1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
##  $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
##  $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
##  $ windspeed : num  0 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ count     : int  16 40 32 13 1 1 2 3 8 14 ...
##  $ date      : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ hour      : Factor w/ 24 levels "00","01","02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weekday   : Factor w/ 7 levels "Friday","Monday",..: 3 3 3 3 3 3 3 3 3 3 ...

Done messing with data for now, re-split into test, training, mytest, and mytraining sets.

train <- alldata[1:10886,]
test <- alldata[10887:17379,]
set.seed(820)
trainIndex <- createDataPartition(1:nrow(train), p = 0.5, 
                                  list = FALSE, times = 1)
mytrain <- train[trainIndex,]
mytest <- train[-trainIndex,]
cor.test(mytrain$temp, mytrain$count)

## 
##  Pearson's product-moment correlation
## 
## data:  mytrain$temp and mytrain$count
## t = 32, df = 5442, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3753 0.4201
## sample estimates:
##    cor 
## 0.3979

Try another rpart

Will try to rpart again now with some factor variables. Wonder if that will help.

names(mytrain)

##  [1] "datetime"   "season"     "holiday"    "workingday" "weather"   
##  [6] "temp"       "atemp"      "humidity"   "windspeed"  "casual"    
## [11] "registered" "count"      "date"       "hour"       "weekday"

fol <- formula(count ~ season + holiday + weather + temp + atemp +
                   humidity + windspeed + hour + weekday)
mymodel <- rpart(fol, data = mytrain)
print(mymodel)

## n= 5444 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 5444 176900000 191.20  
##     2) hour=00,01,02,03,04,05,06,22,23 2065   6675000  50.09  
##       4) hour=00,01,02,03,04,05 1375   1296000  25.56 *
##       5) hour=06,22,23 690   2903000  98.97 *
##     3) hour=07,08,09,10,11,12,13,14,15,16,17,18,19,20,21 3379 104000000 277.50  
##       6) hour=07,09,10,11,12,13,14,15,20,21 2260  39470000 225.30  
##        12) temp< 19.27 989  11060000 165.70 *
##        13) temp>=19.27 1271  22160000 271.70  
##          26) weekday=Friday,Monday,Thursday,Tuesday,Wednesday 920   9038000 238.80 *
##          27) weekday=Saturday,Sunday 351   9506000 358.10  
##            54) hour=07,09,10,20,21 161   1508000 232.50 *
##            55) hour=11,12,13,14,15 190   3309000 464.50 *
##       7) hour=08,16,17,18,19 1119  45920000 382.80  
##        14) temp< 17.63 389  10520000 258.00  
##          28) season=spring,summer 236   4905000 202.70 *
##          29) season=winter 153   3785000 343.20 *
##        15) temp>=17.63 730  26110000 449.30  
##          30) hour=08,16,19 420  10560000 391.00 *
##          31) hour=17,18 310  12180000 528.40  
##            62) weekday=Saturday,Sunday 84   1498000 393.80 *
##            63) weekday=Friday,Monday,Thursday,Tuesday,Wednesday 226   8594000 578.40  
##             126) atemp< 30.68 103   3397000 481.10 *
##             127) atemp>=30.68 123   3406000 659.90 *

mypred <- round(predict(mymodel, newdata = mytest))
bikeevaluate(mytest, mypred)

## [1] 0.8645

# That helped us a lot!  Down to 0.86451

I want to create an output function though – tired of always typing it in

writeout <- function(testdata, pred){

    output <- subset(testdata, select = 'datetime')
    output$count <- pred
    outname <- paste0('results/out_',format(Sys.time(), '%Y%m%d_%H%M%S'),
                  '.csv')
    write.csv(output, outname, quote = FALSE, row.names = FALSE)
}

pred <- predict(mymodel, newdata = test)
writeout(test, pred)

Kaggle score: 0.87592. Moved up 93 positions.

Random Forest

Let's see how a Random Forest implementation will work on my model

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

rfmodel <- randomForest(fol, data = mytrain)
rfpredict <- round(predict(rfmodel, newdata = mytest))
bikeevaluate(mytest, rfpredict)

## [1] 0.6763

# New score: 0.67634
importance(rfmodel)

##           IncNodePurity
## season          7581689
## holiday          703584
## weather         3107707
## temp           15610820
## atemp          17950570
## humidity       15577794
## windspeed       6298487
## hour           90590889
## weekday        13269494

rfmodel2 <- randomForest(fol, data = train)
importance(rfmodel2)

##           IncNodePurity
## season         15806349
## holiday         1819964
## weather         6994547
## temp           30811097
## atemp          33209366
## humidity       30609371
## windspeed      12093003
## hour          183401318
## weekday        30028723

rfpredict2 <- round(predict(rfmodel2, newdata = test))
writeout(test, rfpredict2)

Actual Kaggle score: 0.66237. Moved up 53 positions.

Support Vector Machine

Will an svm work?? Let's see.

library(e1071)
svmmodel <- svm(fol, data=mytrain)
svmpredict <- round(predict(svmmodel, newdata = mytest))
# There are some values below zero.  Fix these.
svmpredict[svmpredict < 0] <- 0
bikeevaluate(mytest, svmpredict)

## [1] 0.8087

Our score for SVM was 0.8086. Not an improvement; will not submit.

Random Forest on Casual and Registered

Based on some of our early graphs and intuition, casual and registered riders have different usage patterns. Will model the two variables separately then combine them. Will use the Random Forest since that has given us the best model so far.

folcasual = formula(casual ~ season + holiday + weather + temp + 
                        atemp + humidity +  windspeed + hour + weekday)
folreg = formula(registered ~ season + holiday + weather + temp + 
                        atemp + humidity +  windspeed + hour + weekday)
rfcasmodel <- randomForest(folcasual, data = mytrain)
importance(rfcasmodel)

##           IncNodePurity
## season           554769
## holiday          200394
## weather          180588
## temp            1736250
## atemp           2087236
## humidity        1186250
## windspeed        420962
## hour            4438091
## weekday         2844337

rfregmodel <- randomForest(folreg, data = mytrain)
importance(rfregmodel)

##           IncNodePurity
## season          5069385
## holiday          570335
## weather         2229766
## temp            8576040
## atemp           8608875
## humidity        9330442
## windspeed       4261755
## hour           66744637
## weekday        11982365

rfcaspred <- predict(rfcasmodel, newdata = mytest)
rfregpred <- predict(rfregmodel, newdata = mytest)
rfcombpred <- round(rfcaspred + rfregpred)
bikeevaluate(mytest, rfcombpred)

## [1] 0.6617

# Score 0.66169, a slight improvement

# Will create model for submission.
rfcasmodel <- randomForest(folcasual, data = train)
rfregmodel <- randomForest(folreg, data = train)
rfcaspred <- predict(rfcasmodel, newdata = test)
rfregpred <- predict(rfregmodel, newdata = test)
rfcombpred <- round(rfcaspred + rfregpred)
writeout(test, rfcombpred)

Kaggle score: 0.64381. Moved up Kaggle board by 10 positions.

More visualization with dplyr

I'm not liking how this is going. I would like to visualize the data a bit more.

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

hourriders <- group_by(train, hour)
hrridertrain <- summarise(hourriders, sum = sum(count), mean = mean(count))
barplot(hrridertrain$mean)

plot of chunk using dplyr

#Makes me wonder what happens if we just put the mean counts into each hour.
pred <- sapply(mytest$hour, FUN = function(x) 
    { round(hrridertrain[which(hrridertrain$hour == x),3])})
bikeevaluate(mytest, pred)

## [1] 0.8167

Ok that was a step backward, but still not as bad as we started out. Score of 0.81667 would move us back down the board quite a bit though.

Variable Transformations

The distribution of riders was strongly skewed right. I noticed when plotting that using a scale_x_sqrt seemed to make the distribution more normal. The scale_x_log10 was also good, but didn't seem quite as good. I'll scale the variables though and see if that helps.

mytrain$cassqrt = sqrt(mytrain$casual)
mytrain$regsqrt = sqrt(mytrain$registered)
qplot(x = cassqrt + regsqrt, data = mytrain, fill = as.factor(season))

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk scale variables

folcasual = formula(cassqrt ~ season + holiday + weather + temp + 
                        atemp + humidity +  windspeed + hour + weekday)
folreg = formula(regsqrt ~ season + holiday + weather + temp + 
                        atemp + humidity +  windspeed + hour + weekday)
rfcasmodel <- randomForest(folcasual, data = mytrain)
importance(rfcasmodel)

##           IncNodePurity
## season           3077.0
## holiday           566.9
## weather           913.4
## temp             9849.0
## atemp           10823.0
## humidity         5809.5
## windspeed        1662.9
## hour            29528.1
## weekday          8942.2

rfregmodel <- randomForest(folreg, data = mytrain)
importance(rfregmodel)

##           IncNodePurity
## season           7614.3
## holiday           642.3
## weather          3108.0
## temp            13199.0
## atemp           12347.1
## humidity        14771.0
## windspeed        5645.8
## hour           121796.8
## weekday         12975.6

rfcaspred <- predict(rfcasmodel, newdata = mytest)
rfregpred <- predict(rfregmodel, newdata = mytest)
rfcombpred <- round((rfcaspred^2 + rfregpred^2))
bikeevaluate(mytest, rfcombpred)

## [1] 0.5126

My score went down to 0.5093. Wow that was a big improvement. Before submitting, I wonder if a log10 transformation can do even better for us.

mytrain$caslog10 = log10(mytrain$casual + 1)
mytrain$reglog10 = log10(mytrain$registered + 1)
qplot(x = caslog10 + reglog10, data = mytrain, fill = as.factor(season))

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk log10 transformation

folcasuallog10 = formula(caslog10 ~ season + holiday + weather + temp + 
                        atemp + humidity +  windspeed + hour + weekday)
folreglog10 = formula(reglog10 ~ season + holiday + weather + temp + 
                        atemp + humidity +  windspeed + hour + weekday)
rfcasmodel <- randomForest(folcasuallog10, data = mytrain)
importance(rfcasmodel)

##           IncNodePurity
## season           94.170
## holiday           7.579
## weather          34.375
## temp            322.099
## atemp           312.489
## humidity        178.826
## windspeed        52.948
## hour           1064.330
## weekday         160.542

rfregmodel <- randomForest(folreglog10, data = mytrain)
importance(rfregmodel)

##           IncNodePurity
## season           58.084
## holiday           4.935
## weather          25.228
## temp            117.456
## atemp           106.116
## humidity        145.553
## windspeed        48.023
## hour           1338.902
## weekday         110.755

rfcaspred <- predict(rfcasmodel, newdata = mytest)
rfregpred <- predict(rfregmodel, newdata = mytest)
rfcombpred <- round((10^rfcaspred + 10^rfregpred - 2))
bikeevaluate(mytest, rfcombpred)

## [1] 0.472

The score on that was 0.471! Cool! I will do a full model now and submit that.

train$caslog10 = log10(train$casual + 1)
train$reglog10 = log10(train$registered + 1)
folcasuallog10 = formula(caslog10 ~ season + holiday + weather + temp + 
                        atemp + humidity +  windspeed + hour + weekday)
folreglog10 = formula(reglog10 ~ season + holiday + weather + temp + 
                        atemp + humidity +  windspeed + hour + weekday)
rfcasmodel <- randomForest(folcasuallog10, data = train)
rfregmodel <- randomForest(folreglog10, data = train)
rfcaspred <- predict(rfcasmodel, newdata = test)
rfregpred <- predict(rfregmodel, newdata = test)
rfcombpred <- round(10^rfcaspred + 10^rfregpred - 2)
writeout(test, rfcombpred)

Kaggle score: 0.47005. Moved 201 positions up the leaderboard!

Use gbm

Some people on the forums are reporting good results with gbm package. I'll admit I don't really understand it, but I'll try using it anyways.

library(gbm)

## Loading required package: survival
## Loading required package: splines
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: parallel
## Loaded gbm 2.1

gbmmodcas <- gbm(folcasuallog10, data = mytrain, n.trees = 1000)

## Distribution not specified, assuming gaussian ...

gbmmodreg <- gbm(folreglog10, data = mytrain, n.trees = 1000)

## Distribution not specified, assuming gaussian ...

summary(gbmmodcas)

plot of chunk using gbm

##                 var rel.inf
## hour           hour  83.604
## temp           temp  13.931
## atemp         atemp   2.466
## season       season   0.000
## holiday     holiday   0.000
## weather     weather   0.000
## humidity   humidity   0.000
## windspeed windspeed   0.000
## weekday     weekday   0.000

summary(gbmmodreg)

plot of chunk using gbm

##                 var rel.inf
## hour           hour     100
## season       season       0
## holiday     holiday       0
## weather     weather       0
## temp           temp       0
## atemp         atemp       0
## humidity   humidity       0
## windspeed windspeed       0
## weekday     weekday       0

gbmcaspred <- predict(gbmmodcas, mytest, n.trees = 1000)
gbmregpred <- predict(gbmmodreg, mytest, n.trees = 1000)
gbmcombpred <- round((10^gbmcaspred + 10^gbmregpred - 2))
bikeevaluate(mytest, gbmcombpred)

## [1] 0.9782

That was not an improvement – 0.9779. I'm not sure how to tweak these parameters. Might learn later, but not for this project.

Do a Linear Regression

Now a want to see if I can create a linear regression. I'll use the same variables as before. Not really expecting this to work very well for this data set but I'm curious.

lmodelcaslog10 <- lm(folcasuallog10, data = mytrain)
summary(lmodelcaslog10)

## 
## Call:
## lm(formula = folcasuallog10, data = mytrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3533 -0.1681  0.0231  0.1886  1.1455 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.254332   0.029907    8.50  < 2e-16 ***
## seasonsummer      0.210303   0.013935   15.09  < 2e-16 ***
## seasonfall        0.113507   0.017919    6.33  2.6e-10 ***
## seasonwinter      0.192117   0.011631   16.52  < 2e-16 ***
## holidayHoliday    0.179359   0.024205    7.41  1.5e-13 ***
## weatherMist      -0.031708   0.009338   -3.40  0.00069 ***
## weatherLightSnow -0.236134   0.015642  -15.10  < 2e-16 ***
## temp              0.017561   0.003118    5.63  1.9e-08 ***
## atemp             0.013891   0.002733    5.08  3.9e-07 ***
## humidity         -0.002195   0.000267   -8.21  2.7e-16 ***
## windspeed        -0.001279   0.000511   -2.50  0.01241 *  
## hour01           -0.164002   0.025728   -6.37  2.0e-10 ***
## hour02           -0.276633   0.025528  -10.84  < 2e-16 ***
## hour03           -0.453900   0.026188  -17.33  < 2e-16 ***
## hour04           -0.557376   0.025787  -21.61  < 2e-16 ***
## hour05           -0.483963   0.025988  -18.62  < 2e-16 ***
## hour06           -0.224915   0.025865   -8.70  < 2e-16 ***
## hour07            0.165859   0.026117    6.35  2.3e-10 ***
## hour08            0.420088   0.026303   15.97  < 2e-16 ***
## hour09            0.496089   0.025761   19.26  < 2e-16 ***
## hour10            0.579747   0.026254   22.08  < 2e-16 ***
## hour11            0.633503   0.025988   24.38  < 2e-16 ***
## hour12            0.661421   0.026302   25.15  < 2e-16 ***
## hour13            0.668594   0.026418   25.31  < 2e-16 ***
## hour14            0.658856   0.026443   24.92  < 2e-16 ***
## hour15            0.644751   0.027023   23.86  < 2e-16 ***
## hour16            0.677206   0.026949   25.13  < 2e-16 ***
## hour17            0.714740   0.026315   27.16  < 2e-16 ***
## hour18            0.625423   0.026423   23.67  < 2e-16 ***
## hour19            0.545524   0.025730   21.20  < 2e-16 ***
## hour20            0.443726   0.025944   17.10  < 2e-16 ***
## hour21            0.378373   0.025651   14.75  < 2e-16 ***
## hour22            0.286962   0.025518   11.25  < 2e-16 ***
## hour23            0.184370   0.025949    7.11  1.4e-12 ***
## weekdayMonday    -0.062229   0.014584   -4.27  2.0e-05 ***
## weekdaySaturday   0.257376   0.014098   18.26  < 2e-16 ***
## weekdaySunday     0.227481   0.014136   16.09  < 2e-16 ***
## weekdayThursday  -0.104049   0.014168   -7.34  2.4e-13 ***
## weekdayTuesday   -0.114946   0.014296   -8.04  1.1e-15 ***
## weekdayWednesday -0.130036   0.014333   -9.07  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.278 on 5404 degrees of freedom
## Multiple R-squared:  0.817,  Adjusted R-squared:  0.816 
## F-statistic:  619 on 39 and 5404 DF,  p-value: <2e-16

lmodelreglog10 <- lm(folreglog10, data = mytrain)
summary(lmodelreglog10)

## 
## Call:
## lm(formula = folreglog10, data = mytrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4431 -0.1596  0.0095  0.1742  1.0190 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.323486   0.029947   44.19  < 2e-16 ***
## seasonsummer      0.116332   0.013954    8.34  < 2e-16 ***
## seasonfall        0.094048   0.017943    5.24  1.7e-07 ***
## seasonwinter      0.237493   0.011646   20.39  < 2e-16 ***
## holidayHoliday   -0.040532   0.024238   -1.67  0.09453 .  
## weatherMist      -0.005210   0.009351   -0.56  0.57742    
## weatherLightSnow -0.187161   0.015663  -11.95  < 2e-16 ***
## temp              0.010694   0.003122    3.43  0.00062 ***
## atemp             0.003481   0.002737    1.27  0.20348    
## humidity         -0.001624   0.000268   -6.07  1.4e-09 ***
## windspeed        -0.001125   0.000512   -2.20  0.02799 *  
## hour01           -0.247912   0.025763   -9.62  < 2e-16 ***
## hour02           -0.479307   0.025563  -18.75  < 2e-16 ***
## hour03           -0.655351   0.026224  -24.99  < 2e-16 ***
## hour04           -0.782353   0.025822  -30.30  < 2e-16 ***
## hour05           -0.325404   0.026024  -12.50  < 2e-16 ***
## hour06            0.147894   0.025900    5.71  1.2e-08 ***
## hour07            0.608705   0.026152   23.28  < 2e-16 ***
## hour08            0.873226   0.026339   33.15  < 2e-16 ***
## hour09            0.682984   0.025796   26.48  < 2e-16 ***
## hour10            0.502750   0.026290   19.12  < 2e-16 ***
## hour11            0.538375   0.026023   20.69  < 2e-16 ***
## hour12            0.620234   0.026338   23.55  < 2e-16 ***
## hour13            0.606185   0.026454   22.91  < 2e-16 ***
## hour14            0.543453   0.026479   20.52  < 2e-16 ***
## hour15            0.579041   0.027060   21.40  < 2e-16 ***
## hour16            0.728319   0.026985   26.99  < 2e-16 ***
## hour17            0.920040   0.026350   34.92  < 2e-16 ***
## hour18            0.905684   0.026459   34.23  < 2e-16 ***
## hour19            0.783554   0.025765   30.41  < 2e-16 ***
## hour20            0.645705   0.025980   24.85  < 2e-16 ***
## hour21            0.541596   0.025685   21.09  < 2e-16 ***
## hour22            0.420025   0.025553   16.44  < 2e-16 ***
## hour23            0.257351   0.025984    9.90  < 2e-16 ***
## weekdayMonday    -0.060833   0.014604   -4.17  3.2e-05 ***
## weekdaySaturday  -0.042303   0.014118   -3.00  0.00274 ** 
## weekdaySunday    -0.070909   0.014155   -5.01  5.6e-07 ***
## weekdayThursday  -0.018973   0.014187   -1.34  0.18117    
## weekdayTuesday   -0.070692   0.014316   -4.94  8.1e-07 ***
## weekdayWednesday -0.048199   0.014353   -3.36  0.00079 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.278 on 5404 degrees of freedom
## Multiple R-squared:  0.792,  Adjusted R-squared:  0.79 
## F-statistic:  527 on 39 and 5404 DF,  p-value: <2e-16

predlmcaslog10 <- predict(lmodelcaslog10, newdata = mytest)

## Error: factor weather has new levels HeavyRain

predlmreglog10 <- predict(lmodelreglog10, newdata = mytest)

## Error: factor weather has new levels HeavyRain

predlm <- round(10^predlmcaslog10 + 10^predlmreglog10 -2)

## Error: object 'predlmcaslog10' not found

bikeevaluate(mytest, predlm)

## Error: object 'predlm' not found

Decent score of 0.6246. Not worth submitting though.

Model Tree

The final thing I want to try is a model tree using the RWeka package.

library(RWeka)
mtcasmod <- M5P(folcasuallog10, data = mytrain)
mtregmod <- M5P(folreglog10, data = mytrain)
summary(mtcasmod)

## 
## === Summary ===
## 
## Correlation coefficient                  0.9358
## Mean absolute error                      0.1759
## Root mean squared error                  0.2282
## Relative absolute error                 32.3061 %
## Root relative squared error             35.2632 %
## Total Number of Instances             5444

summary(mtregmod)

## 
## === Summary ===
## 
## Correlation coefficient                  0.9506
## Mean absolute error                      0.1447
## Root mean squared error                  0.1888
## Relative absolute error                 29.4251 %
## Root relative squared error             31.0744 %
## Total Number of Instances             5444

mtcaspred <- predict(mtcasmod, newdata = mytest)
mtregpred <- predict(mtregmod, newdata = mytest)
predmt <- round(10^mtcaspred + 10^mtregpred -2)
bikeevaluate(mytest, predmt)

## [1] 0.4678

Another slight improvement! Up to 0.46581. Now will turn this in.

mtcasmod <- M5P(folcasuallog10, data = train)
mtregmod <- M5P(folreglog10, data = train)
summary(mtcasmod)

## 
## === Summary ===
## 
## Correlation coefficient                  0.9362
## Mean absolute error                      0.175 
## Root mean squared error                  0.2276
## Relative absolute error                 32.1536 %
## Root relative squared error             35.1557 %
## Total Number of Instances            10886

summary(mtregmod)

## 
## === Summary ===
## 
## Correlation coefficient                  0.9581
## Mean absolute error                      0.1338
## Root mean squared error                  0.1741
## Relative absolute error                 27.3302 %
## Root relative squared error             28.6499 %
## Total Number of Instances            10886

mtcaspred <- predict(mtcasmod, newdata = test)
mtregpred <- predict(mtregmod, newdata = test)
predmt <- round(10^mtcaspred + 10^mtregpred -2)
writeout(test, predmt)

Kaggle score: 0.45423. Moved up 32 spots!

At this moment in time, I am ranked #76 out of 537. I want to keep toying with my variables and the models, but this will have to do for purposes of this assignment, so I can turn in my report. This was fun!

Suggestions for improvement.

At this point, that's about all the types of machine learning algorithms that I am familiar with, or can easily look up. There are likely endless tweaks that I could do to make my model better. Here are a few I can think of.

Feature Engineering
- Perhaps weather-related factors like wind, temperature, and humidity only have an effect above a certain threshold. Maybe if I set those to binary or categorical variables, that would help.
- Perhaps the season combined with the hour would make a difference. For instance, in the summer it might stay light out in a location until 9pm, while in the winter that location gets dark at 5pm. That would likely affect the ridership at certain hours of the day.
- Maybe the holiday ridership is only affected if the holiday lies on a day that would otherwise be a weekday. Combine holiday with weekday?
Transform target variables better. The target variables were strongly skewed left. I applied a log10 transformation to those variables, which made the distribution closer to normal. But it wasn't as close to normal as I would have liked. Perhaps if I applied a different transformation to the target variables it would be better.
Tweak models
- There are a number of tweaks I could learn about for the machine learning algorithms. In particular I'm thinking of the gbm algorithm. From the forums, a number of people are saying that gbm works very well. I need to learn how to use this.
- Smarter choices of prediction variables. I basically chose all the predictors. Might the model get stronger if I prune some of those out?
Follow competition instructions a little closer. I didn't notice these instructions until near the end of my project: 'You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.' I wasn't following these instructions. Following these would make the prediction much more difficult. I'd have to scrap my current results and start over. I wasn't willing to do that for this assignment though in the interest of time.