Tyler Byers
15 Aug 2015
This project is being done as part of the Coursera Intro to Data Science class by the University of Washington. This is the class Peer Review project, and is an open-ended assignment where we get to participate in a Kaggle competition.
I chose the Bike Sharing Demand competition. The website and problem description are at: http://www.kaggle.com/c/bike-sharing-demand
datetime - hourly date + timestamp
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - “feels like” temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals
train <- read.csv('train.csv')
test <- read.csv('test.csv')
summary(train)
## datetime season holiday
## 2011-01-01 00:00:00: 1 Min. :1.00 Min. :0.0000
## 2011-01-01 01:00:00: 1 1st Qu.:2.00 1st Qu.:0.0000
## 2011-01-01 02:00:00: 1 Median :3.00 Median :0.0000
## 2011-01-01 03:00:00: 1 Mean :2.51 Mean :0.0286
## 2011-01-01 04:00:00: 1 3rd Qu.:4.00 3rd Qu.:0.0000
## 2011-01-01 05:00:00: 1 Max. :4.00 Max. :1.0000
## (Other) :10880
## workingday weather temp atemp
## Min. :0.000 Min. :1.00 Min. : 0.82 Min. : 0.76
## 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:13.94 1st Qu.:16.66
## Median :1.000 Median :1.00 Median :20.50 Median :24.24
## Mean :0.681 Mean :1.42 Mean :20.23 Mean :23.66
## 3rd Qu.:1.000 3rd Qu.:2.00 3rd Qu.:26.24 3rd Qu.:31.06
## Max. :1.000 Max. :4.00 Max. :41.00 Max. :45.45
##
## humidity windspeed casual registered count
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0 Min. : 1
## 1st Qu.: 47.0 1st Qu.: 7.0 1st Qu.: 4 1st Qu.: 36 1st Qu.: 42
## Median : 62.0 Median :13.0 Median : 17 Median :118 Median :145
## Mean : 61.9 Mean :12.8 Mean : 36 Mean :156 Mean :192
## 3rd Qu.: 77.0 3rd Qu.:17.0 3rd Qu.: 49 3rd Qu.:222 3rd Qu.:284
## Max. :100.0 Max. :57.0 Max. :367 Max. :886 Max. :977
##
str(train)
## 'data.frame': 10886 obs. of 12 variables:
## $ datetime : Factor w/ 10886 levels "2011-01-01 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weather : int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 9.84 9.02 9.02 9.84 9.84 ...
## $ atemp : num 14.4 13.6 13.6 14.4 14.4 ...
## $ humidity : int 81 80 80 75 75 75 80 86 75 76 ...
## $ windspeed : num 0 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ count : int 16 40 32 13 1 1 2 3 8 14 ...
summary(test)
## datetime season holiday
## 2011-01-20 00:00:00: 1 Min. :1.00 Min. :0.0000
## 2011-01-20 01:00:00: 1 1st Qu.:2.00 1st Qu.:0.0000
## 2011-01-20 02:00:00: 1 Median :3.00 Median :0.0000
## 2011-01-20 03:00:00: 1 Mean :2.49 Mean :0.0291
## 2011-01-20 04:00:00: 1 3rd Qu.:3.00 3rd Qu.:0.0000
## 2011-01-20 05:00:00: 1 Max. :4.00 Max. :1.0000
## (Other) :6487
## workingday weather temp atemp
## Min. :0.000 Min. :1.00 Min. : 0.82 Min. : 0.0
## 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:13.94 1st Qu.:16.7
## Median :1.000 Median :1.00 Median :21.32 Median :25.0
## Mean :0.686 Mean :1.44 Mean :20.62 Mean :24.0
## 3rd Qu.:1.000 3rd Qu.:2.00 3rd Qu.:27.06 3rd Qu.:31.1
## Max. :1.000 Max. :4.00 Max. :40.18 Max. :50.0
##
## humidity windspeed
## Min. : 16.0 Min. : 0.0
## 1st Qu.: 49.0 1st Qu.: 7.0
## Median : 65.0 Median :11.0
## Mean : 64.1 Mean :12.6
## 3rd Qu.: 81.0 3rd Qu.:17.0
## Max. :100.0 Max. :56.0
##
str(test)
## 'data.frame': 6493 obs. of 9 variables:
## $ datetime : Factor w/ 6493 levels "2011-01-20 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday: int 1 1 1 1 1 1 1 1 1 1 ...
## $ weather : int 1 1 1 1 1 1 1 1 1 2 ...
## $ temp : num 10.7 10.7 10.7 10.7 10.7 ...
## $ atemp : num 11.4 13.6 13.6 12.9 12.9 ...
## $ humidity : int 56 56 56 56 56 60 60 55 55 52 ...
## $ windspeed : num 26 0 0 11 11 ...
It's a very clean data set. Going to change datetime from factor to datetime variable, and then split on date and time.
test[c('casual', 'registered', 'count')] <- NA
alldata <- rbind(train, test)
summary(alldata)
## datetime season holiday
## 2011-01-01 00:00:00: 1 Min. :1.0 Min. :0.0000
## 2011-01-01 01:00:00: 1 1st Qu.:2.0 1st Qu.:0.0000
## 2011-01-01 02:00:00: 1 Median :3.0 Median :0.0000
## 2011-01-01 03:00:00: 1 Mean :2.5 Mean :0.0288
## 2011-01-01 04:00:00: 1 3rd Qu.:3.0 3rd Qu.:0.0000
## 2011-01-01 05:00:00: 1 Max. :4.0 Max. :1.0000
## (Other) :17373
## workingday weather temp atemp
## Min. :0.000 Min. :1.00 Min. : 0.82 Min. : 0.0
## 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:13.94 1st Qu.:16.7
## Median :1.000 Median :1.00 Median :20.50 Median :24.2
## Mean :0.683 Mean :1.43 Mean :20.38 Mean :23.8
## 3rd Qu.:1.000 3rd Qu.:2.00 3rd Qu.:27.06 3rd Qu.:31.1
## Max. :1.000 Max. :4.00 Max. :41.00 Max. :50.0
##
## humidity windspeed casual registered
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.: 48.0 1st Qu.: 7.0 1st Qu.: 4 1st Qu.: 36
## Median : 63.0 Median :13.0 Median : 17 Median :118
## Mean : 62.7 Mean :12.7 Mean : 36 Mean :156
## 3rd Qu.: 78.0 3rd Qu.:17.0 3rd Qu.: 49 3rd Qu.:222
## Max. :100.0 Max. :57.0 Max. :367 Max. :886
## NA's :6493 NA's :6493
## count
## Min. : 1
## 1st Qu.: 42
## Median :145
## Mean :192
## 3rd Qu.:284
## Max. :977
## NA's :6493
alldata$datetime <- strftime(as.character(alldata$datetime), "%Y-%m-%d %H:%M:%S")
str(alldata$datetime)
## chr [1:17379] "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
alldata$date <- strftime(alldata$datetime, format = '%Y-%m-%d')
alldata$hour <- strftime(alldata$datetime, format = '%H')
train <- alldata[1:10886,]
test <- alldata[10887:17379,]
library(ggplot2)
library(gridExtra)
## Loading required package: grid
qplot(x=count, data = train, fill = as.factor(season))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
p1 <- qplot(x = count, data = train, fill = as.factor(season)) +
scale_x_sqrt()
p2 <- qplot(x = casual, data = train, fill = as.factor(season)) +
scale_x_sqrt()
p3 <- qplot(x = registered, data = train, fill = as.factor(season)) +
scale_x_sqrt()
grid.arrange(p1, p2, p3)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Registered riders are much more consistent – casual riders as expected are more likely in spring and summer.
qplot(x=count, data = train, fill = as.factor((weather)))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(aes(x=temp, y = count), data = train) + geom_jitter() +
geom_smooth(method = 'lm', color = 'red')
cor.test(train$temp, train$count)
##
## Pearson's product-moment correlation
##
## data: train$temp and train$count
## t = 44.78, df = 10884, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3785 0.4102
## sample estimates:
## cor
## 0.3945
Some slight correlation – 0.3944 between temp and count. I actually want to do a quick model using just temp to see if I can estimate what my Kaggle score will be.
library(caret)
## Loading required package: lattice
set.seed(815)
trainIndex <- createDataPartition(1:nrow(train), p = 0.5,
list = FALSE, times = 1)
mytrain <- train[trainIndex,]
mytest <- train[-trainIndex,]
cor.test(mytrain$temp, mytrain$count)
##
## Pearson's product-moment correlation
##
## data: mytrain$temp and mytrain$count
## t = 30.81, df = 5442, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3625 0.4078
## sample estimates:
## cor
## 0.3854
That correlation was roughly the same as my last one.
model <- lm(count ~ temp, data = mytrain)
p <- round(predict(model, newdata = mytest))
head(p)
## 1 3 4 5 6 9
## 98 91 98 98 98 98
Create psuedo evaluation function
bikeevaluate <- function (data, pred) {
return(sqrt(1/nrow(data)*sum((log(pred+1)-log(data$count+1))^2)))
}
Evaluate our simple model
bikeevaluate(mytest, p)
## [1] 1.449
# Now need to run model using the test
p <- round(predict(model, newdata = test))
head(p)
## 10887 10888 10889 10890 10891 10892
## 106 106 106 106 106 98
Estimating Kaggle score approx 1.45
output <- subset(test, select = 'datetime')
output$count <- p
outname <- paste0('results/out_',format(Sys.time(), '%Y%m%d_%H%M%S'),
'.csv')
write.csv(output, outname, quote = FALSE, row.names = FALSE)
Kaggle output: 1.4523. Ok – have a good estimator.
I don't want to think very hard right now, so I'm going to try to make an rpart model using most of the variables.
library(rpart)
fol <- formula(count ~ season + holiday + workingday + weather + temp + atemp +
humidity + windspeed)
mymodel <- rpart(fol, data = mytrain)
print(mymodel)
## n= 5444
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 5444 180300000 192.80
## 2) atemp< 29.93 3820 91070000 150.10
## 4) temp< 12.71 1055 12850000 90.58 *
## 5) temp>=12.71 2765 73040000 172.90
## 10) humidity>=65.5 1527 30350000 131.10 *
## 11) humidity< 65.5 1238 36740000 224.40
## 22) season< 2.5 746 15840000 184.20 *
## 23) season>=2.5 492 17860000 285.40 *
## 3) atemp>=29.93 1624 65900000 293.30
## 6) humidity>=64.5 569 17310000 202.70 *
## 7) humidity< 64.5 1055 41400000 342.10
## 14) workingday>=0.5 745 31030000 315.20 *
## 15) workingday< 0.5 310 8543000 406.60 *
mypred <- round(predict(mymodel, newdata = mytest))
bikeevaluate(mytest, mypred)
## [1] 1.385
# That would get us to 1.393, which is a very slight move up the leaderboard. Not worth submitting at this time.
There are several variables that really should be factor variables rather than numerical variables. Also I want to create a variable for what day of week it is, rather than just the weekday.
names(alldata)
## [1] "datetime" "season" "holiday" "workingday" "weather"
## [6] "temp" "atemp" "humidity" "windspeed" "casual"
## [11] "registered" "count" "date" "hour"
alldata$season <- factor(alldata$season, labels = c('spring', 'summer',
'fall', 'winter'))
alldata$weather <- factor(alldata$weather,
labels = c('Clear', 'Mist', 'LightSnow', 'HeavyRain'))
alldata$holiday <- factor(alldata$holiday, labels = c('Not','Holiday'))
alldata$hour <- factor(alldata$hour)
alldata$weekday <- weekdays(as.Date(alldata$date, '%Y-%m-%d'))
alldata$weekday <- factor(alldata$weekday)
str(alldata)
## 'data.frame': 17379 obs. of 15 variables:
## $ datetime : chr "2011-01-01 00:00:00" "2011-01-01 01:00:00" "2011-01-01 02:00:00" "2011-01-01 03:00:00" ...
## $ season : Factor w/ 4 levels "spring","summer",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : Factor w/ 2 levels "Not","Holiday": 1 1 1 1 1 1 1 1 1 1 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weather : Factor w/ 4 levels "Clear","Mist",..: 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 9.84 9.02 9.02 9.84 9.84 ...
## $ atemp : num 14.4 13.6 13.6 14.4 14.4 ...
## $ humidity : int 81 80 80 75 75 75 80 86 75 76 ...
## $ windspeed : num 0 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ count : int 16 40 32 13 1 1 2 3 8 14 ...
## $ date : chr "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
## $ hour : Factor w/ 24 levels "00","01","02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weekday : Factor w/ 7 levels "Friday","Monday",..: 3 3 3 3 3 3 3 3 3 3 ...
Done messing with data for now, re-split into test, training, mytest, and mytraining sets.
train <- alldata[1:10886,]
test <- alldata[10887:17379,]
set.seed(820)
trainIndex <- createDataPartition(1:nrow(train), p = 0.5,
list = FALSE, times = 1)
mytrain <- train[trainIndex,]
mytest <- train[-trainIndex,]
cor.test(mytrain$temp, mytrain$count)
##
## Pearson's product-moment correlation
##
## data: mytrain$temp and mytrain$count
## t = 32, df = 5442, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3753 0.4201
## sample estimates:
## cor
## 0.3979
Will try to rpart again now with some factor variables. Wonder if that will help.
names(mytrain)
## [1] "datetime" "season" "holiday" "workingday" "weather"
## [6] "temp" "atemp" "humidity" "windspeed" "casual"
## [11] "registered" "count" "date" "hour" "weekday"
fol <- formula(count ~ season + holiday + weather + temp + atemp +
humidity + windspeed + hour + weekday)
mymodel <- rpart(fol, data = mytrain)
print(mymodel)
## n= 5444
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 5444 176900000 191.20
## 2) hour=00,01,02,03,04,05,06,22,23 2065 6675000 50.09
## 4) hour=00,01,02,03,04,05 1375 1296000 25.56 *
## 5) hour=06,22,23 690 2903000 98.97 *
## 3) hour=07,08,09,10,11,12,13,14,15,16,17,18,19,20,21 3379 104000000 277.50
## 6) hour=07,09,10,11,12,13,14,15,20,21 2260 39470000 225.30
## 12) temp< 19.27 989 11060000 165.70 *
## 13) temp>=19.27 1271 22160000 271.70
## 26) weekday=Friday,Monday,Thursday,Tuesday,Wednesday 920 9038000 238.80 *
## 27) weekday=Saturday,Sunday 351 9506000 358.10
## 54) hour=07,09,10,20,21 161 1508000 232.50 *
## 55) hour=11,12,13,14,15 190 3309000 464.50 *
## 7) hour=08,16,17,18,19 1119 45920000 382.80
## 14) temp< 17.63 389 10520000 258.00
## 28) season=spring,summer 236 4905000 202.70 *
## 29) season=winter 153 3785000 343.20 *
## 15) temp>=17.63 730 26110000 449.30
## 30) hour=08,16,19 420 10560000 391.00 *
## 31) hour=17,18 310 12180000 528.40
## 62) weekday=Saturday,Sunday 84 1498000 393.80 *
## 63) weekday=Friday,Monday,Thursday,Tuesday,Wednesday 226 8594000 578.40
## 126) atemp< 30.68 103 3397000 481.10 *
## 127) atemp>=30.68 123 3406000 659.90 *
mypred <- round(predict(mymodel, newdata = mytest))
bikeevaluate(mytest, mypred)
## [1] 0.8645
# That helped us a lot! Down to 0.86451
I want to create an output function though – tired of always typing it in
writeout <- function(testdata, pred){
output <- subset(testdata, select = 'datetime')
output$count <- pred
outname <- paste0('results/out_',format(Sys.time(), '%Y%m%d_%H%M%S'),
'.csv')
write.csv(output, outname, quote = FALSE, row.names = FALSE)
}
pred <- predict(mymodel, newdata = test)
writeout(test, pred)
Kaggle score: 0.87592. Moved up 93 positions.
Let's see how a Random Forest implementation will work on my model
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
rfmodel <- randomForest(fol, data = mytrain)
rfpredict <- round(predict(rfmodel, newdata = mytest))
bikeevaluate(mytest, rfpredict)
## [1] 0.6763
# New score: 0.67634
importance(rfmodel)
## IncNodePurity
## season 7581689
## holiday 703584
## weather 3107707
## temp 15610820
## atemp 17950570
## humidity 15577794
## windspeed 6298487
## hour 90590889
## weekday 13269494
rfmodel2 <- randomForest(fol, data = train)
importance(rfmodel2)
## IncNodePurity
## season 15806349
## holiday 1819964
## weather 6994547
## temp 30811097
## atemp 33209366
## humidity 30609371
## windspeed 12093003
## hour 183401318
## weekday 30028723
rfpredict2 <- round(predict(rfmodel2, newdata = test))
writeout(test, rfpredict2)
Actual Kaggle score: 0.66237. Moved up 53 positions.
Will an svm work?? Let's see.
library(e1071)
svmmodel <- svm(fol, data=mytrain)
svmpredict <- round(predict(svmmodel, newdata = mytest))
# There are some values below zero. Fix these.
svmpredict[svmpredict < 0] <- 0
bikeevaluate(mytest, svmpredict)
## [1] 0.8087
Our score for SVM was 0.8086. Not an improvement; will not submit.
Based on some of our early graphs and intuition, casual and registered riders have different usage patterns. Will model the two variables separately then combine them. Will use the Random Forest since that has given us the best model so far.
folcasual = formula(casual ~ season + holiday + weather + temp +
atemp + humidity + windspeed + hour + weekday)
folreg = formula(registered ~ season + holiday + weather + temp +
atemp + humidity + windspeed + hour + weekday)
rfcasmodel <- randomForest(folcasual, data = mytrain)
importance(rfcasmodel)
## IncNodePurity
## season 554769
## holiday 200394
## weather 180588
## temp 1736250
## atemp 2087236
## humidity 1186250
## windspeed 420962
## hour 4438091
## weekday 2844337
rfregmodel <- randomForest(folreg, data = mytrain)
importance(rfregmodel)
## IncNodePurity
## season 5069385
## holiday 570335
## weather 2229766
## temp 8576040
## atemp 8608875
## humidity 9330442
## windspeed 4261755
## hour 66744637
## weekday 11982365
rfcaspred <- predict(rfcasmodel, newdata = mytest)
rfregpred <- predict(rfregmodel, newdata = mytest)
rfcombpred <- round(rfcaspred + rfregpred)
bikeevaluate(mytest, rfcombpred)
## [1] 0.6617
# Score 0.66169, a slight improvement
# Will create model for submission.
rfcasmodel <- randomForest(folcasual, data = train)
rfregmodel <- randomForest(folreg, data = train)
rfcaspred <- predict(rfcasmodel, newdata = test)
rfregpred <- predict(rfregmodel, newdata = test)
rfcombpred <- round(rfcaspred + rfregpred)
writeout(test, rfcombpred)
Kaggle score: 0.64381. Moved up Kaggle board by 10 positions.
I'm not liking how this is going. I would like to visualize the data a bit more.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
hourriders <- group_by(train, hour)
hrridertrain <- summarise(hourriders, sum = sum(count), mean = mean(count))
barplot(hrridertrain$mean)
#Makes me wonder what happens if we just put the mean counts into each hour.
pred <- sapply(mytest$hour, FUN = function(x)
{ round(hrridertrain[which(hrridertrain$hour == x),3])})
bikeevaluate(mytest, pred)
## [1] 0.8167
Ok that was a step backward, but still not as bad as we started out. Score of 0.81667 would move us back down the board quite a bit though.
The distribution of riders was strongly skewed right. I noticed when plotting that using a scale_x_sqrt seemed to make the distribution more normal. The scale_x_log10 was also good, but didn't seem quite as good. I'll scale the variables though and see if that helps.
mytrain$cassqrt = sqrt(mytrain$casual)
mytrain$regsqrt = sqrt(mytrain$registered)
qplot(x = cassqrt + regsqrt, data = mytrain, fill = as.factor(season))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
folcasual = formula(cassqrt ~ season + holiday + weather + temp +
atemp + humidity + windspeed + hour + weekday)
folreg = formula(regsqrt ~ season + holiday + weather + temp +
atemp + humidity + windspeed + hour + weekday)
rfcasmodel <- randomForest(folcasual, data = mytrain)
importance(rfcasmodel)
## IncNodePurity
## season 3077.0
## holiday 566.9
## weather 913.4
## temp 9849.0
## atemp 10823.0
## humidity 5809.5
## windspeed 1662.9
## hour 29528.1
## weekday 8942.2
rfregmodel <- randomForest(folreg, data = mytrain)
importance(rfregmodel)
## IncNodePurity
## season 7614.3
## holiday 642.3
## weather 3108.0
## temp 13199.0
## atemp 12347.1
## humidity 14771.0
## windspeed 5645.8
## hour 121796.8
## weekday 12975.6
rfcaspred <- predict(rfcasmodel, newdata = mytest)
rfregpred <- predict(rfregmodel, newdata = mytest)
rfcombpred <- round((rfcaspred^2 + rfregpred^2))
bikeevaluate(mytest, rfcombpred)
## [1] 0.5126
My score went down to 0.5093. Wow that was a big improvement. Before submitting, I wonder if a log10 transformation can do even better for us.
mytrain$caslog10 = log10(mytrain$casual + 1)
mytrain$reglog10 = log10(mytrain$registered + 1)
qplot(x = caslog10 + reglog10, data = mytrain, fill = as.factor(season))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
folcasuallog10 = formula(caslog10 ~ season + holiday + weather + temp +
atemp + humidity + windspeed + hour + weekday)
folreglog10 = formula(reglog10 ~ season + holiday + weather + temp +
atemp + humidity + windspeed + hour + weekday)
rfcasmodel <- randomForest(folcasuallog10, data = mytrain)
importance(rfcasmodel)
## IncNodePurity
## season 94.170
## holiday 7.579
## weather 34.375
## temp 322.099
## atemp 312.489
## humidity 178.826
## windspeed 52.948
## hour 1064.330
## weekday 160.542
rfregmodel <- randomForest(folreglog10, data = mytrain)
importance(rfregmodel)
## IncNodePurity
## season 58.084
## holiday 4.935
## weather 25.228
## temp 117.456
## atemp 106.116
## humidity 145.553
## windspeed 48.023
## hour 1338.902
## weekday 110.755
rfcaspred <- predict(rfcasmodel, newdata = mytest)
rfregpred <- predict(rfregmodel, newdata = mytest)
rfcombpred <- round((10^rfcaspred + 10^rfregpred - 2))
bikeevaluate(mytest, rfcombpred)
## [1] 0.472
The score on that was 0.471! Cool! I will do a full model now and submit that.
train$caslog10 = log10(train$casual + 1)
train$reglog10 = log10(train$registered + 1)
folcasuallog10 = formula(caslog10 ~ season + holiday + weather + temp +
atemp + humidity + windspeed + hour + weekday)
folreglog10 = formula(reglog10 ~ season + holiday + weather + temp +
atemp + humidity + windspeed + hour + weekday)
rfcasmodel <- randomForest(folcasuallog10, data = train)
rfregmodel <- randomForest(folreglog10, data = train)
rfcaspred <- predict(rfcasmodel, newdata = test)
rfregpred <- predict(rfregmodel, newdata = test)
rfcombpred <- round(10^rfcaspred + 10^rfregpred - 2)
writeout(test, rfcombpred)
Kaggle score: 0.47005. Moved 201 positions up the leaderboard!
Some people on the forums are reporting good results with gbm package. I'll admit I don't really understand it, but I'll try using it anyways.
library(gbm)
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: parallel
## Loaded gbm 2.1
gbmmodcas <- gbm(folcasuallog10, data = mytrain, n.trees = 1000)
## Distribution not specified, assuming gaussian ...
gbmmodreg <- gbm(folreglog10, data = mytrain, n.trees = 1000)
## Distribution not specified, assuming gaussian ...
summary(gbmmodcas)
## var rel.inf
## hour hour 83.604
## temp temp 13.931
## atemp atemp 2.466
## season season 0.000
## holiday holiday 0.000
## weather weather 0.000
## humidity humidity 0.000
## windspeed windspeed 0.000
## weekday weekday 0.000
summary(gbmmodreg)
## var rel.inf
## hour hour 100
## season season 0
## holiday holiday 0
## weather weather 0
## temp temp 0
## atemp atemp 0
## humidity humidity 0
## windspeed windspeed 0
## weekday weekday 0
gbmcaspred <- predict(gbmmodcas, mytest, n.trees = 1000)
gbmregpred <- predict(gbmmodreg, mytest, n.trees = 1000)
gbmcombpred <- round((10^gbmcaspred + 10^gbmregpred - 2))
bikeevaluate(mytest, gbmcombpred)
## [1] 0.9782
That was not an improvement – 0.9779. I'm not sure how to tweak these parameters. Might learn later, but not for this project.
Now a want to see if I can create a linear regression. I'll use the same variables as before. Not really expecting this to work very well for this data set but I'm curious.
lmodelcaslog10 <- lm(folcasuallog10, data = mytrain)
summary(lmodelcaslog10)
##
## Call:
## lm(formula = folcasuallog10, data = mytrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3533 -0.1681 0.0231 0.1886 1.1455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.254332 0.029907 8.50 < 2e-16 ***
## seasonsummer 0.210303 0.013935 15.09 < 2e-16 ***
## seasonfall 0.113507 0.017919 6.33 2.6e-10 ***
## seasonwinter 0.192117 0.011631 16.52 < 2e-16 ***
## holidayHoliday 0.179359 0.024205 7.41 1.5e-13 ***
## weatherMist -0.031708 0.009338 -3.40 0.00069 ***
## weatherLightSnow -0.236134 0.015642 -15.10 < 2e-16 ***
## temp 0.017561 0.003118 5.63 1.9e-08 ***
## atemp 0.013891 0.002733 5.08 3.9e-07 ***
## humidity -0.002195 0.000267 -8.21 2.7e-16 ***
## windspeed -0.001279 0.000511 -2.50 0.01241 *
## hour01 -0.164002 0.025728 -6.37 2.0e-10 ***
## hour02 -0.276633 0.025528 -10.84 < 2e-16 ***
## hour03 -0.453900 0.026188 -17.33 < 2e-16 ***
## hour04 -0.557376 0.025787 -21.61 < 2e-16 ***
## hour05 -0.483963 0.025988 -18.62 < 2e-16 ***
## hour06 -0.224915 0.025865 -8.70 < 2e-16 ***
## hour07 0.165859 0.026117 6.35 2.3e-10 ***
## hour08 0.420088 0.026303 15.97 < 2e-16 ***
## hour09 0.496089 0.025761 19.26 < 2e-16 ***
## hour10 0.579747 0.026254 22.08 < 2e-16 ***
## hour11 0.633503 0.025988 24.38 < 2e-16 ***
## hour12 0.661421 0.026302 25.15 < 2e-16 ***
## hour13 0.668594 0.026418 25.31 < 2e-16 ***
## hour14 0.658856 0.026443 24.92 < 2e-16 ***
## hour15 0.644751 0.027023 23.86 < 2e-16 ***
## hour16 0.677206 0.026949 25.13 < 2e-16 ***
## hour17 0.714740 0.026315 27.16 < 2e-16 ***
## hour18 0.625423 0.026423 23.67 < 2e-16 ***
## hour19 0.545524 0.025730 21.20 < 2e-16 ***
## hour20 0.443726 0.025944 17.10 < 2e-16 ***
## hour21 0.378373 0.025651 14.75 < 2e-16 ***
## hour22 0.286962 0.025518 11.25 < 2e-16 ***
## hour23 0.184370 0.025949 7.11 1.4e-12 ***
## weekdayMonday -0.062229 0.014584 -4.27 2.0e-05 ***
## weekdaySaturday 0.257376 0.014098 18.26 < 2e-16 ***
## weekdaySunday 0.227481 0.014136 16.09 < 2e-16 ***
## weekdayThursday -0.104049 0.014168 -7.34 2.4e-13 ***
## weekdayTuesday -0.114946 0.014296 -8.04 1.1e-15 ***
## weekdayWednesday -0.130036 0.014333 -9.07 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.278 on 5404 degrees of freedom
## Multiple R-squared: 0.817, Adjusted R-squared: 0.816
## F-statistic: 619 on 39 and 5404 DF, p-value: <2e-16
lmodelreglog10 <- lm(folreglog10, data = mytrain)
summary(lmodelreglog10)
##
## Call:
## lm(formula = folreglog10, data = mytrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4431 -0.1596 0.0095 0.1742 1.0190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.323486 0.029947 44.19 < 2e-16 ***
## seasonsummer 0.116332 0.013954 8.34 < 2e-16 ***
## seasonfall 0.094048 0.017943 5.24 1.7e-07 ***
## seasonwinter 0.237493 0.011646 20.39 < 2e-16 ***
## holidayHoliday -0.040532 0.024238 -1.67 0.09453 .
## weatherMist -0.005210 0.009351 -0.56 0.57742
## weatherLightSnow -0.187161 0.015663 -11.95 < 2e-16 ***
## temp 0.010694 0.003122 3.43 0.00062 ***
## atemp 0.003481 0.002737 1.27 0.20348
## humidity -0.001624 0.000268 -6.07 1.4e-09 ***
## windspeed -0.001125 0.000512 -2.20 0.02799 *
## hour01 -0.247912 0.025763 -9.62 < 2e-16 ***
## hour02 -0.479307 0.025563 -18.75 < 2e-16 ***
## hour03 -0.655351 0.026224 -24.99 < 2e-16 ***
## hour04 -0.782353 0.025822 -30.30 < 2e-16 ***
## hour05 -0.325404 0.026024 -12.50 < 2e-16 ***
## hour06 0.147894 0.025900 5.71 1.2e-08 ***
## hour07 0.608705 0.026152 23.28 < 2e-16 ***
## hour08 0.873226 0.026339 33.15 < 2e-16 ***
## hour09 0.682984 0.025796 26.48 < 2e-16 ***
## hour10 0.502750 0.026290 19.12 < 2e-16 ***
## hour11 0.538375 0.026023 20.69 < 2e-16 ***
## hour12 0.620234 0.026338 23.55 < 2e-16 ***
## hour13 0.606185 0.026454 22.91 < 2e-16 ***
## hour14 0.543453 0.026479 20.52 < 2e-16 ***
## hour15 0.579041 0.027060 21.40 < 2e-16 ***
## hour16 0.728319 0.026985 26.99 < 2e-16 ***
## hour17 0.920040 0.026350 34.92 < 2e-16 ***
## hour18 0.905684 0.026459 34.23 < 2e-16 ***
## hour19 0.783554 0.025765 30.41 < 2e-16 ***
## hour20 0.645705 0.025980 24.85 < 2e-16 ***
## hour21 0.541596 0.025685 21.09 < 2e-16 ***
## hour22 0.420025 0.025553 16.44 < 2e-16 ***
## hour23 0.257351 0.025984 9.90 < 2e-16 ***
## weekdayMonday -0.060833 0.014604 -4.17 3.2e-05 ***
## weekdaySaturday -0.042303 0.014118 -3.00 0.00274 **
## weekdaySunday -0.070909 0.014155 -5.01 5.6e-07 ***
## weekdayThursday -0.018973 0.014187 -1.34 0.18117
## weekdayTuesday -0.070692 0.014316 -4.94 8.1e-07 ***
## weekdayWednesday -0.048199 0.014353 -3.36 0.00079 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.278 on 5404 degrees of freedom
## Multiple R-squared: 0.792, Adjusted R-squared: 0.79
## F-statistic: 527 on 39 and 5404 DF, p-value: <2e-16
predlmcaslog10 <- predict(lmodelcaslog10, newdata = mytest)
## Error: factor weather has new levels HeavyRain
predlmreglog10 <- predict(lmodelreglog10, newdata = mytest)
## Error: factor weather has new levels HeavyRain
predlm <- round(10^predlmcaslog10 + 10^predlmreglog10 -2)
## Error: object 'predlmcaslog10' not found
bikeevaluate(mytest, predlm)
## Error: object 'predlm' not found
Decent score of 0.6246. Not worth submitting though.
The final thing I want to try is a model tree using the RWeka package.
library(RWeka)
mtcasmod <- M5P(folcasuallog10, data = mytrain)
mtregmod <- M5P(folreglog10, data = mytrain)
summary(mtcasmod)
##
## === Summary ===
##
## Correlation coefficient 0.9358
## Mean absolute error 0.1759
## Root mean squared error 0.2282
## Relative absolute error 32.3061 %
## Root relative squared error 35.2632 %
## Total Number of Instances 5444
summary(mtregmod)
##
## === Summary ===
##
## Correlation coefficient 0.9506
## Mean absolute error 0.1447
## Root mean squared error 0.1888
## Relative absolute error 29.4251 %
## Root relative squared error 31.0744 %
## Total Number of Instances 5444
mtcaspred <- predict(mtcasmod, newdata = mytest)
mtregpred <- predict(mtregmod, newdata = mytest)
predmt <- round(10^mtcaspred + 10^mtregpred -2)
bikeevaluate(mytest, predmt)
## [1] 0.4678
Another slight improvement! Up to 0.46581. Now will turn this in.
mtcasmod <- M5P(folcasuallog10, data = train)
mtregmod <- M5P(folreglog10, data = train)
summary(mtcasmod)
##
## === Summary ===
##
## Correlation coefficient 0.9362
## Mean absolute error 0.175
## Root mean squared error 0.2276
## Relative absolute error 32.1536 %
## Root relative squared error 35.1557 %
## Total Number of Instances 10886
summary(mtregmod)
##
## === Summary ===
##
## Correlation coefficient 0.9581
## Mean absolute error 0.1338
## Root mean squared error 0.1741
## Relative absolute error 27.3302 %
## Root relative squared error 28.6499 %
## Total Number of Instances 10886
mtcaspred <- predict(mtcasmod, newdata = test)
mtregpred <- predict(mtregmod, newdata = test)
predmt <- round(10^mtcaspred + 10^mtregpred -2)
writeout(test, predmt)
Kaggle score: 0.45423. Moved up 32 spots!
At this moment in time, I am ranked #76 out of 537. I want to keep toying with my variables and the models, but this will have to do for purposes of this assignment, so I can turn in my report. This was fun!
At this point, that's about all the types of machine learning algorithms that I am familiar with, or can easily look up. There are likely endless tweaks that I could do to make my model better. Here are a few I can think of.