bikeSharing

Kaggle Instruction

datetime - hourly date + timestamp
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather - 1 Clear, Few clouds, Partly cloudy, Partly cloudy; 2 Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3 Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds; 4 Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - “feels like” temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals

Data loading and pre-processing

The data is downloaded from Kaggle website and read in as test and train data frames.

test <- read.csv("test.csv")
train <- read.csv("train.csv")

Now, let’s take a look at the data. There are about 6500 observations in training set and 10000 in test set. The predictors contain hourly time stamp and factors possibly affecting the number of bikes shared like flags indicating it is a holiday or weekday and weather information on that day. Note that in training set, the outcome count is split into casual and registered. The summary of train dataset is listed below:

str(train)

## 'data.frame':    10886 obs. of  12 variables:
##  $ datetime  : Factor w/ 10886 levels "2011-01-01 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
##  $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
##  $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
##  $ windspeed : num  0 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ count     : int  16 40 32 13 1 1 2 3 8 14 ...

The types of variables shown above mostly are integer and numeric. However, some of them should be categorical. For example, the value of season can only be one of {1(spring), 2(summer), 3(fall), 4(winter)}, though assigning numeric values to seasons seems controversial. We will get back to this later. It is more natural to convert these variable into factor. Before we do that, we should combine test and train sets in rows to avoid repeating same procedure twice. Also, there are no NAs in data frames. However, that does not necessarily mean the dataset is complete. We will find out in a moment.

# combine first nine columns of training set with the testing set.
total <- rbind(train[,1:9],test)
# check if there is any NA to impute
sapply(total, function(x) sum(is.na(x)))

##   datetime     season    holiday workingday    weather       temp 
##          0          0          0          0          0          0 
##      atemp   humidity  windspeed 
##          0          0          0

# convert all categorical columns into factor varible.
total$season <- as.factor(total$season)
total$holiday <- as.factor(total$holiday)
total$workingday <- as.factor(total$workingday)
total$weather <- as.factor(total$weather)
train[,1:9] <- total[1:nrow(train),1:9]
test <- total[(nrow(train)+1):nrow(total),]

Data imputation

temperature

Weather variables like temp or humidity cannot be obtained accurately without external source. The intuitive way to fill these blanks is to adopt mean values of neighbors. Before we take action, we should check all existing values are reasonable at that datetime. The boxplots of temp and atemp are shown below. There are a couple of outliers in spring and fall in both plots. Most of outliers are very close to boundary except a few atemp points are far from the lower boundary in fall season of training dataset which is not observed in temp.

par(mfrow = c(2,2))
par(cex = 0.7)
boxplot(temp~season, data=train, notch = FALSE, col = c("gold"), outcol = c("blue"), main = "training dataset", xlab = "season", ylab = expression(paste("temperature (", ~degree~C, ")")))
boxplot(atemp~season, data=train, notch = FALSE, col = c("gold"), outcol = c("blue"), main = "training dataset", xlab = "season", ylab = expression(paste("feel-like temperature (", ~degree~C, ")")))
boxplot(temp~season, data=test, notch = FALSE, col = c("gold"), outcol = c("blue"), main = "testing dataset", xlab = "season", ylab = expression(paste("temperature (", ~degree~C, ")")))
boxplot(atemp~season, data=test, notch = FALSE, col = c("gold"), outcol = c("blue"), main = "testing dataset", xlab = "season", ylab = expression(paste("feel-like temperature (", ~degree~C, ")")))

We can make a correlation plot between temp and atemp to look into that. The plot tells us there are some points in fall out of correlation. The atemp seems not to respond to the change of temp.

library(ggplot2)
qplot(x=temp, y=atemp, data=train, geom="point", color=season, alpha=I(0.3))

Let us search these outliers. These points are 24 hours on 2012-08-17. It is very likely atemp was not recorded on that day.

train[train$temp > 20 & train$atemp < 20, c("datetime", "temp", "atemp")]

##                 datetime  temp atemp
## 8992 2012-08-17 00:00:00 27.88 12.12
## 8993 2012-08-17 01:00:00 27.06 12.12
## 8994 2012-08-17 02:00:00 27.06 12.12
## 8995 2012-08-17 03:00:00 26.24 12.12
## 8996 2012-08-17 04:00:00 26.24 12.12
## 8997 2012-08-17 05:00:00 26.24 12.12
## 8998 2012-08-17 06:00:00 25.42 12.12
## 8999 2012-08-17 07:00:00 26.24 12.12
## 9000 2012-08-17 08:00:00 27.88 12.12
## 9001 2012-08-17 09:00:00 28.70 12.12
## 9002 2012-08-17 10:00:00 30.34 12.12
## 9003 2012-08-17 11:00:00 31.16 12.12
## 9004 2012-08-17 12:00:00 33.62 12.12
## 9005 2012-08-17 13:00:00 34.44 12.12
## 9006 2012-08-17 14:00:00 35.26 12.12
## 9007 2012-08-17 15:00:00 35.26 12.12
## 9008 2012-08-17 16:00:00 34.44 12.12
## 9009 2012-08-17 17:00:00 33.62 12.12
## 9010 2012-08-17 18:00:00 33.62 12.12
## 9011 2012-08-17 19:00:00 30.34 12.12
## 9012 2012-08-17 20:00:00 29.52 12.12
## 9013 2012-08-17 21:00:00 27.88 12.12
## 9014 2012-08-17 22:00:00 27.06 12.12
## 9015 2012-08-17 23:00:00 26.24 12.12

What we can do to fit the correlation with linear model and obtain atemp from the relation.

# exclude data on 2012-08-17
dateList <- sapply(as.character(train[, "datetime"]), function(x) strsplit(x, " ")[[1]][1])
withoutOutlier <- train[dateList != "2012-08-17",]
lm(atemp ~ temp, data = withoutOutlier)

## 
## Call:
## lm(formula = atemp ~ temp, data = withoutOutlier)
## 
## Coefficients:
## (Intercept)         temp  
##       1.871        1.079

ggplot(withoutOutlier, aes(x=temp, y=atemp)) +
  geom_point(shape =1) + 
  geom_smooth(method=lm, se=FALSE, size = 1.5, color = "red")

The linear relation is atemp = 1.080*temp + 1.818. Thus the actual atemp can be estimated.

train$atemp[dateList == "2012-08-17"] <- 1.080*train$temp[dateList == "2012-08-17"] + 1.818

humidity

Now let us look at humidity histogram. Particularly, we are interested in two ends(0 and 100). The little spike at 0 in training set is suspicious because usually humidity cannot be zero. Those observations are listed out below. Surprisingly, these are 24 hours on 2011-03-10. The humidity information is probably missing.

par(mfrow = c(1,2))
par(cex = 0.7)
hist(train$humidity, breaks=seq(-0.5, 100.5, 1), col = "cyan", xlab="humidity (%)", main = "training dataset")
hist(test$humidity, breaks=seq(-0.5, 100.5, 1), col = "cyan", xlab="humidity (%)", main = "testing dataset")

train[train$humidity == 0, c("datetime", "humidity")]

##                 datetime humidity
## 1092 2011-03-10 00:00:00        0
## 1093 2011-03-10 01:00:00        0
## 1094 2011-03-10 02:00:00        0
## 1095 2011-03-10 05:00:00        0
## 1096 2011-03-10 06:00:00        0
## 1097 2011-03-10 07:00:00        0
## 1098 2011-03-10 08:00:00        0
## 1099 2011-03-10 09:00:00        0
## 1100 2011-03-10 10:00:00        0
## 1101 2011-03-10 11:00:00        0
## 1102 2011-03-10 12:00:00        0
## 1103 2011-03-10 13:00:00        0
## 1104 2011-03-10 14:00:00        0
## 1105 2011-03-10 15:00:00        0
## 1106 2011-03-10 16:00:00        0
## 1107 2011-03-10 17:00:00        0
## 1108 2011-03-10 18:00:00        0
## 1109 2011-03-10 19:00:00        0
## 1110 2011-03-10 20:00:00        0
## 1111 2011-03-10 21:00:00        0
## 1112 2011-03-10 22:00:00        0
## 1113 2011-03-10 23:00:00        0

The fixing strategy is to average humidityon 2011-03-09 and 2011-03-11 for each hour and fill in corresponding spot on 2011-03-10. Note that observations at 3 AM and 4 AM of 2011-03-10 and 4 AM of 2011-03-11 are missing. We should skip points at 3 AM and 4 AM when averaging. The time series imputation will be discussed later. The 100% humidity is mostly due to wet days. We are fine to live with it.

humOn10th <- train$humidity[dateList == "2011-03-10"]
humOn9th <- train$humidity[dateList == "2011-03-09"]
humOn11th <- train$humidity[dateList == "2011-03-11"]
humOn10th <- (humOn9th[c(1:3, 6:24)] + humOn11th[-5]) / 2.0
train$humidity[dateList == "2011-03-10"] <- humOn10th
test$humidity <- as.numeric(test$humidity)

windspeed

The last variable we inspect is windspeed. Histogram shows there are a lot of zeros in both training and testing dataset. Also, data between 0 and 6 are missing probably because windspeed was too low to be measured. Low windspeed would not be a big factor for bike rental. For now I would not worry about that too much.

par(mfrow = c(2,1))
par(cex = 0.7)
hist(train$windspeed, breaks=seq(-0.5, 60.5, 1), xlab = "windspeed", main = "training", col = "purple")
hist(test$windspeed, breaks=seq(-0.5, 60.5, 1), xlab = "windspeed", main = "testing", col = "purple")

Feature Engineering

Let us put all other predictors aside for now and focus on datetime. Daily experience tells us people tend to rent more bikes during daytime rather than midnight. Therefore, the hour in time stamp should play an important role in prediction. It is more convenient to extract hour string from datetime and add it into data frame as a new factor variable. The count is aggregated into each hour of a day as shown in the histogram below.

hourList <- sapply(as.character(total[, "datetime"]), function(x) strsplit(x, "[ ,:]")[[1]][2])
total$hour <- as.factor(unname(hourList))
train$hour <- total[1:nrow(train),"hour"]
test$hour <- total[(nrow(train)+1):nrow(total), "hour"]
par(mfrow = c(2,1))
par(cex = 0.7)
barplot(table(train$hour), xlab="hour", ylab="counts", col="lightblue", main="counts over 24 hours in training set")
barplot(table(test$hour), xlab="hour", ylab="counts", col="lightblue", main="counts over 24 hours in testing set")

As you can see there is a dent around 4 AM both in training and in testing dataset which indicates some records in early morning are missing maybe because both casual and registered are zero for these hourly intervals. So we need to figure out how to impute these missing items.

The casual and registered value barplots are stacked together so we can see the difference between casual and registered. Note that sum of values of casual and registered is count. We can see there are more registered users than casual ones generally. registered barplot peaks around 8 AM and 5 PM which indicates registered users rent bike mostly for work.

AggCount <- aggregate(train$count, by=list(train$hour), FUN=sum)
names(AggCount) <- c("hour", "total")
AggCount$registered <- aggregate(train$registered, by=list(train$hour), FUN=sum)[,2]
AggCount$casual <- aggregate(train$casual, by=list(train$hour), FUN=sum)[,2]
Aggmatrix <- matrix(c(AggCount[,"casual"], AggCount[,"registered"]), nrow=24, ncol=2, dimnames=list(AggCount$hour, c("casual","registered")))
Aggmatrix <- t(Aggmatrix)
barplot(Aggmatrix,
        col=c("lightblue","darkblue"),
        main = "Barplot of averaged registered and casual counts over 24 hours",
        xlab = "hour", ylab = "counts",
        legend = rownames(Aggmatrix),
        args.legend = list(x = "topleft"))

The time-stamp contains information of month which is more accurate than season. Out of curiosity, if we create a new factor variable month to resolve the slight difference from month to month rather than season. This might improve our prediction. Likewise, another two new variables weekdays (from Monday to Sunday) and dayOfMonth (from 1st up to 31st) are created to tell weekly difference better than workingday indicating workingday or weekend. Unfortunately, none of variables help to improve the accuracy.

dateList <- sapply(as.character(total[, "datetime"]), function(x) strsplit(x, " ")[[1]][1])
dateList <- as.Date(dateList, "%Y-%m-%d")
total$year <- format(dateList, "%Y")
total$year <- as.factor(total$year)
train$year <-  total[1:nrow(train),"year"]
test$year <- total[(nrow(train)+1):nrow(total),"year"]

dateList <- sapply(as.character(total[, "datetime"]), function(x) strsplit(x, " ")[[1]][1])
dateList <- as.Date(dateList, "%Y-%m-%d")
total$month <- format(dateList, "%m")
total$month <- as.numeric(total$month)
total$month <- ifelse(total$year == "2012", total$month+12, total$month)
total$month <- as.factor(total$month)
train$month <- total[1:nrow(train),"month"]
test$month <- total[(nrow(train)+1):nrow(total), "month"]

dateList <- sapply(as.character(total[, "datetime"]), function(x) strsplit(x, " ")[[1]][1])
dateList <- as.Date(dateList, "%Y-%m-%d")
total$dayOfMonth <- format(dateList, "%d")
total$dayOfMonth <- as.factor(total$dayOfMonth)
train$dayOfMonth <- total[1:nrow(train),"dayOfMonth"]
test$dayOfMonth <- total[(nrow(train)+1):nrow(total), "dayOfMonth"]

dateList <- sapply(as.character(total[, "datetime"]), function(x) strsplit(x, " ")[[1]][1])
dateList <- as.Date(dateList, "%Y-%m-%d")
total$weekdays <- format(dateList, "%a")
total$weekdays <- as.factor(total$weekdays)
train$weekdays <-  total[1:nrow(train),"weekdays"]
test$weekdays <- total[(nrow(train)+1):nrow(total),"weekdays"]

Days from 2011-01-01

dateList <- sapply(as.character(total[, "datetime"]), function(x) strsplit(x, " ")[[1]][1])
dateList <- as.Date(dateList, "%Y-%m-%d")
startdate <- as.Date("2011-01-01","%Y-%m-%d")
numOfDays <- as.numeric(difftime(dateList, startdate, units = "days"))
#hours <- as.numeric(levels(total$hour)[total$hour])
total$whichDay <- numOfDays
train$whichDay <-  total[1:nrow(train),"whichDay"]
test$whichDay <- total[(nrow(train)+1):nrow(total),"whichDay"]

Weeks from 2011-01-01

dateList <- sapply(as.character(total[, "datetime"]), function(x) strsplit(x, " ")[[1]][1])
dateList <- as.Date(dateList, "%Y-%m-%d")
startdate <- as.Date("2011-01-01","%Y-%m-%d")
numOfWeeks <- as.numeric(difftime(dateList, startdate, units = "weeks"))
#hours <- as.numeric(levels(total$hour)[total$hour])
total$whichWeek <- numOfWeeks
train$whichWeek <-  total[1:nrow(train),"whichWeek"]
test$whichWeek <- total[(nrow(train)+1):nrow(total),"whichWeek"]

Random Forest

Prediction with random forest (Option 1) (So far, this gives the best result!)

library(caret)
library(lattice)
library(randomForest)
train$registeredlog10 <- log10(sqrt(train$registered) + 1)
train$casuallog10 <- log10(sqrt(train$casual) + 1)
modFitReg <- randomForest(registeredlog10 ~ hour + year + season + weather + workingday + weekdays, data = train, ntree = 1000, mtry = 6, importance = TRUE)
modFitCas <- randomForest(casuallog10 ~ hour + year + season + workingday + weather + atemp + humidity + weekdays, data = train, ntree = 1000, mtry = 6, importance = TRUE)
testPredReg <- predict(modFitReg, test)
testPredCas <- predict(modFitCas, test)
models <- data.frame(datetime = test$datetime, RF = (10^testPredReg-1)^2 + (10^testPredCas-1)^2)
names(models) <- c("datetime", "count")
write.csv(models, file = "result.csv", row.names = FALSE)

Prediction with random forest (Option 2)

library(caret)
library(lattice)
library(randomForest)

total$yearMonth <- format(dateList, "%Y-%m")
total$yearMonth <- as.factor(total$yearMonth)
train$yearMonth <-  total[1:nrow(train),"yearMonth"]
test$yearMonth <- total[(nrow(train)+1):nrow(total),"yearMonth"]   
trainList <- split(train, train$yearMonth)
testList <- split(test, test$yearMonth)

predictingReg <- function(x, y) {
  x$registeredlog10 <- log10(x$registered + 1)
  modFitReg <- randomForest(registeredlog10 ~ hour + season + weekdays + holiday + weather + temp + atemp + humidity + windspeed, data = x, ntree = 500)
  testPredReg <- predict(modFitReg, y)
  return(testPredReg)
}

predictingCas <- function(x, y) {
  x$casuallog10 <- log10(x$casual + 1)
  modFitCas <- randomForest(casuallog10 ~ hour + season + weekdays + holiday + weather + temp + atemp + humidity + windspeed, data = x, ntree = 1000)
  testPredCas <- predict(modFitCas, y)
  return(testPredCas)
}

testRegList <- mapply(predictingReg, trainList, testList)
testCasList <- mapply(predictingCas, trainList, testList)

testReg <- unname(unlist(testRegList))
testCas <- unname(unlist(testCasList))

submit <- data.frame(datetime = test$datetime, count = 10^testReg + 10^testCas - 2)
write.csv(submit, file = "result.csv", row.names = FALSE)

Prediction with random forest (Option 3)

library(caret)
library(lattice)
library(randomForest)

total$yearMonth <- format(dateList, "%Y-%m")
total$yearMonth <- as.factor(total$yearMonth)
train$yearMonth <-  total[1:nrow(train),"yearMonth"]
test$yearMonth <- total[(nrow(train)+1):nrow(total),"yearMonth"]   
trainList <- split(train, train$yearMonth)
testList <- split(test, test$yearMonth)

trainCumList <- list()
trainCumFirstList <- list()
trainCumSecondList <- list()
testFirstList <- list()
testSecondList <- list()

for (i in 1:length(trainList)) {
  trainCumList[[i]] <- trainList[[1]]
}
for (i in 2:length(trainList)) {
  for (j in 2:i) {
    trainCumList[[i]] <- rbind(trainCumList[[i]], trainList[[j]])    
  }
}

trainCumFirstList <- trainCumList[1:3]
trainCumSecondList <- trainCumList[4:24]
testFirstList <- testList[1:12]
testSecondList <- testList[13:24]

predictingReg1 <- function(x, y) {
  x$registeredlog10 <- log10(sqrt(x$registered) + 1)
  modFitReg <- randomForest(registeredlog10 ~ hour + workingday + holiday + whichDay, data = x, ntree = 500)
  testPredReg <- predict(modFitReg, y)
  return(testPredReg)
}

predictingCas1 <- function(x, y) {
  x$casuallog10 <- log10(sqrt(x$casual) + 1)
  modFitCas <- randomForest(casuallog10 ~ hour + workingday + holiday + whichDay, data = x, ntree = 500)
  testPredCas <- predict(modFitCas, y)
  return(testPredCas)
}

predictingReg2 <- function(x, y) {
  x$registeredlog10 <- log10(sqrt(x$registered) + 1)
  modFitReg <- randomForest(registeredlog10 ~ hour + season + workingday + holiday + weather + temp + atemp + humidity + windspeed + whichDay, data = x, ntree = 500)
  testPredReg <- predict(modFitReg, y)
  return(testPredReg)
}

predictingCas2 <- function(x, y) {
  x$casuallog10 <- log10(sqrt(x$casual) + 1)
  modFitCas <- randomForest(casuallog10 ~ hour + season + workingday + holiday + weather + temp + atemp + humidity + windspeed + whichDay, ntree = 500)
  testPredCas <- predict(modFitCas, y)
  return(testPredCas)
}

testRegFirstList <- mapply(predictingReg1, trainCumFirstList, testFirstList)
testCasFirstList <- mapply(predictingCas1, trainCumFirstList, testFirstList)
testRegSecondList <- mapply(predictingReg2, trainCumSecondList, testSecondList)
testCasSecondList <- mapply(predictingCas2, trainCumSecondList, testSecondList)

testRegFirstResult <- unname(unlist(testRegFirstList))
testRegSecondResult <- unname(unlist(testRegSecondList))
testReg <- c(testRegFirstResult, (10^testRegSecondResult-1)^2)
testCasFirstResult <- unname(unlist(testCasFirstList))
testCasSecondResult <- unname(unlist(testCasSecondList))
testCas <- c(testCasFirstResult, (10^testCasSecondResult-1)^2)

#submit <- data.frame(datetime = test$datetime, count = (10^testReg-1)^2 + (10^testCas-1)^2)
submit <- data.frame(datetime = test$datetime, count = testReg + testCas)
write.csv(submit, file = "result.csv", row.names = FALSE)

Decision Trees

library(party)
train$registeredlog10 <- log10(sqrt(train$registered) + 1)
train$casuallog10 <- log10(sqrt(train$casual) + 1)
modFitReg <- ctree(registeredlog10 ~ hour + season + weekdays + holiday + weather + temp + atemp + humidity + windspeed + whichDay, data = train)
modFitCas <- ctree(casuallog10 ~ hour + season + weekdays + holiday + weather + temp + atemp + humidity + windspeed + whichDay, data = train)
testPredReglog10 <- predict(modFitReg, test)
testPredCaslog10 <- predict(modFitCas, test)
testPredReg <- (10^testPredReglog10 - 1)^2
testPredCas <- (10^testPredCaslog10 - 1)^2
models$DT <- testPredReg + testPredCas
#submit <- data.frame(datetime = test$datetime, count = testPredReg + testPredCas)
#write.csv(submit, file = "result.csv", row.names = FALSE)

Generalized Boosted Regression

library(gbm) 
train$registeredlog10 <- log10(sqrt(train$registered) + 1)
train$casuallog10 <- log10(sqrt(train$casual) + 1)
modFitReg <- gbm(registeredlog10 ~ hour + season + weekdays + holiday + weather + temp + atemp + humidity + windspeed + whichDay, data = train, n.trees = 500)
modFitCas <- gbm(casuallog10 ~ hour + season + weekdays + holiday + weather + temp + atemp + humidity + windspeed + whichDay, data = train, n.trees = 500)
testPredReglog10 <- predict(modFitReg, n.trees = 500, test)
testPredCaslog10 <- predict(modFitCas, n.trees = 500, test)
testPredReg <- (10^testPredReglog10 - 1)^2
testPredCas <- (10^testPredCaslog10 - 1)^2
models$GBM <- testPredReg + testPredCas
#submit <- data.frame(datetime = test$datetime, count = testPredReg + testPredCas)
#write.csv(submit, file = "result.csv", row.names = FALSE)

Support Vector Machine

library(e1071)
library(rpart)
train$registeredlog10 <- log10(sqrt(train$registered) + 1)
train$casuallog10 <- log10(sqrt(train$casual) + 1)
modFitReg <- svm(registeredlog10 ~ hour + season + weekdays + holiday + weather + temp + atemp + humidity + windspeed + whichDay, data = train, degree = 10)
modFitCas <- svm(casuallog10 ~ hour + season + weekdays + holiday + weather + temp + atemp + humidity + windspeed + whichDay, data = train, degree = 10)
testPredReglog10 <- predict(modFitReg, test)
testPredCaslog10 <- predict(modFitCas, test)
testPredReg <- (10^testPredReglog10 - 1)^2
testPredCas <- (10^testPredCaslog10 - 1)^2
models$SVM <- testPredReg + testPredCas
#submit <- data.frame(datetime = test$datetime, count = testPredReg + testPredCas)
#write.csv(submit, file = "result.csv", row.names = FALSE)

Neural Network

library("neuralnet")
train$registeredlog10 <- log10(sqrt(train$registered) + 1)
train$casuallog10 <- log10(sqrt(train$casual) + 1)
trainNN <- train
testNN <- test
total$hour <- as.integer(levels(total$hour)[total$hour])
total$holiday <- as.integer(levels(total$holiday)[total$holiday])
total$workingday <- as.integer(levels(total$workingday)[total$workingday])
total$season <- as.integer(levels(total$season)[total$season])
total$weather <- as.integer(levels(total$weather)[total$weather])
trainNN[,c("hour", "season", "holiday", "workingday", "weather")] <-  total[1:nrow(train), c("hour", "season", "holiday", "workingday", "weather")]
trainNN <- trainNN[,c("hour", "season", "workingday", "holiday", "weather", "temp", "atemp", "humidity", "windspeed", "whichDay", "registeredlog10")]
testNN[,c("hour", "season", "holiday", "workingday", "weather")] <- total[(nrow(train)+1):nrow(total), c("hour", "season", "holiday", "workingday", "weather")]
#testNN <- testNN[,c("hour", "season", "workingday", "holiday", "weather", "temp", "atemp", "humidity", "windspeed", "whichDay")]
testNN <- testNN[,c("hour", "season", "workingday", "holiday", "weather", "whichDay")]
modFitReg <- neuralnet(registeredlog10 ~ hour + season + workingday + holiday + weather + whichDay, data = trainNN, hidden = 2, threshold = 0.01, stepmax = 1e+05, rep = 10)
#modFitCas <- neuralnet(casuallog10 ~ hour + season + workingday + holiday + weather + temp + atemp + humidity + windspeed + whichDay, data = train, hidden = 3, threshold = 0.01, stepmax = 1e+06)
testPredReglog10 <- compute(modFitReg, testNN)
#testPredCaslog10 <- compute(modFitCas, test[,-1])
testPredReg <- (10^testPredReglog10[[2]] - 1)^2
#testPredCas <- (10^testPredCaslog10[[2]] - 1)^2
#submit <- data.frame(datetime = test$datetime, count = testPredReg + testPredCas)
#write.csv(submit, file = "result.csv", row.names = FALSE)