A Random Forest is combination of classification and regression. The result from an ensemble model is usually better than the result from one of the individual models. In Random Forest, each decision tree is constructed by using a random subset of the training data that has predictors with known response.
In a decision tree, an input is entered at the top and as it traverses down the tree the data gets bucketed into smaller and smaller sets. The random forest takes the notion of decision trees to the next level by combining trees. Thus, in ensemble terms, the trees are weak learners and the random forest is a strong learner.
Reading the libraries
library(readr)
library(randomForest)
set.seed(415)
Reading the CSV files to be analyzed
train <- read_csv("C:/Users/6430/Desktop/Project/train.csv/train.csv")
test <- read_csv("C:/Users/6430/Desktop/Project/test.csv/test.csv")
store <- read_csv("C:/Users/6430/Desktop/Project/store.csv/store.csv")
##merging the two files because two files have the different feature that have to be combined in order to the see the full effect of features on sales.
train1 <- merge(train,store)
test1 <- merge(test,store)
Converting all the ‘NA’ in train data to Zeros. Store 622 has 11 missing values for the “open” column, in test data; so to predict correctly I have decided to input “1” for open column of store 622. Otherwise our prediction will not be correct.
train1[is.na(train1)] <- 0
test1[is.na(test1)] <- 1
## We will only look at the stores that had status as "open"
train1<- train1[ which(train1$Open=='1'),]
train1 and test1 data have “Date” as column value. We will seperate the Date into month, year and day respectively. These new variables generated through “Date” column will be better handle to predict the sales
train1$Date <- as.Date(train1$Date)
test1$Date <- as.Date(test1$Date)
train1$month <- as.integer(format(train1$Date, "%m"))
train1$year <- as.integer(format(train1$Date, "%y"))
train1$day <- as.integer(format(train1$Date, "%d"))
train1$DayOfYear <- as.integer(as.POSIXlt(train1$Date)$yday)
train1$week <- as.integer( format(train1$Date+3, "%U"))
test1$month <- as.integer(format(test1$Date, "%m"))
test1$year <- as.integer(format(test1$Date, "%y"))
test1$day <- as.integer(format(test1$Date, "%d"))
test1$DayOfYear <- as.integer(as.POSIXlt(test1$Date)$yday)
test1$week <- as.integer( format(test1$Date+3, "%U"))
names(train1)
## [1] "Store" "DayOfWeek"
## [3] "Date" "Sales"
## [5] "Customers" "Open"
## [7] "Promo" "StateHoliday"
## [9] "SchoolHoliday" "StoreType"
## [11] "Assortment" "CompetitionDistance"
## [13] "CompetitionOpenSinceMonth" "CompetitionOpenSinceYear"
## [15] "Promo2" "Promo2SinceWeek"
## [17] "Promo2SinceYear" "PromoInterval"
## [19] "month" "year"
## [21] "day" "DayOfYear"
## [23] "week"
summary(train1)
## Store DayOfWeek Date Sales
## Min. : 1.0 Min. :1.00 Min. :2013-01-01 Min. : 0
## 1st Qu.: 280.0 1st Qu.:2.00 1st Qu.:2013-08-16 1st Qu.: 4859
## Median : 558.0 Median :3.00 Median :2014-03-31 Median : 6369
## Mean : 558.4 Mean :3.52 Mean :2014-04-11 Mean : 6956
## 3rd Qu.: 837.0 3rd Qu.:5.00 3rd Qu.:2014-12-10 3rd Qu.: 8360
## Max. :1115.0 Max. :7.00 Max. :2015-07-31 Max. :41551
## Customers Open Promo StateHoliday
## Min. : 0.0 Min. :1 Min. :0.0000 Min. :0
## 1st Qu.: 519.0 1st Qu.:1 1st Qu.:0.0000 1st Qu.:0
## Median : 676.0 Median :1 Median :0.0000 Median :0
## Mean : 762.7 Mean :1 Mean :0.4464 Mean :0
## 3rd Qu.: 893.0 3rd Qu.:1 3rd Qu.:1.0000 3rd Qu.:0
## Max. :7388.0 Max. :1 Max. :1.0000 Max. :0
## SchoolHoliday StoreType Assortment
## Min. :0.0000 Length:844392 Length:844392
## 1st Qu.:0.0000 Class :character Class :character
## Median :0.0000 Mode :character Mode :character
## Mean :0.1936
## 3rd Qu.:0.0000
## Max. :1.0000
## CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear
## Min. : 0 Min. : 0.000 Min. : 0
## 1st Qu.: 700 1st Qu.: 0.000 1st Qu.: 0
## Median : 2320 Median : 4.000 Median :2006
## Mean : 5444 Mean : 4.926 Mean :1370
## 3rd Qu.: 6880 3rd Qu.: 9.000 3rd Qu.:2011
## Max. :75860 Max. :12.000 Max. :2015
## Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
## Min. :0.0000 Min. : 0.0 Min. : 0 Length:844392
## 1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.: 0 Class :character
## Median :0.0000 Median : 0.0 Median : 0 Mode :character
## Mean :0.4987 Mean :11.6 Mean :1003
## 3rd Qu.:1.0000 3rd Qu.:22.0 3rd Qu.:2012
## Max. :1.0000 Max. :50.0 Max. :2015
## month year day DayOfYear
## Min. : 1.000 Min. :13.00 Min. : 1.00 Min. : 0.0
## 1st Qu.: 3.000 1st Qu.:13.00 1st Qu.: 8.00 1st Qu.: 74.0
## Median : 6.000 Median :14.00 Median :16.00 Median :153.0
## Mean : 5.846 Mean :13.83 Mean :15.84 Mean :161.4
## 3rd Qu.: 8.000 3rd Qu.:14.00 3rd Qu.:23.00 3rd Qu.:240.0
## Max. :12.000 Max. :15.00 Max. :31.00 Max. :364.0
## week
## Min. : 0.00
## 1st Qu.:11.00
## Median :22.00
## Mean :23.14
## 3rd Qu.:34.00
## Max. :52.00
names(test1)
## [1] "Store" "Id"
## [3] "DayOfWeek" "Date"
## [5] "Open" "Promo"
## [7] "StateHoliday" "SchoolHoliday"
## [9] "StoreType" "Assortment"
## [11] "CompetitionDistance" "CompetitionOpenSinceMonth"
## [13] "CompetitionOpenSinceYear" "Promo2"
## [15] "Promo2SinceWeek" "Promo2SinceYear"
## [17] "PromoInterval" "month"
## [19] "year" "day"
## [21] "DayOfYear" "week"
summary(test1)
## Store Id DayOfWeek Date
## Min. : 1.0 Min. : 1 Min. :1.000 Min. :2015-08-01
## 1st Qu.: 279.8 1st Qu.:10273 1st Qu.:2.000 1st Qu.:2015-08-12
## Median : 553.5 Median :20545 Median :4.000 Median :2015-08-24
## Mean : 555.9 Mean :20545 Mean :3.979 Mean :2015-08-24
## 3rd Qu.: 832.2 3rd Qu.:30816 3rd Qu.:6.000 3rd Qu.:2015-09-05
## Max. :1115.0 Max. :41088 Max. :7.000 Max. :2015-09-17
## Open Promo StateHoliday SchoolHoliday
## Min. :0.0000 Min. :0.0000 Min. :0.000000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :0.000000 Median :0.0000
## Mean :0.8544 Mean :0.3958 Mean :0.004381 Mean :0.4435
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.000000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.000000 Max. :1.0000
## StoreType Assortment CompetitionDistance
## Length:41088 Length:41088 Min. : 1
## Class :character Class :character 1st Qu.: 710
## Mode :character Mode :character Median : 2410
## Mean : 5077
## 3rd Qu.: 6435
## Max. :75860
## CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2
## Min. : 1.0 Min. : 1 Min. :0.0000
## 1st Qu.: 1.0 1st Qu.: 1 1st Qu.:0.0000
## Median : 4.0 Median :2005 Median :1.0000
## Mean : 4.8 Mean :1265 Mean :0.5806
## 3rd Qu.: 9.0 3rd Qu.:2011 3rd Qu.:1.0000
## Max. :12.0 Max. :2015 Max. :1.0000
## Promo2SinceWeek Promo2SinceYear PromoInterval month
## Min. : 1.0 Min. : 1 Length:41088 Min. :8.000
## 1st Qu.: 1.0 1st Qu.: 1 Class :character 1st Qu.:8.000
## Median : 9.0 Median :2010 Mode :character Median :8.000
## Mean :14.6 Mean :1168 Mean :8.354
## 3rd Qu.:31.0 3rd Qu.:2012 3rd Qu.:9.000
## Max. :49.0 Max. :2015 Max. :9.000
## year day DayOfYear week
## Min. :15 Min. : 1.00 Min. :212.0 Min. :31.00
## 1st Qu.:15 1st Qu.: 6.75 1st Qu.:223.8 1st Qu.:32.75
## Median :15 Median :12.50 Median :235.5 Median :34.00
## Mean :15 Mean :13.52 Mean :235.5 Mean :34.21
## 3rd Qu.:15 3rd Qu.:19.25 3rd Qu.:247.2 3rd Qu.:36.00
## Max. :15 Max. :31.00 Max. :259.0 Max. :38.00
Features relevant to our analysis; Sales column is left as we are going to predict.
variable.names <- names(train1)[c(1,2,6,7,8:12,14:23)]
for (f in variable.names) {
if (class(train1[[f]])=="character") {
levels <- unique(c(train1[[f]], test1[[f]]))
train1[[f]] <- as.integer(factor(train1[[f]], levels=levels))
test1[[f]] <- as.integer(factor(test1[[f]], levels=levels))
}
}
result <- randomForest(train1[,variable.names],
log(train1$Sales+1),
mtry=5,
ntree=50,
max_depth = 30,
sampsize=150000,
do.trace=TRUE)
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 1 | 0.08369 45.07 |
## 2 | 0.06868 36.99 |
## 3 | 0.05841 31.46 |
## 4 | 0.0535 28.81 |
## 5 | 0.04812 25.91 |
## 6 | 0.04416 23.78 |
## 7 | 0.04266 22.97 |
## 8 | 0.04094 22.05 |
## 9 | 0.03996 21.52 |
## 10 | 0.0398 21.43 |
## 11 | 0.03948 21.26 |
## 12 | 0.03889 20.94 |
## 13 | 0.03846 20.71 |
## 14 | 0.03885 20.92 |
## 15 | 0.03882 20.91 |
## 16 | 0.03885 20.92 |
## 17 | 0.03864 20.81 |
## 18 | 0.03866 20.82 |
## 19 | 0.0383 20.63 |
## 20 | 0.03808 20.51 |
## 21 | 0.03784 20.38 |
## 22 | 0.03777 20.34 |
## 23 | 0.03757 20.23 |
## 24 | 0.03732 20.10 |
## 25 | 0.03723 20.05 |
## 26 | 0.03741 20.15 |
## 27 | 0.03709 19.97 |
## 28 | 0.03671 19.77 |
## 29 | 0.03679 19.81 |
## 30 | 0.03672 19.78 |
## 31 | 0.03665 19.74 |
## 32 | 0.03642 19.61 |
## 33 | 0.03655 19.68 |
## 34 | 0.03639 19.60 |
## 35 | 0.03621 19.50 |
## 36 | 0.03617 19.48 |
## 37 | 0.03613 19.45 |
## 38 | 0.03589 19.33 |
## 39 | 0.03598 19.38 |
## 40 | 0.03584 19.30 |
## 41 | 0.03576 19.26 |
## 42 | 0.0356 19.17 |
## 43 | 0.03535 19.04 |
## 44 | 0.0353 19.01 |
## 45 | 0.03526 18.99 |
## 46 | 0.03529 19.01 |
## 47 | 0.03523 18.97 |
## 48 | 0.03514 18.92 |
## 49 | 0.035 18.85 |
## 50 | 0.03503 18.86 |
importance(result, type = 1)
##
## Store
## DayOfWeek
## Open
## Promo
## StateHoliday
## SchoolHoliday
## StoreType
## Assortment
## CompetitionDistance
## CompetitionOpenSinceYear
## Promo2
## Promo2SinceWeek
## Promo2SinceYear
## PromoInterval
## month
## year
## day
## DayOfYear
## week
importance(result, type = 2)
## IncNodePurity
## Store 3785.7361
## DayOfWeek 1721.9592
## Open 0.0000
## Promo 3767.6600
## StateHoliday 0.0000
## SchoolHoliday 144.1649
## StoreType 1360.8727
## Assortment 691.0718
## CompetitionDistance 4623.0304
## CompetitionOpenSinceYear 2052.2270
## Promo2 111.6698
## Promo2SinceWeek 973.3825
## Promo2SinceYear 938.3932
## PromoInterval 497.4217
## month 363.4467
## year 320.2606
## day 1034.1576
## DayOfYear 996.8762
## week 772.4623
varImpPlot(result)
pred <- exp(predict(result, test1)) -1
submission <- data.frame(Id=test$Id, Sales=pred)
write_csv(submission, "C:/Users/6430/Desktop/Project/resultfile.csv")