A Random Forest is combination of classification and regression. The result from an ensemble model is usually better than the result from one of the individual models. In Random Forest, each decision tree is constructed by using a random subset of the training data that has predictors with known response.

In a decision tree, an input is entered at the top and as it traverses down the tree the data gets bucketed into smaller and smaller sets. The random forest takes the notion of decision trees to the next level by combining trees. Thus, in ensemble terms, the trees are weak learners and the random forest is a strong learner.

Reading the libraries

library(readr)
library(randomForest)
set.seed(415)

Reading the CSV files to be analyzed

train <- read_csv("C:/Users/6430/Desktop/Project/train.csv/train.csv")
test  <- read_csv("C:/Users/6430/Desktop/Project/test.csv/test.csv")
store <- read_csv("C:/Users/6430/Desktop/Project/store.csv/store.csv")

##merging the two files because two files have the different feature that have to be combined in order to the see the full effect of features on sales.
train1 <- merge(train,store) 
test1 <- merge(test,store)

Converting all the ‘NA’ in train data to Zeros. Store 622 has 11 missing values for the “open” column, in test data; so to predict correctly I have decided to input “1” for open column of store 622. Otherwise our prediction will not be correct.

train1[is.na(train1)]   <- 0
test1[is.na(test1)]   <- 1

## We will only look at the stores that had status as "open"
train1<- train1[ which(train1$Open=='1'),]

train1 and test1 data have “Date” as column value. We will seperate the Date into month, year and day respectively. These new variables generated through “Date” column will be better handle to predict the sales

train1$Date <- as.Date(train1$Date)
test1$Date <- as.Date(test1$Date)

train1$month <- as.integer(format(train1$Date, "%m"))
train1$year <- as.integer(format(train1$Date, "%y"))
train1$day <- as.integer(format(train1$Date, "%d"))
train1$DayOfYear <- as.integer(as.POSIXlt(train1$Date)$yday)
train1$week <- as.integer( format(train1$Date+3, "%U"))


test1$month <- as.integer(format(test1$Date, "%m"))
test1$year <- as.integer(format(test1$Date, "%y"))
test1$day <- as.integer(format(test1$Date, "%d"))
test1$DayOfYear <-  as.integer(as.POSIXlt(test1$Date)$yday)
test1$week <- as.integer( format(test1$Date+3, "%U"))
names(train1)
##  [1] "Store"                     "DayOfWeek"                
##  [3] "Date"                      "Sales"                    
##  [5] "Customers"                 "Open"                     
##  [7] "Promo"                     "StateHoliday"             
##  [9] "SchoolHoliday"             "StoreType"                
## [11] "Assortment"                "CompetitionDistance"      
## [13] "CompetitionOpenSinceMonth" "CompetitionOpenSinceYear" 
## [15] "Promo2"                    "Promo2SinceWeek"          
## [17] "Promo2SinceYear"           "PromoInterval"            
## [19] "month"                     "year"                     
## [21] "day"                       "DayOfYear"                
## [23] "week"
summary(train1)
##      Store          DayOfWeek         Date                Sales      
##  Min.   :   1.0   Min.   :1.00   Min.   :2013-01-01   Min.   :    0  
##  1st Qu.: 280.0   1st Qu.:2.00   1st Qu.:2013-08-16   1st Qu.: 4859  
##  Median : 558.0   Median :3.00   Median :2014-03-31   Median : 6369  
##  Mean   : 558.4   Mean   :3.52   Mean   :2014-04-11   Mean   : 6956  
##  3rd Qu.: 837.0   3rd Qu.:5.00   3rd Qu.:2014-12-10   3rd Qu.: 8360  
##  Max.   :1115.0   Max.   :7.00   Max.   :2015-07-31   Max.   :41551  
##    Customers           Open       Promo         StateHoliday
##  Min.   :   0.0   Min.   :1   Min.   :0.0000   Min.   :0    
##  1st Qu.: 519.0   1st Qu.:1   1st Qu.:0.0000   1st Qu.:0    
##  Median : 676.0   Median :1   Median :0.0000   Median :0    
##  Mean   : 762.7   Mean   :1   Mean   :0.4464   Mean   :0    
##  3rd Qu.: 893.0   3rd Qu.:1   3rd Qu.:1.0000   3rd Qu.:0    
##  Max.   :7388.0   Max.   :1   Max.   :1.0000   Max.   :0    
##  SchoolHoliday     StoreType          Assortment       
##  Min.   :0.0000   Length:844392      Length:844392     
##  1st Qu.:0.0000   Class :character   Class :character  
##  Median :0.0000   Mode  :character   Mode  :character  
##  Mean   :0.1936                                        
##  3rd Qu.:0.0000                                        
##  Max.   :1.0000                                        
##  CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear
##  Min.   :    0       Min.   : 0.000            Min.   :   0            
##  1st Qu.:  700       1st Qu.: 0.000            1st Qu.:   0            
##  Median : 2320       Median : 4.000            Median :2006            
##  Mean   : 5444       Mean   : 4.926            Mean   :1370            
##  3rd Qu.: 6880       3rd Qu.: 9.000            3rd Qu.:2011            
##  Max.   :75860       Max.   :12.000            Max.   :2015            
##      Promo2       Promo2SinceWeek Promo2SinceYear PromoInterval     
##  Min.   :0.0000   Min.   : 0.0    Min.   :   0    Length:844392     
##  1st Qu.:0.0000   1st Qu.: 0.0    1st Qu.:   0    Class :character  
##  Median :0.0000   Median : 0.0    Median :   0    Mode  :character  
##  Mean   :0.4987   Mean   :11.6    Mean   :1003                      
##  3rd Qu.:1.0000   3rd Qu.:22.0    3rd Qu.:2012                      
##  Max.   :1.0000   Max.   :50.0    Max.   :2015                      
##      month             year            day          DayOfYear    
##  Min.   : 1.000   Min.   :13.00   Min.   : 1.00   Min.   :  0.0  
##  1st Qu.: 3.000   1st Qu.:13.00   1st Qu.: 8.00   1st Qu.: 74.0  
##  Median : 6.000   Median :14.00   Median :16.00   Median :153.0  
##  Mean   : 5.846   Mean   :13.83   Mean   :15.84   Mean   :161.4  
##  3rd Qu.: 8.000   3rd Qu.:14.00   3rd Qu.:23.00   3rd Qu.:240.0  
##  Max.   :12.000   Max.   :15.00   Max.   :31.00   Max.   :364.0  
##       week      
##  Min.   : 0.00  
##  1st Qu.:11.00  
##  Median :22.00  
##  Mean   :23.14  
##  3rd Qu.:34.00  
##  Max.   :52.00
names(test1)
##  [1] "Store"                     "Id"                       
##  [3] "DayOfWeek"                 "Date"                     
##  [5] "Open"                      "Promo"                    
##  [7] "StateHoliday"              "SchoolHoliday"            
##  [9] "StoreType"                 "Assortment"               
## [11] "CompetitionDistance"       "CompetitionOpenSinceMonth"
## [13] "CompetitionOpenSinceYear"  "Promo2"                   
## [15] "Promo2SinceWeek"           "Promo2SinceYear"          
## [17] "PromoInterval"             "month"                    
## [19] "year"                      "day"                      
## [21] "DayOfYear"                 "week"
summary(test1)
##      Store              Id          DayOfWeek          Date           
##  Min.   :   1.0   Min.   :    1   Min.   :1.000   Min.   :2015-08-01  
##  1st Qu.: 279.8   1st Qu.:10273   1st Qu.:2.000   1st Qu.:2015-08-12  
##  Median : 553.5   Median :20545   Median :4.000   Median :2015-08-24  
##  Mean   : 555.9   Mean   :20545   Mean   :3.979   Mean   :2015-08-24  
##  3rd Qu.: 832.2   3rd Qu.:30816   3rd Qu.:6.000   3rd Qu.:2015-09-05  
##  Max.   :1115.0   Max.   :41088   Max.   :7.000   Max.   :2015-09-17  
##       Open            Promo         StateHoliday      SchoolHoliday   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000   Median :0.000000   Median :0.0000  
##  Mean   :0.8544   Mean   :0.3958   Mean   :0.004381   Mean   :0.4435  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.000000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000000   Max.   :1.0000  
##   StoreType          Assortment        CompetitionDistance
##  Length:41088       Length:41088       Min.   :    1      
##  Class :character   Class :character   1st Qu.:  710      
##  Mode  :character   Mode  :character   Median : 2410      
##                                        Mean   : 5077      
##                                        3rd Qu.: 6435      
##                                        Max.   :75860      
##  CompetitionOpenSinceMonth CompetitionOpenSinceYear     Promo2      
##  Min.   : 1.0              Min.   :   1             Min.   :0.0000  
##  1st Qu.: 1.0              1st Qu.:   1             1st Qu.:0.0000  
##  Median : 4.0              Median :2005             Median :1.0000  
##  Mean   : 4.8              Mean   :1265             Mean   :0.5806  
##  3rd Qu.: 9.0              3rd Qu.:2011             3rd Qu.:1.0000  
##  Max.   :12.0              Max.   :2015             Max.   :1.0000  
##  Promo2SinceWeek Promo2SinceYear PromoInterval          month      
##  Min.   : 1.0    Min.   :   1    Length:41088       Min.   :8.000  
##  1st Qu.: 1.0    1st Qu.:   1    Class :character   1st Qu.:8.000  
##  Median : 9.0    Median :2010    Mode  :character   Median :8.000  
##  Mean   :14.6    Mean   :1168                       Mean   :8.354  
##  3rd Qu.:31.0    3rd Qu.:2012                       3rd Qu.:9.000  
##  Max.   :49.0    Max.   :2015                       Max.   :9.000  
##       year         day          DayOfYear          week      
##  Min.   :15   Min.   : 1.00   Min.   :212.0   Min.   :31.00  
##  1st Qu.:15   1st Qu.: 6.75   1st Qu.:223.8   1st Qu.:32.75  
##  Median :15   Median :12.50   Median :235.5   Median :34.00  
##  Mean   :15   Mean   :13.52   Mean   :235.5   Mean   :34.21  
##  3rd Qu.:15   3rd Qu.:19.25   3rd Qu.:247.2   3rd Qu.:36.00  
##  Max.   :15   Max.   :31.00   Max.   :259.0   Max.   :38.00

Features relevant to our analysis; Sales column is left as we are going to predict.

variable.names <- names(train1)[c(1,2,6,7,8:12,14:23)]

for (f in variable.names) {
  if (class(train1[[f]])=="character") {
    levels <- unique(c(train1[[f]], test1[[f]]))
    train1[[f]] <- as.integer(factor(train1[[f]], levels=levels))
    test1[[f]]  <- as.integer(factor(test1[[f]],  levels=levels))
  }
}
result <- randomForest(train1[,variable.names], 
                    log(train1$Sales+1),
                    mtry=5,
                    ntree=50,
                    max_depth = 30,
                    sampsize=150000,
                    do.trace=TRUE)
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##    1 |  0.08369    45.07 |
##    2 |  0.06868    36.99 |
##    3 |  0.05841    31.46 |
##    4 |   0.0535    28.81 |
##    5 |  0.04812    25.91 |
##    6 |  0.04416    23.78 |
##    7 |  0.04266    22.97 |
##    8 |  0.04094    22.05 |
##    9 |  0.03996    21.52 |
##   10 |   0.0398    21.43 |
##   11 |  0.03948    21.26 |
##   12 |  0.03889    20.94 |
##   13 |  0.03846    20.71 |
##   14 |  0.03885    20.92 |
##   15 |  0.03882    20.91 |
##   16 |  0.03885    20.92 |
##   17 |  0.03864    20.81 |
##   18 |  0.03866    20.82 |
##   19 |   0.0383    20.63 |
##   20 |  0.03808    20.51 |
##   21 |  0.03784    20.38 |
##   22 |  0.03777    20.34 |
##   23 |  0.03757    20.23 |
##   24 |  0.03732    20.10 |
##   25 |  0.03723    20.05 |
##   26 |  0.03741    20.15 |
##   27 |  0.03709    19.97 |
##   28 |  0.03671    19.77 |
##   29 |  0.03679    19.81 |
##   30 |  0.03672    19.78 |
##   31 |  0.03665    19.74 |
##   32 |  0.03642    19.61 |
##   33 |  0.03655    19.68 |
##   34 |  0.03639    19.60 |
##   35 |  0.03621    19.50 |
##   36 |  0.03617    19.48 |
##   37 |  0.03613    19.45 |
##   38 |  0.03589    19.33 |
##   39 |  0.03598    19.38 |
##   40 |  0.03584    19.30 |
##   41 |  0.03576    19.26 |
##   42 |   0.0356    19.17 |
##   43 |  0.03535    19.04 |
##   44 |   0.0353    19.01 |
##   45 |  0.03526    18.99 |
##   46 |  0.03529    19.01 |
##   47 |  0.03523    18.97 |
##   48 |  0.03514    18.92 |
##   49 |    0.035    18.85 |
##   50 |  0.03503    18.86 |
importance(result, type = 1)   
##                         
## Store                   
## DayOfWeek               
## Open                    
## Promo                   
## StateHoliday            
## SchoolHoliday           
## StoreType               
## Assortment              
## CompetitionDistance     
## CompetitionOpenSinceYear
## Promo2                  
## Promo2SinceWeek         
## Promo2SinceYear         
## PromoInterval           
## month                   
## year                    
## day                     
## DayOfYear               
## week
importance(result, type = 2)
##                          IncNodePurity
## Store                        3785.7361
## DayOfWeek                    1721.9592
## Open                            0.0000
## Promo                        3767.6600
## StateHoliday                    0.0000
## SchoolHoliday                 144.1649
## StoreType                    1360.8727
## Assortment                    691.0718
## CompetitionDistance          4623.0304
## CompetitionOpenSinceYear     2052.2270
## Promo2                        111.6698
## Promo2SinceWeek               973.3825
## Promo2SinceYear               938.3932
## PromoInterval                 497.4217
## month                         363.4467
## year                          320.2606
## day                          1034.1576
## DayOfYear                     996.8762
## week                          772.4623
varImpPlot(result)                 

pred <- exp(predict(result, test1)) -1
submission <- data.frame(Id=test$Id, Sales=pred)
write_csv(submission, "C:/Users/6430/Desktop/Project/resultfile.csv")