PROBLEM STATEMENT

Use dataset “base_data.csv” to build a model.Variable names are self explanatory.Your task here is to build predictive model for predicting sales figures given other information related to counterfeit medicine selling operations.

AIM: Use RandomForest and Decision Tress to build your model on train data and compare their performance on test data. Also get the variable importance plot for the model.

Initial Setup for using decision tree and random forest model:

library(tree)
## Warning: package 'tree' was built under R version 3.3.3
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.3.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.3.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
med=read.csv("base_data.csv") #gives character n num data types.
View(med)
str(med)
## 'data.frame':    8523 obs. of  12 variables:
##  $ Medicine_ID        : Factor w/ 1557 levels "AAJ32","AAS12",..: 1228 276 782 318 1088 191 505 1543 402 564 ...
##  $ Counterfeit_Weight : num  10.6 7.22 18.8 20.5 10.23 ...
##  $ DistArea_ID        : Factor w/ 10 levels "Area010","Area013",..: 10 4 10 1 2 4 2 6 8 3 ...
##  $ Active_Since       : int  1997 2007 1997 1996 1985 2007 1985 1983 2000 2005 ...
##  $ Medicine_MRP       : num  260.1 58.6 151.9 192.4 64.2 ...
##  $ Medicine_Type      : Factor w/ 16 levels "Analgesics","Antacids",..: 6 1 5 3 7 11 10 10 13 13 ...
##  $ SidEffect_Level    : Factor w/ 2 levels "critical","mild": 2 1 2 1 2 1 1 2 1 1 ...
##  $ Availability_rating: num  0.029 0.0323 0.0298 0.013 0.013 ...
##  $ Area_Type          : Factor w/ 4 levels "CityLimits","DownTown",..: 2 3 2 4 2 3 2 1 2 2 ...
##  $ Area_City_Type     : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
##  $ Area_dist_level    : Factor w/ 4 levels "High","Medium",..: 2 2 2 4 1 2 1 2 4 4 ...
##  $ Counterfeit_Sales  : num  3848 556 2210 845 1108 ...
glimpse(med)
## Observations: 8,523
## Variables: 12
## $ Medicine_ID         <fctr> UJM11, ERM22, NDK33, FKW44, RZH55, DIA66,...
## $ Counterfeit_Weight  <dbl> 10.600, 7.220, 18.800, 20.500, 10.230, 11....
## $ DistArea_ID         <fctr> Area049, Area018, Area049, Area010, Area0...
## $ Active_Since        <int> 1997, 2007, 1997, 1996, 1985, 2007, 1985, ...
## $ Medicine_MRP        <dbl> 260.1092, 58.5692, 151.9180, 192.3950, 64....
## $ Medicine_Type       <fctr> Antipyretics, Analgesics, Antimalarial, A...
## $ SidEffect_Level     <fctr> mild, critical, mild, critical, mild, cri...
## $ Availability_rating <dbl> 0.02904730, 0.03227822, 0.02976008, 0.0130...
## $ Area_Type           <fctr> DownTown, Industrial, DownTown, MidTownRe...
## $ Area_City_Type      <fctr> Tier 1, Tier 3, Tier 1, Tier 3, Tier 3, T...
## $ Area_dist_level     <fctr> Medium, Medium, Medium, Unknown, High, Me...
## $ Counterfeit_Sales   <dbl> 3848.1380, 556.4228, 2210.2700, 845.3800, ...

step 1: Data preparation

med=med %>%
  na.omit()
glimpse(med)
## Observations: 7,060
## Variables: 12
## $ Medicine_ID         <fctr> UJM11, ERM22, NDK33, FKW44, RZH55, DIA66,...
## $ Counterfeit_Weight  <dbl> 10.600, 7.220, 18.800, 20.500, 10.230, 11....
## $ DistArea_ID         <fctr> Area049, Area018, Area049, Area010, Area0...
## $ Active_Since        <int> 1997, 2007, 1997, 1996, 1985, 2007, 1985, ...
## $ Medicine_MRP        <dbl> 260.1092, 58.5692, 151.9180, 192.3950, 64....
## $ Medicine_Type       <fctr> Antipyretics, Analgesics, Antimalarial, A...
## $ SidEffect_Level     <fctr> mild, critical, mild, critical, mild, cri...
## $ Availability_rating <dbl> 0.02904730, 0.03227822, 0.02976008, 0.0130...
## $ Area_Type           <fctr> DownTown, Industrial, DownTown, MidTownRe...
## $ Area_City_Type      <fctr> Tier 1, Tier 3, Tier 1, Tier 3, Tier 3, T...
## $ Area_dist_level     <fctr> Medium, Medium, Medium, Unknown, High, Me...
## $ Counterfeit_Sales   <dbl> 3848.1380, 556.4228, 2210.2700, 845.3800, ...

divide data into test n train in ration 30:70%

set.seed(3)
s=sample(1:nrow(med),0.7*nrow(med))
med_train=med[s,]
med_test=med[-s,]

here response variable is Counterfeit_Sales which is continues numeric variable type. hence problem is of regression type.

STEP 2: Lets build tree model on train data

med.tree=tree(Counterfeit_Sales~.-Medicine_ID,data=med_train,na.action = na.exclude)
summary(med.tree)
## 
## Regression tree:
## tree(formula = Counterfeit_Sales ~ . - Medicine_ID, data = med_train, 
##     na.action = na.exclude)
## Variables actually used in tree construction:
## [1] "Medicine_MRP" "DistArea_ID" 
## Number of terminal nodes:  6 
## Residual mean deviance:  1188000 = 5.866e+09 / 4936 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3433.00  -614.20   -86.68     0.00   554.80  6308.00

terminal nodes:6 mean deviance:1188000

plot(med.tree)
text(med.tree,pretty = 0)

step 3:Predict for test model and find error rmse

med.pred=predict(med.tree,newdata = med_test)

finding error:RMSE(done for regression variable)

sum((med_test$Counterfeit_Sales-med.pred)**2) %>%
  sqrt()
## [1] 49925.79

error is :49925.79

Step 4: Lets prune the tree

set.seed(3)
cv.med=cv.tree(med.tree)

step 5: lets plot the pruned tree

plot(cv.med$size,cv.med$dev,type="b")

deviance is smallest for 6 terminal nodes…so no need to create new model


LETS DO RANDOM FOREST MODEL

med.rf=randomForest(Counterfeit_Sales~.-Medicine_ID,data=med_train,na.action = na.exclude)#can use do.trace=T
med.rf
## 
## Call:
##  randomForest(formula = Counterfeit_Sales ~ . - Medicine_ID, data = med_train,      na.action = na.exclude) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 1193333
##                     % Var explained: 49.83

ntree=500 no of variable at each split:3 Mean of squared residuals: 1193459 % Var explained: 49.83 lets predict for test data

med.rf.pred=predict(med.rf,newdata = med_test)

lets find rmse

sum((med_test$Counterfeit_Sales-med.rf.pred)**2) %>% 
  sqrt()
## [1] 49108.53

rmse:49057.98

importance(med.rf)
##                     IncNodePurity
## Counterfeit_Weight      879577035
## DistArea_ID             744926619
## Active_Since            212592958
## Medicine_MRP           5220119648
## Medicine_Type           825332246
## SidEffect_Level         113069585
## Availability_rating     937651678
## Area_Type               687924258
## Area_City_Type          155704105
## Area_dist_level         187139561
varImpPlot(med.rf)

the mrp and availability rating and counterfeit weight ate the most imp variables contributing towards building of model while side effect level and area city type are least imp variables contributing to build model.

CONCLUSION:RMSE by decision tree is 49925 and that by random forest is 49057. Hence random forest is little better model.