PROBLEM STATEMENT
Use dataset “base_data.csv” to build a model.Variable names are self explanatory.Your task here is to build predictive model for predicting sales figures given other information related to counterfeit medicine selling operations.
AIM: Use RandomForest and Decision Tress to build your model on train data and compare their performance on test data. Also get the variable importance plot for the model.
Initial Setup for using decision tree and random forest model:
library(tree)
## Warning: package 'tree' was built under R version 3.3.3
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.3.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.3.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
med=read.csv("base_data.csv") #gives character n num data types.
View(med)
str(med)
## 'data.frame': 8523 obs. of 12 variables:
## $ Medicine_ID : Factor w/ 1557 levels "AAJ32","AAS12",..: 1228 276 782 318 1088 191 505 1543 402 564 ...
## $ Counterfeit_Weight : num 10.6 7.22 18.8 20.5 10.23 ...
## $ DistArea_ID : Factor w/ 10 levels "Area010","Area013",..: 10 4 10 1 2 4 2 6 8 3 ...
## $ Active_Since : int 1997 2007 1997 1996 1985 2007 1985 1983 2000 2005 ...
## $ Medicine_MRP : num 260.1 58.6 151.9 192.4 64.2 ...
## $ Medicine_Type : Factor w/ 16 levels "Analgesics","Antacids",..: 6 1 5 3 7 11 10 10 13 13 ...
## $ SidEffect_Level : Factor w/ 2 levels "critical","mild": 2 1 2 1 2 1 1 2 1 1 ...
## $ Availability_rating: num 0.029 0.0323 0.0298 0.013 0.013 ...
## $ Area_Type : Factor w/ 4 levels "CityLimits","DownTown",..: 2 3 2 4 2 3 2 1 2 2 ...
## $ Area_City_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
## $ Area_dist_level : Factor w/ 4 levels "High","Medium",..: 2 2 2 4 1 2 1 2 4 4 ...
## $ Counterfeit_Sales : num 3848 556 2210 845 1108 ...
glimpse(med)
## Observations: 8,523
## Variables: 12
## $ Medicine_ID <fctr> UJM11, ERM22, NDK33, FKW44, RZH55, DIA66,...
## $ Counterfeit_Weight <dbl> 10.600, 7.220, 18.800, 20.500, 10.230, 11....
## $ DistArea_ID <fctr> Area049, Area018, Area049, Area010, Area0...
## $ Active_Since <int> 1997, 2007, 1997, 1996, 1985, 2007, 1985, ...
## $ Medicine_MRP <dbl> 260.1092, 58.5692, 151.9180, 192.3950, 64....
## $ Medicine_Type <fctr> Antipyretics, Analgesics, Antimalarial, A...
## $ SidEffect_Level <fctr> mild, critical, mild, critical, mild, cri...
## $ Availability_rating <dbl> 0.02904730, 0.03227822, 0.02976008, 0.0130...
## $ Area_Type <fctr> DownTown, Industrial, DownTown, MidTownRe...
## $ Area_City_Type <fctr> Tier 1, Tier 3, Tier 1, Tier 3, Tier 3, T...
## $ Area_dist_level <fctr> Medium, Medium, Medium, Unknown, High, Me...
## $ Counterfeit_Sales <dbl> 3848.1380, 556.4228, 2210.2700, 845.3800, ...
step 1: Data preparation
med=med %>%
na.omit()
glimpse(med)
## Observations: 7,060
## Variables: 12
## $ Medicine_ID <fctr> UJM11, ERM22, NDK33, FKW44, RZH55, DIA66,...
## $ Counterfeit_Weight <dbl> 10.600, 7.220, 18.800, 20.500, 10.230, 11....
## $ DistArea_ID <fctr> Area049, Area018, Area049, Area010, Area0...
## $ Active_Since <int> 1997, 2007, 1997, 1996, 1985, 2007, 1985, ...
## $ Medicine_MRP <dbl> 260.1092, 58.5692, 151.9180, 192.3950, 64....
## $ Medicine_Type <fctr> Antipyretics, Analgesics, Antimalarial, A...
## $ SidEffect_Level <fctr> mild, critical, mild, critical, mild, cri...
## $ Availability_rating <dbl> 0.02904730, 0.03227822, 0.02976008, 0.0130...
## $ Area_Type <fctr> DownTown, Industrial, DownTown, MidTownRe...
## $ Area_City_Type <fctr> Tier 1, Tier 3, Tier 1, Tier 3, Tier 3, T...
## $ Area_dist_level <fctr> Medium, Medium, Medium, Unknown, High, Me...
## $ Counterfeit_Sales <dbl> 3848.1380, 556.4228, 2210.2700, 845.3800, ...
divide data into test n train in ration 30:70%
set.seed(3)
s=sample(1:nrow(med),0.7*nrow(med))
med_train=med[s,]
med_test=med[-s,]
here response variable is Counterfeit_Sales which is continues numeric variable type. hence problem is of regression type.
STEP 2: Lets build tree model on train data
med.tree=tree(Counterfeit_Sales~.-Medicine_ID,data=med_train,na.action = na.exclude)
summary(med.tree)
##
## Regression tree:
## tree(formula = Counterfeit_Sales ~ . - Medicine_ID, data = med_train,
## na.action = na.exclude)
## Variables actually used in tree construction:
## [1] "Medicine_MRP" "DistArea_ID"
## Number of terminal nodes: 6
## Residual mean deviance: 1188000 = 5.866e+09 / 4936
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3433.00 -614.20 -86.68 0.00 554.80 6308.00
terminal nodes:6 mean deviance:1188000
plot(med.tree)
text(med.tree,pretty = 0)
step 3:Predict for test model and find error rmse
med.pred=predict(med.tree,newdata = med_test)
finding error:RMSE(done for regression variable)
sum((med_test$Counterfeit_Sales-med.pred)**2) %>%
sqrt()
## [1] 49925.79
Step 4: Lets prune the tree
set.seed(3)
cv.med=cv.tree(med.tree)
step 5: lets plot the pruned tree
plot(cv.med$size,cv.med$dev,type="b")
deviance is smallest for 6 terminal nodes…so no need to create new model
LETS DO RANDOM FOREST MODEL
med.rf=randomForest(Counterfeit_Sales~.-Medicine_ID,data=med_train,na.action = na.exclude)#can use do.trace=T
med.rf
##
## Call:
## randomForest(formula = Counterfeit_Sales ~ . - Medicine_ID, data = med_train, na.action = na.exclude)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 1193333
## % Var explained: 49.83
ntree=500 no of variable at each split:3 Mean of squared residuals: 1193459 % Var explained: 49.83 lets predict for test data
med.rf.pred=predict(med.rf,newdata = med_test)
lets find rmse
sum((med_test$Counterfeit_Sales-med.rf.pred)**2) %>%
sqrt()
## [1] 49108.53
rmse:49057.98
importance(med.rf)
## IncNodePurity
## Counterfeit_Weight 879577035
## DistArea_ID 744926619
## Active_Since 212592958
## Medicine_MRP 5220119648
## Medicine_Type 825332246
## SidEffect_Level 113069585
## Availability_rating 937651678
## Area_Type 687924258
## Area_City_Type 155704105
## Area_dist_level 187139561
varImpPlot(med.rf)
the mrp and availability rating and counterfeit weight ate the most imp variables contributing towards building of model while side effect level and area city type are least imp variables contributing to build model.
CONCLUSION:RMSE by decision tree is 49925 and that by random forest is 49057. Hence random forest is little better model.