Decision Tree Regression on Salary

Import salary dataset

dataset=read.csv('Position_Salaries.csv')
dataset=dataset[2:3]

Fitting Simple Decision Tree to the dataset

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2
library(rpart)
## Warning: package 'rpart' was built under R version 3.4.2
regressor=rpart(formula= Salary ~. , data=dataset)

Predict the salary with Decision Tree

y_tree_pred=predict(regressor, data.frame(Level=6.5))
y_tree_pred
## [1] 249500

Visualizing the Simple Decision Tree Regression

g2=ggplot()+
  geom_point(aes(x=dataset$Level , y=dataset$Salary), 
             colour="purple")+
  geom_line(aes(x=dataset$Level, y=predict(regressor, newdata= dataset)),
            colour="black")+
  ggtitle("Truth or Bluff(Decision Tree)")+
  xlab('Level')+
  ylab("Salary")
g2

##Analysis the result1 Decision tree over estimate the salary, and it has a straight line for all levels , because it just took the average of the 10 level salaries. In order to fix this problem, we need to make some splits.

Fitting Decision Tree with parameter

regressor1=rpart(formula= Salary ~. , data=dataset ,
                control=rpart.control(minsplit = 1))
y_tree_pred1=predict(regressor1, data.frame(Level=6.5))
y_tree_pred1
##      1 
## 250000

Visualizing the Decision Tree Regression with split

g3=ggplot()+
  geom_point(aes(x=dataset$Level , y=dataset$Salary), 
             colour="pink")+
  geom_line(aes(x=dataset$Level, y=predict(regressor1, newdata= dataset)),
            colour="green")+
  ggtitle("Truth or Bluff(Decision Tree with split)")+
  xlab('Level')+
  ylab("Salary")
g3

Analysis the result2

We can see the model is improved, we solved the problem of split, we can clearly see there are 4 intervals. However, the decision tree should take the average of level, therefore, we shouldn’t have the slant lines. We detect the problem is we only plot the observations according to 10 levels. Decision tree is non-continuous. Next section We will improve the our graph by adding higher resulotion

xgrid=seq(min(dataset$Level), max(dataset$Level),0.01)
g4=ggplot()+
  geom_point(aes(x=dataset$Level , y=dataset$Salary), 
             colour="red")+
  geom_line(aes(x=xgrid, y=predict(regressor1, newdata=data.frame(Level=xgrid))),
            colour="blue")+
  ggtitle("Truth or Bluff(Decision Tree with higher resolution)")+
  xlab('Level')+
  ylab("Salary")
g4

Conclusion

Now we can clearly see the intervals 1-6.5, 6.5-8.5, 8.5-9.5, 9.5-10. Finally, our graph shows a good decision tree model looks like.