library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.5.3
m<-read.csv("C:/Users/pradeep/OneDrive/datasets/students_placement_data.csv")
head(m) # Check the first 6 rows.
## Roll.No Gender Section SSC.Percentage inter_Diploma_percentage
## 1 1 M A 87.30 65.3
## 2 2 F B 89.00 92.4
## 3 3 F A 67.00 68.0
## 4 4 M A 71.00 70.4
## 5 5 M A 67.00 65.5
## 6 6 M A 81.26 68.0
## B.Tech_percentage Backlogs registered_for_.Placement_Training
## 1 40.00 18 NO
## 2 71.45 0 yes
## 3 45.26 13 yes
## 4 36.47 17 yes
## 5 42.52 17 yes
## 6 62.20 6 yes
## placement.status
## 1 Not placed
## 2 Placed
## 3 Not placed
## 4 Not placed
## 5 Not placed
## 6 Not placed
str(m) # Check the structure of the dataset
## 'data.frame': 117 obs. of 9 variables:
## $ Roll.No : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ Section : Factor w/ 2 levels "A","B": 1 2 1 1 1 1 1 1 1 1 ...
## $ SSC.Percentage : num 87.3 89 67 71 67 ...
## $ inter_Diploma_percentage : num 65.3 92.4 68 70.4 65.5 68 56.5 79.3 89.6 75.5 ...
## $ B.Tech_percentage : num 40 71.5 45.3 36.5 42.5 ...
## $ Backlogs : int 18 0 13 17 17 6 20 3 10 8 ...
## $ registered_for_.Placement_Training: Factor w/ 2 levels "NO","yes": 1 2 2 2 2 2 2 1 2 2 ...
## $ placement.status : Factor w/ 2 levels "Not placed","Placed": 1 2 1 1 1 1 1 1 1 1 ...
n=nrow(m) # n is total number of rows.
set.seed(101)
# We use sample function to partition the data. Here 85 percent is training data and 15 percent is test data. Note that since "replace = TRUE", we may have a row sampled more than once.
data_index=sample(1:n, size = round(0.85*n),replace = TRUE)
train_data=m[data_index,]
test_data=m[-data_index,]
str(train_data)
## 'data.frame': 99 obs. of 9 variables:
## $ Roll.No : int 44 6 84 77 30 36 69 40 73 64 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 1 2 ...
## $ Section : Factor w/ 2 levels "A","B": 2 1 1 1 2 2 1 2 2 1 ...
## $ SSC.Percentage : num 86 81.3 89 78 72 ...
## $ inter_Diploma_percentage : num 92.5 68 88.9 59 88.1 90 61 88.8 83.7 69.2 ...
## $ B.Tech_percentage : num 70.8 62.2 63 51.1 69.6 ...
## $ Backlogs : int 0 6 1 17 0 0 6 0 0 20 ...
## $ registered_for_.Placement_Training: Factor w/ 2 levels "NO","yes": 2 2 1 1 2 2 1 2 2 2 ...
## $ placement.status : Factor w/ 2 levels "Not placed","Placed": 1 1 1 1 2 2 1 1 1 1 ...
str(test_data)
## 'data.frame': 49 obs. of 9 variables:
## $ Roll.No : int 1 4 7 8 11 12 14 15 17 18 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 2 2 1 1 2 1 2 1 1 ...
## $ Section : Factor w/ 2 levels "A","B": 1 1 1 1 2 1 2 1 2 2 ...
## $ SSC.Percentage : num 87.3 71 71 84.8 82.3 ...
## $ inter_Diploma_percentage : num 65.3 70.4 56.5 79.3 76.3 66 88.7 52.2 85 95.1 ...
## $ B.Tech_percentage : num 40 36.5 33.8 61 71.5 ...
## $ Backlogs : int 18 17 20 3 0 16 0 7 0 0 ...
## $ registered_for_.Placement_Training: Factor w/ 2 levels "NO","yes": 1 2 2 1 1 2 2 2 2 2 ...
## $ placement.status : Factor w/ 2 levels "Not placed","Placed": 1 1 1 1 2 1 2 1 1 2 ...
stu_model<-rpart(formula =placement.status~ Backlogs+Gender+B.Tech_percentage+SSC.Percentage+inter_Diploma_percentage, data=train_data,method = "class",parms = list(split="gini"))
# Print the model.
print(stu_model)
## n= 99
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 99 28 Not placed (0.71717172 0.28282828)
## 2) B.Tech_percentage< 67.135 63 2 Not placed (0.96825397 0.03174603) *
## 3) B.Tech_percentage>=67.135 36 10 Placed (0.27777778 0.72222222)
## 6) SSC.Percentage< 83.58 11 3 Not placed (0.72727273 0.27272727) *
## 7) SSC.Percentage>=83.58 25 2 Placed (0.08000000 0.92000000) *
We using rpart.plot from rpart.plot package.
rpart.plot(stu_model,type=5,extra = 2 )
In the predict function, give the model name stu_model and the test_data as input and specify type=“class” because we are doing classification.
p<-predict(stu_model,test_data,type="class")
print(p)
## 1 4 7 8 11 12
## Not placed Not placed Not placed Not placed Not placed Not placed
## 14 15 17 18 19 21
## Placed Not placed Placed Placed Not placed Placed
## 23 26 31 32 34 35
## Not placed Not placed Placed Not placed Not placed Not placed
## 37 41 42 43 45 55
## Not placed Placed Not placed Not placed Not placed Not placed
## 56 57 58 59 60 63
## Placed Not placed Not placed Placed Placed Not placed
## 65 66 68 71 74 75
## Placed Placed Placed Placed Not placed Not placed
## 76 85 87 88 89 93
## Not placed Not placed Not placed Not placed Not placed Not placed
## 100 101 102 105 106 114
## Not placed Not placed Not placed Not placed Not placed Not placed
## 116
## Placed
## Levels: Not placed Placed
“table” command is used to draw confusion matrix. “test_data[,9]” is the original class labels and “p” are predicted class labels. Confusion matrix gives information about number of correct predictions and number of wrong predictions.
t<-table(test_data[,9],p)
print(t)
## p
## Not placed Placed
## Not placed 29 2
## Placed 6 12
In the above table, (29+ 12) are correct predictions and (6+2) are wrong predictions.
Accuracy of the model is number of correct predictions in test set divided by total number of samples in test set.
print(sum(diag(t))/sum(t))
## [1] 0.8367347