This is a continuation of the first phase of the project. We are loading data from the data set. The first phase can be found at the following link
http://rpubs.com/dhirajbasnet/808995
Loading data from the github link.
df <- read.csv("https://raw.githubusercontent.com/nurfnick/Data_Sets_For_Stats/master/CuratedDataSets/DisneyMoviesDataset.csv")
Cleaning our data. We removed 16 variables out of 32 keeping only the ones that are used. Also, the variables that were removed had large number of empty entries. We are also removing any row that has imdb value of “N/A” and Empty Values.
df <- df[c(1:14,22,23)]
df<-df[!(df$imdb=="N/A" | df$imdb ==""),]
df<-df[!(is.na(df$Box.office..float.) | is.na(df$Budget..float.)),]
Creating imdb as a factor. For movies that have a rating of 7.0 and higher, they are Great Movies. Else they are Not So Great Movies
library(dplyr)
df <- df %>% mutate(
imdb = factor(imdb >= 7.0 , levels = c(TRUE, FALSE),
labels = c('Great Movies', 'Not So Great Movies'))
)
Creating our decision tree using imdb.
library(rpart)
tree <- rpart(imdb ~ Budget..float. + Box.office..float. , data = df)
Plotting the decision tree.
library(rpart.plot)
rpart.plot(tree, extra = 2)
Making a prediction : We need the following variable to compare the actual values against it.
pred <- predict(tree, df, type = "class")
head(pred)
## 2 3 4 5
## Great Movies Great Movies Great Movies Not So Great Movies
## 6 7
## Not So Great Movies Great Movies
## Levels: Great Movies Not So Great Movies
Probabilities of Classification
predict(tree, df) %>%
head()
## Great Movies Not So Great Movies
## 2 0.7500000 0.2500000
## 3 0.7307692 0.2692308
## 4 0.7307692 0.2692308
## 5 0.2380952 0.7619048
## 6 0.2380952 0.7619048
## 7 0.7307692 0.2692308
Confusion Matrix for the above classification
confusion_table <- with(df, table(imdb, pred))
confusion_table
## pred
## imdb Great Movies Not So Great Movies
## Great Movies 58 45
## Not So Great Movies 20 129
We can see. The accuracy of this model is 74.2%.
Cross Validation
For the cross validation, we will be taking a sample of the total data. We are taking a total of 126 observations into the training data and testing data respectively.
Taking a sample of data and creating training and testing data from them.
s<-sample(150,126)
train_data <- df[s,]
test_data <- df[-s,]
Viewing the meta data for the samples taken.
dim(train_data)
## [1] 126 16
dim(test_data)
## [1] 126 16
We will be using the training data first and build our model. We will not be giving out the testing data until the model is done building.
dtm <- rpart(imdb ~ Budget..float. + Box.office..float., train_data, method = "class")
rpart.plot(dtm, extra = 2)
Now, we will be using the testing data to make predictions and see how well we do.
p <- predict(dtm, test_data, type = "class")
table(test_data[,12], p)
## p
## Great Movies Not So Great Movies
## Great Movies 29 29
## Not So Great Movies 17 51
The accuracy of the model is around 62% which is a little less than what the first confusion matrix suggested.
Now we will calculate the importance of each feature using chi-square statistic.
library(FSelector)
weights <- df %>% chi.squared(imdb ~ ., data = .) %>%
as_tibble(rownames = "feature") %>%
arrange(desc(attr_importance))
weights
## # A tibble: 15 × 2
## feature attr_importance
## <chr> <dbl>
## 1 title 1
## 2 Release.date 1
## 3 Release.date..datetime. 0.996
## 4 Box.office 0.982
## 5 rotten_tomatoes 0.812
## 6 metascore 0.748
## 7 Production.company 0.703
## 8 Budget 0.695
## 9 Running.time 0.547
## 10 Box.office..float. 0.354
## 11 Country 0.318
## 12 Language 0.246
## 13 X 0
## 14 Running.time..int. 0
## 15 Budget..float. 0
Creating a visualization for the above data.
ggplot(weights,
aes(x = attr_importance, y = reorder(feature, attr_importance))) +
geom_bar(stat = "identity") +
xlab("Importance score") + ylab("Feature")
We can see that the box office collection is an important determinant in IMDB ratings being high.
Overfitting is easy to diagnose with the accuracy visualizations that we have. The “Accuracy” (measured against the training set) is relatively good and “Validation Accuracy” (measured against a testing set) is not as good, so our model is overfitting. However, we need to note that the accuracy of the training set was also not good in the first place.