Phase II Disney Movies

This is a continuation of the first phase of the project. We are loading data from the data set. The first phase can be found at the following link

http://rpubs.com/dhirajbasnet/808995

Loading data from the github link.

df <- read.csv("https://raw.githubusercontent.com/nurfnick/Data_Sets_For_Stats/master/CuratedDataSets/DisneyMoviesDataset.csv")

Cleaning our data. We removed 16 variables out of 32 keeping only the ones that are used. Also, the variables that were removed had large number of empty entries. We are also removing any row that has imdb value of “N/A” and Empty Values.

df <- df[c(1:14,22,23)]
df<-df[!(df$imdb=="N/A" | df$imdb ==""),]
df<-df[!(is.na(df$Box.office..float.) | is.na(df$Budget..float.)),]

Creating imdb as a factor. For movies that have a rating of 7.0 and higher, they are Great Movies. Else they are Not So Great Movies

library(dplyr)
df <- df %>% mutate(
imdb = factor(imdb >= 7.0 , levels = c(TRUE, FALSE),
              labels = c('Great Movies', 'Not So Great Movies'))
)

Creating our decision tree using imdb.

library(rpart)

tree <- rpart(imdb ~ Budget..float. +  Box.office..float. , data = df)

Plotting the decision tree.

library(rpart.plot)
rpart.plot(tree, extra = 2)

Making a prediction : We need the following variable to compare the actual values against it.

pred <- predict(tree, df, type = "class")
head(pred)

##                   2                   3                   4                   5 
##        Great Movies        Great Movies        Great Movies Not So Great Movies 
##                   6                   7 
## Not So Great Movies        Great Movies 
## Levels: Great Movies Not So Great Movies

Probabilities of Classification

predict(tree, df) %>%
head()

##   Great Movies Not So Great Movies
## 2    0.7500000           0.2500000
## 3    0.7307692           0.2692308
## 4    0.7307692           0.2692308
## 5    0.2380952           0.7619048
## 6    0.2380952           0.7619048
## 7    0.7307692           0.2692308

Confusion Matrix for the above classification

confusion_table <- with(df, table(imdb, pred))
confusion_table

##                      pred
## imdb                  Great Movies Not So Great Movies
##   Great Movies                  58                  45
##   Not So Great Movies           20                 129

We can see. The accuracy of this model is 74.2%.

Cross Validation

For the cross validation, we will be taking a sample of the total data. We are taking a total of 126 observations into the training data and testing data respectively.

Taking a sample of data and creating training and testing data from them.

s<-sample(150,126)
train_data <- df[s,]
test_data <- df[-s,]

Viewing the meta data for the samples taken.

dim(train_data)

## [1] 126  16

dim(test_data)

## [1] 126  16

We will be using the training data first and build our model. We will not be giving out the testing data until the model is done building.

dtm <- rpart(imdb ~ Budget..float. +  Box.office..float., train_data, method = "class")
rpart.plot(dtm, extra = 2)

Now, we will be using the testing data to make predictions and see how well we do.

p <- predict(dtm, test_data, type = "class")
table(test_data[,12], p)

##                      p
##                       Great Movies Not So Great Movies
##   Great Movies                  29                  29
##   Not So Great Movies           17                  51

The accuracy of the model is around 62% which is a little less than what the first confusion matrix suggested.

Now we will calculate the importance of each feature using chi-square statistic.

library(FSelector)
weights <- df %>% chi.squared(imdb ~ ., data = .) %>%
  as_tibble(rownames = "feature") %>%
  arrange(desc(attr_importance))
weights

## # A tibble: 15 × 2
##    feature                 attr_importance
##    <chr>                             <dbl>
##  1 title                             1    
##  2 Release.date                      1    
##  3 Release.date..datetime.           0.996
##  4 Box.office                        0.982
##  5 rotten_tomatoes                   0.812
##  6 metascore                         0.748
##  7 Production.company                0.703
##  8 Budget                            0.695
##  9 Running.time                      0.547
## 10 Box.office..float.                0.354
## 11 Country                           0.318
## 12 Language                          0.246
## 13 X                                 0    
## 14 Running.time..int.                0    
## 15 Budget..float.                    0

Creating a visualization for the above data.

ggplot(weights,
  aes(x = attr_importance, y = reorder(feature, attr_importance))) +
  geom_bar(stat = "identity") +
  xlab("Importance score") + ylab("Feature")

We can see that the box office collection is an important determinant in IMDB ratings being high.

Overfitting is easy to diagnose with the accuracy visualizations that we have. The “Accuracy” (measured against the training set) is relatively good and “Validation Accuracy” (measured against a testing set) is not as good, so our model is overfitting. However, we need to note that the accuracy of the training set was also not good in the first place.

Phase II Disney Movies

Dhiraj Basnet

09/01/2021