Name: Florencia Palacios

Instructions

Load packages (5 pts)

Use this block to load all the packages you will be using. Victory conditions: All the packages needed to run the R markdown and knit to html file are included. Submit a knitted html for full credit.

Problem 1: Analysis of the Movie dataset

Data

Your data for the midterm exam consists of the 1036 highest rated movies on the Internet Movie Database (IMDB). We will be using this data to predict the imdb user ratings for movies. In some cases, some of these variables may be missing for a particular movie. For your convenience, Here is the codebook for this data set:

Variable name Description
Title The movie title
Year The year the movie was released.
RunTime The movie’s run time in minutes
ContentRating content rating of the movie(G, PG, PG-13, R ,NC-17, etc)
UserRating The average rating of the movie by IMDB users
Metascore The average rating of the movie by Metacritic users
NumRaters The number of ratings of the movie
Genre The movie’s genre
Budget The movie’s budget (in USD)
GrossUS The gross revenue of the movie in the US (in USD)
GrossGlobal The gross revenue of the movie globally (in USD)
Director1 The movie’s director
Director2 The movie’s director
Actor1 The movie’s actor
Actor2 The movie’s actor
Actor3 The movie’s actor
  • Question 1 (10 pts) Read in the data located in the folder and store it in a data frame called movie. What are its dimensions and column names of movie? What are the 10 movies with the highest user ratings? Show only their titles, user ratings and genres Victory conditions: In addition to the dimensions and column names, you have a data frame with 10 rows and 3 columns, corresponding to the 3 variables of interest.
movie = read.csv("movie.csv")
movie %>% 
  arrange(desc(UserRating)) %>% 
  summarise(Title,UserRating, Genre) %>% 
  head(10)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
  • Question 2 (10 pts) Now let’s see how a movie’s genre relates to the user rating and a few other variables. Show the average and median UserRatings, NumRaters and GrossUS of each Genre. Round your average ratings to 2 decimal points and order the results by highest average ratings. Victory conditions: Your code returns a data frame with dimensions 16 x 7 and, ordered by the highest average ratings. All Columns have short but clear names
movie %>% 
  group_by(Genre) %>% 
   summarise(Average_Rating=round(mean(UserRating),2),Median_Rating=round(median(UserRating),2),Average_NumRaters=mean(NumRaters),Median_NumRaters=median(NumRaters),Average_GrossUS=mean(GrossUS),Median_GrossUS=median(GrossUS)) %>% 
  arrange(desc(Average_Rating))
  • Question 3 (10 pts) Next, let’s try some data transformations to create new predictors. Create two new columns for the log of NumRaters and the log of GrossUS, called log_NumRaters and log_GrossUS. (a) Visualize the distributions of log_NumRaters and log_GrossUS. (b) Plot the relationship between log_NumRaters and log_GrossUS, and differentiate movies by their ContentRating. Victory conditions:(a) You have two plots showing two different probability distributions, each with clearly labeled axes and reasonable precision. (b) You have one plot showing relationships between all the variables of interest.
  • Hint: log function in R takes the log of a value.
  movie$log_NumRaters=log(movie$NumRaters) 
  movie$log_GrossUS=log(movie$GrossUS) 
  
movie %>% 
  ggplot(data=movie,mapping=aes(x=log_NumRaters))+geom_histogram(binwidth = 0.1)

ggplot(data=movie,mapping=aes(x=log_GrossUS))+geom_histogram(binwidth=0.1)

movie %>% 
  ggplot(data=movie,mapping=aes(x=log_GrossUS, y=log_NumRaters))+geom_point()+facet_wrap(~ContentRating) 

 ggplot(data=movie,mapping=aes(x=log_GrossUS, y=log_NumRaters,color=ContentRating))+geom_point()

  #There is a positive relationship between the Gross Revenues and the Rating of movies when grouped by Rating
  • Question 4 (10 pts) Consider breaking the year a movie is released into decades. (a) How many movies are in each decade? (b) plot the distribution of user ratings over different decades (c) Comment on movies from which decade has the highest average ratings, which decades has the lowest among movies in this data set. Victory conditions: Your column for decades is a factor/character column instead of numeric/integer (e.g. “2010s” vs 2010). You have ten subplots showing ten different decades, with clearly labeled axes and reasonable precision.
min(movie$Year) #Minimum year is 1920
## [1] 1920
max(movie$Year) #Max year is 2018
## [1] 2018
movie$Decade <- floor(movie$Year/10)*10

movie %>% 
  group_by(Decade) %>% 
  summarise(Count=n()) %>% 
  arrange(desc(Count))
movie %>% 
  group_by(Decade) %>% 
  slice(c(which.min(UserRating),which.max(UserRating))) %>% 
  summarise(Title,Year,UserRating)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Decade'. You can override using the
## `.groups` argument.
   ggplot(data=movie,mapping=aes(x=UserRating))+geom_histogram()+facet_wrap(~Decade)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Min Rating is a movie from decade 2010 called Jump Streat and Max rating is a movie from the decade of 1990s with 9.3 called The Shawshank Redemption
  • Question 5 (15 pts) Now we take a look at the finances of these movies. (a) Create a new variable BreakEven, if a movie has either its gross US box office or its gross global box office larger than the budget, we will give 1 for BreakEven otherwise 0. Save the variable BreakEven to movie in the environment. (b) What is the average user ratings for movies that made a profit versus those didn’t? (c) Calculate the proportion of movies that are profitable at each decade. (d) Comment on how the likelihood of being profitable change over the decades? Victory conditions: You update the movie variable by creating a new column BreakEven. You have a data frame showing percentages of profitable movies for each decade. Your comment is clear and concise.
 movie$BreakEven<-if_else(movie$GrossUS>movie$Budget|movie$GrossGlobal >movie$Budget,1,0)
movie %>% 
  group_by(BreakEven) %>% 
  summarise(Average_Rating=round(mean(UserRating),2))
#For moviews that were profitable the average rating is 7.85 vs 7.84 for the ones that were not profitable. 
movie %>% 
  group_by(Decade) %>% 
  summarise(Count_1=sum(BreakEven==1),Count_0=sum(BreakEven==0),Prop_profitable=round(Count_1/(Count_0+Count_1),2)) %>% 
  select(Decade,Prop_profitable)
#Pendiente que solo muestre una row por decade  
  • Question 6 (15) (a) Set the random seed to be 123. Randomly split the movie dataset to train set and test set, where train set is 70% of the data, and test set is 30% of the data.
  1. Fit a linear regression model on the train set predicting user rating from decade, year, movie run time, content rating, log of total number of raters, log of gross US, and movie genre, as well as the interaction between decades and log of total number of raters. (c) Choose one categorical term and one numeric term in the regression output and comment on the meaning of their coefficients. What is the train R square? Victory conditions: You obtain a train set and a test set. You have the estimated coefficients and related summary statistics for your model. Your comment for the coefficients argues cogently from specific results of the code.
library(caTools)
set.seed(123)

samples=sample.split(movie$Metascore, SplitRatio=0.7) 
train=subset(movie, samples==T)
test=subset(movie, samples==F)

model=lm(UserRating~Decade+Year+RunTime+ContentRating+log_GrossUS+log_NumRaters+Genre+Decade:log_NumRaters,data=train)
summary(model)
## 
## Call:
## lm(formula = UserRating ~ Decade + Year + RunTime + ContentRating + 
##     log_GrossUS + log_NumRaters + Genre + Decade:log_NumRaters, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6492 -0.1667 -0.0069  0.1487  1.2032 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)            -6.34e+01   1.81e+01   -3.50  0.00051
## Decade                  4.03e-02   1.00e-02    4.01  6.8e-05
## Year                   -5.77e-03   3.71e-03   -1.55  0.12104
## RunTime                 2.17e-03   4.69e-04    4.64  4.3e-06
## ContentRatingG         -4.12e-01   2.68e-01   -1.54  0.12431
## ContentRatingGP        -4.13e-01   3.69e-01   -1.12  0.26343
## ContentRatingM/PG      -4.71e-01   3.65e-01   -1.29  0.19740
## ContentRatingNC-17     -4.47e-01   2.92e-01   -1.53  0.12692
## ContentRatingNot Rated -2.10e-01   2.64e-01   -0.80  0.42541
## ContentRatingPassed    -2.68e-01   2.86e-01   -0.94  0.34965
## ContentRatingPG        -3.39e-01   2.63e-01   -1.29  0.19851
## ContentRatingPG-13     -3.95e-01   2.64e-01   -1.50  0.13502
## ContentRatingR         -3.51e-01   2.63e-01   -1.33  0.18255
## ContentRatingUnrated   -1.85e-01   3.68e-01   -0.50  0.61400
## log_GrossUS            -1.56e-02   2.99e-03   -5.20  2.7e-07
## log_NumRaters           6.83e+00   1.53e+00    4.47  9.6e-06
## GenreAdventure         -2.40e-02   7.52e-02   -0.32  0.74996
## GenreAnimation          8.14e-02   7.25e-02    1.12  0.26176
## GenreBiography          8.42e-02   5.87e-02    1.43  0.15219
## GenreComedy             5.84e-02   6.46e-02    0.90  0.36657
## GenreCrime              1.78e-02   5.66e-02    0.31  0.75298
## GenreDrama              1.74e-01   5.92e-02    2.94  0.00342
## GenreFantasy           -5.39e-02   6.97e-02   -0.77  0.43942
## GenreFilm-Noir          8.16e-02   1.28e-01    0.64  0.52292
## GenreHorror            -1.65e-01   6.74e-02   -2.45  0.01452
## GenreMusical            9.46e-02   1.01e-01    0.94  0.34927
## GenreMystery            1.14e-01   6.53e-02    1.74  0.08214
## GenreRomance            8.37e-02   5.97e-02    1.40  0.16142
## GenreSci-Fi            -4.48e-02   6.26e-02   -0.72  0.47453
## GenreThriller           2.02e-02   7.86e-02    0.26  0.79702
## Decade:log_NumRaters   -3.31e-03   7.68e-04   -4.32  1.9e-05
## 
## Residual standard error: 0.25 on 596 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.473,  Adjusted R-squared:  0.447 
## F-statistic: 17.8 on 30 and 596 DF,  p-value: <2e-16
#Runtime has a significant coefficient with high confidence. The coefficient is almost zero, which means that Runtime has no impact in the User Rating. 
#ContentRating, is a dummy variable that takes 1 depending on the content rating. There is one coefficient for each rating. None is significant. 
#Train R square is 0.4924
  • Question 7 (10 pts) (a) Use your model to predict the user rating on the test set. (b) Compute the test R square. Victory conditions: You used the right R function to calculate the prediction on the test set you created in Question 6, and compute the test R square.
  • Hint: If there are categorical values in the test set but not in the train set, you may see errors in the prediction. If such samples are not many, you can delete them in the test set.
  • Hint: You may need to remove samples with NA values in the prediction. Otherwise, the test R square you compute may be NA.
test <- test[!grepl("TV-PG", test$ContentRating),]
test <-test[!grepl("Other", test$Genre),]
test=test[complete.cases(test),]
pred=predict(model,newdata=test) 
SSE=sum((pred-test$UserRating)^2)
SST=sum((mean(test$UserRating)-test$UserRating)^2)
testR2=1-SSE/SST
testR2  
## [1] 0.28
 #The Rsquared of the test data is negative. As we expect that Rsquared to be between 0 and 1, when is negative if means that the linear regression doesn't fit the data.
  • Question 8 (15 pts) In this question, we predict whether the movie breaks even. This is a classification problem, since the value of BreakEven has binary value. We use logistic regression for this question. (a) Build up a logistic regression model using variables RunTime, UserRating, Genre, Budget, log_NumRaters, Decade to for predict the BreakEven on the train set we created in Question 6. (b) Take a look at the summary result of the model, and which factor is the significant independent variable in this model? (c) Use the model to predict the probability whether a movie in the test set can break even. (d) Set the cutoff value to be 0.75. Calculate the confusion table and the predict accuracy. Victory conditions: You created a proper logistic regression model on the train set and look at the summary information of the model. You used the model to predict the break even probability for the films in the test set. You created a confusion matrix and calculated the prediction accuracy on the test set.
model1=glm(BreakEven~RunTime+UserRating+Genre+Budget+log_NumRaters+Decade, data=train, family="binomial")
summary(model1)
## 
## Call:
## glm(formula = BreakEven ~ RunTime + UserRating + Genre + Budget + 
##     log_NumRaters + Decade, family = "binomial", data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.485   0.223   0.418   0.625   1.407  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)     2.46e+01   1.62e+01    1.52   0.1289
## RunTime        -8.13e-03   5.34e-03   -1.52   0.1277
## UserRating     -3.81e-01   4.90e-01   -0.78   0.4369
## GenreAdventure -1.86e-01   1.02e+00   -0.18   0.8551
## GenreAnimation -1.34e-01   9.36e-01   -0.14   0.8866
## GenreBiography -3.22e-01   7.43e-01   -0.43   0.6649
## GenreComedy    -8.84e-01   7.64e-01   -1.16   0.2470
## GenreCrime     -8.02e-01   7.11e-01   -1.13   0.2595
## GenreDrama     -5.27e-01   7.36e-01   -0.72   0.4738
## GenreFantasy   -1.20e+00   9.37e-01   -1.28   0.2003
## GenreFilm-Noir -1.17e+00   1.35e+00   -0.87   0.3861
## GenreHorror    -1.20e+00   8.07e-01   -1.48   0.1386
## GenreMusical   -6.34e-01   1.07e+00   -0.59   0.5544
## GenreMystery    2.36e-01   9.00e-01    0.26   0.7935
## GenreRomance   -3.49e-01   7.42e-01   -0.47   0.6382
## GenreSci-Fi    -2.31e+00   7.95e-01   -2.90   0.0037
## GenreThriller  -1.55e+00   8.38e-01   -1.85   0.0643
## Budget          7.06e-09   5.48e-09    1.29   0.1977
## log_NumRaters   1.12e+00   1.92e-01    5.82  5.9e-09
## Decade         -1.59e-02   7.65e-03   -2.07   0.0383
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 563.39  on 626  degrees of freedom
## Residual deviance: 485.18  on 607  degrees of freedom
##   (4 observations deleted due to missingness)
## AIC: 525.2
## 
## Number of Fisher Scoring iterations: 5
#Coefficients that are significants correspond to: Genre=Sci-Fi, log_NumRaters and Decade
p_break_even=predict(model1,newdata=test,type="response")
overall_prob_breakeven= sum(p_break_even) / length(p_break_even)
overall_prob_breakeven
## [1] 0.92
#Prob of breakeven is 92%

cutoff=3/4
as.numeric(p_break_even>=cutoff)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
tab=table(test$BreakEven, p_break_even>=cutoff)
tab
##    
##     FALSE TRUE
##   1     1   20
accuracy = sum(diag(tab))/sum(tab)
accuracy
## [1] 0.048

Problem 2 Analysis of the car dataset (This is a challenging and optional question)

This is a challenging problem, and you’ll need to learn some new knowledges by yourself. It is to test your self-learning ability. During this exercise, you are welcome to check many functions and their documentations online. It is important to learn what we teach in the lecture while it is equally important to pick up the habit of self-learning since we cannot cover all coding questions you might face in the future.

More specifically, we will use decision tree to make decision. Decision tree learning or induction of decision trees is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity.

In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data (but the resulting classification tree can be an input for decision making). This page deals with decision trees in data mining.

There are two major Decision Tree models:

Classification Tree: when the predicted outcome is the class (discrete) to which the data belongs.

Regression Tree: when the predicted outcome can be considered a real number.

Question 1 We will use the librarys “caret”, “rpart”, and “rpart.plot”. If you do not have them, please install them first, and then load these three packages. Victory conditions: Successfully install and import these three packages.

Question 2 Download the car.data dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/car/ to your file system, and then read the data to your environment by the read.csv function after setting up your working directory. The dataset contains information of second-hand cars. We want to evaluate the condition of the car, i.e., unacceptable, acceptable, good, very good, by using six features: buying, maint, doors, persons, lug_boot, safety. More information about this dataset can be found at: https://archive.ics.uci.edu/ml/datasets/car+evaluation. Victory conditions: Store the data into a variable in your environment.

Question 3 Similar to linear/logistic regression, we create a random train set with 70% of the data and a test set with 30% of the data with random seed 123. Victory conditions: Randomly create a train set and a test set.

Question 4 Now it is time to train a decision tree model on the train dataset. We here use all features to predict unacc. Hint: use the rpart function to create a decision tree model. Search online to see the usage of the rpart function. Victory conditions: Create a decision tree model.

Question 5 Next, we get the information about the decision tree model we built and plot it. How to interpret this decision model? Hint: use prp() function to plot decision tree. Search online to see the usage of the prp function. Victory conditions: Visualize the decision tree and interpret the output.

Question 6 Finally, we use the decision tree model to predict on the testing data. What is the predicted probability of unacc on the test set? How to interpret the output? Hint: use predict() function Victory conditions: Make the prediction using the decision tree model you get and interpret the output.