Name: Florencia Palacios
The deadline of the midterm is Wed April 26th 23:59pm.
You must work alone.
You will be submitting a knitted HTML file to Canvas. Late submission, but within 24 hours of due date/time: -20%. Any later: no credit. Submit a knitted html for full credit.
For full credit on each question, make sure you follow exactly what is asked, and answer each prompt. The total is 100 points (and there are 10 bonus points).
You may only ask clarification questions on slack; you may not ask questions for help. This is an exam, not a homework or lab assignment.
In some cases, an incomplete code is given to guide you in the right direction. However, you will need to fill in the blanks and complete the entire code block in order to run. Make sure to fill all the blanks and complete all the missing parts, before you knit your R markdown file. Otherwise, it will return knitting errors.
Use this block to load all the packages you will be using. Victory conditions: All the packages needed to run the R markdown and knit to html file are included. Submit a knitted html for full credit.
Your data for the midterm exam consists of the 1036 highest rated movies on the Internet Movie Database (IMDB). We will be using this data to predict the imdb user ratings for movies. In some cases, some of these variables may be missing for a particular movie. For your convenience, Here is the codebook for this data set:
| Variable name | Description |
|---|---|
Title |
The movie title |
Year |
The year the movie was released. |
RunTime |
The movie’s run time in minutes |
ContentRating |
content rating of the movie(G, PG, PG-13, R ,NC-17, etc) |
UserRating |
The average rating of the movie by IMDB users |
Metascore |
The average rating of the movie by Metacritic users |
NumRaters |
The number of ratings of the movie |
Genre |
The movie’s genre |
Budget |
The movie’s budget (in USD) |
GrossUS |
The gross revenue of the movie in the US (in USD) |
GrossGlobal |
The gross revenue of the movie globally (in USD) |
Director1 |
The movie’s director |
Director2 |
The movie’s director |
Actor1 |
The movie’s actor |
Actor2 |
The movie’s actor |
Actor3 |
The movie’s actor |
movie. What are
its dimensions and column names of movie? What are the 10
movies with the highest user ratings? Show only their titles, user
ratings and genres Victory conditions: In addition to
the dimensions and column names, you have a data frame with 10 rows and
3 columns, corresponding to the 3 variables of interest.movie = read.csv("movie.csv")
movie %>%
arrange(desc(UserRating)) %>%
summarise(Title,UserRating, Genre) %>%
head(10)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
UserRatings, NumRaters and
GrossUS of each Genre. Round your average
ratings to 2 decimal points and order the results by highest average
ratings. Victory conditions: Your code returns a data
frame with dimensions 16 x 7 and, ordered by the highest average
ratings. All Columns have short but clear namesmovie %>%
group_by(Genre) %>%
summarise(Average_Rating=round(mean(UserRating),2),Median_Rating=round(median(UserRating),2),Average_NumRaters=mean(NumRaters),Median_NumRaters=median(NumRaters),Average_GrossUS=mean(GrossUS),Median_GrossUS=median(GrossUS)) %>%
arrange(desc(Average_Rating))
NumRaters and the log of GrossUS,
called log_NumRaters and log_GrossUS. (a)
Visualize the distributions of log_NumRaters and
log_GrossUS. (b) Plot the relationship between
log_NumRaters and log_GrossUS, and
differentiate movies by their ContentRating.
Victory conditions:(a) You have two plots showing two
different probability distributions, each with clearly labeled axes and
reasonable precision. (b) You have one plot showing relationships
between all the variables of interest.log function in R takes the log of a
value. movie$log_NumRaters=log(movie$NumRaters)
movie$log_GrossUS=log(movie$GrossUS)
movie %>%
ggplot(data=movie,mapping=aes(x=log_NumRaters))+geom_histogram(binwidth = 0.1)
ggplot(data=movie,mapping=aes(x=log_GrossUS))+geom_histogram(binwidth=0.1)
movie %>%
ggplot(data=movie,mapping=aes(x=log_GrossUS, y=log_NumRaters))+geom_point()+facet_wrap(~ContentRating)
ggplot(data=movie,mapping=aes(x=log_GrossUS, y=log_NumRaters,color=ContentRating))+geom_point()
#There is a positive relationship between the Gross Revenues and the Rating of movies when grouped by Rating
min(movie$Year) #Minimum year is 1920
## [1] 1920
max(movie$Year) #Max year is 2018
## [1] 2018
movie$Decade <- floor(movie$Year/10)*10
movie %>%
group_by(Decade) %>%
summarise(Count=n()) %>%
arrange(desc(Count))
movie %>%
group_by(Decade) %>%
slice(c(which.min(UserRating),which.max(UserRating))) %>%
summarise(Title,Year,UserRating)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Decade'. You can override using the
## `.groups` argument.
ggplot(data=movie,mapping=aes(x=UserRating))+geom_histogram()+facet_wrap(~Decade)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Min Rating is a movie from decade 2010 called Jump Streat and Max rating is a movie from the decade of 1990s with 9.3 called The Shawshank Redemption
BreakEven, if a movie has either its gross
US box office or its gross global box office larger than the budget, we
will give 1 for BreakEven otherwise 0. Save the variable
BreakEven to movie in the environment. (b)
What is the average user ratings for movies that made a profit versus
those didn’t? (c) Calculate the proportion of movies that are profitable
at each decade. (d) Comment on how the likelihood of being profitable
change over the decades? Victory conditions: You update
the movie variable by creating a new column
BreakEven. You have a data frame showing percentages of
profitable movies for each decade. Your comment is clear and
concise. movie$BreakEven<-if_else(movie$GrossUS>movie$Budget|movie$GrossGlobal >movie$Budget,1,0)
movie %>%
group_by(BreakEven) %>%
summarise(Average_Rating=round(mean(UserRating),2))
#For moviews that were profitable the average rating is 7.85 vs 7.84 for the ones that were not profitable.
movie %>%
group_by(Decade) %>%
summarise(Count_1=sum(BreakEven==1),Count_0=sum(BreakEven==0),Prop_profitable=round(Count_1/(Count_0+Count_1),2)) %>%
select(Decade,Prop_profitable)
#Pendiente que solo muestre una row por decade
train set and
test set, where train set is 70% of the data,
and test set is 30% of the data.train set
predicting user rating from decade, year, movie run time, content
rating, log of total number of raters, log of gross US, and movie genre,
as well as the interaction between decades and log of total number of
raters. (c) Choose one categorical term and one numeric term in the
regression output and comment on the meaning of their coefficients. What
is the train R square? Victory conditions: You obtain a
train set and a test set. You have the
estimated coefficients and related summary statistics for your model.
Your comment for the coefficients argues cogently from specific results
of the code.library(caTools)
set.seed(123)
samples=sample.split(movie$Metascore, SplitRatio=0.7)
train=subset(movie, samples==T)
test=subset(movie, samples==F)
model=lm(UserRating~Decade+Year+RunTime+ContentRating+log_GrossUS+log_NumRaters+Genre+Decade:log_NumRaters,data=train)
summary(model)
##
## Call:
## lm(formula = UserRating ~ Decade + Year + RunTime + ContentRating +
## log_GrossUS + log_NumRaters + Genre + Decade:log_NumRaters,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6492 -0.1667 -0.0069 0.1487 1.2032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.34e+01 1.81e+01 -3.50 0.00051
## Decade 4.03e-02 1.00e-02 4.01 6.8e-05
## Year -5.77e-03 3.71e-03 -1.55 0.12104
## RunTime 2.17e-03 4.69e-04 4.64 4.3e-06
## ContentRatingG -4.12e-01 2.68e-01 -1.54 0.12431
## ContentRatingGP -4.13e-01 3.69e-01 -1.12 0.26343
## ContentRatingM/PG -4.71e-01 3.65e-01 -1.29 0.19740
## ContentRatingNC-17 -4.47e-01 2.92e-01 -1.53 0.12692
## ContentRatingNot Rated -2.10e-01 2.64e-01 -0.80 0.42541
## ContentRatingPassed -2.68e-01 2.86e-01 -0.94 0.34965
## ContentRatingPG -3.39e-01 2.63e-01 -1.29 0.19851
## ContentRatingPG-13 -3.95e-01 2.64e-01 -1.50 0.13502
## ContentRatingR -3.51e-01 2.63e-01 -1.33 0.18255
## ContentRatingUnrated -1.85e-01 3.68e-01 -0.50 0.61400
## log_GrossUS -1.56e-02 2.99e-03 -5.20 2.7e-07
## log_NumRaters 6.83e+00 1.53e+00 4.47 9.6e-06
## GenreAdventure -2.40e-02 7.52e-02 -0.32 0.74996
## GenreAnimation 8.14e-02 7.25e-02 1.12 0.26176
## GenreBiography 8.42e-02 5.87e-02 1.43 0.15219
## GenreComedy 5.84e-02 6.46e-02 0.90 0.36657
## GenreCrime 1.78e-02 5.66e-02 0.31 0.75298
## GenreDrama 1.74e-01 5.92e-02 2.94 0.00342
## GenreFantasy -5.39e-02 6.97e-02 -0.77 0.43942
## GenreFilm-Noir 8.16e-02 1.28e-01 0.64 0.52292
## GenreHorror -1.65e-01 6.74e-02 -2.45 0.01452
## GenreMusical 9.46e-02 1.01e-01 0.94 0.34927
## GenreMystery 1.14e-01 6.53e-02 1.74 0.08214
## GenreRomance 8.37e-02 5.97e-02 1.40 0.16142
## GenreSci-Fi -4.48e-02 6.26e-02 -0.72 0.47453
## GenreThriller 2.02e-02 7.86e-02 0.26 0.79702
## Decade:log_NumRaters -3.31e-03 7.68e-04 -4.32 1.9e-05
##
## Residual standard error: 0.25 on 596 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.473, Adjusted R-squared: 0.447
## F-statistic: 17.8 on 30 and 596 DF, p-value: <2e-16
#Runtime has a significant coefficient with high confidence. The coefficient is almost zero, which means that Runtime has no impact in the User Rating.
#ContentRating, is a dummy variable that takes 1 depending on the content rating. There is one coefficient for each rating. None is significant.
#Train R square is 0.4924
test set. (b) Compute the test R
square. Victory conditions: You used the right
R function to calculate the prediction on the
test set you created in Question 6, and compute the test R
square.test <- test[!grepl("TV-PG", test$ContentRating),]
test <-test[!grepl("Other", test$Genre),]
test=test[complete.cases(test),]
pred=predict(model,newdata=test)
SSE=sum((pred-test$UserRating)^2)
SST=sum((mean(test$UserRating)-test$UserRating)^2)
testR2=1-SSE/SST
testR2
## [1] 0.28
#The Rsquared of the test data is negative. As we expect that Rsquared to be between 0 and 1, when is negative if means that the linear regression doesn't fit the data.
BreakEven has binary value. We use logistic
regression for this question. (a) Build up a logistic regression model
using variables RunTime, UserRating,
Genre, Budget, log_NumRaters,
Decade to for predict the BreakEven on the
train set we created in Question 6. (b) Take a look at the
summary result of the model, and which factor is the significant
independent variable in this model? (c) Use the model to predict the
probability whether a movie in the test set can break even.
(d) Set the cutoff value to be 0.75. Calculate the confusion table and
the predict accuracy. Victory conditions: You created a
proper logistic regression model on the train set and look
at the summary information of the model. You used the model to predict
the break even probability for the films in the test set.
You created a confusion matrix and calculated the prediction accuracy on
the test set.model1=glm(BreakEven~RunTime+UserRating+Genre+Budget+log_NumRaters+Decade, data=train, family="binomial")
summary(model1)
##
## Call:
## glm(formula = BreakEven ~ RunTime + UserRating + Genre + Budget +
## log_NumRaters + Decade, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.485 0.223 0.418 0.625 1.407
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.46e+01 1.62e+01 1.52 0.1289
## RunTime -8.13e-03 5.34e-03 -1.52 0.1277
## UserRating -3.81e-01 4.90e-01 -0.78 0.4369
## GenreAdventure -1.86e-01 1.02e+00 -0.18 0.8551
## GenreAnimation -1.34e-01 9.36e-01 -0.14 0.8866
## GenreBiography -3.22e-01 7.43e-01 -0.43 0.6649
## GenreComedy -8.84e-01 7.64e-01 -1.16 0.2470
## GenreCrime -8.02e-01 7.11e-01 -1.13 0.2595
## GenreDrama -5.27e-01 7.36e-01 -0.72 0.4738
## GenreFantasy -1.20e+00 9.37e-01 -1.28 0.2003
## GenreFilm-Noir -1.17e+00 1.35e+00 -0.87 0.3861
## GenreHorror -1.20e+00 8.07e-01 -1.48 0.1386
## GenreMusical -6.34e-01 1.07e+00 -0.59 0.5544
## GenreMystery 2.36e-01 9.00e-01 0.26 0.7935
## GenreRomance -3.49e-01 7.42e-01 -0.47 0.6382
## GenreSci-Fi -2.31e+00 7.95e-01 -2.90 0.0037
## GenreThriller -1.55e+00 8.38e-01 -1.85 0.0643
## Budget 7.06e-09 5.48e-09 1.29 0.1977
## log_NumRaters 1.12e+00 1.92e-01 5.82 5.9e-09
## Decade -1.59e-02 7.65e-03 -2.07 0.0383
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.39 on 626 degrees of freedom
## Residual deviance: 485.18 on 607 degrees of freedom
## (4 observations deleted due to missingness)
## AIC: 525.2
##
## Number of Fisher Scoring iterations: 5
#Coefficients that are significants correspond to: Genre=Sci-Fi, log_NumRaters and Decade
p_break_even=predict(model1,newdata=test,type="response")
overall_prob_breakeven= sum(p_break_even) / length(p_break_even)
overall_prob_breakeven
## [1] 0.92
#Prob of breakeven is 92%
cutoff=3/4
as.numeric(p_break_even>=cutoff)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
tab=table(test$BreakEven, p_break_even>=cutoff)
tab
##
## FALSE TRUE
## 1 1 20
accuracy = sum(diag(tab))/sum(tab)
accuracy
## [1] 0.048
This is a challenging problem, and you’ll need to learn some new knowledges by yourself. It is to test your self-learning ability. During this exercise, you are welcome to check many functions and their documentations online. It is important to learn what we teach in the lecture while it is equally important to pick up the habit of self-learning since we cannot cover all coding questions you might face in the future.
More specifically, we will use decision tree to make decision. Decision tree learning or induction of decision trees is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity.
In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data (but the resulting classification tree can be an input for decision making). This page deals with decision trees in data mining.
There are two major Decision Tree models:
Classification Tree: when the predicted outcome is the class (discrete) to which the data belongs.
Regression Tree: when the predicted outcome can be considered a real number.
Question 1 We will use the librarys “caret”, “rpart”, and “rpart.plot”. If you do not have them, please install them first, and then load these three packages. Victory conditions: Successfully install and import these three packages.
Question 2 Download the car.data dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/car/
to your file system, and then read the data to your environment by the
read.csv function after setting up your working directory.
The dataset contains information of second-hand cars. We want to
evaluate the condition of the car, i.e., unacceptable, acceptable, good,
very good, by using six features: buying, maint, doors, persons,
lug_boot, safety. More information about this dataset can be found at:
https://archive.ics.uci.edu/ml/datasets/car+evaluation.
Victory conditions: Store the data into a variable in
your environment.
Question 3 Similar to linear/logistic regression, we
create a random train set with 70% of the data and a
test set with 30% of the data with random seed 123.
Victory conditions: Randomly create a train set and a
test set.
Question 4 Now it is time to train a decision tree
model on the train dataset. We here use all features to predict
unacc. Hint: use the rpart function to create
a decision tree model. Search online to see the usage of the
rpart function. Victory conditions: Create
a decision tree model.
Question 5 Next, we get the information about the
decision tree model we built and plot it. How to interpret this decision
model? Hint: use prp() function to plot decision tree. Search online to
see the usage of the prp function. Victory
conditions: Visualize the decision tree and interpret the
output.
Question 6 Finally, we use the decision tree model
to predict on the testing data. What is the predicted probability of
unacc on the test set? How to interpret the output? Hint:
use predict() function Victory conditions: Make the
prediction using the decision tree model you get and interpret the
output.