library(tidyverse)
horror_movies <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv") %>%
# Cast column has multiple actors: separate them into multiple rows
separate_rows(cast, sep = "\\|") %>%
# Remove white spaces
mutate(cast = str_trim(cast)) %>%
# lump together least common factor levels
mutate(cast = fct_lump(cast, 10)) %>%
# filter out "Other"
filter(cast != "Other")
# It appears that actors make difference in review ratings. For example, Kauffman's movies tend to be rated better than Roberts' movies. But do they always? How about the outlier? One movie of Roberts actually rated as high as any of Kauffman's.
horror_movies %>%
mutate(cast = fct_reorder(cast, review_rating, na.rm = TRUE)) %>%
ggplot(aes(cast, review_rating, fill = cast)) +
geom_boxplot() +
coord_flip() +
theme(legend.position = "none") +
labs(title = "Horror Movie Review Ratings by Top 10 Most Common Actors",
y = "Horror Movie Review Ratings",
x = NULL)
# The intercept represents predicted rating for movies in which Bill Moseley starred
horror_movies %>%
lm(review_rating ~ cast, data = .) %>%
summary()
##
## Call:
## lm(formula = review_rating ~ cast, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2640 -0.9720 -0.1000 0.8777 5.0000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.673333 0.364202 12.832 <2e-16 ***
## castBill Oberst Jr. 0.005238 0.524176 0.010 0.9920
## castBrinke Stevens -0.173333 0.499681 -0.347 0.7291
## castDebbie Rochon 0.057101 0.468134 0.122 0.9031
## castElissa Dowling 0.140000 0.515059 0.272 0.7861
## castEric Roberts -1.109333 0.460683 -2.408 0.0171 *
## castKane Hodder 0.306667 0.515059 0.595 0.5524
## castLloyd Kaufman 1.090667 0.460683 2.367 0.0190 *
## castMaria Olsen -0.345556 0.493132 -0.701 0.4844
## castSuzi Lorraine -0.242083 0.506948 -0.478 0.6336
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.411 on 173 degrees of freedom
## (21 observations deleted due to missingness)
## Multiple R-squared: 0.1603, Adjusted R-squared: 0.1166
## F-statistic: 3.67 on 9 and 173 DF, p-value: 0.0003167
Hint: You can choose any data you like but can’t take one that is already taken by other groups.
library(tidyverse)
horror_movies <- read_csv("https://github.com/rfordatascience/tidytuesday/raw/master/data/2018/2018-10-23/movie_profit.csv")
Hint: Source and description of data, and definition of variables. Data shows the release date, movie name, production budget, domestic gross profit, worldwide profits, distributer, Movie ratings, and genre of a particular movie. The data is used to show analyzing movie profits.
Hint: Create at least two plots.
ggplot(horror_movies,
aes(x = mpaa_rating,
y = worldwide_gross)) +
geom_boxplot() +
labs(title = "Salary distribution by movie rates")
ggplot(horror_movies,
aes(x = genre,
y = production_budget)) +
geom_point() +
labs(title = "Salary distribution by Genre")
The first plot is a box plot that shows the worldwide profits for ratings of movies. The plot shows that movies rated pg and pg-13 tend to make the most profits out of the 5 ratings. There isn’t enough g rated to actually compare same with N/A. R rated movies make profits but not as much as pg and pg-13.
The second plot is a scatter plot that shows the worldwide profits for genre of movie. The plot shows that action and adventure generate the most revenue out of the 5 genres. Comedy and horror are very even and share the lowest revenue. Drama is in between.