Data was from: 1. https://grouplens.org/datasets/movielens/ 2. https://datasets.imdbws.com/
First, I had to upload the data:
setwd("/Users/thomgray/Downloads")
library(readr)
movies <- read.csv(file="MOVIESUPLOAD.csv", header = TRUE, sep = ",")
ratings <- read.csv(file="ratings.csv", header = TRUE, sep = ",")
links <- read.csv(file="links.csv", header = TRUE, sep = ",")
imdbratings <- read_tsv(file = "imdbratings.tsv")
## Parsed with column specification:
## cols(
## tconst = col_character(),
## averageRating = col_double(),
## numVotes = col_integer()
## )
Next, I wanted to check the dimensions of my data.
dim(movies)
## [1] 27278 3
dim(ratings)
## [1] 1048575 4
dim(links)
## [1] 27278 3
dim(imdbratings)
## [1] 825271 3
head(movies)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
head(links)
## movieId imdbId tmdbId
## 1 95541 5 16624
## 2 88674 8 105158
## 3 120869 10 774
## 4 98981 12 160
## 5 113048 14 82120
## 6 94431 192 159900
head(ratings)
## userId movieId rating timestamp
## 1 1 2 3.5 1112486027
## 2 1 29 3.5 1112484676
## 3 1 32 3.5 1112484819
## 4 1 47 3.5 1112484727
## 5 1 50 3.5 1112484580
## 6 1 112 3.5 1094785740
head(imdbratings)
## # A tibble: 6 x 3
## tconst averageRating numVotes
## <chr> <dbl> <int>
## 1 tt0000001 5.80 1367
## 2 tt0000002 6.50 160
## 3 tt0000003 6.60 949
## 4 tt0000004 6.40 96
## 5 tt0000005 6.20 1645
## 6 tt0000006 5.60 83
I noticed at this point that I would need to change the imdb movie ID to numeric and remove the first two characters in order to match the links data table.
I’ll first join the movies and links table together with a full join.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
movieslinks <- full_join(movies, links, by = "movieId")
movieslinks$tmdbId <- NULL
Like I mentioned before, I need to fix the IMDB Id before I join that table.
imdbratings$NewID <- substr(imdbratings$tconst, 3, 9)
imdbratings$NewID <- as.numeric(imdbratings$NewID)
imdbratings$tconst <- NULL
imdbratings$imdbId <- imdbratings$NewID
imdbratings$NewID <- NULL
Joining the IMDB data with movielinks
AlmostCompleteData <- left_join(movieslinks, imdbratings, by = "imdbId")
I do not like the naming convention for the IMDB average rating and number of votes, so I’m going to rename the columns.
head(AlmostCompleteData[,4:6])
## imdbId averageRating numVotes
## 1 114709 8.3 728760
## 2 113497 6.9 251035
## 3 113228 6.6 21270
## 4 114885 5.7 8206
## 5 113041 6.0 29500
## 6 113277 8.2 490797
AlmostCompleteData$IMDBAvgRating <- AlmostCompleteData$averageRating
AlmostCompleteData$IMDBNumVotes <- AlmostCompleteData$numVotes
AlmostCompleteData$averageRating <- NULL
AlmostCompleteData$numVotes <- NULL
head(AlmostCompleteData[,4:6])
## imdbId IMDBAvgRating IMDBNumVotes
## 1 114709 8.3 728760
## 2 113497 6.9 251035
## 3 113228 6.6 21270
## 4 114885 5.7 8206
## 5 113041 6.0 29500
## 6 113277 8.2 490797
Now, I will continue my data merging process by combining AlmostCompleteData with the ratings table.
CompleteData <- left_join(ratings, AlmostCompleteData, by = "movieId")
Basically, I want to make sure there are no NAs in my table.
sum(is.na(CompleteData))
## [1] 786
sum(is.na(CompleteData$IMDBAvgRating))
## [1] 393
sum(is.na(CompleteData$IMDBNumVotes))
## [1] 393
There are NAs. I will need to investigate more.
length(unique(NAValues$movieId))
## [1] 21
It is only 21 movies that account for the NAs. If I remove them I will still be comaring 27,257 movies between Movie Lens and IMDB.
## [1] 1048575 9
## [1] 1048182 9
I’m going to add a column for Movie Lens average rating:
CompleteData <- CompleteData %>%
group_by(movieId) %>%
mutate(LensAvgRating = ave(rating))
Also, going to add a new column for the count of unique ratings for Movie Lens:
CompleteData <- CompleteData %>%
group_by(movieId) %>%
mutate(LensNumVotes = length(unique(userId)))
## # A tibble: 6 x 3
## # Groups: movieId [6]
## movieId LensAvgRating LensNumVotes
## <int> <dbl> <int>
## 1 2 3.27 1155
## 2 29 4.00 448
## 3 32 3.90 2312
## 4 47 4.08 2241
## 5 50 4.37 2490
## 6 112 3.37 627
I’ve completed combining and cleaning my data. The 4 Variables that I will do my exploratory analysis on are: - The average rating from IMDB, IMDBAvgRating - The number of votes from IMDB, IMDBNumVotes - The average rating from MovieLens, LensAvgRating - The number of votes from Lens, LensNumVotes
I want to Compare IMDB’s Ratings with Movie Lens:
quantile(CompleteData$IMDBAvgRating)
## 0% 25% 50% 75% 100%
## 1.6 6.6 7.3 7.9 9.5
quantile(CompleteData$LensAvgRatingCOMP)
## 0% 25% 50% 75% 100%
## 1.000000 6.500000 7.218750 7.764268 10.000000
## Loading required package: sm
## Package 'sm', version 2.2-5.4: type help(sm) for summary information
## Warning: weights overwritten by binning
There are a lot of outliers below the 25 quartile. This makes me think that people actually tend to review a movie if they really did not like it. Hypothesis: the more reviews a movie has the worst the avg rating will be.
There are outliers above the 75th quartile in both boxplots. It appears there are some movies that have signifigantly more reviews than others. There are a lot of movies made each year, so these could represent the more well known films.
According to the geom_line on this graph my hypothesis was wrong. There is a positive correlation between the number of movie reviews and the avg rating on Movie Lens. I’m going to confirm with a formal regression:
##
## Call:
## lm(formula = CompleteData$LensAvgRating ~ CompleteData$LensNumVotes)
##
## Coefficients:
## (Intercept) CompleteData$LensNumVotes
## 3.3462998 0.0002574
From the formal regression we can confirm that there is a positive correlation.
##
## Call:
## lm(formula = CompleteData$LensAvgRating ~ CompleteData$LensNumVotes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8473 -0.2601 0.0653 0.3167 1.6534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.346e+00 6.070e-04 5512.8 <2e-16 ***
## CompleteData$LensNumVotes 2.574e-04 5.926e-07 434.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4469 on 1048180 degrees of freedom
## Multiple R-squared: 0.1525, Adjusted R-squared: 0.1525
## F-statistic: 1.887e+05 on 1 and 1048180 DF, p-value: < 2.2e-16
Based on the P Values in the summary we can see the correlation is statistically signifigant.
My hypothesis was wrong again. There is a positive correlation between the number of movie reviews and the avg rating on IMDB. I’m going to reaffirm this with a formal regression:
##
## Call:
## lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$IMDBNumVotes)
##
## Coefficients:
## (Intercept) CompleteData$IMDBNumVotes
## 6.777e+00 1.784e-06
##
## Call:
## lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$IMDBNumVotes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1786 -0.4123 0.1143 0.5391 2.5960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.777e+00 9.561e-04 7087.9 <2e-16 ***
## CompleteData$IMDBNumVotes 1.784e-06 2.413e-09 739.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7781 on 1048180 degrees of freedom
## Multiple R-squared: 0.3427, Adjusted R-squared: 0.3427
## F-statistic: 5.466e+05 on 1 and 1048180 DF, p-value: < 2.2e-16
Based on the P Values in the summary we can see the correlation is statistically signifigant.
The graph shows that the IMDB rating and Movie Lens rating are highly correlated.
##
## Call:
## lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$LensAvgRatingCOMP)
##
## Coefficients:
## (Intercept) CompleteData$LensAvgRatingCOMP
## 0.8749 0.8968
There is a positive correlation in the formal regression.
##
## Call:
## lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$LensAvgRatingCOMP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0433 -0.2312 0.0124 0.2305 6.6283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.874859 0.002893 302.4 <2e-16 ***
## CompleteData$LensAvgRatingCOMP 0.896842 0.000406 2208.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4036 on 1048180 degrees of freedom
## Multiple R-squared: 0.8231, Adjusted R-squared: 0.8231
## F-statistic: 4.879e+06 on 1 and 1048180 DF, p-value: < 2.2e-16
And, the low P Values validate that the regression is signifigant.
Taking a look at the current genre column I realize that there are numerous combinations of genres that a movie can have.
head(CompleteData$genres)
## [1] Adventure|Children|Fantasy
## [2] Adventure|Drama|Fantasy|Mystery|Sci-Fi
## [3] Mystery|Sci-Fi|Thriller
## [4] Mystery|Thriller
## [5] Crime|Mystery|Thriller
## [6] Action|Adventure|Comedy|Crime
## 1342 Levels: (no genres listed) Action ... Western
The exact amount is:
length(unique(CompleteData$genres))
## [1] 1062
1062…
So I need to: 1) Split up these Genres into separate columns 2) Combine them into one again 3) Clean the column 4) Then analyze and find the average rating by genre
First I need to split up the Genre column
CompleteData$genres <- strsplit(CompleteData$genres, split = '|', fixed = TRUE)
## [[1]]
## [1] "Adventure" "Children" "Fantasy"
##
## [[2]]
## [1] "Adventure" "Drama" "Fantasy" "Mystery" "Sci-Fi"
##
## [[3]]
## [1] "Mystery" "Sci-Fi" "Thriller"
##
## [[4]]
## [1] "Mystery" "Thriller"
##
## [[5]]
## [1] "Crime" "Mystery" "Thriller"
##
## [[6]]
## [1] "Action" "Adventure" "Comedy" "Crime"
Then, split the genres into separate columns
## Warning: Expected 10 pieces. Missing pieces filled with `NA` in 1048174
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
## # A tibble: 6 x 5
## `Genre 1` `Genre 2` `Genre 3` `Genre 4` `Genre 5`
## <chr> <chr> <chr> <chr> <chr>
## 1 "c(\"Adventure\"" " \"Children\"" " \"Fantasy\")" <NA> <NA>
## 2 "c(\"Adventure\"" " \"Drama\"" " \"Fantasy\"" " \"Myste… " \"Sci-…
## 3 "c(\"Mystery\"" " \"Sci-Fi\"" " \"Thriller\")" <NA> <NA>
## 4 "c(\"Mystery\"" " \"Thriller\")" <NA> <NA> <NA>
## 5 "c(\"Crime\"" " \"Mystery\"" " \"Thriller\")" <NA> <NA>
## 6 "c(\"Action\"" " \"Adventure\"" " \"Comedy\"" " \"Crime… <NA>
Next, I’ll use the gather function to collect the various genres and put into one column as individual entries.
CompleteDataMergedGenres <- CompleteDataSeparateGenres %>% gather(GenreLevel, Genre, c(`Genre 1`, `Genre 2`, `Genre 3`, `Genre 4`, `Genre 5`, `Genre 6`, `Genre 7`, `Genre 8`, `Genre 9`, `Genre 10`))
However, my data is extremely messy
head(CompleteDataMergedGenres$Genre)
## [1] "c(\"Adventure\"" "c(\"Adventure\"" "c(\"Mystery\"" "c(\"Mystery\""
## [5] "c(\"Crime\"" "c(\"Action\""
How many genres are there?
length(unique(CompleteDataMergedGenres$Genre))
## [1] 73
There are 73 genres, but I need 19. This means I have to go in and clean the data in that column for my analysis to work.
These are the various Genre names in my messy data:
## [1] "c(\"Adventure\"" "c(\"Mystery\"" "c(\"Crime\""
## [4] "c(\"Action\"" "Comedy" "c(\"Drama\""
## [7] "c(\"Comedy\"" "Drama" "c(\"Children\""
## [10] "c(\"Horror\"" "Horror" "c(\"Animation\""
## [13] "Thriller" "Western" "c(\"Fantasy\""
## [16] "c(\"Sci-Fi\"" "Sci-Fi" "c(\"Documentary\""
## [19] "c(\"Musical\"" "Romance" "Documentary"
## [22] "c(\"Romance\"" "Crime" "Action"
## [25] "c(\"Film-Noir\"" "Mystery" "Adventure"
## [28] "Animation" "War" "Musical"
## [31] "Children" "Film-Noir" "Fantasy"
## [34] "c(\"Thriller\"" "IMAX" "(no genres listed)"
## [37] "c(\"War\"" " \"Children\"" " \"Drama\""
## [40] " \"Sci-Fi\"" " \"Thriller\")" " \"Mystery\""
## [43] " \"Adventure\"" NA " \"Horror\")"
## [46] " \"Crime\"" " \"Drama\")" " \"Comedy\""
## [49] " \"Sci-Fi\")" " \"Horror\"" " \"Crime\")"
## [52] " \"War\")" " \"Adventure\")" " \"Western\")"
## [55] " \"Fantasy\")" " \"Fantasy\"" " \"Mystery\")"
## [58] " \"Animation\"" " \"Comedy\")" " \"Romance\")"
## [61] " \"Musical\")" " \"Romance\"" " \"Film-Noir\""
## [64] " \"Thriller\"" " \"Musical\"" " \"Children\")"
## [67] " \"Animation\")" " \"IMAX\")" " \"Documentary\")"
## [70] " \"Documentary\"" " \"Film-Noir\")" " \"War\""
## [73] " \"Western\""
This is the method that I used to clean my messy data. I did this for all the wrong data entries in the genre column, but will only show 4 examples. I attempted to use OpenRefine to clean, but because of my large dataset it would take over 4 hours for my data to load.
CompleteDataMergedGenres$Genre[CompleteDataMergedGenres$Genre == "c(\"Adventure\""] <- "Adventure"
CompleteDataMergedGenres$Genre[CompleteDataMergedGenres$Genre == " \"Drama\")"] <- "Drama"
CompleteDataMergedGenres$Genre[CompleteDataMergedGenres$Genre == "c(\"Mystery\""] <- "Mystery"
CompleteDataMergedGenres$Genre[CompleteDataMergedGenres$Genre == "c(\"Crime\"" ] <- "Crime"
Moment of truth..
length(unique(CompleteDataMergedGenres$Genre))
## [1] 21
There are 21. At first I was confused but then realized it makes sense because ‘No Genres Listed’ and NA values make the total 21.
I belive this hypothesis to be true, because Best Picture Winners tend to be dramas.
Caculations to find the average rating by genre:
CompleteDataMergedGenres <- CompleteDataMergedGenres %>%
group_by(Genre) %>%
mutate(IMDBGenreAvgRating = ave(IMDBAvgRating))
CompleteDataMergedGenres <- CompleteDataMergedGenres %>%
group_by(Genre) %>%
mutate(MovieLensGenreAvgRating = ave(LensAvgRating))
Here I am compiling a list of the genres.
FinalGenreList <- unique(CompleteDataMergedGenres$Genre)
FinalGenreList <- sort(FinalGenreList)
FinalGenreList
## [1] "(no genres listed)" "Action" "Adventure"
## [4] "Animation" "Children" "Comedy"
## [7] "Crime" "Documentary" "Drama"
## [10] "Fantasy" "Film-Noir" "Horror"
## [13] "IMAX" "Musical" "Mystery"
## [16] "Romance" "Sci-Fi" "Thriller"
## [19] "War" "Western"
Here I am compiling a list of the average IMDB scores by genre in the FinalGenreList order.
IMDBGenreAvgListV2 <- c(7.00, 7.10, 7.23, 7.55, 7.12, 6.97, 7.51, 7.64, 7.47,
7.24, 7.90, 6.90, 7.60, 7.44, 7.46, 7.12, 7.13, 7.18,
7.81, 7.39)
Here I create the data frame that I will use to construct my graph.
IMDBGenreDF <- data.frame(FinalGenreList, IMDBGenreAvgListV2)
colnames(IMDBGenreDF) <- c("Genre", "IMDB Avg Rating")
## Genre IMDB Avg Rating
## 1 (no genres listed) 7.00
## 2 Action 7.10
## 3 Adventure 7.23
## 4 Animation 7.55
## 5 Children 7.12
## 6 Comedy 6.97
I want to change the order of the ratings. It should be an ascedning graph bar graph, so that it is easy to read.
IMDBGenreDF$Genre <- factor(IMDBGenreDF$Genre, levels = IMDBGenreDF$Genre[order(IMDBGenreDF$`IMDB Avg Rating`)])
ggplot(data = IMDBGenreDF, mapping =
aes(x = IMDBGenreDF$Genre, y = IMDBGenreDF$`IMDB Avg Rating`,
label = IMDBGenreDF$`IMDB Avg Rating`)) +
geom_col(fill = "Steel Blue") + xlab("Genre") + ylab("Avg Rating - IMDB") +
ggtitle("Comparing IMDB Ratings by Genre")+
theme(text = element_text(size=20),
axis.text.x = element_text(angle=90, hjust=1)) + geom_text(nudge_y = 1, size = 2)
Interesting, in the IMDB database Film Noir has the highest average rating. Horror is the worst rated genre. Drama is 7th, which I’m very surprised about.
Here I am compiling a list of the average Movie Lens scores by genre in the FinalGenreList order.
MovieLensGenreAvgList <- c(3.07, 3.45, 3.51, 3.61, 3.41, 3.43, 3.68, 3.76,
3.68, 3.51, 3.96, 3.26, 3.66, 3.55, 3.66, 3.55,
3.43, 3.51, 3.82, 3.57)
Realized I want the scores to be on a 10 point scale.
MovieLensGenreAvgList <- MovieLensGenreAvgList * 2
Follow the same steps for creating the Movie Lens Genre Data Frame that I did with IMDB.
MovieLensGenreDF <- data.frame(FinalGenreList, MovieLensGenreAvgList)
colnames(MovieLensGenreDF) <- c("Genre", "Movie Lens Avg Rating")
MovieLensGenreDF$Genre <- factor(MovieLensGenreDF$Genre, levels = MovieLensGenreDF$Genre[order(MovieLensGenreDF$`Movie Lens Avg Rating`)])
head(MovieLensGenreDF)
## Genre Movie Lens Avg Rating
## 1 (no genres listed) 6.14
## 2 Action 6.90
## 3 Adventure 7.02
## 4 Animation 7.22
## 5 Children 6.82
## 6 Comedy 6.86
ggplot(data = MovieLensGenreDF, mapping =
aes(x = MovieLensGenreDF$Genre, y = MovieLensGenreDF$`Movie Lens Avg Rating`,
label = MovieLensGenreDF$`Movie Lens Avg Rating`)) +
geom_col(fill = "Steel Blue") + xlab("Genre") + ylab("Avg Rating - Movie Lens") +
ggtitle("Comparing Movie Lens Ratings by Genre")+
theme(text = element_text(size=20),
axis.text.x = element_text(angle=90, hjust=1)) + geom_text(nudge_y = 1, size = 2)
Again, Film Noir was the highest genre and Horror was the lowest. But in Movie Lens Genre was the 4th highest rated Genre.