ITP Final Project: Movie Lens & IMDB Data Analysis

The Assignment

Combine, restructure and/or augment your database using dplyr()
Conduct an exploratory data analysis, with summary statistics and base plots
Visualize key features and relations using ggplot()
Incorporate into your analysis an R feature or package NOT covered in class
Build one or more supervised learning models and interpret results.

Data was from: 1. https://grouplens.org/datasets/movielens/ 2. https://datasets.imdbws.com/

First, I had to upload the data:

setwd("/Users/thomgray/Downloads")
library(readr)
movies <- read.csv(file="MOVIESUPLOAD.csv", header = TRUE, sep = ",")
ratings <- read.csv(file="ratings.csv", header = TRUE, sep = ",")
links <- read.csv(file="links.csv", header = TRUE, sep = ",")
imdbratings <- read_tsv(file = "imdbratings.tsv")

## Parsed with column specification:
## cols(
##   tconst = col_character(),
##   averageRating = col_double(),
##   numVotes = col_integer()
## )

Next, I wanted to check the dimensions of my data.

dim(movies)

## [1] 27278     3

dim(ratings)

## [1] 1048575       4

dim(links)

## [1] 27278     3

dim(imdbratings)

## [1] 825271      3

Initial Observations:

movies and links have the same number of rows and can easily be matched up.
ratings has multiple rows for each movie lens reviewed film.
imdb appears to have a larger movie data base.

head(movies)

##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

head(links)

##   movieId imdbId tmdbId
## 1   95541      5  16624
## 2   88674      8 105158
## 3  120869     10    774
## 4   98981     12    160
## 5  113048     14  82120
## 6   94431    192 159900

head(ratings)

##   userId movieId rating  timestamp
## 1      1       2    3.5 1112486027
## 2      1      29    3.5 1112484676
## 3      1      32    3.5 1112484819
## 4      1      47    3.5 1112484727
## 5      1      50    3.5 1112484580
## 6      1     112    3.5 1094785740

head(imdbratings)

## # A tibble: 6 x 3
##   tconst    averageRating numVotes
##   <chr>             <dbl>    <int>
## 1 tt0000001          5.80     1367
## 2 tt0000002          6.50      160
## 3 tt0000003          6.60      949
## 4 tt0000004          6.40       96
## 5 tt0000005          6.20     1645
## 6 tt0000006          5.60       83

I noticed at this point that I would need to change the imdb movie ID to numeric and remove the first two characters in order to match the links data table.

Joining My Data Together

I’ll first join the movies and links table together with a full join.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

movieslinks <- full_join(movies, links, by = "movieId")
movieslinks$tmdbId <- NULL

Like I mentioned before, I need to fix the IMDB Id before I join that table.

imdbratings$NewID <- substr(imdbratings$tconst, 3, 9)
imdbratings$NewID <- as.numeric(imdbratings$NewID)
imdbratings$tconst <- NULL
imdbratings$imdbId <- imdbratings$NewID
imdbratings$NewID <- NULL

Joining the IMDB data with movielinks

AlmostCompleteData <- left_join(movieslinks, imdbratings, by = "imdbId")

I do not like the naming convention for the IMDB average rating and number of votes, so I’m going to rename the columns.

Renaming Columns of Newly Added Data

head(AlmostCompleteData[,4:6])

##   imdbId averageRating numVotes
## 1 114709           8.3   728760
## 2 113497           6.9   251035
## 3 113228           6.6    21270
## 4 114885           5.7     8206
## 5 113041           6.0    29500
## 6 113277           8.2   490797

AlmostCompleteData$IMDBAvgRating <- AlmostCompleteData$averageRating
AlmostCompleteData$IMDBNumVotes <- AlmostCompleteData$numVotes
AlmostCompleteData$averageRating <- NULL
AlmostCompleteData$numVotes <- NULL 
head(AlmostCompleteData[,4:6])

##   imdbId IMDBAvgRating IMDBNumVotes
## 1 114709           8.3       728760
## 2 113497           6.9       251035
## 3 113228           6.6        21270
## 4 114885           5.7         8206
## 5 113041           6.0        29500
## 6 113277           8.2       490797

Now, I will continue my data merging process by combining AlmostCompleteData with the ratings table.

CompleteData <- left_join(ratings, AlmostCompleteData, by = "movieId")

Check for Mistakes

Basically, I want to make sure there are no NAs in my table.

sum(is.na(CompleteData))

## [1] 786

sum(is.na(CompleteData$IMDBAvgRating))

## [1] 393

sum(is.na(CompleteData$IMDBNumVotes))

## [1] 393

There are NAs. I will need to investigate more.

length(unique(NAValues$movieId))

## [1] 21

It is only 21 movies that account for the NAs. If I remove them I will still be comaring 27,257 movies between Movie Lens and IMDB.

## [1] 1048575       9

## [1] 1048182       9

Manipulating the Data

I’m going to add a column for Movie Lens average rating:

CompleteData <- CompleteData %>% 
  group_by(movieId) %>%
  mutate(LensAvgRating = ave(rating))

Also, going to add a new column for the count of unique ratings for Movie Lens:

CompleteData <- CompleteData %>% 
  group_by(movieId) %>%
  mutate(LensNumVotes = length(unique(userId)))

## # A tibble: 6 x 3
## # Groups:   movieId [6]
##   movieId LensAvgRating LensNumVotes
##     <int>         <dbl>        <int>
## 1       2          3.27         1155
## 2      29          4.00          448
## 3      32          3.90         2312
## 4      47          4.08         2241
## 5      50          4.37         2490
## 6     112          3.37          627

Exploratory Data Analysis

I’ve completed combining and cleaning my data. The 4 Variables that I will do my exploratory analysis on are: - The average rating from IMDB, IMDBAvgRating - The number of votes from IMDB, IMDBNumVotes - The average rating from MovieLens, LensAvgRating - The number of votes from Lens, LensNumVotes

I want to Compare IMDB’s Ratings with Movie Lens:

quantile(CompleteData$IMDBAvgRating)

##   0%  25%  50%  75% 100% 
##  1.6  6.6  7.3  7.9  9.5

quantile(CompleteData$LensAvgRatingCOMP)

##        0%       25%       50%       75%      100% 
##  1.000000  6.500000  7.218750  7.764268 10.000000

## Loading required package: sm

## Package 'sm', version 2.2-5.4: type help(sm) for summary information

## Warning: weights overwritten by binning

There are a lot of outliers below the 25 quartile. This makes me think that people actually tend to review a movie if they really did not like it. Hypothesis: the more reviews a movie has the worst the avg rating will be.

IMDB Number of Votes

Movie Lens Number of Votes

Analysis of Number of Votes

There are outliers above the 75th quartile in both boxplots. It appears there are some movies that have signifigantly more reviews than others. There are a lot of movies made each year, so these could represent the more well known films.

Questions I Want to Answer

Does a high number of votes lead to a lower rating? (In both IMDB and Movie Lens)
Is a high rating in Movie Lens correlated with a high rating in IMDB?
Which Genre has the highest and worst rating?

According to the geom_line on this graph my hypothesis was wrong. There is a positive correlation between the number of movie reviews and the avg rating on Movie Lens. I’m going to confirm with a formal regression:

## 
## Call:
## lm(formula = CompleteData$LensAvgRating ~ CompleteData$LensNumVotes)
## 
## Coefficients:
##               (Intercept)  CompleteData$LensNumVotes  
##                 3.3462998                  0.0002574

From the formal regression we can confirm that there is a positive correlation.

## 
## Call:
## lm(formula = CompleteData$LensAvgRating ~ CompleteData$LensNumVotes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8473 -0.2601  0.0653  0.3167  1.6534 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.346e+00  6.070e-04  5512.8   <2e-16 ***
## CompleteData$LensNumVotes 2.574e-04  5.926e-07   434.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4469 on 1048180 degrees of freedom
## Multiple R-squared:  0.1525, Adjusted R-squared:  0.1525 
## F-statistic: 1.887e+05 on 1 and 1048180 DF,  p-value: < 2.2e-16

Based on the P Values in the summary we can see the correlation is statistically signifigant.

My hypothesis was wrong again. There is a positive correlation between the number of movie reviews and the avg rating on IMDB. I’m going to reaffirm this with a formal regression:

## 
## Call:
## lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$IMDBNumVotes)
## 
## Coefficients:
##               (Intercept)  CompleteData$IMDBNumVotes  
##                 6.777e+00                  1.784e-06

## 
## Call:
## lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$IMDBNumVotes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1786 -0.4123  0.1143  0.5391  2.5960 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               6.777e+00  9.561e-04  7087.9   <2e-16 ***
## CompleteData$IMDBNumVotes 1.784e-06  2.413e-09   739.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7781 on 1048180 degrees of freedom
## Multiple R-squared:  0.3427, Adjusted R-squared:  0.3427 
## F-statistic: 5.466e+05 on 1 and 1048180 DF,  p-value: < 2.2e-16

Based on the P Values in the summary we can see the correlation is statistically signifigant.

The graph shows that the IMDB rating and Movie Lens rating are highly correlated.

## 
## Call:
## lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$LensAvgRatingCOMP)
## 
## Coefficients:
##                    (Intercept)  CompleteData$LensAvgRatingCOMP  
##                         0.8749                          0.8968

There is a positive correlation in the formal regression.

## 
## Call:
## lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$LensAvgRatingCOMP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0433 -0.2312  0.0124  0.2305  6.6283 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.874859   0.002893   302.4   <2e-16 ***
## CompleteData$LensAvgRatingCOMP 0.896842   0.000406  2208.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4036 on 1048180 degrees of freedom
## Multiple R-squared:  0.8231, Adjusted R-squared:  0.8231 
## F-statistic: 4.879e+06 on 1 and 1048180 DF,  p-value: < 2.2e-16

And, the low P Values validate that the regression is signifigant.

Finding the Average Rating by Movie Genre

Taking a look at the current genre column I realize that there are numerous combinations of genres that a movie can have.

head(CompleteData$genres)

## [1] Adventure|Children|Fantasy            
## [2] Adventure|Drama|Fantasy|Mystery|Sci-Fi
## [3] Mystery|Sci-Fi|Thriller               
## [4] Mystery|Thriller                      
## [5] Crime|Mystery|Thriller                
## [6] Action|Adventure|Comedy|Crime         
## 1342 Levels: (no genres listed) Action ... Western

The exact amount is:

length(unique(CompleteData$genres))

## [1] 1062

1062…

So I need to: 1) Split up these Genres into separate columns 2) Combine them into one again 3) Clean the column 4) Then analyze and find the average rating by genre

First I need to split up the Genre column

CompleteData$genres <- strsplit(CompleteData$genres, split = '|', fixed = TRUE)

## [[1]]
## [1] "Adventure" "Children"  "Fantasy"  
## 
## [[2]]
## [1] "Adventure" "Drama"     "Fantasy"   "Mystery"   "Sci-Fi"   
## 
## [[3]]
## [1] "Mystery"  "Sci-Fi"   "Thriller"
## 
## [[4]]
## [1] "Mystery"  "Thriller"
## 
## [[5]]
## [1] "Crime"    "Mystery"  "Thriller"
## 
## [[6]]
## [1] "Action"    "Adventure" "Comedy"    "Crime"

Then, split the genres into separate columns

## Warning: Expected 10 pieces. Missing pieces filled with `NA` in 1048174
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].

## # A tibble: 6 x 5
##   `Genre 1`         `Genre 2`        `Genre 3`        `Genre 4`  `Genre 5`
##   <chr>             <chr>            <chr>            <chr>      <chr>    
## 1 "c(\"Adventure\"" " \"Children\""  " \"Fantasy\")"  <NA>       <NA>     
## 2 "c(\"Adventure\"" " \"Drama\""     " \"Fantasy\""   " \"Myste… " \"Sci-…
## 3 "c(\"Mystery\""   " \"Sci-Fi\""    " \"Thriller\")" <NA>       <NA>     
## 4 "c(\"Mystery\""   " \"Thriller\")" <NA>             <NA>       <NA>     
## 5 "c(\"Crime\""     " \"Mystery\""   " \"Thriller\")" <NA>       <NA>     
## 6 "c(\"Action\""    " \"Adventure\"" " \"Comedy\""    " \"Crime… <NA>

Next, I’ll use the gather function to collect the various genres and put into one column as individual entries.

CompleteDataMergedGenres <- CompleteDataSeparateGenres %>% gather(GenreLevel, Genre, c(`Genre 1`, `Genre 2`, `Genre 3`, `Genre 4`, `Genre 5`, `Genre 6`, `Genre 7`, `Genre 8`, `Genre 9`, `Genre 10`))

However, my data is extremely messy

head(CompleteDataMergedGenres$Genre)

## [1] "c(\"Adventure\"" "c(\"Adventure\"" "c(\"Mystery\""   "c(\"Mystery\""  
## [5] "c(\"Crime\""     "c(\"Action\""

How many genres are there?

length(unique(CompleteDataMergedGenres$Genre))

## [1] 73

There are 73 genres, but I need 19. This means I have to go in and clean the data in that column for my analysis to work.

These are the various Genre names in my messy data:

##  [1] "c(\"Adventure\""    "c(\"Mystery\""      "c(\"Crime\""       
##  [4] "c(\"Action\""       "Comedy"             "c(\"Drama\""       
##  [7] "c(\"Comedy\""       "Drama"              "c(\"Children\""    
## [10] "c(\"Horror\""       "Horror"             "c(\"Animation\""   
## [13] "Thriller"           "Western"            "c(\"Fantasy\""     
## [16] "c(\"Sci-Fi\""       "Sci-Fi"             "c(\"Documentary\"" 
## [19] "c(\"Musical\""      "Romance"            "Documentary"       
## [22] "c(\"Romance\""      "Crime"              "Action"            
## [25] "c(\"Film-Noir\""    "Mystery"            "Adventure"         
## [28] "Animation"          "War"                "Musical"           
## [31] "Children"           "Film-Noir"          "Fantasy"           
## [34] "c(\"Thriller\""     "IMAX"               "(no genres listed)"
## [37] "c(\"War\""          " \"Children\""      " \"Drama\""        
## [40] " \"Sci-Fi\""        " \"Thriller\")"     " \"Mystery\""      
## [43] " \"Adventure\""     NA                   " \"Horror\")"      
## [46] " \"Crime\""         " \"Drama\")"        " \"Comedy\""       
## [49] " \"Sci-Fi\")"       " \"Horror\""        " \"Crime\")"       
## [52] " \"War\")"          " \"Adventure\")"    " \"Western\")"     
## [55] " \"Fantasy\")"      " \"Fantasy\""       " \"Mystery\")"     
## [58] " \"Animation\""     " \"Comedy\")"       " \"Romance\")"     
## [61] " \"Musical\")"      " \"Romance\""       " \"Film-Noir\""    
## [64] " \"Thriller\""      " \"Musical\""       " \"Children\")"    
## [67] " \"Animation\")"    " \"IMAX\")"         " \"Documentary\")" 
## [70] " \"Documentary\""   " \"Film-Noir\")"    " \"War\""          
## [73] " \"Western\""

This is the method that I used to clean my messy data. I did this for all the wrong data entries in the genre column, but will only show 4 examples. I attempted to use OpenRefine to clean, but because of my large dataset it would take over 4 hours for my data to load.

CompleteDataMergedGenres$Genre[CompleteDataMergedGenres$Genre == "c(\"Adventure\""] <- "Adventure"
CompleteDataMergedGenres$Genre[CompleteDataMergedGenres$Genre == " \"Drama\")"] <- "Drama" 
CompleteDataMergedGenres$Genre[CompleteDataMergedGenres$Genre == "c(\"Mystery\""] <- "Mystery"
CompleteDataMergedGenres$Genre[CompleteDataMergedGenres$Genre == "c(\"Crime\"" ] <- "Crime"

Moment of truth..

length(unique(CompleteDataMergedGenres$Genre))

## [1] 21

There are 21. At first I was confused but then realized it makes sense because ‘No Genres Listed’ and NA values make the total 21.

HYPOTHESIS: Drama films will have the highest average rating.

I belive this hypothesis to be true, because Best Picture Winners tend to be dramas.

Caculations to find the average rating by genre:

CompleteDataMergedGenres <- CompleteDataMergedGenres %>% 
  group_by(Genre) %>%
  mutate(IMDBGenreAvgRating = ave(IMDBAvgRating))

CompleteDataMergedGenres <- CompleteDataMergedGenres %>% 
  group_by(Genre) %>%
  mutate(MovieLensGenreAvgRating = ave(LensAvgRating))

Here I am compiling a list of the genres.

FinalGenreList <- unique(CompleteDataMergedGenres$Genre)
FinalGenreList <- sort(FinalGenreList)
FinalGenreList

##  [1] "(no genres listed)" "Action"             "Adventure"         
##  [4] "Animation"          "Children"           "Comedy"            
##  [7] "Crime"              "Documentary"        "Drama"             
## [10] "Fantasy"            "Film-Noir"          "Horror"            
## [13] "IMAX"               "Musical"            "Mystery"           
## [16] "Romance"            "Sci-Fi"             "Thriller"          
## [19] "War"                "Western"

Here I am compiling a list of the average IMDB scores by genre in the FinalGenreList order.

IMDBGenreAvgListV2 <- c(7.00, 7.10, 7.23, 7.55, 7.12, 6.97, 7.51, 7.64, 7.47, 
                        7.24, 7.90, 6.90, 7.60, 7.44, 7.46, 7.12, 7.13, 7.18, 
                        7.81, 7.39)

Here I create the data frame that I will use to construct my graph.

IMDBGenreDF <- data.frame(FinalGenreList, IMDBGenreAvgListV2)
colnames(IMDBGenreDF) <- c("Genre", "IMDB Avg Rating")

##                Genre IMDB Avg Rating
## 1 (no genres listed)            7.00
## 2             Action            7.10
## 3          Adventure            7.23
## 4          Animation            7.55
## 5           Children            7.12
## 6             Comedy            6.97

I want to change the order of the ratings. It should be an ascedning graph bar graph, so that it is easy to read.

IMDBGenreDF$Genre <- factor(IMDBGenreDF$Genre, levels = IMDBGenreDF$Genre[order(IMDBGenreDF$`IMDB Avg Rating`)])

ggplot(data = IMDBGenreDF, mapping = 
         aes(x = IMDBGenreDF$Genre, y = IMDBGenreDF$`IMDB Avg Rating`, 
             label = IMDBGenreDF$`IMDB Avg Rating`)) +
  geom_col(fill = "Steel Blue") + xlab("Genre") + ylab("Avg Rating - IMDB") +
  ggtitle("Comparing IMDB Ratings by Genre")+
  theme(text = element_text(size=20),
        axis.text.x = element_text(angle=90, hjust=1)) + geom_text(nudge_y = 1, size = 2)

Interesting, in the IMDB database Film Noir has the highest average rating. Horror is the worst rated genre. Drama is 7th, which I’m very surprised about.

Now I’m going to conduct the same analysis for the ratings in Movie Lens

Here I am compiling a list of the average Movie Lens scores by genre in the FinalGenreList order.

MovieLensGenreAvgList <- c(3.07, 3.45, 3.51, 3.61, 3.41, 3.43, 3.68, 3.76, 
                           3.68, 3.51, 3.96, 3.26, 3.66, 3.55, 3.66, 3.55, 
                           3.43, 3.51, 3.82, 3.57)

Realized I want the scores to be on a 10 point scale.

MovieLensGenreAvgList <- MovieLensGenreAvgList * 2

Follow the same steps for creating the Movie Lens Genre Data Frame that I did with IMDB.

MovieLensGenreDF <- data.frame(FinalGenreList, MovieLensGenreAvgList)
colnames(MovieLensGenreDF) <- c("Genre", "Movie Lens Avg Rating")
MovieLensGenreDF$Genre <- factor(MovieLensGenreDF$Genre, levels = MovieLensGenreDF$Genre[order(MovieLensGenreDF$`Movie Lens Avg Rating`)])
head(MovieLensGenreDF)

##                Genre Movie Lens Avg Rating
## 1 (no genres listed)                  6.14
## 2             Action                  6.90
## 3          Adventure                  7.02
## 4          Animation                  7.22
## 5           Children                  6.82
## 6             Comedy                  6.86

ggplot(data = MovieLensGenreDF, mapping = 
         aes(x = MovieLensGenreDF$Genre, y = MovieLensGenreDF$`Movie Lens Avg Rating`, 
             label = MovieLensGenreDF$`Movie Lens Avg Rating`)) +
  geom_col(fill = "Steel Blue") + xlab("Genre") + ylab("Avg Rating - Movie Lens") +
  ggtitle("Comparing Movie Lens Ratings by Genre")+
  theme(text = element_text(size=20),
        axis.text.x = element_text(angle=90, hjust=1)) + geom_text(nudge_y = 1, size = 2)

Analysis

Again, Film Noir was the highest genre and Horror was the lowest. But in Movie Lens Genre was the 4th highest rated Genre.