ITP Final Project: Movie Lens Data

T.J. Gray
04/22/2018

The Assignment

Combine, restructure and/or augment your database using dplyr()
Conduct an exploratory data analysis, with summary statistics and base plots
Visualize key features and relations using ggplot()
Incorporate into your analysis an R feature or package NOT covered in class
Build one or more supervised learning models and interpret results.

Uploading My Data

setwd("/Users/thomgray/Downloads")
library(readr)
movies <- read.csv(file="MOVIESUPLOAD.csv", header = TRUE, sep = ",")
ratings <- read.csv(file="ratings.csv", header = TRUE, sep = ",")
links <- read.csv(file="links.csv", header = TRUE, sep = ",")
imdbratings <- read_tsv(file = "imdbratings.tsv")

Data Dimensions

dim(movies)

[1] 27278     3

dim(ratings)

[1] 1048575       4

dim(links)

[1] 27278     3

dim(imdbratings)

[1] 825271      3

Data Dimensions

movies and links have the same number of rows and can easily be matched up.
ratings has multiple rows for each movie lens reviewed film.
imdb appears to have a larger movie data base.

movies Head

head(movies)

  movieId                              title
1       1                   Toy Story (1995)
2       2                     Jumanji (1995)
3       3            Grumpier Old Men (1995)
4       4           Waiting to Exhale (1995)
5       5 Father of the Bride Part II (1995)
6       6                        Heat (1995)
                                       genres
1 Adventure|Animation|Children|Comedy|Fantasy
2                  Adventure|Children|Fantasy
3                              Comedy|Romance
4                        Comedy|Drama|Romance
5                                      Comedy
6                       Action|Crime|Thriller

links Head

head(links)

  movieId imdbId tmdbId
1   95541      5  16624
2   88674      8 105158
3  120869     10    774
4   98981     12    160
5  113048     14  82120
6   94431    192 159900

ratings Head

head(ratings)

  userId movieId rating  timestamp
1      1       2    3.5 1112486027
2      1      29    3.5 1112484676
3      1      32    3.5 1112484819
4      1      47    3.5 1112484727
5      1      50    3.5 1112484580
6      1     112    3.5 1094785740

imdbratings Head

head(imdbratings)

# A tibble: 6 x 3
  tconst    averageRating numVotes
  <chr>             <dbl>    <int>
1 tt0000001          5.80     1367
2 tt0000002          6.50      160
3 tt0000003          6.60      949
4 tt0000004          6.40       96
5 tt0000005          6.20     1645
6 tt0000006          5.60       83

-Noticed at this point that I would need to change the imdb movie ID to numeric and remove the first two characters in order to match the links data table.

Joining movies with links

library(dplyr)
movieslinks <- full_join(movies, links, by = "movieId")
movieslinks$tmdbId <- NULL
#I do not need tmdbID

dim(movieslinks)

[1] 27278     4

Fix the IMDB ID

imdbratings$NewID <- substr(imdbratings$tconst, 3, 9)
imdbratings$NewID <- as.numeric(imdbratings$NewID)
imdbratings$tconst <- NULL
#change the imdb id name to make the join easier 
imdbratings$imdbId <- imdbratings$NewID
#remove NewId
imdbratings$NewID <- NULL

Joining the IMDB data with movielinks

dim(imdbratings)

[1] 825271      3

dim(movieslinks)

[1] 27278     4

AlmostCompleteData <- left_join(movieslinks, imdbratings, by = "imdbId")

Checking for Mistkaes

sum(is.na(AlmostCompleteData$imdbId))

[1] 0

sum(AlmostCompleteData$imdbId == " ")

[1] 0

Reviewing the Column Names of Newly Added Data

head(AlmostCompleteData[,4:6])

  imdbId averageRating numVotes
1 114709           8.3   728760
2 113497           6.9   251035
3 113228           6.6    21270
4 114885           5.7     8206
5 113041           6.0    29500
6 113277           8.2   490797

-I do not like the naming convention for the IMDB average rating and number of votes

Renaming Columns of Newly Added Data

AlmostCompleteData$IMDBAvgRating <- AlmostCompleteData$averageRating
AlmostCompleteData$IMDBNumVotes <- AlmostCompleteData$numVotes

AlmostCompleteData$averageRating <- NULL
AlmostCompleteData$numVotes <- NULL

Reviewing the Column Names of Newly Added Data

  imdbId IMDBAvgRating IMDBNumVotes
1 114709           8.3       728760
2 113497           6.9       251035
3 113228           6.6        21270
4 114885           5.7         8206
5 113041           6.0        29500
6 113277           8.2       490797

Add Almost Complete Data to ratings

CompleteData <- left_join(ratings, AlmostCompleteData, by = "movieId")

Check for Mistakes in Complete Data

sum(is.na(CompleteData))

[1] 786

sum(is.na(CompleteData$IMDBAvgRating))

[1] 393

sum(is.na(CompleteData$IMDBNumVotes))

[1] 393

Investigate NA Values in Complete Data

NAValues <- CompleteData[is.na(CompleteData$IMDBAvgRating),]
head(NAValues[,c(2,8)])

      movieId IMDBAvgRating
1754     6776            NA
4083     6003            NA
5935     2851            NA
8956    79677            NA
23705    6776            NA
25764    6776            NA

Investigate NA Values in Complete Data

How many movies are in the movie lens data set and not IMDB?

length(unique(NAValues$movieId))

[1] 21

-Only 21 movies. I quickly glance over the movies and none of them appear that important.
-If I remove them I will still be comaring 27,257 movies between Movie Lens and IMDB.

Removing the NAs

dim(CompleteData)

[1] 1048575       9

CompleteData <- na.omit(CompleteData)
dim(CompleteData)

[1] 1048182       9

1048575 - 1048182 == 393

[1] TRUE

sum(is.na(CompleteData))

[1] 0

Adding an Average Ratings Variable for Movie Lens

CompleteData <- CompleteData %>% 
  group_by(movieId) %>%
  mutate(LensAvgRating = ave(rating))

Adding a Count of Unique Ratings for Movie Lens

CompleteData <- CompleteData %>% 
  group_by(movieId) %>%
  mutate(LensNumVotes = length(unique(userId)))

Adding an Average Ratings Variable for Movie Lens

head(CompleteData[,c(2,10)])

# A tibble: 6 x 2
# Groups:   movieId [6]
  movieId LensAvgRating
    <int>         <dbl>
1       2          3.27
2      29          4.00
3      32          3.90
4      47          4.08
5      50          4.37
6     112          3.37

Adding a Count of Unique Ratings for Movie Lens

head(CompleteData[,c(2,11)])

# A tibble: 6 x 2
# Groups:   movieId [6]
  movieId LensNumVotes
    <int>        <int>
1       2         1155
2      29          448
3      32         2312
4      47         2241
5      50         2490
6     112          627

Exploratory Data Analysis

I've completed combining and cleaning my data. The 4 Variables that I will do my exploratory analysis on are:

The average rating from IMDB, IMDBAvgRating
The number of votes from IMDB, IMDBNumVotes
The average rating from MovieLens, LensAvgRating
The number of votes from Lens, LensNumVotes

IMDB Avg Rating

min(CompleteData$IMDBAvgRating)

[1] 1.6

median(CompleteData$IMDBAvgRating)

[1] 7.3

max(CompleteData$IMDBAvgRating)

[1] 9.5

IMDB Avg Rating

Aha Moment

quantile(CompleteData$IMDBAvgRating)

  0%  25%  50%  75% 100% 
 1.6  6.6  7.3  7.9  9.5

-Gives me all the information I was looking for in one line of code.

Comparing IMDB Rating with Movie Lens

quantile(CompleteData$IMDBAvgRating)

  0%  25%  50%  75% 100% 
 1.6  6.6  7.3  7.9  9.5

quantile(CompleteData$LensAvgRating)

      0%      25%      50%      75%     100% 
0.500000 3.250000 3.609375 3.882134 5.000000

-I now realize that I'm comparing a 5 point scale and a 10 point scale. I will create a new variable.

Comparing IMDB Rating with Movie Lens

quantile(CompleteData$IMDBAvgRating)

  0%  25%  50%  75% 100% 
 1.6  6.6  7.3  7.9  9.5

quantile(CompleteData$LensAvgRatingCOMP)

       0%       25%       50%       75%      100% 
 1.000000  6.500000  7.218750  7.764268 10.000000

Comparing IMDB Rating with Movie Lens

plot of chunk unnamed-chunk-31

There are a lot of outliers below the 25 quartiles. This makes me think that people actually tend to review a movie if they really did not like it.
Hypothesis: the more reviews a movie has the worst the avg rating will be.

Comparing IMDB Rating with Movie Lens using vioplot

Warning: weights overwritten by binning

plot of chunk unnamed-chunk-32

Looking at the long tail of the violin reaffirms my theory.

Comparing IMDB Votes with Movie Lens Votes

IMDB plot of chunk unnamed-chunk-33

Movie Lens plot of chunk unnamed-chunk-34

Comparing IMDB Votes with Movie Lens Votes

The previous boxplot makes me think that there are some movies that have signifigantly more reviews than others.
Could be because the movies were well known.
They could have been either extremely good or bad. Will need to do more analysis to find out which.

Questions I Want to Answer

Does a high number of votes lead to a lower rating? (In both IMDB and Movie Lens)
Is a high rating in Movie Lens correlated with a high rating in IMDB?

Movie Lens: Ratings vs Number of Votes

plot of chunk unnamed-chunk-35

Movie Lens: Ratings vs Number of Votes

plot of chunk unnamed-chunk-36

Movie Lens: Ratings vs Number of Votes

plot of chunk unnamed-chunk-37

Movie Lens: Ratings vs Number of Votes Analysis

According to the geom_line on this graph my hypothesis was wrong. There is a positive correlation between the number of movie reviews and the avg rating on Movie Lens. I'm going to confirm with a formal regression:


Call:
lm(formula = CompleteData$LensAvgRating ~ CompleteData$LensNumVotes)

Coefficients:
              (Intercept)  CompleteData$LensNumVotes  
                3.3462998                  0.0002574

Movie Lens: Ratings vs Number of Votes Analysis


Call:
lm(formula = CompleteData$LensAvgRating ~ CompleteData$LensNumVotes)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8473 -0.2601  0.0653  0.3167  1.6534 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3.346e+00  6.070e-04  5512.8   <2e-16 ***
CompleteData$LensNumVotes 2.574e-04  5.926e-07   434.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4469 on 1048180 degrees of freedom
Multiple R-squared:  0.1525,    Adjusted R-squared:  0.1525 
F-statistic: 1.887e+05 on 1 and 1048180 DF,  p-value: < 2.2e-16

IMDB: Ratings vs Number of Votes Analysis

plot of chunk unnamed-chunk-40

IMDB: Ratings vs Number of Votes Analysis

Again, my hypothesis was wrong. There is a positive correlation between the number of movie reviews and the avg rating on IMDB. I'm going to confirm with a formal regression:


Call:
lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$IMDBNumVotes)

Coefficients:
              (Intercept)  CompleteData$IMDBNumVotes  
                6.777e+00                  1.784e-06

IMDB: Ratings vs Number of Votes Analysis


Call:
lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$IMDBNumVotes)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1786 -0.4123  0.1143  0.5391  2.5960 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               6.777e+00  9.561e-04  7087.9   <2e-16 ***
CompleteData$IMDBNumVotes 1.784e-06  2.413e-09   739.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7781 on 1048180 degrees of freedom
Multiple R-squared:  0.3427,    Adjusted R-squared:  0.3427 
F-statistic: 5.466e+05 on 1 and 1048180 DF,  p-value: < 2.2e-16

Are IMDB and Movie Lens Ratings Correlated?

plot of chunk unnamed-chunk-43

Formal Regression


Call:
lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$LensAvgRatingCOMP)

Coefficients:
                   (Intercept)  CompleteData$LensAvgRatingCOMP  
                        0.8749                          0.8968

Summary of Regression


Call:
lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$LensAvgRatingCOMP)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0433 -0.2312  0.0124  0.2305  6.6283 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    0.874859   0.002893   302.4   <2e-16 ***
CompleteData$LensAvgRatingCOMP 0.896842   0.000406  2208.8   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4036 on 1048180 degrees of freedom
Multiple R-squared:  0.8231,    Adjusted R-squared:  0.8231 
F-statistic: 4.879e+06 on 1 and 1048180 DF,  p-value: < 2.2e-16

Problems I Had During This Project

I could not remove the Vioplot warning message from my presentation.
I could not properly analyze which Genre had the best rating.
Movie Lens genre labeling system created thousands of genre possibilities.
Analyzing the time stamp. I was unable to come up with a way to gain information from the time the review was made. I would like to have more help with the lubridate package.