T.J. Gray
04/22/2018
setwd("/Users/thomgray/Downloads")
library(readr)
movies <- read.csv(file="MOVIESUPLOAD.csv", header = TRUE, sep = ",")
ratings <- read.csv(file="ratings.csv", header = TRUE, sep = ",")
links <- read.csv(file="links.csv", header = TRUE, sep = ",")
imdbratings <- read_tsv(file = "imdbratings.tsv")
dim(movies)
[1] 27278 3
dim(ratings)
[1] 1048575 4
dim(links)
[1] 27278 3
dim(imdbratings)
[1] 825271 3
head(movies)
movieId title
1 1 Toy Story (1995)
2 2 Jumanji (1995)
3 3 Grumpier Old Men (1995)
4 4 Waiting to Exhale (1995)
5 5 Father of the Bride Part II (1995)
6 6 Heat (1995)
genres
1 Adventure|Animation|Children|Comedy|Fantasy
2 Adventure|Children|Fantasy
3 Comedy|Romance
4 Comedy|Drama|Romance
5 Comedy
6 Action|Crime|Thriller
head(links)
movieId imdbId tmdbId
1 95541 5 16624
2 88674 8 105158
3 120869 10 774
4 98981 12 160
5 113048 14 82120
6 94431 192 159900
head(ratings)
userId movieId rating timestamp
1 1 2 3.5 1112486027
2 1 29 3.5 1112484676
3 1 32 3.5 1112484819
4 1 47 3.5 1112484727
5 1 50 3.5 1112484580
6 1 112 3.5 1094785740
head(imdbratings)
# A tibble: 6 x 3
tconst averageRating numVotes
<chr> <dbl> <int>
1 tt0000001 5.80 1367
2 tt0000002 6.50 160
3 tt0000003 6.60 949
4 tt0000004 6.40 96
5 tt0000005 6.20 1645
6 tt0000006 5.60 83
-Noticed at this point that I would need to change the imdb movie ID to numeric and remove the first two characters in order to match the links data table.
library(dplyr)
movieslinks <- full_join(movies, links, by = "movieId")
movieslinks$tmdbId <- NULL
#I do not need tmdbID
dim(movieslinks)
[1] 27278 4
imdbratings$NewID <- substr(imdbratings$tconst, 3, 9)
imdbratings$NewID <- as.numeric(imdbratings$NewID)
imdbratings$tconst <- NULL
#change the imdb id name to make the join easier
imdbratings$imdbId <- imdbratings$NewID
#remove NewId
imdbratings$NewID <- NULL
dim(imdbratings)
[1] 825271 3
dim(movieslinks)
[1] 27278 4
AlmostCompleteData <- left_join(movieslinks, imdbratings, by = "imdbId")
sum(is.na(AlmostCompleteData$imdbId))
[1] 0
sum(AlmostCompleteData$imdbId == " ")
[1] 0
head(AlmostCompleteData[,4:6])
imdbId averageRating numVotes
1 114709 8.3 728760
2 113497 6.9 251035
3 113228 6.6 21270
4 114885 5.7 8206
5 113041 6.0 29500
6 113277 8.2 490797
-I do not like the naming convention for the IMDB average rating and number of votes
AlmostCompleteData$IMDBAvgRating <- AlmostCompleteData$averageRating
AlmostCompleteData$IMDBNumVotes <- AlmostCompleteData$numVotes
AlmostCompleteData$averageRating <- NULL
AlmostCompleteData$numVotes <- NULL
imdbId IMDBAvgRating IMDBNumVotes
1 114709 8.3 728760
2 113497 6.9 251035
3 113228 6.6 21270
4 114885 5.7 8206
5 113041 6.0 29500
6 113277 8.2 490797
CompleteData <- left_join(ratings, AlmostCompleteData, by = "movieId")
sum(is.na(CompleteData))
[1] 786
sum(is.na(CompleteData$IMDBAvgRating))
[1] 393
sum(is.na(CompleteData$IMDBNumVotes))
[1] 393
NAValues <- CompleteData[is.na(CompleteData$IMDBAvgRating),]
head(NAValues[,c(2,8)])
movieId IMDBAvgRating
1754 6776 NA
4083 6003 NA
5935 2851 NA
8956 79677 NA
23705 6776 NA
25764 6776 NA
How many movies are in the movie lens data set and not IMDB?
length(unique(NAValues$movieId))
[1] 21
-Only 21 movies. I quickly glance over the movies and none of them appear that important.
-If I remove them I will still be comaring 27,257 movies between Movie Lens and IMDB.
dim(CompleteData)
[1] 1048575 9
CompleteData <- na.omit(CompleteData)
dim(CompleteData)
[1] 1048182 9
1048575 - 1048182 == 393
[1] TRUE
sum(is.na(CompleteData))
[1] 0
CompleteData <- CompleteData %>%
group_by(movieId) %>%
mutate(LensAvgRating = ave(rating))
CompleteData <- CompleteData %>%
group_by(movieId) %>%
mutate(LensNumVotes = length(unique(userId)))
head(CompleteData[,c(2,10)])
# A tibble: 6 x 2
# Groups: movieId [6]
movieId LensAvgRating
<int> <dbl>
1 2 3.27
2 29 4.00
3 32 3.90
4 47 4.08
5 50 4.37
6 112 3.37
head(CompleteData[,c(2,11)])
# A tibble: 6 x 2
# Groups: movieId [6]
movieId LensNumVotes
<int> <int>
1 2 1155
2 29 448
3 32 2312
4 47 2241
5 50 2490
6 112 627
I've completed combining and cleaning my data. The 4 Variables that I will do my exploratory analysis on are:
min(CompleteData$IMDBAvgRating)
[1] 1.6
median(CompleteData$IMDBAvgRating)
[1] 7.3
max(CompleteData$IMDBAvgRating)
[1] 9.5
Aha Moment
quantile(CompleteData$IMDBAvgRating)
0% 25% 50% 75% 100%
1.6 6.6 7.3 7.9 9.5
-Gives me all the information I was looking for in one line of code.
quantile(CompleteData$IMDBAvgRating)
0% 25% 50% 75% 100%
1.6 6.6 7.3 7.9 9.5
quantile(CompleteData$LensAvgRating)
0% 25% 50% 75% 100%
0.500000 3.250000 3.609375 3.882134 5.000000
-I now realize that I'm comparing a 5 point scale and a 10 point scale. I will create a new variable.
quantile(CompleteData$IMDBAvgRating)
0% 25% 50% 75% 100%
1.6 6.6 7.3 7.9 9.5
quantile(CompleteData$LensAvgRatingCOMP)
0% 25% 50% 75% 100%
1.000000 6.500000 7.218750 7.764268 10.000000
Warning: weights overwritten by binning
IMDB
Movie Lens
According to the geom_line on this graph my hypothesis was wrong. There is a positive correlation between the number of movie reviews and the avg rating on Movie Lens. I'm going to confirm with a formal regression:
Call:
lm(formula = CompleteData$LensAvgRating ~ CompleteData$LensNumVotes)
Coefficients:
(Intercept) CompleteData$LensNumVotes
3.3462998 0.0002574
Call:
lm(formula = CompleteData$LensAvgRating ~ CompleteData$LensNumVotes)
Residuals:
Min 1Q Median 3Q Max
-2.8473 -0.2601 0.0653 0.3167 1.6534
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.346e+00 6.070e-04 5512.8 <2e-16 ***
CompleteData$LensNumVotes 2.574e-04 5.926e-07 434.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4469 on 1048180 degrees of freedom
Multiple R-squared: 0.1525, Adjusted R-squared: 0.1525
F-statistic: 1.887e+05 on 1 and 1048180 DF, p-value: < 2.2e-16
Again, my hypothesis was wrong. There is a positive correlation between the number of movie reviews and the avg rating on IMDB. I'm going to confirm with a formal regression:
Call:
lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$IMDBNumVotes)
Coefficients:
(Intercept) CompleteData$IMDBNumVotes
6.777e+00 1.784e-06
Call:
lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$IMDBNumVotes)
Residuals:
Min 1Q Median 3Q Max
-5.1786 -0.4123 0.1143 0.5391 2.5960
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.777e+00 9.561e-04 7087.9 <2e-16 ***
CompleteData$IMDBNumVotes 1.784e-06 2.413e-09 739.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7781 on 1048180 degrees of freedom
Multiple R-squared: 0.3427, Adjusted R-squared: 0.3427
F-statistic: 5.466e+05 on 1 and 1048180 DF, p-value: < 2.2e-16
Call:
lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$LensAvgRatingCOMP)
Coefficients:
(Intercept) CompleteData$LensAvgRatingCOMP
0.8749 0.8968
Call:
lm(formula = CompleteData$IMDBAvgRating ~ CompleteData$LensAvgRatingCOMP)
Residuals:
Min 1Q Median 3Q Max
-8.0433 -0.2312 0.0124 0.2305 6.6283
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.874859 0.002893 302.4 <2e-16 ***
CompleteData$LensAvgRatingCOMP 0.896842 0.000406 2208.8 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4036 on 1048180 degrees of freedom
Multiple R-squared: 0.8231, Adjusted R-squared: 0.8231
F-statistic: 4.879e+06 on 1 and 1048180 DF, p-value: < 2.2e-16