Today I am going to analyze the MovieLens data sets, which were collected by the GroupLens Research Project at the University of Minnesota through the MovieLens web site.This data set consists of 100,000 ratings from 943 users on 1682 movies, and was released in April 1998. Each user has rated at least 20 movies, and basic demographic information for the users are included in the data set.
The data set has three tables. The table ‘User’ includes basic demographic info such as age and gender; the table ‘Data’ includes ratings; and the table ‘Movies’ includes information for the movies. I merge the three tables together.
users <- read.table('ml-100k/u.user', sep='|')
colnames(users) <- c('user_id', 'age', 'sex', 'occupation', 'zip_code') #Assign column names
ratings <- read.table('ml-100k/u.data', sep='\t')
colnames(ratings) <- c('user_id', 'movie_id', 'rating', 'unix_timestamp')
movies <- fread('ml-100k/u.item', sep='|', select = c(1:5))
colnames(movies) <- c('movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url')
movie_ratings = merge(movies, ratings, by="movie_id")
lens = merge(movie_ratings, users, by="user_id")
Let’s take a look at the distribution of users’ ages.
ggplot(aes(x=age), data=users) + geom_histogram(aes(fill=..count..), binwidth = 5) +
ggtitle("Distribution of users' ages") + xlab('age') +ylab('count of users') +
scale_x_continuous(breaks = seq(0,81,5)) +
scale_fill_gradient("Count", low = "#66CCFF", high = "#003366")
The plot shows that the majority of users are 20 years old to 50 years old. This makes sense because young people are more likely to rate movies than children and older people.
Then I would also like to take a look the distribution of user’s ages by gender.
ggplot(aes(x=age), data=users) + geom_histogram(aes(fill=..count..), binwidth = 5) +
ggtitle("Distribution of users' ages") + xlab('age') +ylab('count of users') +
scale_x_continuous(breaks = seq(0,81,5)) +
scale_fill_gradient("Count", low = "#66CCFF", high = "#003366") +
facet_wrap(~sex, ncol=2)
The graph shows that the number of male users is much higher than that of female users. It may seem strange at first, but it kind of makes sense because this data set was collected in 1998. Back then the Internet was not universal. Some certain groups, such as programmers, might be more active on the Internet than others.
I am interested in comparing ratings by different ages. To easily analyze the data, I put data into eight age groups.
agelabel <- c('0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79')
lens$age_group <- cut(lens$age, seq(0,81,10), include.lowest=TRUE, right=FALSE, labels=agelabel)
rating_by_age_group <- lens %>% group_by(age_group) %>% summarise(mean=mean(rating))
ggplot(data=rating_by_age_group, aes(x=age_group, y=mean)) +
geom_bar(stat="identity", fill="#FF6666") +
scale_y_continuous(limits = c(0, 5)) +
ggtitle("Ratings by Age Group") +
ylab("Movie Ratings") +
xlab("Age Groups")
It does not seem like age plays an important role in ratings. Only ratings by age group 10-19 years old, 20-29 years old are slightly lower than other age groups. However, the difference is not significant.
Then I am also interested in ratings by different age groups and gender.
rating_by_age_group_sex <- lens %>% group_by(sex, age_group) %>% summarise(mean=mean(rating))
ggplot(data=rating_by_age_group_sex, aes(x=age_group, y=mean, fill=sex)) +
geom_bar(stat="identity", position=position_dodge()) +
scale_y_continuous(limits = c(0, 5))+
ggtitle("Ratings by Age Group and Sex") +
ylab("Movie Ratings") +
xlab("Age Groups")
The rating difference caused by gender within every age group is not considerable except age groups 60-69 years old and 70-70 years old. For these two age groups, the ratings by males are obviously higher than that by females.
Then let’s look if occupation matters in terms of movie ratings. First, let’s plot count of ratings by occupation.
total_rating_by_occupation <- lens %>% group_by(occupation) %>%
summarise(mean=mean(rating), N=n()) %>% arrange(desc(mean))
ggplot(aes(x=reorder(occupation, N), y=N, fill=N), data=total_rating_by_occupation) +
geom_bar(stat = "identity") +
scale_fill_gradient("Count", low = "#66CCFF", high = "#003366") +
ylab("Number of Ratings") +
xlab("Occupation") +
ggtitle("Number of Ratings by Occupation") +
coord_flip()
It looks like students rated a lot of movies, which makes sense because students were more likely to have access to the Internet in 1998.
ggplot(aes(x=reorder(occupation, mean), y=mean, fill=mean), data=total_rating_by_occupation) +
geom_bar(stat = "identity") +
scale_fill_gradient("Mean", low = "#66CCFF", high = "#003366") +
ylab("Average Ratings") +
xlab("Occupation") +
ggtitle("Average Ratings by Occupation") +
scale_y_continuous(limits = c(0, 4)) +
coord_flip()
Healthcare workers are likely to give lower ratings compared to other occupation. Otherwise, ratings across occupations look consistent.
My another interest is whether income relates to ratings. Since zip code is included in the data set, it is possible to link ratings to income if I can find income by zip code data. Luckily I found median income by zip code online and merge it to the MovieLens data.
zipincome <- read.csv("zipincome.csv")
zipincome$Zip <- as.factor(zipincome$Zip)
lens3 <- merge(lens, zipincome, by.x="zip_code", by.y="Zip")
rating_by_income <- lens3 %>% group_by(Median) %>% summarise(mean=mean(rating), N=n()) %>% arrange(desc(Median))
ggplot(aes(x=Median, y=mean), data=rating_by_income) + geom_point() +
geom_smooth() +
ggtitle("Ratings by Income") +
xlab("Median Income") +
ylab("Average Ratings")
It does not look like income is related to ratings. We can tell this by looking at the plot. The regression line is pretty flat.
Again, since zip code is included in the data set, I can also look at ratings by state. It is also a great chance to practice the Maps function in ggplot.
zip <- read.csv("us_postal_codes.csv")
zip$Postal.Code <- as.factor(zip$Postal.Code)
lens1 <- merge(lens, zip, by.x="zip_code", by.y="Postal.Code")
lens1$imdb_url <- NULL
lens1$video_release_date <- NULL
rating_by_state <- lens1 %>% group_by(State) %>% summarise(mean=mean(rating), N=n()) %>% arrange(desc(N))
rating_by_state$State <- tolower(rating_by_state$State)
states_map <- map_data("state")
rating_by_state_map <- merge(states_map, rating_by_state, by.x="region", by.y="State")
rating_by_state_map <- rating_by_state_map %>% arrange(group, order)
ggplot(rating_by_state_map, aes(x=long, y=lat, group=group, fill=mean)) +
geom_polygon(colour="black") +
coord_map("polyconic") +
scale_fill_gradient("Rating", low = "#99FFFF", high = "#003366") +
ggtitle("Movie Ratings by State")
ratings1000 <- rating_by_state %>% arrange(desc(mean)) %>% filter(N>1000)
ratings1000df <- data.frame(ratings1000)
ratings1000df
## State mean N
## 1 wisconsin 3.755182 1785
## 2 iowa 3.754883 1587
## 3 tennessee 3.716418 1206
## 4 florida 3.701013 1679
## 5 washington 3.669627 2252
## 6 missouri 3.658802 2204
## 7 indiana 3.639165 1006
## 8 california 3.595506 13842
## 9 pennsylvania 3.565738 3339
## 10 north carolina 3.532371 2008
## 11 district of columbia 3.530120 1411
## 12 ohio 3.526619 3475
## 13 michigan 3.521597 2454
## 14 georgia 3.513595 2317
## 15 minnesota 3.497315 7635
## 16 texas 3.467672 5042
## 17 colorado 3.467560 2389
## 18 idaho 3.462521 1801
## 19 virginia 3.444402 2590
## 20 illinois 3.436063 5740
## 21 oregon 3.432937 1767
## 22 arizona 3.400547 1463
## 23 new york 3.352078 6882
## 24 south carolina 3.339296 1733
## 25 maryland 3.255933 2739
For states with more than 1000 ratings, Wisconsin gives the highest ratings, which is 3.75 out of 5, and Maryland gives the lowest ratings, which is 3.25 out of 5.
If you let me name two states relative to movies, my answer would be New York and California. Therefore, I am interested in taking a further look at movie ratings by these two states. I select the 50 most rated movies, and compare ratings between New York and California on these 50 movies.
most <- lens %>% group_by(movie_id) %>% summarise(N = n()) %>% arrange(desc(N))
most50 <- most[0:50,]
most_50 <- merge(most50, lens1, by="movie_id")
NYCAMoives <- most_50 %>% group_by(title, State) %>%
summarise(mean=mean(rating)) %>%
spread(State,mean) %>% select(title, California, `New York`) %>%
mutate(diff=California-`New York`)
ggplot(aes(x=reorder(title, diff), y=diff, fill=diff), data=NYCAMoives) +
geom_bar(stat = "identity") + coord_flip() +
scale_fill_gradient("Difference", low = "#66CCFF", high = "#003366") +
ylab("New York's Movie Favor vs California's Moive Favor") +
xlab("Movie Titles")
Looks like California users like Star Wars, Pulp Fiction, and Titanic more than New York users, while New York users prefer movies like Dead Man Walking and Mr. Holland’s Opus.
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872