For this assignment, I achieved the following tasks:
my_db=DBI::dbConnect(RMySQL::MySQL(),host=Sys.getenv("DB_HOST"),
port=as.integer(Sys.getenv("DB_PORT")),
user=Sys.getenv("DB_USERNAME"),
password=Sys.getenv("DB_PASSWORD"),
dbname=Sys.getenv("DB_NAME"))
This code segment scrapes the https://movieweb.com/movies/2022/ website to retrieve the names of movies released in 2022. It then pulls out the title of the movie and the released date, and stores it in a dataframe
html = read_html("https://movieweb.com/movies/2022/")
col_names = c("title", "release_date")
movie_df = data.frame(matrix(nrow=0, ncol=2))
colnames(movie_df) = col_names
db_cards = html %>% html_nodes(".listing-database-card")
count = 1
for(item in db_cards) {
movie_date = item %>% html_node(".database-card-spec") %>% html_text()
movie_title = item %>% html_node(".database-card-title") %>% html_text()
movie_title = str_squish(movie_title)
#print(movie_title)
movie_df[count,] = c(movie_title, movie_date)
count = count+1
}
movie_df = as.tibble(movie_df)
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
movie_df = movie_df %>% filter(release_date!='2022')
movie_df = movie_df %>% mutate(release_date = as.Date(release_date,format="%B %d, %Y"))
movie_df %>% mutate(year = year(release_date), month=month(release_date))
## # A tibble: 211 × 4
## title release_date year month
## <chr> <date> <dbl> <dbl>
## 1 Death of a Telemarketer 2022-01-24 2022 1
## 2 munich 2022-01-21 2022 1
## 3 The Royal Treatment 2022-01-20 2022 1
## 4 Ray Donovan: The Movie 2022-01-14 2022 1
## 5 Hotel Transylvania: Transformania 2022-01-14 2022 1
## 6 Belle 2022-01-14 2022 1
## 7 Scream (2022) 2022-01-14 2022 1
## 8 The Tender Bar 2022-01-07 2022 1
## 9 The 355 2022-01-07 2022 1
## 10 Boiling Point 2022-01-07 2022 1
## # … with 201 more rows
movie_df = movie_df %>% filter(title != "")
movie_df = movie_df %>% filter(str_detect(title,"\'",negate=TRUE))
This code segment scrapes the https://www.cbssports.com/mlb/news/top-100-mlb-players-for-2022-season-juan-soto-vladimir-guerrero-jr-mike-trout-battle-for-no-1-ranking/ website, and uses the site to generate a list of names that will be used to represent the respondents for the movie ratings. It places the information in a vector of names
names = c()
html = read_html("https://www.cbssports.com/mlb/news/top-100-mlb-players-for-2022-season-juan-soto-vladimir-guerrero-jr-mike-trout-battle-for-no-1-ranking/")
player_cells = html %>% html_nodes(".PlayerObjectCell")
count = 1
for(player in player_cells) {
name = player %>% html_node(".player-name") %>% html_text()
name = str_squish(name)
#print(name)
names[count] = name
count = count + 1
}
names = as.tibble(names) %>% filter(str_detect(value,"\'",negate = TRUE))
This code block is used to create synthetic data that will represent the movies seen by each respondent (if any) and the rating that they give to each movie. The code uses the following logic:
The data is stored in a dataframe with the movie ratings.
col_names = c("respondent", "title", "rating")
ratings_df = data.frame(matrix(nrow=0, ncol=length(col_names)))
colnames(ratings_df) = col_names
count = 1
set.seed(10)
for(respondent in names$value) {
num_movies = ceiling(runif(1,min=1, max=10))
movie_titles = ceiling(runif(num_movies,min=1,max=nrow(movie_df)))
ratings = ceiling(runif(num_movies,min=1, max=5))
for(i in 1:num_movies) {
title_num = movie_titles[i]
rating = ratings[i]
movie_title = movie_df[[title_num,1]]
rating_vector = c(respondent,movie_title,rating)
ratings_df[count,] = rating_vector
count=count+1
}
}
ratings_df = as.tibble(ratings_df)
ratings_df = ratings_df %>% mutate(rating = as.integer(rating))
Use the movies dataframe to populate the movies table in the DB
# creating the query
query <- "INSERT INTO movies(title, release_date) VALUES"
# inserting values in sql query
query <- paste0(query, paste(sprintf("('%s', '%s')",
movie_df$title, movie_df$release_date), collapse = ","))
dbSendQuery(my_db,query)
## <MySQLResult:-1,0,0>
Use the names dataframe to populate the respondents table in the DB
query = "INSERT INTO respondents(name) VALUES"
query = paste0(query, paste(sprintf("('%s')",names$value),collapse=","))
dbSendQuery(my_db,query)
## <MySQLResult:32477552,0,1>
Use data from the ratings_df dataframent to populate the movie_ratings table in the DB
for (i in 1:nrow(ratings_df)) {
respondent = ratings_df[[i,1]]
title = ratings_df[[i,2]]
rating = ratings_df[[i,3]]
#print(respondent)
query = "SELECT movie_id FROM movies WHERE title = "
query = paste0(query,paste(sprintf("'%s'",title)))
movie_id = dbGetQuery(my_db,query)
query = "SELECT respondent_id FROM respondents WHERE name = "
query = paste0(query,paste(sprintf("'%s'",respondent)))
respondent_id = dbGetQuery(my_db,query)
movie_id = movie_id$movie_id
respondent_id = respondent_id$respondent_id
query = "INSERT INTO movie_ratings(respondent_id, movie_id, rating) VALUES"
query = paste0(query, paste(sprintf("(%i, %i, %i)",respondent_id, movie_id, rating),collapse=","))
dbGetQuery(my_db,query)
}
Now that the data is in the DB, we can run queries to extract the table and put it into a data frame. Once each of the tables are in a dataframe, we complete the necessary joins in R (as opposed to using SQL) to create a dataframe that combines the ratings, with the movie titles and respondent names. From there we also generated a ratings summary dataframe that lists the following measurments for each movie title:
query = "SELECT * FROM movies"
movies = dbGetQuery(my_db,query)
query = "SELECT * FROM respondents"
respondents = dbGetQuery(my_db,query)
query = "SELECT * FROM movie_ratings"
ratings = dbGetQuery(my_db,query)
ratings_df = left_join(ratings, movies)
## Joining, by = "movie_id"
ratings_df = left_join(ratings_df,respondents)
## Joining, by = "respondent_id"
ratings_summary = ratings_df %>% group_by(title) %>% summarize(num_reviews = n_distinct(rating_id),
avg_rating = mean(rating),
min_rating = min(rating),
max_rating = max(rating))
df = ratings_summary %>% arrange(desc(num_reviews))
This was a very challenging and fun assignment. I continued to have issues with the code saying it was Updating Loaded Packages and the code stalling. This was something that I was unable to resolve.
However, if I had more time, there are a number of exploratory data analysis steps that I could take to better examine the results and do some associated visualizations. I plan on coming back and doing more with this assignment.