DATA607_Assignment2

Introduction

I chose six recent films and asked five friends to rate each of the movies they had seen from a scale of 1 to 5. This project follows the collection of that information and it’s migration to R for further analysis.

Collecting The Data

The time constraints of the project required collection of the data by text message. I created a database in a MySQL workbench to store the data in individual tables:

Table 1: Friends
Table 2: Movies
Table 3: Ratings

By creating separate individual tables, this allowed me avoid null values for friends who did not watch certain films, as the only observations in my master dataframe would be friends who have watched each film.

The SQL code for database and table creation can be found here.

Inspecting The Data

The data was then read into R using the MariaDB library:

# store password
pw <- pw ## password stored in a hidden cell

moviesDb <- dbConnect(MariaDB(), user='josh1den', password=pw, 
                       dbname='data607_hw2', host='localhost')

# preview the tables
dbListTables(moviesDb)

## [1] "friends" "movies"  "ratings"

As you can see, the three tables are now available in R.

I can inspect each individual table by writing a query:

# store the query
query1 <- "SELECT * FROM friends;"

# store the friends table as a dataframe
rs1 = dbSendQuery(moviesDb,query1)
friends <- dbFetch(rs1)
head(friends)

##   id      name
## 1  1    Bianca
## 2  2     Jason
## 3  3    Henric
## 4  4 Hannelore
## 5  5      Andy

I write a query to join the data from the three tables to a master dataframe for analysis and disconnect from the database:

# performing a join, build a master dataframe from the three tables
query2 <- "SELECT f.name, m.title, m.genre, r.rating
          FROM friends as f
          JOIN ratings as r
          ON f.id = r.friend_id
          JOIN movies as m
          ON m.id = r.movie_id
          ORDER BY 1,2;"

# store the friends table as a dataframe
rs2 = dbSendQuery(moviesDb,query2)

## Warning in result_create(conn@ptr, statement, is_statement): Cancelling previous
## query

df <- dbFetch(rs2)
head(df)

##     name          title  genre rating
## 1   Andy           Nope Horror      4
## 2   Andy           Prey Horror      4
## 3   Andy        Top Gun Action      4
## 4 Bianca   Bullet Train Action      4
## 5 Bianca          Elvis  Drama      3
## 6 Bianca Jurassic World Action      1

I can create an individual tibble for movies watched and average rating by friend:

# compute the number of movies watched and avg rating by friend
by_friend <- df %>%
  group_by(name) %>%
  summarize(movies_watched = n(), avg_rating = mean(rating)) %>%
  arrange(desc(movies_watched))

by_friend

## # A tibble: 5 × 3
##   name      movies_watched avg_rating
##   <chr>              <int>      <dbl>
## 1 Bianca                 6       3.67
## 2 Andy                   3       4   
## 3 Hannelore              3       3.67
## 4 Henric                 3       4.67
## 5 Jason                  3       3.67

Bianca loves movies - which is why I asked her!

It would be great to look at each movie’s ratings, and how many times they were watched:

# compute the number of times watched and avg rating by movie
by_movie <- df %>%
  group_by(title) %>%
  summarize(times_watched = n(), avg_rating = mean(rating)) %>%
  arrange(desc(avg_rating))

by_movie

## # A tibble: 6 × 3
##   title          times_watched avg_rating
##   <chr>                  <int>      <dbl>
## 1 Nope                       3       4.67
## 2 Prey                       4       4.5 
## 3 Top Gun                    5       4.4 
## 4 Bullet Train               2       4   
## 5 Elvis                      2       2.5 
## 6 Jurassic World             2       1.5

We can see Nope had the highest average rating, while Top Gun was watched by the most friends.

Conclusion

The data is ready for further analysis. Some visualizations that would make sense for exploratory data analysis are bar charts for the number of times each movie was watched, a bar chart for the number of movies watched per friend, and a histogram showing the frequency of ratings.

Thank you for reading!

DATA607_Assignment2

Josh Iden

2022-09-10

Introduction

Collecting The Data

Inspecting The Data

Conclusion