DS Labs Homework

Author

Myriam O.

DS Labs Homework

Load Library

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(dslabs)

Attaching package: 'dslabs'

The following object is masked from 'package:highcharter':

    stars
library(extrafont)
Registering fonts with R

View data

data("movielens")

Cleaning

group_by

data_summary <- movielens |>
  group_by(movieId, title, genres) |>
  summarise(avg_rating = mean(rating), num_ratings = n(), .groups = "drop")

I use group_by to group the data by each movie and calculates the average rating and the total number of ratings for each movie.

filter

data_cleaning <- data_summary |>
  filter(num_ratings >= 100)

I filter the data to keep only movies with at least 100 ratings, so the results are more reliable

mutate

data_cleaning <- data_cleaning |>
  mutate(rating_group = case_when(avg_rating < 4 ~ "Low",
                                  TRUE ~ "High"))

I creates a new variable that divides movies into two groups: low and high, based on their average ratings.

Highcharter

cols <- c("steelblue", "darkred")
highchart() |>
  hc_add_series(data = data_cleaning[data_cleaning$rating_group == "Low",],
                type = "scatter",
                hcaes(x = num_ratings,
                      y = avg_rating,
                      text = paste("Title:", title, 
                                   "<br>Genre:", genres, 
                                   "<br>Ratings:", num_ratings,
                                   "<br>Avg_rating:", round(avg_rating, 2))),
                      name = "Low Ratings") |>
  
 hc_add_series(data = data_cleaning[data_cleaning$rating_group == "High",],
                type = "scatter",
                hcaes(x = num_ratings,
                      y = avg_rating,
                      text = paste("Title:", title, 
                                   "<br>Genre:", genres, 
                                   "<br>Ratings:", num_ratings,
                                   "<br>Avg_rating:", round(avg_rating, 2))),
                      name = "High Ratings") |>
  hc_title(text = "Popular Movies vs Average Ratings") |>
  hc_xAxis(title = list(text = "Number of ratings")) |>
  hc_yAxis(title = list(text = "Average Rating")) |>
  hc_colors(cols) |>
  hc_tooltip(shared = FALSE,
             pointFormat = "{point.text}" ) |>
  hc_plotOptions(scatter = list(marker = list(radius = 7))) |>
  hc_chart(style = list(fontFamily = "Arial",
                        fontWeight = "bold"))

Brief Essay

I used the MovieLens dataset to study the relationship between the number of ratings and the average rating of movies. I created a bubble plot using ggplot, where the x-axis shows the number of ratings and the y-axis shows the average rating. The size of each bubble represents how many ratings a movie has, and the color shows the average rating, with blue for lower ratings and red for higher ratings. I also used the extrafont library to improve the font and make the graph look better.

In addition, I created an interactive scatterplot using Highcharter. In this plot, movies are divided into low rating and high rating groups with different colors. When you hover over a point, you can see more information like the movie title, genre, number of ratings, and average rating.

At first, I tried to create a bubble plot in Highcharter to match my main plot, but it looked too messy and harder to read, so I decided to use a scatterplot instead. I also tried to combine all the points together because I noticed that low ratings and high ratings were separated. I am not sure if what I did is correct , but I found a way to display them together like in the visualization.