library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(RMySQL)
## Loading required package: DBI
library(dbplyr)
##
## Attaching package: 'dbplyr'
##
## The following objects are masked from 'package:dplyr':
##
## ident, sql
library(DBI)
library(keyring)
I imported a table called Movie Ratings Survey through the MySQL database Movie Rating. In order to hide the password, I used the Keyring library in order to gain access into the MySQL Database. The dataset consists of ratings from family and friends of movies that were recently released. The rating scale is from 1 to 5, 1 representing the person extremely dislikes the movie and would not recommend the movie, and 5 being extremely likes the movie and would recommend the movie.
#psswd <- .rs.askForPassword("Database Password:")
sql_db_con <- dbConnect(MySQL(), user='root', password=key_get('MySQL', 'root'),
dbname="Movie_Rating", host='localhost')
dbListTables(sql_db_con)
## [1] "movie ratings survey" "movie_ratings" "movie_ratings_survey"
movie_ratings <- tbl(sql_db_con, 'Movie_Ratings_Survey')
movie_ratings
## # Source: table<Movie_Ratings_Survey> [?? x 3]
## # Database: mysql 8.0.32 [@localhost:/Movie_Rating]
## Name Film Rating
## <chr> <chr> <int>
## 1 Tim Women Talking 3
## 2 Tim Top Gun: Maverick 5
## 3 Tim Avatar 4
## 4 Tim Elvis 5
## 5 Tim Wakanda Forever 4
## 6 Tim Smile 0
## 7 Cindy Women Talking 4
## 8 Cindy Top Gun: Maverick 3
## 9 Cindy Avatar 5
## 10 Cindy Elvis 2
## # … with more rows
dbListFields(sql_db_con, 'Movie_Ratings_Survey')
## [1] "Name" "Film" "Rating"
rs = dbSendQuery(sql_db_con, "select * from Movie_Ratings_Survey")
data <- fetch(rs, n=-1)
data
## Name Film Rating
## 1 Tim Women Talking 3
## 2 Tim Top Gun: Maverick 5
## 3 Tim Avatar 4
## 4 Tim Elvis 5
## 5 Tim Wakanda Forever 4
## 6 Tim Smile 0
## 7 Cindy Women Talking 4
## 8 Cindy Top Gun: Maverick 3
## 9 Cindy Avatar 5
## 10 Cindy Elvis 2
## 11 Cindy Wakanda Forever 0
## 12 Cindy Smile 5
## 13 Carolyn Women Talking 4
## 14 Carolyn Top Gun: Maverick 0
## 15 Carolyn Avatar 5
## 16 Carolyn Elvis 5
## 17 Carolyn Wakanda Forever 5
## 18 Carolyn Smile 4
## 19 Tiffany Women Talking 5
## 20 Tiffany Top Gun: Maverick 0
## 21 Tiffany Avatar 5
## 22 Tiffany Elvis 4
## 23 Tiffany Wakanda Forever 4
## 24 Tiffany Smile 5
## 25 Wesley Women Talking 3
## 26 Wesley Top Gun: Maverick 5
## 27 Wesley Avatar 4
## 28 Wesley Elvis 4
## 29 Wesley Wakanda Forever 4
## 30 Wesley Smile 3
## 31 Alex Women Talking 0
## 32 Alex Top Gun: Maverick 4
## 33 Alex Avatar 5
## 34 Alex Elvis 3
## 35 Alex Wakanda Forever 5
## 36 Alex Smile 5
Now that my data is loaded, I explored the dataset further:
glimpse(data)
## Rows: 36
## Columns: 3
## $ Name <chr> "Tim", "Tim", "Tim", "Tim", "Tim", "Tim", "Cindy", "Cindy", "Ci…
## $ Film <chr> "Women Talking", "Top Gun: Maverick", "Avatar", "Elvis", "Wakan…
## $ Rating <int> 3, 5, 4, 5, 4, 0, 4, 3, 5, 2, 0, 5, 4, 0, 5, 5, 5, 4, 5, 0, 5, …
dim(data)
## [1] 36 3
mean(data$Rating)
## [1] 3.638889
median(data$Rating)
## [1] 4
colSums(data==0)
## Name Film Rating
## 0 0 5
There are 5 values in the dataset that have a 0 rating, which indicates that the respondents may not have watched the movie and therefore did not give a rating for the movie. To fill in those values, I subsetted the data looking at the mean and median based on Name and Film, respectively. From the results, I will choose how to fill in the zero values.
new_data <- data %>%
group_by(Name) %>%
summarise(new_rating = mean(Rating))
new_data
## # A tibble: 6 × 2
## Name new_rating
## <chr> <dbl>
## 1 Alex 3.67
## 2 Carolyn 3.83
## 3 Cindy 3.17
## 4 Tiffany 3.83
## 5 Tim 3.5
## 6 Wesley 3.83
new_film_data <- data %>%
group_by(Film) %>%
summarise(new_rating = mean(Rating))
new_film_data
## # A tibble: 6 × 2
## Film new_rating
## <chr> <dbl>
## 1 Avatar 4.67
## 2 Elvis 3.83
## 3 Smile 3.67
## 4 Top Gun: Maverick 2.83
## 5 Wakanda Forever 3.67
## 6 Women Talking 3.17
new_data_median <- data %>%
group_by(Name) %>%
summarise(new_rating = median(Rating))
new_data_median
## # A tibble: 6 × 2
## Name new_rating
## <chr> <dbl>
## 1 Alex 4.5
## 2 Carolyn 4.5
## 3 Cindy 3.5
## 4 Tiffany 4.5
## 5 Tim 4
## 6 Wesley 4
new_film_median <- data %>%
group_by(Film) %>%
summarise(new_rating = median(Rating))
new_film_median
## # A tibble: 6 × 2
## Film new_rating
## <chr> <dbl>
## 1 Avatar 5
## 2 Elvis 4
## 3 Smile 4.5
## 4 Top Gun: Maverick 3.5
## 5 Wakanda Forever 4
## 6 Women Talking 3.5
I transformed the zero values into NA values in order to replace the values with either the mean or median based on Name or Film.
data[data == 0] <- NA
data
## Name Film Rating
## 1 Tim Women Talking 3
## 2 Tim Top Gun: Maverick 5
## 3 Tim Avatar 4
## 4 Tim Elvis 5
## 5 Tim Wakanda Forever 4
## 6 Tim Smile NA
## 7 Cindy Women Talking 4
## 8 Cindy Top Gun: Maverick 3
## 9 Cindy Avatar 5
## 10 Cindy Elvis 2
## 11 Cindy Wakanda Forever NA
## 12 Cindy Smile 5
## 13 Carolyn Women Talking 4
## 14 Carolyn Top Gun: Maverick NA
## 15 Carolyn Avatar 5
## 16 Carolyn Elvis 5
## 17 Carolyn Wakanda Forever 5
## 18 Carolyn Smile 4
## 19 Tiffany Women Talking 5
## 20 Tiffany Top Gun: Maverick NA
## 21 Tiffany Avatar 5
## 22 Tiffany Elvis 4
## 23 Tiffany Wakanda Forever 4
## 24 Tiffany Smile 5
## 25 Wesley Women Talking 3
## 26 Wesley Top Gun: Maverick 5
## 27 Wesley Avatar 4
## 28 Wesley Elvis 4
## 29 Wesley Wakanda Forever 4
## 30 Wesley Smile 3
## 31 Alex Women Talking NA
## 32 Alex Top Gun: Maverick 4
## 33 Alex Avatar 5
## 34 Alex Elvis 3
## 35 Alex Wakanda Forever 5
## 36 Alex Smile 5
For the values that have a zero, I will use the median value of the respondents’ response to the other movies they gave ratings for.
final_dataset <- data %>%
group_by(Name) %>%
mutate_at("Rating", function(x) replace(x, is.na(x), median(x, na.rm = TRUE)))
final_dataset
## # A tibble: 36 × 3
## # Groups: Name [6]
## Name Film Rating
## <chr> <chr> <dbl>
## 1 Tim Women Talking 3
## 2 Tim Top Gun: Maverick 5
## 3 Tim Avatar 4
## 4 Tim Elvis 5
## 5 Tim Wakanda Forever 4
## 6 Tim Smile 4
## 7 Cindy Women Talking 4
## 8 Cindy Top Gun: Maverick 3
## 9 Cindy Avatar 5
## 10 Cindy Elvis 2
## # … with 26 more rows
Checking to see if there’s any NA values in final_dataset created:
sum(is.na(final_dataset))
## [1] 0
new_final_dataset <- final_dataset %>%
group_by(Film) %>%
summarise(new_final_rating = mean(Rating))
new_final_dataset
## # A tibble: 6 × 2
## Film new_final_rating
## <chr> <dbl>
## 1 Avatar 4.67
## 2 Elvis 3.83
## 3 Smile 4.33
## 4 Top Gun: Maverick 4.5
## 5 Wakanda Forever 4.33
## 6 Women Talking 4
I created a barplot using the new_final_dataset:
As the barplot shows, Avatar rated the highest among the respondents, while Elvis rated the lowest.