Let’s install all required packages

install.packages("readr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("purrr")
install.packages("magrittr")
install.packages("stringr")
print('All packages installed')

Let’s now import our packages

library(dplyr)
library(readr)
library(ggplot2)
library(magrittr)
library(stringr)
print("Imported!")

Fandango has been suspected of releasing inflated ratings to increase ticket sales. After they found that some films that garnered poor ratings elsewhere were rated highly on Fandango, analysts from FiveThirtyEight investigated and published an article about bias in movie ratings.

To conduct the investigation, the team compiled data for 147 films from 2015 with reviews from movie critics and consumers.

In this mission, you’ll use this data and ggplot2 to visualize reviews from Metacritic, Fandango, Rotten Tomatoes, and IMDB to get a sense for differences in the way the four sites compute movie ratings.

We have made a subset of the data available for you to work with in this mission in a file named “movie_reviews.csv”. The file contains three columns:

# See the movie review link below just incase reading diectly from Github
reviews_link <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/fandango/fandango_scrape.csv"  

reviews <- read_csv("movie_reviews.csv")
Parsed with column specification:
cols(
  FILM = col_character(),
  Rating_Site = col_character(),
  Rating = col_double()
)
#View(reviews)
head(reviews)

Let’s start by getting a sense for how reviews reported by the four sites compare.

You can approach this problem by calculating the average ratings for each rating site. To do this, you could group the reviews data frame into one group for each value of Rating_Site and calculate the average of Rating for each group.

Recall from the R Intermediate course that such problems, where you perform summary calculations on grouped data, are known as “split-apply-combine”(SAC) problems.

Once you’ve calculated average ratings for each site, we’ll introduce you to a new type of graph for visualizing comparisons among groups, or categories, of data.

Let’s calculate average ratings for each movie rating site.

# Instructions
# Use group_by() to group the reviews data frame by Rating_Site.
# Use summarize() to calculate the average Rating for each Rating_Site. Save the summary data frame as review_avgs.
review_avgs <- reviews %>% group_by(Rating_Site) %>% summarise(avg = mean(Rating))
review_avgs

Like line graphs, bar charts depict the relationship between two variables. To create a bar chart to visualize the average ratings for each movie rating site, you would use the review_avgs data frame. The syntax for the data and aesthetics layers you’ll specify when creating a bar chart with ggplot2 is the same as the syntax you learned when creating line graphs. The layer that distinguishes a bar chart from a line graph is the layer in which you’ll specify the geometric shape used to display the data. While before you used geom_line(), now you’ll use geom_bar():

ggplot(data = review_avgs) +
  aes(x = rating_site, y = avg) + 
  geom_bar(stat = 'identity)

In the code above, we specify stat = “identity” within the geom_bar() layer. This is because, by default, using geom_bar() creates a bar graph where the height of the bars corresponds to the number of values in the specified y-variable. Using stat = “identity” overrides the default behavior and creates bars equal to the value of the y-variable, the average.

# Instructions -->
# 
# Use ggplot2 to create a bar chart depicting the average movie ratings for Rotten Tomatoes, IMDB, Metacritic, and Fandango. We have loaded the ggplot2 package for you. -->
# Use the review_avgs data frame that contains your calculated average ratings for each site. -->

ggplot(data = review_avgs) + 
  aes(x = Rating_Site, y = avg, color=Rating_Site) +
  geom_bar(stat = 'identity')

As you look at the chart, you can clearly see that Fandango has a higher average movie rating than the other three sites. Does this mean Fandango tends to give higher ratings?

As you consider that question, let’s think about what the bar chart does not show us. It makes sense to wonder if Fandango’s average movie rating is higher than those of the other sites because it tends to give all movies good ratings, or because it gave some movies average ratings and a small number of movies excellent ratings.

However, the bar graph does not provide this information.

The average of a set of numbers does not tell us anyting about the spread of the numbers that were used to calculate the average. For example, the values of these two variables both have an average of 5:

Variable 1: 5 5 5 5 5 4 5 5 6 5
Variable 2: 20 9 1 2 8 4 9 5 7

However, while values of Variable 1 are distributed between 4 and 6, values of Variable 2 are distributed between 1 and 20. The values of Variable 2 are much more spread out than those of Variable 1.

This leads us to the next chart, which is called a Histogram. A Histogram will show us the count or frequency of values in each rating scale.

To visualize data using a histogram, the syntax for the data and aesthetics layers are similar to what you have used to generate line graphs and bar charts:

ggplot(data = reviews) + 
  aes(x = Rating) +
  geom_histogram(binwidth = 1)

Within the aes() layer, you only need to specify the independent variable. Remember when you create a histogram, the dependent variable count is calculated for you.

The geom_histogram() layer specifies creation of a histogram to represent the independent variable. The argument binwidth = 1 specifies the size of the categories used to bin the values of the independent variable.

Within the geom_histogram() layer, you can use two different arguments to specify the number of categories for binning the independent variable.

binwidth = allows you to specify the size of the bins, and is useful for instances, such as this example, where you want categories to span specific intervals. bins = allows you to specify the number of bins, which can be useful to experiment with when deciding how much detail you want to use to display your data. If you don’t use any arguments within the geom_histogram() layer, ggplot2 will use a default number of bins.

# Instructions
# 
# Create a histogram to show the distribution of all values of the Rating variable in the reviews data frame.
# Specify 30 bins to categorize values of the independent variable.

ggplot(data = reviews) + 
  aes(x = Rating) + 
  geom_histogram(bins = 30)

Histograms allow you to visualize the shape of a distribution — where values of the data are clustered. Most values of Rating are clustered between 3.5 and 4.5.

This histogram tells us about the distribution of all values of the Rating variable, but what we really want to investigate is how ratings for different rating sites differ.

One way to compare Rating distributions for the four sites is to create a faceted plot, as you learned to do for line graphs.

Recall that to create a faceted plot for categories of a variable, you can add a layer to your graph using facet_wrap():

# Instructions
# 
# Add a layer to the histogram you created on the last screen to create a faceted graph containing four histograms of the distribution of Rating for each site:
# Rotten Tomatoes
# IMDB
# Metacritic
# Fandango
# Use nrow = 2 within the facet_wrap() layer to specify a two-by-two arrangement of the histograms.

ggplot(data = reviews) +
  aes(x = Rating) + 
  geom_histogram(bins = 30) +
  facet_wrap(~Rating_Site, nrow = 2)

The distributions of Rating for Rotten Tomatoes and Metacritic indicate that those two sites are more likely to give movies poor ratings than Fandango or IMDB, which have most values of Rating clustered over 3.

Comparing these distributions suggests some sites give poor ratings more often than others. For example, the difference between the distributions of Ratings for Fandango and Rotten Tomatoes is very clear. However, Fandango and IMDB have distributions that look similar. Is there a better way to visualize differences between them?

Remember that when you created line charts, plotting multiple variables on the same set of axes was useful for creating a more nuanced comparison. Similarly, we can plot histograms for the four rating sites on the same set of axes.

As you did for line graphs, you can distinguish values associated with different variables by mapping them to different colors within the aes() layer:

ggplot(data = reviews) + 
  aes(x = Rating, color = Rating_Site) +
  geom_histogram(bins = 30)

When creating histograms (or bar charts), using the argument color = within aes() maps your specified variable to bar outlines of different colors:

Another option for using aesthetics to map values of Rating to different values of Rating_Site is to use the argument fill = instead of color =. Instead of outlines, fill = depicts bars filled in with different colors. Let’s use both option to visualize differences in Rating distributions of the four sites.

# Instructions
# Create a histogram depicting the distribution of Ratings for each site using bars with different color lines.

ggplot(data = reviews) + 
  aes(x = Rating, color = Rating_Site) + 
  geom_histogram(bins = 30) + 
  facet_wrap(~Rating_Site, nrow = 2)

# Instructions
# Create a histogram depicting the distribution of Ratings for each site using bars filled with different colors.

ggplot(data = reviews) + 
  aes(x = Rating, fill = Rating_Site) + 
  geom_histogram(bins = 30) + 
  facet_wrap(~Rating_Site, nrow = 2)

# Plotting all four histograms on one canvas with different filled colors
ggplot(data = reviews) + 
  aes(x = Rating, fill = Rating_Site) + 
  geom_histogram(bins = 30)

# Plotting all four histograms on one canvas with different edge colors colors
ggplot(data = reviews) + 
  aes(x = Rating, color = Rating_Site) + 
  geom_histogram(bins = 30)

Visualizing the distributions of Rankings for each rating site makes it clear that Fandango is more likely to rate movies highly than the other sites are, which supports the argument that it is biased toward assigning higher ratings.

You’ll often use histograms in your data science career for initial explorations of your data. Knowing how to visualize and interpret distributions will become increasingly important later on when you learn about statistics and modeling.

Now that you’re able to make bar charts to visualize data summaries and histograms to visualize data distributions, we’ll introduce you to a type of plot for visualizing both the center of and the variation in your data.

Like Bar charts, Box plots provide a summary of data by group. Like histograms, they provide information about how data are spread.

To create a box plot using ggplot2, the syntax for creating the data layer and mapping data to x and y variables is familiar.

ggplot(data = reviews) + 
  aes(x = Rating, color = Rating_Site) +
  geom_histogram(bins = 30)
ggplot(data = reviews) +
  aes(x = Rating_Site, y = Rating)

You’ll add a geom_boxplot() layer to specify creation of a box plot.

Let’s create a box plot of ratings for each site in the reviews data frame.

# Instructions
# 
# Create a box plot to visualize summaries of values of the Rating variable for each value of Rating_Site.
# Add a geom_boxplot() layer to specify creation of a box pot.

ggplot(data = reviews) + 
  aes(x = Rating_Site, y = Rating, fill = Rating_Site) + 
  geom_boxplot()

In general, you can see that the box representing Fandango ratings is higher up on the y-axis than those for the other sites. You can also see the Rotten Tomatoes ratings appear to be more spread out, which is consistent with what you saw when we plotted the data using histograms.

While you’ve been able to glean some information from this box plot, let’s dig deeper into the individual components to fully understand all they can tell us about data.

Box plots present what is known to statisticians as a five-number summary. The five numbers refer to percentiles of the data you’re working with:

The five percentiles summarized by a box plot are:

The largest value: Represented by the top of the black line extending from the top of the box. These lines are also known as “whiskers”. The third quartile (Q3): Represented by the top of the box. Seventy-five percent of the values are smaller than the third quartile. The median: Represented by the thick black line. The median is the value that falls in the middle of the data. The first quartile (Q1): Represented by the bottom of the box. Twenty-five percent of the values are smaller than the first quartile. The smallest value: Represented by the bottom of the black line extending from the bottom of the box. The white box, bounded by Q3 and Q1, is referred to as the Interquartile Range or IQR. The IQR encompasses 50 percent of the data, and is calculated by subtracting Q1 from Q3.

In the box plot you created, notice there are some points that fall below the bottom of the black lines that represent the smallest values. These points are referred to as outliers because they are outside the range of what would be expected based on the rest of the data.

When you make a box plot using ggplot2, data points that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR are defined as outliers.

Now that you’ve delved into the meaning of the components of a box plot, what can you learn about the movie rating data? Here are some observations:

Values of Rating for Rotten Tomatoes are spread out, indicating they regularly give movies ratings that range from poor to excellent.

The range of values of Rating for Fandango and IMDB are both quite narrow. Fandango’s lowest reviews are around 2.5, whle outliers indicate that IMDB has some reviews that are between 2 and 2.4.

Fandango’s median for values of Rating is higher than the median of the other sites, indicating Fandango tends to give higher ratings.

Does the box plot you made support the idea that Fandango’s reviews are biased? Which site do you think would provide the most unbiased reviews?

From this exercise, My personal opinion is that Rotten-Tomatoes give the most unbiased reviews.

# Instructions
# 
# In the previous exercise, you created a box plot to visualize summaries of ratings for Fandango, IMDB, Metacritic, and Rotten Tomatoes.
# 
# Add layers to your plot so it fits the following specifications:
# 
# White panel background
# The plot title: "Comparison of Movie Ratings"
ggplot(data = reviews) + 
  aes(x = Rating_Site, y = Rating, fill = Rating_Site) + 
  geom_boxplot() + 
  labs(title = "Comparison of Movie Ratings") +
  theme(panel.background = element_rect(fill = 'white'))

As we discussed earlier in this course, building intuition around when to use different types of visualizations to understand your data is an important skill you will develop.

When should you use the three types of plots you learned about in this mission? You will probably explore different options for visualizing each new data set, and doing so is a good idea. However, here are some general guidelines:

Bar charts may be used for showing a quick summary of your data, such as averages or counts of the number of instances of a value that occur for a given variable.

Histograms are useful for visualizing distributions of data when you want to know the shape of a distribution (in other words, where most values are clustered).

Box plots provide an informative summary of the shape, spread, and center of your data.

