Homework 8

knitr::opts_chunk$set() ## Import Libraries

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'plotly'
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following object is masked from 'package:graphics':
## 
##     layout

dbListTables(con)

## [1] "actor"         "category"      "customer"      "film"         
## [5] "film_actor"    "film_category" "inventory"     "payment"      
## [9] "rental"

SQL

Question 1

List the top 10 customers who have rented the most films, including their first and last names, email addresses, and the total number of films they have rented.

SELECT
first_name, last_name, email, count(rental.rental_id) as total_rentals
FROM
customer
JOIN
rental USING (customer_id)
GROUP BY
first_name, last_name, email
ORDER BY
total_rentals DESC
LIMIT
10

Displaying records 1 - 10
first_name	last_name	email	total_rentals
ELEANOR	HUNT	ELEANOR.HUNT@sakilacustomer.org	46
KARL	SEAL	KARL.SEAL@sakilacustomer.org	45
CLARA	SHAW	CLARA.SHAW@sakilacustomer.org	42
MARCIA	DEAN	MARCIA.DEAN@sakilacustomer.org	42
TAMMY	SANDERS	TAMMY.SANDERS@sakilacustomer.org	41
SUE	PETERS	SUE.PETERS@sakilacustomer.org	40
WESLEY	BULL	WESLEY.BULL@sakilacustomer.org	40
MARION	SNYDER	MARION.SNYDER@sakilacustomer.org	39
RHONDA	KENNEDY	RHONDA.KENNEDY@sakilacustomer.org	39
TIM	CARY	TIM.CARY@sakilacustomer.org	39

Question 2

Find the titles and descriptions of films that have never been rented, sorted alphabetically by title.

SELECT
title, description
FROM
film
LEFT JOIN
inventory
ON
inventory.film_id = film.film_id
LEFT JOIN
rental
ON 
rental.inventory_id = inventory.inventory_id
WHERE
rental_id IS NULL
ORDER BY
title;

Displaying records 1 - 10
title	description
ACADEMY DINOSAUR	A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies
ALICE FANTASIA	A Emotional Drama of a A Shark And a Database Administrator who must Vanquish a Pioneer in Soviet Georgia
APOLLO TEEN	A Action-Packed Reflection of a Crocodile And a Explorer who must Find a Sumo Wrestler in An Abandoned Mine Shaft
ARGONAUTS TOWN	A Emotional Epistle of a Forensic Psychologist And a Butler who must Challenge a Waitress in An Abandoned Mine Shaft
ARK RIDGEMONT	A Beautiful Yarn of a Pioneer And a Monkey who must Pursue a Explorer in The Sahara Desert
ARSENIC INDEPENDENCE	A Fanciful Documentary of a Mad Cow And a Womanizer who must Find a Dentist in Berlin
BOONDOCK BALLROOM	A Fateful Panorama of a Crocodile And a Boy who must Defeat a Monkey in The Gulf of Mexico
BUTCH PANTHER	A Lacklusture Yarn of a Feminist And a Database Administrator who must Face a Hunter in New Orleans
CATCH AMISTAD	A Boring Reflection of a Lumberjack And a Feminist who must Discover a Woman in Nigeria
CHINATOWN GLADIATOR	A Brilliant Panorama of a Technical Writer And a Lumberjack who must Escape a Butler in Ancient India

Question 3

Retrieve a list of film categories and the average length of films in each category, ordered by the average length in descending order. Include the category name and the average length.

SELECT
name, avg(film.length) AS average_length
FROM
category
LEFT JOIN
film_category ON category.category_id = film_category.category_id
LEFT JOIN
film ON film_category.film_id = film.film_id
GROUP BY
category.name
ORDER BY
average_length DESC;

Displaying records 1 - 10
name	average_length
Sports	128.2027
Games	127.8361
Foreign	121.6986
Drama	120.8387
Comedy	115.8276
Family	114.7826
Music	113.6471
Travel	113.3158
Horror	112.4821
Classics	111.6667

Question 4

Identify the 5 actors who have appeared in the most films, along with the number of films they’ve acted in. Include their first and last names.

SELECT
first_name, last_name, count(film_actor.actor_id) AS number_of_films
FROM
actor
JOIN 
film_actor ON film_actor.actor_id = actor.actor_id
GROUP BY
first_name, last_name
ORDER BY
number_of_films DESC
LIMIT 5;

5 records
first_name	last_name	number_of_films
SUSAN	DAVIS	54
GINA	DEGENERES	42
WALTER	TORN	41
MARY	KEITEL	40
MATTHEW	CARREY	39

Question 5

List the names of customers who have rented at least one film starring the actor ’Johnny Depp.’

SELECT customer.first_name, customer.last_name
FROM customer JOIN rental USING (customer_id)
  JOIN inventory USING (inventory_id)
  JOIN film_actor USING (film_id)
  JOIN actor USING (actor_id)
WHERE actor.first_name = 'Johnny' AND actor.last_name = 'Depp';

0 records
first_name	last_name

Question 6

Find the top 10 earning film titles along with their total revenue (rental fees), sorted by revenue in descending order. Include the film title and total revenue.**

SELECT title, sum(payment.amount)
FROM film JOIN inventory USING (film_id)
  JOIN rental USING (inventory_id)
  JOIN payment USING (rental_id)
GROUP BY film.title
ORDER BY sum(payment.amount) DESC
LIMIT 10;

Displaying records 1 - 10
title	sum(payment.amount)
TELEGRAPH VOYAGE	231.73
WIFE TURN	223.69
ZORRO ARK	214.69
GOODFELLAS SALUTE	209.69
SATURDAY LAMBS	204.72
TITANS JERK	201.71
TORQUE BOUND	198.72
HARRY IDAHO	195.70
INNOCENT USUAL	191.74
HUSTLER PARTY	190.78

Plotly

Question 1

Use the diamonds dataset to create a 2D plotly scatterplot (you can choose the variables). The plot should have a title and subtitle, axis labels, and hover information that is useful.

head(diamonds,2)

## # A tibble: 2 × 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31

p1 <- plot_ly(diamonds, x = ~carat, y = ~price, 
        mode = "markers", 
        type = "scatter")
p1

p2 <- p1 %>% layout(
  title = "Scatterplot of Carat vs. Price",
  xaxis = list(title = "Carat"),
  yaxis = list(title = "Price"),
  annotations = list(
    list(
      x = 0.5,
      y = 1,
      xref = "paper",
      yref = "paper",
      text = "From the diamonds dataset",
      showarrow = FALSE
    )
))
p2

Question 2

Compare and contrast ggplot2 and plotly. What is an example of a time when you would definitely want to use each one over the other? –> From what I’ve seen, it seems like ggplot2 is faster and has more features. Although they both use the same function of adding on layers, it seems that ggplot2 has more capabilities and can make more plots with various features. For example, if I am making a more complex plot like a violin plot, where I don’t need to hover over information, it seems best to use ggplot2. Even something like a bar plot or histogram that shows a distribution is better with ggplot2. On the other hand, plotly is best for animations, or for when it is best to hover over information. For example, the scatterplots with a large dataset make the points hard to read their exact location. Plotly allows the user to see where exactly those points are.

Question 3

using the gapminder dataset, make a Plotly plot which compares two variables over time. Explain the meaning of this comparison in context and describe what narrative is advanced by this comparison.

head(gapminder, 2)

## # A tibble: 2 × 6
##   country     continent  year lifeExp     pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>   <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8 8425333      779.
## 2 Afghanistan Asia       1957    30.3 9240934      821.

View(gapminder)

us_data <- gapminder[gapminder$country == "United States", ]

p3 <- us_data %>% plot_ly(x = ~year, y = ~pop, type = "scatter",
                    mode = "lines", name = "population")


final_plot <- p3 %>% layout(
  title = "United States Population Over Time (1952-2007)",
  xaxis = list(title="Year"),
  yaxis = list(title="Population"),
  showlegend=TRUE
)

final_plot

Description

The comparison shown in this plotly is the population over time in the United States. I took the gapminder dataset, and used dplyr to select only the rows that observed the United States population. The narrative that is advanced through this graph is that the population in the United States has grown at a relatively steady and linear rate since 1952. By hovering over the line, 1952 shows a US population of approximately 157 Million, and that has increased to 301 million in 2007. The popluation is close to doubling over this time period. The line shows a linear rate of increase.

Homework 8

Sam Davenport

2023-11-08

SQL

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Plotly

Question 1

Question 2

Question 3

Description