This document can serve as a reasonable start to your work on HW 8. Note that you need to have all nine csv files in the same folder as this HTML file. If you connected to the Sakila database remotely (the intended way before connection issues), that’s fine.
# library(DBI)
#
# con <- dbConnect(
# RMariaDB::MariaDB(),
# host = "relational.fit.cvut.cz",
# port = 3306,
# username = "guest",
# password = "relational",
# dbname = "sakila"
# )
#
# dbListTables(con)
# dbDisconnect(con)
library(DBI)
library(dbplyr)
con <- dbConnect(drv = RSQLite::SQLite(),
dbname = ":memory:")
customer <- read.csv("customer.csv")
rental <- read.csv("rental.csv")
film <- read.csv("film.csv")
inventory <- read.csv("inventory.csv")
category <- read.csv("category.csv")
film_category <- read.csv("film_category.csv")
actor <- read.csv("actor.csv")
film_actor <- read.csv("film_actor.csv")
payment <- read.csv("payment.csv")
dbWriteTable(conn = con, name = "customer", value = customer)
dbWriteTable(conn = con, name = "rental", value = rental)
dbWriteTable(conn = con, name = "film", value = film)
dbWriteTable(conn = con, name = "inventory", value = inventory)
dbWriteTable(conn = con, name = "category", value = category)
dbWriteTable(conn = con, name = "film_category", value = film_category)
dbWriteTable(conn = con, name = "actor", value = actor)
dbWriteTable(conn = con, name = "film_actor", value = film_actor)
dbWriteTable(conn = con, name = "payment", value = payment)
knitr::opts_chunk$set(connection = "con")
dbListTables(con)
## [1] "actor" "category" "customer" "film"
## [5] "film_actor" "film_category" "inventory" "payment"
## [9] "rental"
SELECT *
FROM actor
| actor_id | first_name | last_name | last_update |
|---|---|---|---|
| 1 | PENELOPE | GUINESS | 2006-02-15 04:34:33 |
| 2 | NICK | WAHLBERG | 2006-02-15 04:34:33 |
| 3 | ED | CHASE | 2006-02-15 04:34:33 |
| 4 | JENNIFER | DAVIS | 2006-02-15 04:34:33 |
| 5 | JOHNNY | LOLLOBRIGIDA | 2006-02-15 04:34:33 |
| 6 | BETTE | NICHOLSON | 2006-02-15 04:34:33 |
| 7 | GRACE | MOSTEL | 2006-02-15 04:34:33 |
| 8 | MATTHEW | JOHANSSON | 2006-02-15 04:34:33 |
| 9 | JOE | SWANK | 2006-02-15 04:34:33 |
| 10 | CHRISTIAN | GABLE | 2006-02-15 04:34:33 |
QUESTION 1:
# dbListFields(con, "film")
# dbListFields(con, "customer")
# dbListFields(con, "rental")
# dbListFields(con, "inventory")
# #dbListFields(con, "category")
# #dbListFields(con, "film_category")
# dbListFields(con, "film_actor")
# dbListFields(con, "payment")
# dbListFields(con, "actor")
SELECT first_name, last_name, email, count(customer_id) AS total_rentals
FROM customer JOIN rental USING (customer_id) GROUP BY first_name, last_name, email ORDER BY total_rentals DESC
LIMIT 10;
| first_name | last_name | total_rentals | |
|---|---|---|---|
| ELEANOR | HUNT | ELEANOR.HUNT@sakilacustomer.org | 46 |
| KARL | SEAL | KARL.SEAL@sakilacustomer.org | 45 |
| CLARA | SHAW | CLARA.SHAW@sakilacustomer.org | 42 |
| MARCIA | DEAN | MARCIA.DEAN@sakilacustomer.org | 42 |
| TAMMY | SANDERS | TAMMY.SANDERS@sakilacustomer.org | 41 |
| SUE | PETERS | SUE.PETERS@sakilacustomer.org | 40 |
| WESLEY | BULL | WESLEY.BULL@sakilacustomer.org | 40 |
| MARION | SNYDER | MARION.SNYDER@sakilacustomer.org | 39 |
| RHONDA | KENNEDY | RHONDA.KENNEDY@sakilacustomer.org | 39 |
| TIM | CARY | TIM.CARY@sakilacustomer.org | 39 |
QUESTION 2
SELECT title, description
FROM film LEFT JOIN inventory USING (film_id) LEFT JOIN rental USING (inventory_id)
WHERE rental_id IS NULL ORDER BY title ASC;
| title | description |
|---|---|
| ACADEMY DINOSAUR | A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies |
| ALICE FANTASIA | A Emotional Drama of a A Shark And a Database Administrator who must Vanquish a Pioneer in Soviet Georgia |
| APOLLO TEEN | A Action-Packed Reflection of a Crocodile And a Explorer who must Find a Sumo Wrestler in An Abandoned Mine Shaft |
| ARGONAUTS TOWN | A Emotional Epistle of a Forensic Psychologist And a Butler who must Challenge a Waitress in An Abandoned Mine Shaft |
| ARK RIDGEMONT | A Beautiful Yarn of a Pioneer And a Monkey who must Pursue a Explorer in The Sahara Desert |
| ARSENIC INDEPENDENCE | A Fanciful Documentary of a Mad Cow And a Womanizer who must Find a Dentist in Berlin |
| BOONDOCK BALLROOM | A Fateful Panorama of a Crocodile And a Boy who must Defeat a Monkey in The Gulf of Mexico |
| BUTCH PANTHER | A Lacklusture Yarn of a Feminist And a Database Administrator who must Face a Hunter in New Orleans |
| CATCH AMISTAD | A Boring Reflection of a Lumberjack And a Feminist who must Discover a Woman in Nigeria |
| CHINATOWN GLADIATOR | A Brilliant Panorama of a Technical Writer And a Lumberjack who must Escape a Butler in Ancient India |
QUESTION 3:
SELECT name as category_name, AVG(length) AS average_length
FROM category JOIN film_category USING (category_id) JOIN film USING (film_id)
GROUP BY name ORDER BY average_length DESC;
| category_name | average_length |
|---|---|
| Sports | 128.2027 |
| Games | 127.8361 |
| Foreign | 121.6986 |
| Drama | 120.8387 |
| Comedy | 115.8276 |
| Family | 114.7826 |
| Music | 113.6471 |
| Travel | 113.3158 |
| Horror | 112.4821 |
| Classics | 111.6667 |
QUESTION 4:
SELECT first_name, last_name, count(film_id) AS film_count
FROM actor JOIN film_actor USING (actor_id) GROUP BY first_name, last_name
ORDER BY film_count DESC LIMIT 5;
| first_name | last_name | film_count |
|---|---|---|
| SUSAN | DAVIS | 54 |
| GINA | DEGENERES | 42 |
| WALTER | TORN | 41 |
| MARY | KEITEL | 40 |
| MATTHEW | CARREY | 39 |
QUESTION 5:
SELECT customer.first_name AS customer_first_name,
customer.last_name AS customer_last_name
FROM customer JOIN rental USING (customer_id)
JOIN inventory USING (inventory_id) JOIN film USING (film_id) JOIN
film_actor USING (film_id) JOIN actor USING (actor_id) WHERE
actor.first_name = 'Johnny' AND actor.last_name = 'Depp';
| customer_first_name | customer_last_name |
|---|
QUESTION 6:
SELECT title as film_title, SUM(amount) AS total_revenue
FROM payment JOIN rental USING(rental_id) JOIN inventory USING (inventory_id)
JOIN film USING(film_id) GROUP BY film_title ORDER BY total_revenue DESC
LIMIT 10;
| film_title | total_revenue |
|---|---|
| TELEGRAPH VOYAGE | 231.73 |
| WIFE TURN | 223.69 |
| ZORRO ARK | 214.69 |
| GOODFELLAS SALUTE | 209.69 |
| SATURDAY LAMBS | 204.72 |
| TITANS JERK | 201.71 |
| TORQUE BOUND | 198.72 |
| HARRY IDAHO | 195.70 |
| INNOCENT USUAL | 191.74 |
| HUSTLER PARTY | 190.78 |
PLOTLY QUESTIONS:
QUESTION 1:
# Load required libraries
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
head(diamonds)
## # A tibble: 6 Ă— 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#Create a plot using plotly
plot <- plot_ly(
data = diamonds,
x = ~carat, #x axis info
y = ~price, #y axis info
type = 'scatter',
mode = 'markers',
color = ~cut, #Color by cut quality
text = ~paste( #this will be our hover text for each point
"Carat: ", carat,
"<br>Price: $", price, "<br>Cut: ", cut,
"<br>Color: ", color,
"<br>Clarity: ", clarity
),
hoverinfo = 'text'
) %>%
layout( #title and text choices
title = list(
text = "Diamond Price vs. Carat<br><sup>Colored by Cut Quality</sup>",
font = list(size = 16)
),
xaxis = list(title = "Carat"),
yaxis = list(title = "Price (USD)"),
margin = list(t = 80)
)
# Display the plot
plot
QUESTION 2: The primary difference between ggplot2 and plotly is that
ggplot creates static images that are likely better for publications
and/or academic papers, especially when the visualization is meant to be
an image (that can’t be interacted with). This is most notably shown by
the fact that plotly is meant for HTML plots and ggplot is meant for
PDFs, PNGs,etc. Plotly includes the ability to make interactive and
dynamic visualizations, such as animations.
This means that plotly is bound to be less customizable, while ggplot
supports a lot more customization with additional layers that plotly may
lack. Overall, plotly is better for exploring data dynamically, while
ggplot is better for more complex or multi-layered plots. For use,
ggplot2 is better for visuals that are meant to be static, such as for a
publication, while plotly is better for interactive and web-based
visuals, especially where user engagement is emphasized.
QUESTION 3:
library(plotly)
library(gapminder)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:dbplyr':
##
## ident, sql
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Use a subset of the data that includes avg gdpPerCap and avg lifeExp, which we
#will use below in our graph
gapminder_filtered <- gapminder %>%
group_by(continent, year) %>%
summarize(
avg_gdpPercap = mean(gdpPercap, na.rm = TRUE),
avg_lifeExp = mean(lifeExp, na.rm = TRUE)
)
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
plot2 <- plot_ly(
data = gapminder_filtered,
x = ~year, #x axis val
y = ~avg_lifeExp, #y axis val
type = 'scatter',
mode = 'lines+markers',
color = ~continent, #color by continent
text = ~paste( #hover text
"Year: ", year,
"<br>Continent: ", continent,
"<br>Avg GDP per Capita: $", round(avg_gdpPercap, 2), #round the hover text
#for easier/less messy visuals
"<br>Avg Life Expectancy: ", round(avg_lifeExp, 2)
),
hoverinfo = 'text'
) %>%
layout(
title = "Life Expectancy vs. Time<br><sup>Grouped by Continent with GDP Context</sup>",
xaxis = list(title = "Year"), yaxis = list(title = "Average Life Expectancy (Years)"),
legend = list(title = list(text = "Continent")),
margin = list(t = 80)
)
plot2
From the above graph created using the gapminder dataset, we compare
years (x) against average live expectancy (y) for each of the continents
(grouped separately as lines). We also include average gdp perCap as
hovertext information, alongside the already mentioned info.
Essentially, we are seeing the change in life expectancy over time
(specifically the years 1950-2010) for each of the continents (excluding
the ones not in the data like Antarctica). The narrative that is
advanced by this comparison of time and average life expectancy for each
of the continents is how economic growth and prosperity drives health
improvements, especially for wealthier countries.
As the years on the x axis grow, we approach more modern contexts, which
have been defined by health and industrial innovations. Overall, the
graph suggests a link between economic development (consider GDP per
capita as well) and health outcomes. Additionally, the regional
disparities are also a narrative advanced by the graph, for continents
like Europe and the Americas tend to show higher average life expectancy
earlier in the timeline (x axis), compared to other continents like
Africa, reflecting earlier access to healthcare and economic growth for
the former two continents. In general, however, we see that the there is
an upward trend in average life expectancy for all continents over time,
reflecting overall advancements in medicine and healthcare access.