Data 205 Homework 6

Author

A Warsaw

Load the Libraries

For this assignment I will be using SQL in R studio to observe two data set and answer a list of questions as required. To start, I will need to load the required libraries to execute the task.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(odbc)
library(DBI)
library(RSQLite)

Load the Data

Next is to upload the csv files into my environment to prepare for use

athlete_events <- read_csv("athlete_events.csv")
Rows: 271116 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
dbl  (5): ID, Age, Height, Weight, Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
host_city_locations <- read_csv("host_city_locations.csv")
Rows: 52 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Season, City
dbl (3): Year, lon, lat

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Build the Database

Next step is to create a SQLite Database prior to moving forward.

olympics.db <- dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(olympics.db, athlete_events)
copy_to(olympics.db, host_city_locations)

Now that the Database has been created, I will be using SQL coding to answer the following questions. Please note that I cannot make comments inside of chunks when coding with SQL.

Needed Variables for the Queries in Assignment

I will now need to select the variables necessary for my observations, which includes the following:

  • Medal (from athlete_events)
  • City (from athlete_events)
  • Name (from athlete_events)
  • City (from host_city_locations)
  • lat (from host_city_locations)

How Many Records are in Athlete Events Data Set?

For this query, I will observing the medal variable from athlete_events to determine the answer.

SELECT COUNT(medal) AS total_records
FROM athlete_events
WHERE medal IN ('Gold', 'Silver', 'Bronze')
1 records
total_records
39783

According to the query, there are a total of 39,783 records in the data set

How Many Different Athletes Competed in Olympics Held in London?

For this query I will be using the City and Name variables from athlete_events data set. I will have to indicate that I do not want duplicates as well, and als to refer to one entry per year per person.

SELECT COUNT(DISTINCT Name) AS london_athletes
FROM athlete_events
WHERE City = 'London'
1 records
london_athletes
16924

The final answer is there are 16,924 total different athletes that have competed in London Olypmics.

Which Olympics Host City is Furthest North?

For this query, I will be using City and lat variables from host_city_locations data set

SELECT City, lat
FROM host_city_locations
ORDER BY lat DESC
LIMIT 1
1 records
City lat
Lillehammer 61.11527

For this query, since finding the northern most city, all I need to do is find the highest latitude among the list, I grouped the list by latitude in order from highest to lowest, followed by slicing (or limiting) the query to only show the top 1 choice. In this case, the top 1 choice is the furthest north.

To answer the question, Lillehammer is the host city that is the furthest north in the host_city_locations data set.

What is the List of Host Cities?

For this query, I will be using City variable from the host_city_locations data set to answer this.

SELECT DISTINCT City
FROM host_city_locations
Displaying records 1 - 10
City
Athina
Paris
St. Louis
London
Stockholm
Antwerpen
Chamonix
Amsterdam
Sankt Moritz
Los Angeles

Export Previous Query as a .CSV Named host_cities.csv

host_cities <- dbGetQuery(olympics.db, "SELECT DISTINCT City FROM host_city_locations;")

write.csv(host_cities, "host_cities.csv", row.names = FALSE)

Close the Data Base

It is always best practice to ensure you close the data base when finished.

dbDisconnect(olympics.db)