For this assignment I will be using SQL in R studio to observe two data set and answer a list of questions as required. To start, I will need to load the required libraries to execute the task.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(odbc)library(DBI)library(RSQLite)
Load the Data
Next is to upload the csv files into my environment to prepare for use
athlete_events <-read_csv("athlete_events.csv")
Rows: 271116 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
dbl (5): ID, Age, Height, Weight, Year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 52 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Season, City
dbl (3): Year, lon, lat
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Build the Database
Next step is to create a SQLite Database prior to moving forward.
Now that the Database has been created, I will be using SQL coding to answer the following questions. Please note that I cannot make comments inside of chunks when coding with SQL.
Needed Variables for the Queries in Assignment
I will now need to select the variables necessary for my observations, which includes the following:
Medal (from athlete_events)
City (from athlete_events)
Name (from athlete_events)
City (from host_city_locations)
lat (from host_city_locations)
How Many Records are in Athlete Events Data Set?
For this query, I will observing the medal variable from athlete_events to determine the answer.
SELECTCOUNT(medal) AS total_recordsFROM athlete_eventsWHERE medal IN ('Gold', 'Silver', 'Bronze')
1 records
total_records
39783
According to the query, there are a total of 39,783 records in the data set
How Many Different Athletes Competed in Olympics Held in London?
For this query I will be using the City and Name variables from athlete_events data set. I will have to indicate that I do not want duplicates as well, and als to refer to one entry per year per person.
SELECTCOUNT(DISTINCT Name) AS london_athletesFROM athlete_eventsWHERE City ='London'
1 records
london_athletes
16924
The final answer is there are 16,924 total different athletes that have competed in London Olypmics.
Which Olympics Host City is Furthest North?
For this query, I will be using City and lat variables from host_city_locations data set
SELECT City, latFROM host_city_locationsORDERBY lat DESCLIMIT1
1 records
City
lat
Lillehammer
61.11527
For this query, since finding the northern most city, all I need to do is find the highest latitude among the list, I grouped the list by latitude in order from highest to lowest, followed by slicing (or limiting) the query to only show the top 1 choice. In this case, the top 1 choice is the furthest north.
To answer the question, Lillehammer is the host city that is the furthest north in the host_city_locations data set.
What is the List of Host Cities?
For this query, I will be using City variable from the host_city_locations data set to answer this.
SELECTDISTINCT CityFROM host_city_locations
Displaying records 1 - 10
City
Athina
Paris
St. Louis
London
Stockholm
Antwerpen
Chamonix
Amsterdam
Sankt Moritz
Los Angeles
Export Previous Query as a .CSV Named host_cities.csv
host_cities <-dbGetQuery(olympics.db, "SELECT DISTINCT City FROM host_city_locations;")write.csv(host_cities, "host_cities.csv", row.names =FALSE)
Close the Data Base
It is always best practice to ensure you close the data base when finished.