The data for this problem contains songs from the Billboard Hot 100 list ranging from 1960 to 2015, and it is available at https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/bbTop100.csv
Some of the variables included in this dataset are the
year the song came out, the artist_name,
duration (in milliseconds), among others.
| Variable Name | Description |
|---|---|
year |
year |
artist_name |
the artist of the song |
explicit |
if the track is rated as explicit |
track_name |
the name of the track |
danceability |
describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
energy |
is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
key |
The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C#/Db, 2 = D, and so on. |
loudness |
The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
mode |
indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
speechiness |
detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
acousticness |
A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
instrumentalness |
Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
liveness |
Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
valence |
A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
tempo |
The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
duration_ms |
The duration of the track in milliseconds. |
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
link <- "https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/bbTop100.csv"
download.file(url=link, destfile = 'bbTop100.csv')
bbTop100 <- read_csv('bbTop100.csv')
## Rows: 5497 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): artist_name, track_name
## dbl (13): year, danceability, energy, key, loudness, mode, speechiness, acou...
## lgl (1): explicit
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
songs_2011 <- bbTop100 %>%
filter(`year` == 2011)
songs_2011
duration_min to convert
from milliseconds to minutessongs_2011 <- songs_2011%>%
mutate(duration_min = duration_ms/60000)
songs_2011
songs_over_10_mins <- bbTop100 %>%
group_by(year) %>%
filter(duration_ms >= 600000)
songs_over_10_mins
This is your chance to be creative and explore the data set in a way we have not done yet.
Think of a question you may be able to answer with the number of observations and variables included in the dataset. Make sure to write down the question and list the steps you might need to execute in order to provide an answer to the question. (For example: how many songs by Alicia Keys have been in the Billboard Top 100 list according to this data set and what do they have in common?)
My question: How was the energy and danceability affected by here relationship with Chris Brown compared to the rest of her career?
library(ggplot2)
rihanna <- bbTop100 %>%
filter(artist_name == 'Rihanna') %>%
group_by(year) %>%
summarise( song_count = n(),
avg_energy = mean(energy),
avg_danceability = mean(danceability),
avg_speech = mean(speechiness),
avg_key = mean(key),
avg_loudness = mean(loudness))
ggplot(rihanna, aes(x = year, y = avg_energy)) +
geom_line(color = "firebrick", size = 1) +
geom_point(size = 3) +
scale_x_continuous(breaks = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015)) + # Ensures integer years on axis
labs(
title = "Rihanna's Average Energy Trend",
x = "Year",
y = "Average Energy"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
rihanna <- bbTop100 %>%
filter(artist_name == 'Rihanna') %>%
group_by(year) %>%
summarise( song_count = n(),
avg_engery = mean(energy),
avg_danceability = mean(danceability),
avg_speech = mean(speechiness),
avg_key = mean(key),
avg_loudness = mean(loudness))
ggplot(rihanna, aes(x = year, y = avg_danceability)) +
geom_line(color = "firebrick", size = 1) +
geom_point(size = 3) +
scale_x_continuous(breaks = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015)) + # Ensures integer years on axis
labs(
title = "Rihanna's Average Danceability Trend",
x = "Year",
y = "Average Danceability"
) +
theme_minimal()
This dataset, sourced from the Department of Justice FBI
(Federal Bureau of Investigation), provides detailed
information on offenses known to law enforcement across various cities
in Florida for the year 2019. The dataset includes key
variables such as City, Population,
Violent Crime,
Murder and Nonnegligent Manslaughter, Rape,
Robbery, Aggravated Assault,
Property Crime, Burglary,
Larceny-Theft, Motor Vehicle Theft, and
Arson.
This comprehensive dataset offers an opportunity to analyze crime patterns, compare crime rates across cities in the state of Florida, and explore factors contributing to public safety. It is available at https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/crime_florida_2019.csv
This is your chance to be creative and explore the data set in a way we have not done yet.
Think of a question you may be able to answer with the number of observations and variables included in the dataset. Make sure to write down the question and list the steps you might need to execute in order to provide an answer to the question. (For example: What is the relationship between population density and violent crime rates in Florida cities?)
Is the relationship between population density and property crime linear?
library(tidyverse)
link <- "https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/crime_florida_2019.csv"
download.file(url=link, destfile = 'florida_crime.csv')
florida <- read_csv('florida_crime.csv')
## Rows: 289 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): City
## dbl (3): Murder and
## nonnegligent
## manslaughter, Rape, Arson
## num (8): Population, Violent
## crime, Robbery, Aggravated
## assault, Property...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(florida)
## [1] "City"
## [2] "Population"
## [3] "Violent\r\ncrime"
## [4] "Murder and\r\nnonnegligent\r\nmanslaughter"
## [5] "Rape"
## [6] "Robbery"
## [7] "Aggravated\r\nassault"
## [8] "Property\r\ncrime"
## [9] "Burglary"
## [10] "Larceny-\r\ntheft"
## [11] "Motor\r\nvehicle\r\ntheft"
## [12] "Arson"
florida <- florida %>%
rename(
population = 2,
violent = 3,
murder = 4,
assault = 7,
property_crime = 8,
larceny = 10,
motor_theft = 11
) %>%
filter(!is.na(population), !is.na(larceny))
colnames(florida)
## [1] "City" "population" "violent" "murder"
## [5] "Rape" "Robbery" "assault" "property_crime"
## [9] "Burglary" "larceny" "motor_theft" "Arson"
ggplot(florida, aes(x = population, y = property_crime)) +
geom_point(alpha = 0.5, color = "darkgrey") +
# Add a linear trend line (Blue)
geom_smooth(method = "lm", color = "blue", se = FALSE, linetype = "dashed") +
# Add a non-linear smooth trend line (Red)
geom_smooth(method = "loess", color = "red", se = FALSE) +
labs(
title = "Is the Relationship between Population and Larceny Linear?",
subtitle = "Blue dashed line = Linear Model | Red solid line = Actual Trend",
x = "Population",
y = "Number of Larceny-Thefts"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
library(dplyr)
library(ggplot2)
my_cities <- c("Lakeland", "Alachua", "Apopka", "Aventura", "Auburndale", "Atlantis")
comparison_data <- florida %>%
filter(City %in% my_cities)
analysis_table <- comparison_data %>%
select(City, population, property_crime, violent)
print(analysis_table)
## # A tibble: 6 × 4
## City population property_crime violent
## <chr> <dbl> <dbl> <dbl>
## 1 Alachua 10070 230 41
## 2 Apopka 55072 1931 206
## 3 Atlantis 2126 54 6
## 4 Auburndale 16679 715 55
## 5 Aventura 38259 1815 69
## 6 Lakeland 112237 3230 350
Lakeland has the highest crime numbers (350 violent crimes) because it has the largest population (112,237). More people and more buildings simply mean more opportunities for incidents to occur compared to tiny towns like Atlantis.
Apopka has half the people of Lakeland, but its crime isn’t “half”—it’s actually quite high (206).This shows that crime isn’t just about total population; urban density and economic factors can make smaller cities feel “busier” or more active in crime statistics.
Small towns like Atlantis and Auburndale have very low numbers. These dont have major malls or interstates, there are fewer “targets” and fewer non-residents passing through.
Weighted Crime Index for each city using the provided
formula, then Identify and
display the 10 most dangerous cities and the 10 safest
cities according to this index.:\[ \text{Weighted Crime Index} = \left( \frac{\sum_i \left( W_i \times C_i \right)}{\text{Population}} \right) \times 1000 \]
Where:
| Crime | Weight |
|---|---|
Violent Crime |
5 |
Murder and Nonnegligent Manslaughter |
10 |
Rape |
8 |
Robbery |
5 |
Aggravated assault |
4 |
Property crime |
2 |
Burglary |
2 |
larceny-theft |
1 |
Motor Vehicle Theft |
3 |
Arson |
4 |
library(dplyr)
florida_indexed <- florida %>%
mutate(
# Sum of weighted crimes
weighted_sum = (murder * 10) +
(Rape * 8) +
(Robbery * 5) +
(assault * 4) +
(property_crime * 2) +
(Burglary * 2) +
(larceny * 1) +
(motor_theft * 3) +
(Arson * 4),
# Calculate final Index per 1,000 residents
crime_index = (weighted_sum / population) * 1000
) %>%
# Filter out any rows with NAs or where population is 0 to avoid errors
filter(!is.na(crime_index), population > 0)
most_dangerous <- florida_indexed %>%
arrange(desc(crime_index)) %>%
slice_head(n = 10) %>%
select(City, population, crime_index)
safest <- florida_indexed %>%
arrange(crime_index) %>%
slice_head(n = 10) %>%
select(City, population, crime_index)
most_dangerous
safest
Let us use the nycflights13 package for this exercise.
Note: load the nycflighst13 package
Datasets available in this package are:
airlines: Airline names
airports: Airport metadata
flights: Flights data
planes: Plane metadata
weather: Hourly weather data
(2 points) What does anti-join between flights and
airports (in that order) tell you? Briefly
explain.
(2 points) What does anti-join between airports and
flights (in that order) tell you? Briefly explain.
(2 points) In which month did flights tend to have the shortest delays?
(4 points) Consider flights that arrive to Tampa airport
(TPA). Find the average arrival delay per carrier of all
flights that arrive in Tampa, during your birthday in 2013.
Use the CIACountries.csv file available at https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/CIACountries.csv
This dataset comes from the (2014) CIA Factbook which has geographic, demographic, and economic data on a country-by-country basis. (In the description of the variables, the 4-digit number indicates the code used to specify that variable on the data and documentation web site)
| Variable | Description |
|---|---|
country |
Name of the country |
pop |
Number of people, 2119 |
area |
Area (sq km), 2147 |
oil_prod |
Crude oil - production (bbl/day), 2241 |
gdp |
Gross Domestic Product per capita ($/person), 2001 |
educ |
Education spending (% of GDP), 2206 |
roadways |
Roadways per unit area (km/sq km), 2085 |
net_users |
Fraction of Internet users (% of population), 2153 |
(6 points) Create two data summaries of your choice, and comment on your results.
(4 points) Select two variables of interest and report back the number of missing entries for those variables/attributes.
library(dplyr)
library(readr)
# Load the dataset
url <- "https://raw.githubusercontent.com/aalhamadani/Data_wrangling_2026/refs/heads/main/practice/CIACountries.csv"
cia_data <- read_csv(url)
## Rows: 236 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, net_users
## dbl (6): pop, area, oil_prod, gdp, educ, roadways
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Summary 1: Average GDP and Total Population by sample
gdp_summary <- cia_data %>%
summarise(
avg_gdp = mean(gdp, na.rm = TRUE),
max_gdp = max(gdp, na.rm = TRUE),
total_pop_billions = sum(pop, na.rm = TRUE) / 1000000000
)
# Summary 2: Top 5 Largest Countries by Area
top_5_largest <- cia_data %>%
select(country, area) %>%
arrange(desc(area)) %>%
slice(1:5)
gdp_summary
top_5_largest
# Counting missing values for 'educ' and 'net_users'
missing_report <- cia_data %>%
summarise(
missing_educ = sum(is.na(educ)),
missing_net_users = sum(is.na(net_users))
)
missing_report
Consider the data available at https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/main/practice/world_bank_data.csv. The dataset in this example was downloaded from the World Bank Data Explorer, which has information on a collection of hundreds of indicators (measures) of different economic and social development factors.
Consider educational indicators that capture a relevant
signal of a country’s level of (or investment in) education, such as
government expenditure on education, literacy rates, school enrollment
rates, and others. The file used in this example has several
imperfections: unnecessary rows at the top of the .csv
file, a lot of missing data (NA), long column names. All of
theses are representative challenges involved in working with real
datasets.
Each observation (row) represents an indicator for a country, with features (columns) that are values of the indicator in a given year.
Notice the names assigned to the different columns, and the many
missing values NAs in this dataframe.
(2 points) Read the dataset.
(4 points) The original structure of the data, allows you to make comparisons between years for the indicators by filtering for the indicator of interest. Consider the indicator below:
url <- "https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/main/practice/world_bank_data.csv"
# Read data and skip the first 4 rows
world_bank <- read_csv(url, skip = 4)
## New names:
## Rows: 41712 Columns: 63
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): Country Name, Country Code, Indicator Name, Indicator Code dbl (58): 1960,
## 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ... lgl (1): ...63
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...63`
world_bank <- world_bank %>%
select(-last_col())
indicator <- "Government expenditure on education, total (% of GDP)"
indicator_label <- "Government expenditure on education, total (% of GDP)"
education_delta <- world_bank %>%
filter(`Indicator Name` == indicator_label) %>%
select(`Country Name`, `1990`, `2014`) %>%
mutate(across(c(`1990`, `2014`), as.numeric)) %>%
mutate(delta = `2014` - `1990`) %>%
filter(!is.na(delta))
head(education_delta)
Calculate the change, call it delta, between 2014 and
1990.
ESP - values for some years may be missing)world_bank_long <- world_bank %>%
pivot_longer(
cols = `1960`:`2017`,
names_to = "Year",
values_to = "Value"
) %>%
mutate(Year = as.numeric(Year),
Value = as.numeric(Value))
Afghanistan_edu <- world_bank_long %>%
filter(!is.na(Value)) %>%
filter(`Country Code` == "AFG",
`Indicator Name` == indicator_label)
Afghanistan_edu
ggplot(Afghanistan_edu, aes(x = Year, y = Value)) +
geom_line(color = "steelblue", size = 1) +
geom_point() +
labs(
title = "Education Expenditure in Spain (% of GDP)",
subtitle = "Data source: World Bank",
x = "Year",
y = "Expenditure (% of GDP)"
) +
theme_minimal()