Problem 1 (15 points)

The data for this problem contains songs from the Billboard Hot 100 list ranging from 1960 to 2015, and it is available at https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/bbTop100.csv

Some of the variables included in this dataset are the year the song came out, the artist_name, duration (in milliseconds), among others.

Variable Name Description
year year
artist_name the artist of the song
explicit if the track is rated as explicit
track_name the name of the track
danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C#/Db, 2 = D, and so on.
loudness The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms The duration of the track in milliseconds.
  1. (3 point) Find all the songs from 2011
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
link <- "https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/bbTop100.csv"
download.file(url=link, destfile = 'bbTop100.csv')
bbTop100 <- read_csv('bbTop100.csv')
## Rows: 5497 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): artist_name, track_name
## dbl (13): year, danceability, energy, key, loudness, mode, speechiness, acou...
## lgl  (1): explicit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
songs_2011 <- bbTop100 %>%
  filter(`year` == 2011)
songs_2011
  1. (3 points) Make a new variable duration_min to convert from milliseconds to minutes
songs_2011 <- songs_2011%>% 
  mutate(duration_min = duration_ms/60000)

songs_2011
  1. (3 points) Are there songs longer than 10 minutes? (consider the entire datasets) If so, what are the years when such songs were part of the Top 100 list?
songs_over_10_mins <- bbTop100 %>% 
  group_by(year) %>%
  filter(duration_ms >= 600000)
songs_over_10_mins
  1. (6 points) Tell us something we don’t know:

My question: How was the energy and danceability affected by here relationship with Chris Brown compared to the rest of her career?

library(ggplot2)
rihanna <- bbTop100 %>% 
  filter(artist_name == 'Rihanna') %>%
  group_by(year) %>%
  summarise( song_count = n(),
             avg_energy = mean(energy),
             avg_danceability = mean(danceability),
             avg_speech = mean(speechiness),
             avg_key = mean(key),
             avg_loudness = mean(loudness))
ggplot(rihanna, aes(x = year, y = avg_energy)) +
  geom_line(color = "firebrick", size = 1) +
  geom_point(size = 3) +
  scale_x_continuous(breaks = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015)) + # Ensures integer years on axis
  labs(
    title = "Rihanna's Average Energy Trend",
    x = "Year",
    y = "Average Energy"
  ) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

rihanna <- bbTop100 %>% 
  filter(artist_name == 'Rihanna') %>%
  group_by(year) %>%
  summarise( song_count = n(),
             avg_engery = mean(energy),
             avg_danceability = mean(danceability),
             avg_speech = mean(speechiness),
             avg_key = mean(key),
             avg_loudness = mean(loudness))
ggplot(rihanna, aes(x = year, y = avg_danceability)) +
  geom_line(color = "firebrick", size = 1) +
  geom_point(size = 3) +
  scale_x_continuous(breaks = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015)) + # Ensures integer years on axis
  labs(
    title = "Rihanna's Average Danceability Trend",
    x = "Year",
    y = "Average Danceability"
  ) +
  theme_minimal()

Problem 2 (15 points)

This dataset, sourced from the Department of Justice FBI (Federal Bureau of Investigation), provides detailed information on offenses known to law enforcement across various cities in Florida for the year 2019. The dataset includes key variables such as City, Population, Violent Crime, Murder and Nonnegligent Manslaughter, Rape, Robbery, Aggravated Assault, Property Crime, Burglary, Larceny-Theft, Motor Vehicle Theft, and Arson.

This comprehensive dataset offers an opportunity to analyze crime patterns, compare crime rates across cities in the state of Florida, and explore factors contributing to public safety. It is available at https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/crime_florida_2019.csv

  1. (4 points) Tell us something we don’t know:

Is the relationship between population density and property crime linear?

library(tidyverse)
link <- "https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/crime_florida_2019.csv"
download.file(url=link, destfile = 'florida_crime.csv')
florida <- read_csv('florida_crime.csv')
## Rows: 289 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): City
## dbl (3): Murder and
## nonnegligent
## manslaughter, Rape, Arson
## num (8): Population, Violent
## crime, Robbery, Aggravated
## assault, Property...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(florida)
##  [1] "City"                                      
##  [2] "Population"                                
##  [3] "Violent\r\ncrime"                          
##  [4] "Murder and\r\nnonnegligent\r\nmanslaughter"
##  [5] "Rape"                                      
##  [6] "Robbery"                                   
##  [7] "Aggravated\r\nassault"                     
##  [8] "Property\r\ncrime"                         
##  [9] "Burglary"                                  
## [10] "Larceny-\r\ntheft"                         
## [11] "Motor\r\nvehicle\r\ntheft"                 
## [12] "Arson"
florida <- florida %>%
  rename(
    population = 2,
    violent = 3,
    murder = 4,
    assault = 7,
    property_crime = 8,
    larceny = 10,
    motor_theft = 11
  ) %>%
  filter(!is.na(population), !is.na(larceny))


colnames(florida)
##  [1] "City"           "population"     "violent"        "murder"        
##  [5] "Rape"           "Robbery"        "assault"        "property_crime"
##  [9] "Burglary"       "larceny"        "motor_theft"    "Arson"
ggplot(florida, aes(x = population, y = property_crime)) +
  geom_point(alpha = 0.5, color = "darkgrey") +
  # Add a linear trend line (Blue)
  geom_smooth(method = "lm", color = "blue", se = FALSE, linetype = "dashed") + 
  # Add a non-linear smooth trend line (Red)
  geom_smooth(method = "loess", color = "red", se = FALSE) +
  labs(
    title = "Is the Relationship between Population and Larceny Linear?",
    subtitle = "Blue dashed line = Linear Model | Red solid line = Actual Trend",
    x = "Population",
    y = "Number of Larceny-Thefts"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

  1. (4 points) Compare Lakeland city with five other cities in Florida for all categories of offenses. Then write your analysis such as highlighting the significant differences and suggest possible reasons.
library(dplyr)
library(ggplot2)


my_cities <- c("Lakeland", "Alachua", "Apopka", "Aventura", "Auburndale", "Atlantis")


comparison_data <- florida %>%
  filter(City %in% my_cities)

analysis_table <- comparison_data %>%
  select(City, population, property_crime, violent)

print(analysis_table)
## # A tibble: 6 × 4
##   City       population property_crime violent
##   <chr>           <dbl>          <dbl>   <dbl>
## 1 Alachua         10070            230      41
## 2 Apopka          55072           1931     206
## 3 Atlantis         2126             54       6
## 4 Auburndale      16679            715      55
## 5 Aventura        38259           1815      69
## 6 Lakeland       112237           3230     350
  1. Lakeland has the highest crime numbers (350 violent crimes) because it has the largest population (112,237). More people and more buildings simply mean more opportunities for incidents to occur compared to tiny towns like Atlantis.

  2. Apopka has half the people of Lakeland, but its crime isn’t “half”—it’s actually quite high (206).This shows that crime isn’t just about total population; urban density and economic factors can make smaller cities feel “busier” or more active in crime statistics.

  3. Small towns like Atlantis and Auburndale have very low numbers. These dont have major malls or interstates, there are fewer “targets” and fewer non-residents passing through.

  1. (7 points) Make a new variable that calculates the Weighted Crime Index for each city using the provided formula, then Identify and display the 10 most dangerous cities and the 10 safest cities according to this index.:

\[ \text{Weighted Crime Index} = \left( \frac{\sum_i \left( W_i \times C_i \right)}{\text{Population}} \right) \times 1000 \]

Where:

Crime Weight
Violent Crime 5
Murder and Nonnegligent Manslaughter 10
Rape 8
Robbery 5
Aggravated assault 4
Property crime 2
Burglary 2
larceny-theft 1
Motor Vehicle Theft 3
Arson 4
library(dplyr)




florida_indexed <- florida %>%
  mutate(
    # Sum of weighted crimes
    weighted_sum = (murder * 10) + 
                   (Rape * 8) + 
                   (Robbery * 5) + 
                   (assault * 4) + 
                   (property_crime * 2) + 
                   (Burglary * 2) + 
                   (larceny * 1) + 
                   (motor_theft * 3) + 
                   (Arson * 4),
    
    # Calculate final Index per 1,000 residents
    crime_index = (weighted_sum / population) * 1000
  ) %>%
  # Filter out any rows with NAs or where population is 0 to avoid errors
  filter(!is.na(crime_index), population > 0)


most_dangerous <- florida_indexed %>%
  arrange(desc(crime_index)) %>%
  slice_head(n = 10) %>%
  select(City, population, crime_index)

safest <- florida_indexed %>%
  arrange(crime_index) %>%
  slice_head(n = 10) %>%
  select(City, population, crime_index)



most_dangerous
safest

Choose 2 out of the three following problems

Problem 3 (10 points)

Let us use the nycflights13 package for this exercise. Note: load the nycflighst13 package

Datasets available in this package are:

nycflights13 datasets
nycflights13 datasets
  1. (2 points) What does anti-join between flights and airports (in that order) tell you? Briefly explain.

  2. (2 points) What does anti-join between airports and flights (in that order) tell you? Briefly explain.

  3. (2 points) In which month did flights tend to have the shortest delays?

  4. (4 points) Consider flights that arrive to Tampa airport (TPA). Find the average arrival delay per carrier of all flights that arrive in Tampa, during your birthday in 2013.

Problem 4 (10 points)

Use the CIACountries.csv file available at https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/CIACountries.csv

This dataset comes from the (2014) CIA Factbook which has geographic, demographic, and economic data on a country-by-country basis. (In the description of the variables, the 4-digit number indicates the code used to specify that variable on the data and documentation web site)

Variable Description
country Name of the country
pop Number of people, 2119
area Area (sq km), 2147
oil_prod Crude oil - production (bbl/day), 2241
gdp Gross Domestic Product per capita ($/person), 2001
educ Education spending (% of GDP), 2206
roadways Roadways per unit area (km/sq km), 2085
net_users Fraction of Internet users (% of population), 2153
  1. (6 points) Create two data summaries of your choice, and comment on your results.

  2. (4 points) Select two variables of interest and report back the number of missing entries for those variables/attributes.

library(dplyr)
library(readr)

# Load the dataset
url <- "https://raw.githubusercontent.com/aalhamadani/Data_wrangling_2026/refs/heads/main/practice/CIACountries.csv"
cia_data <- read_csv(url)
## Rows: 236 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, net_users
## dbl (6): pop, area, oil_prod, gdp, educ, roadways
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Summary 1: Average GDP and Total Population by sample
gdp_summary <- cia_data %>%
  summarise(
    avg_gdp = mean(gdp, na.rm = TRUE),
    max_gdp = max(gdp, na.rm = TRUE),
    total_pop_billions = sum(pop, na.rm = TRUE) / 1000000000
  )

# Summary 2: Top 5 Largest Countries by Area
top_5_largest <- cia_data %>%
  select(country, area) %>%
  arrange(desc(area)) %>%
  slice(1:5)


gdp_summary
top_5_largest
# Counting missing values for 'educ' and 'net_users'
missing_report <- cia_data %>%
  summarise(
    missing_educ = sum(is.na(educ)),
    missing_net_users = sum(is.na(net_users))
  )

missing_report

Problem 5 (10 points)

Consider the data available at https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/main/practice/world_bank_data.csv. The dataset in this example was downloaded from the World Bank Data Explorer, which has information on a collection of hundreds of indicators (measures) of different economic and social development factors.

Consider educational indicators that capture a relevant signal of a country’s level of (or investment in) education, such as government expenditure on education, literacy rates, school enrollment rates, and others. The file used in this example has several imperfections: unnecessary rows at the top of the .csv file, a lot of missing data (NA), long column names. All of theses are representative challenges involved in working with real datasets.

Each observation (row) represents an indicator for a country, with features (columns) that are values of the indicator in a given year.

Notice the names assigned to the different columns, and the many missing values NAs in this dataframe.

  1. (2 points) Read the dataset.

  2. (4 points) The original structure of the data, allows you to make comparisons between years for the indicators by filtering for the indicator of interest. Consider the indicator below:

url <- "https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/main/practice/world_bank_data.csv"

# Read data and skip the first 4 rows
world_bank <- read_csv(url, skip = 4)
## New names:
## Rows: 41712 Columns: 63
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): Country Name, Country Code, Indicator Name, Indicator Code dbl (58): 1960,
## 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ... lgl (1): ...63
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...63`
world_bank <- world_bank %>%
  select(-last_col())

 indicator <- "Government expenditure on education, total (% of GDP)"
 
 indicator_label <- "Government expenditure on education, total (% of GDP)"

education_delta <- world_bank %>%
  filter(`Indicator Name` == indicator_label) %>%
  select(`Country Name`, `1990`, `2014`) %>%
  mutate(across(c(`1990`, `2014`), as.numeric)) %>%
  mutate(delta = `2014` - `1990`) %>%
  filter(!is.na(delta))

head(education_delta)

Calculate the change, call it delta, between 2014 and 1990.

  1. (4 points) Consider the problem of comparing how the expenditure across all years varies for a given country. We would need to reshape this dataframe, so that each observation is an indicator for a country for a year, making the data in long(er) format. Reshape the dataset as described above, and show a table for the indicator in part (b) and a country of your choice (notice that for example, if you want to filter data for Spain, you could use the “country code” ESP - values for some years may be missing)
world_bank_long <- world_bank %>%
  pivot_longer(
    cols = `1960`:`2017`,
    names_to = "Year",
    values_to = "Value"
  ) %>%
  mutate(Year = as.numeric(Year),
         Value = as.numeric(Value))


Afghanistan_edu <- world_bank_long %>%
  filter(!is.na(Value)) %>%
  filter(`Country Code` == "AFG", 
         `Indicator Name` == indicator_label)

Afghanistan_edu
  1. (3 extra-points) Create a data visualization of your choice that relates to the dataset in this exercise.
ggplot(Afghanistan_edu, aes(x = Year, y = Value)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point() +
  labs(
    title = "Education Expenditure in Spain (% of GDP)",
    subtitle = "Data source: World Bank",
    x = "Year",
    y = "Expenditure (% of GDP)"
  ) +
  theme_minimal()