Problem 1 (15 points)

The data for this problem contains songs from the Billboard Hot 100 list ranging from 1960 to 2015, and it is available at https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/bbTop100.csv

Some of the variables included in this dataset are the year the song came out, the artist_name, duration (in milliseconds), among others.

Variable Name	Description
`year`	year
`artist_name`	the artist of the song
`explicit`	if the track is rated as explicit
`track_name`	the name of the track
`danceability`	describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
`energy`	is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
`key`	The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C#/Db, 2 = D, and so on.
`loudness`	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
`mode`	indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
`speechiness`	detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
`acousticness`	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
`instrumentalness`	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
`liveness`	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
`valence`	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
`tempo`	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
`duration_ms`	The duration of the track in milliseconds.

(3 point) Find all the songs from 2011

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

link <- "https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/bbTop100.csv"
download.file(url=link, destfile = 'bbTop100.csv')
bbTop100 <- read_csv('bbTop100.csv')

## Rows: 5497 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): artist_name, track_name
## dbl (13): year, danceability, energy, key, loudness, mode, speechiness, acou...
## lgl  (1): explicit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

songs_2011 <- bbTop100 %>%
  filter(`year` == 2011)
songs_2011

(3 points) Make a new variable duration_min to convert from milliseconds to minutes

songs_2011 <- songs_2011%>% 
  mutate(duration_min = duration_ms/60000)

songs_2011

(3 points) Are there songs longer than 10 minutes? (consider the entire datasets) If so, what are the years when such songs were part of the Top 100 list?

songs_over_10_mins <- bbTop100 %>% 
  group_by(year) %>%
  filter(duration_ms >= 600000)
songs_over_10_mins

(6 points) Tell us something we don’t know:

This is your chance to be creative and explore the data set in a way we have not done yet.
Think of a question you may be able to answer with the number of observations and variables included in the dataset. Make sure to write down the question and list the steps you might need to execute in order to provide an answer to the question. (For example: how many songs by Alicia Keys have been in the Billboard Top 100 list according to this data set and what do they have in common?)

My question: How was the energy and danceability affected by here relationship with Chris Brown compared to the rest of her career?

library(ggplot2)
rihanna <- bbTop100 %>% 
  filter(artist_name == 'Rihanna') %>%
  group_by(year) %>%
  summarise( song_count = n(),
             avg_energy = mean(energy),
             avg_danceability = mean(danceability),
             avg_speech = mean(speechiness),
             avg_key = mean(key),
             avg_loudness = mean(loudness))
ggplot(rihanna, aes(x = year, y = avg_energy)) +
  geom_line(color = "firebrick", size = 1) +
  geom_point(size = 3) +
  scale_x_continuous(breaks = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015)) + # Ensures integer years on axis
  labs(
    title = "Rihanna's Average Energy Trend",
    x = "Year",
    y = "Average Energy"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

rihanna <- bbTop100 %>% 
  filter(artist_name == 'Rihanna') %>%
  group_by(year) %>%
  summarise( song_count = n(),
             avg_engery = mean(energy),
             avg_danceability = mean(danceability),
             avg_speech = mean(speechiness),
             avg_key = mean(key),
             avg_loudness = mean(loudness))
ggplot(rihanna, aes(x = year, y = avg_danceability)) +
  geom_line(color = "firebrick", size = 1) +
  geom_point(size = 3) +
  scale_x_continuous(breaks = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015)) + # Ensures integer years on axis
  labs(
    title = "Rihanna's Average Danceability Trend",
    x = "Year",
    y = "Average Danceability"
  ) +
  theme_minimal()

Problem 2 (15 points)

This dataset, sourced from the Department of Justice FBI (Federal Bureau of Investigation), provides detailed information on offenses known to law enforcement across various cities in Florida for the year 2019. The dataset includes key variables such as City, Population, Violent Crime, Murder and Nonnegligent Manslaughter, Rape, Robbery, Aggravated Assault, Property Crime, Burglary, Larceny-Theft, Motor Vehicle Theft, and Arson.

This comprehensive dataset offers an opportunity to analyze crime patterns, compare crime rates across cities in the state of Florida, and explore factors contributing to public safety. It is available at https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/crime_florida_2019.csv

(4 points) Tell us something we don’t know:

This is your chance to be creative and explore the data set in a way we have not done yet.
Think of a question you may be able to answer with the number of observations and variables included in the dataset. Make sure to write down the question and list the steps you might need to execute in order to provide an answer to the question. (For example: What is the relationship between population density and violent crime rates in Florida cities?)

Is the relationship between population density and property crime linear?

library(tidyverse)
link <- "https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/crime_florida_2019.csv"
download.file(url=link, destfile = 'florida_crime.csv')
florida <- read_csv('florida_crime.csv')

## Rows: 289 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): City
## dbl (3): Murder and
## nonnegligent
## manslaughter, Rape, Arson
## num (8): Population, Violent
## crime, Robbery, Aggravated
## assault, Property...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

colnames(florida)

##  [1] "City"                                      
##  [2] "Population"                                
##  [3] "Violent\r\ncrime"                          
##  [4] "Murder and\r\nnonnegligent\r\nmanslaughter"
##  [5] "Rape"                                      
##  [6] "Robbery"                                   
##  [7] "Aggravated\r\nassault"                     
##  [8] "Property\r\ncrime"                         
##  [9] "Burglary"                                  
## [10] "Larceny-\r\ntheft"                         
## [11] "Motor\r\nvehicle\r\ntheft"                 
## [12] "Arson"

florida <- florida %>%
  rename(
    population = 2,
    violent = 3,
    murder = 4,
    assault = 7,
    property_crime = 8,
    larceny = 10,
    motor_theft = 11
  ) %>%
  filter(!is.na(population), !is.na(larceny))


colnames(florida)

##  [1] "City"           "population"     "violent"        "murder"        
##  [5] "Rape"           "Robbery"        "assault"        "property_crime"
##  [9] "Burglary"       "larceny"        "motor_theft"    "Arson"

ggplot(florida, aes(x = population, y = property_crime)) +
  geom_point(alpha = 0.5, color = "darkgrey") +
  # Add a linear trend line (Blue)
  geom_smooth(method = "lm", color = "blue", se = FALSE, linetype = "dashed") + 
  # Add a non-linear smooth trend line (Red)
  geom_smooth(method = "loess", color = "red", se = FALSE) +
  labs(
    title = "Is the Relationship between Population and Larceny Linear?",
    subtitle = "Blue dashed line = Linear Model | Red solid line = Actual Trend",
    x = "Population",
    y = "Number of Larceny-Thefts"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

(4 points) Compare Lakeland city with five other cities in Florida for all categories of offenses. Then write your analysis such as highlighting the significant differences and suggest possible reasons.

library(dplyr)
library(ggplot2)


my_cities <- c("Lakeland", "Alachua", "Apopka", "Aventura", "Auburndale", "Atlantis")


comparison_data <- florida %>%
  filter(City %in% my_cities)

analysis_table <- comparison_data %>%
  select(City, population, property_crime, violent)

print(analysis_table)

## # A tibble: 6 × 4
##   City       population property_crime violent
##   <chr>           <dbl>          <dbl>   <dbl>
## 1 Alachua         10070            230      41
## 2 Apopka          55072           1931     206
## 3 Atlantis         2126             54       6
## 4 Auburndale      16679            715      55
## 5 Aventura        38259           1815      69
## 6 Lakeland       112237           3230     350

Lakeland has the highest crime numbers (350 violent crimes) because it has the largest population (112,237). More people and more buildings simply mean more opportunities for incidents to occur compared to tiny towns like Atlantis.
Apopka has half the people of Lakeland, but its crime isn’t “half”—it’s actually quite high (206).This shows that crime isn’t just about total population; urban density and economic factors can make smaller cities feel “busier” or more active in crime statistics.
Small towns like Atlantis and Auburndale have very low numbers. These dont have major malls or interstates, there are fewer “targets” and fewer non-residents passing through.

(7 points) Make a new variable that calculates the Weighted Crime Index for each city using the provided formula, then Identify and display the 10 most dangerous cities and the 10 safest cities according to this index.:

\[ \text{Weighted Crime Index} = \left( \frac{\sum_i \left( W_i \times C_i \right)}{\text{Population}} \right) \times 1000 \]

Where:

$C_i$ = The number of crimes of type $i$
$W_i$ = Weight assigned to crime type $i$ based on its severity or impact
$\text{Population}$ = Population of the city
Multiply by 1000 to obtain a rate per 1,000 residents.

Crime	Weight
`Violent Crime`	5
`Murder and Nonnegligent Manslaughter`	10
`Rape`	8
`Robbery`	5
`Aggravated assault`	4
`Property crime`	2
`Burglary`	2
`larceny-theft`	1
`Motor Vehicle Theft`	3
`Arson`	4

library(dplyr)




florida_indexed <- florida %>%
  mutate(
    # Sum of weighted crimes
    weighted_sum = (murder * 10) + 
                   (Rape * 8) + 
                   (Robbery * 5) + 
                   (assault * 4) + 
                   (property_crime * 2) + 
                   (Burglary * 2) + 
                   (larceny * 1) + 
                   (motor_theft * 3) + 
                   (Arson * 4),
    
    # Calculate final Index per 1,000 residents
    crime_index = (weighted_sum / population) * 1000
  ) %>%
  # Filter out any rows with NAs or where population is 0 to avoid errors
  filter(!is.na(crime_index), population > 0)


most_dangerous <- florida_indexed %>%
  arrange(desc(crime_index)) %>%
  slice_head(n = 10) %>%
  select(City, population, crime_index)

safest <- florida_indexed %>%
  arrange(crime_index) %>%
  slice_head(n = 10) %>%
  select(City, population, crime_index)



most_dangerous

safest

Choose 2 out of the three following problems

Problem 3 (10 points)

Let us use the nycflights13 package for this exercise. Note: load the nycflighst13 package

Datasets available in this package are:

airlines: Airline names
airports: Airport metadata
flights: Flights data
planes: Plane metadata
weather: Hourly weather data

nycflights13 datasets

(2 points) What does anti-join between flights and airports (in that order) tell you? Briefly explain.
(2 points) What does anti-join between airports and flights (in that order) tell you? Briefly explain.
(2 points) In which month did flights tend to have the shortest delays?
(4 points) Consider flights that arrive to Tampa airport (TPA). Find the average arrival delay per carrier of all flights that arrive in Tampa, during your birthday in 2013.

Problem 4 (10 points)

Use the CIACountries.csv file available at https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/refs/heads/main/practice/CIACountries.csv

This dataset comes from the (2014) CIA Factbook which has geographic, demographic, and economic data on a country-by-country basis. (In the description of the variables, the 4-digit number indicates the code used to specify that variable on the data and documentation web site)

Variable	Description
`country`	Name of the country
`pop`	Number of people, 2119
`area`	Area (sq km), 2147
`oil_prod`	Crude oil - production (bbl/day), 2241
`gdp`	Gross Domestic Product per capita ($/person), 2001
`educ`	Education spending (% of GDP), 2206
`roadways`	Roadways per unit area (km/sq km), 2085
`net_users`	Fraction of Internet users (% of population), 2153

(6 points) Create two data summaries of your choice, and comment on your results.
(4 points) Select two variables of interest and report back the number of missing entries for those variables/attributes.

library(dplyr)
library(readr)

# Load the dataset
url <- "https://raw.githubusercontent.com/aalhamadani/Data_wrangling_2026/refs/heads/main/practice/CIACountries.csv"
cia_data <- read_csv(url)

## Rows: 236 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, net_users
## dbl (6): pop, area, oil_prod, gdp, educ, roadways
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Summary 1: Average GDP and Total Population by sample
gdp_summary <- cia_data %>%
  summarise(
    avg_gdp = mean(gdp, na.rm = TRUE),
    max_gdp = max(gdp, na.rm = TRUE),
    total_pop_billions = sum(pop, na.rm = TRUE) / 1000000000
  )

# Summary 2: Top 5 Largest Countries by Area
top_5_largest <- cia_data %>%
  select(country, area) %>%
  arrange(desc(area)) %>%
  slice(1:5)


gdp_summary

top_5_largest

# Counting missing values for 'educ' and 'net_users'
missing_report <- cia_data %>%
  summarise(
    missing_educ = sum(is.na(educ)),
    missing_net_users = sum(is.na(net_users))
  )

missing_report

Problem 5 (10 points)

Consider the data available at https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/main/practice/world_bank_data.csv. The dataset in this example was downloaded from the World Bank Data Explorer, which has information on a collection of hundreds of indicators (measures) of different economic and social development factors.

Consider educational indicators that capture a relevant signal of a country’s level of (or investment in) education, such as government expenditure on education, literacy rates, school enrollment rates, and others. The file used in this example has several imperfections: unnecessary rows at the top of the .csv file, a lot of missing data (NA), long column names. All of theses are representative challenges involved in working with real datasets.

Each observation (row) represents an indicator for a country, with features (columns) that are values of the indicator in a given year.

Notice the names assigned to the different columns, and the many missing values NAs in this dataframe.

(2 points) Read the dataset.
(4 points) The original structure of the data, allows you to make comparisons between years for the indicators by filtering for the indicator of interest. Consider the indicator below:

url <- "https://raw.githubusercontent.com/aalhamadani/Data_Wrangling_2026/main/practice/world_bank_data.csv"

# Read data and skip the first 4 rows
world_bank <- read_csv(url, skip = 4)

## New names:
## Rows: 41712 Columns: 63
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): Country Name, Country Code, Indicator Name, Indicator Code dbl (58): 1960,
## 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ... lgl (1): ...63
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...63`

world_bank <- world_bank %>%
  select(-last_col())

 indicator <- "Government expenditure on education, total (% of GDP)"
 
 indicator_label <- "Government expenditure on education, total (% of GDP)"

education_delta <- world_bank %>%
  filter(`Indicator Name` == indicator_label) %>%
  select(`Country Name`, `1990`, `2014`) %>%
  mutate(across(c(`1990`, `2014`), as.numeric)) %>%
  mutate(delta = `2014` - `1990`) %>%
  filter(!is.na(delta))

head(education_delta)

Calculate the change, call it delta, between 2014 and 1990.

(4 points) Consider the problem of comparing how the expenditure across all years varies for a given country. We would need to reshape this dataframe, so that each observation is an indicator for a country for a year, making the data in long(er) format. Reshape the dataset as described above, and show a table for the indicator in part (b) and a country of your choice (notice that for example, if you want to filter data for Spain, you could use the “country code” ESP - values for some years may be missing)

world_bank_long <- world_bank %>%
  pivot_longer(
    cols = `1960`:`2017`,
    names_to = "Year",
    values_to = "Value"
  ) %>%
  mutate(Year = as.numeric(Year),
         Value = as.numeric(Value))


Afghanistan_edu <- world_bank_long %>%
  filter(!is.na(Value)) %>%
  filter(`Country Code` == "AFG", 
         `Indicator Name` == indicator_label)

Afghanistan_edu

(3 extra-points) Create a data visualization of your choice that relates to the dataset in this exercise.

ggplot(Afghanistan_edu, aes(x = Year, y = Value)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point() +
  labs(
    title = "Education Expenditure in Spain (% of GDP)",
    subtitle = "Data source: World Bank",
    x = "Year",
    y = "Expenditure (% of GDP)"
  ) +
  theme_minimal()

Midterm Exam - Part 2

Makenzie Wiggins