Loading libraries:

library(kableExtra)
library(RSocrata)   
library(tidyverse)
library(viridis)
library(readr)
library(readxl)
library(janitor) 
library(lubridate)
library(ggplot2)
library(scales)
library(stringr)
library(forcats)

Introduction

Project Overview

The goal of this project is to tidy, transform, and analyze three different datasets using R, leveraging the tidyverse, tidyr, and dplyr packages. These datasets, originally in an untidy “wide” format, require cleaning, restructuring, and standardization before analysis can be performed.

By the end of this project, we will:

Convert three untidy datasets into a structured format for analysis.
Perform data wrangling using tidyr and dplyr to clean and reshape the data.
Conduct exploratory data analysis (EDA) to uncover insights and trends.
Document the transformation process and provide meaningful conclusions.

This project is a collaborative effort, and each dataset presents a unique challenge in terms of data cleaning, structuring, and interpretation. The final results will be published as an R Markdown report, demonstrating the power of data transformation techniques.

Overview of Datasets

Each dataset represents a different domain and requires a unique transformation approach. Below is a summary of the datasets used in this project:

Dataset #1: Emissions Data

Description: This dataset provides information on pollutant emissions over multiple years. The data includes various emission sources and their impact over time.
Data Issues: The dataset is in wide format, with emissions spread across multiple columns by year.
Transformation Steps: We will convert it into a long format, making it easier to analyze trends over time.

Dataset #2: New York State Department of Environmental Conservation’s Application Review & Tracking System from 2020-2025 (DART)

Description: This dataset contains public about environment permits issued by New York State’s Department of Environmental Conservation. This report explores water permits, regulated under the National Pollutant Discharge Elimination System (NPDES).
Data Issues: The dataset is not normalized, and some entries are duplicated.
Transformation Steps: We will standardize date formats, remove redundant data, and ensure consistency across permit records.

Dataset #3: Cheese Nutritional Data

Description: This dataset provides nutritional information on various types of cheese.
Data Issues: The dataset is structured as a wide table, making it difficult to compare across different cheese types.
Transformation Steps: We will reshape the data into a long format, making it easier to compare nutritional values across different cheese varieties.

Relevance of These Datasets

Each dataset requires different data transformation techniques, making them ideal for practicing tidyr and dplyr functions. The common themes across these datasets include:

Converting wide-format data into long format.
Standardizing date and time fields.
Handling missing values and duplicates.
Preparing the data for downstream statistical analysis and visualization.

By applying tidy data principles, we ensure that each dataset is structured, organized, and ready for analysis. The insights gained from this project can be used for policy recommendations, water quality analysis, and compliance to the Clean Water Act.

Data Preparation and Cleaning

Emissions Data:

df <- read.csv("https://raw.githubusercontent.com/justin-2028/Total-Emissions-Per-Country-2000-2020/refs/heads/main/Total%20Emissions%20Per%20Country%20(2000-2020).csv")

colnames(df) <- gsub("^X", "", colnames(df))

print(head(df))

##          Area             Item                          Element       Unit
## 1 Afghanistan    Crop Residues           Direct emissions (N2O) kilotonnes
## 2 Afghanistan    Crop Residues         Indirect emissions (N2O) kilotonnes
## 3 Afghanistan    Crop Residues                  Emissions (N2O) kilotonnes
## 4 Afghanistan    Crop Residues Emissions (CO2eq) from N2O (AR5) kilotonnes
## 5 Afghanistan    Crop Residues          Emissions (CO2eq) (AR5) kilotonnes
## 6 Afghanistan Rice Cultivation                  Emissions (CH4) kilotonnes
##      2000     2001     2002     2003     2004     2005     2006     2007
## 1   0.520   0.5267   0.8200   0.9988   0.8225   1.1821   1.0277   1.2426
## 2   0.117   0.1185   0.1845   0.2247   0.1851   0.2660   0.2312   0.2796
## 3   0.637   0.6452   1.0045   1.2235   1.0075   1.4481   1.2589   1.5222
## 4 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 5 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 6  18.200  16.9400  18.9000  20.3000  27.3000  22.4000  22.4000  23.8000
##       2008     2009     2010     2011     2012     2013     2014     2015
## 1   0.8869   1.3920   1.2742   1.0321   1.3726   1.4018   1.4584   1.2424
## 2   0.1996   0.3132   0.2867   0.2322   0.3088   0.3154   0.3281   0.2795
## 3   1.0865   1.7051   1.5609   1.2643   1.6815   1.7173   1.7865   1.5220
## 4 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 5 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 6  26.6000  28.0000  29.1200  29.4000  28.7000  28.7000  30.8000  22.9600
##       2016     2017     2018     2019     2020
## 1   1.1940   1.0617   0.8988   1.2176   1.3170
## 2   0.2687   0.2389   0.2022   0.2740   0.2963
## 3   1.4627   1.3005   1.1011   1.4916   1.6133
## 4 387.6130 344.6447 291.7838 395.2689 427.5284
## 5 387.6130 344.6447 291.7838 395.2689 427.5284
## 6  16.6600  15.3233  16.4555  17.8542  20.6577

Make longer

All the year columns were changed to one column under year. The dataset was made longer. This makes it tidy.

df_longer <- df |> 
  pivot_longer(
    cols = starts_with("2"), 
    names_to = "year", 
    values_to = "total emissions",
    values_drop_na = TRUE
  )
print(head(df))

##          Area             Item                          Element       Unit
## 1 Afghanistan    Crop Residues           Direct emissions (N2O) kilotonnes
## 2 Afghanistan    Crop Residues         Indirect emissions (N2O) kilotonnes
## 3 Afghanistan    Crop Residues                  Emissions (N2O) kilotonnes
## 4 Afghanistan    Crop Residues Emissions (CO2eq) from N2O (AR5) kilotonnes
## 5 Afghanistan    Crop Residues          Emissions (CO2eq) (AR5) kilotonnes
## 6 Afghanistan Rice Cultivation                  Emissions (CH4) kilotonnes
##      2000     2001     2002     2003     2004     2005     2006     2007
## 1   0.520   0.5267   0.8200   0.9988   0.8225   1.1821   1.0277   1.2426
## 2   0.117   0.1185   0.1845   0.2247   0.1851   0.2660   0.2312   0.2796
## 3   0.637   0.6452   1.0045   1.2235   1.0075   1.4481   1.2589   1.5222
## 4 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 5 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 6  18.200  16.9400  18.9000  20.3000  27.3000  22.4000  22.4000  23.8000
##       2008     2009     2010     2011     2012     2013     2014     2015
## 1   0.8869   1.3920   1.2742   1.0321   1.3726   1.4018   1.4584   1.2424
## 2   0.1996   0.3132   0.2867   0.2322   0.3088   0.3154   0.3281   0.2795
## 3   1.0865   1.7051   1.5609   1.2643   1.6815   1.7173   1.7865   1.5220
## 4 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 5 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 6  26.6000  28.0000  29.1200  29.4000  28.7000  28.7000  30.8000  22.9600
##       2016     2017     2018     2019     2020
## 1   1.1940   1.0617   0.8988   1.2176   1.3170
## 2   0.2687   0.2389   0.2022   0.2740   0.2963
## 3   1.4627   1.3005   1.1011   1.4916   1.6133
## 4 387.6130 344.6447 291.7838 395.2689 427.5284
## 5 387.6130 344.6447 291.7838 395.2689 427.5284
## 6  16.6600  15.3233  16.4555  17.8542  20.6577

Total emissions per country for each year

yearly_emissions_by_area <- aggregate(df_longer$'total emissions', by = list(df_longer$year, df_longer$Area), FUN = sum)

yearly_emissions_by_area

#rename columns
yearly_emissions_by_area <-
yearly_emissions_by_area %>% 
  rename(
    year = Group.1,
    country = Group.2,
    emissions = x
    )

yearly_emissions_by_area

print(head(df))

Analyze overall total emissions per country for each year

Too many different countries

ggplot(yearly_emissions_by_area, aes(x = year, y = emissions, 
                                     fill = country)) +
  geom_tile()

print(head(df))

##          Area             Item                          Element       Unit
## 1 Afghanistan    Crop Residues           Direct emissions (N2O) kilotonnes
## 2 Afghanistan    Crop Residues         Indirect emissions (N2O) kilotonnes
## 3 Afghanistan    Crop Residues                  Emissions (N2O) kilotonnes
## 4 Afghanistan    Crop Residues Emissions (CO2eq) from N2O (AR5) kilotonnes
## 5 Afghanistan    Crop Residues          Emissions (CO2eq) (AR5) kilotonnes
## 6 Afghanistan Rice Cultivation                  Emissions (CH4) kilotonnes
##      2000     2001     2002     2003     2004     2005     2006     2007
## 1   0.520   0.5267   0.8200   0.9988   0.8225   1.1821   1.0277   1.2426
## 2   0.117   0.1185   0.1845   0.2247   0.1851   0.2660   0.2312   0.2796
## 3   0.637   0.6452   1.0045   1.2235   1.0075   1.4481   1.2589   1.5222
## 4 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 5 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 6  18.200  16.9400  18.9000  20.3000  27.3000  22.4000  22.4000  23.8000
##       2008     2009     2010     2011     2012     2013     2014     2015
## 1   0.8869   1.3920   1.2742   1.0321   1.3726   1.4018   1.4584   1.2424
## 2   0.1996   0.3132   0.2867   0.2322   0.3088   0.3154   0.3281   0.2795
## 3   1.0865   1.7051   1.5609   1.2643   1.6815   1.7173   1.7865   1.5220
## 4 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 5 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 6  26.6000  28.0000  29.1200  29.4000  28.7000  28.7000  30.8000  22.9600
##       2016     2017     2018     2019     2020
## 1   1.1940   1.0617   0.8988   1.2176   1.3170
## 2   0.2687   0.2389   0.2022   0.2740   0.2963
## 3   1.4627   1.3005   1.1011   1.4916   1.6133
## 4 387.6130 344.6447 291.7838 395.2689 427.5284
## 5 387.6130 344.6447 291.7838 395.2689 427.5284
## 6  16.6600  15.3233  16.4555  17.8542  20.6577

Analyze total emissions over time

As you can see from the graph, total emissions have gone up steadily from 2000 to 2019, but in 2020, it decreased a significant amount. This might be due to more awareness about climate change and global warming.

yearly_emissions <- aggregate(df_longer$'total emissions', by=list(df_longer$year), FUN = sum)
yearly_emissions

##    Group.1          x
## 1     2000 2143119367
## 2     2001 2109760240
## 3     2002 2180860383
## 4     2003 2220985897
## 5     2004 2321819994
## 6     2005 2348671376
## 7     2006 2421011446
## 8     2007 2428782054
## 9     2008 2463686673
## 10    2009 2468939225
## 11    2010 2511864950
## 12    2011 2523633556
## 13    2012 2558572795
## 14    2013 2576071615
## 15    2014 2618685489
## 16    2015 2639203455
## 17    2016 2636367158
## 18    2017 2665248135
## 19    2018 2712258358
## 20    2019 2741323659
## 21    2020 2648131930

#rename columns
yearly_emissions <-
yearly_emissions %>% 
  rename(
    year = Group.1,
    emissions = x
    )

ggplot(yearly_emissions, aes(x = year, y = emissions)) +
  geom_point()

Total emissions per country

Some of the top countries that contributed to emissions are China, USA, Brazil, India, Indonesia, and Democratic Republic of the Congo.

emissions_by_area <- aggregate(df_longer$'total emissions', by = list(df_longer$Area), FUN = sum)

#rename columns
emissions_by_area <-
emissions_by_area %>% 
  rename(
    country = Group.1,
    emissions = x
    )

top <- emissions_by_area[order(-emissions_by_area$emissions),]

print(head(df))

##          Area             Item                          Element       Unit
## 1 Afghanistan    Crop Residues           Direct emissions (N2O) kilotonnes
## 2 Afghanistan    Crop Residues         Indirect emissions (N2O) kilotonnes
## 3 Afghanistan    Crop Residues                  Emissions (N2O) kilotonnes
## 4 Afghanistan    Crop Residues Emissions (CO2eq) from N2O (AR5) kilotonnes
## 5 Afghanistan    Crop Residues          Emissions (CO2eq) (AR5) kilotonnes
## 6 Afghanistan Rice Cultivation                  Emissions (CH4) kilotonnes
##      2000     2001     2002     2003     2004     2005     2006     2007
## 1   0.520   0.5267   0.8200   0.9988   0.8225   1.1821   1.0277   1.2426
## 2   0.117   0.1185   0.1845   0.2247   0.1851   0.2660   0.2312   0.2796
## 3   0.637   0.6452   1.0045   1.2235   1.0075   1.4481   1.2589   1.5222
## 4 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 5 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 6  18.200  16.9400  18.9000  20.3000  27.3000  22.4000  22.4000  23.8000
##       2008     2009     2010     2011     2012     2013     2014     2015
## 1   0.8869   1.3920   1.2742   1.0321   1.3726   1.4018   1.4584   1.2424
## 2   0.1996   0.3132   0.2867   0.2322   0.3088   0.3154   0.3281   0.2795
## 3   1.0865   1.7051   1.5609   1.2643   1.6815   1.7173   1.7865   1.5220
## 4 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 5 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 6  26.6000  28.0000  29.1200  29.4000  28.7000  28.7000  30.8000  22.9600
##       2016     2017     2018     2019     2020
## 1   1.1940   1.0617   0.8988   1.2176   1.3170
## 2   0.2687   0.2389   0.2022   0.2740   0.2963
## 3   1.4627   1.3005   1.1011   1.4916   1.6133
## 4 387.6130 344.6447 291.7838 395.2689 427.5284
## 5 387.6130 344.6447 291.7838 395.2689 427.5284
## 6  16.6600  15.3233  16.4555  17.8542  20.6577

Analysis of Total emissions per country

too many countries, cant read

ggplot(emissions_by_area, aes(x = country, y = emissions), label=NA)+ geom_point()

Aanalyze by emission type

Emissions (CO2eq) (AR5) are highest. They are over 200,000 kilotonnes. The second highest place is tied with emissions (N20) and emissions (CO2eq) from N20 (AR5). Lowest emissions are (CO2eq) from F-gases, less than 25,000 kilotonnes.

ggplot(df_longer, aes(y=Element)) +
         geom_bar()

USA emission types distribution

The distribution looks very similar to the distribution with the data from all the regions. For USA, the counts are smaller. Highest are emissions (CO2eq) (AR5), a little less than 1250 kilotonnes. Lowest emissions are (CO2eq) from F-gases, around 125 kilotonnes. The second highest place is from emissions (N20), around 950 kilotonnes.

usa <- df_longer %>%
  filter(Area == "United States of America")

ggplot(usa, aes(y=Element)) +
         geom_bar()

Item Analysis

Highest item is IPCC Agriculture. Second highest is farm gate emissions. Lowest is international bunkers.

ggplot(df_longer, aes(y=Item)) +
         geom_bar()

Item Analysis USA

Highest item is IPCC Agriculture, just like in the overall data. Lowest is drained organic soils (C02) and drained organic soils.

ggplot(usa, aes(y=Item)) +
         geom_bar()

NYSDEC Water Permit Data (DART):

Loading NYSDEC DART Data

Data used in this section comes from New York State Department of Environmental Conservation’s Application Review & Tracking System (DART on the Web).

DART is a web-based application and tracking system that is designed for the general public. DART hosts information about NYSDEC’s processing and issuance of environmental permits under the Uniform Procedures Act. The data is updated daily, and more information about the data can be found in the data dictionary.

In this section, data was previously filtered to only include DART entries from 2020-2025, and will be focused on waste water permits that discharge to surface water.

library(readr)
dart <- read_csv("https://raw.githubusercontent.com/AlinaVikhnevich/data_607/refs/heads/main/Project%202/dart_2020_2025.csv")

Defining Regex Patterns to Detect NPDES IDs

To identify wastewater permits, there are three regex patterns to identify:

NPDES Permit (meaning a regular permit).
General Permit
Individual Permit (these are permits that are processed under general permits).

For more information about permit types please see the question “What are the primary differences between a NODES individual permit and a NPDES general permit” under EPA’s NPDES Permit Basics Site.

# p_type = permit type, in this exercise we are filtering for wastewater permits
p_type <- c("P/C/I SPDES - Surface Discharge",
            "Municipal SPDES - Surface Discharge",
            "Industrial SPDES - Surface Discharge")

# defining the regex patterns for the IDs we want to track
npdes_pattern <- "NY\\d{7}"
gp_pattern <- "GP\\d{7}"
individual_pattern <- "NY[A-Z]\\d{2}[A-Z]\\d{3}"
all_patterns <- paste(npdes_pattern,gp_pattern,individual_pattern, sep="|")

Creating the NPDES Universe

Creating the permit universe pulling from NYSDEC’s DART System and detecting the string patterns within DART to assign permit type: npdes, individual(i.e., a permit covered under a general permit), general, or multi (meaning the DART entry had multiple associated IDs).

universe <- dart |>  
  filter(`permit_type` %in% p_type) |>  
  mutate(
    npdes = str_count(`other_known_ids`, npdes_pattern), # the str_counts are taking count of permit ID type
    individual = str_count(`other_known_ids`, individual_pattern),
    gp =  str_count(`other_known_ids`, gp_pattern),
    sum_ids = rowSums(across(c(`npdes`, `individual`,`gp`))),
    npdes_id = str_extract_all(`other_known_ids`, all_patterns),
    date_received=as.Date(date_received, format = "%d-%m-%Y")
    ) |> 
  mutate(applicant_id =cur_group_id(),.by = applicant) |> # creating applicant id
  mutate(facility_id = cur_group_id(),.by = c(facility,location,town_or_city)) |> # creating facility id
  distinct() |> # removing duplicate rows 
  mutate(
          dart_p_type = case_when(sum_ids  > 1 ~ "multi", # if entry is associated with multiple ids, it is flagged as multi
                               sum_ids & npdes == 1 ~ "npdes",
                               sum_ids & individual == 1 ~ "individual",
                               sum_ids & gp == 1 ~ "gp")) |>  
  unnest_longer(npdes_id, keep_empty = FALSE) |> 
  filter(!is.na(npdes_id))

Note: The code above filters entries that did not have a NPDES ID listed in the “Other Known IDs” column, however, were listed as NPDES permits in the Permit Type Column. However, out of 35,642 entries, only 69 were missing NPDES IDs.

Tidy Data

Table 1: Permit Level Data

This table shows the most recent permit information

tbl1_permit_lvl <- universe  |> 
  group_by(npdes_id) |> 
  slice(which.max(date_received)) |>
  select(npdes_id,facility_id,application_id,applicant,applicant_id,permit_type,
         status,date_received,upa_class,seqr_class,seqr_determination,
         lead_agency,coastal_zone_status, final_disposition,permit_effective_date,
         permit_expration_date,dec_contact,shpa_status,enivronmental_justice)

Permit Data Head
npdes_id	facility_id	application_id	applicant	applicant_id	permit_type	status	date_received	upa_class	seqr_class	seqr_determination	lead_agency	coastal_zone_status	final_disposition	permit_effective_date	permit_expration_date	dec_contact	shpa_status	enivronmental_justice
NY0000044	574	3-3724-00045/00004	WATCHTOWER BIBLE AND TRACT SOCIETY OF NEW YORK INC	158	P/C/I SPDES - Surface Discharge	Issued	2024-02-16	MINOR	Type II Action	Not Applicable	None Designated	This project is not located in a Coastal Management area.	Issued	2024-09-01 04:00:00	2029-08-31 04:00:00	KATHERINE M MURRAY	NA	NA
NY0000078	706	8-5436-00007/00014	GARLOCK SEALING TECHNOLOGIES LLC	593	Industrial SPDES - Surface Discharge	SAPA Extended	2021-02-03	MAJOR	Type II Action	Not Applicable	None Designated	This project is not located in a Coastal Management area.	Issued	2023-07-01 04:00:00	2023-08-31 04:00:00	GUILLERMO R SAAR	NA	The application is subject to the Department Environmental Justice policy (CP-29). Either the permits needed for the project are not subject to the policy or it has been determined that the project would not affect a Potential Environmental Justice Area.
NY0000167	1906	8-2499-00039/00002	RETSOF REALTY LLC	1571	Industrial SPDES - Surface Discharge	Issued	2021-04-15	MINOR	Type II Action	Not Applicable	None Designated	This project is not located in a Coastal Management area.	Issued	2022-01-01 05:00:00	2026-12-31 05:00:00	MICHAEL R SCHAEFER	NA	NA
NY0000247	1556	3-5518-00680/00001	CONSOLIDATED EDISON COMPANY OF NEW YORK, INC.	914	Industrial SPDES - Surface Discharge	Suspended Indefinitely	2024-04-22	MINOR	Type II Action	Not Applicable	None Designated	This project is not located in a Coastal Management area.	NA	NA	NA	KATHERINE M MURRAY	NA	NA
NY0000281	539	9-1464-00117/00013	LINDE INC	455	Industrial SPDES - Surface Discharge	Issued	2021-04-29	MINOR	Type II Action	Not Applicable	None Designated	This project is not located in a Coastal Management area.	Issued	2022-01-11 05:00:00	2026-12-31 05:00:00	MICHAEL R SCHAEFER	NA	NA
NY0000311	1420	8-3224-00108/00031	PACTIV LLC	1186	Industrial SPDES - Surface Discharge	Issued	2021-08-02	MINOR	Type II Action	Not Applicable	None Designated	This project is not located in a Coastal Management area.	Issued	2022-03-01 05:00:00	2027-02-28 05:00:00	MICHAEL R SCHAEFER	NA	NA

Table 2: Permit Action Level Data

This table shows the permit history. each observation in this table represents a permit action.

tbl2_permit_act_lvl <- universe |> 
    mutate(action_id = paste(npdes_id,date_received, sep = "_")) |> 
    distinct() |> 
    mutate(dup_flag = duplicated(action_id),
           transfer_flag=str_detect(toupper(short_description),"TRANSFER")) |> 
  select(action_id,facility,facility_id,npdes_id,application_id,applicant,
         application_type,date_received,status,short_description,
         enb_publication_date,written_comments_due,dup_flag,transfer_flag)

tbl2_permit_act_lvl$short_description <- tolower(tbl2_permit_act_lvl$short_description)

Permit Action Data Head
action_id	facility	facility_id	npdes_id	application_id	applicant	application_type	date_received	status	short_description	enb_publication_date	written_comments_due	dup_flag	transfer_flag
NY0101915_2022-10-05	B O C E S	1	NY0101915	3-1332-00172/00001	DUTCHESS BOCES	Renewal Treat as New	2022-10-05	Issued	spdes fast track renewal for ny0101915	2023-02-15 05:00:00	2023-03-17 04:00:00	FALSE	FALSE
NYR10L635_2023-03-24	NYC EDC-SOUTH BROOKLYN MARINE TERMINAL	2	NYR10L635	2-6102-00120/00032	NYC ECONOMIC DEVELOPMENT CORP	New	2023-03-24	Issued	spdes application for substation test pits and construction	2024-03-20 04:00:00	2024-04-19 04:00:00	FALSE	FALSE
NY0313149_2023-03-24	NYC EDC-SOUTH BROOKLYN MARINE TERMINAL	2	NY0313149	2-6102-00120/00032	NYC ECONOMIC DEVELOPMENT CORP	New	2023-03-24	Issued	spdes application for substation test pits and construction	2024-03-20 04:00:00	2024-04-19 04:00:00	FALSE	FALSE
NY0035441_2021-06-30	CHAUTAUQUA FISH HATCHERY	3	NY0035441	9-0628-00098/00004	NYS Dept of Environmental Conservation	New	2021-06-30	Issued	reduce conc. limit of formalin & change in phos. loading limit	2021-09-29 04:00:00	2021-10-29 04:00:00	FALSE	FALSE
NY0071897_2021-03-30	EFFRON FUEL OIL CORP TERMINAL	4	NY0071897	3-1313-00015/00002	PETRO INC	Minor Modification	2021-03-30	Expired	transfer	NA	NA	FALSE	TRUE
NY0071897_2022-09-08	EFFRON FUEL OIL CORP TERMINAL	4	NY0071897	3-1313-00015/00002	PETRO INC	Renewal Treat as New	2022-09-08	Issued	spdes fast track renewal for ny 0071897	2022-11-02 04:00:00	2022-12-02 05:00:00	FALSE	FALSE

Table 3: Facility Level Data

This table shows the facility information. Each observation in this table represents a facility associated with NPDES permits.

tbl3_facility_lvl <- universe |> 
  select(facility_id,facility,
         location,town_or_city) |> 
  distinct() |> 
  arrange(facility_id)

Facility Data Head
facility_id	facility	location	town_or_city
1	B O C E S	578 SALT POINT TURNPIKE HYDE PARK 12538	HYDE PARK
2	NYC EDC-SOUTH BROOKLYN MARINE TERMINAL	Sunset Park and Greenwood Heights 29th to 39th St\|2nd Ave To Ny Harbor Brooklyn (6102) 11232	BROOKLYN
3	CHAUTAUQUA FISH HATCHERY	5875 PRENDERGAST RD MAYVILLE 14757	CHAUTAUQUA
4	EFFRON FUEL OIL CORP TERMINAL	FOOT OF PROSPECT ST POUGHKEEPSIE 12602	POUGHKEEPSIE
5	POUGHKEEPSIE STP	173 KITTREDGE PL POUGHKEEPSIE 12601	POUGHKEEPSIE
6	AMENIA S & G-LEEDSVILLE PROCESSING PLANT	307 LEEDSVILLE RD AMENIA 12501	AMENIA

Table 4: NPDES Permit Applicant Table

This table shows the applicant information. Each observation in this table represents a permit applicant for NPDES permits.

tbl4_app_lvl <- universe |> 
  group_by(applicant_id) |> 
  slice(which.max(date_received)) |>
  select(applicant_id,applicant,application_id)

NPDES Permit Applicant Data Head
applicant_id	applicant	application_id
1	DUTCHESS BOCES	3-1332-00172/00001
2	NYC ECONOMIC DEVELOPMENT CORP	2-6402-00004/00100
3	NYS Dept of Environmental Conservation	3-4844-00112/00001
4	PETRO INC	3-1334-00136/00001
5	CITY OF POUGHKEEPSIE	3-1346-00364/00003
6	DOLOMITE PRODUCTS COMPANY INC.	8-1836-00001/02002

Data Tables and Structure

(1) Table 1 - permit table: the purpose of this table is to have the most
recent permit information. This will have one row per permit.

(2) Table 2 - permit action table: the purpose of this table is to have a 
table with every permit-action. This means there should be one row per
permit action. 

(3) Table 3 - facility table: the purpose of this table is to have 
information on the facility. 

(4) Table 4 - applicant table: the purpose of this table is to have 
information about the applicant.

Data Considerations:

There was missing data, such as NPDES IDs. This means that some permit information may not be available.
There may be facilities that are listed as different facilities due to address changes. This information should be verified. Databases like EPA’s Enforcement and Compliance History Online (ECHO)) may be helpful for verifying facility information.
For entries that were made on the same day for a particular permit, it is not possible to identify which entry was made first. Permit transfer actions are largely affected by this. Due to this, duplicates and transfers are flagged for manual review.

Analysis

permit_status <- tbl1_permit_lvl |> 
group_by(status) |> 
  summarize(
    Count = n(),
    Proportion = (n()/nrow(tbl1_permit_lvl))*100
  ) |> 
  arrange(desc(Proportion)) |> 
  head(5) |> 
  rename("Status" = "status") 
  
permit_status$Proportion <- paste0(round(permit_status$Proportion, digits=1),"%")

ggplot(permit_status,aes(x = reorder(Status, -Count), y= Count)) +
    geom_bar(stat="identity", fill="lightblue", width=0.5)+
    geom_text(aes(label=Proportion),
              hjust=.35)+
    theme_minimal()+
    labs(title="Permits by Status",x="Status")+
    theme(axis.text.y =element_text(angle = 55,hjust=1))+
    coord_flip()

upa_class <- tbl1_permit_lvl |> 
group_by(upa_class) |> 
  summarize(
    Count = n(),
    Proportion = (n()/nrow(tbl1_permit_lvl))*100
  ) 

upa_class$Proportion <- paste0(round(upa_class$Proportion, digits=1),"%")

ggplot(upa_class,aes(x = reorder(upa_class, -Count), y= Count)) +
    geom_bar(stat="identity", fill="lightblue", width=0.5)+
    geom_text(aes(label=Proportion),
              hjust=.5,
              vjust=0.25)+
    theme_minimal()+
    labs(title="Permits by UPA Class",x="UPA Class")

final_dis <- tbl1_permit_lvl |> 
group_by(final_disposition) |> 
  summarize(
    Count = n(),
    Proportion = (n()/nrow(tbl1_permit_lvl))*100
  ) |> 
  arrange(desc(Count)) |> 
  head(5)

final_dis$Proportion <- paste0(round(final_dis$Proportion, digits=1),"%")
final_dis$Count <- as.numeric(final_dis$Count)

final_dis <- final_dis |> 
  clean_names("title")

ggplot(final_dis,aes(x =reorder(`Final Disposition`,`Count`, .desc = TRUE), y= Count)) +
    geom_bar(stat="identity", fill="lightblue", width=0.5)+
    theme_minimal()+
    labs(title="Permits by Final Disposition",x="Final Disposition")+
    theme(axis.text.y =element_text(angle = 55,hjust=1))+
    coord_flip()

app_type <- tbl2_permit_act_lvl |> 
group_by(application_type) |> 
  summarize(
    Count = n(),
    Proportion = n()/nrow(tbl2_permit_act_lvl)
  ) |> 
  clean_names("title") |> 
  arrange(desc(Count))

knitr::kable(app_type, format ="markdown")

Application Type	Count	Proportion
Renewal Treat as New	2565	0.7753930
Modification Treat as New	274	0.0828295
Minor Modification	185	0.0559250
New	160	0.0483676
Modification	48	0.0145103
DIM Treat as New	47	0.0142080
Department Initiated Modification	29	0.0087666

short_desc <- tbl2_permit_act_lvl |>
  mutate(c_fast_track=coalesce(str_count(short_description,"fast track"),0)) |> 
  summarize(
    "Fast Tracked Renewal Actions" = sum(c_fast_track),
    "Total Actions" = nrow(tbl2_permit_act_lvl),
    Proportion = sum(c_fast_track)/nrow(tbl2_permit_act_lvl)
  ) |> 
  clean_names("title")

knitr::kable(short_desc, format ="markdown")

Fast Tracked Renewal Actions	Total Actions	Proportion
2204	3308	0.6662636

Cheese Dataset:

Import Data

Read in raw csv file as data frame

data <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/refs/heads/main/data/2024/2024-06-04/cheeses.csv")

View Raw Data

Raw Data Head
cheese	url	milk	country	region	family	type	fat_content	calcium_content	texture	rind	color	flavor	aroma	vegetarian	vegan	synonyms	alt_spellings	producers
Aarewasser	https://www.cheese.com/aarewasser/	cow	Switzerland	NA	NA	semi-soft	NA	NA	buttery	washed	yellow	sweet	buttery	FALSE	FALSE	NA	NA	Jumi
Abbaye de Belloc	https://www.cheese.com/abbaye-de-belloc/	sheep	France	Pays Basque	NA	semi-hard, artisan	NA	NA	creamy, dense, firm	natural	yellow	burnt caramel	lanoline	TRUE	FALSE	Abbaye Notre-Dame de Belloc	NA	NA
Abbaye de Belval	https://www.cheese.com/abbaye-de-belval/	cow	France	NA	NA	semi-hard	40-46%	NA	elastic	washed	ivory	NA	aromatic	FALSE	FALSE	NA	NA	NA
Abbaye de Citeaux	https://www.cheese.com/abbaye-de-citeaux/	cow	France	Burgundy	NA	semi-soft, artisan, brined	NA	NA	creamy, dense, smooth	washed	white	acidic, milky, smooth	barnyardy, earthy	FALSE	FALSE	NA	NA	NA
Abbaye de Tamié	https://www.cheese.com/tamie/	cow	France	Savoie	NA	soft, artisan	NA	NA	creamy, open, smooth	washed	white	fruity, nutty	perfumed, pungent	FALSE	FALSE	NA	Tamié, Trappiste de Tamie, Abbey of Tamie	NA
Abbaye de Timadeuc	https://www.cheese.com/abbaye-de-timadeuc/	cow	France	province of Brittany	NA	semi-hard	NA	NA	soft	washed	pale yellow	salty, smooth	nutty	FALSE	FALSE	NA	NA	Abbaye Cistercienne NOTRE-DAME DE TIMADEUC

Raw Data Stats
Row_Count	Column_Count	Null_Count	None_Str_Count
1187	19	7133	0

Data Handling

select columns needed to tidy and for analysis
fill empty strings and null values with ‘None’ string

fill_empty_str = function(x){if_else(x=="", 'None' ,x)}

df = data |>
  select(cheese, milk, country, texture, aroma, flavor) |>
  mutate_all(fill_empty_str) |>
  mutate_all(replace_na, "None")

Data after Handling Stats
Row_Count	Column_Count	Null_Count	None_Str_Count
1187	6	0	340

Tidy Data

Tidy data by ensuring each value has its own cell

split out each row with listed values (milk, texture, aroma, flavor, country) into individual rows and lengthen the data frame

df = df |>
  mutate(cheese_id = row_number()) |>
  separate_rows(country, sep = ', ') |>
  separate_rows(milk, sep = ', ') |>
  separate_rows(texture, sep = ', ') |>
  separate_rows(aroma, sep = ', ') |>
  separate_rows(flavor, sep = ', ')

View Tidy Data

Tidy Data Head
cheese	milk	country	texture	aroma	flavor	cheese_id
Aarewasser	cow	Switzerland	buttery	buttery	sweet	1
Abbaye de Belloc	sheep	France	creamy	lanoline	burnt caramel	2
Abbaye de Belloc	sheep	France	dense	lanoline	burnt caramel	2
Abbaye de Belloc	sheep	France	firm	lanoline	burnt caramel	2
Abbaye de Belval	cow	France	elastic	aromatic	None	3
Abbaye de Citeaux	cow	France	creamy	barnyardy	acidic	4

Tidy Data Stats
Row_Count	Column_Count	Null_Count	None_Str_Count
14394	7	0	2050

Normalize Data

Normalize data to reduce redundancy and allow for more efficient analysis

create a data frame for each column and create an associated id column for each
replace all column values with respective id value in core data frame

create_id_dfs = function(id_prefix, col, df) {
  id_df = df |>
    select(all_of(col)) |>
    distinct() |>
    arrange(col) |>
    mutate(id = paste0(id_prefix, row_number()))
  return(id_df)
}

cheese_df = df |>
  select(cheese, cheese_id) |>
  distinct()

country_df = create_id_dfs('C', 'country', df)
colnames(country_df) = c('country', 'country_id')

milk_df = create_id_dfs('M', 'milk', df)
colnames(milk_df) = c('milk', 'milk_id')

texture_df = create_id_dfs('T', 'texture', df)
colnames(texture_df) = c('texture', 'texture_id')

aroma_df = create_id_dfs('A', 'aroma', df)
colnames(aroma_df) = c('aroma', 'aroma_id')

flavor_df = create_id_dfs('F', 'flavor', df)
colnames(flavor_df) = c('flavor', 'flavor_id')

df = left_join(df, country_df, by = join_by(country))
df = left_join(df, milk_df, by = join_by(milk))
df = left_join(df, texture_df, by = join_by(texture))
df = left_join(df, aroma_df, by = join_by(aroma))
df = left_join(df, flavor_df, by = join_by(flavor))

df = df |>
  select(cheese_id, country_id, milk_id, texture_id, aroma_id, flavor_id)

View Tidy and Normalized Data

Normalized Tidy Data Head
cheese_id	country_id	milk_id	texture_id	aroma_id	flavor_id
1	C1	M1	T1	A1	F1
2	C2	M2	T2	A2	F2
2	C2	M2	T3	A2	F2
2	C2	M2	T4	A2	F2
3	C2	M1	T5	A3	F3
4	C2	M1	T2	A4	F4

Cheese Table Head
cheese	cheese_id
Aarewasser	1
Abbaye de Belloc	2
Abbaye de Belval	3
Abbaye de Citeaux	4
Abbaye de Tamié	5
Abbaye de Timadeuc	6

Country Table Head
country	country_id
Switzerland	C1
France	C2
England	C3
Great Britain	C4
United Kingdom	C5
Czech Republic	C6

Milk Table Head
milk	milk_id
cow	M1
sheep	M2
goat	M3
buffalo	M4
None	M5
water buffalo	M6

Texture Table Head
texture	texture_id
buttery	T1
creamy	T2
dense	T3
firm	T4
elastic	T5
smooth	T6

Aroma Table Head
aroma	aroma_id
buttery	A1
lanoline	A2
aromatic	A3
barnyardy	A4
earthy	A5
perfumed	A6

Flavor Table Head
flavor	flavor_id
sweet	F1
burnt caramel	F2
None	F3
acidic	F4
milky	F5
smooth	F6

Analysis

Analysis Requested in Discussion Post:

What are the most common milks used?
What are the more common textures associated with cheese?
Is there a country or region that produces more cheese?
Are there common aromas or flavors across cheeses made by different milks?

1. Most Common Milks Used

df |>
  select(cheese_id, milk_id) |>
  left_join(cheese_df, by = join_by(cheese_id)) |>
  left_join(milk_df, by = join_by(milk_id)) |>
  mutate(milk = str_to_title(milk)) |>
  group_by(milk) |>
  summarise(
    Total_Count = n(),
    Unique_Cheese_Count = n_distinct(cheese)) |>
  pivot_longer(cols = c(Total_Count, Unique_Cheese_Count)) |>
  ggplot(aes(x = reorder(milk, -value), y = value, fill = name)) +
  geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
  labs(
    title = "Most Common Cheese Milk Types",
    x = "Milk",
    y = "Count",
    fill = 'Count Type'
  ) +
  theme_classic() + 
  theme(
    axis.text.x = element_text(size = 11, angle = 45, vjust = 1, hjust=1),
    plot.title = element_text(size = 14, face = "bold"),
    legend.position = "bottom")

2. Most Common Textures Associated with Cheeses

df |>
  select(cheese_id, texture_id) |>
  left_join(cheese_df, by = join_by(cheese_id)) |>
  left_join(texture_df, by = join_by(texture_id)) |>
  mutate(texture = str_to_title(texture)) |>
  group_by(texture) |>
  summarise(
    Total_Count = n(),
    Unique_Cheese_Count = n_distinct(cheese)) |>
  pivot_longer(cols = c(Total_Count, Unique_Cheese_Count)) |>
  ggplot(aes(x = reorder(texture, -value), y = value, fill = name)) +
  geom_col(position = position_dodge2(width = 0.2, preserve = "single")) +
  labs(
    title = "Most Common Textures Associated with Cheeses",
    x = "Texture",
    y = "Count",
    fill = 'Count Type'
  ) +
  theme_classic() + 
  theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 8),
    plot.title = element_text(size = 14, face = "bold"),
    legend.position = "bottom")

3. Countries Producing Most Cheese

df |>
  select(cheese_id, country_id) |>
  left_join(cheese_df, by = join_by(cheese_id)) |>
  left_join(country_df, by = join_by(country_id)) |>
  mutate(country = str_to_title(country)) |>
  group_by(country) |>
  summarise(
    Total_Count = n(),
    Unique_Cheese_Count = n_distinct(cheese)) |>
  mutate(country = ifelse(Unique_Cheese_Count <= 5, 'Other', country)) |>
  group_by(country) |>
  summarise(
    Total_Count = sum(Total_Count),
    Unique_Cheese_Count = sum(Unique_Cheese_Count)) |>
  pivot_longer(cols = c(Total_Count, Unique_Cheese_Count)) |>
  ggplot(aes(x = reorder(country, -value), y = value, fill = name)) +
  geom_col(position = position_dodge2(width = 0.2, preserve = "single")) +
  labs(
    title = "Countries Producing Most Cheese",
    x = "Country",
    y = "Count",
    fill = 'Count Type'
  ) +
  theme_classic() + 
  theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
    plot.title = element_text(size = 14, face = "bold"),
    legend.position = "bottom")

4a. Common Flavors Across Cheeses By Different Milks

top_flavor_df = df |>
  left_join(cheese_df, by = join_by(cheese_id)) |>
  left_join(flavor_df, by = join_by(flavor_id)) |>
  group_by(flavor, flavor_id) |>
  summarise(cnt = n(), .groups = 'keep') |>
  arrange(desc(cnt)) |>
  head(12)

milk_cnt_df = df |>
  left_join(cheese_df, by = join_by(cheese_id)) |>
  inner_join(top_flavor_df, by = join_by(flavor_id)) |>
  left_join(milk_df, by = join_by(milk_id)) |>
  group_by(milk, milk_id) |>
  summarise(milk_cnt = n(), .groups = 'keep') |>
  filter(milk_cnt>10)
  
df |>
  select(cheese_id, milk_id, flavor_id) |>
  left_join(cheese_df, by = join_by(cheese_id)) |>
  inner_join(top_flavor_df, by = join_by(flavor_id)) |>
  inner_join(milk_cnt_df, by = join_by(milk_id)) |>
  mutate(
    flavor = str_to_title(flavor),
    milk = str_to_title(milk)
    ) |>
  group_by(milk) |>
  mutate(y= n()) |>
  group_by(milk, flavor) |>
  mutate(x = n()) |>
  mutate(Unique_Cheese_Prct =  (x / y)*100) |>
  select(milk, flavor, Unique_Cheese_Prct, x, milk_cnt, y) |>
  distinct() |>
  ggplot(aes(x = milk, y = Unique_Cheese_Prct, fill = flavor)) +
  geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
  labs(
    title = "Top Cheese Flavors and Milk Distribution",
    subtitle = 'Top 12 Flavors and Milk Types with a Count of at Least 10',
    x = "Milk",
    y = "Percent of Milk",
    fill = 'Flavor'
  ) +
  theme_classic() +
  theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
    plot.title = element_text(size = 14, face = "bold"))

4b. Common Aromas Across Cheeses By Different Milks

top_aroma_df = df |>
  left_join(cheese_df, by = join_by(cheese_id)) |>
  left_join(aroma_df, by = join_by(aroma_id)) |>
  group_by(aroma, aroma_id) |>
  summarise(cnt = n(), .groups = 'keep') |>
  arrange(desc(cnt)) |>
  head(12)

df |>
  select(cheese_id, milk_id, aroma_id) |>
  left_join(cheese_df, by = join_by(cheese_id)) |>
  inner_join(top_aroma_df, by = join_by(aroma_id)) |>
  inner_join(milk_cnt_df, by = join_by(milk_id)) |>
  mutate(
    aroma = str_to_title(aroma),
    milk = str_to_title(milk)
    ) |>
  group_by(milk) |>
  mutate(y= n()) |>
  group_by(milk, aroma) |>
  mutate(x = n()) |>
  mutate(Unique_Cheese_Prct =  (x / y)*100) |>
  select(milk, aroma, Unique_Cheese_Prct, x, milk_cnt, y) |>
  distinct() |>
  ggplot(aes(x = milk, y = Unique_Cheese_Prct, fill = aroma)) +
  geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
  labs(
    title = "Top Cheese Aromas and Milk Distribution",
    subtitle = 'Top 12 Aromas and Milk Types with a Count of at Least 10',
    x = "Milk",
    y = "Percent of Milk",
    fill = 'Aroma'
  ) +
  theme_classic() +
  theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
    plot.title = element_text(size = 14, face = "bold"))

Exporting Processed Data

# Export cleaned Emissions dataset
write.csv(yearly_emissions_by_area, "yearly_emissions_by_area_cleaned.csv", row.names = FALSE)
write.csv(yearly_emissions, "yearly_emissions_cleaned.csv", row.names = FALSE)
write.csv(emissions_by_area, "emissions_by_area_cleaned.csv", row.names = FALSE)

# Export cleaned DART water permit dataset
write.csv(tbl1_permit_lvl, "tbl1_permit_lvl_cleaned.csv", row.names = FALSE)
write.csv(tbl2_permit_act_lvl, "tbl2_permit_act_lvl_cleaned.csv", row.names = FALSE)
write.csv(tbl3_facility_lvl, "tbl3_facility_lvl_cleaned.csv", row.names = FALSE)
write.csv(tbl4_app_lvl, "tbl4_app_lvl_cleaned.csv", row.names = FALSE)

# Export cleaned Cheese Quality dataset
write.csv(cheese_df, "cheese_df_cleaned.csv", row.names = FALSE)
write.csv(country_df, "country_df_cleaned.csv", row.names = FALSE)
write.csv(milk_df, "milk_df_cleaned.csv", row.names = FALSE)
write.csv(texture_df, "texture_df_cleaned.csv", row.names = FALSE)
write.csv(aroma_df, "aroma_df_cleaned.csv", row.names = FALSE)
write.csv(flavor_df, "flavor_df_cleaned.csv", row.names = FALSE)
write.csv(df, "final_cheese_data_cleaned.csv", row.names = FALSE)

# Confirm that the files were saved successfully
list.files(pattern = "*.csv")

##  [1] "aroma_df_cleaned.csv"                                                                           
##  [2] "cheese_df_cleaned.csv"                                                                          
##  [3] "CoffeeChain-SimplifiedData.csv"                                                                 
##  [4] "correlation-between-child-mortality-and-mean-years-of-schooling-for-those-aged-15-and-older.csv"
##  [5] "country_df_cleaned.csv"                                                                         
##  [6] "data-1.csv"                                                                                     
##  [7] "emissions_by_area_cleaned.csv"                                                                  
##  [8] "Facts_Python_1 (2).csv"                                                                         
##  [9] "FED-BCABLCCEESO_N_Q.csv"                                                                        
## [10] "final_cheese_data_cleaned.csv"                                                                  
## [11] "flavor_df_cleaned.csv"                                                                          
## [12] "food_coded.csv"                                                                                 
## [13] "FRED-EAAXDBNQ.csv"                                                                              
## [14] "FRED-EPLNQ.csv"                                                                                 
## [15] "laptopData.csv"                                                                                 
## [16] "Master_Company_List.csv"                                                                        
## [17] "mean-years-of-schooling-female.csv"                                                             
## [18] "milk_df_cleaned.csv"                                                                            
## [19] "movies.csv"                                                                                     
## [20] "MULTPL-SP500_BVPS_QUARTER.csv"                                                                  
## [21] "New_York_City_Leading_Causes_of_Death.csv"                                                      
## [22] "Requested_Facts_Projecy_2.csv (1).zip"                                                          
## [23] "Requested_Facts_Projecy_2.csv.zip"                                                              
## [24] "Sample - Superstore - Returns.csv"                                                              
## [25] "tb(1)(2).csv"                                                                                   
## [26] "tbl1_permit_lvl_cleaned.csv"                                                                    
## [27] "tbl2_permit_act_lvl_cleaned.csv"                                                                
## [28] "tbl3_facility_lvl_cleaned.csv"                                                                  
## [29] "tbl4_app_lvl_cleaned.csv"                                                                       
## [30] "texture_df_cleaned.csv"                                                                         
## [31] "Total Emissions Per Country (2000-2020).csv"                                                    
## [32] "yearly_emissions_by_area_cleaned.csv"                                                           
## [33] "yearly_emissions_cleaned.csv"

Why Exporting Matters?

Preserving Cleaning Efforts: Once data transformation is complete, saving the cleaned versions prevents the need to redo preprocessing each time.
Improving Reproducibility: The structured datasets can be shared with other analysts or data scientists for further analysis.
Facilitating Advanced Analytics: The exported .csv files are now ready for machine learning models, visualization dashboards, and predictive analytics.

Conclusion

This project focused on transforming and analyzing three diverse datasets, demonstrating the importance of data wrangling techniques in preparing raw information for meaningful insights. By leveraging tidyr and dplyr, we efficiently cleaned, structured, and transformed the datasets into a tidy format, making them suitable for downstream analysis.

Key Takeaways from Each Dataset:

Emissions Data:
- The data was reshaped to a long format, making it easier to analyze changes over time.
- Trends in emissions were identified, providing insights into pollution levels and their environmental implications.
DART Water Permits Data:
- The dataset was transformed to facilitate trend analysis in water permit issuance from 2020 to 2025.
- Cleaning and standardization helped address inconsistencies, ensuring accurate comparisons across years.
Cheese Quality Data:
- The dataset was normalized to separate variables, improving its usability.
- Various transformations allowed us to explore relationships between cheese characteristics, quality ratings, and production factors.

Overall Insights and Future Applications:

Data Preparation Matters:

The process of tidying and transforming data is crucial for accurate and meaningful analysis. Well-structured data improves efficiency in both visualization and modeling.

Standardization & Normalization:

Converting datasets to tidy formats ensured that values were easily accessible for statistical computation.

Potential for Further Analysis:

These datasets can now be used for deeper predictive modeling, trend forecasting, and policy recommendations based on their respective domains.

Through this project, we reinforced the significance of data wrangling techniques in real-world data science applications. The ability to tidy, transform, and analyze raw data is a fundamental skill that enhances decision-making and unlocks valuable insights across different industries.

cheese_id	country_id	milk_id	texture_id	aroma_id	flavor_id
1	C1	M1	T1	A1	F1
2	C2	M2	T2	A2	F2
2	C2	M2	T3	A2	F2
2	C2	M2	T4	A2	F2
3	C2	M1	T5	A3	F3
4	C2	M1	T2	A4	F4

cheese_id	country_id	milk_id	texture_id	aroma_id	flavor_id
1	C1	M1	T1	A1	F1
2	C2	M2	T2	A2	F2
2	C2	M2	T3	A2	F2
2	C2	M2	T4	A2	F2
3	C2	M1	T5	A3	F3
4	C2	M1	T2	A4	F4

Project 2: Data Transformation

Group Members: Alina Vikhnevich, Olivia Azevedo, Alyssa Gurkas, Musrat Jahan

2025-03-11

Loading libraries:

Introduction

Project Overview

Overview of Datasets

Relevance of These Datasets

Data Preparation and Cleaning

Emissions Data:

Make longer

Total emissions per country for each year

Analyze overall total emissions per country for each year

Analyze total emissions over time

Total emissions per country

Analysis of Total emissions per country

Aanalyze by emission type

USA emission types distribution

Item Analysis

Item Analysis USA

NYSDEC Water Permit Data (DART):

Loading NYSDEC DART Data

Defining Regex Patterns to Detect NPDES IDs

Creating the NPDES Universe

Tidy Data

Table 1: Permit Level Data

Table 2: Permit Action Level Data

Table 3: Facility Level Data

Table 4: NPDES Permit Applicant Table

Data Tables and Structure

Data Considerations:

Analysis

Cheese Dataset:

Import Data

View Raw Data

Data Handling

Tidy Data

View Tidy Data

Normalize Data

View Tidy and Normalized Data

Analysis

1. Most Common Milks Used

2. Most Common Textures Associated with Cheeses

3. Countries Producing Most Cheese

4a. Common Flavors Across Cheeses By Different Milks

4b. Common Aromas Across Cheeses By Different Milks

Exporting Processed Data

Why Exporting Matters?

Conclusion

Key Takeaways from Each Dataset:

Overall Insights and Future Applications:

cheese_id	country_id	milk_id	texture_id	aroma_id	flavor_id
1	C1	M1	T1	A1	F1
2	C2	M2	T2	A2	F2
2	C2	M2	T3	A2	F2
2	C2	M2	T4	A2	F2
3	C2	M1	T5	A3	F3
4	C2	M1	T2	A4	F4