library(kableExtra)
library(RSocrata)
library(tidyverse)
library(viridis)
library(readr)
library(readxl)
library(janitor)
library(lubridate)
library(ggplot2)
library(scales)
library(stringr)
library(forcats)
The goal of this project is to tidy, transform, and analyze three
different datasets using R, leveraging the tidyverse
,
tidyr
, and dplyr
packages. These datasets,
originally in an untidy “wide” format, require cleaning, restructuring,
and standardization before analysis can be performed.
By the end of this project, we will:
tidyr
and
dplyr
to clean and reshape the data.This project is a collaborative effort, and each dataset presents a unique challenge in terms of data cleaning, structuring, and interpretation. The final results will be published as an R Markdown report, demonstrating the power of data transformation techniques.
Each dataset represents a different domain and requires a unique transformation approach. Below is a summary of the datasets used in this project:
Each dataset requires different data transformation techniques,
making them ideal for practicing tidyr
and
dplyr
functions. The common themes across these datasets
include:
By applying tidy data principles, we ensure that each dataset is structured, organized, and ready for analysis. The insights gained from this project can be used for policy recommendations, water quality analysis, and compliance to the Clean Water Act.
df <- read.csv("https://raw.githubusercontent.com/justin-2028/Total-Emissions-Per-Country-2000-2020/refs/heads/main/Total%20Emissions%20Per%20Country%20(2000-2020).csv")
colnames(df) <- gsub("^X", "", colnames(df))
print(head(df))
## Area Item Element Unit
## 1 Afghanistan Crop Residues Direct emissions (N2O) kilotonnes
## 2 Afghanistan Crop Residues Indirect emissions (N2O) kilotonnes
## 3 Afghanistan Crop Residues Emissions (N2O) kilotonnes
## 4 Afghanistan Crop Residues Emissions (CO2eq) from N2O (AR5) kilotonnes
## 5 Afghanistan Crop Residues Emissions (CO2eq) (AR5) kilotonnes
## 6 Afghanistan Rice Cultivation Emissions (CH4) kilotonnes
## 2000 2001 2002 2003 2004 2005 2006 2007
## 1 0.520 0.5267 0.8200 0.9988 0.8225 1.1821 1.0277 1.2426
## 2 0.117 0.1185 0.1845 0.2247 0.1851 0.2660 0.2312 0.2796
## 3 0.637 0.6452 1.0045 1.2235 1.0075 1.4481 1.2589 1.5222
## 4 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 5 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 6 18.200 16.9400 18.9000 20.3000 27.3000 22.4000 22.4000 23.8000
## 2008 2009 2010 2011 2012 2013 2014 2015
## 1 0.8869 1.3920 1.2742 1.0321 1.3726 1.4018 1.4584 1.2424
## 2 0.1996 0.3132 0.2867 0.2322 0.3088 0.3154 0.3281 0.2795
## 3 1.0865 1.7051 1.5609 1.2643 1.6815 1.7173 1.7865 1.5220
## 4 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 5 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 6 26.6000 28.0000 29.1200 29.4000 28.7000 28.7000 30.8000 22.9600
## 2016 2017 2018 2019 2020
## 1 1.1940 1.0617 0.8988 1.2176 1.3170
## 2 0.2687 0.2389 0.2022 0.2740 0.2963
## 3 1.4627 1.3005 1.1011 1.4916 1.6133
## 4 387.6130 344.6447 291.7838 395.2689 427.5284
## 5 387.6130 344.6447 291.7838 395.2689 427.5284
## 6 16.6600 15.3233 16.4555 17.8542 20.6577
All the year columns were changed to one column under year. The dataset was made longer. This makes it tidy.
df_longer <- df |>
pivot_longer(
cols = starts_with("2"),
names_to = "year",
values_to = "total emissions",
values_drop_na = TRUE
)
print(head(df))
## Area Item Element Unit
## 1 Afghanistan Crop Residues Direct emissions (N2O) kilotonnes
## 2 Afghanistan Crop Residues Indirect emissions (N2O) kilotonnes
## 3 Afghanistan Crop Residues Emissions (N2O) kilotonnes
## 4 Afghanistan Crop Residues Emissions (CO2eq) from N2O (AR5) kilotonnes
## 5 Afghanistan Crop Residues Emissions (CO2eq) (AR5) kilotonnes
## 6 Afghanistan Rice Cultivation Emissions (CH4) kilotonnes
## 2000 2001 2002 2003 2004 2005 2006 2007
## 1 0.520 0.5267 0.8200 0.9988 0.8225 1.1821 1.0277 1.2426
## 2 0.117 0.1185 0.1845 0.2247 0.1851 0.2660 0.2312 0.2796
## 3 0.637 0.6452 1.0045 1.2235 1.0075 1.4481 1.2589 1.5222
## 4 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 5 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 6 18.200 16.9400 18.9000 20.3000 27.3000 22.4000 22.4000 23.8000
## 2008 2009 2010 2011 2012 2013 2014 2015
## 1 0.8869 1.3920 1.2742 1.0321 1.3726 1.4018 1.4584 1.2424
## 2 0.1996 0.3132 0.2867 0.2322 0.3088 0.3154 0.3281 0.2795
## 3 1.0865 1.7051 1.5609 1.2643 1.6815 1.7173 1.7865 1.5220
## 4 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 5 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 6 26.6000 28.0000 29.1200 29.4000 28.7000 28.7000 30.8000 22.9600
## 2016 2017 2018 2019 2020
## 1 1.1940 1.0617 0.8988 1.2176 1.3170
## 2 0.2687 0.2389 0.2022 0.2740 0.2963
## 3 1.4627 1.3005 1.1011 1.4916 1.6133
## 4 387.6130 344.6447 291.7838 395.2689 427.5284
## 5 387.6130 344.6447 291.7838 395.2689 427.5284
## 6 16.6600 15.3233 16.4555 17.8542 20.6577
yearly_emissions_by_area <- aggregate(df_longer$'total emissions', by = list(df_longer$year, df_longer$Area), FUN = sum)
yearly_emissions_by_area
#rename columns
yearly_emissions_by_area <-
yearly_emissions_by_area %>%
rename(
year = Group.1,
country = Group.2,
emissions = x
)
yearly_emissions_by_area
print(head(df))
Too many different countries
ggplot(yearly_emissions_by_area, aes(x = year, y = emissions,
fill = country)) +
geom_tile()
print(head(df))
## Area Item Element Unit
## 1 Afghanistan Crop Residues Direct emissions (N2O) kilotonnes
## 2 Afghanistan Crop Residues Indirect emissions (N2O) kilotonnes
## 3 Afghanistan Crop Residues Emissions (N2O) kilotonnes
## 4 Afghanistan Crop Residues Emissions (CO2eq) from N2O (AR5) kilotonnes
## 5 Afghanistan Crop Residues Emissions (CO2eq) (AR5) kilotonnes
## 6 Afghanistan Rice Cultivation Emissions (CH4) kilotonnes
## 2000 2001 2002 2003 2004 2005 2006 2007
## 1 0.520 0.5267 0.8200 0.9988 0.8225 1.1821 1.0277 1.2426
## 2 0.117 0.1185 0.1845 0.2247 0.1851 0.2660 0.2312 0.2796
## 3 0.637 0.6452 1.0045 1.2235 1.0075 1.4481 1.2589 1.5222
## 4 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 5 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 6 18.200 16.9400 18.9000 20.3000 27.3000 22.4000 22.4000 23.8000
## 2008 2009 2010 2011 2012 2013 2014 2015
## 1 0.8869 1.3920 1.2742 1.0321 1.3726 1.4018 1.4584 1.2424
## 2 0.1996 0.3132 0.2867 0.2322 0.3088 0.3154 0.3281 0.2795
## 3 1.0865 1.7051 1.5609 1.2643 1.6815 1.7173 1.7865 1.5220
## 4 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 5 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 6 26.6000 28.0000 29.1200 29.4000 28.7000 28.7000 30.8000 22.9600
## 2016 2017 2018 2019 2020
## 1 1.1940 1.0617 0.8988 1.2176 1.3170
## 2 0.2687 0.2389 0.2022 0.2740 0.2963
## 3 1.4627 1.3005 1.1011 1.4916 1.6133
## 4 387.6130 344.6447 291.7838 395.2689 427.5284
## 5 387.6130 344.6447 291.7838 395.2689 427.5284
## 6 16.6600 15.3233 16.4555 17.8542 20.6577
As you can see from the graph, total emissions have gone up steadily from 2000 to 2019, but in 2020, it decreased a significant amount. This might be due to more awareness about climate change and global warming.
yearly_emissions <- aggregate(df_longer$'total emissions', by=list(df_longer$year), FUN = sum)
yearly_emissions
## Group.1 x
## 1 2000 2143119367
## 2 2001 2109760240
## 3 2002 2180860383
## 4 2003 2220985897
## 5 2004 2321819994
## 6 2005 2348671376
## 7 2006 2421011446
## 8 2007 2428782054
## 9 2008 2463686673
## 10 2009 2468939225
## 11 2010 2511864950
## 12 2011 2523633556
## 13 2012 2558572795
## 14 2013 2576071615
## 15 2014 2618685489
## 16 2015 2639203455
## 17 2016 2636367158
## 18 2017 2665248135
## 19 2018 2712258358
## 20 2019 2741323659
## 21 2020 2648131930
#rename columns
yearly_emissions <-
yearly_emissions %>%
rename(
year = Group.1,
emissions = x
)
ggplot(yearly_emissions, aes(x = year, y = emissions)) +
geom_point()
Some of the top countries that contributed to emissions are China, USA, Brazil, India, Indonesia, and Democratic Republic of the Congo.
emissions_by_area <- aggregate(df_longer$'total emissions', by = list(df_longer$Area), FUN = sum)
#rename columns
emissions_by_area <-
emissions_by_area %>%
rename(
country = Group.1,
emissions = x
)
top <- emissions_by_area[order(-emissions_by_area$emissions),]
print(head(df))
## Area Item Element Unit
## 1 Afghanistan Crop Residues Direct emissions (N2O) kilotonnes
## 2 Afghanistan Crop Residues Indirect emissions (N2O) kilotonnes
## 3 Afghanistan Crop Residues Emissions (N2O) kilotonnes
## 4 Afghanistan Crop Residues Emissions (CO2eq) from N2O (AR5) kilotonnes
## 5 Afghanistan Crop Residues Emissions (CO2eq) (AR5) kilotonnes
## 6 Afghanistan Rice Cultivation Emissions (CH4) kilotonnes
## 2000 2001 2002 2003 2004 2005 2006 2007
## 1 0.520 0.5267 0.8200 0.9988 0.8225 1.1821 1.0277 1.2426
## 2 0.117 0.1185 0.1845 0.2247 0.1851 0.2660 0.2312 0.2796
## 3 0.637 0.6452 1.0045 1.2235 1.0075 1.4481 1.2589 1.5222
## 4 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 5 168.807 170.9884 266.1975 324.2195 266.9995 383.7498 333.6093 403.3749
## 6 18.200 16.9400 18.9000 20.3000 27.3000 22.4000 22.4000 23.8000
## 2008 2009 2010 2011 2012 2013 2014 2015
## 1 0.8869 1.3920 1.2742 1.0321 1.3726 1.4018 1.4584 1.2424
## 2 0.1996 0.3132 0.2867 0.2322 0.3088 0.3154 0.3281 0.2795
## 3 1.0865 1.7051 1.5609 1.2643 1.6815 1.7173 1.7865 1.5220
## 4 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 5 287.9099 451.8647 413.6467 335.0379 445.5958 455.0727 473.4174 403.3181
## 6 26.6000 28.0000 29.1200 29.4000 28.7000 28.7000 30.8000 22.9600
## 2016 2017 2018 2019 2020
## 1 1.1940 1.0617 0.8988 1.2176 1.3170
## 2 0.2687 0.2389 0.2022 0.2740 0.2963
## 3 1.4627 1.3005 1.1011 1.4916 1.6133
## 4 387.6130 344.6447 291.7838 395.2689 427.5284
## 5 387.6130 344.6447 291.7838 395.2689 427.5284
## 6 16.6600 15.3233 16.4555 17.8542 20.6577
too many countries, cant read
ggplot(emissions_by_area, aes(x = country, y = emissions), label=NA)+ geom_point()
Emissions (CO2eq) (AR5) are highest. They are over 200,000 kilotonnes. The second highest place is tied with emissions (N20) and emissions (CO2eq) from N20 (AR5). Lowest emissions are (CO2eq) from F-gases, less than 25,000 kilotonnes.
ggplot(df_longer, aes(y=Element)) +
geom_bar()
The distribution looks very similar to the distribution with the data from all the regions. For USA, the counts are smaller. Highest are emissions (CO2eq) (AR5), a little less than 1250 kilotonnes. Lowest emissions are (CO2eq) from F-gases, around 125 kilotonnes. The second highest place is from emissions (N20), around 950 kilotonnes.
usa <- df_longer %>%
filter(Area == "United States of America")
ggplot(usa, aes(y=Element)) +
geom_bar()
Highest item is IPCC Agriculture. Second highest is farm gate emissions. Lowest is international bunkers.
ggplot(df_longer, aes(y=Item)) +
geom_bar()
Highest item is IPCC Agriculture, just like in the overall data. Lowest is drained organic soils (C02) and drained organic soils.
ggplot(usa, aes(y=Item)) +
geom_bar()
Data used in this section comes from New York State Department of Environmental Conservation’s Application Review & Tracking System (DART on the Web).
DART is a web-based application and tracking system that is designed for the general public. DART hosts information about NYSDEC’s processing and issuance of environmental permits under the Uniform Procedures Act. The data is updated daily, and more information about the data can be found in the data dictionary.
In this section, data was previously filtered to only include
DART entries from 2020-2025, and will be focused on waste water permits
that discharge to surface water.
library(readr)
dart <- read_csv("https://raw.githubusercontent.com/AlinaVikhnevich/data_607/refs/heads/main/Project%202/dart_2020_2025.csv")
To identify wastewater permits, there are three regex patterns to identify:
For more information about permit types please see the question “What are the primary differences between a NODES individual permit and a NPDES general permit” under EPA’s NPDES Permit Basics Site.
# p_type = permit type, in this exercise we are filtering for wastewater permits
p_type <- c("P/C/I SPDES - Surface Discharge",
"Municipal SPDES - Surface Discharge",
"Industrial SPDES - Surface Discharge")
# defining the regex patterns for the IDs we want to track
npdes_pattern <- "NY\\d{7}"
gp_pattern <- "GP\\d{7}"
individual_pattern <- "NY[A-Z]\\d{2}[A-Z]\\d{3}"
all_patterns <- paste(npdes_pattern,gp_pattern,individual_pattern, sep="|")
Creating the permit universe pulling from NYSDEC’s DART System and detecting the string patterns within DART to assign permit type: npdes, individual(i.e., a permit covered under a general permit), general, or multi (meaning the DART entry had multiple associated IDs).
universe <- dart |>
filter(`permit_type` %in% p_type) |>
mutate(
npdes = str_count(`other_known_ids`, npdes_pattern), # the str_counts are taking count of permit ID type
individual = str_count(`other_known_ids`, individual_pattern),
gp = str_count(`other_known_ids`, gp_pattern),
sum_ids = rowSums(across(c(`npdes`, `individual`,`gp`))),
npdes_id = str_extract_all(`other_known_ids`, all_patterns),
date_received=as.Date(date_received, format = "%d-%m-%Y")
) |>
mutate(applicant_id =cur_group_id(),.by = applicant) |> # creating applicant id
mutate(facility_id = cur_group_id(),.by = c(facility,location,town_or_city)) |> # creating facility id
distinct() |> # removing duplicate rows
mutate(
dart_p_type = case_when(sum_ids > 1 ~ "multi", # if entry is associated with multiple ids, it is flagged as multi
sum_ids & npdes == 1 ~ "npdes",
sum_ids & individual == 1 ~ "individual",
sum_ids & gp == 1 ~ "gp")) |>
unnest_longer(npdes_id, keep_empty = FALSE) |>
filter(!is.na(npdes_id))
Note: The code above filters entries that did not have a NPDES ID listed in the “Other Known IDs” column, however, were listed as NPDES permits in the Permit Type Column. However, out of 35,642 entries, only 69 were missing NPDES IDs.
This table shows the most recent permit information
tbl1_permit_lvl <- universe |>
group_by(npdes_id) |>
slice(which.max(date_received)) |>
select(npdes_id,facility_id,application_id,applicant,applicant_id,permit_type,
status,date_received,upa_class,seqr_class,seqr_determination,
lead_agency,coastal_zone_status, final_disposition,permit_effective_date,
permit_expration_date,dec_contact,shpa_status,enivronmental_justice)
npdes_id | facility_id | application_id | applicant | applicant_id | permit_type | status | date_received | upa_class | seqr_class | seqr_determination | lead_agency | coastal_zone_status | final_disposition | permit_effective_date | permit_expration_date | dec_contact | shpa_status | enivronmental_justice |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NY0000044 | 574 | 3-3724-00045/00004 | WATCHTOWER BIBLE AND TRACT SOCIETY OF NEW YORK INC | 158 | P/C/I SPDES - Surface Discharge | Issued | 2024-02-16 | MINOR | Type II Action | Not Applicable | None Designated | This project is not located in a Coastal Management area. | Issued | 2024-09-01 04:00:00 | 2029-08-31 04:00:00 | KATHERINE M MURRAY | NA | NA |
NY0000078 | 706 | 8-5436-00007/00014 | GARLOCK SEALING TECHNOLOGIES LLC | 593 | Industrial SPDES - Surface Discharge | SAPA Extended | 2021-02-03 | MAJOR | Type II Action | Not Applicable | None Designated | This project is not located in a Coastal Management area. | Issued | 2023-07-01 04:00:00 | 2023-08-31 04:00:00 | GUILLERMO R SAAR | NA | The application is subject to the Department Environmental Justice policy (CP-29). Either the permits needed for the project are not subject to the policy or it has been determined that the project would not affect a Potential Environmental Justice Area. |
NY0000167 | 1906 | 8-2499-00039/00002 | RETSOF REALTY LLC | 1571 | Industrial SPDES - Surface Discharge | Issued | 2021-04-15 | MINOR | Type II Action | Not Applicable | None Designated | This project is not located in a Coastal Management area. | Issued | 2022-01-01 05:00:00 | 2026-12-31 05:00:00 | MICHAEL R SCHAEFER | NA | NA |
NY0000247 | 1556 | 3-5518-00680/00001 | CONSOLIDATED EDISON COMPANY OF NEW YORK, INC. | 914 | Industrial SPDES - Surface Discharge | Suspended Indefinitely | 2024-04-22 | MINOR | Type II Action | Not Applicable | None Designated | This project is not located in a Coastal Management area. | NA | NA | NA | KATHERINE M MURRAY | NA | NA |
NY0000281 | 539 | 9-1464-00117/00013 | LINDE INC | 455 | Industrial SPDES - Surface Discharge | Issued | 2021-04-29 | MINOR | Type II Action | Not Applicable | None Designated | This project is not located in a Coastal Management area. | Issued | 2022-01-11 05:00:00 | 2026-12-31 05:00:00 | MICHAEL R SCHAEFER | NA | NA |
NY0000311 | 1420 | 8-3224-00108/00031 | PACTIV LLC | 1186 | Industrial SPDES - Surface Discharge | Issued | 2021-08-02 | MINOR | Type II Action | Not Applicable | None Designated | This project is not located in a Coastal Management area. | Issued | 2022-03-01 05:00:00 | 2027-02-28 05:00:00 | MICHAEL R SCHAEFER | NA | NA |
This table shows the permit history. each observation in this table represents a permit action.
tbl2_permit_act_lvl <- universe |>
mutate(action_id = paste(npdes_id,date_received, sep = "_")) |>
distinct() |>
mutate(dup_flag = duplicated(action_id),
transfer_flag=str_detect(toupper(short_description),"TRANSFER")) |>
select(action_id,facility,facility_id,npdes_id,application_id,applicant,
application_type,date_received,status,short_description,
enb_publication_date,written_comments_due,dup_flag,transfer_flag)
tbl2_permit_act_lvl$short_description <- tolower(tbl2_permit_act_lvl$short_description)
action_id | facility | facility_id | npdes_id | application_id | applicant | application_type | date_received | status | short_description | enb_publication_date | written_comments_due | dup_flag | transfer_flag |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NY0101915_2022-10-05 | B O C E S | 1 | NY0101915 | 3-1332-00172/00001 | DUTCHESS BOCES | Renewal Treat as New | 2022-10-05 | Issued | spdes fast track renewal for ny0101915 | 2023-02-15 05:00:00 | 2023-03-17 04:00:00 | FALSE | FALSE |
NYR10L635_2023-03-24 | NYC EDC-SOUTH BROOKLYN MARINE TERMINAL | 2 | NYR10L635 | 2-6102-00120/00032 | NYC ECONOMIC DEVELOPMENT CORP | New | 2023-03-24 | Issued | spdes application for substation test pits and construction | 2024-03-20 04:00:00 | 2024-04-19 04:00:00 | FALSE | FALSE |
NY0313149_2023-03-24 | NYC EDC-SOUTH BROOKLYN MARINE TERMINAL | 2 | NY0313149 | 2-6102-00120/00032 | NYC ECONOMIC DEVELOPMENT CORP | New | 2023-03-24 | Issued | spdes application for substation test pits and construction | 2024-03-20 04:00:00 | 2024-04-19 04:00:00 | FALSE | FALSE |
NY0035441_2021-06-30 | CHAUTAUQUA FISH HATCHERY | 3 | NY0035441 | 9-0628-00098/00004 | NYS Dept of Environmental Conservation | New | 2021-06-30 | Issued | reduce conc. limit of formalin & change in phos. loading limit | 2021-09-29 04:00:00 | 2021-10-29 04:00:00 | FALSE | FALSE |
NY0071897_2021-03-30 | EFFRON FUEL OIL CORP TERMINAL | 4 | NY0071897 | 3-1313-00015/00002 | PETRO INC | Minor Modification | 2021-03-30 | Expired | transfer | NA | NA | FALSE | TRUE |
NY0071897_2022-09-08 | EFFRON FUEL OIL CORP TERMINAL | 4 | NY0071897 | 3-1313-00015/00002 | PETRO INC | Renewal Treat as New | 2022-09-08 | Issued | spdes fast track renewal for ny 0071897 | 2022-11-02 04:00:00 | 2022-12-02 05:00:00 | FALSE | FALSE |
This table shows the facility information. Each observation in this table represents a facility associated with NPDES permits.
tbl3_facility_lvl <- universe |>
select(facility_id,facility,
location,town_or_city) |>
distinct() |>
arrange(facility_id)
facility_id | facility | location | town_or_city |
---|---|---|---|
1 | B O C E S | 578 SALT POINT TURNPIKE HYDE PARK 12538 | HYDE PARK |
2 | NYC EDC-SOUTH BROOKLYN MARINE TERMINAL | Sunset Park and Greenwood Heights 29th to 39th St|2nd Ave To Ny Harbor Brooklyn (6102) 11232 | BROOKLYN |
3 | CHAUTAUQUA FISH HATCHERY | 5875 PRENDERGAST RD MAYVILLE 14757 | CHAUTAUQUA |
4 | EFFRON FUEL OIL CORP TERMINAL | FOOT OF PROSPECT ST POUGHKEEPSIE 12602 | POUGHKEEPSIE |
5 | POUGHKEEPSIE STP | 173 KITTREDGE PL POUGHKEEPSIE 12601 | POUGHKEEPSIE |
6 | AMENIA S & G-LEEDSVILLE PROCESSING PLANT | 307 LEEDSVILLE RD AMENIA 12501 | AMENIA |
This table shows the applicant information. Each observation in this table represents a permit applicant for NPDES permits.
tbl4_app_lvl <- universe |>
group_by(applicant_id) |>
slice(which.max(date_received)) |>
select(applicant_id,applicant,application_id)
applicant_id | applicant | application_id |
---|---|---|
1 | DUTCHESS BOCES | 3-1332-00172/00001 |
2 | NYC ECONOMIC DEVELOPMENT CORP | 2-6402-00004/00100 |
3 | NYS Dept of Environmental Conservation | 3-4844-00112/00001 |
4 | PETRO INC | 3-1334-00136/00001 |
5 | CITY OF POUGHKEEPSIE | 3-1346-00364/00003 |
6 | DOLOMITE PRODUCTS COMPANY INC. | 8-1836-00001/02002 |
(1) Table 1 - permit table: the purpose of this table is to have the most
recent permit information. This will have one row per permit.
(2) Table 2 - permit action table: the purpose of this table is to have a
table with every permit-action. This means there should be one row per
permit action.
(3) Table 3 - facility table: the purpose of this table is to have
information on the facility.
(4) Table 4 - applicant table: the purpose of this table is to have
information about the applicant.
permit_status <- tbl1_permit_lvl |>
group_by(status) |>
summarize(
Count = n(),
Proportion = (n()/nrow(tbl1_permit_lvl))*100
) |>
arrange(desc(Proportion)) |>
head(5) |>
rename("Status" = "status")
permit_status$Proportion <- paste0(round(permit_status$Proportion, digits=1),"%")
ggplot(permit_status,aes(x = reorder(Status, -Count), y= Count)) +
geom_bar(stat="identity", fill="lightblue", width=0.5)+
geom_text(aes(label=Proportion),
hjust=.35)+
theme_minimal()+
labs(title="Permits by Status",x="Status")+
theme(axis.text.y =element_text(angle = 55,hjust=1))+
coord_flip()
upa_class <- tbl1_permit_lvl |>
group_by(upa_class) |>
summarize(
Count = n(),
Proportion = (n()/nrow(tbl1_permit_lvl))*100
)
upa_class$Proportion <- paste0(round(upa_class$Proportion, digits=1),"%")
ggplot(upa_class,aes(x = reorder(upa_class, -Count), y= Count)) +
geom_bar(stat="identity", fill="lightblue", width=0.5)+
geom_text(aes(label=Proportion),
hjust=.5,
vjust=0.25)+
theme_minimal()+
labs(title="Permits by UPA Class",x="UPA Class")
final_dis <- tbl1_permit_lvl |>
group_by(final_disposition) |>
summarize(
Count = n(),
Proportion = (n()/nrow(tbl1_permit_lvl))*100
) |>
arrange(desc(Count)) |>
head(5)
final_dis$Proportion <- paste0(round(final_dis$Proportion, digits=1),"%")
final_dis$Count <- as.numeric(final_dis$Count)
final_dis <- final_dis |>
clean_names("title")
ggplot(final_dis,aes(x =reorder(`Final Disposition`,`Count`, .desc = TRUE), y= Count)) +
geom_bar(stat="identity", fill="lightblue", width=0.5)+
theme_minimal()+
labs(title="Permits by Final Disposition",x="Final Disposition")+
theme(axis.text.y =element_text(angle = 55,hjust=1))+
coord_flip()
app_type <- tbl2_permit_act_lvl |>
group_by(application_type) |>
summarize(
Count = n(),
Proportion = n()/nrow(tbl2_permit_act_lvl)
) |>
clean_names("title") |>
arrange(desc(Count))
knitr::kable(app_type, format ="markdown")
Application Type | Count | Proportion |
---|---|---|
Renewal Treat as New | 2565 | 0.7753930 |
Modification Treat as New | 274 | 0.0828295 |
Minor Modification | 185 | 0.0559250 |
New | 160 | 0.0483676 |
Modification | 48 | 0.0145103 |
DIM Treat as New | 47 | 0.0142080 |
Department Initiated Modification | 29 | 0.0087666 |
short_desc <- tbl2_permit_act_lvl |>
mutate(c_fast_track=coalesce(str_count(short_description,"fast track"),0)) |>
summarize(
"Fast Tracked Renewal Actions" = sum(c_fast_track),
"Total Actions" = nrow(tbl2_permit_act_lvl),
Proportion = sum(c_fast_track)/nrow(tbl2_permit_act_lvl)
) |>
clean_names("title")
knitr::kable(short_desc, format ="markdown")
Fast Tracked Renewal Actions | Total Actions | Proportion |
---|---|---|
2204 | 3308 | 0.6662636 |
Read in raw csv file as data frame
data <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/refs/heads/main/data/2024/2024-06-04/cheeses.csv")
cheese | url | milk | country | region | family | type | fat_content | calcium_content | texture | rind | color | flavor | aroma | vegetarian | vegan | synonyms | alt_spellings | producers |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Aarewasser | https://www.cheese.com/aarewasser/ | cow | Switzerland | NA | NA | semi-soft | NA | NA | buttery | washed | yellow | sweet | buttery | FALSE | FALSE | NA | NA | Jumi |
Abbaye de Belloc | https://www.cheese.com/abbaye-de-belloc/ | sheep | France | Pays Basque | NA | semi-hard, artisan | NA | NA | creamy, dense, firm | natural | yellow | burnt caramel | lanoline | TRUE | FALSE | Abbaye Notre-Dame de Belloc | NA | NA |
Abbaye de Belval | https://www.cheese.com/abbaye-de-belval/ | cow | France | NA | NA | semi-hard | 40-46% | NA | elastic | washed | ivory | NA | aromatic | FALSE | FALSE | NA | NA | NA |
Abbaye de Citeaux | https://www.cheese.com/abbaye-de-citeaux/ | cow | France | Burgundy | NA | semi-soft, artisan, brined | NA | NA | creamy, dense, smooth | washed | white | acidic, milky, smooth | barnyardy, earthy | FALSE | FALSE | NA | NA | NA |
Abbaye de Tamié | https://www.cheese.com/tamie/ | cow | France | Savoie | NA | soft, artisan | NA | NA | creamy, open, smooth | washed | white | fruity, nutty | perfumed, pungent | FALSE | FALSE | NA | Tamié, Trappiste de Tamie, Abbey of Tamie | NA |
Abbaye de Timadeuc | https://www.cheese.com/abbaye-de-timadeuc/ | cow | France | province of Brittany | NA | semi-hard | NA | NA | soft | washed | pale yellow | salty, smooth | nutty | FALSE | FALSE | NA | NA | Abbaye Cistercienne NOTRE-DAME DE TIMADEUC |
Row_Count | Column_Count | Null_Count | None_Str_Count |
---|---|---|---|
1187 | 19 | 7133 | 0 |
fill_empty_str = function(x){if_else(x=="", 'None' ,x)}
df = data |>
select(cheese, milk, country, texture, aroma, flavor) |>
mutate_all(fill_empty_str) |>
mutate_all(replace_na, "None")
Row_Count | Column_Count | Null_Count | None_Str_Count |
---|---|---|---|
1187 | 6 | 0 | 340 |
Tidy data by ensuring each value has its own cell
df = df |>
mutate(cheese_id = row_number()) |>
separate_rows(country, sep = ', ') |>
separate_rows(milk, sep = ', ') |>
separate_rows(texture, sep = ', ') |>
separate_rows(aroma, sep = ', ') |>
separate_rows(flavor, sep = ', ')
cheese | milk | country | texture | aroma | flavor | cheese_id |
---|---|---|---|---|---|---|
Aarewasser | cow | Switzerland | buttery | buttery | sweet | 1 |
Abbaye de Belloc | sheep | France | creamy | lanoline | burnt caramel | 2 |
Abbaye de Belloc | sheep | France | dense | lanoline | burnt caramel | 2 |
Abbaye de Belloc | sheep | France | firm | lanoline | burnt caramel | 2 |
Abbaye de Belval | cow | France | elastic | aromatic | None | 3 |
Abbaye de Citeaux | cow | France | creamy | barnyardy | acidic | 4 |
Row_Count | Column_Count | Null_Count | None_Str_Count |
---|---|---|---|
14394 | 7 | 0 | 2050 |
Normalize data to reduce redundancy and allow for more efficient analysis
create_id_dfs = function(id_prefix, col, df) {
id_df = df |>
select(all_of(col)) |>
distinct() |>
arrange(col) |>
mutate(id = paste0(id_prefix, row_number()))
return(id_df)
}
cheese_df = df |>
select(cheese, cheese_id) |>
distinct()
country_df = create_id_dfs('C', 'country', df)
colnames(country_df) = c('country', 'country_id')
milk_df = create_id_dfs('M', 'milk', df)
colnames(milk_df) = c('milk', 'milk_id')
texture_df = create_id_dfs('T', 'texture', df)
colnames(texture_df) = c('texture', 'texture_id')
aroma_df = create_id_dfs('A', 'aroma', df)
colnames(aroma_df) = c('aroma', 'aroma_id')
flavor_df = create_id_dfs('F', 'flavor', df)
colnames(flavor_df) = c('flavor', 'flavor_id')
df = left_join(df, country_df, by = join_by(country))
df = left_join(df, milk_df, by = join_by(milk))
df = left_join(df, texture_df, by = join_by(texture))
df = left_join(df, aroma_df, by = join_by(aroma))
df = left_join(df, flavor_df, by = join_by(flavor))
df = df |>
select(cheese_id, country_id, milk_id, texture_id, aroma_id, flavor_id)
cheese_id | country_id | milk_id | texture_id | aroma_id | flavor_id |
---|---|---|---|---|---|
1 | C1 | M1 | T1 | A1 | F1 |
2 | C2 | M2 | T2 | A2 | F2 |
2 | C2 | M2 | T3 | A2 | F2 |
2 | C2 | M2 | T4 | A2 | F2 |
3 | C2 | M1 | T5 | A3 | F3 |
4 | C2 | M1 | T2 | A4 | F4 |
cheese | cheese_id |
---|---|
Aarewasser | 1 |
Abbaye de Belloc | 2 |
Abbaye de Belval | 3 |
Abbaye de Citeaux | 4 |
Abbaye de Tamié | 5 |
Abbaye de Timadeuc | 6 |
country | country_id |
---|---|
Switzerland | C1 |
France | C2 |
England | C3 |
Great Britain | C4 |
United Kingdom | C5 |
Czech Republic | C6 |
milk | milk_id |
---|---|
cow | M1 |
sheep | M2 |
goat | M3 |
buffalo | M4 |
None | M5 |
water buffalo | M6 |
texture | texture_id |
---|---|
buttery | T1 |
creamy | T2 |
dense | T3 |
firm | T4 |
elastic | T5 |
smooth | T6 |
aroma | aroma_id |
---|---|
buttery | A1 |
lanoline | A2 |
aromatic | A3 |
barnyardy | A4 |
earthy | A5 |
perfumed | A6 |
flavor | flavor_id |
---|---|
sweet | F1 |
burnt caramel | F2 |
None | F3 |
acidic | F4 |
milky | F5 |
smooth | F6 |
Analysis Requested in Discussion Post:
df |>
select(cheese_id, milk_id) |>
left_join(cheese_df, by = join_by(cheese_id)) |>
left_join(milk_df, by = join_by(milk_id)) |>
mutate(milk = str_to_title(milk)) |>
group_by(milk) |>
summarise(
Total_Count = n(),
Unique_Cheese_Count = n_distinct(cheese)) |>
pivot_longer(cols = c(Total_Count, Unique_Cheese_Count)) |>
ggplot(aes(x = reorder(milk, -value), y = value, fill = name)) +
geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
labs(
title = "Most Common Cheese Milk Types",
x = "Milk",
y = "Count",
fill = 'Count Type'
) +
theme_classic() +
theme(
axis.text.x = element_text(size = 11, angle = 45, vjust = 1, hjust=1),
plot.title = element_text(size = 14, face = "bold"),
legend.position = "bottom")
df |>
select(cheese_id, texture_id) |>
left_join(cheese_df, by = join_by(cheese_id)) |>
left_join(texture_df, by = join_by(texture_id)) |>
mutate(texture = str_to_title(texture)) |>
group_by(texture) |>
summarise(
Total_Count = n(),
Unique_Cheese_Count = n_distinct(cheese)) |>
pivot_longer(cols = c(Total_Count, Unique_Cheese_Count)) |>
ggplot(aes(x = reorder(texture, -value), y = value, fill = name)) +
geom_col(position = position_dodge2(width = 0.2, preserve = "single")) +
labs(
title = "Most Common Textures Associated with Cheeses",
x = "Texture",
y = "Count",
fill = 'Count Type'
) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 8),
plot.title = element_text(size = 14, face = "bold"),
legend.position = "bottom")
df |>
select(cheese_id, country_id) |>
left_join(cheese_df, by = join_by(cheese_id)) |>
left_join(country_df, by = join_by(country_id)) |>
mutate(country = str_to_title(country)) |>
group_by(country) |>
summarise(
Total_Count = n(),
Unique_Cheese_Count = n_distinct(cheese)) |>
mutate(country = ifelse(Unique_Cheese_Count <= 5, 'Other', country)) |>
group_by(country) |>
summarise(
Total_Count = sum(Total_Count),
Unique_Cheese_Count = sum(Unique_Cheese_Count)) |>
pivot_longer(cols = c(Total_Count, Unique_Cheese_Count)) |>
ggplot(aes(x = reorder(country, -value), y = value, fill = name)) +
geom_col(position = position_dodge2(width = 0.2, preserve = "single")) +
labs(
title = "Countries Producing Most Cheese",
x = "Country",
y = "Count",
fill = 'Count Type'
) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
plot.title = element_text(size = 14, face = "bold"),
legend.position = "bottom")
top_flavor_df = df |>
left_join(cheese_df, by = join_by(cheese_id)) |>
left_join(flavor_df, by = join_by(flavor_id)) |>
group_by(flavor, flavor_id) |>
summarise(cnt = n(), .groups = 'keep') |>
arrange(desc(cnt)) |>
head(12)
milk_cnt_df = df |>
left_join(cheese_df, by = join_by(cheese_id)) |>
inner_join(top_flavor_df, by = join_by(flavor_id)) |>
left_join(milk_df, by = join_by(milk_id)) |>
group_by(milk, milk_id) |>
summarise(milk_cnt = n(), .groups = 'keep') |>
filter(milk_cnt>10)
df |>
select(cheese_id, milk_id, flavor_id) |>
left_join(cheese_df, by = join_by(cheese_id)) |>
inner_join(top_flavor_df, by = join_by(flavor_id)) |>
inner_join(milk_cnt_df, by = join_by(milk_id)) |>
mutate(
flavor = str_to_title(flavor),
milk = str_to_title(milk)
) |>
group_by(milk) |>
mutate(y= n()) |>
group_by(milk, flavor) |>
mutate(x = n()) |>
mutate(Unique_Cheese_Prct = (x / y)*100) |>
select(milk, flavor, Unique_Cheese_Prct, x, milk_cnt, y) |>
distinct() |>
ggplot(aes(x = milk, y = Unique_Cheese_Prct, fill = flavor)) +
geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
labs(
title = "Top Cheese Flavors and Milk Distribution",
subtitle = 'Top 12 Flavors and Milk Types with a Count of at Least 10',
x = "Milk",
y = "Percent of Milk",
fill = 'Flavor'
) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
plot.title = element_text(size = 14, face = "bold"))
top_aroma_df = df |>
left_join(cheese_df, by = join_by(cheese_id)) |>
left_join(aroma_df, by = join_by(aroma_id)) |>
group_by(aroma, aroma_id) |>
summarise(cnt = n(), .groups = 'keep') |>
arrange(desc(cnt)) |>
head(12)
df |>
select(cheese_id, milk_id, aroma_id) |>
left_join(cheese_df, by = join_by(cheese_id)) |>
inner_join(top_aroma_df, by = join_by(aroma_id)) |>
inner_join(milk_cnt_df, by = join_by(milk_id)) |>
mutate(
aroma = str_to_title(aroma),
milk = str_to_title(milk)
) |>
group_by(milk) |>
mutate(y= n()) |>
group_by(milk, aroma) |>
mutate(x = n()) |>
mutate(Unique_Cheese_Prct = (x / y)*100) |>
select(milk, aroma, Unique_Cheese_Prct, x, milk_cnt, y) |>
distinct() |>
ggplot(aes(x = milk, y = Unique_Cheese_Prct, fill = aroma)) +
geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
labs(
title = "Top Cheese Aromas and Milk Distribution",
subtitle = 'Top 12 Aromas and Milk Types with a Count of at Least 10',
x = "Milk",
y = "Percent of Milk",
fill = 'Aroma'
) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
plot.title = element_text(size = 14, face = "bold"))
# Export cleaned Emissions dataset
write.csv(yearly_emissions_by_area, "yearly_emissions_by_area_cleaned.csv", row.names = FALSE)
write.csv(yearly_emissions, "yearly_emissions_cleaned.csv", row.names = FALSE)
write.csv(emissions_by_area, "emissions_by_area_cleaned.csv", row.names = FALSE)
# Export cleaned DART water permit dataset
write.csv(tbl1_permit_lvl, "tbl1_permit_lvl_cleaned.csv", row.names = FALSE)
write.csv(tbl2_permit_act_lvl, "tbl2_permit_act_lvl_cleaned.csv", row.names = FALSE)
write.csv(tbl3_facility_lvl, "tbl3_facility_lvl_cleaned.csv", row.names = FALSE)
write.csv(tbl4_app_lvl, "tbl4_app_lvl_cleaned.csv", row.names = FALSE)
# Export cleaned Cheese Quality dataset
write.csv(cheese_df, "cheese_df_cleaned.csv", row.names = FALSE)
write.csv(country_df, "country_df_cleaned.csv", row.names = FALSE)
write.csv(milk_df, "milk_df_cleaned.csv", row.names = FALSE)
write.csv(texture_df, "texture_df_cleaned.csv", row.names = FALSE)
write.csv(aroma_df, "aroma_df_cleaned.csv", row.names = FALSE)
write.csv(flavor_df, "flavor_df_cleaned.csv", row.names = FALSE)
write.csv(df, "final_cheese_data_cleaned.csv", row.names = FALSE)
# Confirm that the files were saved successfully
list.files(pattern = "*.csv")
## [1] "aroma_df_cleaned.csv"
## [2] "cheese_df_cleaned.csv"
## [3] "CoffeeChain-SimplifiedData.csv"
## [4] "correlation-between-child-mortality-and-mean-years-of-schooling-for-those-aged-15-and-older.csv"
## [5] "country_df_cleaned.csv"
## [6] "data-1.csv"
## [7] "emissions_by_area_cleaned.csv"
## [8] "Facts_Python_1 (2).csv"
## [9] "FED-BCABLCCEESO_N_Q.csv"
## [10] "final_cheese_data_cleaned.csv"
## [11] "flavor_df_cleaned.csv"
## [12] "food_coded.csv"
## [13] "FRED-EAAXDBNQ.csv"
## [14] "FRED-EPLNQ.csv"
## [15] "laptopData.csv"
## [16] "Master_Company_List.csv"
## [17] "mean-years-of-schooling-female.csv"
## [18] "milk_df_cleaned.csv"
## [19] "movies.csv"
## [20] "MULTPL-SP500_BVPS_QUARTER.csv"
## [21] "New_York_City_Leading_Causes_of_Death.csv"
## [22] "Requested_Facts_Projecy_2.csv (1).zip"
## [23] "Requested_Facts_Projecy_2.csv.zip"
## [24] "Sample - Superstore - Returns.csv"
## [25] "tb(1)(2).csv"
## [26] "tbl1_permit_lvl_cleaned.csv"
## [27] "tbl2_permit_act_lvl_cleaned.csv"
## [28] "tbl3_facility_lvl_cleaned.csv"
## [29] "tbl4_app_lvl_cleaned.csv"
## [30] "texture_df_cleaned.csv"
## [31] "Total Emissions Per Country (2000-2020).csv"
## [32] "yearly_emissions_by_area_cleaned.csv"
## [33] "yearly_emissions_cleaned.csv"
.csv
files are now ready for machine learning models,
visualization dashboards, and predictive analytics.This project focused on transforming and analyzing three diverse
datasets, demonstrating the importance of data wrangling techniques in
preparing raw information for meaningful insights. By leveraging
tidyr
and dplyr
, we efficiently cleaned,
structured, and transformed the datasets into a tidy format, making them
suitable for downstream analysis.
Emissions Data:
DART Water Permits Data:
Cheese Quality Data:
The process of tidying and transforming data is crucial for accurate and meaningful analysis. Well-structured data improves efficiency in both visualization and modeling.
Converting datasets to tidy formats ensured that values were easily accessible for statistical computation.
These datasets can now be used for deeper predictive modeling, trend forecasting, and policy recommendations based on their respective domains.
Through this project, we reinforced the significance of data wrangling techniques in real-world data science applications. The ability to tidy, transform, and analyze raw data is a fundamental skill that enhances decision-making and unlocks valuable insights across different industries.