Project 2 B

Author

Michael Mayne

Project 2 Continutation:

Code Base Part #2 :

For this part of the project I will be continue by using the remaining 2 datasets to run an analysis of the information listed in Discussion 5A earlier this semester. It will be similar to part one focusing on Cleanring the data effectively, changing the format and graphing results.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading Data Sets (Waste Management & Disney Parks)

#Disney Parks

disneyParks<- read_csv("https://raw.githubusercontent.com/Mayneman000/DATA607Assignment/refs/heads/DATA/disney_parks_monthly_attendance.csv")

Rows: 4 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Park Name, Jan 2025, March, May, Jun, July, Aug, Sept, October 202...
dbl  (1): Apr_2025
num  (1): Feb-2025

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#waste manganement

Waste_records <- read_csv("https://raw.githubusercontent.com/Mayneman000/DATA607Assignment/refs/heads/DATA/Solid_Waste_Management_Facilities.csv")

Rows: 2774 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): Facility Name, Location Address, Location Address2, City, State, Z...
dbl  (4): Region, Phone Number, East Coordinate, North Coordinate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Disney Parks: Cleaning

We can start by cleaning the Disney park data set which looks small but has a lot of issue in such a simple dataset. So I will first get a look at the data, then break down the corrections to be made.

glimpse(disneyParks)

Rows: 4
Columns: 14
$ `Park Name`    <chr> "Magic Kingdom", "EPCOT", "Disneyland Park", "Hollywood…
$ `Jan 2025`     <chr> "1.5M", "1.2M", "1.1M", "950K"
$ `Feb-2025`     <dbl> 1600000, 1300000, 1200000, 980000
$ March          <chr> "2.1 million", "1.7 million", "1.6 million", "1.3 milli…
$ Apr_2025       <dbl> 2300000, 1800000, 1700000, 1400000
$ May            <chr> "2.5M", "2.0M", "1.9M", NA
$ Jun            <chr> "2.8M", "2.2M", "2.1M", "1.6M"
$ July           <chr> "3.0M", "2.4M", "2.3M", "1.8M"
$ Aug            <chr> "2.9M", "2.3M", "2.2M", "1.7M"
$ Sept           <chr> "2.4M", "1.9M", "1.8M", "1.4M"
$ `October 2025` <chr> "2.2M", "1.8M", "1.7M", "1.3M"
$ Nov            <chr> "1.9M", "1.6M", "1.5M", "1.1M"
$ Dec            <chr> "2.6M", "2.1M", "2.0M", "1.5M"
$ Region         <chr> "Florida", "Florida", "California", "Florida"

Unify Columns and Roles:

A lot of the current month columns has mismatching names, So it would make functions harder and more difficult to read. Since all of this information regarding the Disney parks are shown to be from 2025 only, we will simple make the columns name into their months.

disneyParks<- disneyParks %>% 
  rename(ParkName= `Park Name` , Jan = `Jan 2025` , Feb= `Feb-2025`, April = Apr_2025, Oct = `October 2025`)

Organizing Dataset from long to wide format:

By taking the information and pivoting the dataframe we can see how each park perform by the month. We do not need the months as columns so the only ones to keep are the ParkName and Region with the values per month shown as Revenue.

Excluding ParkName and Region in order to get an overview of their values

disneyParks_Long <- disneyParks %>%
  pivot_longer(
    cols = -c(ParkName, Region),
    names_to = "month",
    values_to = "Revenue",
    values_transform = list(Revenue = as.character)
  )

print(disneyParks_Long)

# A tibble: 48 × 4
   ParkName      Region  month Revenue    
   <chr>         <chr>   <chr> <chr>      
 1 Magic Kingdom Florida Jan   1.5M       
 2 Magic Kingdom Florida Feb   1600000    
 3 Magic Kingdom Florida March 2.1 million
 4 Magic Kingdom Florida April 2300000    
 5 Magic Kingdom Florida May   2.5M       
 6 Magic Kingdom Florida Jun   2.8M       
 7 Magic Kingdom Florida July  3.0M       
 8 Magic Kingdom Florida Aug   2.9M       
 9 Magic Kingdom Florida Sept  2.4M       
10 Magic Kingdom Florida Oct   2.2M       
# ℹ 38 more rows

Obviously the Data makes no fundamental sense and does not work becaus they are characters ans not values. So we can use a regular expression to change

a <- disneyParks_Long %>%
  mutate(numbers = as.numeric(str_extract(Revenue, "[0-9.]+")))

b <- a %>%
  mutate(Revenue_Clean = case_when(
    str_detect(Revenue, "M|million") ~ numbers *1000000,
    str_detect(Revenue, "K")~ numbers *1000,
    TRUE ~ numbers ))

disneyParks_Clean <- b %>%
  select(ParkName, Region, month, Revenue_Clean)

Visualization and Analysis

So now we need to ask “What is the highest month for income for all Disney parks?”. This can be done by using the group and summarize functions provided by

glimpse(disneyParks_Clean)

Rows: 48
Columns: 4
$ ParkName      <chr> "Magic Kingdom", "Magic Kingdom", "Magic Kingdom", "Magi…
$ Region        <chr> "Florida", "Florida", "Florida", "Florida", "Florida", "…
$ month         <chr> "Jan", "Feb", "March", "April", "May", "Jun", "July", "A…
$ Revenue_Clean <dbl> 1500000, 1600000, 2100000, 2300000, 2500000, 2800000, 30…

disneyParks_Clean%>%
  group_by(month)%>%  #Grouping Data by Month
  summarize(
    Monthy_Revenue = sum(Revenue_Clean, na.rm = TRUE)) %>%   #Getting revenue per month & removing the NA
  arrange(desc(Monthy_Revenue))  # organize from highest to lowest

# A tibble: 12 × 2
   month Monthy_Revenue
   <chr>          <dbl>
 1 July         9500000
 2 Aug          9100000
 3 Jun          8700000
 4 Dec          8200000
 5 Sept         7500000
 6 April        7200000
 7 Oct          7000000
 8 March        6700000
 9 May          6400000
10 Nov          6100000
11 Feb          5080000
12 Jan          4750000

This table shows that July is the most successful month for all the parks with January is where the park collectively have the least visitors. We can also use this format to decide if this data also works for California

disneyParks_Clean%>%
  filter (Region == "California") %>%
  group_by(month)%>%  
  summarize(
    Monthy_Revenue = sum(Revenue_Clean, na.rm = TRUE)) %>%  
  arrange(desc(Monthy_Revenue))

# A tibble: 12 × 2
   month Monthy_Revenue
   <chr>          <dbl>
 1 July         2300000
 2 Aug          2200000
 3 Jun          2100000
 4 Dec          2000000
 5 May          1900000
 6 Sept         1800000
 7 April        1700000
 8 Oct          1700000
 9 March        1600000
10 Nov          1500000
11 Feb          1200000
12 Jan          1100000

Graphing Data

We can show the growth of information one the parks.

Group by Park

Park_Totals <- disneyParks_Clean %>%
  group_by(ParkName)%>%
  summarize(
    Total_Att = sum(Revenue_Clean, na.rm = TRUE)
  )

ggplot(Park_Totals, aes(x= ParkName, y= Total_Att)) +
  geom_col()+
  labs(title = "2025 Annual Park Attendance for Each Disney Park", 
       x = "Disney Park Location",
       y = "Total Attendance")

Final Analysis - Waste Management

Introduction to data: This was a unclean data set provide by one of classmate which goes over the distribution of Waste management centers. The purpose of this analysis is to see how waste facilities are distributed by county.

glimpse(Waste_records)

Rows: 2,774
Columns: 22
$ `Facility Name`            <chr> "Eversharp Recycling Inc", "Kept Companies;…
$ `Location Address`         <chr> "10A Morris Ave", "42 Cherry Ln", "478 Gran…
$ `Location Address2`        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ City                       <chr> "Glen Cove", "Floral Park", "Westbury", "Gr…
$ State                      <chr> "NY", "NY", "NY", "NY", "NY", "NY", "NY", "…
$ `Zip Code`                 <chr> "11542", "11001", "11590", "11944", "10474"…
$ County                     <chr> "Nassau", "Nassau", "Nassau", "Suffolk", "B…
$ Region                     <dbl> 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4…
$ `Phone Number`             <dbl> 5169030406, 5167790108, 5163347625, NA, 718…
$ `Owner Name`               <chr> "Barbara Piliero", NA, "Christopher Stasi",…
$ `Owner Type`               <chr> "Private", NA, "Private", NA, "Private", "P…
$ `Activity Desc`            <chr> "C&D processing - registration", "Transfer …
$ `Activity Number`          <chr> "[30W47R]", "[30XP0108]", "[30CP0180]", "[5…
$ Active                     <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
$ `East Coordinate`          <dbl> 614932, 610197, 620845, 735302, 594480, 591…
$ `North Coordinate`         <dbl> 4524181, 4509811, 4512544, 4561644, 4517942…
$ `Accuracy Code`            <chr> NA, NA, NA, "4.3 - Utilization of Digital O…
$ `Waste Types`              <chr> "Tree Debris;Concrete;Soil (Clean)", NA, "C…
$ `Authorization Number`     <chr> "30W47R", "1-2822-01754/00001", "30W48R", "…
$ `Authorization Issue Date` <chr> "06/28/2010", "10/22/2025", NA, "01/13/2023…
$ `Expiration Date`          <chr> NA, "10/21/2035", NA, "10/24/2032", "03/24/…
$ Georeference               <chr> "10A Morris Ave\nGlen Cove, NY 11542\n(40.8…

Reducing Columns

This is the largest of the 3 dataset and although it is defintety untidy and has a lot of data, not all of this data is needed.

To begin, States is deemed necessary because all the facilities in this dataset is from NY. Which is fairly redundant.So I decided to focus on the Columns that I do need to have which are

Facility Name,
Location #1 & #2
City, Zip Code, County, Region
Active & Waste Types

Cleaning Columns

Waste_records_reduce <- Waste_records %>%
  select(`Facility Name`, `Location Address`, `Location Address2`, City, `Zip Code`, County, Region, Active, `Waste Types`) %>%
  rename(FacilityName = `Facility Name`, Address = `Location Address`, Address_2 = `Location Address2`, Zip_Code = `Zip Code`, Waste_Types = `Waste Types`)

# join columns for the 2 address

Waste_Clean <- Waste_records_reduce %>%
  unite("Full_Address", Address, Address_2, sep = ", ", na.rm = TRUE)

Now with the full column we can make one final task: Which is lenghting the dataset.

With our reduced data, there doesn’t seems to be this big of issue to work with what we have but it would help to at least understand the type of Waste the facilities deal with.

Waste_Long <- Waste_Clean %>%
  mutate(across(c(Region, Active, Zip_Code), as.character)) %>% 
  pivot_longer(
    cols = c(Region, Active), 
    names_to = "Status", 
    values_to = "Value"
  )


print(Waste_Long)

# A tibble: 5,548 × 8
   FacilityName      Full_Address City  Zip_Code County Waste_Types Status Value
   <chr>             <chr>        <chr> <chr>    <chr>  <chr>       <chr>  <chr>
 1 Eversharp Recycl… 10A Morris … Glen… 11542    Nassau Tree Debri… Region 1    
 2 Eversharp Recycl… 10A Morris … Glen… 11542    Nassau Tree Debri… Active Yes  
 3 Kept Companies; … 42 Cherry Ln Flor… 11001    Nassau <NA>        Region 1    
 4 Kept Companies; … 42 Cherry Ln Flor… 11001    Nassau <NA>        Active Yes  
 5 Rock Crush Recyc… 478 Grand B… West… 11590    Nassau Concrete;A… Region 1    
 6 Rock Crush Recyc… 478 Grand B… West… 11590    Nassau Concrete;A… Active Yes  
 7 USDHS - PIADC Bu… P O Box 848  Gree… 11944    Suffo… <NA>        Region 1    
 8 USDHS - PIADC Bu… P O Box 848  Gree… 11944    Suffo… <NA>        Active Yes  
 9 Luciano Auto Wre… 275 Halleck… Bronx 10474    Bronx  End of Lif… Region 2    
10 Luciano Auto Wre… 275 Halleck… Bronx 10474    Bronx  End of Lif… Active Yes  
# ℹ 5,538 more rows

We can then take this data from the Clean data set (It is the most effective) to count the rows about the total amount per county

Waste_Total <- Waste_Clean %>%
  group_by(County) %>%
  summarize(
    Total_Count = n()) %>%
  arrange(desc(Total_Count))

print(Waste_Total)

# A tibble: 62 × 2
   County      Total_Count
   <chr>             <int>
 1 Suffolk             317
 2 Erie                162
 3 Nassau              101
 4 Monroe               95
 5 Albany               92
 6 Onondaga             90
 7 Jefferson            78
 8 Oneida               73
 9 Westchester          69
10 Chautauqua           66
# ℹ 52 more rows

According to the table we can see that Suffolk has the highest number of waste facilities in its space. We do not see on the main NYC counties until we reach the Kings Counhtry/Brooklyn with 49. The lowest amount is New York with 7 facilities. which is understandable considering how much space is coveted in Mahattan.

Graphing the amount of waste facilities by County

ggplot(Waste_Total, aes(x= reorder(County, Total_Count), y=(Total_Count))) +
  geom_col()+
  coord_flip()+
  labs (title = "Waste Facilities by county", y = "# of Facilities")

By showing the data we can see just how much the facilities distribution is surprising spread out. Not even but there a variety of facilities per county.

End of Report