SnowfallNYC

Author

Fraz Aslam

Approach

The data-set I have selected for the first Assignment is NYC snowfall in central park between the years 1869-2022. The data-set is sourced from Kaggle titled “New York City Weather: A 154-Year Retrospective”. I was inspired to select this data from our recent snowstorm in NYC just this past weekend. I wanted to see what the worst occurances of snowfall have been in NYC history and compare it with our most recent experience.

Loading libraries that are required for me to display the data in a presentable fashion

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gt)

Here I am loading the data-set from a RAW github file and displaying the structure and how many values there are in the actual data-set

url <- "https://raw.githubusercontent.com/AslamF/DATA607-Assignment-1/refs/heads/main/NYC_Central_Park_weather_1869-2022.csv"

df <- read_csv(
  file = url,
  show_col_types = FALSE,
  progress = FALSE
)

glimpse(df)
Rows: 56,245
Columns: 6
$ DATE <date> 1869-01-01, 1869-01-02, 1869-01-03, 1869-01-04, 1869-01-05, 1869…
$ PRCP <dbl> 0.75, 0.03, 0.00, 0.18, 0.05, 0.00, 0.00, 0.00, 0.00, 0.01, 0.00,…
$ SNOW <dbl> 9.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
$ SNWD <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TMIN <dbl> 19, 21, 27, 34, 37, 34, 35, 40, 38, 33, 30, 29, 28, 32, 39, 32, 2…
$ TMAX <dbl> 29, 27, 35, 37, 43, 38, 48, 54, 48, 44, 33, 37, 38, 42, 42, 38, 3…

Here i am displaying the first 100 rows of the data-set using the library gt

df |>
  head(100) |>
  gt()
DATE PRCP SNOW SNWD TMIN TMAX
1869-01-01 0.75 9.0 NA 19 29
1869-01-02 0.03 0.0 NA 21 27
1869-01-03 0.00 0.0 NA 27 35
1869-01-04 0.18 0.0 NA 34 37
1869-01-05 0.05 0.0 NA 37 43
1869-01-06 0.00 0.0 NA 34 38
1869-01-07 0.00 0.0 NA 35 48
1869-01-08 0.00 0.0 NA 40 54
1869-01-09 0.00 0.0 NA 38 48
1869-01-10 0.01 0.0 NA 33 44
1869-01-11 0.00 0.0 NA 30 33
1869-01-12 0.85 0.0 NA 29 37
1869-01-13 0.00 0.0 NA 28 38
1869-01-14 0.00 0.0 NA 32 42
1869-01-15 0.04 0.0 NA 39 42
1869-01-16 0.00 0.0 NA 32 38
1869-01-17 0.00 0.1 NA 29 35
1869-01-18 0.00 0.0 NA 26 29
1869-01-19 0.15 6.0 NA 27 34
1869-01-20 0.00 0.0 NA 30 37
1869-01-21 0.00 0.0 NA 29 42
1869-01-22 0.00 0.0 NA 17 29
1869-01-23 0.00 0.0 NA 20 39
1869-01-24 0.00 0.0 NA 35 48
1869-01-25 0.00 0.0 NA 22 38
1869-01-26 0.00 0.0 NA 18 31
1869-01-27 0.00 0.0 NA 23 40
1869-01-28 0.00 0.0 NA 35 47
1869-01-29 0.00 0.0 NA 30 48
1869-01-30 0.47 0.0 NA 41 54
1869-01-31 0.00 0.0 NA 30 41
1869-02-01 0.00 0.0 NA 25 33
1869-02-02 0.00 0.0 NA 22 34
1869-02-03 0.00 0.0 NA 35 36
1869-02-04 1.55 1.1 NA 21 38
1869-02-05 0.00 0.0 NA 22 26
1869-02-06 0.00 0.0 NA 26 40
1869-02-07 0.00 0.0 NA 23 36
1869-02-08 0.00 0.0 NA 25 40
1869-02-09 0.00 0.0 NA 34 38
1869-02-10 0.85 0.0 NA 34 39
1869-02-11 0.05 0.0 NA 38 46
1869-02-12 0.00 0.0 NA 36 48
1869-02-13 0.00 0.0 NA 40 61
1869-02-14 0.00 0.0 NA 35 42
1869-02-15 2.60 0.0 NA 32 43
1869-02-16 0.00 0.0 NA 36 44
1869-02-17 0.00 0.0 NA 34 46
1869-02-18 0.54 2.0 NA 30 41
1869-02-19 0.06 0.5 NA 30 42
1869-02-20 0.00 0.0 NA 32 39
1869-02-21 0.00 0.0 NA 33 40
1869-02-22 0.00 0.0 NA 38 50
1869-02-23 0.56 0.0 NA 33 44
1869-02-24 0.00 0.0 NA 27 38
1869-02-25 0.00 0.0 NA 26 38
1869-02-26 0.66 6.0 NA 31 34
1869-02-27 0.00 0.0 NA 22 31
1869-02-28 0.00 0.0 NA 17 25
1869-03-01 0.00 0.0 NA 4 26
1869-03-02 0.02 0.0 NA 19 36
1869-03-03 0.00 0.0 NA 28 43
1869-03-04 0.10 0.0 NA 19 44
1869-03-05 0.00 0.0 NA 11 35
1869-03-06 0.03 0.8 NA 20 36
1869-03-07 0.00 0.0 NA 20 26
1869-03-08 0.00 0.0 NA 26 38
1869-03-09 0.00 0.0 NA 32 46
1869-03-10 1.06 0.0 NA 38 47
1869-03-11 0.00 0.0 NA 28 37
1869-03-12 0.00 0.0 NA 27 38
1869-03-13 0.00 0.0 NA 33 52
1869-03-14 0.02 0.0 NA 34 48
1869-03-15 0.18 0.0 NA 25 34
1869-03-16 0.00 0.0 NA 26 39
1869-03-17 0.00 0.0 NA 31 42
1869-03-18 0.00 0.0 NA 24 36
1869-03-19 0.00 0.0 NA 34 42
1869-03-20 0.28 0.0 NA 32 48
1869-03-21 0.00 0.0 NA 21 32
1869-03-22 0.00 0.0 NA 21 35
1869-03-23 0.95 0.0 NA 32 40
1869-03-24 0.00 0.0 NA 35 50
1869-03-25 0.00 0.0 NA 33 46
1869-03-26 0.00 0.0 NA 39 47
1869-03-27 0.78 0.0 NA 48 60
1869-03-28 0.00 0.0 NA 41 55
1869-03-29 0.00 0.0 NA 39 40
1869-03-30 1.15 0.0 NA 40 53
1869-03-31 0.04 0.0 NA 38 46
1869-04-01 0.01 0.0 NA 36 48
1869-04-02 0.43 0.0 NA 38 43
1869-04-03 0.00 0.0 NA 38 41
1869-04-04 0.00 0.0 NA 30 38
1869-04-05 0.00 0.0 NA 36 50
1869-04-06 0.00 0.0 NA 42 53
1869-04-07 0.30 0.0 NA 37 44
1869-04-08 0.00 0.0 NA 36 52
1869-04-09 0.00 0.0 NA 33 47
1869-04-10 0.00 0.0 NA 34 47

Here i am displaying basic handling of the data-set by showing the first 3 columns and only the first 5 rows of the data-set. Saving it in a new variable df1 and displaying that variable using the gt library

df1 <- df[1:5, c(1,2,3)]

df1 |> 
  gt()
DATE PRCP SNOW
1869-01-01 0.75 9
1869-01-02 0.03 0
1869-01-03 0.00 0
1869-01-04 0.18 0
1869-01-05 0.05 0

Subset data focusing on SNOW

I have created a subset of the data, focusing on the SNOW (data is displayed in inches) and organize it in descending order. I also only displayed the top 50 of those results as the data-set is incredibly large.

df |>
  arrange(desc(SNOW)) |>
  head(50) |>
  select(DATE, SNOW) |>
  gt() |>
  tab_header(
    title = "Top 100 Snowiest Days in NYC Central Park",
    subtitle = "1869-2022"
  )
Top 100 Snowiest Days in NYC Central Park
1869-2022
DATE SNOW
2016-01-23 27.3
1947-12-26 26.1
2006-02-12 24.1
1872-12-26 18.0
1888-03-12 16.5
2003-02-17 16.3
1948-12-19 15.8
1941-03-08 15.7
1978-02-06 15.5
2021-02-01 14.8
1969-02-09 14.0
1996-01-07 13.6
1914-03-01 13.5
1879-01-16 13.0
1935-01-23 12.8
1994-02-11 12.8
1916-12-15 12.7
1979-02-19 12.7
1921-02-20 12.5
1960-03-03 12.5
1967-02-07 12.5
1983-02-11 12.5
2011-01-26 12.3
2010-12-26 12.2
2000-12-30 12.0
1960-12-12 11.6
1925-01-02 11.5
1964-01-13 11.5
2010-02-26 11.5
1912-12-24 11.4
1961-02-04 11.4
1933-12-26 11.2
1978-01-20 11.1
1876-02-04 11.0
2014-01-21 11.0
1995-02-04 10.8
1926-02-10 10.4
1959-12-22 10.3
1993-03-13 10.2
1874-12-20 10.0
1896-03-02 10.0
1902-02-17 10.0
1905-01-25 10.0
1915-04-03 10.0
1933-02-11 10.0
2010-02-10 10.0
1996-02-16 9.9
1934-02-01 9.8
2018-01-04 9.8
1982-04-06 9.6

Conclusion

In this first assignment I was able to identify a data-set with unique information that I wanted to parse and display. I was able to upload and access that data-set and then display specific information as wanted. In particular I was able to see the highest record of snowfall in inches over the past 150 years in NYC. The results show that January of 2016 had the largest amount of snowfall with a shocking 27 inches and right below that is 26 inches in December of 1947! There are a-lot more ways I can explore this data such as comparing amount of snow per year and creating sub-set of data for that.