NYC_WeatherP2

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Data Loading

Dataset Preparation

For This dataset, I am reading the information from a web table located in wikipedia. Using the html_node function for reading the tables and then choosing the 3rd table which is the one we are interested in. Finally, since the headers where presented as a row in the web table, the setNames is setting the second row as a header and then slice a row that was displayed as a footer for the table.

nyc_climate <- read_html("https://en.wikipedia.org/wiki/Climate_of_New_York",encoding = "UTF-8") %>% 

html_nodes("table") %>% 
  .[[3]] %>% 
  html_table(header = FALSE) %>%
  setNames(.[2,]) %>% 
  slice(3:(n()-1))

Original Dataframe Display

Climate data for Albany International Airport, New York (1981–2010 normals, extremes 1874–present)
Month	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Year
Record high °F (°C)	71(22)	74(23)	89(32)	93(34)	97(36)	100(38)	104(40)	102(39)	100(38)	91(33)	82(28)	72(22)	104(40)
Mean maximum °F (°C)	52.8(11.6)	53.5(11.9)	68.2(20.1)	81.6(27.6)	86.7(30.4)	91.1(32.8)	92.3(33.5)	91.2(32.9)	86.0(30.0)	76.8(24.9)	68.1(20.1)	54.8(12.7)	94.3(34.6)
Average high °F (°C)	30.6(−0.8)	34.6(1.4)	44.4(6.9)	58.3(14.6)	69.4(20.8)	77.9(25.5)	82.3(27.9)	80.4(26.9)	72.2(22.3)	59.8(15.4)	47.9(8.8)	35.8(2.1)	57.9(14.4)
Daily mean °F (°C)	22.6(−5.2)	25.9(−3.4)	35.0(1.7)	47.8(8.8)	58.3(14.6)	67.2(19.6)	71.8(22.1)	70.1(21.2)	61.9(16.6)	49.7(9.8)	39.7(4.3)	28.5(−1.9)	48.2(9.0)
Average low °F (°C)	14.5(−9.7)	17.3(−8.2)	25.7(−3.5)	37.3(2.9)	47.1(8.4)	56.5(13.6)	61.4(16.3)	59.9(15.5)	51.6(10.9)	39.6(4.2)	31.5(−0.3)	21.2(−6.0)	38.7(3.7)
Mean minimum °F (°C)	−7.1(−21.7)	−2.8(−19.3)	6.5(−14.2)	23.2(−4.9)	32.7(0.4)	42.0(5.6)	49.5(9.7)	46.5(8.1)	36.1(2.3)	25.8(−3.4)	15.9(−8.9)	1.6(−16.9)	−9.6(−23.1)
Record low °F (°C)	−28(−33)	−22(−30)	−21(−29)	9(−13)	26(−3)	35(2)	40(4)	34(1)	24(−4)	16(−9)	−11(−24)	−22(−30)	−28(−33)
Average precipitation inches (mm)	2.59(66)	2.20(56)	3.21(82)	3.17(81)	3.61(92)	3.79(96)	4.12(105)	3.46(88)	3.30(84)	3.68(93)	3.29(84)	2.93(74)	39.35(999)
Average snowfall inches (cm)	17.9(45)	12.2(31)	11.0(28)	2.3(5.8)	0.1(0.25)	0(0)	0(0)	0(0)	0(0)	0.3(0.76)	2.8(7.1)	13.7(35)	60.3(153)
Average precipitation days (≥ 0.01 in)	12.8	10.4	12.1	11.9	13.1	12.2	10.8	10.7	9.8	10.4	11.7	11.9	137.8
Average snowy days (≥ 0.1 in)	10.3	7.6	5.6	1.2	0.1	0	0	0	0	0.1	2.5	7.4	34.8
Average relative humidity (%)	71.1	68.5	64.8	61.2	65.5	69.5	70.5	74.1	75.7	72.4	73.1	73.9	70.0
Mean monthly sunshine hours	141.1	158.5	200.3	218.9	248.9	262.2	289.2	253.2	210.5	168.8	100.7	108.3	2,360.6
Percent possible sunshine	48	54	54	54	55	57	62	59	56	49	34	38	53

Original Dataset

Post Processed Dataset

The first step to process the dataset is to convert all the months presented in the column as a row. Then the temperature field will store the corresponding values for each month. Then the months are grouped according to their respective characteristics (variables like Record High Temperature, Mean Minimum, etc.. ) and then a row that works as an index for the new dataframe is added because it is needed for spreading the data again and show each variable according to their month.Finally, all the rows are selected except the id generated for grouping purposes.

nyc_climate <- nyc_climate %>% 
gather("Months","Temperature",2:14) %>% 
  group_by(Month) %>% 
  mutate(id=row_number()) %>% 
  spread(Month,Temperature) %>% 
  arrange(id) %>% 
  select(-id)

In this part the code iterates over the columns of the dataframe in order to separate the Fahrenheit and Celsius values (in parenthesis)from each field and rows that contains such values.Then, it generates the column headers for the new columns according the name extracted from the original column and name the accordingly.Finally, the code slices the dataframe ignoring the last row since it is not needed for calculations

headers <- nyc_climate %>%  names()
nyc_clFinal <- nyc_climate 
fields_to_rm <- c()

for ( i in 1:length(names(nyc_climate))) {
  
  if (str_detect(headers[i],"(.*)(?=[^a-zA-Z][F])")) {
    
    fields_to_rm <- c(fields_to_rm,i)
    
    celsius <- nyc_climate[[i]] %>% str_extract("(?<=\\()(.*)(?=\\))") %>% 
      str_replace(".*\\−","-")
    fahrenh <- nyc_climate[[i]] %>%  str_extract(".*(?=\\()") %>% 
      str_replace(".*\\−","-")
    
    title <- str_extract(headers[i],"(.*)(?=[^a-zA-Z][F])")

    nyc_clFinal <- nyc_clFinal %>% 
      mutate(celsius_f=celsius, fahrenh_f=fahrenh)
    
    leng <- length(names(nyc_clFinal))
    
    names(nyc_clFinal)[c(leng-1,leng)] <-  c(paste(title,"Celsius"),paste(title,"Fahrenh"))
    
  }  

}
nyc_clFinal <- nyc_clFinal %>% slice(1:n()-1) %>% 
  select(-fields_to_rm)

Final Dataset Display

In this dataset, temperature related variable are split from its original two in one field format. However, variables like “Average Precipitation” where left as is since will not be used for the analysis.

Climate data for Albany International Airport, New York (1981–2010 normals, extremes 1874–present)
Months	Average precipitation days (≥ 0.01 in)	Average precipitation inches (mm)	Average relative humidity (%)	Average snowfall inches (cm)	Average snowy days (≥ 0.1 in)	Mean monthly sunshine hours	Percent possible sunshine	Average high Celsius	Average high Fahrenh	Average low Celsius	Average low Fahrenh	Daily mean Celsius	Daily mean Fahrenh	Mean maximum Celsius	Mean maximum Fahrenh	Mean minimum Celsius	Mean minimum Fahrenh	Record high Celsius	Record high Fahrenh	Record low Celsius	Record low Fahrenh
Jan	12.8	2.59(66)	71.1	17.9(45)	10.3	141.1	48	-0.8	30.6	-9.7	14.5	-5.2	22.6	11.6	52.8	-21.7	-7.1	22	71	-33	-28
Feb	10.4	2.20(56)	68.5	12.2(31)	7.6	158.5	54	1.4	34.6	-8.2	17.3	-3.4	25.9	11.9	53.5	-19.3	-2.8	23	74	-30	-22
Mar	12.1	3.21(82)	64.8	11.0(28)	5.6	200.3	54	6.9	44.4	-3.5	25.7	1.7	35.0	20.1	68.2	-14.2	6.5	32	89	-29	-21
Apr	11.9	3.17(81)	61.2	2.3(5.8)	1.2	218.9	54	14.6	58.3	2.9	37.3	8.8	47.8	27.6	81.6	-4.9	23.2	34	93	-13	9
May	13.1	3.61(92)	65.5	0.1(0.25)	0.1	248.9	55	20.8	69.4	8.4	47.1	14.6	58.3	30.4	86.7	0.4	32.7	36	97	-3	26
Jun	12.2	3.79(96)	69.5	0(0)	0	262.2	57	25.5	77.9	13.6	56.5	19.6	67.2	32.8	91.1	5.6	42.0	38	100	2	35
Jul	10.8	4.12(105)	70.5	0(0)	0	289.2	62	27.9	82.3	16.3	61.4	22.1	71.8	33.5	92.3	9.7	49.5	40	104	4	40
Aug	10.7	3.46(88)	74.1	0(0)	0	253.2	59	26.9	80.4	15.5	59.9	21.2	70.1	32.9	91.2	8.1	46.5	39	102	1	34
Sep	9.8	3.30(84)	75.7	0(0)	0	210.5	56	22.3	72.2	10.9	51.6	16.6	61.9	30.0	86.0	2.3	36.1	38	100	-4	24
Oct	10.4	3.68(93)	72.4	0.3(0.76)	0.1	168.8	49	15.4	59.8	4.2	39.6	9.8	49.7	24.9	76.8	-3.4	25.8	33	91	-9	16
Nov	11.7	3.29(84)	73.1	2.8(7.1)	2.5	100.7	34	8.8	47.9	-0.3	31.5	4.3	39.7	20.1	68.1	-8.9	15.9	28	82	-24	-11
Dec	11.9	2.93(74)	73.9	13.7(35)	7.4	108.3	38	2.1	35.8	-6.0	21.2	-1.9	28.5	12.7	54.8	-16.9	1.6	22	72	-30	-22

Save as a CSV File

write.csv(nyc_clFinal,file = "climate_data.csv")

Analysis

This sections shows the average celsius temperature.

According to the graph, during the last years, January has been the coldest month with temperatures under 0 whereas July has been the hottest month with temperature above 27 grades celsius.

pl <- nyc_clFinal %>% 
ggplot(aes(y = as.numeric(nyc_clFinal$`Average high  Celsius`),x = .$Months))+geom_bar(stat = "identity")+coord_flip()+labs(title = "Average High Monthly Celsius Temperatures 1.1")+xlab(label = "Months")+ylab(label = "Temperature in celsius")
pl+scale_y_continuous(limits = c(-1,30))

Since temperature decreases during the night, lets check how many hours of light a particular month had. Assuming that less light means more lower temperatures.

As we can compare with the graph 1.1, months like November, December and January which that are supposed to receive less sun light are the more cold. However, thats not always the case. The next graph shows the average probability of sunshine for arch month during the last years.

On average, the average probability of possible sunshine was almost uniform across the years.

On the other hand, according to the probabilities, more hours were expected for each month. Lets calculate those hours.

sun_per <- as.numeric(nyc_clFinal$`Percent possible sunshine`)/100

days <- c(31,28,31,30,31,30,31,31,30,31,30,31) # days for months
 
#multiply the number of day for each month with its corresponding probability
nyc_clFinal <- nyc_clFinal %>% mutate(days,sun_per) %>% 
                mutate(expected_hours=(.$sun_per*.$days)*24)

nyc_clFinal %>% 
ggplot(aes(y = as.numeric(.$expected_hours),x = .$Months))+geom_bar(stat = "identity")+labs(title = "Expected Sun hours by month")+xlab(label = "Months")+ylab(label = "Sun Hours Probability")+coord_flip()

As Compared to graph 1.3, months like november that had about 37 hours of ligth were expecting about 230 hours of light as would expected from its initial probability and months like July that had about 290 hours of light was expecting about 450 hours of light.

Finally, lets calculate the proportion of expected hours of sun for those years.

(sum(as.numeric(nyc_clFinal$`Mean monthly sunshine hours`)))/(sum(as.numeric(nyc_clFinal$expected_hours)))*100

## [1] 52.16012

As shown in the calculations, about the doble of sun light of what we received was expected.

Conclusion

In conclusion, we can visually reaffirm that months that receive less light tent to be coldest whereas the . months that receive more light are the hotter. On the other hand, the probabilities of having or not having long hours of light accross the years where uniform; however, as compared to the expected hours of light, we were supposed to receive at least the double of sun light. One idea the comes to my mind is that one of the reasons for this beyond simple chance, is the effect of globlal warming as product of gasses released in the atmosphere which blocks the percentage of sun light we are supposed to receive.