Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
For This dataset, I am reading the information from a web table located in wikipedia. Using the html_node function for reading the tables and then choosing the 3rd table which is the one we are interested in. Finally, since the headers where presented as a row in the web table, the setNames is setting the second row as a header and then slice a row that was displayed as a footer for the table.
nyc_climate <- read_html("https://en.wikipedia.org/wiki/Climate_of_New_York",encoding = "UTF-8") %>%
html_nodes("table") %>%
.[[3]] %>%
html_table(header = FALSE) %>%
setNames(.[2,]) %>%
slice(3:(n()-1))
Original Dataframe Display
| Month | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Record high °F (°C) | 71(22) | 74(23) | 89(32) | 93(34) | 97(36) | 100(38) | 104(40) | 102(39) | 100(38) | 91(33) | 82(28) | 72(22) | 104(40) |
| Mean maximum °F (°C) | 52.8(11.6) | 53.5(11.9) | 68.2(20.1) | 81.6(27.6) | 86.7(30.4) | 91.1(32.8) | 92.3(33.5) | 91.2(32.9) | 86.0(30.0) | 76.8(24.9) | 68.1(20.1) | 54.8(12.7) | 94.3(34.6) |
| Average high °F (°C) | 30.6(−0.8) | 34.6(1.4) | 44.4(6.9) | 58.3(14.6) | 69.4(20.8) | 77.9(25.5) | 82.3(27.9) | 80.4(26.9) | 72.2(22.3) | 59.8(15.4) | 47.9(8.8) | 35.8(2.1) | 57.9(14.4) |
| Daily mean °F (°C) | 22.6(−5.2) | 25.9(−3.4) | 35.0(1.7) | 47.8(8.8) | 58.3(14.6) | 67.2(19.6) | 71.8(22.1) | 70.1(21.2) | 61.9(16.6) | 49.7(9.8) | 39.7(4.3) | 28.5(−1.9) | 48.2(9.0) |
| Average low °F (°C) | 14.5(−9.7) | 17.3(−8.2) | 25.7(−3.5) | 37.3(2.9) | 47.1(8.4) | 56.5(13.6) | 61.4(16.3) | 59.9(15.5) | 51.6(10.9) | 39.6(4.2) | 31.5(−0.3) | 21.2(−6.0) | 38.7(3.7) |
| Mean minimum °F (°C) | −7.1(−21.7) | −2.8(−19.3) | 6.5(−14.2) | 23.2(−4.9) | 32.7(0.4) | 42.0(5.6) | 49.5(9.7) | 46.5(8.1) | 36.1(2.3) | 25.8(−3.4) | 15.9(−8.9) | 1.6(−16.9) | −9.6(−23.1) |
| Record low °F (°C) | −28(−33) | −22(−30) | −21(−29) | 9(−13) | 26(−3) | 35(2) | 40(4) | 34(1) | 24(−4) | 16(−9) | −11(−24) | −22(−30) | −28(−33) |
| Average precipitation inches (mm) | 2.59(66) | 2.20(56) | 3.21(82) | 3.17(81) | 3.61(92) | 3.79(96) | 4.12(105) | 3.46(88) | 3.30(84) | 3.68(93) | 3.29(84) | 2.93(74) | 39.35(999) |
| Average snowfall inches (cm) | 17.9(45) | 12.2(31) | 11.0(28) | 2.3(5.8) | 0.1(0.25) | 0(0) | 0(0) | 0(0) | 0(0) | 0.3(0.76) | 2.8(7.1) | 13.7(35) | 60.3(153) |
| Average precipitation days (≥ 0.01 in) | 12.8 | 10.4 | 12.1 | 11.9 | 13.1 | 12.2 | 10.8 | 10.7 | 9.8 | 10.4 | 11.7 | 11.9 | 137.8 |
| Average snowy days (≥ 0.1 in) | 10.3 | 7.6 | 5.6 | 1.2 | 0.1 | 0 | 0 | 0 | 0 | 0.1 | 2.5 | 7.4 | 34.8 |
| Average relative humidity (%) | 71.1 | 68.5 | 64.8 | 61.2 | 65.5 | 69.5 | 70.5 | 74.1 | 75.7 | 72.4 | 73.1 | 73.9 | 70.0 |
| Mean monthly sunshine hours | 141.1 | 158.5 | 200.3 | 218.9 | 248.9 | 262.2 | 289.2 | 253.2 | 210.5 | 168.8 | 100.7 | 108.3 | 2,360.6 |
| Percent possible sunshine | 48 | 54 | 54 | 54 | 55 | 57 | 62 | 59 | 56 | 49 | 34 | 38 | 53 |
The first step to process the dataset is to convert all the months presented in the column as a row. Then the temperature field will store the corresponding values for each month. Then the months are grouped according to their respective characteristics (variables like Record High Temperature, Mean Minimum, etc.. ) and then a row that works as an index for the new dataframe is added because it is needed for spreading the data again and show each variable according to their month.Finally, all the rows are selected except the id generated for grouping purposes.
nyc_climate <- nyc_climate %>%
gather("Months","Temperature",2:14) %>%
group_by(Month) %>%
mutate(id=row_number()) %>%
spread(Month,Temperature) %>%
arrange(id) %>%
select(-id)
In this part the code iterates over the columns of the dataframe in order to separate the Fahrenheit and Celsius values (in parenthesis)from each field and rows that contains such values.Then, it generates the column headers for the new columns according the name extracted from the original column and name the accordingly.Finally, the code slices the dataframe ignoring the last row since it is not needed for calculations
headers <- nyc_climate %>% names()
nyc_clFinal <- nyc_climate
fields_to_rm <- c()
for ( i in 1:length(names(nyc_climate))) {
if (str_detect(headers[i],"(.*)(?=[^a-zA-Z][F])")) {
fields_to_rm <- c(fields_to_rm,i)
celsius <- nyc_climate[[i]] %>% str_extract("(?<=\\()(.*)(?=\\))") %>%
str_replace(".*\\−","-")
fahrenh <- nyc_climate[[i]] %>% str_extract(".*(?=\\()") %>%
str_replace(".*\\−","-")
title <- str_extract(headers[i],"(.*)(?=[^a-zA-Z][F])")
nyc_clFinal <- nyc_clFinal %>%
mutate(celsius_f=celsius, fahrenh_f=fahrenh)
leng <- length(names(nyc_clFinal))
names(nyc_clFinal)[c(leng-1,leng)] <- c(paste(title,"Celsius"),paste(title,"Fahrenh"))
}
}
nyc_clFinal <- nyc_clFinal %>% slice(1:n()-1) %>%
select(-fields_to_rm)
| Months | Average precipitation days (≥ 0.01 in) | Average precipitation inches (mm) | Average relative humidity (%) | Average snowfall inches (cm) | Average snowy days (≥ 0.1 in) | Mean monthly sunshine hours | Percent possible sunshine | Average high Celsius | Average high Fahrenh | Average low Celsius | Average low Fahrenh | Daily mean Celsius | Daily mean Fahrenh | Mean maximum Celsius | Mean maximum Fahrenh | Mean minimum Celsius | Mean minimum Fahrenh | Record high Celsius | Record high Fahrenh | Record low Celsius | Record low Fahrenh |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Jan | 12.8 | 2.59(66) | 71.1 | 17.9(45) | 10.3 | 141.1 | 48 | -0.8 | 30.6 | -9.7 | 14.5 | -5.2 | 22.6 | 11.6 | 52.8 | -21.7 | -7.1 | 22 | 71 | -33 | -28 |
| Feb | 10.4 | 2.20(56) | 68.5 | 12.2(31) | 7.6 | 158.5 | 54 | 1.4 | 34.6 | -8.2 | 17.3 | -3.4 | 25.9 | 11.9 | 53.5 | -19.3 | -2.8 | 23 | 74 | -30 | -22 |
| Mar | 12.1 | 3.21(82) | 64.8 | 11.0(28) | 5.6 | 200.3 | 54 | 6.9 | 44.4 | -3.5 | 25.7 | 1.7 | 35.0 | 20.1 | 68.2 | -14.2 | 6.5 | 32 | 89 | -29 | -21 |
| Apr | 11.9 | 3.17(81) | 61.2 | 2.3(5.8) | 1.2 | 218.9 | 54 | 14.6 | 58.3 | 2.9 | 37.3 | 8.8 | 47.8 | 27.6 | 81.6 | -4.9 | 23.2 | 34 | 93 | -13 | 9 |
| May | 13.1 | 3.61(92) | 65.5 | 0.1(0.25) | 0.1 | 248.9 | 55 | 20.8 | 69.4 | 8.4 | 47.1 | 14.6 | 58.3 | 30.4 | 86.7 | 0.4 | 32.7 | 36 | 97 | -3 | 26 |
| Jun | 12.2 | 3.79(96) | 69.5 | 0(0) | 0 | 262.2 | 57 | 25.5 | 77.9 | 13.6 | 56.5 | 19.6 | 67.2 | 32.8 | 91.1 | 5.6 | 42.0 | 38 | 100 | 2 | 35 |
| Jul | 10.8 | 4.12(105) | 70.5 | 0(0) | 0 | 289.2 | 62 | 27.9 | 82.3 | 16.3 | 61.4 | 22.1 | 71.8 | 33.5 | 92.3 | 9.7 | 49.5 | 40 | 104 | 4 | 40 |
| Aug | 10.7 | 3.46(88) | 74.1 | 0(0) | 0 | 253.2 | 59 | 26.9 | 80.4 | 15.5 | 59.9 | 21.2 | 70.1 | 32.9 | 91.2 | 8.1 | 46.5 | 39 | 102 | 1 | 34 |
| Sep | 9.8 | 3.30(84) | 75.7 | 0(0) | 0 | 210.5 | 56 | 22.3 | 72.2 | 10.9 | 51.6 | 16.6 | 61.9 | 30.0 | 86.0 | 2.3 | 36.1 | 38 | 100 | -4 | 24 |
| Oct | 10.4 | 3.68(93) | 72.4 | 0.3(0.76) | 0.1 | 168.8 | 49 | 15.4 | 59.8 | 4.2 | 39.6 | 9.8 | 49.7 | 24.9 | 76.8 | -3.4 | 25.8 | 33 | 91 | -9 | 16 |
| Nov | 11.7 | 3.29(84) | 73.1 | 2.8(7.1) | 2.5 | 100.7 | 34 | 8.8 | 47.9 | -0.3 | 31.5 | 4.3 | 39.7 | 20.1 | 68.1 | -8.9 | 15.9 | 28 | 82 | -24 | -11 |
| Dec | 11.9 | 2.93(74) | 73.9 | 13.7(35) | 7.4 | 108.3 | 38 | 2.1 | 35.8 | -6.0 | 21.2 | -1.9 | 28.5 | 12.7 | 54.8 | -16.9 | 1.6 | 22 | 72 | -30 | -22 |
Save as a CSV File
write.csv(nyc_clFinal,file = "climate_data.csv")
This sections shows the average celsius temperature.
According to the graph, during the last years, January has been the coldest month with temperatures under 0 whereas July has been the hottest month with temperature above 27 grades celsius.
pl <- nyc_clFinal %>%
ggplot(aes(y = as.numeric(nyc_clFinal$`Average high Celsius`),x = .$Months))+geom_bar(stat = "identity")+coord_flip()+labs(title = "Average High Monthly Celsius Temperatures 1.1")+xlab(label = "Months")+ylab(label = "Temperature in celsius")
pl+scale_y_continuous(limits = c(-1,30))
Since temperature decreases during the night, lets check how many hours of light a particular month had. Assuming that less light means more lower temperatures.
As we can compare with the graph 1.1, months like November, December and January which that are supposed to receive less sun light are the more cold. However, thats not always the case. The next graph shows the average probability of sunshine for arch month during the last years.
On average, the average probability of possible sunshine was almost uniform across the years.
On the other hand, according to the probabilities, more hours were expected for each month. Lets calculate those hours.
sun_per <- as.numeric(nyc_clFinal$`Percent possible sunshine`)/100
days <- c(31,28,31,30,31,30,31,31,30,31,30,31) # days for months
#multiply the number of day for each month with its corresponding probability
nyc_clFinal <- nyc_clFinal %>% mutate(days,sun_per) %>%
mutate(expected_hours=(.$sun_per*.$days)*24)
nyc_clFinal %>%
ggplot(aes(y = as.numeric(.$expected_hours),x = .$Months))+geom_bar(stat = "identity")+labs(title = "Expected Sun hours by month")+xlab(label = "Months")+ylab(label = "Sun Hours Probability")+coord_flip()
As Compared to graph 1.3, months like november that had about 37 hours of ligth were expecting about 230 hours of light as would expected from its initial probability and months like July that had about 290 hours of light was expecting about 450 hours of light.
Finally, lets calculate the proportion of expected hours of sun for those years.
(sum(as.numeric(nyc_clFinal$`Mean monthly sunshine hours`)))/(sum(as.numeric(nyc_clFinal$expected_hours)))*100
## [1] 52.16012
As shown in the calculations, about the doble of sun light of what we received was expected.
In conclusion, we can visually reaffirm that months that receive less light tent to be coldest whereas the . months that receive more light are the hotter. On the other hand, the probabilities of having or not having long hours of light accross the years where uniform; however, as compared to the expected hours of light, we were supposed to receive at least the double of sun light. One idea the comes to my mind is that one of the reasons for this beyond simple chance, is the effect of globlal warming as product of gasses released in the atmosphere which blocks the percentage of sun light we are supposed to receive.