In this assignment, the information was downloaded in its naive form from the the NOAA website, after wrangling the data, the information demonstrated that tornadoes has had the great health and economic impact across 1950-2010.
library(tidyverse)
library(readxl)
library(data.table)
library(sqldf)
options(scipen = 10)
Weather_Url <-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
temp <-tempfile()
download.file(Weather_Url, destfile = temp, mode="wb")
Weather_data <- read.csv(temp)
Unfortunately, most data is not clean enough for data analysis after upload , and the NOAA data is no different.The column names were not always discernible from their names, recoding was neccessary. There are inconsistencies in the The Evnt_Type inputs and the Crop/Property Damage_Exp(onents). These has to be addressed because approximately 400,000 records were assigned a blank for the Exp variable.
colnames(Weather_data)<-c(
"State_ID_Number",
"Begin_Date",
"Begin_Time",
"Time_Zone",
"County",
"County_Name",
"State",
"Event_Type",
"Begin_Range",
"Begin_Azimuth",
"Begin_Location",
"End_Date",
"End_Time",
"County_end",
"County_Ending",
"End_Range",
"End_Azimuth",
"End_Location",
"Length",
"Width",
"Fujita_Scale(WindSpeed)",
"Magnitude",
"Fatalities",
"Injuries",
"Property_Damage",
"Property_Damage_Exp",
"Crop_Damage",
"Crop_Damage_Exp",
"WFO",
"State_Office",
"Zone_Names",
"Latitute",
"Longitue",
"Latitude_e",
"Longitude_e",
"Remarks",
"Reference_Number"
)
Exp_Idenitifiers <-unique(Weather_data$Property_Damage_Exp)
Exp_Idenitifiers <-as.data.frame(Exp_Idenitifiers)
Exp_number <-c(1e3,1e6, 1e0,1e9,1e6, NA,0, 1e5, 1e6,NA,1e4, 1e2, 1e3,1e2, 1e7, 1e2,NA,1e1,1e8)
Exp_Idenitifiers<-cbind(Exp_Idenitifiers, Exp_number)
colnames(Exp_Idenitifiers) <-c("Property_Damage_Exp", "Exp_number")
Some assumptions had to be made with these observations. First, there are both character entries and numerical entries. The assumption is that the number represents the 1eX, with the number entry being equal to X. Second, the H, K, M and B represent hundred(1e2), thousand(1e3), million(1e6), and billion (1e9) respectively. Last, the blank or absence of an entry was assumed to mean 1e0, for it were left out it would be a lost of nearly 49.8024486 % of the records; however, with the ? and + are unknowable but represented only 0.0013299 % lost of records.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.4
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Weather_data_2 <-Weather_data%>%
mutate_all(., toupper)
types_used <-unique(Weather_data_2$Event_Type)%>%
sort()
The Ent_Type which was converted to the name Event_Type for better clarity needed to be cleaned and all entries were converted to uppercase. There were 898 unique entries but the guide only listed 44 listed in the Storm documentation.
Cleaning data can be and in this instance a lengthy process with the cleaning of about 97% of the records. The cleaning process included everything from removal of digits form the tornados/hurricanes/thunderstorms, correcting spelling, and collapsing events to a single entry. Because this is such a lengthy process, I will only show an example of each in this document; however, I am creating a part II, if you are interested in seeing it all.
library(tidyr)
Weather_data_2$Event_Type <-gsub('[[:digit:]]+','',Weather_data_2$Event_Type)
Weather_data_2$Event_Type[grepl("AVALANCE", Weather_data_2$Event_Type, ignore.case=FALSE)] <- "AVALANCHE"
Weather_data_2$Event_Type[grepl("URBAN", Weather_data_2$Event_Type, ignore.case=FALSE)] <- "URBAN FLOODING"
Weather_data_2$Event_Type[grepl("COLD AND SNOW", Weather_data_2$Event_Type, ignore.case=FALSE)] <- "WINTER WEATHER"
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
Weather_data_2<-data.table(Weather_data_2)
Next, an event classifier was added to dataframe; this portion of code will not be shown as it is not part of this assignment.The reasoning behind this will be included in Part II of this document.
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
Event_Classifier_DT<- data.table(Flooding, #and call it DT
Cold_Related,
Heat_Related,
Seismic_Activity,
Marine_Atmospheric_Activity,
Non_Marine_Atmospheric_Activity,
Other)
library(tidyr)
Event_Classifier<- pivot_longer(Event_Classifier_DT,cols = 1:7, names_to = "Event_Classifier", values_to = "Event_Type")
Event_Classifier<-Event_Classifier%>%
unique()
Event_Classifier<-Event_Classifier%>%
mutate_all(.,toupper )
Event_Classifier <-data.table(Event_Classifier)
Weather_data <-sqldf("SELECT *
FROM Weather_data_2
JOIN Event_Classifier USING(Event_Type)")
Next, the Begin_Dates were split into Day, Month, Year, and Decade. This was acheived by using r separate and r lubridate.
Weather_data_2 <-Weather_data_2%>%
separate(Begin_Date, c("Begin_Month", "Begin_Day", "Begin_Year", "Zero_Time"))
Weather_data_2 <-Weather_data_2%>%
separate(End_Date, c("End_Month", "End_Day", "End_Year", "Zero_End_Time"))
Updated_Weather_Data <-Weather_data_2%>%
select(-Begin_Range, -Begin_Azimuth,-End_Azimuth,-WFO,-State_Office, -Zone_Names,
-Latitude_e, -Longitude_e, -Remarks)
Decade_Summary_Weather_data<-Updated_Weather_Data%>%
mutate(decade = floor(as.numeric(Begin_Year)/10)*10)
Decade_Fatalities<-Decade_Summary_Weather_data%>%
group_by(decade, Event_Type)%>%
select(decade, Event_Type, Fatalities) %>%
summarize(Total_Fatalities=sum(as.numeric(Fatalities)))%>%
arrange(desc(Total_Fatalities))
Decade_Injuries<-Decade_Summary_Weather_data%>%
group_by(decade, Event_Type)%>%
select( decade, Event_Type, Injuries) %>%
summarize(Total_Injuries=sum(as.numeric(Injuries)))%>%
arrange(desc(Total_Injuries))
Weather_Impact <-sqldf("Select *
FROM Decade_Fatalities
JOIN Decade_Injuries USING(decade,Event_Type)")
Weather_Impact_longer <-Weather_Impact%>%
pivot_longer(c("Total_Fatalities","Total_Injuries"), names_to="Health_Impact",values_to="Total_Number_of_Impacts")
Weather_Impact_longer <-Weather_Impact_longer%>%
filter(Total_Number_of_Impacts!=0)
Now Select from the top 10 Injuries and Fatalities based on the Decade and the year.
Top_10_Fatalities <- Weather_Impact_longer%>%
group_by(decade)%>%
filter(Health_Impact =="Total_Fatalities")%>%
arrange(desc(Total_Number_of_Impacts))%>%
slice(1:10)
Top_10_Injuries <- Weather_Impact_longer%>%
group_by(decade)%>%
filter(Health_Impact =="Total_Injuries")%>%
arrange(desc(Total_Number_of_Impacts))%>%
slice(1:10)
Top_10 <-rbind(Top_10_Fatalities, Top_10_Injuries)
Top_10 <-Top_10%>%
arrange(desc(Total_Number_of_Impacts))%>%
slice(1:10)
Top_10
## # A tibble: 41 x 4
## # Groups: decade [7]
## decade Event_Type Health_Impact Total_Number_of_Impacts
## <dbl> <chr> <chr> <dbl>
## 1 1950 TORNADO Total_Injuries 14470
## 2 1950 TORNADO Total_Fatalities 1419
## 3 1960 TORNADO Total_Injuries 17265
## 4 1960 TORNADO Total_Fatalities 942
## 5 1970 TORNADO Total_Injuries 21640
## 6 1970 TORNADO Total_Fatalities 998
## 7 1980 TORNADO Total_Injuries 11297
## 8 1980 THUNDERSTORM Total_Injuries 1935
## 9 1980 TORNADO Total_Fatalities 522
## 10 1980 HAIL Total_Injuries 222
## # ... with 31 more rows
Now, comes the fun part plotting the data. After the preprocessing, the data was Event Types were plotted using the fill as the decades. The reasoning behind this is that while one single even may have had more impact at that time point, how does the Weather Event affect public health and impact the economy/ agriculture in the long term.
library(data.table)
library(sqldf)
Decade_Property_damage <-Decade_Summary_Weather_data%>%
select(decade, Property_Damage, Property_Damage_Exp, Event_Type)
Decade_Property_damage <-data.table(Decade_Property_damage)
Exp_Idenitifiers<-data.table(Exp_Idenitifiers)
Decade_Property_damage_1 <- sqldf("SELECT *
FROM Decade_Property_damage
JOIN Exp_Idenitifiers USING(Property_Damage_Exp)")
Decade_Property_damage_2 <-Decade_Property_damage_1%>%
filter(Property_Damage!=0)
Decade_Property_damage_3 <-Decade_Property_damage_2%>%
mutate(New_Property_Damage=as.numeric(Property_Damage)*as.numeric(Exp_number))
Decade_Property_damage_3 <-Decade_Property_damage_3%>%
group_by(decade, Event_Type)%>%
summarise(Total_Property_Damage=sum(as.numeric(New_Property_Damage)))%>%
arrange(desc(Total_Property_Damage))
Decade_Property_damage_4 <-Decade_Property_damage_3%>%
slice(1:10)%>%
arrange(desc(Total_Property_Damage))
Decade_Property_damage_5 <-Decade_Property_damage_4%>%
slice(1)
Decade_Crop_damage<-Decade_Summary_Weather_data%>%
select(decade, Crop_Damage, Crop_Damage_Exp, Event_Type)%>%
filter(Crop_Damage !=0)
Decade_Crop_damage <-data.table(Decade_Crop_damage)
Crop_Exp_Idenitifiers <-unique(Decade_Crop_damage$Crop_Damage_Exp)
Crop_Idenitifiers <-as.data.frame(Crop_Exp_Idenitifiers)
Crop_Exp_number <-c(1e6, 1e3,1e9,1e0, NA)
Crop_Exp_Idenitifiers<-cbind(Crop_Exp_Idenitifiers, Crop_Exp_number)
colnames(Crop_Exp_Idenitifiers) <-c("Crop_Damage_Exp","Crop_Exp_Number")
Crop_Exp_Idenitifiers <-data.table(Crop_Exp_Idenitifiers)
Decade_Crop_damage_1 <- sqldf("SELECT *
FROM Decade_Crop_damage
JOIN Crop_Exp_Idenitifiers
USING(Crop_Damage_Exp)")
Decade_Crop_damage_2<-Decade_Crop_damage_1%>%
mutate(Crop_Damage=as.numeric(Crop_Damage)*as.numeric(Crop_Exp_Number))%>%
group_by( decade, Event_Type)%>%
summarize(Total_Crop_Damage=sum(as.numeric(Crop_Damage)))%>%
arrange(desc(Total_Crop_Damage))%>%
slice(1:5)
## `summarise()` has grouped output by 'decade'. You can override using the `.groups` argument.
Decade_Crop_damage_2
## # A tibble: 15 x 3
## # Groups: decade [3]
## decade Event_Type Total_Crop_Damage
## <dbl> <chr> <dbl>
## 1 1990 RIVER FLOOD 5029459000
## 2 1990 ICE STORM 5013448500
## 3 1990 DROUGHT 4836981000
## 4 1990 HURRICANE 2458440000
## 5 1990 FLOOD 1491979650
## 6 2000 DROUGHT 9098504000
## 7 2000 HURRICANE 3046352800
## 8 2000 FLOOD 2962475400
## 9 2000 HAIL 1600522200
## 10 2000 FROST/FREEZE 726436000
## 11 2010 FLOOD 1259581000
## 12 2010 FROST/FREEZE 367650000
## 13 2010 FLASH FLOOD 241127000
## 14 2010 HAIL 184314000
## 15 2010 THUNDERSTORM WIND 150644000
The analysis shows that agriculture is mostly impacted by Crop_damage
library(RColorBrewer)
library(tidyverse)
fig3 <-ggplot(Decade_Property_damage_4, aes(x=Event_Type, y=Total_Property_Damage, fill=decade))+
geom_bar(stat="identity", color="light green")+theme_classic()
fig3 <-fig3+ theme(axis.text = element_text( size=4))+
scale_y_log10()
fig3 +ggtitle("Top 10 Weather Events Resulting Most Property Damage from 1950-2011") +
xlab("Weather Event") + ylab("Total Dollars Lost")
fig5 <-ggplot(Decade_Crop_damage_2, aes(x=Event_Type, y=Total_Crop_Damage, fill=decade))+
geom_bar(stat="identity", color="light green")+theme_classic()
fig5<-fig5+ theme(axis.text = element_text( size=4))+
scale_y_log10()
fig5+ggtitle("Top 10 Weather Events Resulting Most Crop Damage from 1950-2011") +
xlab("Weather Event") + ylab("Total Dollars Lost")
fig <- ggplot(data=Top_10, aes(x=Event_Type,y=Total_Number_of_Impacts, fill=Health_Impact))+
geom_bar(stat="identity") +
scale_fill_brewer(palette="Dark1")+
theme_classic()+
ggtitle("Top 10 Weather Events Resulting in Fatalities from 1950-2011") +
xlab("Weather Event") + ylab("Number of Fatalities")
fig+ theme(axis.text = element_text(size=4))
Finally, after all the analysis it can be established the greatest property damage was most attributable to flooding from 1950-2011, causing about 123 trillion dollars of damage. With Crop damage Whether this is due to Global Climate change, more data is needed. As for the greatest impact on health torndaoes cause more lives lost, injuries, and property damage than any other weather event.