Introduction:

In this assignment, the information was downloaded in its naive form from the the NOAA website, after wrangling the data, the information demonstrated that tornadoes has had the great health and economic impact across 1950-2010.

Downloading the data:

library(tidyverse)
library(readxl)
library(data.table)
library(sqldf)  
options(scipen = 10)


Weather_Url <-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
temp <-tempfile()
download.file(Weather_Url, destfile = temp, mode="wb")
Weather_data <- read.csv(temp)

Cleaning Data and Data Transformation

Unfortunately, most data is not clean enough for data analysis after upload , and the NOAA data is no different.The column names were not always discernible from their names, recoding was neccessary. There are inconsistencies in the The Evnt_Type inputs and the Crop/Property Damage_Exp(onents). These has to be addressed because approximately 400,000 records were assigned a blank for the Exp variable.

colnames(Weather_data)<-c(
        "State_ID_Number",
        "Begin_Date",
        "Begin_Time",
        "Time_Zone",
        "County",
        "County_Name",
        "State",     
        "Event_Type",
        "Begin_Range",
        "Begin_Azimuth",
        "Begin_Location",
        "End_Date",
        "End_Time",
        "County_end",
        "County_Ending",
        "End_Range",
        "End_Azimuth",
        "End_Location",
        "Length",
        "Width",
        "Fujita_Scale(WindSpeed)",         
        "Magnitude",
        "Fatalities",
        "Injuries",
        "Property_Damage",
        "Property_Damage_Exp",
        "Crop_Damage",
        "Crop_Damage_Exp",
        "WFO",
        "State_Office",
        "Zone_Names",
        "Latitute",
        "Longitue",
        "Latitude_e",
        "Longitude_e",
        "Remarks",
        "Reference_Number"
        )
Exp_Idenitifiers <-unique(Weather_data$Property_Damage_Exp)
Exp_Idenitifiers <-as.data.frame(Exp_Idenitifiers)
Exp_number <-c(1e3,1e6, 1e0,1e9,1e6, NA,0, 1e5, 1e6,NA,1e4, 1e2, 1e3,1e2, 1e7, 1e2,NA,1e1,1e8)
Exp_Idenitifiers<-cbind(Exp_Idenitifiers, Exp_number)
colnames(Exp_Idenitifiers) <-c("Property_Damage_Exp", "Exp_number")

Some assumptions had to be made with these observations. First, there are both character entries and numerical entries. The assumption is that the number represents the 1eX, with the number entry being equal to X. Second, the H, K, M and B represent hundred(1e2), thousand(1e3), million(1e6), and billion (1e9) respectively. Last, the blank or absence of an entry was assumed to mean 1e0, for it were left out it would be a lost of nearly 49.8024486 % of the records; however, with the ? and + are unknowable but represented only 0.0013299 % lost of records.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
Weather_data_2 <-Weather_data%>%
  mutate_all(., toupper)

types_used <-unique(Weather_data_2$Event_Type)%>%
  sort()

The Ent_Type which was converted to the name Event_Type for better clarity needed to be cleaned and all entries were converted to uppercase. There were 898 unique entries but the guide only listed 44 listed in the Storm documentation.

Cleaning data can be and in this instance a lengthy process with the cleaning of about 97% of the records. The cleaning process included everything from removal of digits form the tornados/hurricanes/thunderstorms, correcting spelling, and collapsing events to a single entry. Because this is such a lengthy process, I will only show an example of each in this document; however, I am creating a part II, if you are interested in seeing it all.

library(tidyr)
Weather_data_2$Event_Type <-gsub('[[:digit:]]+','',Weather_data_2$Event_Type)
Weather_data_2$Event_Type[grepl("AVALANCE", Weather_data_2$Event_Type, ignore.case=FALSE)] <- "AVALANCHE"
Weather_data_2$Event_Type[grepl("URBAN", Weather_data_2$Event_Type, ignore.case=FALSE)] <- "URBAN FLOODING"
Weather_data_2$Event_Type[grepl("COLD AND SNOW", Weather_data_2$Event_Type, ignore.case=FALSE)] <- "WINTER WEATHER"
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
Weather_data_2<-data.table(Weather_data_2)

Next, an event classifier was added to dataframe; this portion of code will not be shown as it is not part of this assignment.The reasoning behind this will be included in Part II of this document.

library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
Event_Classifier_DT<- data.table(Flooding, #and call it DT
                 Cold_Related,
                 Heat_Related,
                 Seismic_Activity,
                 Marine_Atmospheric_Activity,
                 Non_Marine_Atmospheric_Activity,
                 Other)
library(tidyr)
Event_Classifier<- pivot_longer(Event_Classifier_DT,cols = 1:7, names_to = "Event_Classifier", values_to = "Event_Type")

Event_Classifier<-Event_Classifier%>%
  unique()
  
Event_Classifier<-Event_Classifier%>% 
  mutate_all(.,toupper )


Event_Classifier <-data.table(Event_Classifier)

Weather_data <-sqldf("SELECT *
                      FROM Weather_data_2
                      JOIN Event_Classifier USING(Event_Type)")

Next, the Begin_Dates were split into Day, Month, Year, and Decade. This was acheived by using r separate and r lubridate.

Weather_data_2 <-Weather_data_2%>%
  separate(Begin_Date, c("Begin_Month", "Begin_Day", "Begin_Year", "Zero_Time"))

Weather_data_2 <-Weather_data_2%>%
  separate(End_Date, c("End_Month", "End_Day", "End_Year", "Zero_End_Time"))
Updated_Weather_Data <-Weather_data_2%>%
  select(-Begin_Range, -Begin_Azimuth,-End_Azimuth,-WFO,-State_Office, -Zone_Names,
         -Latitude_e, -Longitude_e, -Remarks)


Decade_Summary_Weather_data<-Updated_Weather_Data%>% 
  mutate(decade = floor(as.numeric(Begin_Year)/10)*10) 
Decade_Fatalities<-Decade_Summary_Weather_data%>%
  group_by(decade, Event_Type)%>%
  select(decade, Event_Type, Fatalities) %>%
  summarize(Total_Fatalities=sum(as.numeric(Fatalities)))%>%
  arrange(desc(Total_Fatalities))

Decade_Injuries<-Decade_Summary_Weather_data%>%
  group_by(decade, Event_Type)%>%
  select( decade, Event_Type, Injuries) %>%
  summarize(Total_Injuries=sum(as.numeric(Injuries)))%>%
   arrange(desc(Total_Injuries))
  Weather_Impact <-sqldf("Select *
                       FROM Decade_Fatalities
                        JOIN  Decade_Injuries USING(decade,Event_Type)")

Weather_Impact_longer <-Weather_Impact%>%
  pivot_longer(c("Total_Fatalities","Total_Injuries"), names_to="Health_Impact",values_to="Total_Number_of_Impacts")
  
Weather_Impact_longer <-Weather_Impact_longer%>%
  filter(Total_Number_of_Impacts!=0)

Now Select from the top 10 Injuries and Fatalities based on the Decade and the year.

Top_10_Fatalities <- Weather_Impact_longer%>%
  group_by(decade)%>%
  filter(Health_Impact =="Total_Fatalities")%>%
  arrange(desc(Total_Number_of_Impacts))%>%
  slice(1:10)
Top_10_Injuries <- Weather_Impact_longer%>%
  group_by(decade)%>%
  filter(Health_Impact =="Total_Injuries")%>%
  arrange(desc(Total_Number_of_Impacts))%>%
  slice(1:10)
Top_10 <-rbind(Top_10_Fatalities, Top_10_Injuries)
  
Top_10 <-Top_10%>%
  arrange(desc(Total_Number_of_Impacts))%>%
  slice(1:10)
  
Top_10
## # A tibble: 41 x 4
## # Groups:   decade [7]
##    decade Event_Type   Health_Impact    Total_Number_of_Impacts
##     <dbl> <chr>        <chr>                              <dbl>
##  1   1950 TORNADO      Total_Injuries                     14470
##  2   1950 TORNADO      Total_Fatalities                    1419
##  3   1960 TORNADO      Total_Injuries                     17265
##  4   1960 TORNADO      Total_Fatalities                     942
##  5   1970 TORNADO      Total_Injuries                     21640
##  6   1970 TORNADO      Total_Fatalities                     998
##  7   1980 TORNADO      Total_Injuries                     11297
##  8   1980 THUNDERSTORM Total_Injuries                      1935
##  9   1980 TORNADO      Total_Fatalities                     522
## 10   1980 HAIL         Total_Injuries                       222
## # ... with 31 more rows

Results:

Now, comes the fun part plotting the data. After the preprocessing, the data was Event Types were plotted using the fill as the decades. The reasoning behind this is that while one single even may have had more impact at that time point, how does the Weather Event affect public health and impact the economy/ agriculture in the long term.

library(data.table)
library(sqldf)

Decade_Property_damage <-Decade_Summary_Weather_data%>%
select(decade, Property_Damage, Property_Damage_Exp, Event_Type)
Decade_Property_damage <-data.table(Decade_Property_damage)
Exp_Idenitifiers<-data.table(Exp_Idenitifiers)

Decade_Property_damage_1 <- sqldf("SELECT *
                      FROM Decade_Property_damage
                      JOIN Exp_Idenitifiers USING(Property_Damage_Exp)")
Decade_Property_damage_2 <-Decade_Property_damage_1%>%
  filter(Property_Damage!=0)
Decade_Property_damage_3 <-Decade_Property_damage_2%>%
  mutate(New_Property_Damage=as.numeric(Property_Damage)*as.numeric(Exp_number))
Decade_Property_damage_3 <-Decade_Property_damage_3%>%
  group_by(decade, Event_Type)%>%
  summarise(Total_Property_Damage=sum(as.numeric(New_Property_Damage)))%>%
  arrange(desc(Total_Property_Damage))

Decade_Property_damage_4 <-Decade_Property_damage_3%>%
  slice(1:10)%>%
  arrange(desc(Total_Property_Damage))
Decade_Property_damage_5 <-Decade_Property_damage_4%>%
  slice(1)
Decade_Crop_damage<-Decade_Summary_Weather_data%>%
select(decade, Crop_Damage, Crop_Damage_Exp, Event_Type)%>%
filter(Crop_Damage !=0)
Decade_Crop_damage <-data.table(Decade_Crop_damage)
Crop_Exp_Idenitifiers <-unique(Decade_Crop_damage$Crop_Damage_Exp)
Crop_Idenitifiers <-as.data.frame(Crop_Exp_Idenitifiers)
Crop_Exp_number <-c(1e6, 1e3,1e9,1e0, NA)
Crop_Exp_Idenitifiers<-cbind(Crop_Exp_Idenitifiers, Crop_Exp_number)
colnames(Crop_Exp_Idenitifiers) <-c("Crop_Damage_Exp","Crop_Exp_Number")

Crop_Exp_Idenitifiers <-data.table(Crop_Exp_Idenitifiers)
Decade_Crop_damage_1 <- sqldf("SELECT *
                      FROM Decade_Crop_damage
                      JOIN Crop_Exp_Idenitifiers   
                      USING(Crop_Damage_Exp)")

Decade_Crop_damage_2<-Decade_Crop_damage_1%>%
 mutate(Crop_Damage=as.numeric(Crop_Damage)*as.numeric(Crop_Exp_Number))%>%
  group_by( decade, Event_Type)%>%
  summarize(Total_Crop_Damage=sum(as.numeric(Crop_Damage)))%>%
  arrange(desc(Total_Crop_Damage))%>%
  slice(1:5)
## `summarise()` has grouped output by 'decade'. You can override using the `.groups` argument.
Decade_Crop_damage_2
## # A tibble: 15 x 3
## # Groups:   decade [3]
##    decade Event_Type        Total_Crop_Damage
##     <dbl> <chr>                         <dbl>
##  1   1990 RIVER FLOOD              5029459000
##  2   1990 ICE STORM                5013448500
##  3   1990 DROUGHT                  4836981000
##  4   1990 HURRICANE                2458440000
##  5   1990 FLOOD                    1491979650
##  6   2000 DROUGHT                  9098504000
##  7   2000 HURRICANE                3046352800
##  8   2000 FLOOD                    2962475400
##  9   2000 HAIL                     1600522200
## 10   2000 FROST/FREEZE              726436000
## 11   2010 FLOOD                    1259581000
## 12   2010 FROST/FREEZE              367650000
## 13   2010 FLASH FLOOD               241127000
## 14   2010 HAIL                      184314000
## 15   2010 THUNDERSTORM WIND         150644000

The analysis shows that agriculture is mostly impacted by Crop_damage

library(RColorBrewer)
 library(tidyverse)
fig3 <-ggplot(Decade_Property_damage_4, aes(x=Event_Type, y=Total_Property_Damage, fill=decade))+
  geom_bar(stat="identity", color="light green")+theme_classic()
fig3 <-fig3+ theme(axis.text = element_text( size=4))+
  scale_y_log10()
 
fig3 +ggtitle("Top 10 Weather Events Resulting Most Property Damage from 1950-2011") +
  xlab("Weather Event") + ylab("Total Dollars Lost")

fig5 <-ggplot(Decade_Crop_damage_2, aes(x=Event_Type, y=Total_Crop_Damage, fill=decade))+
  geom_bar(stat="identity", color="light green")+theme_classic()
fig5<-fig5+ theme(axis.text = element_text( size=4))+
  scale_y_log10()
fig5+ggtitle("Top 10 Weather Events Resulting Most Crop Damage from 1950-2011") +
  xlab("Weather Event") + ylab("Total Dollars Lost")

 fig <- ggplot(data=Top_10, aes(x=Event_Type,y=Total_Number_of_Impacts, fill=Health_Impact))+
        geom_bar(stat="identity") +
      scale_fill_brewer(palette="Dark1")+
  theme_classic()+
  ggtitle("Top 10 Weather Events Resulting in Fatalities from 1950-2011") +
  xlab("Weather Event") + ylab("Number of Fatalities")
 fig+ theme(axis.text = element_text(size=4))

Conclusion:

Finally, after all the analysis it can be established the greatest property damage was most attributable to flooding from 1950-2011, causing about 123 trillion dollars of damage. With Crop damage Whether this is due to Global Climate change, more data is needed. As for the greatest impact on health torndaoes cause more lives lost, injuries, and property damage than any other weather event.