This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States. The events in the database start in the year 1950 and end in November 2011.
The data is available to download from the NOAA’s web site: https://www.ncdc.noaa.gov/cdo-web/datasets. The NOAA’s data becomes through the National Climatic Data Center (NCDC) that regularly receives Storm Data from the National Weather Service (NWS) approximately 60-90 days after the end of the data month.
Generally, the data comes from different sources such as county, state and federal emergency management officials, local law enforcement officials, sky-warn spotters, NWS damage surveys, newspapers services, insurance industry and general public. It is important stating that some information appearing in the Storm Data may be provided by or gathered from sources outside the National Weather Service (NWS).
An effort is made to use the best available information but because of time and resource constraints, information from these sources may be unverified by the NWS; therefore customers should be cautious as the NWS does not guarantee the accuracy or validity of the information.
The next libraries should be used in order to run the R code effectively:
library(dplyr) #Used for doing transformations of the data (in a tidy way)
library(tidyr) #A package that is also used for doing transformations
## Warning: package 'tidyr' was built under R version 3.4.4
library(lubridate) #Package needed to operate dates in R
library(ggplot2) #For doing graphics
library(lattice) #Base package for doing graphics
library(gridExtra) #For joining plots
Before loading the data, a zipped file must be downloaded and extracted in the same working directory used in R. The Zip file could be download from: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2. If everything went all right, there must be a new Excel file in CSV format called: repdata_data_StormData.csv.
We are assigning the results to a variable called “NOAA”:
NOAA <- read.csv("repdata_data_StormData.csv", header = TRUE)
str(NOAA) #describing the variable formats
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(NOAA) #describing the fist records to the dataset
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
Raw data is well constructed with rows and columns, so we then proceed in analyzing the research questions that have been asked.
In order to find an answer to this research question, it is important to select only the variables of interest in our data set. Since we are interested in studying in counting the number of human harms we need to select the next variables: FATALITIES and INJURIES. The variable EVTYPE contains the type of event registered in the NOAA’s dataset.
Once we have selected the interested variables, the first step is checking if there are any NAs values in our dataset:
NAS <- NOAA %>%
summarise(event=mean(is.na(EVTYPE)),
fatalities = mean(is.na(FATALITIES)),
injuries=mean(is.na(INJURIES))) %>%
select(event, fatalities, injuries)
print(NAS)
## event fatalities injuries
## 1 0 0 0
No NAs appear to be present in our selected variables. Our data has records starting from the year 1950 and ending in the year 2011. In earlier years technology wasn’t improved as in recent years and so we may find lack of useful data in order to be compared to recent years. For that, we need to select a sample from our dataset in which we could find it considerable to be used in our research questions.
NOAA.harmful <- NOAA %>%
select(BGN_DATE, EVTYPE, FATALITIES, INJURIES) %>% #selecting variables of interest
mutate(years = year( #mutating a new column which contains the 'year'
as.Date(sapply(
strsplit( #using strsplit function to separate
#and extracting only the 'years'
as.character(NOAA$BGN_DATE) ,
" ",fixed = TRUE),
"[[",1),
"%m/%d/%Y")))
head(NOAA.harmful, 3)
## BGN_DATE EVTYPE FATALITIES INJURIES years
## 1 4/18/1950 0:00:00 TORNADO 0 15 1950
## 2 4/18/1950 0:00:00 TORNADO 0 0 1950
## 3 2/20/1951 0:00:00 TORNADO 0 2 1951
We proceed to do a small transformation in our new variable of the dataset. It should be useful to converting the columns FATALITIES and INJURIES into rows, so that could be easily to plotting our results:
NOAA.harmful.tendency <- NOAA.harmful %>%
select(years, FATALITIES, INJURIES) %>%
gather("Type", "total",2:3 ) %>% #function gather to create new columns and merging
group_by(years, Type) %>% #the values of FATALITIES and INJURIES into them.
summarise(totals = sum(total)) #Summarises total of counts.
head(NOAA.harmful.tendency,3)
## # A tibble: 3 x 3
## # Groups: years [2]
## years Type totals
## <dbl> <chr> <dbl>
## 1 1950 FATALITIES 70
## 2 1950 INJURIES 659
## 3 1951 FATALITIES 34
#plotting our new dataset
qplot(years, totals, data = NOAA.harmful.tendency,
geom ="line", color = Type,
xlab = "Year", ylab = "Total Harms",
main = "NOAA's Registered Harmful Weather Events"
) + theme(plot.title = element_text(hjust=0.5))
From the 50’s to the earlier 90’s there appears to be slight changes in respect to the total of fatalities. Data seems starting growing nearly at the beginning of 90’s. An interesting observation is representative from that year compared to the total of injuries as well; there appears to be a slight relationship in the behavior of both line graphs. For that reason, in order to answer the research questions, we are going to filter out the original dataset starting from the year 1990.
NOAA.year.1990.1 <- NOAA.harmful %>%
select(years, FATALITIES, INJURIES, EVTYPE) %>%
filter(as.integer(years) >= 1990)
harmful.total <- NOAA.year.1990.1 %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
group_by(EVTYPE) %>%
summarise(fatalities =sum(FATALITIES), injuries = sum(INJURIES))
head(harmful.total)
## # A tibble: 6 x 3
## EVTYPE fatalities injuries
## <fct> <dbl> <dbl>
## 1 " HIGH SURF ADVISORY" 0 0
## 2 " COASTAL FLOOD" 0 0
## 3 " FLASH FLOOD" 0 0
## 4 " LIGHTNING" 0 0
## 5 " TSTM WIND" 0 0
## 6 " TSTM WIND (G45)" 0 0
total.fatalities <- harmful.total %>%
select(EVTYPE, fatalities) %>%
arrange(desc(fatalities)) %>%
head(5)
total.injuries <- harmful.total %>%
select(EVTYPE, injuries) %>%
arrange(desc(injuries)) %>%
head(5)
fatalities.sum <- barchart(fatalities~EVTYPE, data = total.fatalities,
col=rgb(0.5,0.4,0.6,0.6),
xlab = "Event Type", ylab="Total Fatalities", main = "NOAA's Harmful Events (1990 - 2011)")
injuries.sum <- barchart(injuries~EVTYPE, data = total.injuries,
col=rgb(0.1,0.4,0.2,0.6),
xlab = "Event Type", ylab="Total Injuries", main = "NOAA's Harmful Events (1990 - 2011)")
grid.arrange(fatalities.sum, injuries.sum)
The results are showing that the reported NOAA’s events from the year 1990 to 2011 involve EXCESSIVE HEAT and TORNADOS considering that they are the most critical in respect to the population health of the U.S., since they are represented as fatalities. The reported cases show that excessive heat is the most representative reaching out 1,903 cases and 1,752 as for TORNADO events.
As for injury damage on people’s health, the most representative one is the TORNADO event as well, reaching out 26,674 cases reported.
Just as we did with the first research question, we are going to filter the data starting from the year 1990. In order to answer this question, we are going to select only the next variables from interest: EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP.
EVTYPE = Contains information about the type of the registered event PROPDMG = Involves property damage (\() that has a representative impact in the economy CROPDMG = Involves crops damage (\)) that has a representative impact in the economy as well PROPDMGEXP/CROPDMGEXP = Prefixes used to denote the multiplier of the PROPDMG and CROPDMG respectively (i.e. kilo - 1,000, M = 1,000,000)
NOAA.economic <- NOAA %>%
select(BGN_DATE,EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>% #variables
mutate(years = year( #mutating a new column which contains the 'year'
as.Date(sapply(
strsplit( #using strsplit function to separate
#and extracting only the 'years'
as.character(NOAA$BGN_DATE) ,
" ",fixed = TRUE),
"[[",1),
"%m/%d/%Y"))) %>%
filter(as.integer(years) >= 1990) %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
head(NOAA.economic, 3)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 HAIL 0 0
## 2 TSTM WIND 0 0
## 3 TSTM WIND 0 0
We now proceed to explore the variables and checking for any NAs:
str(NOAA.economic)
## 'data.frame': 751740 obs. of 5 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 244 856 856 856 834 856 856 856 834 834 ...
## $ PROPDMG : num 0 0 0 0 2.5 0 0 0 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 19 1 1 1 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
NAS.economic <- NOAA.economic %>%
summarise(propdmg=mean(is.na(PROPDMG)),
cropdmg = mean(is.na(CROPDMG))) %>%
select(propdmg, cropdmg)
print(NAS.economic)
## propdmg cropdmg
## 1 0 0
tapply(NOAA.economic$PROPDMG, NOAA.economic$PROPDMGEXP, sum) # count of prefixes in property
## - ? + 0 1
## 527.41 15.00 0.00 117.00 7108.30 0.00
## 2 3 4 5 6 7
## 12.00 20.00 14.50 210.50 65.00 82.00
## 8 B h H K m
## 0.00 275.85 2.00 25.00 9136646.93 38.90
## M
## 115814.45
tapply(NOAA.economic$CROPDMG, NOAA.economic$CROPDMGEXP, sum) # count of prefixes in crops
## ? 0 2 B k
## 11.00 0.00 260.00 0.00 13.61 436.00
## K m M
## 1342955.91 10.00 34140.80
Important assumptions regarding the prefixes:
Calculating a new variable called damage.dollars (for both Property and Crops cases) that contains the multiplication of the prefixes by the PROPDMG and CROPDMG variables respectively.
# Property Damage
NOAA.economic.prop <- NOAA.economic %>%
select(EVTYPE, PROPDMG, PROPDMGEXP) %>%
filter(PROPDMG!=0) %>%
mutate(damage.dollars=PROPDMG*(ifelse(PROPDMGEXP=="", 1,
ifelse(PROPDMGEXP=="k" | PROPDMGEXP=="K", 1000,
ifelse(PROPDMGEXP=="m" | PROPDMGEXP=="M", 1000000,
ifelse(PROPDMGEXP=="h" | PROPDMGEXP=="H", 100,
ifelse(PROPDMGEXP=="B", 1000000000,
ifelse(PROPDMGEXP=="0", 1,0)
))))))) %>%
mutate(type = "PROPERTY DAMAGE") %>%
select(EVTYPE,damage.dollars, type)
# Crops Damage
NOAA.economic.crops <- NOAA.economic %>%
select(EVTYPE, CROPDMG, CROPDMGEXP) %>%
filter(CROPDMG!=0) %>%
mutate(damage.dollars=CROPDMG*(ifelse(CROPDMGEXP=="", 1,
ifelse(CROPDMGEXP=="k" | CROPDMGEXP=="K", 1000,
ifelse(CROPDMGEXP=="m" | CROPDMGEXP=="M", 1000000,
ifelse(CROPDMGEXP=="B", 1000000000,
ifelse(CROPDMGEXP=="0", 1,0)
)))))) %>%
mutate(type = "CROPS DAMAGE") %>%
select(EVTYPE, damage.dollars, type)
Mutating and summarizing the results into a new table called ‘NOAA.economic.total.damage’
NOAA.economic.total.damage <- rbind(NOAA.economic.prop, NOAA.economic.crops)
NOAA.economic.total.damage <- NOAA.economic.total.damage %>%
group_by(EVTYPE, type) %>%
summarise(total.damage.dollars = sum(damage.dollars)) %>%
select(EVTYPE, type, total.damage.dollars) %>%
arrange(desc(type),desc(total.damage.dollars,desc))
Selecting only the top 5 events:
NOAA.economic.total.damage <- rbind(NOAA.economic.prop, NOAA.economic.crops)
NOAA.economic.total.damage <- NOAA.economic.total.damage %>%
group_by(EVTYPE, type) %>%
summarise(total.damage.dollars = sum(damage.dollars)) %>%
select(EVTYPE, type, total.damage.dollars) %>%
arrange(desc(total.damage.dollars)) %>%
group_by(type) %>%
slice(1:5)
Plotting the final result:
ggplot(NOAA.economic.total.damage, aes(x = EVTYPE, y = total.damage.dollars/1000000)) +
geom_bar(aes(fill = type), position = "dodge", stat = "identity") + coord_flip() +
xlab("Reported Event Type") + ylab("Damage Amount (In Million Dollars)") +
labs(title = "Economic Loss by Climate Events in the U.S.",
subtitle = "Reported from year 1990 - 2011") + scale_fill_discrete(name = "Economic Variable") +
theme(plot.title = element_text(hjust=0.5, size = 15, face = "italic")) + theme(plot.subtitle = element_text(hjust=0.5)) +
theme(plot.subtitle = element_text(hjust=0.5, size = 10, face = "italic")) + theme(plot.subtitle = element_text(hjust=0.5))
FLOOD property damage clearly represents the most outstanding economic loss in the country, reaching out 144.65 Billion of dollars in losses reported. The second representative event is due to HURRICANE/TYPHOON events reaching out 69.30 Billion in dollars in losses.
Interestingly enough is that CROPS DAMAGES play also some contribution to the total economic loss, being DROUGHT the most outstanding event with 13.97 Billion of dollars in economic losses.