This project is the final pre-graduate Assignment of “Data Science: Foundations using R Specialization” course
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
We need to find out by using the data:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The data for this assignment come in the form of a comma-separated-value file Storm Data
The documentation on how some of the variables are constructed/defined for the database is available from below links:
- National Weather Service Storm Data Documentation
- National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Firstly, load sone of the required packages using the library() function
library(tidyverse)
reading the data
storm<-read.csv("repdata_data_StormData.csv")
the names of variable
names(storm)
[1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY" "COUNTYNAME"
[7] "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI" "BGN_LOCATI" "END_DATE"
[13] "END_TIME" "COUNTY_END" "COUNTYENDN" "END_RANGE" "END_AZI" "END_LOCATI"
[19] "LENGTH" "WIDTH" "F" "MAG" "FATALITIES" "INJURIES"
[25] "PROPDMG" "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
[31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_" "REMARKS"
[37] "REFNUM"
the structure of the data
str(storm)
'data.frame': 902297 obs. of 37 variables:
$ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
$ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
$ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
$ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
$ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
$ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
$ STATE : chr "AL" "AL" "AL" "AL" ...
$ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
$ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ BGN_AZI : chr "" "" "" "" ...
$ BGN_LOCATI: chr "" "" "" "" ...
$ END_DATE : chr "" "" "" "" ...
$ END_TIME : chr "" "" "" "" ...
$ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
$ COUNTYENDN: logi NA NA NA NA NA NA ...
$ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ END_AZI : chr "" "" "" "" ...
$ END_LOCATI: chr "" "" "" "" ...
$ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
$ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
$ F : int 3 2 2 2 2 2 2 1 3 3 ...
$ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
$ PROPDMGEXP: chr "K" "K" "K" "K" ...
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
$ CROPDMGEXP: chr "" "" "" "" ...
$ WFO : chr "" "" "" "" ...
$ STATEOFFIC: chr "" "" "" "" ...
$ ZONENAMES : chr "" "" "" "" ...
$ LATITUDE : num 3040 3042 3340 3458 3412 ...
$ LONGITUDE : num 8812 8755 8742 8626 8642 ...
$ LATITUDE_E: num 3051 0 0 0 0 ...
$ LONGITUDE_: num 8806 0 0 0 0 ...
$ REMARKS : chr "" "" "" "" ...
$ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
summarize the data
summary(storm)
STATE__ BGN_DATE BGN_TIME TIME_ZONE
Min. : 1.0 Length:902297 Length:902297 Length:902297
1st Qu.:19.0 Class :character Class :character Class :character
Median :30.0 Mode :character Mode :character Mode :character
Mean :31.2
3rd Qu.:45.0
Max. :95.0
COUNTY COUNTYNAME STATE EVTYPE
Min. : 0.0 Length:902297 Length:902297 Length:902297
1st Qu.: 31.0 Class :character Class :character Class :character
Median : 75.0 Mode :character Mode :character Mode :character
Mean :100.6
3rd Qu.:131.0
Max. :873.0
BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
Min. : 0.000 Length:902297 Length:902297 Length:902297
1st Qu.: 0.000 Class :character Class :character Class :character
Median : 0.000 Mode :character Mode :character Mode :character
Mean : 1.484
3rd Qu.: 1.000
Max. :3749.000
END_TIME COUNTY_END COUNTYENDN END_RANGE
Length:902297 Min. :0 Mode:logical Min. : 0.0000
Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0.0000
Mode :character Median :0 Median : 0.0000
Mean :0 Mean : 0.9862
3rd Qu.:0 3rd Qu.: 0.0000
Max. :0 Max. :925.0000
END_AZI END_LOCATI LENGTH WIDTH
Length:902297 Length:902297 Min. : 0.0000 Min. : 0.000
Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
Mode :character Mode :character Median : 0.0000 Median : 0.000
Mean : 0.2301 Mean : 7.503
3rd Qu.: 0.0000 3rd Qu.: 0.000
Max. :2315.0000 Max. :4400.000
F MAG FATALITIES INJURIES
Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
NA's :843563
PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
Median : 0.00 Mode :character Median : 0.000 Mode :character
Mean : 12.06 Mean : 1.527
3rd Qu.: 0.50 3rd Qu.: 0.000
Max. :5000.00 Max. :990.000
WFO STATEOFFIC ZONENAMES LATITUDE
Length:902297 Length:902297 Length:902297 Min. : 0
Class :character Class :character Class :character 1st Qu.:2802
Mode :character Mode :character Mode :character Median :3540
Mean :2875
3rd Qu.:4019
Max. :9706
NA's :47
LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
Min. :-14451 Min. : 0 Min. :-14455 Length:902297
1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
Median : 8707 Median : 0 Median : 0 Mode :character
Mean : 6940 Mean :1452 Mean : 3509
3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
Max. : 17124 Max. :9706 Max. :106220
NA's :40
REFNUM
Min. : 1
1st Qu.:225575
Median :451149
Mean :451149
3rd Qu.:676723
Max. :902297
Looking for NAs
colSums(is.na(storm))
STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
0 0 0 0 0 0 0
EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
0 0 0 0 0 0 0
COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F
902297 0 0 0 0 0 843563
MAG FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
0 0 0 0 0 0 0
WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
0 0 0 47 0 40 0
REMARKS REFNUM
0 0
the NAs doesn’t gonna affect our work
Rename all the variable to lower case
storm<-storm %>% rename_all(tolower)
Our questions required focusing on health and economic problems, so we don’t need to process all the fields.
to answer the first question we need to focus on the variables “fatalities and injuries”
make a new table with the data that affect the health
storm_health<-storm %>% select(evtype,fatalities,injuries)
summation of all the fatalities and injuries per evtype
storm_health_type<-storm_health %>% group_by(evtype) %>%
summarize(sum_of_fatalities=sum(fatalities),sum_of_injuries=sum(injuries))%>%
arrange(desc(sum_of_fatalities),desc(sum_of_injuries))
To answer the second question we need to focus on the variables “propdmg ,propdmgexp,cropdmg and cropdmgexp”
and the description of each variable is below:
propdmg : property damage amount.
propdmgexp: property damage in exponents.
cropdmg : crop damage amount.
cropdmgexp: crop damage in exponents.
make a new table with the data that affect the economy
storm_economy<-storm %>% select(evtype,propdmg ,propdmgexp,cropdmg,cropdmgexp)
this data has some exponents defined in propdmgexp and cropdmgexp, respectively. So first we have to decode these exponents. Let’s look at unique variables in exps
unique(storm_economy$propdmgexp)
[1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(storm_economy$cropdmgexp)
[1] "" "M" "K" "m" "B" "?" "0" "k" "2"
converting the exp with their corresponding values
storm_economy$propdmgexp[storm_economy$propdmgexp=="K"]<-10^3
storm_economy$propdmgexp[storm_economy$propdmgexp=="M"]<-10^6
storm_economy$propdmgexp[storm_economy$propdmgexp==""]<-10^0
storm_economy$propdmgexp[storm_economy$propdmgexp=="B"]<-10^9
storm_economy$propdmgexp[storm_economy$propdmgexp=="m"]<-10^6
storm_economy$propdmgexp[storm_economy$propdmgexp=="+"]<-10^0
storm_economy$propdmgexp[storm_economy$propdmgexp=="0"]<-10^0
storm_economy$propdmgexp[storm_economy$propdmgexp=="5"]<-10^5
storm_economy$propdmgexp[storm_economy$propdmgexp=="6"]<-10^6
storm_economy$propdmgexp[storm_economy$propdmgexp=="?"]<-10^0
storm_economy$propdmgexp[storm_economy$propdmgexp=="4"]<-10^4
storm_economy$propdmgexp[storm_economy$propdmgexp=="2"]<-10^2
storm_economy$propdmgexp[storm_economy$propdmgexp=="3"]<-10^3
storm_economy$propdmgexp[storm_economy$propdmgexp=="h"]<-10^2
storm_economy$propdmgexp[storm_economy$propdmgexp=="7"]<-10^7
storm_economy$propdmgexp[storm_economy$propdmgexp=="H"]<-10^2
storm_economy$propdmgexp[storm_economy$propdmgexp=="-"]<-10^0
storm_economy$propdmgexp[storm_economy$propdmgexp=="1"]<-10^1
storm_economy$propdmgexp[storm_economy$propdmgexp=="8"]<-10^8
storm_economy$cropdmgexp[storm_economy$cropdmgexp==""]<-10^0
storm_economy$cropdmgexp[storm_economy$cropdmgexp=="M"]<-10^6
storm_economy$cropdmgexp[storm_economy$cropdmgexp=="K"]<-10^3
storm_economy$cropdmgexp[storm_economy$cropdmgexp=="m"]<-10^6
storm_economy$cropdmgexp[storm_economy$cropdmgexp=="B"]<-10^9
storm_economy$cropdmgexp[storm_economy$cropdmgexp=="?"]<-10^0
storm_economy$cropdmgexp[storm_economy$cropdmgexp=="0"]<-10^0
storm_economy$cropdmgexp[storm_economy$cropdmgexp=="k"]<-10^3
storm_economy$cropdmgexp[storm_economy$cropdmgexp=="2"]<-10^2
Make a new variable that calculates all damages
storm_economy$damages<-((as.numeric(storm_economy$propdmg)*as.numeric(storm_economy$propdmgexp))+
(as.numeric(storm_economy$cropdmg)*as.numeric(storm_economy$cropdmgexp)))
summation of all the damages per evtype
storm_economy_type<-storm_economy %>%
group_by(evtype)%>%
summarize(all_damages=sum(damages))%>%
arrange(desc(all_damages))
Based on analysis, the most resources of dealing with the Storms and other severe weather events in the United State should be directed towards dealing with tornadoes (causes highest damage to human health including fatalities and injuries in US) and Floods (cause highest economic loss in US)