This report analyzes the NOAA storm database between the years 1950 to 2011 and answers the following questions:
We found that Tornados cause by far the highest impact to human population health in terms of fatalities and injuries followed by Thunderstorm wind and Excessive Heat. We also found that Flood causes the maximum economic damage followed by Hurricane/Typhoon and Tornado.
From the course website, we obtained the NOAA storm database in the form of a comma-separated-value file compressed via the bzip2 algorithm. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete
We first read in the csv file from the bzip2 archive. The missing values are blank in the database and there is a header line
filename <- "repdata-data-StormData.csv.bz2"
rawdata <- read.csv(filename,
header = TRUE,
sep = ",",
na.strings= "")
This is a big dataset. So we check the struture
str(rawdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 34 levels " N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
## $ BGN_LOCATI: Factor w/ 54428 levels "- 1 N Albion",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_DATE : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_TIME : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_LOCATI: Factor w/ 34505 levels "- .5 NNW","- 11 ESE Jay",..: NA NA NA NA NA NA NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
## $ WFO : Factor w/ 541 levels " CI","$AC","$AG",..: NA NA NA NA NA NA NA NA NA NA ...
## $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
## $ ZONENAMES : Factor w/ 25111 levels " "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436780 levels "-2 at Deer Park\n",..: NA NA NA NA NA NA NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
In this report we are only intersted in analysis across the united states and only in population health and economic consequences. Hence all location columns and columns not directly relevent to this can be removed.
First look at the all the columns available
names(rawdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Now select only the columns that are relevent to our report. These are - EVTYPE: The weather event - FATALITIES: Fatalities due to EVTYPE - INJURIES: Injuries due to EVTYPE - PROPDMG: Property damage value - PROPDMGEXP: The unit of property damage - CROPDMG: Crop damage value - CROPDMGEXP: The unit of Crop damage - REFNUM: Reference number. Retained just in case we need to refer to raw data
tidydata <- rawdata[ , c("EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG",
"PROPDMGEXP",
"CROPDMG",
"CROPDMGEXP",
"REFNUM")]
Free up memory by removing rawdata. As a nice bonus the datasize reduced from 500 MB of raw data to 50 MB of tidy data!
rm(rawdata)
Let us look at the strucure of this dataset
str(tidydata)
## 'data.frame': 902297 obs. of 8 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We need to clean up the PROPDMGEXP as it has 18 factors. First take a look at the unique values
unique(tidydata$PROPDMGEXP)
## [1] K M <NA> B m + 0 5 6 ? 4 2 3 h
## [15] 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
This has 18 levels.
For the purpose of this analysis, we have assumed the following values alone are proper data and others are assumed to be 0. Since the question is to find the greatest economic damage, this appears to be a reasonable assumption
-H: Hunderds -K: Thousands -M: Millions -B: Billions -h: This is also assumed to be same as ‘H’ ie hundreds -k: This is also assumed to be same as ‘K’ ie Thousands -m: This is also assumed to be same as ‘M’ ie Millions. -b: This is also assumed to be same as ‘B’ ie Billions.
exp <- c("H", "h", "K", "k", "M", "m", "B", "b", "0")
value <- c(100L, 100L,
1000L, 1000L,
1000000L, 1000000L,
1000000000L, 1000000000L,
0L)
df <- data.frame(exp, value)
While assesing greatest damage, we have to add the property damage value and crop damage value as there could be one without the other. If NA or other values are ignored, then the total will be incorrect. Hence, Also all NAs and improper values are assumed as 0 for the purpose of this analysis. Their effect will anyway be marginal because we have Billions in the data
tidydata_eco <- tidydata
tidydata_eco[is.na(tidydata_eco$PROPDMGEXP), ]$PROPDMGEXP <- "0"
tidydata_eco[!(tidydata_eco$PROPDMGEXP %in% exp), ]$PROPDMGEXP <- "0"
Similarly, We need to clean up the CROPDMGEXP as it has 8 factors. First take a look at the unique values
unique(tidydata$CROPDMGEXP)
## [1] <NA> M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
We apply the same NA and “exp” treatment as that of PROPDMGEXP
tidydata_eco[is.na(tidydata_eco$CROPDMGEXP), ]$CROPDMGEXP <- "0"
tidydata_eco[!(tidydata_eco$CROPDMGEXP %in% exp), ]$CROPDMGEXP <- "0"
Now let us calculate the economic damage. This is essentially the sum of the products of property damage and their exponents
tidydata_eco$ECODMG <-
value[match(tidydata_eco$PROPDMGEXP, df$exp)]*tidydata_eco$PROPDMG +
value[match(tidydata_eco$CROPDMGEXP, df$exp)]*tidydata_eco$CROPDMG
And we also calculate the impact to popluation health. For the purpose of the analysis, we assume that both fatalities and injuries are impacting population health and both have equal weightage
tidydata_eco$HLTHDMG <- tidydata_eco$FATALITIES +
tidydata_eco$INJURIES
So far we have cleaned up the columns. Now we can clean up the rows.
Let us first eliminate those rows that have 0 economic damage and 0 health damage
tidydata_eco <- tidydata_eco[!(tidydata_eco$ECODMG == 0 &
tidydata_eco$HLTHDMG == 0), ]
Then convert them all to upper case and remove the blank spaces infront
library(stringr)
tidydata_eco$EVTYPE <- str_trim(toupper(tidydata_eco$EVTYPE))
Now we have 254331 obs. of 10 variables EVTYPE has 440 values.
Let us further cleanup the EVTYPE by trying to match possible values for the EVTYPE from the valid event types given in section 2.1.1 of code book
## TSTM is assumed to be Thuderstorm
tidydata_eco[grepl("^TSTM", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "THUNDERSTORM WIND"
## Thunderstorms are cleanedup
tidydata_eco[grepl("^THUNDERSTORM", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "THUNDERSTORM WINDS"
## non TSTM winds are classified as HIGH WIND
tidydata_eco[grepl("^NON.TSTM", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HIGH WIND"
tidydata_eco[grepl("^HIGH WIND", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HIGH WIND"
## Hail is cleaned up
tidydata_eco[grepl("^HAIL", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HAIL"
## Hurricane names are removed
tidydata_eco[grepl("HURRICANE", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HURRICANE/TYPHOON"
## If it begins with Waterspout, it is given priority. There are few events
##with both Waterspout and Tornado
tidydata_eco[grepl("^WATERSPOUT", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "WATERSPOUT"
## Event that has Tornado is given priority over others
tidydata_eco[grepl("TORNADO", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "TORNADO"
tidydata_eco[grepl("^FLASH FLOOD", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "FLASH FLOOD"
tidydata_eco[grepl("ICE STORM", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "ICE STORM"
## If there is flash anywhere, it is flash flood
tidydata_eco[grepl("FLASH", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "FLASH FLOOD"
## Now if it begins with flood, make it flood
tidydata_eco[grepl("^FLOOD", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "FLOOD"
## Coastal flood
tidydata_eco[grepl("COAST", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "COASTAL FLOOD"
## IT should not start with extreme cold but has cold anywhere, it is marked as cold
tidydata_eco[grepl("(^(?!EXTREME COLD))(?=.*COLD)",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "COLD"
tidydata_eco[grepl("FROST|FREEZE", tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "FROST/FREEZE"
tidydata_eco[grepl("(^(?!FREEZING FOG))(?=.*FREEZING)",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "FROST/FREEZE"
tidydata_eco[grepl("(^(?!EXCESSIVE HEAT))(^(?!RECORD))(^(?!DROUGHT))(?=.*HEAT)",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HEAT"
tidydata_eco[grepl("(LAKE)(?=.*SNOW)",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "LAKE-EFFECT SNOW"
tidydata_eco[grepl("(^(?!LAKE-EFFECT))(?=.*SNOW)",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HEAVY SNOW"
tidydata_eco[grepl("RAIN",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HEAVY RAIN"
tidydata_eco[grepl("SURF",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HEAVY SURF"
tidydata_eco[grepl("(^THU|TUN)",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "THUNDERSTORM WIND"
tidydata_eco[grepl("(^(?!MARINE))(?=.*THUNDER)",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "THUNDERSTORM WIND"
tidydata_eco[grepl("(^(?!THUNDER|MARINE|DUST|EXTREME))(?=.*WIND)",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "HIGH WIND"
tidydata_eco[grepl("LAKE ",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "LAKESHORE FLOOD"
tidydata_eco[grepl("CSTL",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "COASTAL FLOOD"
tidydata_eco[grepl("(^(?!COASTAL|FLASH|LAKE))(?=.*FLOOD)",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "FLOOD"
tidydata_eco[grepl("(^ICE|ICY)",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "ICE STORM"
tidydata_eco[grepl("LIG",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "LIGHTNING"
tidydata_eco[grepl("RECORD",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "EXCESSIVE HEAT"
tidydata_eco[grepl("RIP CURRENT",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "RIP CURRENT"
tidydata_eco[grepl("TROPICAL STORM",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "TROPICAL STORM"
tidydata_eco[grepl("^WINTER WEATHER",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "WINTER WEATHER"
tidydata_eco[grepl("AVALA",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "AVALANCHE"
tidydata_eco[grepl("BLIZZARD",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "BLIZZARD"
tidydata_eco[grepl("DROUGHT",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "DROUGHT"
tidydata_eco[grepl("TORN",
tidydata_eco$EVTYPE, perl = TRUE), ]$EVTYPE <- "TORNADO"
The unique events are now 116.
pending_events <- sort(unique(tidydata_eco$EVTYPE))
The master set has 48 + 1 (other). The data is taken from the code pdf section 2.1.1
valid_event <-
c("Astronomical Low Tide",
"Avalanche",
"Blizzard",
"Coastal Flood",
"Cold/Wind Chill",
"Debris Flow",
"Dense Fog",
"Dense Smoke",
"Drought",
"Dust Devil",
"Dust Storm",
"Excessive Heat",
"Extreme Cold/Wind Chill",
"Flash Flood",
"Flood",
"Frost/Freeze",
"Funnel Cloud",
"Freezing Fog",
"Hail",
"Heat",
"Heavy Rain",
"Heavy Snow",
"High Surf",
"High Wind",
"Hurricane/Typhoon",
"Ice Storm",
"Lake-Effect Snow",
"Lakeshore Flood",
"Lightning",
"Marine Hail",
"Marine High Wind",
"Marine Strong Wind",
"Marine Thunderstorm Wind",
"Rip Current",
"Seiche",
"Sleet",
"Storm Surge/Tide",
"Strong Wind",
"Thunderstorm Wind",
"Tornado",
"Tropical Depression",
"Tropical Storm",
"Tsunami",
"Volcanic Ash",
"Waterspout",
"Wildfire",
"Winter Storm",
"Winter Weather")
valid_event <- toupper(valid_event)
Now do a matching with the valid events. Where it is not present, fill with “OTHER”. Then create a new column in the dataset and set the values This way we would have both the events and their match available for future reference
event_match <- valid_event[pmatch(pending_events, valid_event,
duplicates.ok = TRUE )]
event_match <- ifelse(is.na(event_match), "OTHER", event_match)
tidydata_eco$EVTYPE1 <-
event_match[match(tidydata_eco$EVTYPE, pending_events)]
Finally we have a clean data set We will take a look at the data for the first 6 rows
head(tidydata_eco)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP REFNUM
## 1 TORNADO 0 15 25.0 K 0 0 1
## 2 TORNADO 0 0 2.5 K 0 0 2
## 3 TORNADO 0 2 25.0 K 0 0 3
## 4 TORNADO 0 2 2.5 K 0 0 4
## 5 TORNADO 0 2 2.5 K 0 0 5
## 6 TORNADO 0 6 2.5 K 0 0 6
## ECODMG HLTHDMG EVTYPE1
## 1 25000 15 TORNADO
## 2 2500 0 TORNADO
## 3 25000 2 TORNADO
## 4 2500 2 TORNADO
## 5 2500 2 TORNADO
## 6 2500 6 TORNADO
We will also look at the last 6 rows as the data collection is much better towards the end
tail(tidydata_eco)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 902249 WINTER STORM 0 0 2.0 K 0
## 902250 WINTER STORM 0 0 5.0 K 0
## 902255 HIGH WIND 0 0 0.6 K 0
## 902257 HIGH WIND 0 0 1.0 K 0
## 902259 DROUGHT 0 0 2.0 K 0
## 902260 HIGH WIND 0 0 7.5 K 0
## CROPDMGEXP REFNUM ECODMG HLTHDMG EVTYPE1
## 902249 K 902249 2000 0 WINTER STORM
## 902250 K 902250 5000 0 WINTER STORM
## 902255 K 902255 600 0 HIGH WIND
## 902257 K 902257 1000 0 HIGH WIND
## 902259 K 902259 2000 0 DROUGHT
## 902260 K 902260 7500 0 HIGH WIND
Now we calculate the summary of the data for population health and economic datmage
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
event_gp <- group_by(tidydata_eco, EVTYPE1)
event_smry <- summarize(event_gp,
hlthsum = sum(HLTHDMG),
hlthmn = mean(HLTHDMG),
hlthmx = max(HLTHDMG),
ecosum = sum(ECODMG),
ecomn = mean(ECODMG),
ecomx = max(ECODMG))
Now let us look at the top 10 Atmospheric event that cause maximum impact to population health across the US across all the years
arrange(event_smry, desc(hlthsum))[1:10, c("EVTYPE1", "hlthsum")]
## Source: local data frame [10 x 2]
##
## EVTYPE1 hlthsum
## 1 TORNADO 97022
## 2 THUNDERSTORM WIND 10220
## 3 EXCESSIVE HEAT 8497
## 4 FLOOD 7279
## 5 LIGHTNING 6048
## 6 HEAT 3863
## 7 FLASH FLOOD 2835
## 8 OTHER 2725
## 9 HIGH WIND 2331
## 10 ICE STORM 2262
We will also plot it for easy visualizaiton. Note that log scale is used in y axis for ease of interpretation
library(ggplot2)
qplot(EVTYPE1,
log(hlthsum),
data= arrange(event_smry, desc(hlthsum))[1:10, ],
xlab = "Top 10 Weather Events",
ylab = "Log of sum of fatalities and injuries") +
labs(title = "Figure 1: Consequences to population health") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
By far the highest impact to population health occours due to Tornados followed by Thunderstorm Wind and Excessive heat
Note that the event OTHER (number 8) is high which suggests fine tuning opportunities for this recommendation
Now let us look at the top 10 Atmospheric event that cause maximum economic damage across the US across all the years
arrange(event_smry,desc(ecosum))[1:10, c("EVTYPE1", "ecosum")]
## Source: local data frame [10 x 2]
##
## EVTYPE1 ecosum
## 1 FLOOD 161000844600
## 2 HURRICANE/TYPHOON 90271472810
## 3 TORNADO 58959393590
## 4 STORM SURGE/TIDE 47965579000
## 5 HAIL 19000564320
## 6 FLASH FLOOD 18440124760
## 7 DROUGHT 15018677780
## 8 THUNDERSTORM WIND 12242125730
## 9 ICE STORM 8981254510
## 10 TROPICAL STORM 8409286550
We will also plot it for easy visualization. (Note the log scale used in y axis)
qplot(EVTYPE1,
log(ecosum),
data= arrange(event_smry, desc(ecosum))[1:10, ],
xlab = "Top 10 Weather Event",
ylab = "Log of sum of fatalities and injuries") +
labs(title = "Figure 2: Economic Consequences") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
As noted above Flood causes the maximum economic damage, followed by Hurricane/Typhoon and Tornado
In the data, the Event type is obviously not clean. It needs lot of tidying up to match the 48 factors available in the code book. This needs further expert knowledge of the weather data and left for future analysis
However as noted in the top 10 tables and figures, they mostly match the official description in the code book and hence the analysis is assumed to be reasonable.