North America has the most tornadoes per year of any continent, and the United States has the most tornadoes per year of any country. Tornadoes are violent rotating pillars of air extending from clouds to the ground. According to NOAA, approximately 1,200 tornadoes occur annually across the United States, with Maryland averaging roughly 15-20 per year, mostly weak EF0-EF1 events. Survey teams rank the magnitude of a tornado based on wind speed, observed damage and other related factors. The Enhanced Fujita (EF) Scale rates tornadoes from EF0 to EF5 based on estimated wind speeds derived from observed damage, and has been used since 2007. The standardized ratings provide crucial data for emergency response, insurance claims, and tracking storm event impacts through NOAA’s Storm Events. This dataset is sourced from NOAA, containing information from 1950 to 2025, and I am interested in looking into the impacts across Maryland. I hope to better understand how the state has been affected and to see the ways the factors impact a tornado’s rating or “magnitude”
Citations:
Did You Know? | National Centers for Environmental Information (NCEI), www.ncei.noaa.gov/access/monitoring/dyk/tornadocount. Accessed 3 July 2026.
Monthly and Annual Numbers of Tornadoes - Graphs and Maps | NOAA Climate.Gov, www.climate.gov/maps-data/dataset/monthly-and-annual-numbers-tornadoes-graphs-and-maps. Accessed 3 July 2026.
I install the packages that will help me clean and organize the data (“tidyverse”, “dplyr”) as well as the packages to help me visualize the data (“ggplot2”, “plotly”).
install.packages("tidyverse")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
install.packages("dplyr")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
install.packages("ggplot2")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
install.packages("plotly")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
I load the libraries, including the library that contains the read_csv function to help import the dataset.
#Import librarieslibrary(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr)library(readr)#Import the datasettornadoes1950_2025 <-read_csv("1950-2025_all_tornadoes.csv")
Rows: 74956 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): om, mo, dy, st, stf
dbl (22): yr, tz, stn, mag, inj, fat, loss, closs, slat, slon, elat, elon, ...
date (2): date, edat
time (2): time, etime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I get an overview of the dataset. I use colnames to see the variables, str to see the the classes of all the variables, and head to get a snapshot of a few observations.
# A tibble: 6 × 31
om yr mo dy date time tz st stf stn mag inj
<chr> <dbl> <chr> <chr> <date> <time> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 192 1950 10 01 1950-10-01 21:00 3 OK 40 23 1 0
2 193 1950 10 09 1950-10-09 02:15 3 NC 37 9 3 3
3 195 1950 11 20 1950-11-20 02:20 3 KY 21 1 2 0
4 196 1950 11 20 1950-11-20 04:00 3 KY 21 2 1 0
5 197 1950 11 20 1950-11-20 07:30 3 MS 28 14 1 3
6 194 1950 11 04 1950-11-04 17:00 3 PA 42 5 3 1
# ℹ 19 more variables: fat <dbl>, loss <dbl>, closs <dbl>, slat <dbl>,
# slon <dbl>, elat <dbl>, elon <dbl>, len <dbl>, wid <dbl>, ns <dbl>,
# sn <dbl>, sg <dbl>, f1 <dbl>, f2 <dbl>, f3 <dbl>, f4 <dbl>, fc <dbl>,
# edat <date>, etime <time>
Next, I create cleaned datasets to work with. We use filter to create a dataset containing the tornado information for Maryland. We use select, and negate the selection using !, to make a dataset. We use select again and create a new dataset with the variables we removed from marylandtornadoes.
#Filter to keep only Maryland observations. Keep in mind some tornadoes do cross multiple states, some of these observations will have information for the maryland segment of a multi-state tornado.maryland <- tornadoes1950_2025 %>%filter(tornadoes1950_2025$st =="MD")#Select, and the ! to keep all variables other than those listed. This is the main tornado data.marylandtornadoes <- maryland %>%select(!c(stn,stf,ns,sn,sg,f1,f2,f3,f4))#Select, and the variables that help determine whether the observation is of a single, double, or multi-state tornado path.marylandpaths <- maryland %>%select(c(om,stn,stf,ns,sn,sg,f1,f2,f3,f4))
For the paths dataset, NOAA uses number codes to help identify which tornado observations in the datasheet are of a tornado segment, or the whole record of a tornado. This is important because some tornadoes may have crossed multiple states and event done different amounts of property damage or killed different numbers of people across different states. We are going to use this information when we create our Tableau visualization.
#Clean the paths dataset, use rowwise to make sure the Number of States(ns), State Number (sn), and Segment Number (sg) are all kept in line. Use mutate to help rename observations to make their meaning more clear.marylandpaths <- marylandpaths %>%rowwise() |>mutate(Tornado_Segment =list(c_across(ns:sg)))marylandpaths$Tornado_Segment <-as.character(marylandpaths$Tornado_Segment)marylandpaths <- marylandpaths %>%mutate(Tornado_Info = Tornado_Segment)marylandpaths <- marylandpaths %>%mutate(Tornado_Info =recode(Tornado_Info,"c(3, 1, 2)"="First Segment for Three-State Tornado","c(2, 1, 2)"="First Segment for Two-State Tornado","c(2, 0, 1)"="Two-State Tornado","c(1, 1, 1)"="Entire Record [Unless All 4 FIPS zero]"))
For the dataset containing the other variables for the tornadoes, rename the columns so it’s more clear what variable is being recorded.
I’m going to check and see if I have any NA in the observations for my variables. I see that End Date and End Time has multiple NAs, so it’s probably not ideal data to analyze. Also, I see the magnitude is listed as a number, but it’s not a quantitative variable in reality. Instead, it’s categorical, following the Enhanced Fujita scale. (-9 for EFU which is unknown speed and no surveyable damage, 0 for EF0 winds up to 85mph, 1 for EF1 winds up to 110mph, etc). I’m going to adjust the dataset accordingly.
#Check for NA in observationscolSums(is.na(marylandpaths))
om stn stf ns sn
0 0 0 0 0
sg f1 f2 f3 f4
0 0 0 0 0
Tornado_Segment Tornado_Info
0 0
colSums(is.na(marylandtornadoes))
Number_of_Tornadoes Year Month Day
0 0 0 0
Date Time Time Zone State
0 0 0 0
Magnitude Injuries Fatalities Property_Loss
0 0 0 0
Crop_Loss Start_Latitude Start_Longitude End_Latitude
0 0 0 0
End_Longitude Path_Length Path_Width Rated_or_Unknown
0 0 0 0
End_Date End_Time
277 277
#For fot the classses of all variablesstr(marylandpaths)
rowws_df [430 × 12] (S3: rowwise_df/tbl_df/tbl/data.frame)
$ om : chr [1:430] "183" "93" "96" "97" ...
$ stn : num [1:430] 1 1 2 3 4 1 2 1 2 1 ...
$ stf : chr [1:430] "24" "24" "24" "24" ...
$ ns : num [1:430] 1 1 1 1 1 1 1 2 1 1 ...
$ sn : num [1:430] 1 1 1 1 1 1 1 1 1 1 ...
$ sg : num [1:430] 1 1 1 1 1 1 1 2 1 1 ...
$ f1 : num [1:430] 29 21 21 11 31 33 3 21 23 47 ...
$ f2 : num [1:430] 0 0 0 0 0 0 0 0 0 0 ...
$ f3 : num [1:430] 0 0 0 0 0 0 0 0 0 0 ...
$ f4 : num [1:430] 0 0 0 0 0 0 0 0 0 0 ...
$ Tornado_Segment: chr [1:430] "c(1, 1, 1)" "c(1, 1, 1)" "c(1, 1, 1)" "c(1, 1, 1)" ...
$ Tornado_Info : chr [1:430] "Entire Record [Unless All 4 FIPS zero]" "Entire Record [Unless All 4 FIPS zero]" "Entire Record [Unless All 4 FIPS zero]" "Entire Record [Unless All 4 FIPS zero]" ...
- attr(*, "groups")= tibble [430 × 1] (S3: tbl_df/tbl/data.frame)
..$ .rows: list<int> [1:430]
.. ..$ : int 1
.. ..$ : int 2
.. ..$ : int 3
.. ..$ : int 4
.. ..$ : int 5
.. ..$ : int 6
.. ..$ : int 7
.. ..$ : int 8
.. ..$ : int 9
.. ..$ : int 10
.. ..$ : int 11
.. ..$ : int 12
.. ..$ : int 13
.. ..$ : int 14
.. ..$ : int 15
.. ..$ : int 16
.. ..$ : int 17
.. ..$ : int 18
.. ..$ : int 19
.. ..$ : int 20
.. ..$ : int 21
.. ..$ : int 22
.. ..$ : int 23
.. ..$ : int 24
.. ..$ : int 25
.. ..$ : int 26
.. ..$ : int 27
.. ..$ : int 28
.. ..$ : int 29
.. ..$ : int 30
.. ..$ : int 31
.. ..$ : int 32
.. ..$ : int 33
.. ..$ : int 34
.. ..$ : int 35
.. ..$ : int 36
.. ..$ : int 37
.. ..$ : int 38
.. ..$ : int 39
.. ..$ : int 40
.. ..$ : int 41
.. ..$ : int 42
.. ..$ : int 43
.. ..$ : int 44
.. ..$ : int 45
.. ..$ : int 46
.. ..$ : int 47
.. ..$ : int 48
.. ..$ : int 49
.. ..$ : int 50
.. ..$ : int 51
.. ..$ : int 52
.. ..$ : int 53
.. ..$ : int 54
.. ..$ : int 55
.. ..$ : int 56
.. ..$ : int 57
.. ..$ : int 58
.. ..$ : int 59
.. ..$ : int 60
.. ..$ : int 61
.. ..$ : int 62
.. ..$ : int 63
.. ..$ : int 64
.. ..$ : int 65
.. ..$ : int 66
.. ..$ : int 67
.. ..$ : int 68
.. ..$ : int 69
.. ..$ : int 70
.. ..$ : int 71
.. ..$ : int 72
.. ..$ : int 73
.. ..$ : int 74
.. ..$ : int 75
.. ..$ : int 76
.. ..$ : int 77
.. ..$ : int 78
.. ..$ : int 79
.. ..$ : int 80
.. ..$ : int 81
.. ..$ : int 82
.. ..$ : int 83
.. ..$ : int 84
.. ..$ : int 85
.. ..$ : int 86
.. ..$ : int 87
.. ..$ : int 88
.. ..$ : int 89
.. ..$ : int 90
.. ..$ : int 91
.. ..$ : int 92
.. ..$ : int 93
.. ..$ : int 94
.. ..$ : int 95
.. ..$ : int 96
.. ..$ : int 97
.. ..$ : int 98
.. ..$ : int 99
.. .. [list output truncated]
.. ..@ ptype: int(0)
#Convert the column data to chr using as.character.marylandtornadoes$Magnitude <-as.character(marylandtornadoes$Magnitude)
Correlation
Following some of the methods demonstrated in the class tutorial that Dr. Saidi has created, I’m going to do a correlation matrix, as well as multiple linear regression, to analyze the data further.
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Creating a correlation matrix to see which variables are have high, low, or no correlation. The red colors show that Injuries and Fatalities are variables that are highly (more than 0.5) correlated. The length of the tornado’s path is also highly (about or more than 0.5) correlated with fatalities and injuries.
plot_correlation(marylandtornadoes)
6 features with more than 20 categories ignored!
Number_of_Tornadoes: 397 categories
Day: 31 categories
Date: 238 categories
Time: 275 categories
End_Date: 75 categories
End_Time: 141 categories
Warning in cor(x = structure(list(Year = c(1950, 1952, 1952, 1952, 1952, : the
standard deviation is zero
I am using a linear regression (formula y~x) to look for a relationship between two variables of note from above, Path_Length and Property_Loss. Based on the graph, this doesn’t look like there is a strong relationship. I’m going to move onto other ways of looking for insights.
I’m going to creature mutiple linear models, and pick some of the variables I’m interested in. Fatalities, Injuries, and Magnitude are all dependent variables (in fact, scientists determine a Tornado’s magnitude based on some of those other factors, and Fatalities/Injuries are usually the result of a tornado that has a severe presentation in Path_length and path_width) so I’m going to run a fit with theses variables. The correlation matrix helped me highlight some variables of note.
My result for fit1 has highlights at least two variables that are significant, Injuries and Path_Length, in relation with Fatalities. The R-squared value is also higher than 0.5 which means this is somewhat trustworthy data.
Fit2 shows a relationship where injuries and Path_Length continue to be significantly related to Fatalities, even though I removed some other variables in this model.
Call:
lm(formula = Fatalities ~ Injuries + Crop_Loss + Path_Length +
Path_Width, data = marylandtornadoes)
Residuals:
Min 1Q Median 3Q Max
-1.21114 -0.01564 0.00123 0.00566 1.97558
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.072e-03 9.602e-03 -0.320 0.749162
Injuries 2.074e-02 1.097e-03 18.906 < 2e-16 ***
Crop_Loss -9.461e-06 8.370e-06 -1.130 0.258940
Path_Length 8.014e-03 2.253e-03 3.556 0.000418 ***
Path_Width -1.003e-04 7.251e-05 -1.383 0.167303
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1416 on 425 degrees of freedom
Multiple R-squared: 0.5691, Adjusted R-squared: 0.565
F-statistic: 140.3 on 4 and 425 DF, p-value: < 2.2e-16
I noticed in my correlation matrix that of the Magnitude variable, while EF0, EF1, and EF2 were not highly correlated with other variables, the EF3 and EF4 were. I’ve created another model where now I’m focused more on the EF3 and EF4 tornadoes. This model shoes that EF3 tornadoes are signficantly related to the Fatalities variable. I would expect then, to see observations where EF3 tornadoes in Maryland might have fatalities whereas other types of tornadoes in Maryland do not.
ANOVA tests help determine that my models where I changed the variables for analysis (for example, removing Magnitude, only selecting certain magnitudes, yielded significant results. According to my p-value of 0.0007675, The p-value is small, showing that the Property_Loss variable contributes significantly to our model.
Based on this, I’m going to focus on highlighting Property_Loss in my visualizations.
Analysis of Variance Table
Model 1: Fatalities ~ Injuries + Property_Loss + Crop_Loss + Magnitude +
Path_Length + Path_Width
Model 2: Fatalities ~ Injuries + Crop_Loss + (Magnitude == "3") + (Magnitude ==
"4") + Path_Length + Path_Width
Res.Df RSS Df Sum of Sq F Pr(>F)
1 419 8.0632
2 423 8.1679 -4 -0.10473 1.3606 0.2468
R -Visualization
This bubble plot graph, with interactivity thanks to plotly, shows that the path and width of a tornado do not have to be large in order to do damage. There are many large “bubbles” in the lower left quadrant of the graph, and in fact many of those tornadoes were of a low magnitude. Because tornadoes are rated based on multiple factors, I suspect that some of those low magnitude, short-path tornadoes might have caused a lot of damage due to an unrepresented variable. I wonder if some of those tornadoes were in more developed areas?
scatterplot <-ggplot(marylandtornadoes, aes(x= Path_Length,y = Path_Width,size = Property_Loss,color = Magnitude))+geom_point(alpha=0.5)+labs(title ="Effect of Tornado Path and Width on Property Loss",subtitle ="Faster tornadoes are not responsible for the most damage",caption ="Source: NOAA")+xlab("Length of Tornado Path (mi)")+ylab("Width of Tornado Path (yd)")+scale_color_manual(values =c("black","violet","blue","cyan","yellow","red"),labels =c("-9"="Unknown", "0"="EF0", "1"="EF1", "2"="EF2","3"="EF3", "4"="EF4"))ggplotly(scatterplot)
Tableau Visualization
The follow Tableau Public link brings us to a map of Maryland, with occurences of tornadoes that ONLY occurred in Maryland (none of these tornadoes crossed the border to do damage in other states).This is to ensure we only look at Property losses to places in Maryland. We used our marylandpaths dataset and the marylandtornadoes dataset to build this map. Upon examination, we can see some of the more devastating tornadoes took place in populated, densely built areas such as Baltimore, Columbia and Elicott City–and in fact, those tornadoes all took place in June 2024!