In this project I will be exploring the data within an earthquake dataset. This dataset is truly vast and very interesting. The dataset includes many different variables containing data that can be useful to my project. In this project I will specifically be targeting the title (the name of the earthquake), the location (where the earthquake took place), the magnitude (the magnitude of the earthquake), and the sig (the significance of the earthquake). In plan to use these specific variables to see which earthquakes, by name, occurred in what areas. From this I can conclude what regions on the planet have the most frequent earthquakes. I will also be able to determine the magnitude and sheer significance of these earthquakes to determine if they are truly impactful or not. The source of the data is: https://earthquake.usgs.gov/earthquakes/search/
Earthquake Map (Where Earthquakes Are Most Likely To Occur)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)library(ggplot2)library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
Rows: 1000 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): title, date_time, alert, net, magType, location, continent, country
dbl (11): magnitude, cdi, mmi, tsunami, sig, nst, dmin, gap, depth, latitude...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Filter the Top 5 Earthquakes by Significance in Descending Order
#Creates subset named topEarthquakes that takes data from earthquakes to sort the significance from high to lowtopEarthquakes <- earthquakes |>arrange(desc(sig)) |>#After sorting from high to low, this takes only the top 5 rowshead(arrange(earthquakes, desc(sig)), n =5)#Displays the top 5 earthquakes with the highest significancehead(topEarthquakes)
# A tibble: 5 × 19
title magnitude date_time cdi mmi alert tsunami sig net nst dmin
<chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 M 7.8 -… 7.8 06-02-20… 9 9 red 0 2910 us 118 1.92
2 M 8.2 -… 8.2 08-09-20… 9 7 red 1 2910 us 0 0.944
3 M 7.2 -… 7.2 04-04-20… 9 9 red 0 2910 ci 10 0.514
4 M 6.6 -… 6.6 30-10-20… 9 8 red 0 2840 us 0 0.174
5 M 7.8 -… 7.8 25-04-20… 8 9 red 0 2820 us 0 1.86
# ℹ 8 more variables: gap <dbl>, magType <chr>, depth <dbl>, latitude <dbl>,
# longitude <dbl>, location <chr>, continent <chr>, country <chr>
Create the Linear Regression Graph with Labeled X and Y Axis, Title, Legend, Color Palette, and Linear Regression Line
#Creates subset named graph1 that takes data from topEarthquakesgraph1 <- topEarthquakes |>#Creates graph with x axis, y axis, and the legendggplot(aes(x = magnitude, y = sig, color = location)) +#Changes the color scheme used in the graphscale_color_brewer(palette ="Set2") +#Renames the y axis, x axis, and legendlabs(y ="Significance of the Earthquake",x ="Magnitude of the Earthquake",color ="Location") +#Changes the theme to minimal, eliminating any background annotationstheme_minimal(base_size =12) +geom_point() +geom_line() +#Creates the linear regression line and changes the color of it to purplegeom_smooth(method ="lm", formula = y~x, color ="purple") +#Title of the Graphggtitle("Earthquake's Significance Based On Magnitude")#Calls the Graphgraph1
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
Correlation Between The Magnitude and Significance of the Earthquakes
cor(topEarthquakes$magnitude, topEarthquakes$sig)
[1] 0.3526549
Getting the Linear Regression Equation
fit1 <-lm(sig ~ magnitude, data = topEarthquakes)summary(fit1)
Call:
lm(formula = sig ~ magnitude, data = topEarthquakes)
Residuals:
1 2 3 4 5
25 15 40 -15 -65
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2690.0 288.8 9.314 0.00262 **
magnitude 25.0 38.3 0.653 0.56047
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 47.96 on 3 degrees of freedom
Multiple R-squared: 0.1244, Adjusted R-squared: -0.1675
F-statistic: 0.4261 on 1 and 3 DF, p-value: 0.5605
Linear Regression Equation
sig = 25.0(magnitude) + 2690.0
Diagnosis Based on P-Value, R^2, and Plots
According to the p-value, r^2 value, plots, and other information, it is safe to say that the magnitude and significance have little correlation to each other. The p-value being so high means that the hypothesis is not significant. The plots also show that the points on the graph are not close to the linear regression line.
Filtering For The Top 8 Magnitudes
#Creates subset named topMags that takes data from earthquakes and sorts magnitude from high to lowtopMags <- earthquakes |>arrange(desc(magnitude)) |>#After sorting from high to low, this takes only the top 8 rowshead(arrange(earthquakes, desc(magnitude)), n =8)#Displays the filtered datahead(topMags)
# A tibble: 6 × 19
title magnitude date_time cdi mmi alert tsunami sig net nst dmin
<chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 M 9.1 -… 9.1 11-03-20… 9 8 <NA> 0 2184 offi… 541 0
2 M 9.1 -… 9.1 26-12-20… 0 8 <NA> 0 1274 offi… 601 0
3 M 8.8 -… 8.8 27-02-20… 8 8 <NA> 0 1991 offi… 454 0
4 M 8.6 -… 8.6 11-04-20… 9 7 yell… 0 2048 offi… 499 0
5 M 8.6 -… 8.6 28-03-20… 0 8 <NA> 0 1138 offi… 510 0
6 M 8.4 -… 8.4 12-09-20… 0 6 <NA> 0 1086 offi… 411 0
# ℹ 8 more variables: gap <dbl>, magType <chr>, depth <dbl>, latitude <dbl>,
# longitude <dbl>, location <chr>, continent <chr>, country <chr>
Creating A Bar Chart With the Locations of the Earthquakes with the Top 8 Magnitudes
#Creates a subset named graph2 that takes data from topMagsgraph2 <- topMags |>#Creates bar graph with x axis, y axis, and legendggplot() +geom_bar(aes(x = title, y = magnitude, fill = location),#Shows the bars as is in the datasetposition ="dodge", stat ="identity") +#Renames the y axis, x axis, legend, title, and captionlabs(y ="Magnitude of the Earthquake",x ="Name of the Earthquake",fill ="Location of the Earthquake",title ="Bar Chart of the Names and Locations of the Top 8 Magnitude Earthquakes",caption ="Earthquake Dataset from USGS") +#Changes the theme to minimal, eliminating background annotations and making the text smallertheme_minimal(base_size =8) +#Tilts the x axis text so they do not overlaptheme(axis.text.x =element_text(angle =28)) +#Changes the color scheme used in the graphscale_fill_brewer(palette ="Dark2")#Calls the graph so it can displaygraph2
End of Project Essay
In this project I cleaned up my dataset in a variety of ways. First, I sorted the dataset and filtered through it to get the top 5 most significant earthquakes. For this, I used the arrange(desc) function which combines the arrange and filter functions to get me a specific piece of data to use in order to create my first graph. I also used the head function to show only the top “x” amount of rows. For the second graph, again I used the arrange(desc) and head, however, this time I used it to sort the magnitudes in descending order to get a specific piece of data to use in my second graph. My second graph speaks to me the most. It shows the top 8 earthquakes with the highest magnitudes and their names and locations. A pattern that I picked up from this visualization is that most of the earthquakes with the highest magnitudes are located in places surrounded by water. Almost all of these earthquakes occur on islands or pieces of land with water covering most of their land. Finally, what I wish I could improve on in this project was maybe using a better dataset. While researching this dataset I thought it would be cool and interesting to research about earthquakes and I was quite fascinated with it at first. However, as I got into playing around with the dataset, I started to see many variables that may not be too significant to my research such as if the earthquake created a tsunami or not. Although, I still thought it was interesting, by the time I noticed there were not many variables to play around with, I did not want to find a different dataset.