DATA110 Project 1

Introduction

In this project I will be exploring the data within an earthquake dataset. This dataset is truly vast and very interesting. The dataset includes many different variables containing data that can be useful to my project. In this project I will specifically be targeting the title (the name of the earthquake), the location (where the earthquake took place), the magnitude (the magnitude of the earthquake), and the sig (the significance of the earthquake). In plan to use these specific variables to see which earthquakes, by name, occurred in what areas. From this I can conclude what regions on the planet have the most frequent earthquakes. I will also be able to determine the magnitude and sheer significance of these earthquakes to determine if they are truly impactful or not. The source of the data is: https://earthquake.usgs.gov/earthquakes/search/

Earthquake Map (Where Earthquakes Are Most Likely To Occur)

https://www.usgs.gov/news/national-news-release/new-usgs-map-shows-where-damaging-earthquakes-are-most-likely-occur-us

Load the Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(RColorBrewer)

Load the dataset

setwd("/Users/natty/Downloads/DATA110")
earthquakes <- read_csv("earthquake_1995-2023.csv")
Rows: 1000 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): title, date_time, alert, net, magType, location, continent, country
dbl (11): magnitude, cdi, mmi, tsunami, sig, nst, dmin, gap, depth, latitude...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Filter the Top 5 Earthquakes by Significance in Descending Order

#Creates subset named topEarthquakes that takes data from earthquakes to sort the significance from high to low
topEarthquakes <- earthquakes |> arrange(desc(sig)) |>
#After sorting from high to low, this takes only the top 5 rows
head(arrange(earthquakes, desc(sig)), n = 5)
#Displays the top 5 earthquakes with the highest significance
head(topEarthquakes)
# A tibble: 5 × 19
  title    magnitude date_time   cdi   mmi alert tsunami   sig net     nst  dmin
  <chr>        <dbl> <chr>     <dbl> <dbl> <chr>   <dbl> <dbl> <chr> <dbl> <dbl>
1 M 7.8 -…       7.8 06-02-20…     9     9 red         0  2910 us      118 1.92 
2 M 8.2 -…       8.2 08-09-20…     9     7 red         1  2910 us        0 0.944
3 M 7.2 -…       7.2 04-04-20…     9     9 red         0  2910 ci       10 0.514
4 M 6.6 -…       6.6 30-10-20…     9     8 red         0  2840 us        0 0.174
5 M 7.8 -…       7.8 25-04-20…     8     9 red         0  2820 us        0 1.86 
# ℹ 8 more variables: gap <dbl>, magType <chr>, depth <dbl>, latitude <dbl>,
#   longitude <dbl>, location <chr>, continent <chr>, country <chr>

Create the Linear Regression Graph with Labeled X and Y Axis, Title, Legend, Color Palette, and Linear Regression Line

#Creates subset named graph1 that takes data from topEarthquakes
graph1 <- topEarthquakes |>
#Creates graph with x axis, y axis, and the legend
ggplot(aes(x = magnitude, y = sig, color = location)) +
#Changes the color scheme used in the graph
scale_color_brewer(palette = "Set2") +
#Renames the y axis, x axis, and legend
labs(y = "Significance of the Earthquake",
    x = "Magnitude of the Earthquake",
    color = "Location") +
#Changes the theme to minimal, eliminating any background annotations
theme_minimal(base_size = 12) +
geom_point() +
geom_line() +
#Creates the linear regression line and changes the color of it to purple
geom_smooth(method = "lm", formula = y~x, color = "purple") +
#Title of the Graph
ggtitle("Earthquake's Significance Based On Magnitude")
#Calls the Graph
graph1
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

Correlation Between The Magnitude and Significance of the Earthquakes

cor(topEarthquakes$magnitude, topEarthquakes$sig)
[1] 0.3526549

Getting the Linear Regression Equation

fit1 <- lm(sig ~ magnitude, data = topEarthquakes)
summary(fit1)

Call:
lm(formula = sig ~ magnitude, data = topEarthquakes)

Residuals:
  1   2   3   4   5 
 25  15  40 -15 -65 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   2690.0      288.8   9.314  0.00262 **
magnitude       25.0       38.3   0.653  0.56047   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47.96 on 3 degrees of freedom
Multiple R-squared:  0.1244,    Adjusted R-squared:  -0.1675 
F-statistic: 0.4261 on 1 and 3 DF,  p-value: 0.5605

Linear Regression Equation

sig = 25.0(magnitude) + 2690.0

Diagnosis Based on P-Value, R^2, and Plots

According to the p-value, r^2 value, plots, and other information, it is safe to say that the magnitude and significance have little correlation to each other. The p-value being so high means that the hypothesis is not significant. The plots also show that the points on the graph are not close to the linear regression line.

Filtering For The Top 8 Magnitudes

#Creates subset named topMags that takes data from earthquakes and sorts magnitude from high to low
topMags <- earthquakes |> arrange(desc(magnitude)) |>
#After sorting from high to low, this takes only the top 8 rows
head(arrange(earthquakes, desc(magnitude)), n = 8)
#Displays the filtered data
head(topMags)
# A tibble: 6 × 19
  title    magnitude date_time   cdi   mmi alert tsunami   sig net     nst  dmin
  <chr>        <dbl> <chr>     <dbl> <dbl> <chr>   <dbl> <dbl> <chr> <dbl> <dbl>
1 M 9.1 -…       9.1 11-03-20…     9     8 <NA>        0  2184 offi…   541     0
2 M 9.1 -…       9.1 26-12-20…     0     8 <NA>        0  1274 offi…   601     0
3 M 8.8 -…       8.8 27-02-20…     8     8 <NA>        0  1991 offi…   454     0
4 M 8.6 -…       8.6 11-04-20…     9     7 yell…       0  2048 offi…   499     0
5 M 8.6 -…       8.6 28-03-20…     0     8 <NA>        0  1138 offi…   510     0
6 M 8.4 -…       8.4 12-09-20…     0     6 <NA>        0  1086 offi…   411     0
# ℹ 8 more variables: gap <dbl>, magType <chr>, depth <dbl>, latitude <dbl>,
#   longitude <dbl>, location <chr>, continent <chr>, country <chr>

Creating A Bar Chart With the Locations of the Earthquakes with the Top 8 Magnitudes

#Creates a subset named graph2 that takes data from topMags
graph2 <- topMags |>
#Creates bar graph with x axis, y axis, and legend
  ggplot() +
  geom_bar(aes(x = title, y = magnitude, fill = location),
#Shows the bars as is in the dataset
      position = "dodge", stat = "identity") +
#Renames the y axis, x axis, legend, title, and caption
  labs(y = "Magnitude of the Earthquake",
       x = "Name of the Earthquake",
       fill = "Location of the Earthquake",
       title = "Bar Chart of the Names and Locations of the Top 8 Magnitude Earthquakes",
       caption = "Earthquake Dataset from USGS") +
#Changes the theme to minimal, eliminating background annotations and making the text smaller
  theme_minimal(base_size = 8) +
#Tilts the x axis text so they do not overlap
  theme(axis.text.x = element_text(angle = 28)) +
#Changes the color scheme used in the graph
  scale_fill_brewer(palette = "Dark2")
#Calls the graph so it can display
graph2

End of Project Essay

In this project I cleaned up my dataset in a variety of ways. First, I sorted the dataset and filtered through it to get the top 5 most significant earthquakes. For this, I used the arrange(desc) function which combines the arrange and filter functions to get me a specific piece of data to use in order to create my first graph. I also used the head function to show only the top “x” amount of rows. For the second graph, again I used the arrange(desc) and head, however, this time I used it to sort the magnitudes in descending order to get a specific piece of data to use in my second graph. My second graph speaks to me the most. It shows the top 8 earthquakes with the highest magnitudes and their names and locations. A pattern that I picked up from this visualization is that most of the earthquakes with the highest magnitudes are located in places surrounded by water. Almost all of these earthquakes occur on islands or pieces of land with water covering most of their land. Finally, what I wish I could improve on in this project was maybe using a better dataset. While researching this dataset I thought it would be cool and interesting to research about earthquakes and I was quite fascinated with it at first. However, as I got into playing around with the dataset, I started to see many variables that may not be too significant to my research such as if the earthquake created a tsunami or not. Although, I still thought it was interesting, by the time I noticed there were not many variables to play around with, I did not want to find a different dataset.