DATA 606 Data Project Proposal

Data Preparation

# load data
library(dplyr)
library(stringr)
library(ggplot2)
earthquakes <- read.csv("all_month.csv")

# Extract country, US state, or region name from earthquakes$place and create earthquakes$location
earthquakes$location <- sub('.*,\\s*', '', earthquakes$place)

Here we have loaded all_month.csv, a file containing data on the locations and strengths of earthquakes in the past 30 days (in this case it’s February 29th to March 29th, 2020).

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Did certain countries receive a disproportionate share of powerful earthquakes in the past 30 days? Which countries received the most powerful earthquakes? Are the depths of these earthquakes normally distributed?

Cases

What are the cases, and how many are there?

nrow(earthquakes)

## [1] 12353

Each case is a earthquake recorded sometime in the past 30 days that was detectable by USGS.

Data collection

Describe the method of data collection.
The United States Geological Survey is an agency that tracks and collects data on earthquakes globally.

Type of study

What type of study is this (observational/experiment)?
This is an observational study because we are not attempting, or able to, intervene on the strength, frequency and location of earthquakes.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.
https://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php

Dependent Variable

What is the response variable? Is it quantitative or qualitative?
The magnitude of the of the earthquake is the response variable, and it is numerical and quantitative.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.
Location and depth will be my two independent variables. Location is qualitative (country, region, or US state) while depth is a quantitative variable as it is a number measured in kilometers.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(earthquakes$mag)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -1.390   0.660   1.200   1.395   1.950   7.500       7

summary(earthquakes$depth)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3.41    3.17    7.79   17.54   12.72  634.87

location_freq <- as.data.frame(table(earthquakes$location))
colnames(location_freq) <- c("location_name","frequency")

# histogram for earthquake magnitude
hist(earthquakes$mag, breaks = 100)

# histogram for earthquake depth
hist(earthquakes$depth, breaks = 100)

# barplot for frequency of earthquakes by location
location_plot <- ggplot(data = location_freq, aes(location_name, frequency)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(size = 4, angle = 90, hjust = 1,vjust=0), plot.title = element_text(hjust = 0.5)) +
  ggtitle("Earthquake Frequency by Location")
location_plot