For this project, you will look at the relationship between changes in temperature and oceanic storms. You will need to load the following libraries to complete the assignment.
library(dplyr)
library(ggplot2)
library(tidyr)
2 data sets are used. The data set containing temperatures is named
“temperature.csv” and can be found in Canvas inside of the Data folder
in Files. Download the data set, then load it using the
read.csv function.
(Hint: You can use the “Import Dataset > From Text (base)” command in the “File” menu to search for the downloaded file. Then, copy and paste the command from the console below.)
The data set contains 142 rows corresponding to each year from 1880-2021. It contains 13 columns – the year of the observation and 12 months. Each value represents the temperature anomaly for the month in the North Hemisphere.
The data set containing information on each storm can be found in the
dplyr library. It will be used for questions 4 and 5. Load
it using the following command.
data(storms)
storms contains 11,859 observations of oceanic storms in
the Atlantic. Each observation is a measurement of the storm, taken
every 6 hours. The data set contains the name of the storm, the date
& time, longitude & latitude, category, wind speed &
pressure, and diameter.
Use the pivot_longer(...) command from the
tidyr package to shrink the number of columns in the
temperature data.frame to 3 – the Year, Month, and
Temperature.
(See Section 4.3)
## Summarizing Temperature Anomalies
# shrinking the number of columns in the temperature data frame
# was not knitting with this so here it is with #s
#TemperatureLong <- pivot_longer(Temperature,
# cols = -Year,
# names_to = "Month",
# values_to = "Temperature")
# checking the results
# TemperatureLong
Calculate the average temperature anomaly for each year. Then, plot
the temperature anomaly across time (line plot) using
ggplot, including a smooth trend line.
Discuss what is happening to average temperatures across time.
(See Section 4.2 for summarizing data & 3. Application for an
example line plot. For the trend line, use geom_smooth().
Don’t worry about adding method="lm".)
## Summary Code
# Calculate the average temperature for each year
#Now it won't recognize TemperatureLong so have to # it
#AverageAnomaly <- TemperatureLong %>%
# group_by(Year) %>%
#summarize(AvgTemperature = mean(Temperature))
# AverageAnomaly
## Plot Code
# Using ggplot2 to Plot
# isn't recognizing Average Anomaly so # it
# ggplot(AverageAnomaly, aes(x = Year, y = AvgTemperature)) +
# geom_line() +
# geom_smooth(method = "lm", se = FALSE) +
#labs(title = "Average Temperature Anomaly Over Time",
#x = "Year",
#y = "Average Temperature Anomaly")
The average temperature overall increases over time, but in a very choppy manner. There are many ups and downs with a trend peak around 1940, then a slight dip around 1970, then after 1970 the average slowly rises (in a choppy way)
Perform a linear regression with Year as the independent
\((x)\) variable and
Temperature as the dependent \((y)\) variable. Show the summary and
discuss the relationship.
Plot the residuals from the regression. Do you observe any patterns?
(See Section 4.4-5 for linear regression examples and plotting
residuals. If you use ggplot for the residual plot, you
will want to place the residuals & fitted values into a new
data.frame first.)
## Regression Code
# performing linear regression
#Same thing, not reading
#LinearRegression <- lm(Temperature ~ Year, data = #TemperatureLong)
#summary(LinearRegression)
The regression has a min of -1.07 and max of 1.267, R-squared is .625. This data was slightly hard to interpret as the results kind of ran into each other and was a bit hard to read
## Residual Plot Code
# plotting the residuals
# #
#ResidualsPlot <- ggplot(TemperatureLong, aes(x = Year, y = #residuals(LinearRegression))) +
# geom_point() +
# geom_hline(yintercept = 0, linetype = "dashed", color = "blue") +
# labs(title = "Residuals Plot",
# x = "Year",
# y = "Residuals")
# print(ResidualsPlot)
The residuals plot has a straight across trendline, likely because the dots switch from having a negative slope to positive, to negative, back to positive. The plotted points are pretty close together and follow the trend pretty consistently. The values don’t go below -1.0 after just before 1900, and they don’t go above 1.0 until around just before 2020.
Restrict the number of variables in the storms data set
to the following: name, year,
category, and wind. Rename the
year column to Year (capitalize) so
that it is consistent with the Temperature data set.
Convert the “category” column to a number using the
as.numeric function.
Finally, group the data set by the name of the storm and summarize
the data. Let Year be the starting year of the storm,
category is the highest category, wind is the
average wind speed, and duration is the number of
observations associated with each storm.
What happened to the “category” column when it was converted to a number? How can you correct it?
(See Section 4.1 for cleaning the data & 4.2 for summarizing
it. You can use levels(x) to see the different categories
of categorical variable x.)
## Cleaning Storms
# renaming columns
#StormsRestricted <- storms %>%
# select(name, Year = year, category, wind)
# converting category to a number
#StormsRestricted$category <- as.numeric(StormsRestricted$category)
# group by name and summarize
#StormSummarized <- StormsRestricted %>%
# group_by(name) %>%
# summarize(
# Year = min(Year),
# category = max(category),
# wind = mean(wind),
# )
# print(StormSummarized)
DISCUSS HERE
Trim the temperature data set created in question 2, then join it to
the storms data set created in question 4 by the year. Perform 2 linear
regressions. The first regression should have category as
the \((y)\) variable and
Temperature as the \((x)\)
variable. The second regression should also include
wind and duration as \((x)\) variables.
Discuss the relationships.
(See Section 4.3 on joining data sets & Sections 4.4-5 for linear regression.)
## Filtering and Joining Code
# Trim Dataset
#TrimmedTemperature <- TemperatureLong %>%
# select(Year, Month, Temperature)
# Join temperature and storms data sets
#JoinedData <- left_join(TrimmedTemperature, StormSummarized, by = "Year")
## Regression 1 Code
# Category as Y and Temperature as X
#RegressionCategoryTemp <- lm(category ~ temperature, data = JoinedData)
The Temperature p value helps determine if the relationship is statistically significant. The coefficient represents the change in category related to a change in temperature.
## Regression 2 Code
# also including wind and duration as X
# RegressionCategoryMultiple <- lm(category ~ temperature + wind + duration, data = JoinedData)
Added wind and duration to the regression. The coefficients for temperature and wind and duration represent the change related to category.