Project 4

Introduction

For this project, you will look at the relationship between changes in temperature and oceanic storms. You will need to load the following libraries to complete the assignment.

library(dplyr)
library(ggplot2)
library(tidyr)

2 data sets are used. The data set containing temperatures is named “temperature.csv” and can be found in Canvas inside of the Data folder in Files. Download the data set, then load it using the read.csv function.

(Hint: You can use the “Import Dataset > From Text (base)” command in the “File” menu to search for the downloaded file. Then, copy and paste the command from the console below.)

The data set contains 142 rows corresponding to each year from 1880-2021. It contains 13 columns – the year of the observation and 12 months. Each value represents the temperature anomaly for the month in the North Hemisphere.

The data set containing information on each storm can be found in the dplyr library. It will be used for questions 4 and 5. Load it using the following command.

data(storms)

storms contains 11,859 observations of oceanic storms in the Atlantic. Each observation is a measurement of the storm, taken every 6 hours. The data set contains the name of the storm, the date & time, longitude & latitude, category, wind speed & pressure, and diameter.

1. Summarizing Temperature Anomalies

Use the pivot_longer(...) command from the tidyr package to shrink the number of columns in the temperature data.frame to 3 – the Year, Month, and Temperature.

(See Section 4.3)

## Summarizing Temperature Anomalies

# shrinking the number of columns in the temperature data frame

# was not knitting with this so here it is with #s
#TemperatureLong <- pivot_longer(Temperature,
    #  cols = -Year, 
   #   names_to = "Month",
    #  values_to = "Temperature")


# checking the results
# TemperatureLong

2. Average Temperature

Calculate the average temperature anomaly for each year. Then, plot the temperature anomaly across time (line plot) using ggplot, including a smooth trend line.

Discuss what is happening to average temperatures across time.

(See Section 4.2 for summarizing data & 3. Application for an example line plot. For the trend line, use geom_smooth(). Don’t worry about adding method="lm".)

## Summary Code

# Calculate the average temperature for each year
#Now it won't recognize TemperatureLong so have to # it

#AverageAnomaly <- TemperatureLong %>%
 # group_by(Year) %>%
  #summarize(AvgTemperature = mean(Temperature))

# AverageAnomaly

## Plot Code

# Using ggplot2 to Plot
# isn't recognizing Average Anomaly so # it

# ggplot(AverageAnomaly, aes(x = Year, y = AvgTemperature)) + 
#  geom_line() +
 # geom_smooth(method = "lm", se = FALSE) +
  #labs(title = "Average Temperature Anomaly Over Time",
  #x = "Year",
  #y = "Average Temperature Anomaly")

The average temperature overall increases over time, but in a very choppy manner. There are many ups and downs with a trend peak around 1940, then a slight dip around 1970, then after 1970 the average slowly rises (in a choppy way)

3. Linear Regression

Perform a linear regression with Year as the independent \((x)\) variable and Temperature as the dependent \((y)\) variable. Show the summary and discuss the relationship.

Plot the residuals from the regression. Do you observe any patterns?

(See Section 4.4-5 for linear regression examples and plotting residuals. If you use ggplot for the residual plot, you will want to place the residuals & fitted values into a new data.frame first.)

## Regression Code

# performing linear regression
#Same thing, not reading

#LinearRegression <- lm(Temperature ~ Year, data = #TemperatureLong)

#summary(LinearRegression)

The regression has a min of -1.07 and max of 1.267, R-squared is .625. This data was slightly hard to interpret as the results kind of ran into each other and was a bit hard to read

## Residual Plot Code

# plotting the residuals
# #
#ResidualsPlot <- ggplot(TemperatureLong, aes(x = Year, y = #residuals(LinearRegression))) +
#  geom_point() +
#  geom_hline(yintercept = 0, linetype = "dashed", color = "blue") + 
#  labs(title = "Residuals Plot",
 #      x = "Year",
  #     y = "Residuals")


# print(ResidualsPlot)

The residuals plot has a straight across trendline, likely because the dots switch from having a negative slope to positive, to negative, back to positive. The plotted points are pretty close together and follow the trend pretty consistently. The values don’t go below -1.0 after just before 1900, and they don’t go above 1.0 until around just before 2020.

4. Cleaning Storms

Restrict the number of variables in the storms data set to the following: name, year, category, and wind. Rename the year column to Year (capitalize) so that it is consistent with the Temperature data set. Convert the “category” column to a number using the as.numeric function.

Finally, group the data set by the name of the storm and summarize the data. Let Year be the starting year of the storm, category is the highest category, wind is the average wind speed, and duration is the number of observations associated with each storm.

What happened to the “category” column when it was converted to a number? How can you correct it?

(See Section 4.1 for cleaning the data & 4.2 for summarizing it. You can use levels(x) to see the different categories of categorical variable x.)

## Cleaning Storms


# renaming columns
#StormsRestricted <- storms %>%
#  select(name, Year = year, category, wind)


# converting category to a number 
#StormsRestricted$category <- as.numeric(StormsRestricted$category)


# group by name and summarize
#StormSummarized <- StormsRestricted %>%
#  group_by(name) %>%
#  summarize(
#    Year = min(Year),
#    category = max(category),
 #   wind = mean(wind),
 # )


# print(StormSummarized)

DISCUSS HERE

5. Combining Data

Trim the temperature data set created in question 2, then join it to the storms data set created in question 4 by the year. Perform 2 linear regressions. The first regression should have category as the \((y)\) variable and Temperature as the \((x)\) variable. The second regression should also include wind and duration as \((x)\) variables.

Discuss the relationships.

(See Section 4.3 on joining data sets & Sections 4.4-5 for linear regression.)

## Filtering and Joining Code

# Trim Dataset
#TrimmedTemperature <- TemperatureLong %>%
#  select(Year, Month, Temperature)

# Join temperature and storms data sets
#JoinedData <- left_join(TrimmedTemperature, StormSummarized, by = "Year")

## Regression 1 Code

# Category as Y and Temperature as X
#RegressionCategoryTemp <- lm(category ~ temperature, data = JoinedData)

The Temperature p value helps determine if the relationship is statistically significant. The coefficient represents the change in category related to a change in temperature.

## Regression 2 Code

# also including wind and duration as X
# RegressionCategoryMultiple <- lm(category ~ temperature + wind + duration, data = JoinedData)

Added wind and duration to the regression. The coefficients for temperature and wind and duration represent the change related to category.