STA 209 Homework 1

#importing required libraries
library(astsa)

## Warning: package 'astsa' was built under R version 4.5.2

library(readxl)
library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)

## Warning: package 'plotly' was built under R version 4.5.2

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)
library(htmlwidgets)

Problem 1

#loading in the data
data(soi)

#making a time series plot
tsplot(soi,main="Southern Oscillation Index Data", xlab="Time (years)", ylab="Southern Oscillation Index",col="darkblue")

This time series plot displays the Southern Oscillation Index (SOI) from 1950 to 1987. The index ranges from -1 to 1, and there is a very slight decreasing trend with the average SOI looking to be a bit above 0.0 in 1950 and a bit below 0.0 in 1987. The variance does not look to change over time because the SOI looks to range from -0.5 to 1.0 during 1950 to 1960 and range from -.75 to 0.75 between 1975 to 1985, which is a consistent range of around 1.5. There are also no signs of volatility. There does look to be some seasonality in the data because of the repetitive patterns happening over time.

#showing decomposition of the time series
decomposition_add <- decompose(soi, type = "additive")
plot(decomposition_add)

The additive decomposition model is better than the multiplicative decomposition model in this scenario because the seasonal variation is relatively constant over time. The trend section of the decomposition confirms that there is an overall decrease of the SOI from 1950 to 1987, but the trend from 1960 to 1978 looks to stay consistent, so most of the decrease happen from 1950 to 1960 and 1978 to 1987. The seasonal section displays that there is strong seasonality, with the same pattern repeating every year. The random section shows that there is random noise that cannot be explained by the long-term trend of the seasonal patterns. The observed section is the observed data.

#fitting three different models to the SOI data
Y<-soi
t<-time(soi)
t2=(t^2)/2
t3=(t^3)/6

m1=lm(Y~t)
m2=lm(Y~t+t2)
m3=lm(Y~t+t2+t3)

summary(m1)

## 
## Call:
## lm(formula = Y ~ t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04140 -0.24183  0.01935  0.27727  0.83866 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.70367    3.18873   4.298 2.12e-05 ***
## t           -0.00692    0.00162  -4.272 2.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3756 on 451 degrees of freedom
## Multiple R-squared:  0.0389, Adjusted R-squared:  0.03677 
## F-statistic: 18.25 on 1 and 451 DF,  p-value: 2.359e-05

summary(m2)

## 
## Call:
## lm(formula = Y ~ t + t2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0529 -0.2443  0.0129  0.2658  0.8412 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.968e+02  6.444e+02  -0.771    0.441
## t            5.116e-01  6.546e-01   0.782    0.435
## t2          -2.634e-04  3.325e-04  -0.792    0.429
## 
## Residual standard error: 0.3758 on 450 degrees of freedom
## Multiple R-squared:  0.04024,    Adjusted R-squared:  0.03597 
## F-statistic: 9.433 on 2 and 450 DF,  p-value: 9.698e-05

summary(m3)

## 
## Call:
## lm(formula = Y ~ t + t2 + t3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07047 -0.23026  0.01631  0.27232  0.85650 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.308e+05  1.325e+05   0.987    0.324
## t           -1.995e+02  2.020e+02  -0.988    0.324
## t2           2.029e-01  2.052e-01   0.989    0.323
## t3          -1.032e-04  1.042e-04  -0.990    0.323
## 
## Residual standard error: 0.3758 on 449 degrees of freedom
## Multiple R-squared:  0.04233,    Adjusted R-squared:  0.03593 
## F-statistic: 6.616 on 3 and 449 DF,  p-value: 0.0002215

Model 1: \(SOI_t\) = 13.70367 - 0.00692(t)

Model 2: \(SOI_t\) = -496.8 + 0.511(t) - 0.0002(\(\frac{t^2}{2}\))

Model 3: \(SOI_t\) = 130,800 - 199.5(t) + 0.2029(\(\frac{t^2}{2}\)) - 0.0001032(\(\frac{t^3}{6}\))

Model 1 has two regression parameters, \(B_0\) and \(B_1\). Model 2 has three regression parameters, \(B_0\), \(B_1\), and \(B_2\). Model 3 has four regression parameters, \(B_0\), \(B_1\), \(B_2\), and \(B_3\).

#reporting ANOVA for the three models
anova(m1)

anova(m2)

anova(m3)

The ANOVA output from Model 1 has a significant p-value, in this case below 0.001, meaning that time is a significant predictor of SOI. Model 2 has t (linear time term) also being a significant predictor, but t2 (quadratic time term) is not a significant predictor of SOI. Model 3 also only has the linear time time being a significant predictor, but not the quadratic or cubic terms of time.

#creating the anova tables for each model
anovaSummary1 <- data.frame(Df=c("1","451","452"), SS=c("2.576","63.640","66.216"),MS=c("2.576","0.1411086","0.1464956"),Fstat=c("18.25544","",""),Pvalue=c("~0","",""))
rownames(anovaSummary1)<-c("Regression","Error","Total")
print(anovaSummary1)

##             Df     SS        MS    Fstat Pvalue
## Regression   1  2.576     2.576 18.25544     ~0
## Error      451 63.640 0.1411086                
## Total      452 66.216 0.1464956

anovaSummary2 <- data.frame(Df=c("2","450","452"), SS=c("2.665","63.551","66.216"),MS=c("1.3325","0.1412244","0.1464956"),Fstat=c("9.435338","",""),Pvalue=c("~0","",""))
rownames(anovaSummary2)<-c("Regression","Error","Total")
print(anovaSummary2)

##             Df     SS        MS    Fstat Pvalue
## Regression   2  2.665    1.3325 9.435338     ~0
## Error      450 63.551 0.1412244                
## Total      452 66.216 0.1464956

anovaSummary3 <- data.frame(Df=c("3","449","452"), SS=c("2.804","63.413","66.217"),MS=c("0.9346667","0.1412316","0.1464978"),Fstat=c("6.617971","",""),Pvalue=c("~0","",""))
rownames(anovaSummary3)<-c("Regression","Error","Total")
print(anovaSummary3)

##             Df     SS        MS    Fstat Pvalue
## Regression   3  2.804 0.9346667 6.617971     ~0
## Error      449 63.413 0.1412316                
## Total      452 66.217 0.1464978

The f-statistic is the largest in model 1, meaning that the linear time term explains a greater proportion of the variability in SOI than model 2 and 3. Model 1 also has the highest mean squared error which indicates greater unexplained variability in the residuals, which makes sense because model 1 is linear and naturally does not capture the complexities of time series. This can be possible because since none of the models fit the data near perfect, there can be predictors that are significant but because the data has a lot of noise it could also lead to overall large prediction errors.

#adding the three fits (linear, quadratic, cubic) to the plot
trend(soi, order=1,col=c("black","red"))
par(new=TRUE)
trend(soi, order=2,col=c("black","blue"))
par(new=TRUE)
trend(soi, order=3,col=c("black","darkgreen"))
legend("bottomright",legend=c("Linear","Quadratic","Cubic"),lwd=c(2,2),col=c("red","blue","darkgreen"))
title("Plot including Three Model Fits")

I think the linear fit fits the SOI data the best because the data seems to be decreasing in a linear pattern from 1950 to 1987. The cubic fit may fit the SOI data ok because there is a decrease in the data that happens around 1950-1960, but there is not much of a decrease that happens from 1984-1987 so I don’t think it fits the data the best. I don’t think the quadratic fit fits the SOI data well because there isn’t a u-shaped curve to the data.

#plotting the detrended data
resid_ts <- ts(m1$residuals,start=1950,frequency=12)
tsplot(resid_ts,main="Detrended SOI Data (Model 1)",ylab="Residuals")

resid_ts2 <- ts(m2$residuals,start=1950,frequency=12)
tsplot(resid_ts2,main="Detrended SOI Data (Model 2)",ylab="Residuals")

resid_ts3 <- ts(m3$residuals,start=1950,frequency=12)
tsplot(resid_ts3,main="Detrended SOI Data (Model 3)",ylab="Residuals")

All three of these detrended plots show that the residuals are very similar. While it is hard to tell which fits the SOI data the best, based on these plots all three models fits the SOI data similarly. The residuals shows that the trend has been extracted because there does not look to be a trend in the residuals.

Problem 2

#copying over the function to set up the temp data for climate series from class
climatets <- function(filename,num,plot=TRUE,s=2) {
  dataexcel <- read_excel(filename,sheet=num,range="D7:H77")
  dataexcel <- dataexcel[-1,-2]
  dataexcel <- filter(dataexcel, !is.na(dataexcel$Avg))
  data  <- dataexcel %>% mutate(Month=num)
  if(plot==TRUE){
    g <- data%>% ggplot(aes(Year, Avg))+ geom_line(size=s)
    return(list(data,g))
  }else{
    return(data)
  }
}

#copying over the function to set up monthly data for climate data
seasonalts <- function(filename,nums) {
  data.seasonal <- matrix(0,nrow=1,ncol=5) 
  for(i in 1:length(nums)){
    dataexcel<- read_excel(filename,sheet=nums[i],range="D7:H77")
    dataexcel <- dataexcel[-1,-2]
    dataexcel <- filter(dataexcel, !is.na(dataexcel$Avg))
    data  <- dataexcel %>% mutate(Month=nums[i])
    colnames(data.seasonal) <- colnames(data)
    data.seasonal <- rbind(data.seasonal,data)
  }
  data.seasonal <- data.seasonal[-1,]
  data.seasonal <- arrange(data.seasonal,Year)
  return(data.seasonal)
}

#loading in the data
newhavenData <- seasonalts("Climate_Northeast_NewHavenCT.xlsx",nums=1:4)

## New names:
## New names:
## New names:
## New names:
## • `` -> `...2`

warwickData <- seasonalts("Climate_Northeast_WarwickRI.xlsx",nums=1:4)

## New names:
## New names:
## New names:
## New names:
## • `` -> `...2`

worcestorData <- seasonalts("Climate_Northeast_WorcesterMA.xlsx",nums=1:4)

## New names:
## New names:
## New names:
## New names:
## • `` -> `...2`

#removing Na's
newHavenNoNa <- newhavenData %>%  filter(!is.na(Avg)) 
warwickNoNa <- warwickData %>%  filter(!is.na(Avg)) 
worcestorNoNa <- worcestorData %>%  filter(!is.na(Avg))

#making three plots for each city of min,max, and avg temperature
tsplot(cbind(newHavenNoNa$Min,newHavenNoNa$Avg,newHavenNoNa$Max),col = astsa.col(c(1, 4, 7),.7),lwd = 2,type = "l",spaghetti = TRUE,x = newHavenNoNa$Year,main="Minimum, Maximum, and Average Temperature in New Haven, CT",xlab="Year",ylab="Temperature (Celsius)")

## Warning in xy.coords(x, y, xlabel, ylabel, log): NAs introduced by coercion
## Warning in xy.coords(x, y, xlabel, ylabel, log): NAs introduced by coercion

## Warning in xy.coords(x, y): NAs introduced by coercion
## Warning in xy.coords(x, y): NAs introduced by coercion
## Warning in xy.coords(x, y): NAs introduced by coercion

legend("bottomright",legend = c("Min", "Avg", "Max"),col = astsa.col(c(1, 4, 7), 0.7),lwd = 2,pch = 1)

tsplot(cbind(warwickNoNa$Min,warwickNoNa$Avg,warwickNoNa$Max),col = astsa.col(c(1, 4, 7), .7),
  lwd = 2,type = "l",spaghetti = TRUE,x = warwickNoNa$Year,main="Minimum, Maximum, and Average Temperature in Warwick, RI",xlab="Year",ylab="Temperature (Celsius)")

legend("bottomright",legend = c("Min", "Avg", "Max"),col = astsa.col(c(1, 4, 7), 0.7),lwd = 2,pch = 1)

tsplot(cbind(worcestorNoNa$Min,worcestorNoNa$Avg,worcestorNoNa$Max),col = astsa.col(c(1, 4, 7), .7),
  lwd = 2,type = "l",spaghetti = TRUE,x=worcestorNoNa$Year,main="Minimum, Maximum, and Average Temperature in Worcestor, MA",xlab="Year",ylab="Temperature (Celsius)")

legend("bottomright",legend = c("Min", "Avg", "Max"),col = astsa.col(c(1, 4, 7), 0.7),lwd = 2,pch = 1)

The plots of New Haven and Warwick temperatures shows that the minimum temperatures go as low as -15 degrees celsius, the maximum temperature goes as high as 30 degrees celsius, and the average temperature is usually around 0-10 degrees celsius. The plot of Worcestor temperature shows that the minimum temperature stays closer to around -10 degrees celsius, the maximum temperature stays closer to 25 degrees celsius, and the average temperature is between -5 and 5 degrees celsius.

#reporting the time series plots for New Haven
p1<-subplot(plot_ly(newHavenNoNa, x = ~Year, y = ~Max, name = "Max"),
  plot_ly(newHavenNoNa, x = ~Year, y = ~Min,name="Min"),
  plot_ly(newHavenNoNa, x = ~Year, y = ~Avg,name="Avg"),shareX=TRUE) %>%
  layout(title="Minimum, Maximum, and Average Temperature in New Haven, CT", xaxis=list(title="Year"), yaxis=list(title="Temperature (Celsius)"))

## No trace type specified:
##   Based on info supplied, a 'bar' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#bar
## No trace type specified:
##   Based on info supplied, a 'bar' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#bar

## Warning: Ignoring 5 observations

## No trace type specified:
##   Based on info supplied, a 'bar' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#bar

p1

This interactive plot highlights that a few years around 1980 had some of the most consistent cold temperatures, where the temperature never got as high as it had been in years prior.

#reporting the time series plots for Warwick
p2<-subplot(
  plot_ly(warwickNoNa, x = ~Year, y = ~Max, name = "Max"),
  plot_ly(warwickNoNa, x = ~Year, y = ~Min,name="Min"),
  plot_ly(warwickNoNa, x = ~Year, y = ~Avg,name="Avg"),shareX=TRUE) %>%
  layout(title="Minimum, Maximum, and Average Temperature in Warwick, RI", xaxis=list(title="Year"), yaxis=list(title="Temperature (Celsius)"))

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

## Warning: Ignoring 13 observations

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

p2

This interactive plot of Warwicks temperature shows that the most common average temperatures are either around -2, 10, or 22 degrees, with almost no average temperatures around 3 degrees and 17 degrees.

#reporting the time series plots for Worcestor
p3<-subplot(
  plot_ly(worcestorNoNa, x = ~Year, y = ~Max, name = "Max"),
  plot_ly(worcestorNoNa, x = ~Year, y = ~Min,name="Min"),
  plot_ly(worcestorNoNa, x = ~Year, y = ~Avg,name="Avg"),shareX=TRUE) %>%
  layout(title="Minimum, Maximum, and Average Temperature in Worcestor, MA", xaxis=list(title="Year"), yaxis=list(title="Temperature (Celsius)"))

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

p3

This interactive plot of Worcestor MA temperature shows that the average temperatures is most common around -5, 9, and 21 degrees, with very few average temperatures around 2 and 16 degrees.

#defining some specific months
newHavenNoNa$Month[newHavenNoNa$Month == 1] <- "January"
newHavenNoNa$Month[newHavenNoNa$Month == 2] <- "April"
newHavenNoNa$Month[newHavenNoNa$Month  == 3] <- "July"
newHavenNoNa$Month[newHavenNoNa$Month == 4] <- "October"

warwickNoNa$Month[warwickNoNa$Month == 1] <- "January"
warwickNoNa$Month[warwickNoNa$Month == 2] <- "April"
warwickNoNa$Month[warwickNoNa$Month  == 3] <- "July"
warwickNoNa$Month[warwickNoNa$Month == 4] <- "October"

worcestorNoNa$Month[worcestorNoNa$Month == 1] <- "January"
worcestorNoNa$Month[worcestorNoNa$Month == 2] <- "April"
worcestorNoNa$Month[worcestorNoNa$Month  == 3] <- "July"
worcestorNoNa$Month[worcestorNoNa$Month == 4] <- "October"

#changing month to be a factor
newHavenNoNa$Month  <- as.factor(newHavenNoNa$Month)
warwickNoNa$Month  <- as.factor(warwickNoNa$Month)
worcestorNoNa$Month  <- as.factor(worcestorNoNa$Month)

#plotting average temperature but showing only 4 months
p = newHavenNoNa %>% filter(!is.na(Avg)) %>% 
  ggplot(aes(Year,Avg)) + 
  geom_line(size=1) + geom_point(aes(color = Month),size = 3) +theme_classic() + 
  geom_line(size=1, aes(color = Month)) + 
  labs(x="Year", y="Average Temperature") +
  ggtitle("New Haven Average Temperature")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

This plot of New Haven average temperature in Aprils, Januarys, Julys, and Octobers, shows that January is always colder than the other months and July is usually hotter than the other months but there are a few low outliers in years where October is actually hotter than July.

#plotting warwick average temp by months
p = warwickNoNa %>% filter(!is.na(Avg)) %>% 
  ggplot(aes(Year,Avg)) + 
  geom_line(size=1) + geom_point(aes(color = Month),size = 3) +theme_classic() + 
  geom_line(size=1, aes(color = Month)) + 
  labs(x="Year", y="Average Temperature") +
  ggtitle("Warwick Average Temperature")
p

This plot shows that in Warwick, Januarys have always been colder than Aprils, Julys, and Octobers and Julys are always hotter than Aprils, Octobers, and Januarys.

#plotting worcestor average temp by month
p = worcestorNoNa %>% filter(!is.na(Avg)) %>% 
  ggplot(aes(Year,Avg)) + 
  geom_line(size=1) + geom_point(aes(color = Month),size = 3) +theme_classic() + 
  geom_line(size=1, aes(color = Month)) + 
  labs(x="Year", y="Average Temperature") +
  ggtitle("Worcestor Average Temperature")
p

This plot shows that Januarys are consistently cooler months than April, January, and July, and Julys are usually much higher than the other three months except in one year when April had a high temperature almost as high as Julys.

From looking at all the plots overall, all of the cities looked to have average temperatures that stayed relatively consistent over time, with no dramatic increasing or decreasing temperatures from 1950 to 2018. There are major seasonality patterns, with some months having much higher temperatures than others. The average temperature of the cities is around 5 degrees celsius and the temperatures ranged from -15 to 30 degrees celsius. There were some years that were outliers with either very high or very low temperatures compared to the usual temperatures of that month in past years.

Project Problem

Possible time series topics I am interested in are all health related, such as determining if time can predict disease or if we can forecast potential illness. I am interested in epidemiology, such as how disease spreads through a population over time, and in wellness, such as how modifiable risk factors (exercise, sleep, nutrition) can have a positive affect on your body. This is a fascinating time series topic because getting enough exercise, sleep, and good nutrition does not improve your health right away, it takes time-so how much time does it take to notice significant health impacts? It is also common that something is good for you when you have it occassionally, but then if you have it too much it is actually bad for you, there has also been some nutrition research on this with coffee, black tea, alcohol, and chocolate, so how long and how often can people drink and eat these things until it turns out it’s not good for them anymore?

Potential inquiry questions could be:

Does time one’s COVID symptoms over time affect the severity of their case?
Is the consistency of cardiovascular activity over time significant in predicting yours hearts health?
How does your bodys sleep phases over time affect your health?
How does drinking caffeine over time affect your health?

Potential datasets that could be used for the project are:

Covid cases dataset includes cases, deaths, and recoveries of covid over time. https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data
Diabetes dataset includes time series data on patients with diabetes that includes insulin and lifestyle metrics https://archive.ics.uci.edu/dataset/34/diabetes
Air quality dataset includes hourly data from an entire year on various air quality metrics. https://archive.ics.uci.edu/dataset/360/air+quality
Daily and sports activities dataset includes data from sensors that were placed on subjects body to track signals from 19 different activities. This time series could look into how diffreent activities affect the subjects body over a period of time. https://archive.ics.uci.edu/dataset/256/daily+and+sports+activities

STA 209 Homework 1

2026-02-04

Problem 1

Problem 2

Project Problem