Problem #1 (i)

library(astsa)
library(readxl)
library(tidyverse)
library(fs)
?soi
tsplot(soi,main = 'Southern Oscillation Index',xlab='Year',ylab = 'SOI')

We see a data set that appears to have little to no linear trend, this is an indicator that a linear model will fit the data poorly. The large spikes in temperature show the seasonality in the data, the temperature rises and falls due to the seasons changing. Finally, there doesn’t appear to be heteroskedasticity in the data, the large spikes and drops in temperature are pretty constant through the period. (ii)

decomposition_add <- decompose(soi, type = "additive")
plot(decomposition_add)

The second panel represents the trend of the data, this does not show a linear relationship. We can see how the data rises and falls over time but has no clear direction. The data is in the same spot at the end of the period as it was at the beginning. In the third plot we can see a clear seasonal trend, within each period (months) the data repeats the same movement. Finally the fourth panel represents the data if we removed the second and third panel aka trend and seasonality. There is no clear pattern in this data which is what we expected. (iii)

Y<-soi
t<-time(soi)
Model1 <- lm(Y~t)
summary(Model1)
## 
## Call:
## lm(formula = Y ~ t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04140 -0.24183  0.01935  0.27727  0.83866 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.70367    3.18873   4.298 2.12e-05 ***
## t           -0.00692    0.00162  -4.272 2.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3756 on 451 degrees of freedom
## Multiple R-squared:  0.0389, Adjusted R-squared:  0.03677 
## F-statistic: 18.25 on 1 and 451 DF,  p-value: 2.359e-05
tsqf <- (t^2)/factorial(2)
Model2 <- lm(Y~t+tsqf,soi)
summary(Model2)
## 
## Call:
## lm(formula = Y ~ t + tsqf, data = soi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0529 -0.2443  0.0129  0.2658  0.8412 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.968e+02  6.444e+02  -0.771    0.441
## t            5.116e-01  6.546e-01   0.782    0.435
## tsqf        -2.634e-04  3.325e-04  -0.792    0.429
## 
## Residual standard error: 0.3758 on 450 degrees of freedom
## Multiple R-squared:  0.04024,    Adjusted R-squared:  0.03597 
## F-statistic: 9.433 on 2 and 450 DF,  p-value: 9.698e-05
tcubf <- (t^3)/factorial(3)
Model3 <- lm(Y~t+tsqf+tcubf,soi)
summary(Model3)
## 
## Call:
## lm(formula = Y ~ t + tsqf + tcubf, data = soi)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07047 -0.23026  0.01631  0.27232  0.85650 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.308e+05  1.325e+05   0.987    0.324
## t           -1.995e+02  2.020e+02  -0.988    0.324
## tsqf         2.029e-01  2.052e-01   0.989    0.323
## tcubf       -1.032e-04  1.042e-04  -0.990    0.323
## 
## Residual standard error: 0.3758 on 449 degrees of freedom
## Multiple R-squared:  0.04233,    Adjusted R-squared:  0.03593 
## F-statistic: 6.616 on 3 and 449 DF,  p-value: 0.0002215
anova1 <- anova(Model1)
anova2 <- anova(Model2)
anova3 <- anova(Model3)
rbind(anova1,anova2,anova3)
## Analysis of Variance Table
## 
## Response: Y
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## t            1  2.576 2.57583 18.2542 2.359e-05 ***
## Residuals  451 63.640 0.14111                      
## t1           1  2.576 2.57583 18.2392 2.378e-05 ***
## tsqf         1  0.089 0.08863  0.6276    0.4286    
## Residuals1 450 63.551 0.14122                      
## t2           1  2.576 2.57583 18.2384 2.380e-05 ***
## tsqf1        1  0.089 0.08863  0.6276    0.4287    
## tcubf        1  0.139 0.13853  0.9809    0.3225    
## Residuals2 449 63.413 0.14123                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 1:There are two regression parameters in this example, beta 0 and beta 1. Model 2:There are three regression parameters in this example, beta 0,1,2. Model 3:There are four regression parameters, beta 0,1,2,3.

Anova:The P values of model 1 is very close to the p value of model 3 indicating the models don’t improve very much when the terms are added. Based on the initial ts plot it’s clear that added terms to the model doesn’t help us understand the data any more than the first model. Also when the quadratic terms are added their p values are insignificant. The mean squared across all models is the same backing up my point that adding quadratic terms to the model does not increase its power of explaining the variability.

tsplot(soi, main="Southern Oscillation Index", xlab="Time (years)", ylab="SOI", col="blue", lwd=2)
abline(Model1, col="red", lwd=2)
points(t, Model2$fitted.values, col="green", lwd=2, type="l")
points(t, Model3$fitted.values, col="black", lwd=2, type="l")

The first model fits the data the best, the other two models are victim to over fitting. Both model 2 and 3 have slopes that are getting too extreme in the negative direction in the rightmost part of the graph. We have seen that this data has little to no linearity in it. Model 1 is almost a horizontal line with a slight negative slope. A model that has a slope increasing in the negative direction could misrepresent the data.

plot(time(soi),Model1$residuals,type="l",main="Detrended Data",ylab="residual",col="blue")

plot(time(soi),Model2$residuals,type="l",main="Detrended Data 2",ylab="residual",col="blue")

plot(time(soi),Model3$residuals,type="l",main="Detrended Data 3",ylab="residual",col="magenta")

All of the residual plots of the models create almost the same exact graph meaning that none of them are any better than the others. Looking back on the anova the p values are the same or have a very small difference. This means that all the models are pretty much the same. If I were to choose a model I would choose the first one, overfitting is possible when quardatic terms are added and choosing the linear model is the best way to avoid that.

library(readxl)
library(tidyverse)
library(fs)
Climate_Northeast_WarwickRI <- read_excel("~/Desktop/STA209/Data/Climate_Northeast_WarwickRI.xlsx")

climatets <- function(filename,num,plot=TRUE,s=2) {
  ##############################################
  # function to set up  temp data for climate series
  # inputs: filename and sheet number
  # output: returns cleaned data and plot
  # written by: P.Kohli
  # date: 4/13/24
  ##############################################
  dataexcel <- read_excel(filename,sheet=num,range="D7:H77")
  dataexcel <- dataexcel[-1,-2]
  dataexcel <- filter(dataexcel, !is.na(dataexcel$Avg))
  data  <- dataexcel %>% mutate(Month=num)
  if(plot==TRUE){
    g <- data%>% ggplot(aes(Year, Avg))+ geom_line(size=s)
    return(list(data,g))
  }else{
    return(data)
  }
  
}
climatets("~/Desktop/STA209/Data/Climate_Northeast_WarwickRI.xlsx",1,TRUE,1)

Jan.Warwick <- climatets("~/Desktop/STA209/Data/Climate_Northeast_WarwickRI.xlsx",1,TRUE,1)[[1]]

Warwick <- arrange(bind_rows(climatets("~/Desktop/STA209/Data/Climate_Northeast_WarwickRI.xlsx",1,FALSE),
                                climatets("~/Desktop/STA209/Data/Climate_Northeast_WarwickRI.xlsx",2,FALSE),
                                climatets("~/Desktop/STA209/Data/Climate_Northeast_WarwickRI.xlsx",3,FALSE),
                                climatets("~/Desktop/STA209/Data/Climate_Northeast_WarwickRI.xlsx",4,FALSE)),Year)


Warwick_ts <- ts(Warwick,start=1950, frequency=4)

Warwick_ts_Max <- Warwick_ts[, "Max"]

Warwick_ts_Min <- Warwick_ts[, "Min"]

Warwick_ts_Avg <- Warwick_ts[, "Avg"]
par(mfrow=c(3,1))
tsplot(Warwick_ts_Max, main="Max Temperature", xlab="Time (months)", ylab="Temperature", col="blue", lwd=2)
tsplot(Warwick_ts_Min, main="Min Temperature", xlab="Time (months)", ylab="Temperature", col="green", lwd=2)
tsplot(Warwick_ts_Avg, main="Avg Temperature", xlab="Time (months)", ylab="Temperature", col="yellow", lwd=2)

par(mfrow=c(1,1))
Climate_Northeast_WorcesterMA_6_16_19 <- read_excel("~/Desktop/STA209/Data/Climate_Northeast_WorcesterMA 6-16-19.xlsx")
climatets("~/Desktop/STA209/Data/Climate_Northeast_WorcesterMA 6-16-19.xlsx",1,TRUE,1)

Jan.Worcester <- climatets("~/Desktop/STA209/Data/Climate_Northeast_WorcesterMA 6-16-19.xlsx",1,TRUE,1)[[1]]

Worcester <- arrange(bind_rows(climatets("~/Desktop/STA209/Data/Climate_Northeast_WorcesterMA 6-16-19.xlsx",1,FALSE),
                                climatets("~/Desktop/STA209/Data/Climate_Northeast_WorcesterMA 6-16-19.xlsx",2,FALSE),
                                climatets("~/Desktop/STA209/Data/Climate_Northeast_WorcesterMA 6-16-19.xlsx",3,FALSE),
                                climatets("~/Desktop/STA209/Data/Climate_Northeast_WorcesterMA 6-16-19.xlsx",4,FALSE)),Year)


Worcester_ts <- ts(Worcester,start=1950, frequency=4)

Worcester_ts_Max <- Worcester_ts[, "Max"]

Worcester_ts_Min <- Worcester_ts[, "Min"]

Worcester_ts_Avg <- Worcester_ts[, "Avg"]
par(mfrow=c(3,1))
tsplot(Worcester_ts_Max, main="Max Temperature", xlab="Time (months)", ylab="Temperature", col="blue", lwd=2)
tsplot(Worcester_ts_Min, main="Min Temperature", xlab="Time (months)", ylab="Temperature", col="green", lwd=2)
tsplot(Worcester_ts_Avg, main="Avg Temperature", xlab="Time (months)", ylab="Temperature", col="yellow", lwd=2)

par(mfrow=c(1,1))
Climate_Northeast_NewHavenCT <- read_excel('~/Desktop/STA209/Data/Climate_Northeast_NewHavenCT.xlsx')
climatets("~/Desktop/STA209/Data/Climate_Northeast_NewHavenCT.xlsx",1,TRUE,1)

Jan.New_Haven <- climatets("~/Desktop/STA209/Data/Climate_Northeast_NewHavenCT.xlsx",1,TRUE,1)[[1]]

New_Haven <- arrange(bind_rows(
  climatets("~/Desktop/STA209/Data/Climate_Northeast_NewHavenCT.xlsx", 1, FALSE) %>% mutate(Year = as.numeric(Year)),
  climatets("~/Desktop/STA209/Data/Climate_Northeast_NewHavenCT.xlsx", 2, FALSE) %>% mutate(Year = as.numeric(Year)),
  climatets("~/Desktop/STA209/Data/Climate_Northeast_NewHavenCT.xlsx", 3, FALSE) %>% mutate(Year = as.numeric(Year)),
  climatets("~/Desktop/STA209/Data/Climate_Northeast_NewHavenCT.xlsx", 4, FALSE) %>% mutate(Year = as.numeric(Year))
), Year)

New_Haven_ts <- ts(New_Haven,start=1950, frequency=4)

New_Haven_ts_Max <- New_Haven_ts[, "Max"]

New_Haven_ts_Min <- New_Haven_ts[, "Min"]

New_Haven_ts_Avg <- New_Haven_ts[, "Avg"]
par(mfrow=c(3,1))
tsplot(New_Haven_ts_Max, main="Max Temperature", xlab="Time (months)", ylab="Temperature", col="blue", lwd=2)
tsplot(New_Haven_ts_Min, main="Min Temperature", xlab="Time (months)", ylab="Temperature", col="green", lwd=2)
tsplot(New_Haven_ts_Avg, main="Avg Temperature", xlab="Time (months)", ylab="Temperature", col="yellow", lwd=2)

par(mfrow=c(1,1))
#Warwick
library(plotly)
seasonalts <- function(filename,nums) {
  ##############################################
  # function to set up monthly data for climate series
  # inputs: filename, sheet number vector
  # output: returns seasonal data
  # written by: P.Kohli
  #Date: 4/13/24
  ##############################################
  data.seasonal <- matrix(0,nrow=1,ncol=5) #can be asked as input
  for(i in 1:length(nums)){ #loops in R to do a task iteratively
    dataexcel<- read_excel(filename,sheet=nums[i],range="D7:H77")
    dataexcel <- dataexcel[-1,-2]
    dataexcel <- filter(dataexcel, !is.na(dataexcel$Avg))
    data  <- dataexcel %>% mutate(Month=nums[i])
    colnames(data.seasonal) <- colnames(data)
    data.seasonal <- rbind(data.seasonal,data)
  }
  data.seasonal <- data.seasonal[-1,]
  data.seasonal <- arrange(data.seasonal,Year)
  return(data.seasonal)
} 
Warwick2 <- seasonalts("~/Desktop/STA209/Data/Climate_Northeast_WarwickRI.xlsx",nums=1:4)

Warwick2$Month[Warwick2$Month == 1] <- "January"
Warwick2$Month[Warwick2$Month == 2] <- "April"
Warwick2$Month[Warwick2$Month  == 3] <- "July"
Warwick2$Month[Warwick2$Month == 4] <- "October"

Warwick2$Month  <- as.factor(Warwick2$Month)

Jan.Warwick.nna <- Jan.Warwick
p <- plot_ly(Jan.Warwick.nna , x = ~Year, y = ~Avg) %>%
  add_lines() %>%
  layout(yaxis = list(title = ""))
p

subplot(
  plot_ly(Jan.Warwick.nna, x = ~Year, y = ~Max, name = "Max"),
  plot_ly(Jan.Warwick.nna, x = ~Year, y = ~Min,name="Min"),
  plot_ly(Jan.Warwick.nna, x = ~Year, y = ~Avg,name="Avg")
) 

subplot(
  plot_ly(Jan.Warwick.nna, x = ~Year, y = ~Max, name = "Max"),
  plot_ly(Jan.Warwick.nna, x = ~Year, y = ~Min,name="Min"),
  plot_ly(Jan.Warwick.nna, x = ~Year, y = ~Avg,name="Avg")%>% 
    add_markers(alpha = 0.3, name = "Avg")
)
subplot(
  plot_ly(Jan.Warwick.nna, x = ~Year, y = ~Max, name = "Max")%>%add_lines(alpha=0.5),
  plot_ly(Jan.Warwick.nna, x = ~Year, y = ~Min,name="Min")%>%add_lines(alpha=0.5),
  plot_ly(Jan.Warwick.nna, x = ~Year, y = ~Avg,name="Avg")%>% 
    add_lines(alpha=0.5)
)
seasonalts <- function(filename,nums) {
  ##############################################
  # function to set up monthly data for climate series
  # inputs: filename, sheet number vector
  # output: returns seasonal data
  # written by: P.Kohli
  #Date: 4/13/24
  ##############################################
  data.seasonal <- matrix(0,nrow=1,ncol=5) #can be asked as input
  for(i in 1:length(nums)){ #loops in R to do a task iteratively
    dataexcel<- read_excel(filename,sheet=nums[i],range="D7:H77")
    dataexcel <- dataexcel[-1,-2]
    dataexcel <- filter(dataexcel, !is.na(dataexcel$Avg))
    data  <- dataexcel %>% mutate(Month=nums[i])
    colnames(data.seasonal) <- colnames(data)
    data.seasonal <- rbind(data.seasonal,data)
  }
  data.seasonal <- data.seasonal[-1,]
  data.seasonal <- arrange(data.seasonal,Year)
  return(data.seasonal)
} 
Worcester2 <- seasonalts("~/Desktop/STA209/Data/Climate_Northeast_WorcesterMA 6-16-19.xlsx",nums=1:4)

Worcester2$Month[Worcester2$Month == 1] <- "January"
Worcester2$Month[Worcester2$Month == 2] <- "April"
Worcester2$Month[Worcester2$Month  == 3] <- "July"
Worcester2$Month[Worcester2$Month == 4] <- "October"

Worcester2$Month  <- as.factor(Worcester2$Month)

Jan.Worcester.nna <- Jan.Worcester
p <- plot_ly(Jan.Worcester.nna , x = ~Year, y = ~Avg) %>%
  add_lines() %>%
  layout(yaxis = list(title = ""))
p

subplot(
  plot_ly(Jan.Worcester.nna, x = ~Year, y = ~Max, name = "Max"),
  plot_ly(Jan.Worcester.nna, x = ~Year, y = ~Min,name="Min"),
  plot_ly(Jan.Worcester.nna, x = ~Year, y = ~Avg,name="Avg")
) 

subplot(
  plot_ly(Jan.Worcester.nna, x = ~Year, y = ~Max, name = "Max"),
  plot_ly(Jan.Worcester.nna, x = ~Year, y = ~Min,name="Min"),
  plot_ly(Jan.Worcester.nna, x = ~Year, y = ~Avg,name="Avg")%>% 
    add_markers(alpha = 0.3, name = "Avg")
)
subplot(
  plot_ly(Jan.Worcester.nna, x = ~Year, y = ~Max, name = "Max")%>%add_lines(alpha=0.5),
  plot_ly(Jan.Worcester.nna, x = ~Year, y = ~Min,name="Min")%>%add_lines(alpha=0.5),
  plot_ly(Jan.Worcester.nna, x = ~Year, y = ~Avg,name="Avg")%>% 
    add_lines(alpha=0.5)
)
New_Haven2 <- seasonalts("~/Desktop/STA209/Data/Climate_Northeast_NewHavenCT.xlsx",nums=1:4)

New_Haven2$Month[New_Haven2$Month == 1] <- "January"
New_Haven2$Month[New_Haven2$Month == 2] <- "April"
New_Haven2$Month[New_Haven2$Month  == 3] <- "July"
New_Haven2$Month[New_Haven2$Month == 4] <- "October"

New_Haven2$Month  <- as.factor(New_Haven2$Month)

Jan.New_Haven.nna <- Jan.New_Haven
p <- plot_ly(Jan.New_Haven.nna , x = ~Year, y = ~Avg) %>%
  add_lines() %>%
  layout(yaxis = list(title = ""))
p

subplot(
  plot_ly(Jan.New_Haven.nna, x = ~Year, y = ~Max, name = "Max"),
  plot_ly(Jan.New_Haven.nna, x = ~Year, y = ~Min,name="Min"),
  plot_ly(Jan.New_Haven.nna, x = ~Year, y = ~Avg,name="Avg")
) 

subplot(
  plot_ly(Jan.New_Haven.nna, x = ~Year, y = ~Max, name = "Max"),
  plot_ly(Jan.New_Haven.nna, x = ~Year, y = ~Min,name="Min"),
  plot_ly(Jan.New_Haven.nna, x = ~Year, y = ~Avg,name="Avg")%>% 
    add_markers(alpha = 0.3, name = "Avg")
)
subplot(
  plot_ly(Jan.New_Haven.nna, x = ~Year, y = ~Max, name = "Max")%>%add_lines(alpha=0.5),
  plot_ly(Jan.New_Haven.nna, x = ~Year, y = ~Min,name="Min")%>%add_lines(alpha=0.5),
  plot_ly(Jan.New_Haven.nna, x = ~Year, y = ~Avg,name="Avg")%>% 
    add_lines(alpha=0.5)
)

The minimum for Worcester is consistently lower than Warwick or New Haven over the period. Warwick based on the graph is second when it comes to lowest minimum temperature with New Haven being 3rd. The spread on Warwick’s minimum temperature is much larger than the other two, from year to year the minimum temperature is much more volatile than the other two. New Haven appears to consistently have the highest maximum temperature, however Worcester and Warwick both boast higher all time maximum temperatures than New Haven. Worcester has the lowest average temperature with it rarely above 0 degrees Celsius. Warwick and New Haven are very similar when it comes to average temperature.

#Project Problem (i).One potential interesting time series project I have is projecting currency. I don’t know what currency I would like to focus on but I’m definitely interested in the idea. Another idea I have is focusing on a specific stock, preferably one with tons of public information like Google or Microsoft. (ii).I could look at reoccurring seasonal trends in a specific stock or the potential trend and increase or decrease in heteroskedasticity over time. For a currency data set I could look at the same sort of trend/seasonality questions but I could also connect it with current events. Possibly some sort of monetrary policy. (iii).For the two ideas I have I’ve already found data sets on Kaggle. I also know of R packages which allow you to pull current data such as stock prices or currency values.