1. Select a column of our data that encodes time (e.g., “date”, “timestamp”, “year”, etc.). Convert this into a Date in R. - Note, you may need to use some combination of as.Date, or to_datetime. And, you may even need to paste year, month, day, hour, etc. together using paste (even if you need to make up a month, like “__/01/01”).
  2. If you do not have a time-based column of data: find a Wikipedia page that is related to our dataset. Then, extract a time series of page views for that page using the wikipedia page views websiteLinks to an external site. or the R package used in this week’s lab. If you choose this option, find ways to tie our results from the below analysis into what you’re seeing with our own data!
  3. Choose a column of data to analyze over time. This should be a “response-like” variable that is of particular interest.
  4. Create a tsibble object of just the date and response variable. Then, plot our data over time. Consider different windows of time.
  1. Use linear regression to detect any upwards or downwards trends.
  1. Use smoothing to detect at least one season in our data, and interpret our results.

Introduction

This is an analysis of weather data, focusing on temperature trends and seasonality.

#Load the data
weather_data <- read.csv("C:\\Users\\singh\\Documents\\StatsR\\dataset\\Final\\weather_repo.csv")

head(weather_data)
##       country location_name latitude longitude   timezone last_updated_epoch
## 1 Afghanistan         Kabul    34.52     69.18 Asia/Kabul         1693301400
## 2 Afghanistan         Kabul    34.52     69.18 Asia/Kabul         1693364400
## 3 Afghanistan         Kabul    34.52     69.18 Asia/Kabul         1693439100
## 4 Afghanistan         Kabul    34.52     69.18 Asia/Kabul         1693525500
## 5 Afghanistan         Kabul    34.52     69.18 Asia/Kabul         1693611000
## 6 Afghanistan         Kabul    34.52     69.18 Asia/Kabul         1693698300
##      last_updated temperature_celsius temperature_fahrenheit
## 1 8/29/2023 14:00                28.8                   83.8
## 2  8/30/2023 7:30                21.3                   70.3
## 3  8/31/2023 4:15                18.1                   64.6
## 4   9/1/2023 4:15                19.2                   66.6
## 5   9/2/2023 4:00                18.5                   65.3
## 6   9/3/2023 4:15                17.0                   62.6
##           condition_text wind_mph wind_kph wind_degree wind_direction
## 1                  Sunny      7.2     11.5          74            ENE
## 2                  Sunny      2.2      3.6         199            SSW
## 3                  Clear      2.2      3.6         256            WSW
## 4                  Clear      2.2      3.6         282            WNW
## 5 Moderate rain at times      2.2      3.6         262              W
## 6                  Clear      2.2      3.6         237            WSW
##   pressure_mb pressure_in precip_mm precip_in humidity cloud feels_like_celsius
## 1        1004       29.64       0.0      0.00       19     0               26.7
## 2        1011       29.84       0.0      0.00       54     4               21.3
## 3        1010       29.83       0.0      0.00       40     0               18.1
## 4        1010       29.83       0.0      0.00       49     5               19.2
## 5        1010       29.82       0.5      0.02       40    87               18.6
## 6        1009       29.79       0.0      0.00       27     0               17.0
##   feels_like_fahrenheit visibility_km visibility_miles uv_index gust_mph
## 1                  80.1            10                6        7      8.3
## 2                  70.3            10                6        6      2.5
## 3                  64.6            10                6        1      3.4
## 4                  66.6            10                6        1      3.1
## 5                  65.5            10                6        1      2.7
## 6                  62.6            10                6        1      2.9
##   gust_kph air_quality_Carbon_Monoxide air_quality_Ozone
## 1     13.3                       647.5             130.2
## 2      4.0                      2964.0              57.2
## 3      5.4                       754.4              46.5
## 4      5.0                      1228.3              45.4
## 5      4.3                       454.0              52.9
## 6      4.7                       701.0              64.4
##   air_quality_Nitrogen_dioxide air_quality_Sulphur_dioxide air_quality_PM2.5
## 1                          1.2                         0.4               7.9
## 2                         20.9                         0.8              31.7
## 3                          6.4                         0.4               7.7
## 4                         12.7                         0.7              20.9
## 5                          4.7                         0.4              10.8
## 6                          6.8                         0.6              12.2
##   air_quality_PM10 air_quality_us_epa_index air_quality_gb_defra_index sunrise
## 1             11.1                        1                          1 5:24 AM
## 2             39.3                        2                          3 5:25 AM
## 3             12.8                        1                          1 5:25 AM
## 4             52.4                        2                          2 5:26 AM
## 5             24.3                        1                          1 5:26 AM
## 6             25.9                        1                          2 5:27 AM
##    sunset moonrise moonset     moon_phase moon_illumination
## 1 6:24 PM  5:39 PM 2:48 AM Waxing Gibbous                93
## 2 6:23 PM  6:18 PM 4:05 AM      Full Moon                98
## 3 6:23 PM  6:18 PM 4:05 AM      Full Moon                98
## 4 6:21 PM  6:52 PM 5:22 AM Waning Gibbous               100
## 5 6:20 PM  7:23 PM 6:36 AM Waning Gibbous                99
## 6 6:19 PM  7:53 PM 7:48 AM Waning Gibbous                94
# Converting 'last_updated' column to POSIXct datetime format
weather_data$last_updated <- as.POSIXct(weather_data$last_updated, format = "%m/%d/%Y %H:%M")

# Checking the structure of the 'last_updated' column after conversion
str(weather_data$last_updated)
##  POSIXct[1:2534], format: "2023-08-29 14:00:00" "2023-08-30 07:30:00" "2023-08-31 04:15:00" ...

Create tsibble Object

names(weather_data)
##  [1] "country"                      "location_name"               
##  [3] "latitude"                     "longitude"                   
##  [5] "timezone"                     "last_updated_epoch"          
##  [7] "last_updated"                 "temperature_celsius"         
##  [9] "temperature_fahrenheit"       "condition_text"              
## [11] "wind_mph"                     "wind_kph"                    
## [13] "wind_degree"                  "wind_direction"              
## [15] "pressure_mb"                  "pressure_in"                 
## [17] "precip_mm"                    "precip_in"                   
## [19] "humidity"                     "cloud"                       
## [21] "feels_like_celsius"           "feels_like_fahrenheit"       
## [23] "visibility_km"                "visibility_miles"            
## [25] "uv_index"                     "gust_mph"                    
## [27] "gust_kph"                     "air_quality_Carbon_Monoxide" 
## [29] "air_quality_Ozone"            "air_quality_Nitrogen_dioxide"
## [31] "air_quality_Sulphur_dioxide"  "air_quality_PM2.5"           
## [33] "air_quality_PM10"             "air_quality_us_epa_index"    
## [35] "air_quality_gb_defra_index"   "sunrise"                     
## [37] "sunset"                       "moonrise"                    
## [39] "moonset"                      "moon_phase"                  
## [41] "moon_illumination"

Aggregate Data to Hourly Averages

weather_data <- weather_data %>%
  mutate(hourly_time = floor_date(last_updated, "hour")) %>%
  group_by(hourly_time) %>%
  summarise(temperature_celsius = mean(temperature_celsius, na.rm = TRUE))

Create tsibble Object

ts_data <- weather_data %>% 
  select(hourly_time, temperature_celsius) %>% 
  as_tsibble(index = hourly_time)
ts_data %>% 
  ggplot(aes(x = hourly_time, y = temperature_celsius)) +
  geom_line() +
  labs(title = "Time Series Plot of Hourly Temperature", x = "Time", y = "Temperature (Celsius)")

The graph shows how the temperature changes hour by hour. It looks like the temperature goes up and down quite a bit, which is normal since it gets warmer during the day and cooler at night. There are some sharp drops to the bottom of the graph, which might mean there are some hours where the temperature wasn’t recorded.

Linear Regression for Trend Analysis

trend_model <- lm(temperature_celsius ~ hourly_time, data = ts_data)
summary(trend_model)
## 
## Call:
## lm(formula = temperature_celsius ~ hourly_time, data = ts_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7587  -2.2873   0.3728   3.0029   8.0834 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.516e+02  1.285e+03   0.118    0.906
## hourly_time -7.547e-08  7.588e-07  -0.099    0.921
## 
## Residual standard error: 3.881 on 257 degrees of freedom
## Multiple R-squared:  3.85e-05,   Adjusted R-squared:  -0.003852 
## F-statistic: 0.009894 on 1 and 257 DF,  p-value: 0.9208

Coefficients: The Estimate for hourly_time is very small (about -7.55e-08) and essentially suggests there’s no significant change in temperature for each increase in time unit. The p-value associated with hourly_time is large (0.921), way above 0.05, which is typically used as a cutoff for significance. This means there’s no statistical evidence from this analysis to suggest that temperature is either increasing or decreasing over time.

Residuals: The residuals, which are the differences between the observed temperatures and those predicted by the model, vary from about -17.76 to 8.08. The median is closer to zero, which means, the spread suggests some variability that the model isn’t capturing.

R-squared values: The Multiple R-squared value is extremely low (approximately 0.0000385), indicating that the model does not explain the variation in temperature well.

F-statistic and p-value: The overall p-value from the F-statistic is 0.9208, which is another indicator that the model is not statistically significant.

In summary, based on the linear regression analysis, there is no significant trend in temperature over the hours in the dataset you’ve analyzed. Temperature does not appear to consistently go up or down as time passes.

Smoothing and Seasonality Detection

library(forecast)

regular_ts <- ts(ts_data$temperature_celsius, frequency = 24)

# Apply STL decomposition and forecast
stlf_result <- stlf(regular_ts, method = "arima", h = 24 * 90) # Forecasting 3 months ahead

autoplot(stlf_result)

The forecast seems to suggest that the temperature will continue to fluctuate within a similar range as the historical data, which makes sense if we assume that the underlying patterns of temperature change will continue into the future. The forecast does not indicate a clear trend of increasing or decreasing temperatures; instead, it shows variability around what appears to be a relatively stable mean temperature.

Model Diagnostics

# Checking the residuals of the ARIMA model
checkresiduals(stlf_result)

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ARIMA(1,0,2) with non-zero mean
## Q* = 223.75, df = 45, p-value < 2.2e-16
## 
## Model df: 3.   Total lags used: 48

The residual plots suggest that the model does a reasonable job of capturing the variability in the temperature data. There is no clear pattern in the residuals, and they seem to be randomly distributed around zero without significant autocorrelation. This indicates that the STL + ARIMA model is likely a good fit for our data.

# ACF plot
acf(regular_ts, main="ACF for Temperature Data")

# PACF plot
pacf(regular_ts, main="PACF for Temperature Data")

Given that both the ACF and PACF plots show limited significant points, it suggests that the data does not have strong autoregressive or moving average components at higher lags, which aligns with the earlier conclusion that the ARIMA(1,0,2) model was sufficient to explain the time series data. The lack of significant points in the PACF plot beyond the first couple of lags supports the use of a lower-order autoregressive part in the ARIMA model.

Conclusion

Trend Analysis

The linear regression analysis did not reveal any significant long-term trends in the temperature data. This suggests that over the period analyzed, the temperature did not consistently increase or decrease.

Seasonality and Cyclic Behavior

The STL + ARIMA model indicated that the temperature data exhibits a stable pattern of fluctuation, consistent with what would be expected from hourly temperature readings. No strong seasonal or cyclic trends were detected in the data.

Model Adequacy

The residuals from the STL + ARIMA model suggest that the model fits the data well. The residuals appeared to be random “white noise,” indicating that the model captured the main patterns in the data.

ACF and PACF Analysis

Both the ACF and PACF plots showed limited significant correlations, confirming that there is no strong evidence of autoregressive or moving average components that would suggest further seasonal or cyclic patterns in the data.

Data Characterization

The data is best characterized by variability that the STL + ARIMA model captured without needing to account for strong trends or seasonality. This might reflect the daily fluctuations in temperature due to natural diurnal changes.