This is an analysis of weather data, focusing on temperature trends and seasonality.
#Load the data
weather_data <- read.csv("C:\\Users\\singh\\Documents\\StatsR\\dataset\\Final\\weather_repo.csv")
head(weather_data)
## country location_name latitude longitude timezone last_updated_epoch
## 1 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1693301400
## 2 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1693364400
## 3 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1693439100
## 4 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1693525500
## 5 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1693611000
## 6 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1693698300
## last_updated temperature_celsius temperature_fahrenheit
## 1 8/29/2023 14:00 28.8 83.8
## 2 8/30/2023 7:30 21.3 70.3
## 3 8/31/2023 4:15 18.1 64.6
## 4 9/1/2023 4:15 19.2 66.6
## 5 9/2/2023 4:00 18.5 65.3
## 6 9/3/2023 4:15 17.0 62.6
## condition_text wind_mph wind_kph wind_degree wind_direction
## 1 Sunny 7.2 11.5 74 ENE
## 2 Sunny 2.2 3.6 199 SSW
## 3 Clear 2.2 3.6 256 WSW
## 4 Clear 2.2 3.6 282 WNW
## 5 Moderate rain at times 2.2 3.6 262 W
## 6 Clear 2.2 3.6 237 WSW
## pressure_mb pressure_in precip_mm precip_in humidity cloud feels_like_celsius
## 1 1004 29.64 0.0 0.00 19 0 26.7
## 2 1011 29.84 0.0 0.00 54 4 21.3
## 3 1010 29.83 0.0 0.00 40 0 18.1
## 4 1010 29.83 0.0 0.00 49 5 19.2
## 5 1010 29.82 0.5 0.02 40 87 18.6
## 6 1009 29.79 0.0 0.00 27 0 17.0
## feels_like_fahrenheit visibility_km visibility_miles uv_index gust_mph
## 1 80.1 10 6 7 8.3
## 2 70.3 10 6 6 2.5
## 3 64.6 10 6 1 3.4
## 4 66.6 10 6 1 3.1
## 5 65.5 10 6 1 2.7
## 6 62.6 10 6 1 2.9
## gust_kph air_quality_Carbon_Monoxide air_quality_Ozone
## 1 13.3 647.5 130.2
## 2 4.0 2964.0 57.2
## 3 5.4 754.4 46.5
## 4 5.0 1228.3 45.4
## 5 4.3 454.0 52.9
## 6 4.7 701.0 64.4
## air_quality_Nitrogen_dioxide air_quality_Sulphur_dioxide air_quality_PM2.5
## 1 1.2 0.4 7.9
## 2 20.9 0.8 31.7
## 3 6.4 0.4 7.7
## 4 12.7 0.7 20.9
## 5 4.7 0.4 10.8
## 6 6.8 0.6 12.2
## air_quality_PM10 air_quality_us_epa_index air_quality_gb_defra_index sunrise
## 1 11.1 1 1 5:24 AM
## 2 39.3 2 3 5:25 AM
## 3 12.8 1 1 5:25 AM
## 4 52.4 2 2 5:26 AM
## 5 24.3 1 1 5:26 AM
## 6 25.9 1 2 5:27 AM
## sunset moonrise moonset moon_phase moon_illumination
## 1 6:24 PM 5:39 PM 2:48 AM Waxing Gibbous 93
## 2 6:23 PM 6:18 PM 4:05 AM Full Moon 98
## 3 6:23 PM 6:18 PM 4:05 AM Full Moon 98
## 4 6:21 PM 6:52 PM 5:22 AM Waning Gibbous 100
## 5 6:20 PM 7:23 PM 6:36 AM Waning Gibbous 99
## 6 6:19 PM 7:53 PM 7:48 AM Waning Gibbous 94
# Converting 'last_updated' column to POSIXct datetime format
weather_data$last_updated <- as.POSIXct(weather_data$last_updated, format = "%m/%d/%Y %H:%M")
# Checking the structure of the 'last_updated' column after conversion
str(weather_data$last_updated)
## POSIXct[1:2534], format: "2023-08-29 14:00:00" "2023-08-30 07:30:00" "2023-08-31 04:15:00" ...
names(weather_data)
## [1] "country" "location_name"
## [3] "latitude" "longitude"
## [5] "timezone" "last_updated_epoch"
## [7] "last_updated" "temperature_celsius"
## [9] "temperature_fahrenheit" "condition_text"
## [11] "wind_mph" "wind_kph"
## [13] "wind_degree" "wind_direction"
## [15] "pressure_mb" "pressure_in"
## [17] "precip_mm" "precip_in"
## [19] "humidity" "cloud"
## [21] "feels_like_celsius" "feels_like_fahrenheit"
## [23] "visibility_km" "visibility_miles"
## [25] "uv_index" "gust_mph"
## [27] "gust_kph" "air_quality_Carbon_Monoxide"
## [29] "air_quality_Ozone" "air_quality_Nitrogen_dioxide"
## [31] "air_quality_Sulphur_dioxide" "air_quality_PM2.5"
## [33] "air_quality_PM10" "air_quality_us_epa_index"
## [35] "air_quality_gb_defra_index" "sunrise"
## [37] "sunset" "moonrise"
## [39] "moonset" "moon_phase"
## [41] "moon_illumination"
weather_data <- weather_data %>%
mutate(hourly_time = floor_date(last_updated, "hour")) %>%
group_by(hourly_time) %>%
summarise(temperature_celsius = mean(temperature_celsius, na.rm = TRUE))
ts_data <- weather_data %>%
select(hourly_time, temperature_celsius) %>%
as_tsibble(index = hourly_time)
ts_data %>%
ggplot(aes(x = hourly_time, y = temperature_celsius)) +
geom_line() +
labs(title = "Time Series Plot of Hourly Temperature", x = "Time", y = "Temperature (Celsius)")
The graph shows how the temperature changes hour by hour. It looks like
the temperature goes up and down quite a bit, which is normal since it
gets warmer during the day and cooler at night. There are some sharp
drops to the bottom of the graph, which might mean there are some hours
where the temperature wasn’t recorded.
trend_model <- lm(temperature_celsius ~ hourly_time, data = ts_data)
summary(trend_model)
##
## Call:
## lm(formula = temperature_celsius ~ hourly_time, data = ts_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7587 -2.2873 0.3728 3.0029 8.0834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.516e+02 1.285e+03 0.118 0.906
## hourly_time -7.547e-08 7.588e-07 -0.099 0.921
##
## Residual standard error: 3.881 on 257 degrees of freedom
## Multiple R-squared: 3.85e-05, Adjusted R-squared: -0.003852
## F-statistic: 0.009894 on 1 and 257 DF, p-value: 0.9208
Coefficients: The Estimate for hourly_time is very small (about -7.55e-08) and essentially suggests there’s no significant change in temperature for each increase in time unit. The p-value associated with hourly_time is large (0.921), way above 0.05, which is typically used as a cutoff for significance. This means there’s no statistical evidence from this analysis to suggest that temperature is either increasing or decreasing over time.
Residuals: The residuals, which are the differences between the observed temperatures and those predicted by the model, vary from about -17.76 to 8.08. The median is closer to zero, which means, the spread suggests some variability that the model isn’t capturing.
R-squared values: The Multiple R-squared value is extremely low (approximately 0.0000385), indicating that the model does not explain the variation in temperature well.
F-statistic and p-value: The overall p-value from the F-statistic is 0.9208, which is another indicator that the model is not statistically significant.
In summary, based on the linear regression analysis, there is no significant trend in temperature over the hours in the dataset you’ve analyzed. Temperature does not appear to consistently go up or down as time passes.
library(forecast)
regular_ts <- ts(ts_data$temperature_celsius, frequency = 24)
# Apply STL decomposition and forecast
stlf_result <- stlf(regular_ts, method = "arima", h = 24 * 90) # Forecasting 3 months ahead
autoplot(stlf_result)
The forecast seems to suggest that the temperature will continue to fluctuate within a similar range as the historical data, which makes sense if we assume that the underlying patterns of temperature change will continue into the future. The forecast does not indicate a clear trend of increasing or decreasing temperatures; instead, it shows variability around what appears to be a relatively stable mean temperature.
# Checking the residuals of the ARIMA model
checkresiduals(stlf_result)
##
## Ljung-Box test
##
## data: Residuals from STL + ARIMA(1,0,2) with non-zero mean
## Q* = 223.75, df = 45, p-value < 2.2e-16
##
## Model df: 3. Total lags used: 48
The residual plots suggest that the model does a reasonable job of capturing the variability in the temperature data. There is no clear pattern in the residuals, and they seem to be randomly distributed around zero without significant autocorrelation. This indicates that the STL + ARIMA model is likely a good fit for our data.
# ACF plot
acf(regular_ts, main="ACF for Temperature Data")
# PACF plot
pacf(regular_ts, main="PACF for Temperature Data")
Given that both the ACF and PACF plots show limited significant points, it suggests that the data does not have strong autoregressive or moving average components at higher lags, which aligns with the earlier conclusion that the ARIMA(1,0,2) model was sufficient to explain the time series data. The lack of significant points in the PACF plot beyond the first couple of lags supports the use of a lower-order autoregressive part in the ARIMA model.
Trend Analysis
The linear regression analysis did not reveal any significant long-term trends in the temperature data. This suggests that over the period analyzed, the temperature did not consistently increase or decrease.
Seasonality and Cyclic Behavior
The STL + ARIMA model indicated that the temperature data exhibits a stable pattern of fluctuation, consistent with what would be expected from hourly temperature readings. No strong seasonal or cyclic trends were detected in the data.
Model Adequacy
The residuals from the STL + ARIMA model suggest that the model fits the data well. The residuals appeared to be random “white noise,” indicating that the model captured the main patterns in the data.
ACF and PACF Analysis
Both the ACF and PACF plots showed limited significant correlations, confirming that there is no strong evidence of autoregressive or moving average components that would suggest further seasonal or cyclic patterns in the data.
Data Characterization
The data is best characterized by variability that the STL + ARIMA model captured without needing to account for strong trends or seasonality. This might reflect the daily fluctuations in temperature due to natural diurnal changes.