The purpose of the USA Rainfall Prediction Dataset (2024-2025) is to provide data for a panoramic view of the weather and developing predictive models of climate variables and rainfall patterns across 20 cities in the United States.
This dataset aims to introducing and making a brief summary about some climate factors such as: temperature, humidity, wind speed, precipitation, etc in cities around the USA. From these concise stats, our group can analyze the trends of these atmospheric factors, their impacts to the forecasting accuracy of rain tomorrow. Especially, ARIMA times series analysis model will be used to forecast the temperature in New York.
Introduction to the Variables: Each row represents daily weather data, and the key variables included are:
1. Date: The specific date of the observation.
2. Location: The city or geographical region where the weather data was recorded.
3. Temperature: The recorded temperature (likely in degrees Fahrenheit or Celsius, though the exact unit needs to be confirmed).
4. Humidity: The percentage of humidity present in the atmosphere on the given day.
5. WindSpeed: The speed of the wind, measured in an appropriate unit (likely miles per hour or kilometers per hour).
6. Precipitation: The amount of rainfall recorded on the given day, measured in millimeters or inches.
7. CloudCover: The percentage of cloud cover on that day, representing how much of the sky was obscured by clouds.
8. Pressure: The atmospheric pressure, typically measured in millibars (hPa).
9. RainTomorrow: A binary variable (0 or 1), where “1” indicates rain is expected the next day, and “0” indicates no rain.
library(readxl)
library(readr)
library(dplyr)
library(rmarkdown)
library(tinytex)
library(utils)
library(readxl)
USArain <- read_csv("C:/Users/manhphi2811/Downloads/USArain.csv")
USArain$Date <- as.Date(USArain$Date)
USAraind <- USArain |>
group_by(Date, Location)|>
summarise(
Temperature = mean(Temperature, na.rm = TRUE),
Humidity = mean(Humidity, na.rm = TRUE),
WindSpeed = mean(WindSpeed, na.rm = TRUE),
Precipitation = mean(Precipitation, na.rm = TRUE),
CloudCover = mean(CloudCover, na.rm = TRUE),
Pressure = mean(Pressure, na.rm = TRUE),
RainTomorrow = max(RainTomorrow)
)
head(USAraind) #Display the first few rows of the dataset## # A tibble: 6 × 9
## # Groups: Date [1]
## Date Location Temperature Humidity WindSpeed Precipitation CloudCover
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2024-01-01 Austin 64.6 46.1 16.8 0.469 57.6
## 2 2024-01-01 Charlotte 70.9 55.2 15.3 0.364 38.7
## 3 2024-01-01 Chicago 82.1 60.1 11.1 0.385 63.4
## 4 2024-01-01 Columbus 67.4 53.4 14.3 0.531 43.4
## 5 2024-01-01 Dallas 62.7 65.0 12.4 0.570 40.6
## 6 2024-01-01 Denver 58.4 49.0 17.9 0.197 70.2
## # ℹ 2 more variables: Pressure <dbl>, RainTomorrow <dbl>
## [1] "Date" "Location" "Temperature" "Humidity"
## [5] "WindSpeed" "Precipitation" "CloudCover" "Pressure"
## [9] "RainTomorrow"
## [1] 14620 9
## Date Location Temperature Humidity
## Min. :2024-01-01 Length:14620 Min. :34.30 Min. :25.31
## 1st Qu.:2024-07-01 Class :character 1st Qu.:58.92 1st Qu.:52.60
## Median :2024-12-31 Mode :character Median :65.23 Median :59.83
## Mean :2024-12-31 Mean :65.18 Mean :59.88
## 3rd Qu.:2025-07-02 3rd Qu.:71.42 3rd Qu.:67.08
## Max. :2025-12-31 Max. :94.69 Max. :93.93
## WindSpeed Precipitation CloudCover Pressure
## Min. : 1.818 Min. :0.0000 Min. :14.14 Min. : 973.4
## 1st Qu.:12.337 1st Qu.:0.2346 1st Qu.:46.80 1st Qu.: 998.8
## Median :15.008 Median :0.3727 Median :54.80 Median :1005.2
## Mean :15.018 Mean :0.3906 Mean :54.94 Mean :1005.2
## 3rd Qu.:17.706 3rd Qu.:0.5262 3rd Qu.:63.06 3rd Qu.:1011.5
## Max. :26.990 Max. :1.4231 Max. :96.23 Max. :1036.8
## RainTomorrow
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.7072
## 3rd Qu.:1.0000
## Max. :1.0000
## Date Location Temperature Humidity WindSpeed
## 0 0 0 0 0
## Precipitation CloudCover Pressure RainTomorrow
## 0 0 0 0
clean_data <- na.omit(USAraind) #Remove rows with any missing values
colSums(is.na(clean_data)) # Verify that missing values have been handled## Date Location Temperature Humidity WindSpeed
## 0 0 0 0 0
## Precipitation CloudCover Pressure RainTomorrow
## 0 0 0 0
##
## Grubbs test for one outlier
##
## data: USAraind$Precipitation
## G = 4.88127, U = 0.99837, p-value = 0.007631
## alternative hypothesis: highest value 1.42305098684524 is an outlier
##
## Grubbs test for one outlier
##
## data: USAraind$Temperature
## G = 3.4206, U = 0.9992, p-value = 1
## alternative hypothesis: lowest value 34.3000451411701 is an outlier
##
## Grubbs test for one outlier
##
## data: USAraind$Humidity
## G = 3.32381, U = 0.99924, p-value = 1
## alternative hypothesis: lowest value 25.3105072675499 is an outlier
##
## Grubbs test for one outlier
##
## data: USAraind$WindSpeed
## G = 3.40376, U = 0.99921, p-value = 1
## alternative hypothesis: lowest value 1.81816641848697 is an outlier
The temperature range is approximately from 30°F to 100°F.
Most temperatures are concentrated between 50°F and 80°F, with a median around 70°F.
There are some lower temperature values, but no noticeable outliers.
Humidity ranges from about 20% to 100%.
Most humidity values are concentrated between 40% and 80%, with a median around 60%.
There are some lower humidity values near 20%, but no clear outliers.
Wind Speed (km/h):
Wind speed ranges from 0 km/h to around 30 km/h.
Wind speed is mostly concentrated between 10 km/h and 20 km/h, with a median around 15 km/h.
Some values near 0 km/h appear, but there are no significant outliers.
Conclusion:
par(mfrow=c(1, 3))
hist(USAraind$WindSpeed,
main = "Histogram of Wind Speed",
xlab = "Wind Speed",
col = "slategray",
border = "black")
hist(USAraind$Precipitation,
main = "Histogram of Precipitation",
xlab = "Precipitation",
col = "slategray",
border = "black")
hist(USAraind$Temperature,
main = "Histogram of Temperature",
xlab = "Temperature",
col = "slategray",
border = "black")par(mfrow=c(1, 3))
hist(USAraind$Humidity,
main = "Histogram of Humidity",
xlab = "Humidity",
col = "slategray",
border = "black")
hist(USAraind$CloudCover,
main = "Histogram of Cloud Cover",
xlab = "Cloud Cover",
col = "slategray",
border = "black")
hist(USAraind$Pressure,
main = "Histogram of Pressure",
xlab = "Pressure",
col = "slategray",
border = "black")The distribution of humidity appears to be quite wide, with an average value ranging from 50-70%. This indicates a diversity in humidity levels over the study period and area.
Cloud cover also has a relatively wide distribution, with an average value ranging from 40-60%. This suggests significant variations in cloud presence in the sky.
Pressure has a narrower distribution, concentrated around a certain average value. This indicates that pressure is relatively stable over the study period and area.
par(mfrow=c(1, 1))
USAraindd <- USAraind[,-9]
numeric_data <- USAraindd[, sapply(USAraindd, is.numeric)]
cor_matrix <- cor(numeric_data, use = "complete.obs")
library(ggplot2)
library(reshape2)
library(pheatmap)
melted_cor_matrix <- melt(cor_matrix)
ggplot(data = melted_cor_matrix, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "darkcyan", high = "darkgray", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
coord_fixed()library(pheatmap)
# Vẽ heatmap từ ma trận tương quan
pheatmap(cor_matrix, display_numbers = TRUE, color = colorRampPalette(c("lightgray", "white", "darkcyan"))(50))The correlation coefficients between these variables range from -0.01 to 0.02, indicating no clear linear relationship among them.
The coefficients close to 0 suggest that these variables are not closely related or might have a nonlinear relationship that is not captured by linear correlation. This means that changes in one factor do not significantly affect the changes in the other factors in this dataset.
Conclusion:
There is no strong or significant correlation between the variables in this dataset.
library(reshape2)
USArain_long <- melt(USArain, id.vars = "RainTomorrow",
measure.vars = c("Temperature", "Humidity", "WindSpeed",
"Precipitation", "CloudCover", "Pressure"))
ggplot(USArain_long, aes(x = value, y = RainTomorrow, color = factor(RainTomorrow))) +
geom_point(position = position_jitter(height = 0.1), size = 2, alpha = 0.7) +
facet_wrap(~ variable, scales = "free_x") + # Separate plots for each factor
labs(title = "Effect of Multiple Factors on RainTomorrow",
x = "Value",
y = "Rain Tomorrow (0 = No, 1 = Yes)",
color = "Rain Tomorrow") +
theme_minimal()The chart below shows the relationship between weather factors (Temperature, Humidity, Wind Speed, Precipitation, Cloud Cover, and Pressure) and the probability of rain tomorrow (Rain Tomorrow). The results are color-coded, with red indicating no rain (Rain Tomorrow = 0) and blue indicating rain (Rain Tomorrow = 1).
Humidity: There is a clear distinction. When humidity is high, especially between 80% and 100%, the probability of rain tomorrow increases. Precipitation: As today’s precipitation increases, the likelihood of rain tomorrow also rises. Blue points become more frequent as precipitation values go up. Cloud Cover: The likelihood of rain also increases with higher cloud cover. Temperature, Wind Speed, and Pressure: These factors do not show a clear relationship with the probability of rain. The values are evenly distributed across both outcomes (rain and no rain).
Humidity, Precipitation, and Cloud Cover are the most influential factors in determining the likelihood of rain tomorrow, with a clear trend: when these factors are higher, the chance of rain increases. Temperature, Wind Speed, and Pressure do not have a clear impact on rain prediction, as their values are distributed relatively equally between the two outcomes. The dominance of red points in the chart suggests that the dataset has a higher proportion of days without rain, reflecting the nature of the data or the region of study, where non-rainy days are more common. In summary, predicting rain tomorrow primarily depends on humidity, precipitation, and cloud cover. Other factors require further analysis to determine their impact.