Overview and EDA

PART 1: Overview

The purpose of the USA Rainfall Prediction Dataset (2024-2025) is to provide data for a panoramic view of the weather and developing predictive models of climate variables and rainfall patterns across 20 cities in the United States.
This dataset aims to introducing and making a brief summary about some climate factors such as: temperature, humidity, wind speed, precipitation, etc in cities around the USA. From these concise stats, our group can analyze the trends of these atmospheric factors, their impacts to the forecasting accuracy of rain tomorrow. Especially, ARIMA times series analysis model will be used to forecast the temperature in New York.

Introduction to the Variables: Each row represents daily weather data, and the key variables included are:

1. Date: The specific date of the observation.
2. Location: The city or geographical region where the weather data was recorded.
3. Temperature: The recorded temperature (likely in degrees Fahrenheit or Celsius, though the exact unit needs to be confirmed).
4. Humidity: The percentage of humidity present in the atmosphere on the given day.
5. WindSpeed: The speed of the wind, measured in an appropriate unit (likely miles per hour or kilometers per hour).
6. Precipitation: The amount of rainfall recorded on the given day, measured in millimeters or inches.
7. CloudCover: The percentage of cloud cover on that day, representing how much of the sky was obscured by clouds.
8. Pressure: The atmospheric pressure, typically measured in millibars (hPa).
9. RainTomorrow: A binary variable (0 or 1), where “1” indicates rain is expected the next day, and “0” indicates no rain.

library(readxl)
library(readr)    
library(dplyr)
library(rmarkdown)
library(tinytex)
library(utils)
library(readxl)

USArain <- read_csv("C:/Users/manhphi2811/Downloads/USArain.csv")

USArain$Date <- as.Date(USArain$Date)
USAraind <- USArain |>
  group_by(Date, Location)|>
  summarise(
    Temperature = mean(Temperature, na.rm = TRUE),
    Humidity = mean(Humidity, na.rm = TRUE),
    WindSpeed = mean(WindSpeed, na.rm = TRUE),
    Precipitation = mean(Precipitation, na.rm = TRUE),
    CloudCover = mean(CloudCover, na.rm = TRUE),
    Pressure = mean(Pressure, na.rm = TRUE),
    RainTomorrow = max(RainTomorrow)
  )

head(USAraind)        #Display the first few rows of the dataset

## # A tibble: 6 × 9
## # Groups:   Date [1]
##   Date       Location  Temperature Humidity WindSpeed Precipitation CloudCover
##   <date>     <chr>           <dbl>    <dbl>     <dbl>         <dbl>      <dbl>
## 1 2024-01-01 Austin           64.6     46.1      16.8         0.469       57.6
## 2 2024-01-01 Charlotte        70.9     55.2      15.3         0.364       38.7
## 3 2024-01-01 Chicago          82.1     60.1      11.1         0.385       63.4
## 4 2024-01-01 Columbus         67.4     53.4      14.3         0.531       43.4
## 5 2024-01-01 Dallas           62.7     65.0      12.4         0.570       40.6
## 6 2024-01-01 Denver           58.4     49.0      17.9         0.197       70.2
## # ℹ 2 more variables: Pressure <dbl>, RainTomorrow <dbl>

colnames(USAraind)    #List the column names of the dataset

## [1] "Date"          "Location"      "Temperature"   "Humidity"     
## [5] "WindSpeed"     "Precipitation" "CloudCover"    "Pressure"     
## [9] "RainTomorrow"

dim(USAraind)         #Get dimensions (number of rows and columns

## [1] 14620     9

summary(USAraind)     #Display descriptive statistics for each variable

##       Date              Location          Temperature       Humidity    
##  Min.   :2024-01-01   Length:14620       Min.   :34.30   Min.   :25.31  
##  1st Qu.:2024-07-01   Class :character   1st Qu.:58.92   1st Qu.:52.60  
##  Median :2024-12-31   Mode  :character   Median :65.23   Median :59.83  
##  Mean   :2024-12-31                      Mean   :65.18   Mean   :59.88  
##  3rd Qu.:2025-07-02                      3rd Qu.:71.42   3rd Qu.:67.08  
##  Max.   :2025-12-31                      Max.   :94.69   Max.   :93.93  
##    WindSpeed      Precipitation      CloudCover       Pressure     
##  Min.   : 1.818   Min.   :0.0000   Min.   :14.14   Min.   : 973.4  
##  1st Qu.:12.337   1st Qu.:0.2346   1st Qu.:46.80   1st Qu.: 998.8  
##  Median :15.008   Median :0.3727   Median :54.80   Median :1005.2  
##  Mean   :15.018   Mean   :0.3906   Mean   :54.94   Mean   :1005.2  
##  3rd Qu.:17.706   3rd Qu.:0.5262   3rd Qu.:63.06   3rd Qu.:1011.5  
##  Max.   :26.990   Max.   :1.4231   Max.   :96.23   Max.   :1036.8  
##   RainTomorrow   
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.7072  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

colSums(is.na(USAraind))    #Check for missing values in each column

##          Date      Location   Temperature      Humidity     WindSpeed 
##             0             0             0             0             0 
## Precipitation    CloudCover      Pressure  RainTomorrow 
##             0             0             0             0

clean_data <- na.omit(USAraind)  #Remove rows with any missing values
colSums(is.na(clean_data))      # Verify that missing values have been handled

##          Date      Location   Temperature      Humidity     WindSpeed 
##             0             0             0             0             0 
## Precipitation    CloudCover      Pressure  RainTomorrow 
##             0             0             0             0

PART 2: Exploratory data analysis

1. Boxplot:

## 
##  Grubbs test for one outlier
## 
## data:  USAraind$Precipitation
## G = 4.88127, U = 0.99837, p-value = 0.007631
## alternative hypothesis: highest value 1.42305098684524 is an outlier

## 
##  Grubbs test for one outlier
## 
## data:  USAraind$Temperature
## G = 3.4206, U = 0.9992, p-value = 1
## alternative hypothesis: lowest value 34.3000451411701 is an outlier

## 
##  Grubbs test for one outlier
## 
## data:  USAraind$Humidity
## G = 3.32381, U = 0.99924, p-value = 1
## alternative hypothesis: lowest value 25.3105072675499 is an outlier

## 
##  Grubbs test for one outlier
## 
## data:  USAraind$WindSpeed
## G = 3.40376, U = 0.99921, p-value = 1
## alternative hypothesis: lowest value 1.81816641848697 is an outlier

The temperature range is approximately from 30°F to 100°F.
Most temperatures are concentrated between 50°F and 80°F, with a median around 70°F.
There are some lower temperature values, but no noticeable outliers.
Humidity ranges from about 20% to 100%.
Most humidity values are concentrated between 40% and 80%, with a median around 60%.
There are some lower humidity values near 20%, but no clear outliers.

Wind Speed (km/h):

Wind speed ranges from 0 km/h to around 30 km/h.
Wind speed is mostly concentrated between 10 km/h and 20 km/h, with a median around 15 km/h.
Some values near 0 km/h appear, but there are no significant outliers.

Conclusion:

Temperature, humidity, and wind speed in this dataset are relatively evenly distributed, with median values near the middle of their respective ranges.
There are no significant outliers in any of the three variables.

2. Histogram

par(mfrow=c(1, 3))
hist(USAraind$WindSpeed, 
     main = "Histogram of Wind Speed", 
     xlab = "Wind Speed", 
     col = "slategray", 
     border = "black")
hist(USAraind$Precipitation, 
     main = "Histogram of Precipitation", 
     xlab = "Precipitation", 
     col = "slategray", 
     border = "black")
hist(USAraind$Temperature, 
     main = "Histogram of Temperature", 
     xlab = "Temperature", 
     col = "slategray", 
     border = "black")

Wind Speed Chart:

Distribution: Wind speed tends to be concentrated around a certain average value, forming a shape similar to a bell curve. This indicates that most days have relatively stable wind speeds.
Extreme Values: There are some days with very high wind speeds, significantly exceeding the average value. This could be due to the influence of storms, monsoons, or other extreme weather events.

Rainfall Chart:

Distribution: Rainfall tends to be unevenly distributed, with some days experiencing very high rainfall and many other days having very little or no rainfall.
Extreme Values: There are days with extremely high rainfall, far exceeding the average value. This indicates the potential for heavy rain events, which could lead to flooding.

Temperature Chart:

Distribution: Temperature also tends to be concentrated around a certain average value. However, compared to wind speed, the temperature distribution may be broader, indicating greater temperature fluctuations during the observed period.
Extreme Values: There are days with very high and very low temperatures, reflecting seasonal changes or the impact of other weather phenomena.

par(mfrow=c(1, 3))
hist(USAraind$Humidity, 
     main = "Histogram of Humidity", 
     xlab = "Humidity", 
     col = "slategray", 
     border = "black")
hist(USAraind$CloudCover, 
     main = "Histogram of Cloud Cover", 
     xlab = "Cloud Cover", 
     col = "slategray", 
     border = "black")
hist(USAraind$Pressure, 
     main = "Histogram of Pressure", 
     xlab = "Pressure", 
     col = "slategray", 
     border = "black")

Humidity

The distribution of humidity appears to be quite wide, with an average value ranging from 50-70%. This indicates a diversity in humidity levels over the study period and area.

Cloud Cover

Cloud cover also has a relatively wide distribution, with an average value ranging from 40-60%. This suggests significant variations in cloud presence in the sky.

Pressure

Pressure has a narrower distribution, concentrated around a certain average value. This indicates that pressure is relatively stable over the study period and area.

par(mfrow=c(1, 1))
USAraindd <- USAraind[,-9]
numeric_data <- USAraindd[, sapply(USAraindd, is.numeric)]
cor_matrix <- cor(numeric_data, use = "complete.obs")

library(ggplot2)
library(reshape2)
library(pheatmap)

melted_cor_matrix <- melt(cor_matrix)

ggplot(data = melted_cor_matrix, aes(x=Var1, y=Var2, fill=value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "darkcyan", high = "darkgray", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Correlation") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
  coord_fixed()

3. Heat map from the correlation

library(pheatmap)

# Vẽ heatmap từ ma trận tương quan
pheatmap(cor_matrix, display_numbers = TRUE, color = colorRampPalette(c("lightgray", "white", "darkcyan"))(50))

Based on the correlation matrix, we can draw the following conclusions:

The correlation coefficients between these variables range from -0.01 to 0.02, indicating no clear linear relationship among them.
The coefficients close to 0 suggest that these variables are not closely related or might have a nonlinear relationship that is not captured by linear correlation. This means that changes in one factor do not significantly affect the changes in the other factors in this dataset.

Conclusion:

There is no strong or significant correlation between the variables in this dataset.

4. Effect of Multiple Factors on RainTomorrow

library(reshape2)
USArain_long <- melt(USArain, id.vars = "RainTomorrow", 
                     measure.vars = c("Temperature", "Humidity", "WindSpeed", 
                                      "Precipitation", "CloudCover", "Pressure"))
ggplot(USArain_long, aes(x = value, y = RainTomorrow, color = factor(RainTomorrow))) +
  geom_point(position = position_jitter(height = 0.1), size = 2, alpha = 0.7) +  
  facet_wrap(~ variable, scales = "free_x") +  # Separate plots for each factor
  labs(title = "Effect of Multiple Factors on RainTomorrow",
       x = "Value", 
       y = "Rain Tomorrow (0 = No, 1 = Yes)",
       color = "Rain Tomorrow") +
  theme_minimal()

Analysis

The chart below shows the relationship between weather factors (Temperature, Humidity, Wind Speed, Precipitation, Cloud Cover, and Pressure) and the probability of rain tomorrow (Rain Tomorrow). The results are color-coded, with red indicating no rain (Rain Tomorrow = 0) and blue indicating rain (Rain Tomorrow = 1).

Analysis of Each Factor:

Humidity: There is a clear distinction. When humidity is high, especially between 80% and 100%, the probability of rain tomorrow increases. Precipitation: As today’s precipitation increases, the likelihood of rain tomorrow also rises. Blue points become more frequent as precipitation values go up. Cloud Cover: The likelihood of rain also increases with higher cloud cover. Temperature, Wind Speed, and Pressure: These factors do not show a clear relationship with the probability of rain. The values are evenly distributed across both outcomes (rain and no rain).

Conclusion

Humidity, Precipitation, and Cloud Cover are the most influential factors in determining the likelihood of rain tomorrow, with a clear trend: when these factors are higher, the chance of rain increases. Temperature, Wind Speed, and Pressure do not have a clear impact on rain prediction, as their values are distributed relatively equally between the two outcomes. The dominance of red points in the chart suggests that the dataset has a higher proportion of days without rain, reflecting the nature of the data or the region of study, where non-rainy days are more common. In summary, predicting rain tomorrow primarily depends on humidity, precipitation, and cloud cover. Other factors require further analysis to determine their impact.