PART 1: Overview

Introduction to the Variables: Each row represents daily weather data, and the key variables included are:

library(readxl)
library(readr)    
library(dplyr)
library(rmarkdown)
library(tinytex)
library(utils)
library(readxl)

USArain <- read_csv("C:/Users/manhphi2811/Downloads/USArain.csv")

USArain$Date <- as.Date(USArain$Date)
USAraind <- USArain |>
  group_by(Date, Location)|>
  summarise(
    Temperature = mean(Temperature, na.rm = TRUE),
    Humidity = mean(Humidity, na.rm = TRUE),
    WindSpeed = mean(WindSpeed, na.rm = TRUE),
    Precipitation = mean(Precipitation, na.rm = TRUE),
    CloudCover = mean(CloudCover, na.rm = TRUE),
    Pressure = mean(Pressure, na.rm = TRUE),
    RainTomorrow = max(RainTomorrow)
  )

head(USAraind)        #Display the first few rows of the dataset
## # A tibble: 6 × 9
## # Groups:   Date [1]
##   Date       Location  Temperature Humidity WindSpeed Precipitation CloudCover
##   <date>     <chr>           <dbl>    <dbl>     <dbl>         <dbl>      <dbl>
## 1 2024-01-01 Austin           64.6     46.1      16.8         0.469       57.6
## 2 2024-01-01 Charlotte        70.9     55.2      15.3         0.364       38.7
## 3 2024-01-01 Chicago          82.1     60.1      11.1         0.385       63.4
## 4 2024-01-01 Columbus         67.4     53.4      14.3         0.531       43.4
## 5 2024-01-01 Dallas           62.7     65.0      12.4         0.570       40.6
## 6 2024-01-01 Denver           58.4     49.0      17.9         0.197       70.2
## # ℹ 2 more variables: Pressure <dbl>, RainTomorrow <dbl>
colnames(USAraind)    #List the column names of the dataset
## [1] "Date"          "Location"      "Temperature"   "Humidity"     
## [5] "WindSpeed"     "Precipitation" "CloudCover"    "Pressure"     
## [9] "RainTomorrow"
dim(USAraind)         #Get dimensions (number of rows and columns
## [1] 14620     9
summary(USAraind)     #Display descriptive statistics for each variable
##       Date              Location          Temperature       Humidity    
##  Min.   :2024-01-01   Length:14620       Min.   :34.30   Min.   :25.31  
##  1st Qu.:2024-07-01   Class :character   1st Qu.:58.92   1st Qu.:52.60  
##  Median :2024-12-31   Mode  :character   Median :65.23   Median :59.83  
##  Mean   :2024-12-31                      Mean   :65.18   Mean   :59.88  
##  3rd Qu.:2025-07-02                      3rd Qu.:71.42   3rd Qu.:67.08  
##  Max.   :2025-12-31                      Max.   :94.69   Max.   :93.93  
##    WindSpeed      Precipitation      CloudCover       Pressure     
##  Min.   : 1.818   Min.   :0.0000   Min.   :14.14   Min.   : 973.4  
##  1st Qu.:12.337   1st Qu.:0.2346   1st Qu.:46.80   1st Qu.: 998.8  
##  Median :15.008   Median :0.3727   Median :54.80   Median :1005.2  
##  Mean   :15.018   Mean   :0.3906   Mean   :54.94   Mean   :1005.2  
##  3rd Qu.:17.706   3rd Qu.:0.5262   3rd Qu.:63.06   3rd Qu.:1011.5  
##  Max.   :26.990   Max.   :1.4231   Max.   :96.23   Max.   :1036.8  
##   RainTomorrow   
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.7072  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
colSums(is.na(USAraind))    #Check for missing values in each column   
##          Date      Location   Temperature      Humidity     WindSpeed 
##             0             0             0             0             0 
## Precipitation    CloudCover      Pressure  RainTomorrow 
##             0             0             0             0
clean_data <- na.omit(USAraind)  #Remove rows with any missing values
colSums(is.na(clean_data))      # Verify that missing values have been handled
##          Date      Location   Temperature      Humidity     WindSpeed 
##             0             0             0             0             0 
## Precipitation    CloudCover      Pressure  RainTomorrow 
##             0             0             0             0

PART 2: Exploratory data analysis

1. Boxplot:

## 
##  Grubbs test for one outlier
## 
## data:  USAraind$Precipitation
## G = 4.88127, U = 0.99837, p-value = 0.007631
## alternative hypothesis: highest value 1.42305098684524 is an outlier
## 
##  Grubbs test for one outlier
## 
## data:  USAraind$Temperature
## G = 3.4206, U = 0.9992, p-value = 1
## alternative hypothesis: lowest value 34.3000451411701 is an outlier
## 
##  Grubbs test for one outlier
## 
## data:  USAraind$Humidity
## G = 3.32381, U = 0.99924, p-value = 1
## alternative hypothesis: lowest value 25.3105072675499 is an outlier
## 
##  Grubbs test for one outlier
## 
## data:  USAraind$WindSpeed
## G = 3.40376, U = 0.99921, p-value = 1
## alternative hypothesis: lowest value 1.81816641848697 is an outlier

  • The temperature range is approximately from 30°F to 100°F.

  • Most temperatures are concentrated between 50°F and 80°F, with a median around 70°F.

  • There are some lower temperature values, but no noticeable outliers.

  • Humidity ranges from about 20% to 100%.

  • Most humidity values are concentrated between 40% and 80%, with a median around 60%.

  • There are some lower humidity values near 20%, but no clear outliers.

Wind Speed (km/h):

  • Wind speed ranges from 0 km/h to around 30 km/h.

  • Wind speed is mostly concentrated between 10 km/h and 20 km/h, with a median around 15 km/h.

  • Some values near 0 km/h appear, but there are no significant outliers.

Conclusion:

  • Temperature, humidity, and wind speed in this dataset are relatively evenly distributed, with median values near the middle of their respective ranges.
  • There are no significant outliers in any of the three variables.

2. Histogram

par(mfrow=c(1, 3))
hist(USAraind$WindSpeed, 
     main = "Histogram of Wind Speed", 
     xlab = "Wind Speed", 
     col = "slategray", 
     border = "black")
hist(USAraind$Precipitation, 
     main = "Histogram of Precipitation", 
     xlab = "Precipitation", 
     col = "slategray", 
     border = "black")
hist(USAraind$Temperature, 
     main = "Histogram of Temperature", 
     xlab = "Temperature", 
     col = "slategray", 
     border = "black")

Wind Speed Chart:

Rainfall Chart:

Temperature Chart:

par(mfrow=c(1, 3))
hist(USAraind$Humidity, 
     main = "Histogram of Humidity", 
     xlab = "Humidity", 
     col = "slategray", 
     border = "black")
hist(USAraind$CloudCover, 
     main = "Histogram of Cloud Cover", 
     xlab = "Cloud Cover", 
     col = "slategray", 
     border = "black")
hist(USAraind$Pressure, 
     main = "Histogram of Pressure", 
     xlab = "Pressure", 
     col = "slategray", 
     border = "black")

Humidity

The distribution of humidity appears to be quite wide, with an average value ranging from 50-70%. This indicates a diversity in humidity levels over the study period and area.

Cloud Cover

Cloud cover also has a relatively wide distribution, with an average value ranging from 40-60%. This suggests significant variations in cloud presence in the sky.

Pressure

Pressure has a narrower distribution, concentrated around a certain average value. This indicates that pressure is relatively stable over the study period and area.

par(mfrow=c(1, 1))
USAraindd <- USAraind[,-9]
numeric_data <- USAraindd[, sapply(USAraindd, is.numeric)]
cor_matrix <- cor(numeric_data, use = "complete.obs")

library(ggplot2)
library(reshape2)
library(pheatmap)

melted_cor_matrix <- melt(cor_matrix)

ggplot(data = melted_cor_matrix, aes(x=Var1, y=Var2, fill=value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "darkcyan", high = "darkgray", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Correlation") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
  coord_fixed()

3. Heat map from the correlation

library(pheatmap)

# Vẽ heatmap từ ma trận tương quan
pheatmap(cor_matrix, display_numbers = TRUE, color = colorRampPalette(c("lightgray", "white", "darkcyan"))(50))

Based on the correlation matrix, we can draw the following conclusions:
  • The correlation coefficients between these variables range from -0.01 to 0.02, indicating no clear linear relationship among them.

  • The coefficients close to 0 suggest that these variables are not closely related or might have a nonlinear relationship that is not captured by linear correlation. This means that changes in one factor do not significantly affect the changes in the other factors in this dataset.

Conclusion:

There is no strong or significant correlation between the variables in this dataset.

4. Effect of Multiple Factors on RainTomorrow

library(reshape2)
USArain_long <- melt(USArain, id.vars = "RainTomorrow", 
                     measure.vars = c("Temperature", "Humidity", "WindSpeed", 
                                      "Precipitation", "CloudCover", "Pressure"))
ggplot(USArain_long, aes(x = value, y = RainTomorrow, color = factor(RainTomorrow))) +
  geom_point(position = position_jitter(height = 0.1), size = 2, alpha = 0.7) +  
  facet_wrap(~ variable, scales = "free_x") +  # Separate plots for each factor
  labs(title = "Effect of Multiple Factors on RainTomorrow",
       x = "Value", 
       y = "Rain Tomorrow (0 = No, 1 = Yes)",
       color = "Rain Tomorrow") +
  theme_minimal()

Analysis

The chart below shows the relationship between weather factors (Temperature, Humidity, Wind Speed, Precipitation, Cloud Cover, and Pressure) and the probability of rain tomorrow (Rain Tomorrow). The results are color-coded, with red indicating no rain (Rain Tomorrow = 0) and blue indicating rain (Rain Tomorrow = 1).

Analysis of Each Factor:

Humidity: There is a clear distinction. When humidity is high, especially between 80% and 100%, the probability of rain tomorrow increases. Precipitation: As today’s precipitation increases, the likelihood of rain tomorrow also rises. Blue points become more frequent as precipitation values go up. Cloud Cover: The likelihood of rain also increases with higher cloud cover. Temperature, Wind Speed, and Pressure: These factors do not show a clear relationship with the probability of rain. The values are evenly distributed across both outcomes (rain and no rain).

Conclusion

Humidity, Precipitation, and Cloud Cover are the most influential factors in determining the likelihood of rain tomorrow, with a clear trend: when these factors are higher, the chance of rain increases. Temperature, Wind Speed, and Pressure do not have a clear impact on rain prediction, as their values are distributed relatively equally between the two outcomes. The dominance of red points in the chart suggests that the dataset has a higher proportion of days without rain, reflecting the nature of the data or the region of study, where non-rainy days are more common. In summary, predicting rain tomorrow primarily depends on humidity, precipitation, and cloud cover. Other factors require further analysis to determine their impact.