Data Cleaning.
Ordering the data
frame
Now to make analysis easier we will add columns (variables): -
obs_count: This will add an observation count identification to
allow a more orderly visualization of the table. We will use the
row_number() function from the dplyr package to generate an
observation count identification. - datetime: Converting each
Unix timestamp in the dt column to a POSIXct date-time object, which is
easier to read and work with for date-time operations in R. -
year: Extract the year from the datetime column to analyze
patterns by year. - month: Extract the month from the datetime
column to analyze patterns by month, this will give us a categorical
value. - day: Extract the day from the datetime column to
analyze patterns by day. - season: Create a categorical
variable to place the seasons of the year (e.g. winter, spring, summer,
autumn)
houston_weather_mdf$obs_count <- seq_along(houston_weather_mdf$dt) # Add observation "count" variable
houston_weather_mdf$datetime <- as.POSIXct(houston_weather_mdf$dt, origin="1970-01-01", tz="UTC") # Add "datetime" variable by converting Unix timestamp to POSIXct date-time
houston_weather_mdf$year <- year(houston_weather_mdf$datetime) # Add "year" variable by extracting year from datetime.
houston_weather_mdf$month <- format(houston_weather_mdf$datetime, "%m") # Add "month" variable by extracting month from datetime.
houston_weather_mdf$day <- day(houston_weather_mdf$datetime) # Add "day" variable by extracting day from datetime.
# Add a new column for the "season"
houston_weather_mdf <- houston_weather_mdf %>%
mutate(
season = case_when(
month(datetime) %in% c(12, 1, 2) ~ "Winter", # Define Winter
month(datetime) %in% c(3, 4, 5) ~ "Spring", # Define Spring
month(datetime) %in% c(6, 7, 8) ~ "Summer", # Define Summer
month(datetime) %in% c(9, 10, 11) ~ "Autumn", # Define Autumn
# Default case
TRUE ~ NA_character_
)
)
houston_weather_mdf <- houston_weather_mdf %>%
dplyr::select(obs_count, city_id, dt, datetime, year, month, day, season, temp, feels_like, pressure, humidity, temp_min, temp_max, wind_speed, wind_deg, wind_gust, clouds_all, calctime, cod, cnt, message) # Rearrange the columns to a specified order.
# View the modified data frame to confirm changes
head(houston_weather_mdf)
## # A tibble: 6 × 22
## obs_count city_id dt datetime year month day season temp
## <int> <int> <int> <dttm> <dbl> <chr> <int> <chr> <dbl>
## 1 1 1 1.68e9 2023-02-13 00:00:00 2023 02 13 Winter 289.
## 2 2 1 1.68e9 2023-02-13 01:00:00 2023 02 13 Winter 287.
## 3 3 1 1.68e9 2023-02-13 02:00:00 2023 02 13 Winter 286.
## 4 4 1 1.68e9 2023-02-13 03:00:00 2023 02 13 Winter 286.
## 5 5 1 1.68e9 2023-02-13 04:00:00 2023 02 13 Winter 285.
## 6 6 1 1.68e9 2023-02-13 05:00:00 2023 02 13 Winter 284.
## # ℹ 13 more variables: feels_like <dbl>, pressure <int>, humidity <int>,
## # temp_min <dbl>, temp_max <dbl>, wind_speed <dbl>, wind_deg <int>,
## # wind_gust <dbl>, clouds_all <int>, calctime <dbl>, cod <chr>, cnt <int>,
## # message <chr>
Understanding the
data frame
dim(houston_weather_mdf)
## [1] 8688 22
The data frame contains 22 variables (described in Table 1) and 8760
observations.
Now we will view the structure, ranges of the values and variables.
For this we will create a table using the kable function from
the knitr package. This table will contain a column for each
variable, a description of the variable, the data type, and the valid
range. ### Creating a table to describe variables
# Define the data frame for variable descriptions
variables_description <- data.frame(
Variable = c("obs_count", "city_id", "dt", "datetime", "year", "month", "day", "season", "temp", "feels_like", "pressure", "humidity", "temp_min", "temp_max", "wind_speed", "wind_deg", "wind_gust", "clouds_all", "calctime", "cod", "cnt", "message"),
`Data Type` = c("Integer", "Integer", "Integer", "POSIXct", "Integer", "Character", "Integer", "Character", "Numeric", "Numeric", "Integer", "Integer", "Numeric", "Numeric", "Numeric", "Integer", "Numeric", "Integer", "Numeric", "Character", "Integer", "Character"),
`Valid Range` = c("1-N", "Fixed value", "Unix timestamp", "2023-Feb-09 to 2024-Feb-09", "2023-2024", "Jan-Dec", "1-31", "Season names", "183.85ºK - 329.85ºK", "183.85ºK - 329.85ºK", "980 - 1030 hPa", "0-100%", "183.85ºK - 329.85ºK", "183.85ºK - 329.85ºK", "0 - 113.3 m/s", "0-360º", "0 - 113.3 m/s", "0-100%", "Calculation time range", "Response codes", "Observation count range", "API response messages"),
Description = c("Observation count", "City identifier for Houston", "Unix timestamp", "Readable date and time", "Year of the weather observation, indicating when the data was recorded.", "Month of the year", "Day of the month for the weather observation, ranging from 1 to 31.", "Meteorological season of the year when the weather observation was made, categorized as Winter, Spring, Summer, or Autumn.", "Temperature in Kelvin (ºK)", "Feels like temperature in Kelvin (ºK)", "Atmospheric pressure in hectopascals (hPa)", "Relative humidity in %", "Minimum recorded temperature in Kelvin (ºK)", "Maximum recorded temperature in Kelvin (ºK)", "Wind speed in m/s", "Wind direction in meteorological degrees (º)", "Wind gust speed in m/s", "Cloud cover percentage (%)", "Time taken for API to calculate response", "API response code", "Number of observations included in the API response", "Message returned by the API")
)
# Generate markdown table with kable and style it with kableExtra
kable_styled <- kable(variables_description,
format = "html",
caption = "Table 1. Description of Variables in the Houston Weather Dataset",
align = c('l', 'l', 'l', 'l')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left") %>%
column_spec(1, bold = TRUE) %>%
column_spec(2, width = "100px") %>%
column_spec(3, width = "150px") %>%
column_spec(4, width = "300px")
# Print the styled table
kable_styled
Table 1. Description of Variables in the Houston Weather Dataset
|
Variable
|
Data.Type
|
Valid.Range
|
Description
|
|
obs_count
|
Integer
|
1-N
|
Observation count
|
|
city_id
|
Integer
|
Fixed value
|
City identifier for Houston
|
|
dt
|
Integer
|
Unix timestamp
|
Unix timestamp
|
|
datetime
|
POSIXct
|
2023-Feb-09 to 2024-Feb-09
|
Readable date and time
|
|
year
|
Integer
|
2023-2024
|
Year of the weather observation, indicating when the data was recorded.
|
|
month
|
Character
|
Jan-Dec
|
Month of the year
|
|
day
|
Integer
|
1-31
|
Day of the month for the weather observation, ranging from 1 to 31.
|
|
season
|
Character
|
Season names
|
Meteorological season of the year when the weather observation was made,
categorized as Winter, Spring, Summer, or Autumn.
|
|
temp
|
Numeric
|
183.85ºK - 329.85ºK
|
Temperature in Kelvin (ºK)
|
|
feels_like
|
Numeric
|
183.85ºK - 329.85ºK
|
Feels like temperature in Kelvin (ºK)
|
|
pressure
|
Integer
|
980 - 1030 hPa
|
Atmospheric pressure in hectopascals (hPa)
|
|
humidity
|
Integer
|
0-100%
|
Relative humidity in %
|
|
temp_min
|
Numeric
|
183.85ºK - 329.85ºK
|
Minimum recorded temperature in Kelvin (ºK)
|
|
temp_max
|
Numeric
|
183.85ºK - 329.85ºK
|
Maximum recorded temperature in Kelvin (ºK)
|
|
wind_speed
|
Numeric
|
0 - 113.3 m/s
|
Wind speed in m/s
|
|
wind_deg
|
Integer
|
0-360º
|
Wind direction in meteorological degrees (º)
|
|
wind_gust
|
Numeric
|
0 - 113.3 m/s
|
Wind gust speed in m/s
|
|
clouds_all
|
Integer
|
0-100%
|
Cloud cover percentage (%)
|
|
calctime
|
Numeric
|
Calculation time range
|
Time taken for API to calculate response
|
|
cod
|
Character
|
Response codes
|
API response code
|
|
cnt
|
Integer
|
Observation count range
|
Number of observations included in the API response
|
|
message
|
Character
|
API response messages
|
Message returned by the API
|
The table summarizes the diverse weather measurements in the Houston
Weather Dataset. It ranges from temperature metrics to wind dynamics,
containing the meteorological conditions in various dimensions.
Understanding these parameters allows for an examination of weather
patterns in Houston.
Values out of
range
There don’t appear to be any values that are immediately identifiable
as out of range without more context. However, specific thresholds for
what constitutes an out-of-range value would depend on the expected
weather conditions for Houston and the measurement standards for each
variable. Temperature (temp, feels_like, temp_min, temp_max):
The minimum and maximum temperatures seem within a plausible range for
Houston’s climate. Pressure: The pressure values range from 994
to 1034, which seems typical for atmospheric pressure measured in
hectopascals (hPa). Humidity: Ranges from 20% to 100%, which is
typical. Wind Speed and Wind Gust: Maximum wind speed
is 34.47, which could be considered high but not impossible during storm
events. Wind gust has a max of 24.18. Clouds_all: Ranges from
0% to 100%, indicating clear skies to completely overcast, which is
typical.
Missing values
To detect missing values we will use the summary() function, on
variables that where previously identified with NA values.
summary(houston_weather_mdf)
## obs_count city_id dt datetime
## Min. : 1 Min. :1 Min. :1.676e+09 Min. :2023-02-13 00:00:00
## 1st Qu.:2173 1st Qu.:1 1st Qu.:1.684e+09 1st Qu.:2023-05-14 11:45:00
## Median :4344 Median :1 Median :1.692e+09 Median :2023-08-12 23:30:00
## Mean :4344 Mean :1 Mean :1.692e+09 Mean :2023-08-12 23:30:00
## 3rd Qu.:6516 3rd Qu.:1 3rd Qu.:1.700e+09 3rd Qu.:2023-11-11 11:15:00
## Max. :8688 Max. :1 Max. :1.708e+09 Max. :2024-02-09 23:00:00
##
## year month day season
## Min. :2023 Length:8688 Min. : 1.00 Length:8688
## 1st Qu.:2023 Class :character 1st Qu.: 8.00 Class :character
## Median :2023 Mode :character Median :16.00 Mode :character
## Mean :2023 Mean :15.76
## 3rd Qu.:2023 3rd Qu.:23.00
## Max. :2024 Max. :31.00
##
## temp feels_like pressure humidity
## Min. :266.1 Min. :259.1 Min. : 994 Min. : 20.00
## 1st Qu.:290.0 1st Qu.:289.8 1st Qu.:1011 1st Qu.: 60.00
## Median :296.6 Median :297.0 Median :1014 Median : 75.00
## Mean :295.8 Mean :297.2 Mean :1014 Mean : 71.95
## 3rd Qu.:301.6 3rd Qu.:305.3 3rd Qu.:1017 3rd Qu.: 85.00
## Max. :315.0 Max. :321.3 Max. :1034 Max. :100.00
##
## temp_min temp_max wind_speed wind_deg
## Min. :264.5 Min. :267.2 Min. : 0.000 Min. : 0.0
## 1st Qu.:288.2 1st Qu.:291.5 1st Qu.: 3.090 1st Qu.:100.0
## Median :294.9 Median :298.1 Median : 4.630 Median :160.0
## Mean :293.9 Mean :297.4 Mean : 4.867 Mean :161.7
## 3rd Qu.:299.8 3rd Qu.:303.1 3rd Qu.: 6.170 3rd Qu.:210.0
## Max. :313.3 Max. :317.1 Max. :34.470 Max. :360.0
##
## wind_gust clouds_all calctime cod
## Min. : 0.420 Min. : 0.00 Min. :0.001696 Length:8688
## 1st Qu.: 7.200 1st Qu.: 0.00 1st Qu.:0.003853 Class :character
## Median : 9.390 Median : 40.00 Median :0.004107 Mode :character
## Mean : 9.343 Mean : 48.63 Mean :0.004472
## 3rd Qu.:11.830 3rd Qu.:100.00 3rd Qu.:0.004486
## Max. :24.180 Max. :100.00 Max. :0.038480
## NA's :5582
## cnt message
## Min. :24 Length:8688
## 1st Qu.:24 Class :character
## Median :24 Mode :character
## Mean :24
## 3rd Qu.:24
## Max. :24
##
Variable “wind_gust” contains 5600 missing values
(NA’s). This indicates a significant portion of this data is
missing, which could impact analyses related to wind gusts. It is
possible to conclude that missing values (NA’s) may represent the lack
of wind gust and not necessarily an error in the observation, since
there is not necessarily a constant wind. Therefore, we will replace al
NAs in this specific variable with 0 (zeros).
houston_weather_mdf$wind_gust[is.na(houston_weather_mdf$wind_gust)] <- 0
summary(houston_weather_mdf$wind_gust)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 3.34 7.72 24.18
Outliers
To identify outliers in out core variables we will apply the inter
quertile range (IQR) method.
# Create a function to calculate IQR and identify outliers
identify_outliers <- function(data, variable_name) {
Q1 <- quantile(data[[variable_name]], 0.25, na.rm = TRUE) # Q1
Q3 <- quantile(data[[variable_name]], 0.75, na.rm = TRUE) # Q3
IQR <- Q3 - Q1 # Interquartile range
lower_bound <- Q1 - 1.5 * IQR # Lower bound calculation
upper_bound <- Q3 + 1.5 * IQR # Upper bound calculation
outliers <- data[[variable_name]] < lower_bound | data[[variable_name]] > upper_bound # Outlier detection
list(lower_bound = lower_bound, upper_bound = upper_bound, outliers = sum(outliers, na.rm = TRUE) # Outlier detection
)
}
# Apply the function to each core variable and print the results
temp_outliers <- identify_outliers(houston_weather_mdf, "temp")
feels_like_outliers <- identify_outliers(houston_weather_mdf, "feels_like")
pressure_outliers <- identify_outliers(houston_weather_mdf, "pressure")
humidity_outliers <- identify_outliers(houston_weather_mdf, "humidity")
temp_min_outliers <- identify_outliers(houston_weather_mdf, "temp_min")
temp_max_outliers <- identify_outliers(houston_weather_mdf, "temp_max")
wind_speed_outliers <- identify_outliers(houston_weather_mdf, "wind_speed")
wind_deg_outliers <- identify_outliers(houston_weather_mdf, "wind_deg")
wind_gust_outliers <- identify_outliers(houston_weather_mdf, "wind_gust")
clouds_all_outliers <- identify_outliers(houston_weather_mdf, "clouds_all")
Identification of “temp” outliers
print(temp_outliers)
## $lower_bound
## 25%
## 272.7
##
## $upper_bound
## 75%
## 318.86
##
## $outliers
## [1] 53
There are 47 outliers in “temp”, since they are all in the
valid range we will leave them.
Identification of “feels_like” outliers
print(feels_like_outliers)
## $lower_bound
## 25%
## 266.51
##
## $upper_bound
## 75%
## 328.59
##
## $outliers
## [1] 51
There are 44 outliers in “feels_like”, since they are all in
the valid range we will leave them.
Identification of “pressure” outliers
print(pressure_outliers)
## $lower_bound
## 25%
## 1002
##
## $upper_bound
## 75%
## 1026
##
## $outliers
## [1] 345
There are 165 outliers in “pressure”, since they are all in
the valid range we will leave them.
Identification of “humidity” outliers
print(humidity_outliers)
## $lower_bound
## 25%
## 22.5
##
## $upper_bound
## 75%
## 122.5
##
## $outliers
## [1] 10
There are 10 outliers in “humidity”, since they are all in
the valid range we will leave them.
Identification of “temp_min” outliers
print(temp_min_outliers)
## $lower_bound
## 25%
## 270.87
##
## $upper_bound
## 75%
## 317.11
##
## $outliers
## [1] 54
There are 54 outliers in “temp_min”, since they are all in
the valid range we will leave them.
Identification of “temp_max” outliers
print(temp_max_outliers)
## $lower_bound
## 25%
## 274.145
##
## $upper_bound
## 75%
## 320.505
##
## $outliers
## [1] 46
There are 51 outliers in “temp_max”, since they are all in
the valid range we will leave them.
Identification of “wind_speed” outliers
print(wind_speed_outliers)
## $lower_bound
## 25%
## -1.53
##
## $upper_bound
## 75%
## 10.79
##
## $outliers
## [1] 164
There are 165 outliers in “wind speed”, since they are all
in the valid range we will leave them.
Identification of “wind_deg” outliers
print(wind_deg_outliers)
## $lower_bound
## 25%
## -65
##
## $upper_bound
## 75%
## 375
##
## $outliers
## [1] 0
There are no outliers in “wind_deg”.
Identification of “wind_gust” outliers
print(wind_gust_outliers)
## $lower_bound
## 25%
## -11.58
##
## $upper_bound
## 75%
## 19.3
##
## $outliers
## [1] 19
There are 19 outliers in “wind_gust”, Since the lower bound
can not be negative it is necessary to explore further.
hist(houston_weather_mdf$wind_gust)
The histogram provides a visual aid that allows us to safely say that
there are no incorrect values.
Identification of “clouds_all” outliers
print(clouds_all_outliers)
## $lower_bound
## 25%
## -150
##
## $upper_bound
## 75%
## 250
##
## $outliers
## [1] 0
We can find no outliers in “clouds_all”
Our data set is now clean, outliers have been identified and
understood, missing values have been identified, explained and replaced,
the data frame is now ready to be analysed further.
Data Analysis
In this section we will try to understand patterns using graphs and
searching for possible causal effects.
Temperature
# Histogram for Temperature with Mean and Median
hist(houston_weather_mdf$temp, 20, main = "Histogram of Temperature", xlab = "Temperature", col = "blue")
abline(v = mean(houston_weather_mdf$temp, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$temp, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2) # Median line
The histogram represents the distribution of temperature in Houston over
one year, divided in 20 bins ranging from 266.1ºK to 315.0ºK. The
distribution has a slightly left-skewed shape with its peak at around
301ºK, this suggests that the temperature around this value occurred
most frequently within the observed period.
houston_weather_mdf$season <- factor(houston_weather_mdf$season, levels = c("Winter", "Spring", "Summer", "Autumn"))
# Box plot of temperature by season
ggplot(houston_weather_mdf, aes(x=season, y=temp, fill=season)) +
geom_boxplot() +
theme_bw() +
labs(title="Temperature Distribution by Season", x="Season", y="Temperature(ºK)") +
theme(legend.position="none")
The graph is a boxplot illustrating the distribution of temperatures
across four seasons: Winter, Spring, Summer and Autumn. Each boxplot
represents the interquartile range (IQR) of temperatures for a season,
with the box’s bottom and top edges indicating the first (Q1) and third
(Q3) quartiles, respectively. The line inside each box denotes the
median temperature. The “whiskers” extend from the boxes to show the
range of the data, while points below or above the whiskers indicate
potential outliers.
Winter has the lowest median temperature, around 288ºK, with an IQR
similar to Autumn, indicating consistent cooler temperatures. Spring
shows a higher median temperature close to 295ºK, with a slightly wider
IQR. Summer has the highest median temperature, above 300ºK, indicating
it’s the warmest season, and also displays a broad IQR, suggesting
significant variation in temperatures. Autumn has a median temperature
just above 290ºK, with a relatively narrow IQR suggesting less
variability in temperatures. There are a few outliers in the Winter and
Spring seasons, indicating occasional temperatures that fall well below
the typical range.
# Convert numeric month to month names (optional)
houston_weather_mdf$month_name <- factor(format(as.Date(as.yearmon(paste(houston_weather_mdf$month, "1", sep="-"), "%m-%d")), "%B"),
levels = month.name)
# Create the boxplot
ggplot(houston_weather_mdf, aes(x=month_name, y=temp, fill=month_name)) +
geom_boxplot() +
theme_bw() +
xlab("Month") + ylab("Temperature (°K)") +
ggtitle("Monthly Temperature Distribution in Houston") +
theme(axis.text.x = element_text(angle=25, hjust=1)) # Rotate x labels for better readability
The graph is a boxplot illustrating the distribution of temperatures
across the months of the year. Each boxplot represents the interquartile
range (IQR) of temperatures for a month, with the box’s bottom and top
edges indicating the first (Q1) and third (Q3) quartiles, respectively.
The line inside each box denotes the median temperature. The “whiskers”
extend from the boxes to show the range of the data, while points below
or above the whiskers indicate potential outliers.
August has the highest median temperature, around 304ºK. January has
the lowest median temperature, around 285ºK. May has the smallest
variance in temperature indicating the most stable temperatures. January
and February have the highest variance in temperature indicating more
variability in temperature. There are a few outliers in the January,
March and October, indicating occasional temperatures that fall well
below the typical range.
Apparent
temperature
# Histogram for Feels Like Temperature with Mean
hist(houston_weather_mdf$feels_like, 30, main = "Histogram of Feels Like Temperature", xlab = "Feels Like Temperature", col = "lightgreen")
abline(v = mean(houston_weather_mdf$feels_like, na.rm = TRUE), col = "red", lwd = 2)
median_temp <- median(houston_weather_mdf$feels_like, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2) # Median line

The histogram represents the distribution of apparent temperature in
Houston over one year, divided in 30 bins ranging from 259.1ºK to
321.3ºK. The distribution has a slightly left-skewed shape with its peak
at around 299ºK, this suggests that the apparent temperature around this
value occurred most frequently within the observed period.
Pressure
# Histogram for Pressure with Mean and Median
hist(houston_weather_mdf$pressure, 40 ,main = "Histogram of Pressure", xlab = "Pressure", col = "aquamarine")
abline(v = mean(houston_weather_mdf$pressure, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$pressure, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2) # Median line
The histogram represents the distribution of pressure in Houston over
one year, divided in 40 bins ranging from 994 hPa to 1034 hPa. The
distribution has a slightly right-skewed shape with its peak at around
1013 hPa, this suggests that the pressure around this value occurred
most frequently within the observed period.
# Box plot of pressure by season
ggplot(houston_weather_mdf, aes(x=season, y=pressure, fill=season)) +
geom_boxplot() +
theme_bw() +
labs(title="Pressure Distribution by Season", x="Season", y="Pressure (hPa)") +
theme(axis.text.x = element_text(angle=0, hjust=1)) # Rotate x labels for better readability
The graph is a boxplot illustrating the distribution of pressure across
four seasons: Winter, Spring, Summer and Autumn. Each boxplot represents
the interquartile range (IQR) of pressure for a season, with the box’s
bottom and top edges indicating the first (Q1) and third (Q3) quartiles,
respectively. The line inside each box denotes the median pressure. The
“whiskers” extend from the boxes to show the range of the data, while
points below or above the whiskers indicate potential outliers.
Winter has the highest median pressure, around 1018 hPa, it also
presents the highest variance in pressure. Summer has the lowest median
pressure, around 1012 hPa, it also displays the thinnest IQR, suggesting
the least variation in pressure. There are a significant outliers in the
Spring, indicating occasional pressure measurements that fall well below
and above the typical range.
# Box plot of pressure by month
ggplot(houston_weather_mdf, aes(x=month_name, y=pressure, fill=month_name)) +
geom_boxplot() +
theme_bw() +
xlab("Month") + ylab("Pressure (hPa)") +
ggtitle("Monthly Pressure Distribution in Houston") +
theme(axis.text.x = element_text(angle=25, hjust=1)) # Rotate x labels for better readability
The graph is a boxplot illustrating the distribution of pressure across
each month of the year. Each boxplot represents the interquartile range
(IQR) of pressure for a month, with the box’s bottom and top edges
indicating the first (Q1) and third (Q3) quartiles, respectively. The
line inside each box denotes the median pressure. The “whiskers” extend
from the boxes to show the range of the data, while points below or
above the whiskers indicate potential outliers.
December has the highest median pressure, around 1020 hPa. June has
the lowest median pressure, around 1010 hPa. There are outliers in
January, March, May, June, August, October and December, indicating
occasional pressure measurements that fall well below and above the
typical range.
Humidity
# Histogram for Humidity with Mean and Median
hist(houston_weather_mdf$humidity, 40 ,main = "Histogram of Humidity", xlab = "Humidity", col = "yellow")
abline(v = mean(houston_weather_mdf$humidity, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$humidity, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2) # Median line
The histogram represents the distribution of humidity in Houston over
one year, divided in 40 bins ranging from 20% to 100%. The distribution
has a left-skewed shape with its peak at around 85%, this suggests that
the humidity around this value occurred most frequently within the
observed period.
# Box plot of humidity by season
ggplot(houston_weather_mdf, aes(x=season, y=humidity, fill=season)) +
geom_boxplot() +
theme_bw() +
labs(title="Humidity Distribution by Season", x="Season", y="Humidity") +
theme(axis.text.x = element_text(angle=0, hjust=1)) # Rotate x labels for better readability
The graph is a boxplot illustrating the distribution of humidity across
four seasons: Winter, Spring, Summer and Autumn. Each boxplot represents
the interquartile range (IQR) of humidity for a season, with the box’s
bottom and top edges indicating the first (Q1) and third (Q3) quartiles,
respectively. The line inside each box denotes the median humidity. The
“whiskers” extend from the boxes to show the range of the data, while
points below or above the whiskers indicate potential outliers.
Summer has the lowest median humidity, around 73%. Spring has the
highest median humidity, around 79%. Summer has the highest IQR
indicating the most variance in humidity, ranging from around 55% to
83%. There is a significant outliers in the Spring, indicating
occasional humidity measurements that fall well below the typical
range.
Minimum
Temperature
# Histogram for Minimum Temperature with Mean and Median
hist(houston_weather_mdf$temp_min, 30 ,main = "Histogram of Minimum Temperature", xlab = "Minimum Temperature", col = "orange")
abline(v = mean(houston_weather_mdf$temp_min, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$temp_min, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2) # Median line
The histogram represents the distribution of minimum temperature in
Houston over one year, divided in 30 bins ranging from 264ºK to 313ºK.
The distribution has a slightly left-skewed shape with its peak at
around 299ºK, this suggests that the minimum temperature around this
value occurred most frequently within the observed period.
Maximum
Temperature
# Histogram for Maximum Temperature with Mean and Median
hist(houston_weather_mdf$temp_max,40, main = "Histogram of Maximum Temperature", xlab = "Maximum Temperature", col = "purple")
abline(v = mean(houston_weather_mdf$temp_max, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$temp_max, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2) # Median line
The histogram represents the distribution of maximum temperature in
Houston over one year, divided in 40 bins ranging from 267.2ºK to
317.1ºK. The distribution has a normal bell shape with its peak at
around 303ºK, this suggests that the maximum temperature around this
value occurred most frequently within the observed period.
Wind Speed
# Histogram for Wind Speed with Mean and Median
hist(houston_weather_mdf$wind_speed, 40 ,main = "Histogram of Wind Speed", xlab = "Wind Speed", col = "brown")
abline(v = mean(houston_weather_mdf$wind_speed, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$wind_speed, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2) # Median line
The histogram represents the distribution of wind speed in Houston over
one year, divided in 40 bins ranging from 0 m/s to 34.47 m/s. The
distribution has a right skewed shape with its peak at around 4.9 m/s,
this suggests that the wind speed around this value occurred most
frequently within the observed period.
# Box plot of wind speed by month
ggplot(houston_weather_mdf, aes(x=month_name, y=wind_speed, fill=month_name)) +
geom_boxplot() +
theme_bw() +
xlab("Month") + ylab("Wind Speed") +
ggtitle("Monthly Wind Speed Distribution in Houston") +
theme(axis.text.x = element_text(angle=25, hjust=1)) # Rotate x labels for better readability
The graph is a boxplot illustrating the distribution of wind speed
across each month of the year. Each boxplot represents the interquartile
range (IQR) of wind speed for a month, with the box’s bottom and top
edges indicating the first (Q1) and third (Q3) quartiles, respectively.
The line inside each box denotes the median wind speed. The “whiskers”
extend from the boxes to show the range of the data, while points below
or above the whiskers indicate potential outliers.
March and April have the highest median wind speed, around 6 m/s.
September has the lowest median wind speed, around 5 m/s. There are
outliers in every month, indicating occasional wind speed measurements
that fall well above the typical range. There is one particular outlier
in June indicating a highly irregular wind speed of 33 m/s.
Wind Direction
# Histogram for Wind Direction (Degrees) with Mean and Median
hist(houston_weather_mdf$wind_deg, 30, main = "Histogram of Wind Direction", xlab = "Wind Direction (Degrees)", col = "pink")
abline(v = mean(houston_weather_mdf$wind_deg, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$wind_deg, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2) # Median line
The histogram represents the distribution of wind direction in Houston
over one year, divided in 30 bins ranging from 0º to 360º. The
distribution has a right multimodal shape with its peak at 0º, this
suggests that the wind direction around this value occurred most
frequently within the observed period, this value corresponds to north
wind.
# Box plot of wind_deg by season
ggplot(houston_weather_mdf, aes(x=season, y=wind_deg, fill=season)) +
geom_boxplot() +
theme_bw() +
labs(title="Wind Direction Distribution by Season", x="Season", y="Wind Direction") +
theme(axis.text.x = element_text(angle=0, hjust=1)) # Rotate x labels for better readability
The graph is a boxplot illustrating the distribution of wind direction
across each season of the year: Winter, Spring, Summer and Autumn. Each
boxplot represents the interquartile range (IQR) of wind direction for a
season, with the box’s bottom and top edges indicating the first (Q1)
and third (Q3) quartiles, respectively. The line inside each box denotes
the median wind direction. The “whiskers” extend from the boxes to show
the range of the data, while points below or above the whiskers indicate
potential outliers.
Winter has the highest IQR, indicating that winds come from the most
directions, mainly falling between 110º (East-South East) to 300º(
West-North West). Summer has the thinnest IQR, indicating that winds
mainly come from fewer directions, mainly falling between 110º (South
East) and 210º (South - South West). This means that winds fro the
Summer come mainly from the South. However, even in the Summer winds can
also come from all directions.
Wind Gust
# Histogram for Wind Gust with Mean and Median
hist(houston_weather_mdf$wind_gust,30, main = "Histogram of Wind Gust", xlab = "Wind Gust", col = "grey")
abline(v = mean(houston_weather_mdf$wind_gust), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$wind_gust)
abline(v = median_temp, col = "darkgreen", lwd = 2) # Median line
The histogram represents the distribution of wind gust in Houston over
one year, divided in 30 bins ranging from 0 to 24.18 m/s. The
distribution has a multimodal shape with its peak at 0 m/s, indicating
that the most frequent value of wind gust is 0, or the absence of wind
gust, over the observed period.
Cloudiness
# Histogram for Cloudiness with Mean and Median
hist(houston_weather_mdf$clouds_all, 15, main = "Histogram of Cloudiness", xlab = "Cloudiness", col = "cyan")
abline(v = mean(houston_weather_mdf$clouds_all, na.rm = TRUE), col = "red", lwd = 2)
median_temp <- median(houston_weather_mdf$clouds_all, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2) # Median line
The histogram represents the distribution of cloudiness in Houston over
one year, divided in 15 bins ranging from 0 % to 100 %. The distribution
has a multimodal shape with its peaks at 0 % and 100 %, indicating that
the most frequent values of clouds is 0 % and 100 %, or the absence of
cloudiness and completely covered sky with clouds.
# Box plot of clouds_all by season
ggplot(houston_weather_mdf, aes(x=season, y=clouds_all, fill=season)) +
geom_boxplot() +
theme_bw() +
labs(title="Clouds by Season", x="Season", y="Clouds covering sky percentage") +
theme(axis.text.x = element_text(angle=0, hjust=1)) # Rotate x labels for better readability
The graph is a boxplot illustrating the distribution of cloudiness
across each season of the year: Winter, Spring, Summer and Autumn. Each
boxplot represents the interquartile range (IQR) of cloudiness for a
season, with the box’s bottom and top edges indicating the first (Q1)
and third (Q3) quartiles, respectively. The line inside each box denotes
the median cloudiness. The “whiskers” extend from the boxes to show the
range of the data, while points below or above the whiskers indicate
potential outliers.
Winter and Spring have the higher median of cloudiness, at 75%.
Summer has the lowest median cloudiness, below 25%, this means that the
Summer has clearer skys than Winter and Spring.
Correlations
Now we will use the ggpairs function from GGally package to obtain
multiple graphs and correlations.
selected_variables <- houston_weather_mdf[c("season","temp", "feels_like", "pressure", "humidity","temp_min","temp_max","wind_speed","wind_deg","clouds_all")] # Create a subset of selected variables.
ggpairs(selected_variables,
mapping = ggplot2::aes(alpha = 0.5), # Add transparency
lower = list(continuous = wrap("points", size = 0.5, position = "jitter")), # Jittering and smaller points
diag = list(continuous = wrap("densityDiag")) # Use density plots on the diagonal
)
Correlations between temperature, apparent temperature, maximum and
minimal temperatures are the strongest, all above .98 since they are all
different ways of measuring the same phenomenon. Other than that, there
is a relevant negative correlation between temperature and pressure of
-.568. This means that, cetris paribus if the temperature rises
the pressure will drop.