1 Abstract

The purpose of the following is to extract and analyze weather data, find relationships between climate variables and understand weather forecasts. The extraction process will use an API call, then we will flatten the data frame and then proceed to analyze the weather variables and their relationships to obtain a better understanding of weather patterns.

2 Objective

To analyze one year of weather conditions of a determined location, for this we will extract the data from Open Weather Map (OWM) using an API.

3 Content

Out document is divided in to 4 main sections: - Extraction: Focuses on the process of obtaining data using an API. - Flattening: Discusses manipulating the dataset into a two-dimensional format suitable for analysis. - Data Correction: Addresses identifying and correcting issues such as missing values and outliers. - Data Analysis: Explores primary analysis, visual aids, and identifying relationships between variables.

4 Extraction

4.1 Installing packages

We need to install and load several R packages to facilitate data extraction and manipulation:

library(httr) #For making HTTP requests to the API.
library(jsonlite) #For parsing JSON data returned by the API. 
library(dplyr) #For data manipulation and transformation.
library(lubridate) #For easy handling of dates and times.
library(purrr) #For functional programming tools and to simplify repetitive tasks.
library(knitr) #To generate markdown tables from R.
library(kableExtra) #Enhances "knitr" tables for HTML and PDF.
library(GGally) # Creation of complex plots form data in a data frame.
library(ggplot2) # A system for declatively creating graphics.
library(zoo) # Provides an S3 class methods for totally ordered indexed observations, used for irregular time series.

4.2 API extraction The API format that we will need is as follows:

https://history.openweathermap.org/data/2.5/history/city?lat=%7Blat%7D&lon=%7Blon%7D&type=hour&start=%7Bstart%7D&end=%7Bend%7D&appid=%7BAPIkey} To extract weather data from Open Weather Map (OWM), we construct an API call. Each parameter in the API URL serves a specific purpose:

# Construct the query´s parameters
url <- "https://history.openweathermap.org/data/2.5/history/city?" #API for OWM.
lat <- 29.76328   # The latitude of the location (e.g., Houston).
lon <- -95.36327  # The longitude of the location (e.g., Houston).
type <- "hour"    # Data type fixed in hours.
start <- 1672560000 # UNIX times tamps defining the start --> Sun Jan 01 2023 02:00:00 GMT-0600 (Central Standard Time) corresponding to Houston.
end <- 1704096000 # UNIX times tamps defining the end--> Mon Jan 01 2024 02:00:00 GMT-0600 (Central Standard Time) corresponding to Houston.
api_key <- "74fd009a6ac3b901f4211706ad3e11be" # The unique API key for authentication provided by OWM.

Our extraction query will therefore be:

query <- list(lat = lat, lon = lon, type = "hour", start = start, end = end, appid = api_key)
response <- GET(url = url, query = query)
print(response)

## Response [https://history.openweathermap.org/data/2.5/history/city?lat=29.76328&lon=-95.36327&type=hour&start=1672560000&end=1704096000&appid=74fd009a6ac3b901f4211706ad3e11be]
##   Date: 2024-02-12 08:38
##   Status: 400
##   Content-Type: application/json; charset=UTF-8
##   Size: 67 B
## {"code":400000,"message":"requested data is out of allowed range"}

One executed the response message is: “requested data is out of allowed range”. This means that the API will not allow us to extract everything in a single call, therefore we must fragment it into shorter calls. For this we will choose an interval for each call. In this case we will use an interval of a day, equivalent to 86400 seconds. This will allow us to extract the information on a daily basis.

# Construct the query´s parameters
url <- "https://history.openweathermap.org/data/2.5/history/city?" #API for OWM.
lat <- 29.76328   # Latitude for Houston.
lon <- -95.36327  # Longitude for Houston.
type <- "hour"    # Data type fixed in hours.
start <- as.numeric(as.POSIXct("2024-01-01 00:00:00", tz = "UTC"))
end <- as.numeric(as.POSIXct("2024-02-09 00:00:00", tz = "UTC"))
api_key <- "74fd009a6ac3b901f4211706ad3e11be" #API_Key provided by OWM.

4.3 Construct the query and response

query <- list(lat = lat, lon = lon, type = type, start = start, end = end, appid = api_key)
response <- GET(url = url, query = query)
print(response)

## Response [https://history.openweathermap.org/data/2.5/history/city?lat=29.76328&lon=-95.36327&type=hour&start=1704067200&end=1707436800&appid=74fd009a6ac3b901f4211706ad3e11be]
##   Date: 2024-02-12 08:38
##   Status: 200
##   Content-Type: application/json; charset=UTF-8
##   Size: 45.3 kB
## {"message":"Count: 169","cod":"200","city_id":1,"calctime":0.007886055,"cnt":...

As we can see the extraction of this single API call is successful, we may proceed to create a loop for all the URLs for out extraction from the OWM API over a specified interval using start_date and end_date, we need to iterate through each interval, update the start and end parameters for each API call, and then collect the responses. Using a day (86400 seconds) as the interval.

Create the Loop, iterate over each day, making an API call for each interval. We observed that there is missing data for over 31 days in January and 9 of February, this means that the API is not allowing to obtain data from more than one year, we redefine the start and end date to the date of extraction. Therefore we will use all 2023 and the first month of 2024, and 9 days of February.

# Define the Start and End Dates for the Entire Period
start_date <- as.POSIXct("2023-01-01 00:00:00", tz = "UTC") # First day of 2023
end_date <- as.POSIXct("2024-02-09 23:59:59", tz = "UTC") # Current date of extraction

# Initialize Variables for the Loop
interval_seconds <- 86400  # One day in seconds
api_url <- "https://history.openweathermap.org/data/2.5/history/city?" #Given URL
lat <- 29.76328  # Latitude for Houston
lon <- -95.36327 # Longitude for Houston
type <- "hour"   # Data type fixed in hours
api_key <- "74fd009a6ac3b901f4211706ad3e11be" # API Key

# Initialize an empty list to store the data
weather_data <- list()

# Create the Loop
current_start <- start_date
while (current_start < end_date) {
    current_end <- current_start + interval_seconds - 1
    start <- as.numeric(current_start)
    end <- as.numeric(current_end)

    query <- list(lat = lat, lon = lon, type = type, start = start, end = end, appid = api_key)
    response <- GET(url = api_url, query = query)

    if (status_code(response) == 200) {
        data <- content(response, "text")
        parsed_data <- fromJSON(data)
        weather_data[[length(weather_data) + 1]] <- parsed_data
    } else {
        print(paste("Error for interval starting on", current_start, ":", status_code(response)))
    }
    current_start <- current_start + interval_seconds
}

## [1] "Error for interval starting on 2023-01-01 : 400"
## [1] "Error for interval starting on 2023-01-02 : 400"
## [1] "Error for interval starting on 2023-01-03 : 400"
## [1] "Error for interval starting on 2023-01-04 : 400"
## [1] "Error for interval starting on 2023-01-05 : 400"
## [1] "Error for interval starting on 2023-01-06 : 400"
## [1] "Error for interval starting on 2023-01-07 : 400"
## [1] "Error for interval starting on 2023-01-08 : 400"
## [1] "Error for interval starting on 2023-01-09 : 400"
## [1] "Error for interval starting on 2023-01-10 : 400"
## [1] "Error for interval starting on 2023-01-11 : 400"
## [1] "Error for interval starting on 2023-01-12 : 400"
## [1] "Error for interval starting on 2023-01-13 : 400"
## [1] "Error for interval starting on 2023-01-14 : 400"
## [1] "Error for interval starting on 2023-01-15 : 400"
## [1] "Error for interval starting on 2023-01-16 : 400"
## [1] "Error for interval starting on 2023-01-17 : 400"
## [1] "Error for interval starting on 2023-01-18 : 400"
## [1] "Error for interval starting on 2023-01-19 : 400"
## [1] "Error for interval starting on 2023-01-20 : 400"
## [1] "Error for interval starting on 2023-01-21 : 400"
## [1] "Error for interval starting on 2023-01-22 : 400"
## [1] "Error for interval starting on 2023-01-23 : 400"
## [1] "Error for interval starting on 2023-01-24 : 400"
## [1] "Error for interval starting on 2023-01-25 : 400"
## [1] "Error for interval starting on 2023-01-26 : 400"
## [1] "Error for interval starting on 2023-01-27 : 400"
## [1] "Error for interval starting on 2023-01-28 : 400"
## [1] "Error for interval starting on 2023-01-29 : 400"
## [1] "Error for interval starting on 2023-01-30 : 400"
## [1] "Error for interval starting on 2023-01-31 : 400"
## [1] "Error for interval starting on 2023-02-01 : 400"
## [1] "Error for interval starting on 2023-02-02 : 400"
## [1] "Error for interval starting on 2023-02-03 : 400"
## [1] "Error for interval starting on 2023-02-04 : 400"
## [1] "Error for interval starting on 2023-02-05 : 400"
## [1] "Error for interval starting on 2023-02-06 : 400"
## [1] "Error for interval starting on 2023-02-07 : 400"
## [1] "Error for interval starting on 2023-02-08 : 400"
## [1] "Error for interval starting on 2023-02-09 : 400"
## [1] "Error for interval starting on 2023-02-10 : 400"
## [1] "Error for interval starting on 2023-02-11 : 400"
## [1] "Error for interval starting on 2023-02-12 : 400"

# After the loop, weather_data will contain the weather data for each interval corresponding to Houston.

The error response correctly coincides with precisely one year ago, therefore our assumption was correct. The database will not allow us to obtain data older than one year ago. However we have all the data from almost one year (364 days) from Houston stored in a list (each element representing a day’s data). Now the extraction process is complete.

5 Flattening

5.1 Analysis of extracted data set

We will first want to aggregate it into a more manageable format, like a data frame. This step often involves understanding the data structure to determine the flattening procedure.

# Inspect the structure of the first element
str(weather_data[[1]])

## List of 6
##  $ message : chr "Count: 24"
##  $ cod     : chr "200"
##  $ city_id : int 1
##  $ calctime: num 0.00387
##  $ cnt     : int 24
##  $ list    :'data.frame':    24 obs. of  5 variables:
##   ..$ dt     : int [1:24] 1676246400 1676250000 1676253600 1676257200 1676260800 1676264400 1676268000 1676271600 1676275200 1676278800 ...
##   ..$ main   :'data.frame':  24 obs. of  6 variables:
##   .. ..$ temp      : num [1:24] 289 287 286 286 285 ...
##   .. ..$ feels_like: num [1:24] 288 286 285 285 284 ...
##   .. ..$ pressure  : int [1:24] 1018 1018 1018 1019 1018 1018 1018 1018 1018 1018 ...
##   .. ..$ humidity  : int [1:24] 48 61 67 69 72 74 76 85 87 88 ...
##   .. ..$ temp_min  : num [1:24] 286 284 284 283 282 ...
##   .. ..$ temp_max  : num [1:24] 291 289 287 287 286 ...
##   ..$ wind   :'data.frame':  24 obs. of  3 variables:
##   .. ..$ speed: num [1:24] 5.14 5.66 5.66 4.63 3.09 3.09 1.34 0.45 0.45 0 ...
##   .. ..$ deg  : int [1:24] 180 170 180 210 0 0 225 248 192 0 ...
##   .. ..$ gust : num [1:24] NA 8.23 NA NA NA NA 1.34 0.89 1.34 NA ...
##   ..$ clouds :'data.frame':  24 obs. of  1 variable:
##   .. ..$ all: int [1:24] 0 0 0 20 75 20 40 20 20 75 ...
##   ..$ weather:List of 24
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 800
##   .. .. ..$ main       : chr "Clear"
##   .. .. ..$ description: chr "clear sky"
##   .. .. ..$ icon       : chr "01d"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 800
##   .. .. ..$ main       : chr "Clear"
##   .. .. ..$ description: chr "clear sky"
##   .. .. ..$ icon       : chr "01n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 800
##   .. .. ..$ main       : chr "Clear"
##   .. .. ..$ description: chr "clear sky"
##   .. .. ..$ icon       : chr "01n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 801
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "few clouds"
##   .. .. ..$ icon       : chr "02n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 803
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "broken clouds"
##   .. .. ..$ icon       : chr "04n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 801
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "few clouds"
##   .. .. ..$ icon       : chr "02n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 802
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "scattered clouds"
##   .. .. ..$ icon       : chr "03n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 801
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "few clouds"
##   .. .. ..$ icon       : chr "02n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 801
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "few clouds"
##   .. .. ..$ icon       : chr "02n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 803
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "broken clouds"
##   .. .. ..$ icon       : chr "04n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 802
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "scattered clouds"
##   .. .. ..$ icon       : chr "03n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 802
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "scattered clouds"
##   .. .. ..$ icon       : chr "03n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 802
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "scattered clouds"
##   .. .. ..$ icon       : chr "03n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 802
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "scattered clouds"
##   .. .. ..$ icon       : chr "03n"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 741
##   .. .. ..$ main       : chr "Fog"
##   .. .. ..$ description: chr "fog"
##   .. .. ..$ icon       : chr "50d"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 800
##   .. .. ..$ main       : chr "Clear"
##   .. .. ..$ description: chr "clear sky"
##   .. .. ..$ icon       : chr "01d"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 800
##   .. .. ..$ main       : chr "Clear"
##   .. .. ..$ description: chr "clear sky"
##   .. .. ..$ icon       : chr "01d"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 800
##   .. .. ..$ main       : chr "Clear"
##   .. .. ..$ description: chr "clear sky"
##   .. .. ..$ icon       : chr "01d"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 800
##   .. .. ..$ main       : chr "Clear"
##   .. .. ..$ description: chr "clear sky"
##   .. .. ..$ icon       : chr "01d"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 800
##   .. .. ..$ main       : chr "Clear"
##   .. .. ..$ description: chr "clear sky"
##   .. .. ..$ icon       : chr "01d"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 800
##   .. .. ..$ main       : chr "Clear"
##   .. .. ..$ description: chr "clear sky"
##   .. .. ..$ icon       : chr "01d"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 803
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "broken clouds"
##   .. .. ..$ icon       : chr "04d"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 800
##   .. .. ..$ main       : chr "Clear"
##   .. .. ..$ description: chr "clear sky"
##   .. .. ..$ icon       : chr "01d"
##   .. ..$ :'data.frame':  1 obs. of  4 variables:
##   .. .. ..$ id         : int 802
##   .. .. ..$ main       : chr "Clouds"
##   .. .. ..$ description: chr "scattered clouds"
##   .. .. ..$ icon       : chr "03d"

Based on the structure of the JSON data, the weather data set is organized into a list with 6 top-level elements:

message: A string providing information about the data count.
cod: A string indicating the status code of the API response.
city_id: An integer representing the ID of the city for which the data is fetched.
calctime: A numeric value representing the calculation time for the API response.
cnt: An integer indicating the count of data points (24 observations represent 24 hours of a day).
list: The main element of interest, containing the actual weather data in a data frame format with 24 observations and multiple variables.

The structure of the list is:

dt: UNIX time stamp for each observation, representing the date and time.
main: A nested data frame containing key weather measurements such as temperature (temp), sensation of temperature (feels_like), the pressure (pressure), the humidity (humidity), the minimum temperature (temp_min), and maximum temperature (temp_max).
wind: A nested data frame providing wind-related data, including speed (speed), direction in degrees (deg), and gusts, or brief increase in speed of the wind (gust).
clouds: A nested data frame with information on cloud coverage all representing the percentage of cloud cover).
weather:A list of data frames, each with a single observation containing weather condition codes (id), the main weather condition (main), a textual description of the conditions (description), and an icon code (icon).

If we access each nested structure individually we will obtain a more detailed understanding of the structure. First we will do the first level nesting:

str(weather_data[[1]]$list$main)

## 'data.frame':    24 obs. of  6 variables:
##  $ temp      : num  289 287 286 286 285 ...
##  $ feels_like: num  288 286 285 285 284 ...
##  $ pressure  : int  1018 1018 1018 1019 1018 1018 1018 1018 1018 1018 ...
##  $ humidity  : int  48 61 67 69 72 74 76 85 87 88 ...
##  $ temp_min  : num  286 284 284 283 282 ...
##  $ temp_max  : num  291 289 287 287 286 ...

We conclude this is a very straight forward list as described previously. Now we we will take a look at the “wind” nesting:

str(weather_data[[1]]$list$wind)

## 'data.frame':    24 obs. of  3 variables:
##  $ speed: num  5.14 5.66 5.66 4.63 3.09 3.09 1.34 0.45 0.45 0 ...
##  $ deg  : int  180 170 180 210 0 0 225 248 192 0 ...
##  $ gust : num  NA 8.23 NA NA NA NA 1.34 0.89 1.34 NA ...

As we can observe there are a few missing values on the “gust” variable, they could also be understood as a lack of gust. Now we will take a look at the “clouds” nesting:

str(weather_data[[1]]$list$clouds)

## 'data.frame':    24 obs. of  1 variable:
##  $ all: int  0 0 0 20 75 20 40 20 20 75 ...

There is only one variable so there is no problem here. Now we will take a look at the “weather” nesting:

str(weather_data[[1]]$list$weather)

## List of 24
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 800
##   ..$ main       : chr "Clear"
##   ..$ description: chr "clear sky"
##   ..$ icon       : chr "01d"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 800
##   ..$ main       : chr "Clear"
##   ..$ description: chr "clear sky"
##   ..$ icon       : chr "01n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 800
##   ..$ main       : chr "Clear"
##   ..$ description: chr "clear sky"
##   ..$ icon       : chr "01n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 801
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "few clouds"
##   ..$ icon       : chr "02n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 803
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "broken clouds"
##   ..$ icon       : chr "04n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 801
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "few clouds"
##   ..$ icon       : chr "02n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 802
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "scattered clouds"
##   ..$ icon       : chr "03n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 801
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "few clouds"
##   ..$ icon       : chr "02n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 801
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "few clouds"
##   ..$ icon       : chr "02n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 803
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "broken clouds"
##   ..$ icon       : chr "04n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 802
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "scattered clouds"
##   ..$ icon       : chr "03n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 802
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "scattered clouds"
##   ..$ icon       : chr "03n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 802
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "scattered clouds"
##   ..$ icon       : chr "03n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 802
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "scattered clouds"
##   ..$ icon       : chr "03n"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 741
##   ..$ main       : chr "Fog"
##   ..$ description: chr "fog"
##   ..$ icon       : chr "50d"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 800
##   ..$ main       : chr "Clear"
##   ..$ description: chr "clear sky"
##   ..$ icon       : chr "01d"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 800
##   ..$ main       : chr "Clear"
##   ..$ description: chr "clear sky"
##   ..$ icon       : chr "01d"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 800
##   ..$ main       : chr "Clear"
##   ..$ description: chr "clear sky"
##   ..$ icon       : chr "01d"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 800
##   ..$ main       : chr "Clear"
##   ..$ description: chr "clear sky"
##   ..$ icon       : chr "01d"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 800
##   ..$ main       : chr "Clear"
##   ..$ description: chr "clear sky"
##   ..$ icon       : chr "01d"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 800
##   ..$ main       : chr "Clear"
##   ..$ description: chr "clear sky"
##   ..$ icon       : chr "01d"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 803
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "broken clouds"
##   ..$ icon       : chr "04d"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 800
##   ..$ main       : chr "Clear"
##   ..$ description: chr "clear sky"
##   ..$ icon       : chr "01d"
##  $ :'data.frame':    1 obs. of  4 variables:
##   ..$ id         : int 802
##   ..$ main       : chr "Clouds"
##   ..$ description: chr "scattered clouds"
##   ..$ icon       : chr "03d"

We may understand that this nested table reports 24 observations. However since we are reporting on a hourly basis this part may seem redundant and enter into conflict with the rest of the extraction. After observing the data reported, it seems to be a description of the weather (description), since we are more interested in values, it may be possible to exclude this variables and use the clouds variable instead.

5.2 Flattening the data set into a two dimension data frame

We will use an extraction method for each variable according to its corresponding nested table: - From the top level (not nested) we will use: message, cod, city_id, calctime, cnt; - From “list”: temp, feels_like, pressure, humidity, temp_min, temp_max. - From “wind”: speed, deg, gust. - From “clouds”: all - From “weather”: id, main, description,icon.

# Function to extract and flatten data for each entry
extract_and_flatten <- function(entry) {
  # Check for wind_gust and handle missing data
  wind_gust <- if("gust" %in% names(entry$list$wind)) {
    entry$list$wind$gust
  } else {
    rep(NA_real_, length(entry$list$dt))
  }
  
  # Ensure wind_gust has the same length as other columns
  if(length(wind_gust) != length(entry$list$dt)) {
    wind_gust <- rep(NA_real_, length(entry$list$dt))
  }
  
  # Extracting fields and combining into a data frame
  data <- tibble(
    message = entry$message,
    cod = entry$cod,
    city_id = entry$city_id,
    calctime = entry$calctime,
    cnt = entry$cnt,
    dt = entry$list$dt,
    temp = entry$list$main$temp,
    feels_like = entry$list$main$feels_like,
    pressure = entry$list$main$pressure,
    humidity = entry$list$main$humidity,
    temp_min = entry$list$main$temp_min,
    temp_max = entry$list$main$temp_max,
    wind_speed = entry$list$wind$speed,
    wind_deg = entry$list$wind$deg,
    wind_gust = wind_gust,
    clouds_all = entry$list$clouds$all
  )
  return(data)
}

# Applying the function to each item in weather_data and combining into a single data frame
flattened_data <- map_df(weather_data, extract_and_flatten)

# Viewing the structure of the flattened data
str(flattened_data)

## tibble [8,688 × 16] (S3: tbl_df/tbl/data.frame)
##  $ message   : chr [1:8688] "Count: 24" "Count: 24" "Count: 24" "Count: 24" ...
##  $ cod       : chr [1:8688] "200" "200" "200" "200" ...
##  $ city_id   : int [1:8688] 1 1 1 1 1 1 1 1 1 1 ...
##  $ calctime  : num [1:8688] 0.00387 0.00387 0.00387 0.00387 0.00387 ...
##  $ cnt       : int [1:8688] 24 24 24 24 24 24 24 24 24 24 ...
##  $ dt        : int [1:8688] 1676246400 1676250000 1676253600 1676257200 1676260800 1676264400 1676268000 1676271600 1676275200 1676278800 ...
##  $ temp      : num [1:8688] 289 287 286 286 285 ...
##  $ feels_like: num [1:8688] 288 286 285 285 284 ...
##  $ pressure  : int [1:8688] 1018 1018 1018 1019 1018 1018 1018 1018 1018 1018 ...
##  $ humidity  : int [1:8688] 48 61 67 69 72 74 76 85 87 88 ...
##  $ temp_min  : num [1:8688] 286 284 284 283 282 ...
##  $ temp_max  : num [1:8688] 291 289 287 287 286 ...
##  $ wind_speed: num [1:8688] 5.14 5.66 5.66 4.63 3.09 3.09 1.34 0.45 0.45 0 ...
##  $ wind_deg  : int [1:8688] 180 170 180 210 0 0 225 248 192 0 ...
##  $ wind_gust : num [1:8688] NA 8.23 NA NA NA NA 1.34 0.89 1.34 NA ...
##  $ clouds_all: int [1:8688] 0 0 0 20 75 20 40 20 20 75 ...

Ok we now have a flat data set with adequate values.

We will rename the data set as: “houston_weather_mdf”

houston_weather_mdf <- flattened_data

We are ready to export the data to a CSV file.

5.2.1 Export CSV

write.csv(houston_weather_mdf, "houston_weather_data.csv", row.names = FALSE) # Export houston_weather_mdf to a CSV file.

6 Data Cleaning.

6.1 Ordering the data frame

Now to make analysis easier we will add columns (variables): - obs_count: This will add an observation count identification to allow a more orderly visualization of the table. We will use the row_number() function from the dplyr package to generate an observation count identification. - datetime: Converting each Unix timestamp in the dt column to a POSIXct date-time object, which is easier to read and work with for date-time operations in R. - year: Extract the year from the datetime column to analyze patterns by year. - month: Extract the month from the datetime column to analyze patterns by month, this will give us a categorical value. - day: Extract the day from the datetime column to analyze patterns by day. - season: Create a categorical variable to place the seasons of the year (e.g. winter, spring, summer, autumn)

houston_weather_mdf$obs_count <- seq_along(houston_weather_mdf$dt) # Add observation "count" variable
houston_weather_mdf$datetime <- as.POSIXct(houston_weather_mdf$dt, origin="1970-01-01", tz="UTC") # Add "datetime" variable by converting Unix timestamp to POSIXct date-time 
houston_weather_mdf$year <- year(houston_weather_mdf$datetime) # Add "year" variable by extracting year from datetime.
houston_weather_mdf$month <- format(houston_weather_mdf$datetime, "%m") # Add "month" variable by extracting month from datetime.
houston_weather_mdf$day <- day(houston_weather_mdf$datetime) # Add "day" variable by extracting day from datetime.

# Add a new column for the "season"
houston_weather_mdf <- houston_weather_mdf %>%
  mutate(
    season = case_when(
      month(datetime) %in% c(12, 1, 2) ~ "Winter",   # Define Winter
      month(datetime) %in% c(3, 4, 5) ~ "Spring",  # Define Spring
      month(datetime) %in% c(6, 7, 8) ~ "Summer",  # Define Summer
      month(datetime) %in% c(9, 10, 11) ~ "Autumn", # Define Autumn
    
      # Default case
      TRUE ~ NA_character_
    )
  )
houston_weather_mdf <- houston_weather_mdf %>%
  dplyr::select(obs_count, city_id, dt, datetime, year, month, day, season, temp, feels_like, pressure, humidity, temp_min, temp_max, wind_speed, wind_deg, wind_gust, clouds_all, calctime, cod, cnt, message) # Rearrange the columns to a specified order. 

# View the modified data frame to confirm changes
head(houston_weather_mdf)

## # A tibble: 6 × 22
##   obs_count city_id        dt datetime             year month   day season  temp
##       <int>   <int>     <int> <dttm>              <dbl> <chr> <int> <chr>  <dbl>
## 1         1       1    1.68e9 2023-02-13 00:00:00  2023 02       13 Winter  289.
## 2         2       1    1.68e9 2023-02-13 01:00:00  2023 02       13 Winter  287.
## 3         3       1    1.68e9 2023-02-13 02:00:00  2023 02       13 Winter  286.
## 4         4       1    1.68e9 2023-02-13 03:00:00  2023 02       13 Winter  286.
## 5         5       1    1.68e9 2023-02-13 04:00:00  2023 02       13 Winter  285.
## 6         6       1    1.68e9 2023-02-13 05:00:00  2023 02       13 Winter  284.
## # ℹ 13 more variables: feels_like <dbl>, pressure <int>, humidity <int>,
## #   temp_min <dbl>, temp_max <dbl>, wind_speed <dbl>, wind_deg <int>,
## #   wind_gust <dbl>, clouds_all <int>, calctime <dbl>, cod <chr>, cnt <int>,
## #   message <chr>

6.2 Understanding the data frame

dim(houston_weather_mdf)

## [1] 8688   22

The data frame contains 22 variables (described in Table 1) and 8760 observations.

Now we will view the structure, ranges of the values and variables. For this we will create a table using the kable function from the knitr package. This table will contain a column for each variable, a description of the variable, the data type, and the valid range. ### Creating a table to describe variables

# Define the data frame for variable descriptions
variables_description <- data.frame(
  Variable = c("obs_count", "city_id", "dt", "datetime", "year", "month", "day", "season", "temp", "feels_like", "pressure", "humidity", "temp_min", "temp_max", "wind_speed", "wind_deg", "wind_gust", "clouds_all", "calctime", "cod", "cnt", "message"),
  `Data Type` = c("Integer", "Integer", "Integer", "POSIXct", "Integer", "Character", "Integer", "Character", "Numeric", "Numeric", "Integer", "Integer", "Numeric", "Numeric", "Numeric", "Integer", "Numeric", "Integer", "Numeric", "Character", "Integer", "Character"),
  `Valid Range` = c("1-N", "Fixed value", "Unix timestamp", "2023-Feb-09 to 2024-Feb-09", "2023-2024", "Jan-Dec", "1-31", "Season names", "183.85ºK - 329.85ºK", "183.85ºK - 329.85ºK", "980 - 1030 hPa", "0-100%", "183.85ºK - 329.85ºK", "183.85ºK - 329.85ºK", "0 - 113.3 m/s", "0-360º", "0 - 113.3 m/s", "0-100%", "Calculation time range", "Response codes", "Observation count range", "API response messages"),
  Description = c("Observation count", "City identifier for Houston", "Unix timestamp", "Readable date and time", "Year of the weather observation, indicating when the data was recorded.", "Month of the year", "Day of the month for the weather observation, ranging from 1 to 31.", "Meteorological season of the year when the weather observation was made, categorized as Winter, Spring, Summer, or Autumn.", "Temperature in Kelvin (ºK)", "Feels like temperature in Kelvin (ºK)", "Atmospheric pressure in hectopascals (hPa)", "Relative humidity in %", "Minimum recorded temperature in Kelvin (ºK)", "Maximum recorded temperature in Kelvin (ºK)", "Wind speed in m/s", "Wind direction in meteorological degrees (º)", "Wind gust speed in m/s", "Cloud cover percentage (%)", "Time taken for API to calculate response", "API response code", "Number of observations included in the API response", "Message returned by the API")
)
# Generate markdown table with kable and style it with kableExtra
kable_styled <- kable(variables_description, 
                      format = "html", 
                      caption = "Table 1. Description of Variables in the Houston Weather Dataset",
                      align = c('l', 'l', 'l', 'l')) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE, 
                position = "left") %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2, width = "100px") %>%
  column_spec(3, width = "150px") %>%
  column_spec(4, width = "300px")

# Print the styled table
kable_styled

Table 1. Description of Variables in the Houston Weather Dataset
Variable	Data.Type	Valid.Range	Description
obs_count	Integer	1-N	Observation count
city_id	Integer	Fixed value	City identifier for Houston
dt	Integer	Unix timestamp	Unix timestamp
datetime	POSIXct	2023-Feb-09 to 2024-Feb-09	Readable date and time
year	Integer	2023-2024	Year of the weather observation, indicating when the data was recorded.
month	Character	Jan-Dec	Month of the year
day	Integer	1-31	Day of the month for the weather observation, ranging from 1 to 31.
season	Character	Season names	Meteorological season of the year when the weather observation was made, categorized as Winter, Spring, Summer, or Autumn.
temp	Numeric	183.85ºK - 329.85ºK	Temperature in Kelvin (ºK)
feels_like	Numeric	183.85ºK - 329.85ºK	Feels like temperature in Kelvin (ºK)
pressure	Integer	980 - 1030 hPa	Atmospheric pressure in hectopascals (hPa)
humidity	Integer	0-100%	Relative humidity in %
temp_min	Numeric	183.85ºK - 329.85ºK	Minimum recorded temperature in Kelvin (ºK)
temp_max	Numeric	183.85ºK - 329.85ºK	Maximum recorded temperature in Kelvin (ºK)
wind_speed	Numeric	0 - 113.3 m/s	Wind speed in m/s
wind_deg	Integer	0-360º	Wind direction in meteorological degrees (º)
wind_gust	Numeric	0 - 113.3 m/s	Wind gust speed in m/s
clouds_all	Integer	0-100%	Cloud cover percentage (%)
calctime	Numeric	Calculation time range	Time taken for API to calculate response
cod	Character	Response codes	API response code
cnt	Integer	Observation count range	Number of observations included in the API response
message	Character	API response messages	Message returned by the API

The table summarizes the diverse weather measurements in the Houston Weather Dataset. It ranges from temperature metrics to wind dynamics, containing the meteorological conditions in various dimensions. Understanding these parameters allows for an examination of weather patterns in Houston.

6.3 Values out of range

There don’t appear to be any values that are immediately identifiable as out of range without more context. However, specific thresholds for what constitutes an out-of-range value would depend on the expected weather conditions for Houston and the measurement standards for each variable. Temperature (temp, feels_like, temp_min, temp_max): The minimum and maximum temperatures seem within a plausible range for Houston’s climate. Pressure: The pressure values range from 994 to 1034, which seems typical for atmospheric pressure measured in hectopascals (hPa). Humidity: Ranges from 20% to 100%, which is typical. Wind Speed and Wind Gust: Maximum wind speed is 34.47, which could be considered high but not impossible during storm events. Wind gust has a max of 24.18. Clouds_all: Ranges from 0% to 100%, indicating clear skies to completely overcast, which is typical.

6.4 Missing values

To detect missing values we will use the summary() function, on variables that where previously identified with NA values.

summary(houston_weather_mdf)

##    obs_count       city_id        dt               datetime                  
##  Min.   :   1   Min.   :1   Min.   :1.676e+09   Min.   :2023-02-13 00:00:00  
##  1st Qu.:2173   1st Qu.:1   1st Qu.:1.684e+09   1st Qu.:2023-05-14 11:45:00  
##  Median :4344   Median :1   Median :1.692e+09   Median :2023-08-12 23:30:00  
##  Mean   :4344   Mean   :1   Mean   :1.692e+09   Mean   :2023-08-12 23:30:00  
##  3rd Qu.:6516   3rd Qu.:1   3rd Qu.:1.700e+09   3rd Qu.:2023-11-11 11:15:00  
##  Max.   :8688   Max.   :1   Max.   :1.708e+09   Max.   :2024-02-09 23:00:00  
##                                                                              
##       year         month                day           season         
##  Min.   :2023   Length:8688        Min.   : 1.00   Length:8688       
##  1st Qu.:2023   Class :character   1st Qu.: 8.00   Class :character  
##  Median :2023   Mode  :character   Median :16.00   Mode  :character  
##  Mean   :2023                      Mean   :15.76                     
##  3rd Qu.:2023                      3rd Qu.:23.00                     
##  Max.   :2024                      Max.   :31.00                     
##                                                                      
##       temp         feels_like       pressure       humidity     
##  Min.   :266.1   Min.   :259.1   Min.   : 994   Min.   : 20.00  
##  1st Qu.:290.0   1st Qu.:289.8   1st Qu.:1011   1st Qu.: 60.00  
##  Median :296.6   Median :297.0   Median :1014   Median : 75.00  
##  Mean   :295.8   Mean   :297.2   Mean   :1014   Mean   : 71.95  
##  3rd Qu.:301.6   3rd Qu.:305.3   3rd Qu.:1017   3rd Qu.: 85.00  
##  Max.   :315.0   Max.   :321.3   Max.   :1034   Max.   :100.00  
##                                                                 
##     temp_min        temp_max       wind_speed        wind_deg    
##  Min.   :264.5   Min.   :267.2   Min.   : 0.000   Min.   :  0.0  
##  1st Qu.:288.2   1st Qu.:291.5   1st Qu.: 3.090   1st Qu.:100.0  
##  Median :294.9   Median :298.1   Median : 4.630   Median :160.0  
##  Mean   :293.9   Mean   :297.4   Mean   : 4.867   Mean   :161.7  
##  3rd Qu.:299.8   3rd Qu.:303.1   3rd Qu.: 6.170   3rd Qu.:210.0  
##  Max.   :313.3   Max.   :317.1   Max.   :34.470   Max.   :360.0  
##                                                                  
##    wind_gust        clouds_all        calctime            cod           
##  Min.   : 0.420   Min.   :  0.00   Min.   :0.001696   Length:8688       
##  1st Qu.: 7.200   1st Qu.:  0.00   1st Qu.:0.003853   Class :character  
##  Median : 9.390   Median : 40.00   Median :0.004107   Mode  :character  
##  Mean   : 9.343   Mean   : 48.63   Mean   :0.004472                     
##  3rd Qu.:11.830   3rd Qu.:100.00   3rd Qu.:0.004486                     
##  Max.   :24.180   Max.   :100.00   Max.   :0.038480                     
##  NA's   :5582                                                           
##       cnt       message         
##  Min.   :24   Length:8688       
##  1st Qu.:24   Class :character  
##  Median :24   Mode  :character  
##  Mean   :24                     
##  3rd Qu.:24                     
##  Max.   :24                     
##

Variable “wind_gust” contains 5600 missing values (NA’s). This indicates a significant portion of this data is missing, which could impact analyses related to wind gusts. It is possible to conclude that missing values (NA’s) may represent the lack of wind gust and not necessarily an error in the observation, since there is not necessarily a constant wind. Therefore, we will replace al NAs in this specific variable with 0 (zeros).

houston_weather_mdf$wind_gust[is.na(houston_weather_mdf$wind_gust)] <- 0
summary(houston_weather_mdf$wind_gust)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    3.34    7.72   24.18

6.5 Outliers

To identify outliers in out core variables we will apply the inter quertile range (IQR) method.

# Create a function to calculate IQR and identify outliers
identify_outliers <- function(data, variable_name) {
  Q1 <- quantile(data[[variable_name]], 0.25, na.rm = TRUE) # Q1
  Q3 <- quantile(data[[variable_name]], 0.75, na.rm = TRUE) # Q3
  IQR <- Q3 - Q1 # Interquartile range
  
  lower_bound <- Q1 - 1.5 * IQR # Lower bound calculation
  upper_bound <- Q3 + 1.5 * IQR # Upper bound calculation
  
  outliers <- data[[variable_name]] < lower_bound | data[[variable_name]] > upper_bound # Outlier detection
  
  list(lower_bound = lower_bound, upper_bound = upper_bound, outliers = sum(outliers, na.rm = TRUE) # Outlier detection
  )
}

# Apply the function to each core variable and print the results
temp_outliers <- identify_outliers(houston_weather_mdf, "temp")
feels_like_outliers <- identify_outliers(houston_weather_mdf, "feels_like")
pressure_outliers <- identify_outliers(houston_weather_mdf, "pressure")
humidity_outliers <- identify_outliers(houston_weather_mdf, "humidity")
temp_min_outliers <- identify_outliers(houston_weather_mdf, "temp_min")
temp_max_outliers <- identify_outliers(houston_weather_mdf, "temp_max")
wind_speed_outliers <- identify_outliers(houston_weather_mdf, "wind_speed")
wind_deg_outliers <- identify_outliers(houston_weather_mdf, "wind_deg")
wind_gust_outliers <- identify_outliers(houston_weather_mdf, "wind_gust")
clouds_all_outliers <- identify_outliers(houston_weather_mdf, "clouds_all")

Identification of “temp” outliers

print(temp_outliers)

## $lower_bound
##   25% 
## 272.7 
## 
## $upper_bound
##    75% 
## 318.86 
## 
## $outliers
## [1] 53

There are 47 outliers in “temp”, since they are all in the valid range we will leave them.

Identification of “feels_like” outliers

print(feels_like_outliers)

## $lower_bound
##    25% 
## 266.51 
## 
## $upper_bound
##    75% 
## 328.59 
## 
## $outliers
## [1] 51

There are 44 outliers in “feels_like”, since they are all in the valid range we will leave them.

Identification of “pressure” outliers

print(pressure_outliers)

## $lower_bound
##  25% 
## 1002 
## 
## $upper_bound
##  75% 
## 1026 
## 
## $outliers
## [1] 345

There are 165 outliers in “pressure”, since they are all in the valid range we will leave them.

Identification of “humidity” outliers

print(humidity_outliers)

## $lower_bound
##  25% 
## 22.5 
## 
## $upper_bound
##   75% 
## 122.5 
## 
## $outliers
## [1] 10

There are 10 outliers in “humidity”, since they are all in the valid range we will leave them.

Identification of “temp_min” outliers

print(temp_min_outliers)

## $lower_bound
##    25% 
## 270.87 
## 
## $upper_bound
##    75% 
## 317.11 
## 
## $outliers
## [1] 54

There are 54 outliers in “temp_min”, since they are all in the valid range we will leave them.

Identification of “temp_max” outliers

print(temp_max_outliers)

## $lower_bound
##     25% 
## 274.145 
## 
## $upper_bound
##     75% 
## 320.505 
## 
## $outliers
## [1] 46

There are 51 outliers in “temp_max”, since they are all in the valid range we will leave them.

Identification of “wind_speed” outliers

print(wind_speed_outliers)

## $lower_bound
##   25% 
## -1.53 
## 
## $upper_bound
##   75% 
## 10.79 
## 
## $outliers
## [1] 164

There are 165 outliers in “wind speed”, since they are all in the valid range we will leave them.

Identification of “wind_deg” outliers

print(wind_deg_outliers)

## $lower_bound
## 25% 
## -65 
## 
## $upper_bound
## 75% 
## 375 
## 
## $outliers
## [1] 0

There are no outliers in “wind_deg”.

Identification of “wind_gust” outliers

print(wind_gust_outliers)

## $lower_bound
##    25% 
## -11.58 
## 
## $upper_bound
##  75% 
## 19.3 
## 
## $outliers
## [1] 19

There are 19 outliers in “wind_gust”, Since the lower bound can not be negative it is necessary to explore further.

hist(houston_weather_mdf$wind_gust)

The histogram provides a visual aid that allows us to safely say that there are no incorrect values.

Identification of “clouds_all” outliers

print(clouds_all_outliers)

## $lower_bound
##  25% 
## -150 
## 
## $upper_bound
## 75% 
## 250 
## 
## $outliers
## [1] 0

We can find no outliers in “clouds_all”

Our data set is now clean, outliers have been identified and understood, missing values have been identified, explained and replaced, the data frame is now ready to be analysed further.

7 Data Analysis

In this section we will try to understand patterns using graphs and searching for possible causal effects.

7.1 Temperature

# Histogram for Temperature with Mean and Median
hist(houston_weather_mdf$temp, 20, main = "Histogram of Temperature", xlab = "Temperature", col = "blue")
abline(v = mean(houston_weather_mdf$temp, na.rm = TRUE), col = "red", lwd = 2)  # Mean line
median_temp <- median(houston_weather_mdf$temp, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2)  # Median line

The histogram represents the distribution of temperature in Houston over one year, divided in 20 bins ranging from 266.1ºK to 315.0ºK. The distribution has a slightly left-skewed shape with its peak at around 301ºK, this suggests that the temperature around this value occurred most frequently within the observed period.

houston_weather_mdf$season <- factor(houston_weather_mdf$season, levels = c("Winter", "Spring", "Summer", "Autumn"))
# Box plot of temperature by season
ggplot(houston_weather_mdf, aes(x=season, y=temp, fill=season)) +
  geom_boxplot() +
  theme_bw() +
  labs(title="Temperature Distribution by Season", x="Season", y="Temperature(ºK)") +
  theme(legend.position="none")

The graph is a boxplot illustrating the distribution of temperatures across four seasons: Winter, Spring, Summer and Autumn. Each boxplot represents the interquartile range (IQR) of temperatures for a season, with the box’s bottom and top edges indicating the first (Q1) and third (Q3) quartiles, respectively. The line inside each box denotes the median temperature. The “whiskers” extend from the boxes to show the range of the data, while points below or above the whiskers indicate potential outliers.

Winter has the lowest median temperature, around 288ºK, with an IQR similar to Autumn, indicating consistent cooler temperatures. Spring shows a higher median temperature close to 295ºK, with a slightly wider IQR. Summer has the highest median temperature, above 300ºK, indicating it’s the warmest season, and also displays a broad IQR, suggesting significant variation in temperatures. Autumn has a median temperature just above 290ºK, with a relatively narrow IQR suggesting less variability in temperatures. There are a few outliers in the Winter and Spring seasons, indicating occasional temperatures that fall well below the typical range.

# Convert numeric month to month names (optional)
houston_weather_mdf$month_name <- factor(format(as.Date(as.yearmon(paste(houston_weather_mdf$month, "1", sep="-"), "%m-%d")), "%B"),
                                         levels = month.name)

# Create the boxplot
ggplot(houston_weather_mdf, aes(x=month_name, y=temp, fill=month_name)) +
  geom_boxplot() +
  theme_bw() +
  xlab("Month") + ylab("Temperature (°K)") +
  ggtitle("Monthly Temperature Distribution in Houston") +
  theme(axis.text.x = element_text(angle=25, hjust=1)) # Rotate x labels for better readability

The graph is a boxplot illustrating the distribution of temperatures across the months of the year. Each boxplot represents the interquartile range (IQR) of temperatures for a month, with the box’s bottom and top edges indicating the first (Q1) and third (Q3) quartiles, respectively. The line inside each box denotes the median temperature. The “whiskers” extend from the boxes to show the range of the data, while points below or above the whiskers indicate potential outliers.

August has the highest median temperature, around 304ºK. January has the lowest median temperature, around 285ºK. May has the smallest variance in temperature indicating the most stable temperatures. January and February have the highest variance in temperature indicating more variability in temperature. There are a few outliers in the January, March and October, indicating occasional temperatures that fall well below the typical range.

7.2 Apparent temperature

# Histogram for Feels Like Temperature with Mean
hist(houston_weather_mdf$feels_like, 30, main = "Histogram of Feels Like Temperature", xlab = "Feels Like Temperature", col = "lightgreen")
abline(v = mean(houston_weather_mdf$feels_like, na.rm = TRUE), col = "red", lwd = 2)
median_temp <- median(houston_weather_mdf$feels_like, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2)  # Median line

The histogram represents the distribution of apparent temperature in Houston over one year, divided in 30 bins ranging from 259.1ºK to 321.3ºK. The distribution has a slightly left-skewed shape with its peak at around 299ºK, this suggests that the apparent temperature around this value occurred most frequently within the observed period.

7.3 Pressure

# Histogram for Pressure with Mean and Median
hist(houston_weather_mdf$pressure, 40 ,main = "Histogram of Pressure", xlab = "Pressure", col = "aquamarine") 
abline(v = mean(houston_weather_mdf$pressure, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$pressure, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2)  # Median line

The histogram represents the distribution of pressure in Houston over one year, divided in 40 bins ranging from 994 hPa to 1034 hPa. The distribution has a slightly right-skewed shape with its peak at around 1013 hPa, this suggests that the pressure around this value occurred most frequently within the observed period.

# Box plot of pressure by season
ggplot(houston_weather_mdf, aes(x=season, y=pressure, fill=season)) +
  geom_boxplot() +
  theme_bw() +
  labs(title="Pressure Distribution by Season", x="Season", y="Pressure (hPa)") +
  theme(axis.text.x = element_text(angle=0, hjust=1)) # Rotate x labels for better readability

The graph is a boxplot illustrating the distribution of pressure across four seasons: Winter, Spring, Summer and Autumn. Each boxplot represents the interquartile range (IQR) of pressure for a season, with the box’s bottom and top edges indicating the first (Q1) and third (Q3) quartiles, respectively. The line inside each box denotes the median pressure. The “whiskers” extend from the boxes to show the range of the data, while points below or above the whiskers indicate potential outliers.

Winter has the highest median pressure, around 1018 hPa, it also presents the highest variance in pressure. Summer has the lowest median pressure, around 1012 hPa, it also displays the thinnest IQR, suggesting the least variation in pressure. There are a significant outliers in the Spring, indicating occasional pressure measurements that fall well below and above the typical range.

# Box plot of pressure by month
ggplot(houston_weather_mdf, aes(x=month_name, y=pressure, fill=month_name)) +
  geom_boxplot() +
  theme_bw() +
  xlab("Month") + ylab("Pressure (hPa)") +
  ggtitle("Monthly Pressure Distribution in Houston") +
  theme(axis.text.x = element_text(angle=25, hjust=1)) # Rotate x labels for better readability

The graph is a boxplot illustrating the distribution of pressure across each month of the year. Each boxplot represents the interquartile range (IQR) of pressure for a month, with the box’s bottom and top edges indicating the first (Q1) and third (Q3) quartiles, respectively. The line inside each box denotes the median pressure. The “whiskers” extend from the boxes to show the range of the data, while points below or above the whiskers indicate potential outliers.

December has the highest median pressure, around 1020 hPa. June has the lowest median pressure, around 1010 hPa. There are outliers in January, March, May, June, August, October and December, indicating occasional pressure measurements that fall well below and above the typical range.

7.4 Humidity

# Histogram for Humidity with Mean and Median
hist(houston_weather_mdf$humidity, 40 ,main = "Histogram of Humidity", xlab = "Humidity", col = "yellow")
abline(v = mean(houston_weather_mdf$humidity, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$humidity, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2)  # Median line

The histogram represents the distribution of humidity in Houston over one year, divided in 40 bins ranging from 20% to 100%. The distribution has a left-skewed shape with its peak at around 85%, this suggests that the humidity around this value occurred most frequently within the observed period.

# Box plot of humidity by season
ggplot(houston_weather_mdf, aes(x=season, y=humidity, fill=season)) +
  geom_boxplot() +
  theme_bw() +
  labs(title="Humidity Distribution by Season", x="Season", y="Humidity") +
  theme(axis.text.x = element_text(angle=0, hjust=1)) # Rotate x labels for better readability

The graph is a boxplot illustrating the distribution of humidity across four seasons: Winter, Spring, Summer and Autumn. Each boxplot represents the interquartile range (IQR) of humidity for a season, with the box’s bottom and top edges indicating the first (Q1) and third (Q3) quartiles, respectively. The line inside each box denotes the median humidity. The “whiskers” extend from the boxes to show the range of the data, while points below or above the whiskers indicate potential outliers.

Summer has the lowest median humidity, around 73%. Spring has the highest median humidity, around 79%. Summer has the highest IQR indicating the most variance in humidity, ranging from around 55% to 83%. There is a significant outliers in the Spring, indicating occasional humidity measurements that fall well below the typical range.

7.5 Minimum Temperature

# Histogram for Minimum Temperature with Mean and Median
hist(houston_weather_mdf$temp_min, 30 ,main = "Histogram of Minimum Temperature", xlab = "Minimum Temperature", col = "orange")
abline(v = mean(houston_weather_mdf$temp_min, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$temp_min, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2)  # Median line

The histogram represents the distribution of minimum temperature in Houston over one year, divided in 30 bins ranging from 264ºK to 313ºK. The distribution has a slightly left-skewed shape with its peak at around 299ºK, this suggests that the minimum temperature around this value occurred most frequently within the observed period.

7.6 Maximum Temperature

# Histogram for Maximum Temperature with Mean and Median
hist(houston_weather_mdf$temp_max,40, main = "Histogram of Maximum Temperature", xlab = "Maximum Temperature", col = "purple")
abline(v = mean(houston_weather_mdf$temp_max, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$temp_max, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2)  # Median line

The histogram represents the distribution of maximum temperature in Houston over one year, divided in 40 bins ranging from 267.2ºK to 317.1ºK. The distribution has a normal bell shape with its peak at around 303ºK, this suggests that the maximum temperature around this value occurred most frequently within the observed period.

7.7 Wind Speed

# Histogram for Wind Speed with Mean and Median
hist(houston_weather_mdf$wind_speed, 40 ,main = "Histogram of Wind Speed", xlab = "Wind Speed", col = "brown")
abline(v = mean(houston_weather_mdf$wind_speed, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$wind_speed, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2)  # Median line

The histogram represents the distribution of wind speed in Houston over one year, divided in 40 bins ranging from 0 m/s to 34.47 m/s. The distribution has a right skewed shape with its peak at around 4.9 m/s, this suggests that the wind speed around this value occurred most frequently within the observed period.

# Box plot of wind speed by month
ggplot(houston_weather_mdf, aes(x=month_name, y=wind_speed, fill=month_name)) +
  geom_boxplot() +
  theme_bw() +
  xlab("Month") + ylab("Wind Speed") +
  ggtitle("Monthly Wind Speed Distribution in Houston") +
  theme(axis.text.x = element_text(angle=25, hjust=1)) # Rotate x labels for better readability

The graph is a boxplot illustrating the distribution of wind speed across each month of the year. Each boxplot represents the interquartile range (IQR) of wind speed for a month, with the box’s bottom and top edges indicating the first (Q1) and third (Q3) quartiles, respectively. The line inside each box denotes the median wind speed. The “whiskers” extend from the boxes to show the range of the data, while points below or above the whiskers indicate potential outliers.

March and April have the highest median wind speed, around 6 m/s. September has the lowest median wind speed, around 5 m/s. There are outliers in every month, indicating occasional wind speed measurements that fall well above the typical range. There is one particular outlier in June indicating a highly irregular wind speed of 33 m/s.

7.8 Wind Direction

# Histogram for Wind Direction (Degrees) with Mean and Median
hist(houston_weather_mdf$wind_deg, 30, main = "Histogram of Wind Direction", xlab = "Wind Direction (Degrees)", col = "pink")
abline(v = mean(houston_weather_mdf$wind_deg, na.rm = TRUE), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$wind_deg, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2)  # Median line

The histogram represents the distribution of wind direction in Houston over one year, divided in 30 bins ranging from 0º to 360º. The distribution has a right multimodal shape with its peak at 0º, this suggests that the wind direction around this value occurred most frequently within the observed period, this value corresponds to north wind.

# Box plot of wind_deg by season
ggplot(houston_weather_mdf, aes(x=season, y=wind_deg, fill=season)) +
  geom_boxplot() +
  theme_bw() +
  labs(title="Wind Direction Distribution by Season", x="Season", y="Wind Direction") +
  theme(axis.text.x = element_text(angle=0, hjust=1)) # Rotate x labels for better readability

The graph is a boxplot illustrating the distribution of wind direction across each season of the year: Winter, Spring, Summer and Autumn. Each boxplot represents the interquartile range (IQR) of wind direction for a season, with the box’s bottom and top edges indicating the first (Q1) and third (Q3) quartiles, respectively. The line inside each box denotes the median wind direction. The “whiskers” extend from the boxes to show the range of the data, while points below or above the whiskers indicate potential outliers.

Winter has the highest IQR, indicating that winds come from the most directions, mainly falling between 110º (East-South East) to 300º( West-North West). Summer has the thinnest IQR, indicating that winds mainly come from fewer directions, mainly falling between 110º (South East) and 210º (South - South West). This means that winds fro the Summer come mainly from the South. However, even in the Summer winds can also come from all directions.

7.9 Wind Gust

# Histogram for Wind Gust with Mean and Median
hist(houston_weather_mdf$wind_gust,30, main = "Histogram of Wind Gust", xlab = "Wind Gust", col = "grey")
abline(v = mean(houston_weather_mdf$wind_gust), col = "red", lwd = 2) # Mean line
median_temp <- median(houston_weather_mdf$wind_gust)
abline(v = median_temp, col = "darkgreen", lwd = 2)  # Median line

The histogram represents the distribution of wind gust in Houston over one year, divided in 30 bins ranging from 0 to 24.18 m/s. The distribution has a multimodal shape with its peak at 0 m/s, indicating that the most frequent value of wind gust is 0, or the absence of wind gust, over the observed period.

7.10 Cloudiness

# Histogram for Cloudiness with Mean and Median
hist(houston_weather_mdf$clouds_all, 15, main = "Histogram of Cloudiness", xlab = "Cloudiness", col = "cyan")
abline(v = mean(houston_weather_mdf$clouds_all, na.rm = TRUE), col = "red", lwd = 2)
median_temp <- median(houston_weather_mdf$clouds_all, na.rm = TRUE)
abline(v = median_temp, col = "darkgreen", lwd = 2)  # Median line

The histogram represents the distribution of cloudiness in Houston over one year, divided in 15 bins ranging from 0 % to 100 %. The distribution has a multimodal shape with its peaks at 0 % and 100 %, indicating that the most frequent values of clouds is 0 % and 100 %, or the absence of cloudiness and completely covered sky with clouds.

# Box plot of clouds_all by season
ggplot(houston_weather_mdf, aes(x=season, y=clouds_all, fill=season)) +
  geom_boxplot() +
  theme_bw() +
  labs(title="Clouds by Season", x="Season", y="Clouds covering sky percentage") +
  theme(axis.text.x = element_text(angle=0, hjust=1)) # Rotate x labels for better readability

The graph is a boxplot illustrating the distribution of cloudiness across each season of the year: Winter, Spring, Summer and Autumn. Each boxplot represents the interquartile range (IQR) of cloudiness for a season, with the box’s bottom and top edges indicating the first (Q1) and third (Q3) quartiles, respectively. The line inside each box denotes the median cloudiness. The “whiskers” extend from the boxes to show the range of the data, while points below or above the whiskers indicate potential outliers.

Winter and Spring have the higher median of cloudiness, at 75%. Summer has the lowest median cloudiness, below 25%, this means that the Summer has clearer skys than Winter and Spring.

7.11 Correlations

Now we will use the ggpairs function from GGally package to obtain multiple graphs and correlations.

selected_variables <- houston_weather_mdf[c("season","temp", "feels_like", "pressure", "humidity","temp_min","temp_max","wind_speed","wind_deg","clouds_all")] # Create a subset of selected variables.
ggpairs(selected_variables,
        mapping = ggplot2::aes(alpha = 0.5), # Add transparency
        lower = list(continuous = wrap("points", size = 0.5, position = "jitter")), # Jittering and smaller points
        diag = list(continuous = wrap("densityDiag")) # Use density plots on the diagonal
)

Correlations between temperature, apparent temperature, maximum and minimal temperatures are the strongest, all above .98 since they are all different ways of measuring the same phenomenon. Other than that, there is a relevant negative correlation between temperature and pressure of -.568. This means that, cetris paribus if the temperature rises the pressure will drop.

8 Conclusion

Our project involved an approach to extract, flatten, clean, and analyze a year’s worth of weather data from Houston using the Open Weather Map API. The seasonal analysis provided the means of understanding Houston’s climate, with the highest temperatures observed in the Summer months and the lowest in Winter. Wind speed and cloudiness showed considerable variation throughout the year, with the Summer months tending to be less cloudy. We also observed strong correlation between the recorded temperatures, including the apparent temperature and maximum/minimum temperatures. We also observed a negative correlation between temperature and atmospheric pressure.

The study encountered challenges such as handling missing data and identifying outliers. The replacement of missing values for wind gusts and the treatment of outliers allowed for an accurate representation of Houston’s weather conditions.

The visualizations created as part of the analysis provided an understanding of the data and supported the numerical findings. The histograms and boxplots were particularly useful in identifying the distribution and range of the weather variables, thus offering valuable insights for meteorological assessments or urban planning.

9 References

OpenAI. (2024). Cover Illustration of Hyperrealistic Weather in Texas [Image]. Generated by ChatGPT. OpenWeather Ltd. (2024). OpenWeatherMap. Retrieved from https://openweathermap.org/

Weather data extraction, flattening, cleaning and analysis: Houston Case Study

Diego Ernesto Díaz Iturbe

2024-02-12