This R Markdown will be using the dataset provided by Daniel Hanasab from the discussion board. This dataset covers three major cities (New York, Los Angeles, and Chicago) and their corresponding temperatures/humidities through the first three months of the year (January, February, and March). The following code is used to load and observe the dataset:
dan_cities <- read_csv(
"https://raw.githubusercontent.com/GullitNa/DATA607-Project2/main/DanielHanasabCities.csv",
col_types = cols(
Temp_Jan = col_character(),
Temp_Feb = col_character(),
Temp_Mar = col_character(),
Humid_Jan = col_character(),
Humid_Feb = col_character(),
Humid_Mar = col_character()
)
)
dan_cities
## # A tibble: 3 × 7
## City Temp_Jan Temp_Feb Temp_Mar Humid_Jan Humid_Feb Humid_Mar
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 New York "32\xb0F" "35\xb0F" "42\xb0F" 75% 72% 68%
## 2 Los Angeles "58\xb0F" "60\xb0F" "65\xb0F" 65% 63% 60%
## 3 Chicago "28\xb0F" "30\xb0F" "40\xb0F" 80% 78% 75%
This dataset may seem simple to read on the surface, but requires additional coding to be properly read in the first place. col_types are used to encode text that contain special characters such as the degrees symbol ° by forcing R to read them as character strings, and even then, the rest of the .csv doesn’t show only the 3 cities of New York and the rest of the untidy columns such as different temperatures (unreadable text next to the numbers) and humidity columns for the first 3 months of the year. The foremost issue at the moment is Rstudio’s inability to read the characters despite the dataset being read into a dataframe. I aim to fix this using code to specifically make sure the numbers/integers are displayed rather than just read by extracting those numbers/integers away from the special characters/symbols.
dan_cities_tid <- dan_cities %>%
mutate(
across(starts_with("Temp"), ~ parse_number(.)),
across(starts_with("Humid"), ~ parse_number(.))
)
The goal of this cleanup is to tidy the data for analysis, so that involves turning this into a long format with combined columns to reduce redundancy and then go back to its wide structure with “Measure” separated into the 2 column types that use measurements to increase clarity/visibility of all data in comparison to its original format.
dan_cities_tidy <- dan_cities_tid %>%
pivot_longer(
cols = -City,
names_to = c("Measure", "Month"),
names_sep = "_",
values_to = "Value"
) %>%
pivot_wider(
names_from = Measure,
values_from = Value
) %>%
mutate(
Month = factor(Month,
levels = c("Jan", "Feb", "Mar"),
labels = c("January", "February", "March"))
) %>%
arrange(City, Month)
dan_cities_tidy
## # A tibble: 9 × 4
## City Month Temp Humid
## <chr> <fct> <dbl> <dbl>
## 1 Chicago January 28 80
## 2 Chicago February 30 78
## 3 Chicago March 40 75
## 4 Los Angeles January 58 65
## 5 Los Angeles February 60 63
## 6 Los Angeles March 65 60
## 7 New York January 32 75
## 8 New York February 35 72
## 9 New York March 42 68
Shown is the summary for the average temperature and average humidity in each cities in the first 3 months of the year. Where Chicago tends to be the coldest with 32.6°F and by correlation, has the highest average humidity of 77.6%. Whereas Los Angeles is the warmest with 61°F with the lowest humidity compared to the other 2 cities of 62.6%.
city_summary <- dan_cities_tidy %>%
group_by(City) %>%
summarize(
Avg_Temperature = mean(Temp),
Avg_Humidity = mean(Humid)
)
city_summary
## # A tibble: 3 × 3
## City Avg_Temperature Avg_Humidity
## <chr> <dbl> <dbl>
## 1 Chicago 32.7 77.7
## 2 Los Angeles 61 62.7
## 3 New York 36.3 71.7
Another way to analyze this data would be through visualization; and also by comparing climate differences between New York, Los Angeles, and Chicago. First, I’ll be plotting the “Temp” or temperature of the data corresponding to each city:
temperature_plot <- dan_cities_tidy %>%
ggplot(aes(x = Month, y = Temp, color = City, group = City)) +
geom_line() +
geom_point() +
labs(
title = "Temperature Trends by City",
x = "Month",
y = "Temperature (F)"
)
temperature_plot
Now here is the “Humid” or humidity of the corresponding cities for comparison, and additionally I’ll include the summary statistics to calculate the average temperature and humidity for each city in the following code block:
humidity_plot <- dan_cities_tidy %>%
ggplot(aes(x = Month, y = Humid, color = City, group = City)) +
geom_line() +
geom_point() +
labs(
title = "Humidity Trends by City",
x = "Month",
y = "Humidity (%)"
)
humidity_plot
As This correlation between the temperature and humidity is visible also within the line graphs from early which I combine using the “patchwork” package. Additionally, it tells us that as the months go by from January to March, the temperatures increase and the humidity decreases.
city_summary
## # A tibble: 3 × 3
## City Avg_Temperature Avg_Humidity
## <chr> <dbl> <dbl>
## 1 Chicago 32.7 77.7
## 2 Los Angeles 61 62.7
## 3 New York 36.3 71.7
temperature_plot + humidity_plot
In my example of analysis and transformation, I started out with an unreadable dataset (mainly due to the °F symbol) about a wide untidy dataset of 3 cities and their temperatures and humidities and read that data from the csv as character strings, later parsing out the numeric portion of these strings by removing the symbols outright. Addition cleanup measures including transformation which would include using pivot commands to make sure each row represents a single city and corresponding month observations for temperature and humidity in a long format. These factors allowed me to observe and compare 2 graphs on the manner as well as generate the summary for a more detailed view on the dataset.