This R Markdown will be using the dataset provided by Daniel Hanasab from the discussion board. The following code is used to load and observe the dataset:
dan_cities <- read_csv(
"https://raw.githubusercontent.com/GullitNa/DATA607-Project2/main/DanielHanasabCities.csv",
col_types = cols(
Temp_Jan = col_character(),
Temp_Feb = col_character(),
Temp_Mar = col_character(),
Humid_Jan = col_character(),
Humid_Feb = col_character(),
Humid_Mar = col_character()
)
)
dan_cities
## # A tibble: 3 × 7
## City Temp_Jan Temp_Feb Temp_Mar Humid_Jan Humid_Feb Humid_Mar
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 New York "32\xb0F" "35\xb0F" "42\xb0F" 75% 72% 68%
## 2 Los Angeles "58\xb0F" "60\xb0F" "65\xb0F" 65% 63% 60%
## 3 Chicago "28\xb0F" "30\xb0F" "40\xb0F" 80% 78% 75%
This dataset may seem simple to read on the surface, but requires additional coding to be properly read in the first place. col_types are used to encode text that contain special characters such as the degrees symbol ° by forcing R to read them as character strings, and even then, the rest of the .csv doesn’t show only the 3 cities of New York and the rest of the untidy columns such as different temperatures (unreadable text next to the numbers) and humidity columns for the first 3 months of the year. The foremost issue at the moment is Rstudio’s inability to read the characters despite the dataset being read into a dataframe. I aim to fix this using code to specifically make sure the numbers/integers are displayed rather than just read by extracting those numbers/integers away from the special characters/symbols.
dan_cities_tidy <- dan_cities %>%
mutate(
across(starts_with("Temp"), ~ parse_number(.)),
across(starts_with("Humid"), ~ parse_number(.))
)
The goal of this cleanup is to tidy the data for analysis, so that involves turning this into a long format with combined columns to reduce redundancy and then go back to its wide structure with “Measure” separated into the 2 column types that use measurements to increase clarity/visibility of all data in comparison to its original format.
dan_cities_tidy <- dan_cities %>%
mutate(
across(starts_with("Temp"), ~ parse_number(.)),
across(starts_with("Humid"), ~ parse_number(.))
) %>%
pivot_longer(
cols = -City,
names_to = c("Measure", "Month"),
names_sep = "_",
values_to = "Value"
) %>%
pivot_wider(
names_from = Measure,
values_from = Value
) %>%
mutate(
Month = factor(Month,
levels = c("Jan", "Feb", "Mar"),
labels = c("January", "February", "March"))
) %>%
arrange(City, Month)
dan_cities_tidy
## # A tibble: 9 × 4
## City Month Temp Humid
## <chr> <fct> <dbl> <dbl>
## 1 Chicago January 28 80
## 2 Chicago February 30 78
## 3 Chicago March 40 75
## 4 Los Angeles January 58 65
## 5 Los Angeles February 60 63
## 6 Los Angeles March 65 60
## 7 New York January 32 75
## 8 New York February 35 72
## 9 New York March 42 68
I plan to analyze this data via visualization and also by comparing climate differences between New York, Los Angeles, and Chicago. First, I’ll be plotting the “Temp” or temperature of the data corresponding to each city:
temperature_plot <- dan_cities_tidy %>%
ggplot(aes(x = Month, y = Temp, color = City, group = City)) +
geom_line() +
geom_point() +
labs(
title = "Temperature Trends by City",
x = "Month",
y = "Temperature (°F)"
)
temperature_plot
Now here is the “Humid” or humidity of the corresponding cities for comparison, and additionally I’ll include the summary statistics to calculate the average temperature and humidity for each city in the following code block:
humidity_plot <- dan_cities_tidy %>%
ggplot(aes(x = Month, y = Humid, color = City, group = City)) +
geom_line() +
geom_point() +
labs(
title = "Temperature Trends by City",
x = "Month",
y = "Humidity (%)"
)
humidity_plot
## Summary Shown is the summary for the average temperature and average
humidity in each cities in the first 3 months of the year. Where Chicago
tends to be the coldest with 32.6°F and by correlation, has the highest
average humidity of 77.6%. Whereas Los Angeles is the warmest with 61°F
with the lowest humidity compared to the other 2 cities of 62.6%. This
correlation between the temperature and humidity is visible also within
the line graphs from early which I combine using the “patchwork”
package. Additionally, it tells us that as the months go by from
Janurary to March, the temperatures increase and the humidity
decreases.
city_summary <- dan_cities_tidy %>%
group_by(City) %>%
summarize(
Avg_Temperature = mean(Temp),
Avg_Humidity = mean(Humid)
)
city_summary
## # A tibble: 3 × 3
## City Avg_Temperature Avg_Humidity
## <chr> <dbl> <dbl>
## 1 Chicago 32.7 77.7
## 2 Los Angeles 61 62.7
## 3 New York 36.3 71.7
temperature_plot + humidity_plot
In my example of analysis and transformation, I started out with an unreadable dataset (mainly due to the °F symbol) about a wide untidy dataset of 3 cities and their temperatures and humidities and read that data from the csv as character strings, later parsing out the numeric portion of these strings by removing the symbols outright. Addition cleanup measures including transformation which would include using pivot commands to make sure each row represents a single city and corresponding month observations for temperature and humidity in a long format. These factors allowed me to observe and compare 2 graphs on the manner as well as generate the summary for a more detailed view on the dataset.