Introduction

Data visualization is an important aspect of analysis that allows users to shape information into a way that can be useful for understanding and decision making. Large datasets can be daunting and confuse users when it comes to knowing what data is relevant for what is trying to be achieved. Creating visualizations is like a puzzle, however users must make their own puzzle pieces and then piece them together until the full picture comes together. Various tables and charts must be meticulously dissected and pieced together until it is ready to be visualized. Data cannot become useful information until it is able to be understood and conceptualized by users, otherwise it is numbers and statistics without context.

Dataset

This database contains information about 344 metropolitan areas in the United States, with 18 columns and 344 rows. Information pertaining to the physical attributes of the city, such as the geographical location, the size, the population, and the air quality. Other important attributes that relate more to the socioeconomic climate of these areas include data about the average income of individuals in the city, the average rent, cost of living, unemployment, price parity, walking score, and transit score.

This dataset can be very useful for people that are looking to learn more about the different cities in the country. Perhaps a recent college graduate is looking to move out and find somewhere new to live. A task like that would require immense energy and reserah, however with a dataset and visualizations, valuable information that influece the decision making is much more accessible and understandable to a user.

# Library #

library(dplyr)
library(ggplot2)
library(ggthemes)
library(RColorBrewer)
library(scales)
library(plyr)
library(data.table)


# https://www.kaggle.com/datasets/denissad/us-cities?select=us_cities.csv #

setwd("//Users//jamesdileo//Desktop//DS736//R_datafiles")

df <- fread("us_cities.csv")

Findings

States with the Most Cities

# Visualization 1: Bar Graph #

statecount <- data.frame(count(df$State))

statecount <- statecount[order(statecount$freq, decreasing = TRUE), ]

statecount$n <- as.numeric(statecount$freq)

ggplot(head(statecount, 10), aes(x = (reorder(x, -n)), y = n)) +
  geom_bar(colour="black", fill="darkgreen", stat="identity") +
  labs(title = "States with Most Cities", x = "State", y = "Number of Cities") +
  theme(plot.title= element_text(hjust=0.5))

For the first visualization I thought it would be useful to find out which state has the greatest number of cities. This could be relevant as somebody my age may not want to live in a state where there are less cities and therefore less activity going on. When constructing my visualization, I had to first had to create a variable that included every state in the country and the number of cities in that state. From that, I included only the top ten states that have the most cities. The two states with the most cities were California and Texas both with twenty-five cities, and following was Florida, with twenty. This visualization is relevant because a recent college graduate might find that California is a place where there is a lot of activity going on and may be interested doing additional research about the state and the cities.

City Size Distribution in California

# Visualization 2: Pie Chart ###################################################

testcombo <- data.frame(df$City, df$State, df$Size)

california_cities <- subset(testcombo, df$State == "California")

count_sizes <- count(california_cities$df.Size)

ggplot(data = count_sizes, aes(x="", y=freq, fill = x)) +
  geom_bar(stat="identity", position="fill") +
  coord_polar(theta="y", start=0) +
  labs(fill = "State Size", x=NULL, y=NULL, title="City Size Distrubution in California") +
  theme_light() +
  theme(plot.title = element_text(hjust=0.5), 
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank()) +
  geom_text(aes(x=1), label = paste0(count_sizes$freq),
            size = 4,
            position = position_fill(vjust=0.5))

For the second visualization, I felt that it was important to distinguish how each of the cities in California were classified in terms of size. The data set classified each city as either a large, mid-sized, or small. When constructing the pie chart, I had to filter the data so that it only included cities in California. The pie chart reveals that the majority of the cities in California were classified as mid-sized, with a total of twelve mid-sized cites, then seven large cities, then six smaller cities. This is useful information because someone can better understand the proportion of city sizes in California.

Walkability of Large California Cities

# Visualization 3

city_walk <- data.frame(df$City, df$WalkScore, df$Population)

cali_walk <- (subset(city_walk, df$State == "California" & df$Size == "Large"))

ggplot(cali_walk, aes(x=df.Population, y=df.WalkScore, group=df.City)) +
  geom_point(shape = 20, aes(color=df.City), size=4) + 
  labs(title = "Walkability in Large California Cities", 
       x = "Population",
       y = "Walking Score") +
  theme_linedraw() + 
  theme(plot.title=element_text(hjust=0.5)) +
  scale_color_brewer(palette = "Paired", name = "City") +
  scale_x_continuous(labels = comma,
                     breaks = seq(1000000, 14000000, by = 2000000),
                     limits = c(1000000,14000000))

For the third visualization, I felt it was important to expand on the city sizes from the pie chart that would allow a user to interpret the cities in California in a different way. Firstly, someone may not know exactly a large, mid-sized, or small city entails physically. Yes, it shows the size, but it doesn’t offer much else. It doesn’t provide information about how many people live in this city, and it doesn’t tell how easy it is to travel from point A to point B. For a recent graduate who may not have a car, it would be important to know how easy it is to walk throughout the city. Does this city allow me to accessibly walk from, for example their home to their work? The third visualization is a scatter plot in which California cities are ranked on the x axis in population, and then the y axis shows the walking score each city was given. The plot showed interesting statistics about the large cities in California, especially Los Angeles and San Francisco. What stood out to me about both these cities are that they are both outliers in the large city group. Los Angeles has a substantially larger population than the rest of the cities, with over thirteen million people. The next highest populated large city in California is San Francisco with less than half the amount of LA. San Francisco has a population of just below five million. Despite being only half as populated then LA, the plot shows that San Francisco has a much higher walkability score at about 87. The closer a score is to 100 means that the city is not dependent on cars. LA has a walking score of 67, meaning that a car isn’t totally necessary, but would be useful. Other cities like Sacramento and San Diego are both smaller in population then LA and San Francisco, and they are both more dependent on cars. Someone may find it attractive that San Francisco has a high walkability score, and it is not as populated as Los Angeles. On the other hand, the high population of Los Angeles may be exciting because there may be a lot of activity and work to do there.

Air Quality around the Country

#Visualization 5

airquality <- data.frame(df$City, df$State, df$Region, df$Size, df$MedianAQI)

ranked_aqi <- airquality [order(airquality$df.MedianAQI, decreasing = TRUE), ]

filtered_aqi <- ranked_aqi %>% filter(!is.na(df$MedianAQI))

filtered_aqi <- ranked_aqi %>% filter(!is.na(df$MedianAQI))

filtered_aqi <- mean(na.omit(filtered_aqi$df.MedianAQI))

testaqi <- data.frame(df$Region, df$Size, df$MedianAQI)

testaqiNA <- na.omit(testaqi)

grouped_aqi <- testaqiNA %>%
  group_by(df.Region, df.Size) %>%
  dplyr::summarise(df.MedianAQI = mean(df.MedianAQI, na.rm = TRUE), .groups='keep') %>%
  data.frame()

ggplot(grouped_aqi, aes(x= df.Size, y =df.Region, fill=df.MedianAQI)) +
  geom_tile(color="black") +
  geom_text(aes(label=comma(df.MedianAQI))) + 
  coord_equal(ratio=1) +
  labs(title="Air Quality Across the Country",
       x = "City Size",
       y = "Region in the Country",
       fill = "Average City Air Quality") +
  theme_minimal() + 
  theme (plot.title=element_text(hjust=0.5)) +
  scale_fill_continuous(low="white", high="orange")

The last visualization again looks at the country overall to assess a factor that not many people take into consideration, air quality. We think about things like income, rent population, and size because these are all tangible things that you can see. You can look at a city and see how large it is, or how many people live there. You can look at someone’s apartment and infer how much their rent is, or what their income is. Air quality on the other hand is something you can’t see so it is not something many people think about. A heatmap is a good way to distinguish the air quality in cities across the country. For the heat map, I found the average AQI for each region and city size in the country. The higher the AQI, the worse the air quality is. According to the heat map, larger cities in the west have the worst air quality. This may affect decision making for someone that was looking at Los Angeles or San Francisco. The average AQI of large western cities is 54.00. California accounted for six out of the ten cities with the higher AQI, with Riverside, CA, having an AQI of 84. This is extremely alarming and should signal someone that maybe living in Riverside isn’t the best idea. Los Angeles ranks third with an AQI of 70. That is substantially less than Riverside, but still high. One interesting thing to note is that although the West has the highest AQI in the country within their large cities, the smaller cities on the other hand have the lowest AQI in the country with a score 35.25. The Midwest, Northeast, and South all have similar ranges of AQI across their different cities, in total ranging from 38 to 45. Although the quality of air is not something people think about too often when they are searching for a new city, it is interesting to see what parts of the country fare off better than others.

Conclusion

These visualizations shed light on various aspects that play a crucial role in decision-making, such as the number of cities in a state, the sizes of cities, population, walkability, income, rent, and air quality. We often hear about these places and things about them, for example how people perceive New York City as the center of to be so massive and populated, yet without anything to copmare it to, these numbers and ideas remain abstract. We often hear about economic activity and how large cities on the east coast have boasting economies, so it is useful to see how the numbers are actually involved, specifically when it comes to average income, unemployment, or rent.

These visualizations allow individuals to make more informed choices as they make important decisions about where they may want to start their life in the country. Without the ability to manipulate data in a way that communicates a message, useful information remains out of reach to a user.