Ethical Web Scraping Assignment: Wiki Cities

Web Scraping: Wikipedia’s List of Largest Cities

(Using an HTML website table)

Part 1: My interest

I have always found city data to be very interesting as I think its fascinating how so many people can live and work together in one area. Each city has its own distribution of people, typically with their own variations in population density spread across a city proper (downtown area), urban areas, and a greater metropolitan area. This data from Wikipedia provides a list of some of the largest cities in the world. Each row is detailed as a city, with its country, definition, total population, and other fields identifying size and specific area populations.

Part 2: My inquiry

For this assignment, I would like to use the largest cities data frame that I scraped from Wikipedia to find out which country has the most highly populated cities, and which area of the city those populations tend to concentrate in (City proper, Urban area, or Metropolitan area)

Part 3: The process

First I scraped data from https://en.wikipedia.org/wiki/List_of_largest_cities using the structured table method. I was able to succesfully grab the 13 columns. Next, I removed the first row of data since it just identified the column names. Since the actual column names were much longer strings (Wiki issue) I renamed those using the rename dplyr function. I then converted necessary columns from chr types to numeric types and removed any commas from the original table.

To answer my questions, I wanted to first test out my data mutation and visualization code in the R- script. For the first visualization, I grouped the data by country and found the average city populations using the summarize command. Then I put this aggregated table into a column chart.

For the second visualization, I filtered out the 3 countries with the highest average cities populations (Iran, Japan, and Bangladesh). I also only kept the columns for the different individual populations. I put this set then into a tidy command that chat helped me with called “pivot_longer” which was able to put the population types into one nominal column of data. This allowed me to easily create a column chart to see each cities population by City Proper, Urban area, and Metropilitan area.

Part 4: CSV. file load

Part 5: Inquiry Visualizations

Viz 1: In which countries are the cities with the largest average population size?

country_pop <-
  cities_df %>%
  group_by(Country) %>%
  summarise(country_avg_pop = mean(`2018_population`, na.rm = TRUE))

country_pop %>% 
  ggplot(aes(x = Country, y = country_avg_pop)) +
  geom_col() +
  scale_y_continuous(labels = scales::number) +
  labs(title = "Average City Population by Country",
       y = "Population",
       x = "Country") +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Viz 2: From the graph above, you can gather that Bangladesh, Japan, and Iran have the largest average city populations. Between the City Proper, Urban area, and metropolitan area, which area has the highest concentration of people in these cities?

cities_df_filtered <-
  cities_df %>%
  filter(Country %in% c("Bangladesh", "Iran", "Japan")) %>%
  select(cp_population, ua_population, ma_population, City.a., Country)


cities_long <-
  cities_df_filtered %>%
  pivot_longer(
    cols = c(cp_population, ua_population, ma_population),
    names_to = "population_type",
    values_to = "population"
  )

cities_long %>%
  ggplot(aes(x = City.a., y = population, fill = population_type)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = scales::number) +
  labs(
    title = "City Population Saturation by Area Type",
    x = "City",
    y = "Population",
    fill = "Population Type"
  )

**** Across all of these highly populated cities, the urban areas tend to have the highest concentration of people!