In this tutorial we are going to practice:
We’ll be focusing on Tanzania, my own home-away-from-home, but you should be able to modify this code and produce similar results for the country of your choosing.
The UN Department of Economic and Social Affairs publishes the World Urbanization Prospects, a report including many indicators of urban growth and scale. They provide free access to some of their historical population data and projections of population into the future, based on demographic models. One of the data sets they host is a listing, by country, of all cities with a population greater than 300K in the year 2018. Among those cities, they share their historical populations in 5 year increments, and future projections of those city populations.
After poking around their webpage for a bit, I found some good Excel formatted data. The direct URL that points to this dataset is listed in the chunk below.
# A data file that lists all cities with population greater than 300 K by country.
UN_URL <- "https://population.un.org/wup/Download/Files/WUP2018-F12-Cities_Over_300K.xls"
Next, we are going to download that file into the week_4
folder of our home directory, and re-name it un_cities.xls
.
# this is a base-R function in the utils package
# it will download a file into your "week_4" directory that you will have created.
download.file(url=UN_URL, "un_cities.xls")
Now let’s try first to read the file using the read_xls
function from the package readxl
.
un_cities <- read_xls("un_cities.xls")
## New names:
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * `` -> ...6
## * ...
The warnings you might see about “New names:” are an indicator that something is a bit weird about the first row of the data. If the data were formatted nicely, it would have column names on the first row. But with these data, R is having trouble finding column names on that first row, and so assigning some default column names.
Now let’s look at the dataframe, as we normally due, using “head”
head(un_cities)
The call to head
gives us a single column named United Nations that contains in the rows below metadata (data about data) rather than the actual data. Metadata is important, but it should not be mixed up with the actual data in the same worksheet or file. Shame on the UN! To make these data usable, we are going to need to trim off the first 16 rows of metadata, and then see if that gives us a nice clean dataframe with column names on the first line. We do this using the variable ‘skip = 16’ in the function read_xls
.
un_cities <- read_xls("un_cities.xls", skip = 16)
head(un_cities)
Now that is better. I will now take a look at the biggest cities in my home-away-from-home country, Tanzania. To do this, I am going to use the values in the rows to select a subset of the total data set. The code below will use the function filter
to select cities in Tanzania. Selecting rows of a data from a dataframe based on the values in a row is a very important tool to practice in your data-wrangling adventures.
tz_cities <- filter(un_cities, `Country or area`=="United Republic of Tanzania")
Now I’d like to see what columns are provided in this UN data set.
colnames(tz_cities)
## [1] "Index" "Country Code" "Country or area"
## [4] "City Code" "Urban Agglomeration" "Note"
## [7] "Latitude" "Longitude" "1950"
## [10] "1955" "1960" "1965"
## [13] "1970" "1975" "1980"
## [16] "1985" "1990" "1995"
## [19] "2000" "2005" "2010"
## [22] "2015" "2020" "2025"
## [25] "2030" "2035"
As you can see, the column names refer to some basic info: latitude and longitude as well as the population in a given year. This data set was released in 2018, and the most recent actual data are from 2015, while those past 2015 are projections into the future. Let’s make a sorted bar- chart of these data from 2015 for the top 5 cities in Tanzania, and save this as a PNG file on your machine.
# sort the data by 2015 population in descending order
tz_cities <- arrange(tz_cities, desc(`2015`))
# select the top 5 cases. The code below is saying "give me rows 1 through 5"
tz_cities_5 <- tz_cities[1:5,]
barplot(height=tz_cities_5$'2015', names.arg=tz_cities_5$`Urban Agglomeration`, cex.names=0.75, ylab = "2015 Population (Thousands)", col="#5BD16B")
Please modify the code above and choose a color for the col
parameter that is appropriate for your country. The Tanzanian flag happens to have lots of green, which is why I chose that color. You should specify the colors for the barplot using what are called hexadecimal codes, which is a hashtag followed by 6 digits. You can find hexadecimal codes for any color using this webpage.
Let’s now save the plot you’ve made as a PNG file on your machine. Make sure the name is changed to be appropriate to your country.
png(filename="population_tz_top_5_ua.png")
barplot(height=tz_cities_5$'2015', names.arg=tz_cities_5$`Urban Agglomeration`, cex.names=0.75, ylab = "2015 Population (Thousands)")
dev.off()
## png
## 2
One of the good features of writing with RMarkdown is the ability to insert code into your text, and thus, include the numbers and words that result from your analyses in line with the rest of your text. We’ll practice that now. First of all, I am going to add a new block of code that will simply save the names of the top five urban areas in Tanzania into variables. I will also store the populations of each of these locales into variables. I will then reference these variables in text using the conventions of RMarkdown, so that the variables will be rendered as text when this document is previewed or knit into an .html report.
Here I identify the names of the top-5 most populous cities and their populations. Since population is stored as a value in thousands in this UN data set, I am also expanding that number by multiplying it by 1000, then rounding off the value to the nearest whole number. Finally, I wrap that rounding in a call to get_nice_number
in order to make sure the numbers are displayed with properly formatted commas.
ua_1 <- tz_cities$`Urban Agglomeration`[1]
ua_1_pop <- get_nice_number(round(tz_cities$`2015`[1]*1000,0))
ua_2 <- tz_cities$`Urban Agglomeration`[2]
ua_2_pop <- get_nice_number(round(tz_cities$`2015`[2]*1000,0))
ua_3 <- tz_cities$`Urban Agglomeration`[3]
ua_3_pop <- get_nice_number(round(tz_cities$`2015`[3]*1000,0))
ua_4 <- tz_cities$`Urban Agglomeration`[4]
ua_4_pop <- get_nice_number(round(tz_cities$`2015`[4]*1000,0))
ua_5 <- tz_cities$`Urban Agglomeration`[5]
ua_5_pop <- get_nice_number(round(tz_cities$`2015`[5]*1000,0))
Next, I use the variables that were created in the code chunk above to produce text. When using RMarkdown, in order to insert the value of an r variable into text, you have to follow a special syntax convention (please see this page for a detailed treatment of inline code in RMarkdown). You must first insert a single back quote, the letter ‘r’, a single space, and then your code. You end a section of in-line code with a single back quote.
The top five most populous urban agglomerations in Tanzania are: Dar es Salaam, Mwanza, Zanzibar, Mbeya, and Arusha. The populations of these urban agglomerations are 5,115,698, 838,019, 568,912, 444,275, and 443,282, respectively.
United Nations, Department of Economic and Social Affairs, Population Division (2018). World Urbanization Prospects: The 2018 Revision, Online Edition. https://population.un.org/wup/Download/