When it comes to coding, I may say I am not the best, but I think teaching will make me better at it. Therefore lets start GIS coding using R.Why R? Because it’s the first coding language I got to learn and master successfully. I may not be the guru who can handle the most complex of tasks in it, but lets just say I can use it to do basic functions. The complexity of the task is debatable. Lets start anyway, and see where this journey leads us.
Lets start by loading some 2019 census data, which I proudly participated in.
Let’s start simple. Lets download the Kenya Population data by sex and county for the year 2019.
It is available from here: https://open.africa/dataset/9b94fe50-9d75-4b92-be00-6354c6e6cc88/resource/384b93cf-ede0-4e05-9f36-a8e8e09b21a7/download/kenya-population-by-sex-and-county.csv
Once loaded, let’s take a look at it using Microsoft Excel.
It looks like this.
the census data
The first five rows of the data, circled in the image above, contain some metadata, that is the source, and the fields therein but that’s unnecessary for us. In fact, it will give us problems when we are trying to load the file into R by making us use unnecessary mental energy to find out which code to delete these columns.
I will try the easy way this first time round.
Delete the first 6 rows circled above in Excel until you remain with the rows containing the column names: Name, Male, Female, Intersex, And Total. Remember to save your updated excel sheet. Now let’s start coding. We have been itching to do this for long. Why not start by a simple exercise—load the excel sheet containing our Census data by sex and county to R?
The good thing is that our excel data is in .csv format, which makes it easier to load it into R. This does not mean that R can’t load Excel data, but it will involve calling some functions (or packages) which might be a traumatising event for the novice beginner.
Open R Studio. It should look like this:
maximize icon in red
There is a maximize icon at the top (see the are bounded in red), click on it to maximize the text editor canvas.
Now the R studio canvas has 4 windows. Going through the details of each one of them is beyond the scope of this book, as I want you to learn how to code and hopefully become better at it as I teach. Therefore, I decided to take the easier method, denote the various windows in a graphic while giving you the source to others who did the hard work of explaining all that they do.
the r-studion canvas
Knowing the names is a first step in understanding the functionalities of R. The Grunwald page has done a good job in giving us a very summarized version of the functionalities of each and every R window tab. See the details here: https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/00--intro_to_rstudio.html We had last left it trying to load some excel data to R.
When we pressed the maximize ribbon, we opened our text editor. This is where we will be writing our code.
An R-script file is opened for us by default, and it goes by the name ‘untitled’. We want to give it a name, as well as the project folder it will be found under. What on earth is a project, and an R script? Again, I leave this to the people who did the hard work to explain all these terms. Kindly see this links respectively:
https://support.rstudio.com/hc/en-us/articles/200526207-Using-RStudio-Projects and https://fileinfo.com/extension/r#:~:text=An%20R%20file%20is%20a,visualizations%20of%20the%20computed%20data.
To create a new R project where we will store our data, kindly follow the guesswork below.
Go to File>New project. The following prompt appears.
create project
Click on New Directory>New project. The following prompt appears.
create new project
Give your project a name. In this case, it is named gis_coding. Browse to the directory you would like to save your project. Once done click Create project.
Back to the R script. It is all empty and we have to wrench out some words, nay, code to call our excel file.
Here is the code to call our excel file into this intimidating blank screen of R
county_sex <- read.csv(‘E:Studio is hereteacher/kenya-population-by-sex-and-county.csv’, header = T)
Click ctrl + enter. Alternatively, you can select the entire code and click Run at the top right of your text editor.
It gives us an error. Don’t worry this was intentional. There is something I want you to learn. When calling files, avoid using the backslash () since it will always result in an error. A () in R is treated as an escape character. It’s a bit complicated to explain it, but it also not needful to do so here. Rather, the forward slash (/) is used to call files. The alternative is using two backslashes (\), but we want our work to be neat and avoid silly mistakes.
Replace all the backslashes with forward slashes. Then Run (Ctlr + enter). From now henceforth, when running a code, unless otherwise indicated, you will be using Ctrl + enter.
county_sex <- read.csv('E:/Documents/R Studio is here/programming teacher/Data/kenya-population-by-sex-and-county.csv', header = T)
Now let’s go through that entire chunk of code slowly.
County_sex – this stands for the object. An object is what we use to store a value. We normally assign values to objects using the <- in R. See this: https://www.datamentor.io/r-programming/object-class-introduction/#:~:text=In%20fact%2C%20everything%20in%20R,(prototype)%20of%20a%20house.
<- this is known as an operator. There are various operators in R. The back arrow (<-) operator assigns whatever value is on the right to the value on the left ie. whatever value is assigned, or results from the operation on the right will be imputed to the value on the left. The equals (=) sign can also be used, but good convention in R prefers using the <- sign. For more simple explanations to other operators, see https://libraryguides.mcgill.ca/c.php?g=699776&p=4968546
Read.csv – This is a function used to call .csv files to R. What is a function? See - https://www.tutorialspoint.com/r/r_functions.htm . read.csv is a built-in R function.
‘E:/Documents/R Studio is here/programming teacher/Data/kenya-population-by-sex-and-county.csv’ – this is the file we are calling. It can be one word if it’s stored within the project folder, or the address if we are calling it from somewhere else in our computer. It can also be a url! Note that the file we are calling also has its extension (.csv) included. We are specifying what it is exactly we are loading by putting that extension, otherwise you will get an error if the extension is omitted. Header = T – we indicate that the first row contains our column names. For now, these explanations suffice. For more information on the read.csv built-in function, type ‘read.csv’ in the help menu of the file browser tab. It contains more details on how this function works, including the terms explained above. Also note the single quote marks (‘…’) enclosing the file address. Quotation marks are used to enclose ‘strings’. A string is a collection of characters that make up one element of a vector. Most coders prefer double quotation marks (“…”) but as mentioned earlier, I prefer the easy way and only insert double quotes when an apostrophe (’) is part of the characters involved, and so far the occurrence of such has been rarer than diamonds.
It is also helpful to do a read regarding the following as they are very common in most programming languages.
String - https://faculty.nps.edu/sebuttre/home/R/text.html Vector - https://www.javatpoint.com/r-vector#:~:text=A%20vector%20is%20a%20basic,complex%2C%20or%20raw%20data%20type.
Once you run the county_sex data, you will see the object appear in the environment window, under ‘Data’.
Click on it.
A table appears containing the same exact data as in our excel sheet.
the census table in r
Let’s save our work so far.
Go back to your R Script and click on save.
Save it within your project folder. Name the R script as ‘county_sex.R’. The .R extension indicates that this an R script file.
Close R studio through File>Quit session. Enough for today. If a pop up appears asking if you would like to save the project, click Yes.
Part 2
In the last chapter, we had created a table by the name ‘county_sex’ in our R environment but we had not saved it anywhere into our computer directory. To do this and to at least have some evidence for your boss who may inquire of it quite unexpectedly, do the following:
write.csv(county_sex, file = 'county_pop_sex_test.csv')
saved csv in computer
What did we just do? And didn’t the argument ‘file’ need a directory path? Lets answer those questions.
write.table is an in-built argument to create a table in the provided file directory or connection. A connection in this can be any destination provided, including a url. As can be seen in the help tab,’file’is the character string whereby the table will be stored. Remember strings from our previous chapter? When writing the file location, kindly put it in quotation marks.
The reason we didn’t put a directory name like ‘E:/documents/…’ is because we intend the file to be placed in the same location as our project folder. If we wanted to create the file in a different directory, we would have specified the file path eg. ‘E:/documents/get/some/direction….’
write.csv(county_sex, file = 'E:/county_pop_sex2.csv')
This duplicate actually got deleted.
R can also understand what argument the file directory even if you do not explicitly state the ‘file =’ argument.
For example
write.csv(county_sex, 'E:/county_pop_sex3.csv')
Try it out. In this case, since the file directory of ‘file =’ argument came immediately after the x argument (see help of r-studio browser tab), R will assume that this is the file directory since the ‘file =’ argument comes immediately after the ‘x’ argument, which stands for the object in the environment path that will be transferred to a file folder in our computer.
Now lets dig dipper.
Our ‘county_sex’ data in our R environment contains the total population of ‘males’, ‘females’, ‘intersex’ and countrywide totals for the variables as the first row after the column labels. Since this row would mess our data during visualization, lets delete it.
#delete row containing countrywide totals. It is labelled 1 in () because the row above it contains columnn names, and is not treated as a row but rather as a header
county_sex_new <- county_sex[-c(1), ]
Please note the comma (,) that comes after the 1 in brackets. It signifies that we want to delete the row, not column. In R, columns in a matrix statement are denoted by a (,) coming before. Now you know why I insist on the comma, of course I did this without it only to end up with a table with a useless ‘name’ column only…
Our data is all clean
county_sex_new
## name Male Female Intersex Total
## 2 Mombasa 610257 598046 30 1208333
## 3 Kwale 425121 441681 18 866820
## 4 Kilifi 704089 749673 25 1453787
## 5 Tana River 158550 157391 2 315943
## 6 Lamu 76103 67813 4 143920
## 7 Taita-Taveta 173337 167327 7 340671
## 8 Garissa 458975 382344 34 841353
## 9 Wajir 415374 365840 49 781263
## 10 Mandera 434976 432444 37 867457
## 11 Marsabit 243548 216219 18 459785
## 12 Isiolo 139510 128483 9 268002
## 13 Meru 767698 777975 41 1545714
## 14 Tharaka-Nithi 193764 199406 7 393177
## 15 Embu 304208 304367 24 608599
## 16 Kitui 549003 587151 33 1136187
## 17 Machakos 710707 711191 34 1421932
## 18 Makueni 489691 497942 20 987653
## 19 Nyandarua 315022 323247 20 638289
## 20 Nyeri 374288 384845 31 759164
## 21 Kirinyaga 302011 308369 31 610411
## 22 Murang'a 523940 532669 31 1056640
## 23 Kiambu 1187146 1230454 135 2417735
## 24 Turkana 478087 448868 21 926976
## 25 West Pokot 307013 314213 15 621241
## 26 Samburu 156774 153546 7 310327
## 27 Trans Nzoia 489107 501206 28 990341
## 28 Uasin Gishu 580269 582889 28 1163186
## 29 Elgeyo-Marakwet 227317 227151 12 454480
## 30 Nandi 441259 444430 22 885711
## 31 Baringo 336322 330428 13 666763
## 32 Laikipia 259440 259102 18 518560
## 33 Nakuru 1077272 1084835 95 2162202
## 34 Narok 579042 578805 26 1157873
## 35 Kajiado 557098 560704 38 1117840
## 36 Kericho 450741 451008 28 901777
## 37 Bomet 434287 441379 23 875689
## 38 Kakamega 897133 970406 40 1867579
## 39 Vihiga 283678 306323 12 590013
## 40 Bungoma 812146 858389 35 1670570
## 41 Busia 426252 467401 28 893681
## 42 Siaya 471669 521496 18 993183
## 43 Kisumu 560942 594609 23 1155574
## 44 Homa Bay 539560 592367 23 1131950
## 45 Migori 536187 580214 35 1116436
## 46 Kisii 605784 661038 38 1266860
## 47 Nyamira 290907 314656 13 605576
## 48 Nairobi 2192452 2204376 245 4397073
However, you may have noted that the first row begins from index 2, whereas we are perfectionists and our eyes, and perhaps mind, like to see things flowing in a well ordered sequence
#for peace of mind
rownames(county_sex_new) <- seq(length = nrow(county_sex_new))
county_sex_new
## name Male Female Intersex Total
## 1 Mombasa 610257 598046 30 1208333
## 2 Kwale 425121 441681 18 866820
## 3 Kilifi 704089 749673 25 1453787
## 4 Tana River 158550 157391 2 315943
## 5 Lamu 76103 67813 4 143920
## 6 Taita-Taveta 173337 167327 7 340671
## 7 Garissa 458975 382344 34 841353
## 8 Wajir 415374 365840 49 781263
## 9 Mandera 434976 432444 37 867457
## 10 Marsabit 243548 216219 18 459785
## 11 Isiolo 139510 128483 9 268002
## 12 Meru 767698 777975 41 1545714
## 13 Tharaka-Nithi 193764 199406 7 393177
## 14 Embu 304208 304367 24 608599
## 15 Kitui 549003 587151 33 1136187
## 16 Machakos 710707 711191 34 1421932
## 17 Makueni 489691 497942 20 987653
## 18 Nyandarua 315022 323247 20 638289
## 19 Nyeri 374288 384845 31 759164
## 20 Kirinyaga 302011 308369 31 610411
## 21 Murang'a 523940 532669 31 1056640
## 22 Kiambu 1187146 1230454 135 2417735
## 23 Turkana 478087 448868 21 926976
## 24 West Pokot 307013 314213 15 621241
## 25 Samburu 156774 153546 7 310327
## 26 Trans Nzoia 489107 501206 28 990341
## 27 Uasin Gishu 580269 582889 28 1163186
## 28 Elgeyo-Marakwet 227317 227151 12 454480
## 29 Nandi 441259 444430 22 885711
## 30 Baringo 336322 330428 13 666763
## 31 Laikipia 259440 259102 18 518560
## 32 Nakuru 1077272 1084835 95 2162202
## 33 Narok 579042 578805 26 1157873
## 34 Kajiado 557098 560704 38 1117840
## 35 Kericho 450741 451008 28 901777
## 36 Bomet 434287 441379 23 875689
## 37 Kakamega 897133 970406 40 1867579
## 38 Vihiga 283678 306323 12 590013
## 39 Bungoma 812146 858389 35 1670570
## 40 Busia 426252 467401 28 893681
## 41 Siaya 471669 521496 18 993183
## 42 Kisumu 560942 594609 23 1155574
## 43 Homa Bay 539560 592367 23 1131950
## 44 Migori 536187 580214 35 1116436
## 45 Kisii 605784 661038 38 1266860
## 46 Nyamira 290907 314656 13 605576
## 47 Nairobi 2192452 2204376 245 4397073
One clue I may give you that I learned the hard way but which easened things a bit, when you run into a problem, try googling it out. Thereafter, if the code seems friendly, type it out, much akin to doing something new for the first time. If it works with results similar to what you wanted, follow the same procedure with your data.
Lets do some bit of visualization for our data.
#first run install.packages('ggplot2'), thereafter this:
library(ggplot2)
#we use the attach argument to make R store the variable names of our data so that they are
#searchable in R
attach(county_sex_new)
#type 'ggplot2' and for 'geom_bar' in help menu to understand the code below
ggplot(county_sex_new, aes(name, Total)) + geom_bar(aes(fill = name), stat = 'identity',
show.legend = F)
However, our x axis text labels looked all crammed up. To make it look a bit better, we will tilt the x-axis text labels so that they can be read vertically, bottom to up.
#introducing theme and element text allows us to alter the x-axis labels
ggplot(county_sex_new, aes(name, Total)) + geom_bar(aes(fill = name), stat = 'identity',
show.legend = F) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
There we go. We have a cool bar graphic with something legible.
But then, we have been bitten by the restlessness bug and we would like to see a bar graph that has one column for each variable, say for male and female each, or male and inter sex each.
It’s quite a stretch, but we are intoxicated with the restlessness bug to misunderstand what going on.
For this crazy mission, we will use a data frame that has far less many values. We will use a dataset that I created that focuses on the sexes of the top 5 most densely populated counties.
Let’s call it from the darkness.
highest_five <- read.csv('highest_five.csv', header = T)
highest_five
## ï.. name Male Female Intersex Total
## 1 47 Nairobi 2192452 2204376 245 4397073
## 2 22 Kiambu 1187146 1230454 135 2417735
## 3 32 Nakuru 1077272 1084835 95 2162202
## 4 37 Kakamega 897133 970406 40 1867579
## 5 39 Bungoma 812146 858389 35 1670570
There you go.
We would like to create a bar graph that has two columns, one for the Male variable and another for the female variable aggregated by the county. Makes sense? In other words, for every county name in the x axis, there will be two columns showing with each column standing for either of the values of the ‘Male’ and ‘Female’ variables.
Lets wallow in it. First we will load the ‘reshape2’ function. The reshape function transforms our dataframe from ‘wide’ format to ‘long’ format and vice versa. Wide format is where a data frame has repeated measurements in separate columns of the same row. Long format refers to repeated measurements in separate rows. In what form is our ‘highest_five’ data frame in? Wide format. Measurements for each county are repeated in separate columns ie. the Male, Female and Intersex columns for each county which makes it rather appear long… hope at least this makes sense.
library(reshape2)
#melt is a function enabled by the reshape2 package. it converts the data frame from wide to long
#format. we used measure.vars to convert the male and female variables to a long format
#the variables will be repeated within the same columns but in different rows to enable
#plotting when the variable column is called.
new_highest <- melt(highest_five, measure.vars = c('Male', 'Female'))
new_highest
## ï.. name Intersex Total variable value
## 1 47 Nairobi 245 4397073 Male 2192452
## 2 22 Kiambu 135 2417735 Male 1187146
## 3 32 Nakuru 95 2162202 Male 1077272
## 4 37 Kakamega 40 1867579 Male 897133
## 5 39 Bungoma 35 1670570 Male 812146
## 6 47 Nairobi 245 4397073 Female 2204376
## 7 22 Kiambu 135 2417735 Female 1230454
## 8 32 Nakuru 95 2162202 Female 1084835
## 9 37 Kakamega 40 1867579 Female 970406
## 10 39 Bungoma 35 1670570 Female 858389
#use ggplot to plot the dataframe. 'value' will call the values of male and female variables and
#since they are aggregated by county in dataframe, R will intuitively devise a mechanism to
#distinguish them, this time in form of colours.
ggplot(new_highest, aes(name, value)) + geom_bar(aes(fill = variable), position = 'dodge',
stat = 'identity')
Fine, the words after # mean that R will not run anything beyond the # sign. Think of it like a stop sigh and this is why the # is used for comments and annotations. What about the ‘position = ’dodge’’ argument? The position argument species how we would like to place our columns. The default is ‘stack’ whereby the columns would be placed on top of each other per county, which is not what we want. ‘Dodge’ preserves the vertical position of the geom.
The effect of the restlessness bug has not worn out. So we will flow along with it. Let’s do a tiny winy exercise of putting in small gaps between the columns of ‘Males’ and ‘Females’ for social distancing’s sake.
ggplot(new_highest, aes(name, value)) + geom_bar(aes(fill = variable),
stat = 'identity', width = 0.5, position = position_dodge(0.7))
The argument ‘position_dodge()’ adjusts position by dodging overlaps to the side.
Once again the restlessness bug strikes just before we awaken from our hypercoding drugged state. The colors seem a bit too pale, and we would like to spruce them up a bit. We also want to add a title and change the legend title as well. Blame it on the restlessness bug.
library(ggplot2)
#change color of the bar columns
ggplot(new_highest, aes(name, value)) + geom_bar(aes(fill = variable),
stat = 'identity', width = 0.5, position = position_dodge(0.7)) + scale_fill_manual(values = c('turquoise', 'magenta'))
#add title and legend title
ggplot(new_highest, aes(name, value)) + geom_bar(aes(fill = variable),
stat = 'identity', width = 0.5, position = position_dodge(0.7)) + scale_fill_manual(values = c('turquoise', 'magenta')) +
labs(title = 'Population by sex for the top 5 densest counties') + xlab('County') +
ylab('Population') + guides(fill = guide_legend(title = 'Sex'))
You may have wondered how we came to know some of the additional code for the color and legend functions. Truth of the matter is, google has brought all the best teachers to my fingertips. Therefore, looking at the trouble others underwent and the solutions they came up with, I can follow what they did and customize it to my case, minus the trouble, which is actually what was done here.
To show that I am actually not joking on looking at the experiences of others, see this question here: https://community.rstudio.com/t/a-single-code-to-create-a-grouped-bar-chart-using-ggplot/128573 . Its good to take a taste of your own medicine at times.
Time to rest in peace. Read more on ggplot2.