Tables in R

When it comes to coding, I may say I am not the best, but I think teaching will make me better at it. Therefore lets start GIS coding using R.Why R? Because it’s the first coding language I got to learn and master successfully. I may not be the guru who can handle the most complex of tasks in it, but lets just say I can use it to do basic functions. The complexity of the task is debatable. Lets start anyway, and see where this journey leads us.

Lets start by loading some 2019 census data, which I proudly participated in.

Let’s start simple. Lets download the Kenya Population data by sex and county for the year 2019.

It is available from here: https://open.africa/dataset/9b94fe50-9d75-4b92-be00-6354c6e6cc88/resource/384b93cf-ede0-4e05-9f36-a8e8e09b21a7/download/kenya-population-by-sex-and-county.csv

Once loaded, let’s take a look at it using Microsoft Excel.

It looks like this.

the census data

The first five rows of the data, circled in the image above, contain some metadata, that is the source, and the fields therein but that’s unnecessary for us. In fact, it will give us problems when we are trying to load the file into R by making us use unnecessary mental energy to find out which code to delete these columns.

I will try the easy way this first time round.

Delete the first 6 rows circled above in Excel until you remain with the rows containing the column names: Name, Male, Female, Intersex, And Total. Remember to save your updated excel sheet. Now let’s start coding. We have been itching to do this for long. Why not start by a simple exercise—load the excel sheet containing our Census data by sex and county to R?

The good thing is that our excel data is in .csv format, which makes it easier to load it into R. This does not mean that R can’t load Excel data, but it will involve calling some functions (or packages) which might be a traumatising event for the novice beginner.

Open R Studio. It should look like this:

maximize icon in red

There is a maximize icon at the top (see the are bounded in red), click on it to maximize the text editor canvas.

Now the R studio canvas has 4 windows. Going through the details of each one of them is beyond the scope of this book, as I want you to learn how to code and hopefully become better at it as I teach. Therefore, I decided to take the easier method, denote the various windows in a graphic while giving you the source to others who did the hard work of explaining all that they do.

the r-studion canvas

Knowing the names is a first step in understanding the functionalities of R. The Grunwald page has done a good job in giving us a very summarized version of the functionalities of each and every R window tab. See the details here: https://grunwaldlab.github.io/analysis_of_microbiome_community_data_in_r/00--intro_to_rstudio.html We had last left it trying to load some excel data to R.

When we pressed the maximize ribbon, we opened our text editor. This is where we will be writing our code.

An R-script file is opened for us by default, and it goes by the name ‘untitled’. We want to give it a name, as well as the project folder it will be found under. What on earth is a project, and an R script? Again, I leave this to the people who did the hard work to explain all these terms. Kindly see this links respectively:

https://support.rstudio.com/hc/en-us/articles/200526207-Using-RStudio-Projects and https://fileinfo.com/extension/r#:~:text=An%20R%20file%20is%20a,visualizations%20of%20the%20computed%20data.

To create a new R project where we will store our data, kindly follow the guesswork below.

Go to File>New project. The following prompt appears.

create project

Click on New Directory>New project. The following prompt appears.

create new project

Give your project a name. In this case, it is named gis_coding. Browse to the directory you would like to save your project. Once done click Create project.

Back to the R script. It is all empty and we have to wrench out some words, nay, code to call our excel file.

Here is the code to call our excel file into this intimidating blank screen of R

county_sex <- read.csv(‘E:Studio is hereteacher/kenya-population-by-sex-and-county.csv’, header = T)

Click ctrl + enter. Alternatively, you can select the entire code and click Run at the top right of your text editor.

It gives us an error. Don’t worry this was intentional. There is something I want you to learn. When calling files, avoid using the backslash () since it will always result in an error. A () in R is treated as an escape character. It’s a bit complicated to explain it, but it also not needful to do so here. Rather, the forward slash (/) is used to call files. The alternative is using two backslashes (\), but we want our work to be neat and avoid silly mistakes.

Replace all the backslashes with forward slashes. Then Run (Ctlr + enter). From now henceforth, when running a code, unless otherwise indicated, you will be using Ctrl + enter.

county_sex <- read.csv('E:/Documents/R Studio is here/programming teacher/Data/kenya-population-by-sex-and-county.csv', header = T)

Now let’s go through that entire chunk of code slowly.

County_sex – this stands for the object. An object is what we use to store a value. We normally assign values to objects using the <- in R. See this: https://www.datamentor.io/r-programming/object-class-introduction/#:~:text=In%20fact%2C%20everything%20in%20R,(prototype)%20of%20a%20house.

<- this is known as an operator. There are various operators in R. The back arrow (<-) operator assigns whatever value is on the right to the value on the left ie. whatever value is assigned, or results from the operation on the right will be imputed to the value on the left. The equals (=) sign can also be used, but good convention in R prefers using the <- sign. For more simple explanations to other operators, see https://libraryguides.mcgill.ca/c.php?g=699776&p=4968546

Read.csv – This is a function used to call .csv files to R. What is a function? See - https://www.tutorialspoint.com/r/r_functions.htm . read.csv is a built-in R function.

‘E:/Documents/R Studio is here/programming teacher/Data/kenya-population-by-sex-and-county.csv’ – this is the file we are calling. It can be one word if it’s stored within the project folder, or the address if we are calling it from somewhere else in our computer. It can also be a url! Note that the file we are calling also has its extension (.csv) included. We are specifying what it is exactly we are loading by putting that extension, otherwise you will get an error if the extension is omitted. Header = T – we indicate that the first row contains our column names. For now, these explanations suffice. For more information on the read.csv built-in function, type ‘read.csv’ in the help menu of the file browser tab. It contains more details on how this function works, including the terms explained above. Also note the single quote marks (‘…’) enclosing the file address. Quotation marks are used to enclose ‘strings’. A string is a collection of characters that make up one element of a vector. Most coders prefer double quotation marks (“…”) but as mentioned earlier, I prefer the easy way and only insert double quotes when an apostrophe (’) is part of the characters involved, and so far the occurrence of such has been rarer than diamonds.

It is also helpful to do a read regarding the following as they are very common in most programming languages.

String - https://faculty.nps.edu/sebuttre/home/R/text.html Vector - https://www.javatpoint.com/r-vector#:~:text=A%20vector%20is%20a%20basic,complex%2C%20or%20raw%20data%20type.

Once you run the county_sex data, you will see the object appear in the environment window, under ‘Data’.

Click on it.

A table appears containing the same exact data as in our excel sheet.

the census table in r

Let’s save our work so far.

Go back to your R Script and click on save.

Save it within your project folder. Name the R script as ‘county_sex.R’. The .R extension indicates that this an R script file.

Close R studio through File>Quit session. Enough for today. If a pop up appears asking if you would like to save the project, click Yes.

Part 2

In the last chapter, we had created a table by the name ‘county_sex’ in our R environment but we had not saved it anywhere into our computer directory. To do this and to at least have some evidence for your boss who may inquire of it quite unexpectedly, do the following:

write.csv(county_sex, file = 'county_pop_sex_test.csv')

saved csv in computer

What did we just do? And didn’t the argument ‘file’ need a directory path? Lets answer those questions.

write.table is an in-built argument to create a table in the provided file directory or connection. A connection in this can be any destination provided, including a url. As can be seen in the help tab,’file’is the character string whereby the table will be stored. Remember strings from our previous chapter? When writing the file location, kindly put it in quotation marks.

The reason we didn’t put a directory name like ‘E:/documents/…’ is because we intend the file to be placed in the same location as our project folder. If we wanted to create the file in a different directory, we would have specified the file path eg. ‘E:/documents/get/some/direction….’

write.csv(county_sex, file = 'E:/county_pop_sex2.csv')

This duplicate actually got deleted.

R can also understand what argument the file directory even if you do not explicitly state the ‘file =’ argument.

For example

write.csv(county_sex, 'E:/county_pop_sex3.csv')

Try it out. In this case, since the file directory of ‘file =’ argument came immediately after the x argument (see help of r-studio browser tab), R will assume that this is the file directory since the ‘file =’ argument comes immediately after the ‘x’ argument, which stands for the object in the environment path that will be transferred to a file folder in our computer.

Now lets dig dipper.

Our ‘county_sex’ data in our R environment contains the total population of ‘males’, ‘females’, ‘intersex’ and countrywide totals for the variables as the first row after the column labels. Since this row would mess our data during visualization, lets delete it.

#delete row containing countrywide totals. It is labelled 1 in () because the row above it contains columnn names, and is not treated as a row but rather as a header
county_sex_new <- county_sex[-c(1), ]

Please note the comma (,) that comes after the 1 in brackets. It signifies that we want to delete the row, not column. In R, columns in a matrix statement are denoted by a (,) coming before. Now you know why I insist on the comma, of course I did this without it only to end up with a table with a useless ‘name’ column only…

Our data is all clean

county_sex_new

##               name    Male  Female Intersex   Total
## 2          Mombasa  610257  598046       30 1208333
## 3            Kwale  425121  441681       18  866820
## 4           Kilifi  704089  749673       25 1453787
## 5       Tana River  158550  157391        2  315943
## 6             Lamu   76103   67813        4  143920
## 7     Taita-Taveta  173337  167327        7  340671
## 8          Garissa  458975  382344       34  841353
## 9            Wajir  415374  365840       49  781263
## 10         Mandera  434976  432444       37  867457
## 11        Marsabit  243548  216219       18  459785
## 12          Isiolo  139510  128483        9  268002
## 13            Meru  767698  777975       41 1545714
## 14   Tharaka-Nithi  193764  199406        7  393177
## 15            Embu  304208  304367       24  608599
## 16           Kitui  549003  587151       33 1136187
## 17        Machakos  710707  711191       34 1421932
## 18         Makueni  489691  497942       20  987653
## 19       Nyandarua  315022  323247       20  638289
## 20           Nyeri  374288  384845       31  759164
## 21       Kirinyaga  302011  308369       31  610411
## 22        Murang'a  523940  532669       31 1056640
## 23          Kiambu 1187146 1230454      135 2417735
## 24         Turkana  478087  448868       21  926976
## 25      West Pokot  307013  314213       15  621241
## 26         Samburu  156774  153546        7  310327
## 27     Trans Nzoia  489107  501206       28  990341
## 28     Uasin Gishu  580269  582889       28 1163186
## 29 Elgeyo-Marakwet  227317  227151       12  454480
## 30           Nandi  441259  444430       22  885711
## 31         Baringo  336322  330428       13  666763
## 32        Laikipia  259440  259102       18  518560
## 33          Nakuru 1077272 1084835       95 2162202
## 34           Narok  579042  578805       26 1157873
## 35         Kajiado  557098  560704       38 1117840
## 36         Kericho  450741  451008       28  901777
## 37           Bomet  434287  441379       23  875689
## 38        Kakamega  897133  970406       40 1867579
## 39          Vihiga  283678  306323       12  590013
## 40         Bungoma  812146  858389       35 1670570
## 41           Busia  426252  467401       28  893681
## 42           Siaya  471669  521496       18  993183
## 43          Kisumu  560942  594609       23 1155574
## 44        Homa Bay  539560  592367       23 1131950
## 45          Migori  536187  580214       35 1116436
## 46           Kisii  605784  661038       38 1266860
## 47         Nyamira  290907  314656       13  605576
## 48         Nairobi 2192452 2204376      245 4397073

However, you may have noted that the first row begins from index 2, whereas we are perfectionists and our eyes, and perhaps mind, like to see things flowing in a well ordered sequence

#for peace of mind
rownames(county_sex_new) <- seq(length = nrow(county_sex_new))
county_sex_new

##               name    Male  Female Intersex   Total
## 1          Mombasa  610257  598046       30 1208333
## 2            Kwale  425121  441681       18  866820
## 3           Kilifi  704089  749673       25 1453787
## 4       Tana River  158550  157391        2  315943
## 5             Lamu   76103   67813        4  143920
## 6     Taita-Taveta  173337  167327        7  340671
## 7          Garissa  458975  382344       34  841353
## 8            Wajir  415374  365840       49  781263
## 9          Mandera  434976  432444       37  867457
## 10        Marsabit  243548  216219       18  459785
## 11          Isiolo  139510  128483        9  268002
## 12            Meru  767698  777975       41 1545714
## 13   Tharaka-Nithi  193764  199406        7  393177
## 14            Embu  304208  304367       24  608599
## 15           Kitui  549003  587151       33 1136187
## 16        Machakos  710707  711191       34 1421932
## 17         Makueni  489691  497942       20  987653
## 18       Nyandarua  315022  323247       20  638289
## 19           Nyeri  374288  384845       31  759164
## 20       Kirinyaga  302011  308369       31  610411
## 21        Murang'a  523940  532669       31 1056640
## 22          Kiambu 1187146 1230454      135 2417735
## 23         Turkana  478087  448868       21  926976
## 24      West Pokot  307013  314213       15  621241
## 25         Samburu  156774  153546        7  310327
## 26     Trans Nzoia  489107  501206       28  990341
## 27     Uasin Gishu  580269  582889       28 1163186
## 28 Elgeyo-Marakwet  227317  227151       12  454480
## 29           Nandi  441259  444430       22  885711
## 30         Baringo  336322  330428       13  666763
## 31        Laikipia  259440  259102       18  518560
## 32          Nakuru 1077272 1084835       95 2162202
## 33           Narok  579042  578805       26 1157873
## 34         Kajiado  557098  560704       38 1117840
## 35         Kericho  450741  451008       28  901777
## 36           Bomet  434287  441379       23  875689
## 37        Kakamega  897133  970406       40 1867579
## 38          Vihiga  283678  306323       12  590013
## 39         Bungoma  812146  858389       35 1670570
## 40           Busia  426252  467401       28  893681
## 41           Siaya  471669  521496       18  993183
## 42          Kisumu  560942  594609       23 1155574
## 43        Homa Bay  539560  592367       23 1131950
## 44          Migori  536187  580214       35 1116436
## 45           Kisii  605784  661038       38 1266860
## 46         Nyamira  290907  314656       13  605576
## 47         Nairobi 2192452 2204376      245 4397073

One clue I may give you that I learned the hard way but which easened things a bit, when you run into a problem, try googling it out. Thereafter, if the code seems friendly, type it out, much akin to doing something new for the first time. If it works with results similar to what you wanted, follow the same procedure with your data.

Lets do some bit of visualization for our data.

#first run install.packages('ggplot2'), thereafter this:
library(ggplot2)

#we use the attach argument to make R store the variable names of our data so that they are
#searchable in R
attach(county_sex_new)

#type 'ggplot2' and for 'geom_bar' in help menu to understand the code below
ggplot(county_sex_new, aes(name, Total)) + geom_bar(aes(fill = name), stat = 'identity',
                                                     show.legend = F)

However, our x axis text labels looked all crammed up. To make it look a bit better, we will tilt the x-axis text labels so that they can be read vertically, bottom to up.

#introducing theme and element text allows us to alter the x-axis labels
ggplot(county_sex_new, aes(name, Total)) + geom_bar(aes(fill = name), stat = 'identity',
                                                     show.legend = F) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

There we go. We have a cool bar graphic with something legible.

But then, we have been bitten by the restlessness bug and we would like to see a bar graph that has one column for each variable, say for male and female each, or male and inter sex each.

It’s quite a stretch, but we are intoxicated with the restlessness bug to misunderstand what going on.

For this crazy mission, we will use a data frame that has far less many values. We will use a dataset that I created that focuses on the sexes of the top 5 most densely populated counties.

Let’s call it from the darkness.

highest_five <- read.csv('highest_five.csv', header = T)
highest_five

##   ï..     name    Male  Female Intersex   Total
## 1  47  Nairobi 2192452 2204376      245 4397073
## 2  22   Kiambu 1187146 1230454      135 2417735
## 3  32   Nakuru 1077272 1084835       95 2162202
## 4  37 Kakamega  897133  970406       40 1867579
## 5  39  Bungoma  812146  858389       35 1670570

There you go.

We would like to create a bar graph that has two columns, one for the Male variable and another for the female variable aggregated by the county. Makes sense? In other words, for every county name in the x axis, there will be two columns showing with each column standing for either of the values of the ‘Male’ and ‘Female’ variables.

Lets wallow in it. First we will load the ‘reshape2’ function. The reshape function transforms our dataframe from ‘wide’ format to ‘long’ format and vice versa. Wide format is where a data frame has repeated measurements in separate columns of the same row. Long format refers to repeated measurements in separate rows. In what form is our ‘highest_five’ data frame in? Wide format. Measurements for each county are repeated in separate columns ie. the Male, Female and Intersex columns for each county which makes it rather appear long… hope at least this makes sense.

library(reshape2)

#melt is a function enabled by the reshape2 package. it converts the data frame from wide to long 
#format. we used measure.vars to convert the male and female variables to a long format 
#the variables will be repeated within the same columns but in different rows to enable 
#plotting when the variable column is called.
new_highest <- melt(highest_five, measure.vars = c('Male', 'Female'))
new_highest

##    ï..     name Intersex   Total variable   value
## 1   47  Nairobi      245 4397073     Male 2192452
## 2   22   Kiambu      135 2417735     Male 1187146
## 3   32   Nakuru       95 2162202     Male 1077272
## 4   37 Kakamega       40 1867579     Male  897133
## 5   39  Bungoma       35 1670570     Male  812146
## 6   47  Nairobi      245 4397073   Female 2204376
## 7   22   Kiambu      135 2417735   Female 1230454
## 8   32   Nakuru       95 2162202   Female 1084835
## 9   37 Kakamega       40 1867579   Female  970406
## 10  39  Bungoma       35 1670570   Female  858389

#use ggplot to plot the dataframe. 'value' will call the values of male and female variables and 
#since they are aggregated by county in dataframe, R will intuitively devise a mechanism to 
#distinguish them, this time in form of colours.
ggplot(new_highest, aes(name, value)) + geom_bar(aes(fill = variable), position = 'dodge',
                                                     stat = 'identity')

Fine, the words after # mean that R will not run anything beyond the # sign. Think of it like a stop sigh and this is why the # is used for comments and annotations. What about the ‘position = ’dodge’’ argument? The position argument species how we would like to place our columns. The default is ‘stack’ whereby the columns would be placed on top of each other per county, which is not what we want. ‘Dodge’ preserves the vertical position of the geom.

The effect of the restlessness bug has not worn out. So we will flow along with it. Let’s do a tiny winy exercise of putting in small gaps between the columns of ‘Males’ and ‘Females’ for social distancing’s sake.

ggplot(new_highest, aes(name, value)) + geom_bar(aes(fill = variable), 
                                                     stat = 'identity', width = 0.5, position = position_dodge(0.7))

The argument ‘position_dodge()’ adjusts position by dodging overlaps to the side.

Once again the restlessness bug strikes just before we awaken from our hypercoding drugged state. The colors seem a bit too pale, and we would like to spruce them up a bit. We also want to add a title and change the legend title as well. Blame it on the restlessness bug.

library(ggplot2)
#change color of the bar columns
ggplot(new_highest, aes(name, value)) + geom_bar(aes(fill = variable), 
                                                     stat = 'identity', width = 0.5, position = position_dodge(0.7)) + scale_fill_manual(values = c('turquoise', 'magenta'))

#add title and legend title
ggplot(new_highest, aes(name, value)) + geom_bar(aes(fill = variable), 
                                                     stat = 'identity', width = 0.5, position = position_dodge(0.7)) + scale_fill_manual(values = c('turquoise', 'magenta')) + 
  labs(title = 'Population by sex for the top 5 densest counties') + xlab('County') + 
  ylab('Population') + guides(fill = guide_legend(title = 'Sex'))

You may have wondered how we came to know some of the additional code for the color and legend functions. Truth of the matter is, google has brought all the best teachers to my fingertips. Therefore, looking at the trouble others underwent and the solutions they came up with, I can follow what they did and customize it to my case, minus the trouble, which is actually what was done here.

To show that I am actually not joking on looking at the experiences of others, see this question here: https://community.rstudio.com/t/a-single-code-to-create-a-grouped-bar-chart-using-ggplot/128573 . Its good to take a taste of your own medicine at times.

Time to rest in peace. Read more on ggplot2.

Tables in R

sam

3/15/2022