In the last script, you learned the basic operations and functions in R. In this script, you will learn how to work with datasets in R. The learning objectives of this script are:
Let’s work with the file “Demographics_Uzbekistan_Worldbank.csv”. The data was originally obtained from here: https://data.worldbank.org/country/uzbekistan This file gives an overview on the age structure of the Uzbek population.
For convenience, you should best store this csv-file in the folder where this script you are reading right now is stored, and all other files that you will export should also be saved in that folder.
Now let us load the file. As it is a CSV-file, you can use the read.csv function, and immediately assign a name to your table so it is imported as a data frame in R. If the file was for example a TXT file, you would have to use the function read.table instead. Try to avoid importing Excel-Files, but rather convert them to CSV or TXT before! Once you have typed read.csv(", you can press the TAB key, and a list with all files in the currently set working directory will appear and you can choose your file from that list - that is how you do not have to type the entire file name yourself! With the argument sep, you specify which character is used in the CSV file to separate columns. You may also want to specify that the file has a header with the argument header, to be sure that the first line is read as columns names and not as data. For more arguments that can be specified in the read.csv or read.table function, please consult the R help section.
You can either write the whole file path in the read.csv function, or you define the working directory before with the setwd function:
data <- read.csv("F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/Demographics_Uzbekistan_Worldbank.csv", sep=";", header=T)
setwd("F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/")
data <- read.csv("Demographics_Uzbekistan_Worldbank.csv", sep=";", header=T)
You can have a look at our data with the head function. By default, you will see the first six rows of the data frame, but you can also change that number, or view the last entries of the data frame with the tail function instead:
head(data)
| Country.Code | X2011 | X2012 | X2013 | X2014 | X2015 | X2016 |
|---|---|---|---|---|---|---|
| Population ages 40-44, male (% of male population) | 5.664594 | 5.755805 | 5.848225 | 5.922188 | 5.988523 | 6.044512 |
| Population ages 40-44, female (% of female population) | 5.917532 | 5.989325 | 6.063903 | 6.120755 | 6.172542 | 6.213516 |
| Population ages 35-39, male (% of male population) | 6.642547 | 6.697085 | 6.768352 | 6.864493 | 6.989650 | 7.130977 |
| Population ages 35-39, female (% of female population) | 6.749273 | 6.792030 | 6.854483 | 6.945114 | 7.069720 | 7.212531 |
| Population ages 30-34, male (% of male population) | 7.802630 | 7.964620 | 8.140442 | 8.324214 | 8.526776 | 8.732077 |
| Population ages 30-34, female (% of female population) | 7.815668 | 7.981538 | 8.156203 | 8.330237 | 8.516460 | 8.698230 |
head(data, n=2)
| Country.Code | X2011 | X2012 | X2013 | X2014 | X2015 | X2016 |
|---|---|---|---|---|---|---|
| Population ages 40-44, male (% of male population) | 5.664594 | 5.755805 | 5.848225 | 5.922188 | 5.988523 | 6.044512 |
| Population ages 40-44, female (% of female population) | 5.917532 | 5.989325 | 6.063903 | 6.120755 | 6.172542 | 6.213516 |
tail(data)
| Country.Code | X2011 | X2012 | X2013 | X2014 | X2015 | X2016 | |
|---|---|---|---|---|---|---|---|
| 25 | Population ages 0-14, male (% of total) | 2.942955e+01 | 2.917704e+01 | 2.901647e+01 | 2.895058e+01 | 2.889303e+01 | 2.891223e+01 |
| 26 | Population ages 0-14, male | 4.361333e+06 | 4.393616e+06 | 4.445335e+06 | 4.514702e+06 | 4.585999e+06 | 4.667807e+06 |
| 27 | Population ages 0-14, female (% of total) | 2.771110e+01 | 2.743799e+01 | 2.725245e+01 | 2.716502e+01 | 2.706964e+01 | 2.706237e+01 |
| 28 | Population ages 0-14, female | 4.144176e+06 | 4.166385e+06 | 4.207140e+06 | 4.266101e+06 | 4.324528e+06 | 4.395593e+06 |
| 29 | Population ages 0-4, male (% of male population) | 1.105702e+01 | 1.097338e+01 | 1.083421e+01 | 1.071369e+01 | 1.052543e+01 | 1.036063e+01 |
| 30 | Population ages 0-4, female (% of female population) | 1.033352e+01 | 1.022124e+01 | 1.006158e+01 | 9.932025e+00 | 9.728738e+00 | 9.573716e+00 |
If you want to see the whole data frame, you can execute the View function, which will open a new tab in R Studio:
View(data)
Let us look at the structure and summary of our data frame:
str(data)
## 'data.frame': 30 obs. of 7 variables:
## $ Country.Code: chr "Population ages 40-44, male (% of male population)" "Population ages 40-44, female (% of female population)" "Population ages 35-39, male (% of male population)" "Population ages 35-39, female (% of female population)" ...
## $ X2011 : num 5.66 5.92 6.64 6.75 7.8 ...
## $ X2012 : num 5.76 5.99 6.7 6.79 7.96 ...
## $ X2013 : num 5.85 6.06 6.77 6.85 8.14 ...
## $ X2014 : num 5.92 6.12 6.86 6.95 8.32 ...
## $ X2015 : num 5.99 6.17 6.99 7.07 8.53 ...
## $ X2016 : num 6.04 6.21 7.13 7.21 8.73 ...
summary(data)
## Country.Code X2011 X2012 X2013
## Length:30 Min. : 6 Min. : 6 Min. : 6
## Class :character 1st Qu.: 9 1st Qu.: 9 1st Qu.: 9
## Mode :character Median : 11 Median : 10 Median : 10
## Mean : 1899114 Mean : 1930550 Mean : 1964132
## 3rd Qu.: 67 3rd Qu.: 67 3rd Qu.: 68
## Max. :19980985 Max. :20398036 Max. :20809283
## X2014 X2015 X2016
## Min. : 6 Min. : 6 Min. : 6
## 1st Qu.: 9 1st Qu.: 9 1st Qu.: 9
## Median : 10 Median : 10 Median : 10
## Mean : 1998314 Mean : 2031271 Mean : 2062671
## 3rd Qu.: 68 3rd Qu.: 68 3rd Qu.: 67
## Max. :21193681 Max. :21558308 Max. :21876451
names(data)
## [1] "Country.Code" "X2011" "X2012" "X2013" "X2014"
## [6] "X2015" "X2016"
dim(data)
## [1] 30 7
You can have a look at certain rows or columns by indexing their positions, or by providing column names:
data[1,]
| Country.Code | X2011 | X2012 | X2013 | X2014 | X2015 | X2016 |
|---|---|---|---|---|---|---|
| Population ages 40-44, male (% of male population) | 5.664594 | 5.755805 | 5.848225 | 5.922188 | 5.988523 | 6.044512 |
data[10:12,]
| Country.Code | X2011 | X2012 | X2013 | X2014 | X2015 | X2016 | |
|---|---|---|---|---|---|---|---|
| 10 | Population ages 20-24, female (% of female population) | 1.057314e+01 | 1.047198e+01 | 1.029709e+01 | 1.004466e+01 | 9.742627e+00 | 9.376233e+00 |
| 11 | Population ages 15-64 (% of total) | 6.710771e+01 | 6.744669e+01 | 6.765552e+01 | 6.771382e+01 | 6.769146e+01 | 6.754660e+01 |
| 12 | Population ages 15-64, total | 1.998099e+07 | 2.039804e+07 | 2.080928e+07 | 2.119368e+07 | 2.155831e+07 | 2.187645e+07 |
data[,2:3]
| X2011 | X2012 |
|---|---|
| 5.664594e+00 | 5.755805e+00 |
| 5.917532e+00 | 5.989325e+00 |
| 6.642547e+00 | 6.697085e+00 |
| 6.749273e+00 | 6.792030e+00 |
| 7.802630e+00 | 7.964620e+00 |
| 7.815668e+00 | 7.981538e+00 |
| 9.530070e+00 | 9.728642e+00 |
| 9.415427e+00 | 9.585648e+00 |
| 1.101036e+01 | 1.096574e+01 |
| 1.057314e+01 | 1.047198e+01 |
| 6.710771e+01 | 6.744669e+01 |
| 1.998099e+07 | 2.039804e+07 |
| 6.685251e+01 | 6.718297e+01 |
| 9.907256e+06 | 1.011673e+07 |
| 6.736057e+01 | 6.770817e+01 |
| 1.007373e+07 | 1.028130e+07 |
| 1.074811e+01 | 1.025028e+01 |
| 1.013665e+01 | 9.695437e+00 |
| 9.233177e+00 | 8.876088e+00 |
| 8.753458e+00 | 8.417185e+00 |
| 9.139354e+00 | 9.327570e+00 |
| 8.624124e+00 | 8.799563e+00 |
| 2.856638e+01 | 2.830386e+01 |
| 8.505498e+06 | 8.559994e+06 |
| 2.942955e+01 | 2.917704e+01 |
| 4.361333e+06 | 4.393616e+06 |
| 2.771110e+01 | 2.743799e+01 |
| 4.144176e+06 | 4.166385e+06 |
| 1.105702e+01 | 1.097338e+01 |
| 1.033352e+01 | 1.022124e+01 |
data$X2012
## [1] 5.755805e+00 5.989325e+00 6.697085e+00 6.792030e+00 7.964619e+00
## [6] 7.981538e+00 9.728642e+00 9.585648e+00 1.096574e+01 1.047198e+01
## [11] 6.744669e+01 2.039804e+07 6.718298e+01 1.011673e+07 6.770817e+01
## [16] 1.028130e+07 1.025028e+01 9.695437e+00 8.876088e+00 8.417185e+00
## [21] 9.327570e+00 8.799563e+00 2.830386e+01 8.559994e+06 2.917704e+01
## [26] 4.393616e+06 2.743799e+01 4.166385e+06 1.097338e+01 1.022124e+01
You can index specific entries by providing both a row and a column index, or by indexing the element in a column vector:
data[2,2]
## [1] 5.917532
data$X2012[2]
## [1] 5.989325
Let us convert the first column of the data frame to row names and check if it worked:
rownames(data) <- data[,1]
data <- data[,-1] # or: data[,1] <- NULL
head(data)
| X2011 | X2012 | X2013 | X2014 | X2015 | X2016 | |
|---|---|---|---|---|---|---|
| Population ages 40-44, male (% of male population) | 5.664594 | 5.755805 | 5.848225 | 5.922188 | 5.988523 | 6.044512 |
| Population ages 40-44, female (% of female population) | 5.917532 | 5.989325 | 6.063903 | 6.120755 | 6.172542 | 6.213516 |
| Population ages 35-39, male (% of male population) | 6.642547 | 6.697085 | 6.768352 | 6.864493 | 6.989650 | 7.130977 |
| Population ages 35-39, female (% of female population) | 6.749273 | 6.792030 | 6.854483 | 6.945114 | 7.069720 | 7.212531 |
| Population ages 30-34, male (% of male population) | 7.802630 | 7.964620 | 8.140442 | 8.324214 | 8.526776 | 8.732077 |
| Population ages 30-34, female (% of female population) | 7.815668 | 7.981538 | 8.156203 | 8.330237 | 8.516460 | 8.698230 |
As you can see, the entries in our data frame are not all comparable to each other, because some are absolute counts, and some are percentages:
rownames(data)
## [1] "Population ages 40-44, male (% of male population)"
## [2] "Population ages 40-44, female (% of female population)"
## [3] "Population ages 35-39, male (% of male population)"
## [4] "Population ages 35-39, female (% of female population)"
## [5] "Population ages 30-34, male (% of male population)"
## [6] "Population ages 30-34, female (% of female population)"
## [7] "Population ages 25-29, male (% of male population)"
## [8] "Population ages 25-29, female (% of female population)"
## [9] "Population ages 20-24, male (% of male population)"
## [10] "Population ages 20-24, female (% of female population)"
## [11] "Population ages 15-64 (% of total)"
## [12] "Population ages 15-64, total"
## [13] "Population ages 15-64, male (% of total)"
## [14] "Population ages 15-64, male"
## [15] "Population ages 15-64, female (% of total)"
## [16] "Population ages 15-64, female"
## [17] "Population ages 15-19, male (% of male population)"
## [18] "Population ages 15-19, female (% of female population)"
## [19] "Population ages 10-14, male (% of male population)"
## [20] "Population ages 10-14, female (% of female population)"
## [21] "Population ages 5-9, male (% of male population)"
## [22] "Population ages 5-9, female (% of female population)"
## [23] "Population ages 0-14 (% of total)"
## [24] "Population ages 0-14, total"
## [25] "Population ages 0-14, male (% of total)"
## [26] "Population ages 0-14, male"
## [27] "Population ages 0-14, female (% of total)"
## [28] "Population ages 0-14, female"
## [29] "Population ages 0-4, male (% of male population)"
## [30] "Population ages 0-4, female (% of female population)"
So let us create two new data frames - one for percentages, and one for absolute counts, and check if it worked:
data_perc <- data[c(1:11,13,15,17:23,25,27,29,30),]
data_count <- data[-c(1:11,13,15,17:23,25,27,29,30),]
rownames(data_perc)
## [1] "Population ages 40-44, male (% of male population)"
## [2] "Population ages 40-44, female (% of female population)"
## [3] "Population ages 35-39, male (% of male population)"
## [4] "Population ages 35-39, female (% of female population)"
## [5] "Population ages 30-34, male (% of male population)"
## [6] "Population ages 30-34, female (% of female population)"
## [7] "Population ages 25-29, male (% of male population)"
## [8] "Population ages 25-29, female (% of female population)"
## [9] "Population ages 20-24, male (% of male population)"
## [10] "Population ages 20-24, female (% of female population)"
## [11] "Population ages 15-64 (% of total)"
## [12] "Population ages 15-64, male (% of total)"
## [13] "Population ages 15-64, female (% of total)"
## [14] "Population ages 15-19, male (% of male population)"
## [15] "Population ages 15-19, female (% of female population)"
## [16] "Population ages 10-14, male (% of male population)"
## [17] "Population ages 10-14, female (% of female population)"
## [18] "Population ages 5-9, male (% of male population)"
## [19] "Population ages 5-9, female (% of female population)"
## [20] "Population ages 0-14 (% of total)"
## [21] "Population ages 0-14, male (% of total)"
## [22] "Population ages 0-14, female (% of total)"
## [23] "Population ages 0-4, male (% of male population)"
## [24] "Population ages 0-4, female (% of female population)"
rownames(data_count)
## [1] "Population ages 15-64, total" "Population ages 15-64, male"
## [3] "Population ages 15-64, female" "Population ages 0-14, total"
## [5] "Population ages 0-14, male" "Population ages 0-14, female"
Let us proceed with the data frame data_count for now. The data might still not be in a convenient format, for example you may want to have the years as rows instead of columns, change the new row names to proper numbers, and shorten the column names:
data_count_t <- as.data.frame(t(data_count)) # transpose the dataframe
rownames(data_count_t) <- c(2011:2016)
names(data_count_t) <- c("total_15_64", "male_15_64", "female_15_64", "total_0_14", "male_0_14", "female_0_14")
head(data_count_t)
| total_15_64 | male_15_64 | female_15_64 | total_0_14 | male_0_14 | female_0_14 | |
|---|---|---|---|---|---|---|
| 2011 | 19980985 | 9907256 | 10073725 | 8505498 | 4361333 | 4144176 |
| 2012 | 20398036 | 10116728 | 10281303 | 8559994 | 4393616 | 4166385 |
| 2013 | 20809283 | 10323217 | 10486061 | 8652474 | 4445335 | 4207140 |
| 2014 | 21193681 | 10516527 | 10677153 | 8780801 | 4514702 | 4266101 |
| 2015 | 21558308 | 10697841 | 10860468 | 8910534 | 4585999 | 4324528 |
| 2016 | 21876451 | 10857091 | 11019354 | 9063398 | 4667807 | 4395593 |
You may want to further split the data, either according to certain time spans, or with respect to gender:
data_2011_2013 <- data_count_t[1:3,]
data_2014_2016 <- data_count_t[4:6,]
data_male <- data_count_t[,c(2,5)]
data_female <- data_count_t[,c(3,6)]
data_total <- data_count_t[,c(1,4)]
If you want to join the tables again, you can use the rbind (bind rows) or cbind (bind columns) function:
data_restored1 <- rbind(data_2011_2013, data_2014_2016)
data_restored2 <- cbind(data_male, data_female, data_total)
(By the way, you can also bind matrices to an array using the abind function from the abind-package.)
Above, you selected the entries of the table by indexing the rows and columns with numbers. However, there might be cases when you need to do a more sophisticated query, or when there are too many columns or rows than you could select manually. You can do that using either the subset or which function. This will be very important during the whole course, so make sure you well understand the logic behind these queries! You may also want to check this site.
Let’s start with the subset function:
# select all rows where the column "total_15_64" is less than 10.000.000
subset(data_count_t, total_15_64 < 21000000)
| total_15_64 | male_15_64 | female_15_64 | total_0_14 | male_0_14 | female_0_14 | |
|---|---|---|---|---|---|---|
| 2011 | 19980985 | 9907256 | 10073725 | 8505498 | 4361333 | 4144176 |
| 2012 | 20398036 | 10116728 | 10281303 | 8559994 | 4393616 | 4166385 |
| 2013 | 20809283 | 10323217 | 10486061 | 8652474 | 4445335 | 4207140 |
# include a second condition with an OR expression: select all rows where the column "total_15_64" is either less than 10.000.000 or greater than 20.000.000
test1 <- subset(data_count_t, total_15_64 < 20000000 | total_15_64 > 21000000)
test1
| total_15_64 | male_15_64 | female_15_64 | total_0_14 | male_0_14 | female_0_14 | |
|---|---|---|---|---|---|---|
| 2011 | 19980985 | 9907256 | 10073725 | 8505498 | 4361333 | 4144176 |
| 2014 | 21193681 | 10516527 | 10677153 | 8780801 | 4514702 | 4266101 |
| 2015 | 21558308 | 10697841 | 10860468 | 8910534 | 4585999 | 4324528 |
| 2016 | 21876451 | 10857091 | 11019354 | 9063398 | 4667807 | 4395593 |
# expand the previous example and only keep the columns "male_15_64" and "female_15_64"
test2 <- subset(data_count_t, total_15_64 < 20000000 | total_15_64 > 21000000, select=c(male_15_64, female_15_64))
test2
| male_15_64 | female_15_64 | |
|---|---|---|
| 2011 | 9907256 | 10073725 |
| 2014 | 10516527 | 10677153 |
| 2015 | 10697841 | 10860468 |
| 2016 | 10857091 | 11019354 |
Another way of subsetting is to use the which function. which produces a boolean vector, i.e. it compares a vector against a value and for every element in the vector, it returns TRUE or FALSE. When you use the which function in squared brackets to index elements, only those elements will be chosen whose position in the vector is the same as the positions of the TRUE values in the boolean vector.
boolean.vector <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
numeric.vector <- 1:5
numeric.vector[boolean.vector]
## [1] 1 3
which(numeric.vector <= 3)
## [1] 1 2 3
# define a vector of all past leap-years
leap_years <- seq(from=1960, to=2016, by=4)
# select only those entries where the row name is a leap-year
data_count_t[which(row.names(data_count_t) %in% leap_years), ]
| total_15_64 | male_15_64 | female_15_64 | total_0_14 | male_0_14 | female_0_14 | |
|---|---|---|---|---|---|---|
| 2012 | 20398036 | 10116728 | 10281303 | 8559994 | 4393616 | 4166385 |
| 2016 | 21876451 | 10857091 | 11019354 | 9063398 | 4667807 | 4395593 |
# select only those columns whose column names start with "male"
data_count_t[, which(substr(names(data_count_t), 1, 4) == "male")]
| male_15_64 | male_0_14 | |
|---|---|---|
| 2011 | 9907256 | 4361333 |
| 2012 | 10116728 | 4393616 |
| 2013 | 10323217 | 4445335 |
| 2014 | 10516527 | 4514702 |
| 2015 | 10697841 | 4585999 |
| 2016 | 10857091 | 4667807 |
The last line of code might seem quite confusing because there are a lot of brackets. In such cases, it helps to read the code from inside to outside. So, let’s decompose line 191:
names(data_count_t)
## [1] "total_15_64" "male_15_64" "female_15_64" "total_0_14" "male_0_14"
## [6] "female_0_14"
substr(names(data_count_t), 1, 4)
## [1] "tota" "male" "fema" "tota" "male" "fema"
which(substr(names(data_count_t), 1, 4) == "male")
## [1] 2 5
data_count_t[, which(substr(names(data_count_t), 1, 4) == "male")]
| male_15_64 | male_0_14 | |
|---|---|---|
| 2011 | 9907256 | 4361333 |
| 2012 | 10116728 | 4393616 |
| 2013 | 10323217 | 4445335 |
| 2014 | 10516527 | 4514702 |
| 2015 | 10697841 | 4585999 |
| 2016 | 10857091 | 4667807 |
Now let us do some calculations. As you learned in the previous script, columns and rows behave like vectors, so you can perform mathematical operations on them. Assume you want to calculate the sum of both age classes in data_total. You can do this the following way:
data_total[,3] <- data_total[,1] + data_total[,2]
names(data_total)[3] <- "total_0_64"
But, let’s undo that.
data_total[,3] <- NULL
However, imagine you have hundreds of columns that you want to sum up - then you would need some other code to simplify the expression. What you need is the apply function. The second argument in the apply function can take the value 1 or 2, depending on whether you want to apply your calculation to rows (1) or columns (2):
data_total[,3] <- apply(data_total, 1, sum) # calculate for each row (1) the sum of both age classes
data_total[7,] <- apply(data_total, 2, mean) # calculate for each column (2) the mean over all years
names(data_total)[3] <- "sum of both age classes"
rownames(data_total)[7] <- "mean of all years"
head(data_total)
| total_15_64 | total_0_14 | sum of both age classes | |
|---|---|---|---|
| 2011 | 19980985 | 8505498 | 28486483 |
| 2012 | 20398036 | 8559994 | 28958030 |
| 2013 | 20809283 | 8652474 | 29461757 |
| 2014 | 21193681 | 8780801 | 29974482 |
| 2015 | 21558308 | 8910534 | 30468842 |
| 2016 | 21876451 | 9063398 | 30939849 |
tail(data_total)
| total_15_64 | total_0_14 | sum of both age classes | |
|---|---|---|---|
| 2012 | 20398036 | 8559994 | 28958030 |
| 2013 | 20809283 | 8652474 | 29461757 |
| 2014 | 21193681 | 8780801 | 29974482 |
| 2015 | 21558308 | 8910534 | 30468842 |
| 2016 | 21876451 | 9063398 | 30939849 |
| mean of all years | 20969457 | 8745450 | 29714907 |
For more variations of the apply function, please have a look here.
Exporting files is analogous to importing them - your working directory defines where the file will be stored by default. Specify the desired folder path in the write function. When you export a CSV, you should use write.csv; for text files, use write.table. In these functions, you first have to specify the name of the object that you want to export, then the name of the new file, and then it is often very convenient to set rownames to FALSE so there will be no default numbering added to the rows. As with the read-functions, in the help-section you can find more arguments that you might need to specify.
You can either write the whole file path in the write.csv function, or you define the working directory before with the setwd function:
write.csv(data_male, "F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/data_male.csv", row.names=T)
write.table(data_female, "F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/data_female.txt", row.names=F)
help(write.table)
## starting httpd help server ... done
setwd("F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/")
write.csv(data_male, "data_male.csv", row.names=T)
write.table(data_female, "data_female.txt", row.names=F)
That’s it! In the following scripts, you will always have to load and export files, and manipulate data in one or the other way, so please make sure that you well understand the tools presented to you in this script - you will need them every time that you work in R!