In the last script, you learned the basic operations and functions in R. In this script, you will learn how to work with datasets in R. The learning objectives of this script are:

  • Know how to read (import) and write (export) files
  • Know how to explore data
  • Know how to manipulate data

1. Import files

Let’s work with the file “Demographics_Uzbekistan_Worldbank.csv”. The data was originally obtained from here: https://data.worldbank.org/country/uzbekistan This file gives an overview on the age structure of the Uzbek population.

For convenience, you should best store this csv-file in the folder where this script you are reading right now is stored, and all other files that you will export should also be saved in that folder.

Now let us load the file. As it is a CSV-file, you can use the read.csv function, and immediately assign a name to your table so it is imported as a data frame in R. If the file was for example a TXT file, you would have to use the function read.table instead. Try to avoid importing Excel-Files, but rather convert them to CSV or TXT before! Once you have typed read.csv(", you can press the TAB key, and a list with all files in the currently set working directory will appear and you can choose your file from that list - that is how you do not have to type the entire file name yourself! With the argument sep, you specify which character is used in the CSV file to separate columns. You may also want to specify that the file has a header with the argument header, to be sure that the first line is read as columns names and not as data. For more arguments that can be specified in the read.csv or read.table function, please consult the R help section.

You can either write the whole file path in the read.csv function, or you define the working directory before with the setwd function:

data <- read.csv("F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/Demographics_Uzbekistan_Worldbank.csv", sep=";", header=T)

setwd("F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/")
data <- read.csv("Demographics_Uzbekistan_Worldbank.csv", sep=";", header=T)

2. Explore data

You can have a look at our data with the head function. By default, you will see the first six rows of the data frame, but you can also change that number, or view the last entries of the data frame with the tail function instead:

head(data)
Country.Code X2011 X2012 X2013 X2014 X2015 X2016
Population ages 40-44, male (% of male population) 5.664594 5.755805 5.848225 5.922188 5.988523 6.044512
Population ages 40-44, female (% of female population) 5.917532 5.989325 6.063903 6.120755 6.172542 6.213516
Population ages 35-39, male (% of male population) 6.642547 6.697085 6.768352 6.864493 6.989650 7.130977
Population ages 35-39, female (% of female population) 6.749273 6.792030 6.854483 6.945114 7.069720 7.212531
Population ages 30-34, male (% of male population) 7.802630 7.964620 8.140442 8.324214 8.526776 8.732077
Population ages 30-34, female (% of female population) 7.815668 7.981538 8.156203 8.330237 8.516460 8.698230
head(data, n=2)
Country.Code X2011 X2012 X2013 X2014 X2015 X2016
Population ages 40-44, male (% of male population) 5.664594 5.755805 5.848225 5.922188 5.988523 6.044512
Population ages 40-44, female (% of female population) 5.917532 5.989325 6.063903 6.120755 6.172542 6.213516
tail(data)
Country.Code X2011 X2012 X2013 X2014 X2015 X2016
25 Population ages 0-14, male (% of total) 2.942955e+01 2.917704e+01 2.901647e+01 2.895058e+01 2.889303e+01 2.891223e+01
26 Population ages 0-14, male 4.361333e+06 4.393616e+06 4.445335e+06 4.514702e+06 4.585999e+06 4.667807e+06
27 Population ages 0-14, female (% of total) 2.771110e+01 2.743799e+01 2.725245e+01 2.716502e+01 2.706964e+01 2.706237e+01
28 Population ages 0-14, female 4.144176e+06 4.166385e+06 4.207140e+06 4.266101e+06 4.324528e+06 4.395593e+06
29 Population ages 0-4, male (% of male population) 1.105702e+01 1.097338e+01 1.083421e+01 1.071369e+01 1.052543e+01 1.036063e+01
30 Population ages 0-4, female (% of female population) 1.033352e+01 1.022124e+01 1.006158e+01 9.932025e+00 9.728738e+00 9.573716e+00

If you want to see the whole data frame, you can execute the View function, which will open a new tab in R Studio:

View(data)

Let us look at the structure and summary of our data frame:

str(data)
## 'data.frame':    30 obs. of  7 variables:
##  $ Country.Code: chr  "Population ages 40-44, male (% of male population)" "Population ages 40-44, female (% of female population)" "Population ages 35-39, male (% of male population)" "Population ages 35-39, female (% of female population)" ...
##  $ X2011       : num  5.66 5.92 6.64 6.75 7.8 ...
##  $ X2012       : num  5.76 5.99 6.7 6.79 7.96 ...
##  $ X2013       : num  5.85 6.06 6.77 6.85 8.14 ...
##  $ X2014       : num  5.92 6.12 6.86 6.95 8.32 ...
##  $ X2015       : num  5.99 6.17 6.99 7.07 8.53 ...
##  $ X2016       : num  6.04 6.21 7.13 7.21 8.73 ...
summary(data)
##  Country.Code           X2011              X2012              X2013         
##  Length:30          Min.   :       6   Min.   :       6   Min.   :       6  
##  Class :character   1st Qu.:       9   1st Qu.:       9   1st Qu.:       9  
##  Mode  :character   Median :      11   Median :      10   Median :      10  
##                     Mean   : 1899114   Mean   : 1930550   Mean   : 1964132  
##                     3rd Qu.:      67   3rd Qu.:      67   3rd Qu.:      68  
##                     Max.   :19980985   Max.   :20398036   Max.   :20809283  
##      X2014              X2015              X2016         
##  Min.   :       6   Min.   :       6   Min.   :       6  
##  1st Qu.:       9   1st Qu.:       9   1st Qu.:       9  
##  Median :      10   Median :      10   Median :      10  
##  Mean   : 1998314   Mean   : 2031271   Mean   : 2062671  
##  3rd Qu.:      68   3rd Qu.:      68   3rd Qu.:      67  
##  Max.   :21193681   Max.   :21558308   Max.   :21876451
names(data)
## [1] "Country.Code" "X2011"        "X2012"        "X2013"        "X2014"       
## [6] "X2015"        "X2016"
dim(data)
## [1] 30  7

You can have a look at certain rows or columns by indexing their positions, or by providing column names:

data[1,]
Country.Code X2011 X2012 X2013 X2014 X2015 X2016
Population ages 40-44, male (% of male population) 5.664594 5.755805 5.848225 5.922188 5.988523 6.044512
data[10:12,]
Country.Code X2011 X2012 X2013 X2014 X2015 X2016
10 Population ages 20-24, female (% of female population) 1.057314e+01 1.047198e+01 1.029709e+01 1.004466e+01 9.742627e+00 9.376233e+00
11 Population ages 15-64 (% of total) 6.710771e+01 6.744669e+01 6.765552e+01 6.771382e+01 6.769146e+01 6.754660e+01
12 Population ages 15-64, total 1.998099e+07 2.039804e+07 2.080928e+07 2.119368e+07 2.155831e+07 2.187645e+07
data[,2:3]
X2011 X2012
5.664594e+00 5.755805e+00
5.917532e+00 5.989325e+00
6.642547e+00 6.697085e+00
6.749273e+00 6.792030e+00
7.802630e+00 7.964620e+00
7.815668e+00 7.981538e+00
9.530070e+00 9.728642e+00
9.415427e+00 9.585648e+00
1.101036e+01 1.096574e+01
1.057314e+01 1.047198e+01
6.710771e+01 6.744669e+01
1.998099e+07 2.039804e+07
6.685251e+01 6.718297e+01
9.907256e+06 1.011673e+07
6.736057e+01 6.770817e+01
1.007373e+07 1.028130e+07
1.074811e+01 1.025028e+01
1.013665e+01 9.695437e+00
9.233177e+00 8.876088e+00
8.753458e+00 8.417185e+00
9.139354e+00 9.327570e+00
8.624124e+00 8.799563e+00
2.856638e+01 2.830386e+01
8.505498e+06 8.559994e+06
2.942955e+01 2.917704e+01
4.361333e+06 4.393616e+06
2.771110e+01 2.743799e+01
4.144176e+06 4.166385e+06
1.105702e+01 1.097338e+01
1.033352e+01 1.022124e+01
data$X2012
##  [1] 5.755805e+00 5.989325e+00 6.697085e+00 6.792030e+00 7.964619e+00
##  [6] 7.981538e+00 9.728642e+00 9.585648e+00 1.096574e+01 1.047198e+01
## [11] 6.744669e+01 2.039804e+07 6.718298e+01 1.011673e+07 6.770817e+01
## [16] 1.028130e+07 1.025028e+01 9.695437e+00 8.876088e+00 8.417185e+00
## [21] 9.327570e+00 8.799563e+00 2.830386e+01 8.559994e+06 2.917704e+01
## [26] 4.393616e+06 2.743799e+01 4.166385e+06 1.097338e+01 1.022124e+01

You can index specific entries by providing both a row and a column index, or by indexing the element in a column vector:

data[2,2]
## [1] 5.917532
data$X2012[2]
## [1] 5.989325

3. Manipulate data

Let us convert the first column of the data frame to row names and check if it worked:

rownames(data) <- data[,1]
data <- data[,-1]  # or: data[,1] <- NULL
head(data)
X2011 X2012 X2013 X2014 X2015 X2016
Population ages 40-44, male (% of male population) 5.664594 5.755805 5.848225 5.922188 5.988523 6.044512
Population ages 40-44, female (% of female population) 5.917532 5.989325 6.063903 6.120755 6.172542 6.213516
Population ages 35-39, male (% of male population) 6.642547 6.697085 6.768352 6.864493 6.989650 7.130977
Population ages 35-39, female (% of female population) 6.749273 6.792030 6.854483 6.945114 7.069720 7.212531
Population ages 30-34, male (% of male population) 7.802630 7.964620 8.140442 8.324214 8.526776 8.732077
Population ages 30-34, female (% of female population) 7.815668 7.981538 8.156203 8.330237 8.516460 8.698230

As you can see, the entries in our data frame are not all comparable to each other, because some are absolute counts, and some are percentages:

rownames(data)
##  [1] "Population ages 40-44, male (% of male population)"    
##  [2] "Population ages 40-44, female (% of female population)"
##  [3] "Population ages 35-39, male (% of male population)"    
##  [4] "Population ages 35-39, female (% of female population)"
##  [5] "Population ages 30-34, male (% of male population)"    
##  [6] "Population ages 30-34, female (% of female population)"
##  [7] "Population ages 25-29, male (% of male population)"    
##  [8] "Population ages 25-29, female (% of female population)"
##  [9] "Population ages 20-24, male (% of male population)"    
## [10] "Population ages 20-24, female (% of female population)"
## [11] "Population ages 15-64 (% of total)"                    
## [12] "Population ages 15-64, total"                          
## [13] "Population ages 15-64, male (% of total)"              
## [14] "Population ages 15-64, male"                           
## [15] "Population ages 15-64, female (% of total)"            
## [16] "Population ages 15-64, female"                         
## [17] "Population ages 15-19, male (% of male population)"    
## [18] "Population ages 15-19, female (% of female population)"
## [19] "Population ages 10-14, male (% of male population)"    
## [20] "Population ages 10-14, female (% of female population)"
## [21] "Population ages 5-9, male (% of male population)"      
## [22] "Population ages 5-9, female (% of female population)"  
## [23] "Population ages 0-14 (% of total)"                     
## [24] "Population ages 0-14, total"                           
## [25] "Population ages 0-14, male (% of total)"               
## [26] "Population ages 0-14, male"                            
## [27] "Population ages 0-14, female (% of total)"             
## [28] "Population ages 0-14, female"                          
## [29] "Population ages 0-4, male (% of male population)"      
## [30] "Population ages 0-4, female (% of female population)"

So let us create two new data frames - one for percentages, and one for absolute counts, and check if it worked:

data_perc  <- data[c(1:11,13,15,17:23,25,27,29,30),]
data_count <- data[-c(1:11,13,15,17:23,25,27,29,30),]
rownames(data_perc)
##  [1] "Population ages 40-44, male (% of male population)"    
##  [2] "Population ages 40-44, female (% of female population)"
##  [3] "Population ages 35-39, male (% of male population)"    
##  [4] "Population ages 35-39, female (% of female population)"
##  [5] "Population ages 30-34, male (% of male population)"    
##  [6] "Population ages 30-34, female (% of female population)"
##  [7] "Population ages 25-29, male (% of male population)"    
##  [8] "Population ages 25-29, female (% of female population)"
##  [9] "Population ages 20-24, male (% of male population)"    
## [10] "Population ages 20-24, female (% of female population)"
## [11] "Population ages 15-64 (% of total)"                    
## [12] "Population ages 15-64, male (% of total)"              
## [13] "Population ages 15-64, female (% of total)"            
## [14] "Population ages 15-19, male (% of male population)"    
## [15] "Population ages 15-19, female (% of female population)"
## [16] "Population ages 10-14, male (% of male population)"    
## [17] "Population ages 10-14, female (% of female population)"
## [18] "Population ages 5-9, male (% of male population)"      
## [19] "Population ages 5-9, female (% of female population)"  
## [20] "Population ages 0-14 (% of total)"                     
## [21] "Population ages 0-14, male (% of total)"               
## [22] "Population ages 0-14, female (% of total)"             
## [23] "Population ages 0-4, male (% of male population)"      
## [24] "Population ages 0-4, female (% of female population)"
rownames(data_count)
## [1] "Population ages 15-64, total"  "Population ages 15-64, male"  
## [3] "Population ages 15-64, female" "Population ages 0-14, total"  
## [5] "Population ages 0-14, male"    "Population ages 0-14, female"

Let us proceed with the data frame data_count for now. The data might still not be in a convenient format, for example you may want to have the years as rows instead of columns, change the new row names to proper numbers, and shorten the column names:

data_count_t <- as.data.frame(t(data_count))    # transpose the dataframe
rownames(data_count_t) <- c(2011:2016)
names(data_count_t) <- c("total_15_64", "male_15_64", "female_15_64", "total_0_14", "male_0_14", "female_0_14")

head(data_count_t)
total_15_64 male_15_64 female_15_64 total_0_14 male_0_14 female_0_14
2011 19980985 9907256 10073725 8505498 4361333 4144176
2012 20398036 10116728 10281303 8559994 4393616 4166385
2013 20809283 10323217 10486061 8652474 4445335 4207140
2014 21193681 10516527 10677153 8780801 4514702 4266101
2015 21558308 10697841 10860468 8910534 4585999 4324528
2016 21876451 10857091 11019354 9063398 4667807 4395593

You may want to further split the data, either according to certain time spans, or with respect to gender:

data_2011_2013 <- data_count_t[1:3,]
data_2014_2016 <- data_count_t[4:6,]

data_male   <- data_count_t[,c(2,5)]
data_female <- data_count_t[,c(3,6)]
data_total  <- data_count_t[,c(1,4)]

If you want to join the tables again, you can use the rbind (bind rows) or cbind (bind columns) function:

data_restored1 <- rbind(data_2011_2013, data_2014_2016)
data_restored2 <- cbind(data_male, data_female, data_total)

(By the way, you can also bind matrices to an array using the abind function from the abind-package.)

Above, you selected the entries of the table by indexing the rows and columns with numbers. However, there might be cases when you need to do a more sophisticated query, or when there are too many columns or rows than you could select manually. You can do that using either the subset or which function. This will be very important during the whole course, so make sure you well understand the logic behind these queries! You may also want to check this site.

Let’s start with the subset function:

# select all rows where the column "total_15_64" is less than 10.000.000
subset(data_count_t, total_15_64 < 21000000)
total_15_64 male_15_64 female_15_64 total_0_14 male_0_14 female_0_14
2011 19980985 9907256 10073725 8505498 4361333 4144176
2012 20398036 10116728 10281303 8559994 4393616 4166385
2013 20809283 10323217 10486061 8652474 4445335 4207140

# include a second condition with an OR expression: select all rows where the column "total_15_64" is either less than 10.000.000 or greater than 20.000.000
test1 <- subset(data_count_t, total_15_64 < 20000000 | total_15_64 > 21000000)  
test1
total_15_64 male_15_64 female_15_64 total_0_14 male_0_14 female_0_14
2011 19980985 9907256 10073725 8505498 4361333 4144176
2014 21193681 10516527 10677153 8780801 4514702 4266101
2015 21558308 10697841 10860468 8910534 4585999 4324528
2016 21876451 10857091 11019354 9063398 4667807 4395593

# expand the previous example and only keep the columns "male_15_64" and "female_15_64"
test2 <- subset(data_count_t, total_15_64 < 20000000 | total_15_64 > 21000000, select=c(male_15_64, female_15_64))  
test2
male_15_64 female_15_64
2011 9907256 10073725
2014 10516527 10677153
2015 10697841 10860468
2016 10857091 11019354

Another way of subsetting is to use the which function. which produces a boolean vector, i.e. it compares a vector against a value and for every element in the vector, it returns TRUE or FALSE. When you use the which function in squared brackets to index elements, only those elements will be chosen whose position in the vector is the same as the positions of the TRUE values in the boolean vector.

boolean.vector <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
numeric.vector <- 1:5
numeric.vector[boolean.vector]
## [1] 1 3
which(numeric.vector <= 3)
## [1] 1 2 3

# define a vector of all past leap-years
leap_years <- seq(from=1960, to=2016, by=4)

# select only those entries where the row name is a leap-year
data_count_t[which(row.names(data_count_t) %in% leap_years), ]    
total_15_64 male_15_64 female_15_64 total_0_14 male_0_14 female_0_14
2012 20398036 10116728 10281303 8559994 4393616 4166385
2016 21876451 10857091 11019354 9063398 4667807 4395593

# select only those columns whose column names start with "male"
data_count_t[, which(substr(names(data_count_t), 1, 4) == "male")]  
male_15_64 male_0_14
2011 9907256 4361333
2012 10116728 4393616
2013 10323217 4445335
2014 10516527 4514702
2015 10697841 4585999
2016 10857091 4667807

The last line of code might seem quite confusing because there are a lot of brackets. In such cases, it helps to read the code from inside to outside. So, let’s decompose line 191:

names(data_count_t)
## [1] "total_15_64"  "male_15_64"   "female_15_64" "total_0_14"   "male_0_14"   
## [6] "female_0_14"
substr(names(data_count_t), 1, 4)
## [1] "tota" "male" "fema" "tota" "male" "fema"
which(substr(names(data_count_t), 1, 4) == "male")
## [1] 2 5
data_count_t[, which(substr(names(data_count_t), 1, 4) == "male")]
male_15_64 male_0_14
2011 9907256 4361333
2012 10116728 4393616
2013 10323217 4445335
2014 10516527 4514702
2015 10697841 4585999
2016 10857091 4667807

Now let us do some calculations. As you learned in the previous script, columns and rows behave like vectors, so you can perform mathematical operations on them. Assume you want to calculate the sum of both age classes in data_total. You can do this the following way:

data_total[,3] <- data_total[,1] + data_total[,2]
names(data_total)[3] <- "total_0_64"

But, let’s undo that.

data_total[,3] <- NULL

However, imagine you have hundreds of columns that you want to sum up - then you would need some other code to simplify the expression. What you need is the apply function. The second argument in the apply function can take the value 1 or 2, depending on whether you want to apply your calculation to rows (1) or columns (2):

data_total[,3]  <- apply(data_total, 1, sum)   # calculate for each row (1) the sum of both age classes
data_total[7,] <- apply(data_total, 2, mean)  # calculate for each column (2) the mean over all years

names(data_total)[3] <- "sum of both age classes"
rownames(data_total)[7] <- "mean of all years"

head(data_total)
total_15_64 total_0_14 sum of both age classes
2011 19980985 8505498 28486483
2012 20398036 8559994 28958030
2013 20809283 8652474 29461757
2014 21193681 8780801 29974482
2015 21558308 8910534 30468842
2016 21876451 9063398 30939849
tail(data_total)
total_15_64 total_0_14 sum of both age classes
2012 20398036 8559994 28958030
2013 20809283 8652474 29461757
2014 21193681 8780801 29974482
2015 21558308 8910534 30468842
2016 21876451 9063398 30939849
mean of all years 20969457 8745450 29714907

For more variations of the apply function, please have a look here.


4. Export files

Exporting files is analogous to importing them - your working directory defines where the file will be stored by default. Specify the desired folder path in the write function. When you export a CSV, you should use write.csv; for text files, use write.table. In these functions, you first have to specify the name of the object that you want to export, then the name of the new file, and then it is often very convenient to set rownames to FALSE so there will be no default numbering added to the rows. As with the read-functions, in the help-section you can find more arguments that you might need to specify.

You can either write the whole file path in the write.csv function, or you define the working directory before with the setwd function:

write.csv(data_male, "F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/data_male.csv", row.names=T)
write.table(data_female, "F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/data_female.txt", row.names=F)

help(write.table)
## starting httpd help server ... done

setwd("F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/")
write.csv(data_male, "data_male.csv", row.names=T)
write.table(data_female, "data_female.txt", row.names=F)

That’s it! In the following scripts, you will always have to load and export files, and manipulate data in one or the other way, so please make sure that you well understand the tools presented to you in this script - you will need them every time that you work in R!