In the last script, you learned the basic operations and functions in R. In this script, you will learn how to work with datasets in R. The learning objectives of this script are:

Know how to read (import) and write (export) files
Know how to explore data
Know how to manipulate data

1. Import files

Let’s work with the file “Demographics_Uzbekistan_Worldbank.csv”. The data was originally obtained from here: https://data.worldbank.org/country/uzbekistan This file gives an overview on the age structure of the Uzbek population.

For convenience, you should best store this csv-file in the folder where this script you are reading right now is stored, and all other files that you will export should also be saved in that folder.

Now let us load the file. As it is a CSV-file, you can use the read.csv function, and immediately assign a name to your table so it is imported as a data frame in R. If the file was for example a TXT file, you would have to use the function read.table instead. Try to avoid importing Excel-Files, but rather convert them to CSV or TXT before! Once you have typed read.csv(", you can press the TAB key, and a list with all files in the currently set working directory will appear and you can choose your file from that list - that is how you do not have to type the entire file name yourself! With the argument sep, you specify which character is used in the CSV file to separate columns. You may also want to specify that the file has a header with the argument header, to be sure that the first line is read as columns names and not as data. For more arguments that can be specified in the read.csv or read.table function, please consult the R help section.

You can either write the whole file path in the read.csv function, or you define the working directory before with the setwd function:

data <- read.csv("F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/Demographics_Uzbekistan_Worldbank.csv", sep=";", header=T)

setwd("F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/")
data <- read.csv("Demographics_Uzbekistan_Worldbank.csv", sep=";", header=T)

2. Explore data

You can have a look at our data with the head function. By default, you will see the first six rows of the data frame, but you can also change that number, or view the last entries of the data frame with the tail function instead:

head(data)

Country.Code	X2011	X2012	X2013	X2014	X2015	X2016
Population ages 40-44, male (% of male population)	5.664594	5.755805	5.848225	5.922188	5.988523	6.044512
Population ages 40-44, female (% of female population)	5.917532	5.989325	6.063903	6.120755	6.172542	6.213516
Population ages 35-39, male (% of male population)	6.642547	6.697085	6.768352	6.864493	6.989650	7.130977
Population ages 35-39, female (% of female population)	6.749273	6.792030	6.854483	6.945114	7.069720	7.212531
Population ages 30-34, male (% of male population)	7.802630	7.964620	8.140442	8.324214	8.526776	8.732077
Population ages 30-34, female (% of female population)	7.815668	7.981538	8.156203	8.330237	8.516460	8.698230

head(data, n=2)

Country.Code	X2011	X2012	X2013	X2014	X2015	X2016
Population ages 40-44, male (% of male population)	5.664594	5.755805	5.848225	5.922188	5.988523	6.044512
Population ages 40-44, female (% of female population)	5.917532	5.989325	6.063903	6.120755	6.172542	6.213516

tail(data)

	Country.Code	X2011	X2012	X2013	X2014	X2015	X2016
25	Population ages 0-14, male (% of total)	2.942955e+01	2.917704e+01	2.901647e+01	2.895058e+01	2.889303e+01	2.891223e+01
26	Population ages 0-14, male	4.361333e+06	4.393616e+06	4.445335e+06	4.514702e+06	4.585999e+06	4.667807e+06
27	Population ages 0-14, female (% of total)	2.771110e+01	2.743799e+01	2.725245e+01	2.716502e+01	2.706964e+01	2.706237e+01
28	Population ages 0-14, female	4.144176e+06	4.166385e+06	4.207140e+06	4.266101e+06	4.324528e+06	4.395593e+06
29	Population ages 0-4, male (% of male population)	1.105702e+01	1.097338e+01	1.083421e+01	1.071369e+01	1.052543e+01	1.036063e+01
30	Population ages 0-4, female (% of female population)	1.033352e+01	1.022124e+01	1.006158e+01	9.932025e+00	9.728738e+00	9.573716e+00

If you want to see the whole data frame, you can execute the View function, which will open a new tab in R Studio:

View(data)

Let us look at the structure and summary of our data frame:

str(data)
## 'data.frame':    30 obs. of  7 variables:
##  $ Country.Code: chr  "Population ages 40-44, male (% of male population)" "Population ages 40-44, female (% of female population)" "Population ages 35-39, male (% of male population)" "Population ages 35-39, female (% of female population)" ...
##  $ X2011       : num  5.66 5.92 6.64 6.75 7.8 ...
##  $ X2012       : num  5.76 5.99 6.7 6.79 7.96 ...
##  $ X2013       : num  5.85 6.06 6.77 6.85 8.14 ...
##  $ X2014       : num  5.92 6.12 6.86 6.95 8.32 ...
##  $ X2015       : num  5.99 6.17 6.99 7.07 8.53 ...
##  $ X2016       : num  6.04 6.21 7.13 7.21 8.73 ...
summary(data)
##  Country.Code           X2011              X2012              X2013         
##  Length:30          Min.   :       6   Min.   :       6   Min.   :       6  
##  Class :character   1st Qu.:       9   1st Qu.:       9   1st Qu.:       9  
##  Mode  :character   Median :      11   Median :      10   Median :      10  
##                     Mean   : 1899114   Mean   : 1930550   Mean   : 1964132  
##                     3rd Qu.:      67   3rd Qu.:      67   3rd Qu.:      68  
##                     Max.   :19980985   Max.   :20398036   Max.   :20809283  
##      X2014              X2015              X2016         
##  Min.   :       6   Min.   :       6   Min.   :       6  
##  1st Qu.:       9   1st Qu.:       9   1st Qu.:       9  
##  Median :      10   Median :      10   Median :      10  
##  Mean   : 1998314   Mean   : 2031271   Mean   : 2062671  
##  3rd Qu.:      68   3rd Qu.:      68   3rd Qu.:      67  
##  Max.   :21193681   Max.   :21558308   Max.   :21876451
names(data)
## [1] "Country.Code" "X2011"        "X2012"        "X2013"        "X2014"       
## [6] "X2015"        "X2016"
dim(data)
## [1] 30  7

You can have a look at certain rows or columns by indexing their positions, or by providing column names:

data[1,]

Country.Code	X2011	X2012	X2013	X2014	X2015	X2016
Population ages 40-44, male (% of male population)	5.664594	5.755805	5.848225	5.922188	5.988523	6.044512

data[10:12,]

	Country.Code	X2011	X2012	X2013	X2014	X2015	X2016
10	Population ages 20-24, female (% of female population)	1.057314e+01	1.047198e+01	1.029709e+01	1.004466e+01	9.742627e+00	9.376233e+00
11	Population ages 15-64 (% of total)	6.710771e+01	6.744669e+01	6.765552e+01	6.771382e+01	6.769146e+01	6.754660e+01
12	Population ages 15-64, total	1.998099e+07	2.039804e+07	2.080928e+07	2.119368e+07	2.155831e+07	2.187645e+07

data[,2:3]

X2011	X2012
5.664594e+00	5.755805e+00
5.917532e+00	5.989325e+00
6.642547e+00	6.697085e+00
6.749273e+00	6.792030e+00
7.802630e+00	7.964620e+00
7.815668e+00	7.981538e+00
9.530070e+00	9.728642e+00
9.415427e+00	9.585648e+00
1.101036e+01	1.096574e+01
1.057314e+01	1.047198e+01
6.710771e+01	6.744669e+01
1.998099e+07	2.039804e+07
6.685251e+01	6.718297e+01
9.907256e+06	1.011673e+07
6.736057e+01	6.770817e+01
1.007373e+07	1.028130e+07
1.074811e+01	1.025028e+01
1.013665e+01	9.695437e+00
9.233177e+00	8.876088e+00
8.753458e+00	8.417185e+00
9.139354e+00	9.327570e+00
8.624124e+00	8.799563e+00
2.856638e+01	2.830386e+01
8.505498e+06	8.559994e+06
2.942955e+01	2.917704e+01
4.361333e+06	4.393616e+06
2.771110e+01	2.743799e+01
4.144176e+06	4.166385e+06
1.105702e+01	1.097338e+01
1.033352e+01	1.022124e+01

data$X2012
##  [1] 5.755805e+00 5.989325e+00 6.697085e+00 6.792030e+00 7.964619e+00
##  [6] 7.981538e+00 9.728642e+00 9.585648e+00 1.096574e+01 1.047198e+01
## [11] 6.744669e+01 2.039804e+07 6.718298e+01 1.011673e+07 6.770817e+01
## [16] 1.028130e+07 1.025028e+01 9.695437e+00 8.876088e+00 8.417185e+00
## [21] 9.327570e+00 8.799563e+00 2.830386e+01 8.559994e+06 2.917704e+01
## [26] 4.393616e+06 2.743799e+01 4.166385e+06 1.097338e+01 1.022124e+01

You can index specific entries by providing both a row and a column index, or by indexing the element in a column vector:

data[2,2]
## [1] 5.917532
data$X2012[2]
## [1] 5.989325

3. Manipulate data

Let us convert the first column of the data frame to row names and check if it worked:

rownames(data) <- data[,1]
data <- data[,-1]  # or: data[,1] <- NULL
head(data)

	X2011	X2012	X2013	X2014	X2015	X2016
Population ages 40-44, male (% of male population)	5.664594	5.755805	5.848225	5.922188	5.988523	6.044512
Population ages 40-44, female (% of female population)	5.917532	5.989325	6.063903	6.120755	6.172542	6.213516
Population ages 35-39, male (% of male population)	6.642547	6.697085	6.768352	6.864493	6.989650	7.130977
Population ages 35-39, female (% of female population)	6.749273	6.792030	6.854483	6.945114	7.069720	7.212531
Population ages 30-34, male (% of male population)	7.802630	7.964620	8.140442	8.324214	8.526776	8.732077
Population ages 30-34, female (% of female population)	7.815668	7.981538	8.156203	8.330237	8.516460	8.698230

As you can see, the entries in our data frame are not all comparable to each other, because some are absolute counts, and some are percentages:

rownames(data)
##  [1] "Population ages 40-44, male (% of male population)"    
##  [2] "Population ages 40-44, female (% of female population)"
##  [3] "Population ages 35-39, male (% of male population)"    
##  [4] "Population ages 35-39, female (% of female population)"
##  [5] "Population ages 30-34, male (% of male population)"    
##  [6] "Population ages 30-34, female (% of female population)"
##  [7] "Population ages 25-29, male (% of male population)"    
##  [8] "Population ages 25-29, female (% of female population)"
##  [9] "Population ages 20-24, male (% of male population)"    
## [10] "Population ages 20-24, female (% of female population)"
## [11] "Population ages 15-64 (% of total)"                    
## [12] "Population ages 15-64, total"                          
## [13] "Population ages 15-64, male (% of total)"              
## [14] "Population ages 15-64, male"                           
## [15] "Population ages 15-64, female (% of total)"            
## [16] "Population ages 15-64, female"                         
## [17] "Population ages 15-19, male (% of male population)"    
## [18] "Population ages 15-19, female (% of female population)"
## [19] "Population ages 10-14, male (% of male population)"    
## [20] "Population ages 10-14, female (% of female population)"
## [21] "Population ages 5-9, male (% of male population)"      
## [22] "Population ages 5-9, female (% of female population)"  
## [23] "Population ages 0-14 (% of total)"                     
## [24] "Population ages 0-14, total"                           
## [25] "Population ages 0-14, male (% of total)"               
## [26] "Population ages 0-14, male"                            
## [27] "Population ages 0-14, female (% of total)"             
## [28] "Population ages 0-14, female"                          
## [29] "Population ages 0-4, male (% of male population)"      
## [30] "Population ages 0-4, female (% of female population)"

So let us create two new data frames - one for percentages, and one for absolute counts, and check if it worked:

data_perc  <- data[c(1:11,13,15,17:23,25,27,29,30),]
data_count <- data[-c(1:11,13,15,17:23,25,27,29,30),]
rownames(data_perc)
##  [1] "Population ages 40-44, male (% of male population)"    
##  [2] "Population ages 40-44, female (% of female population)"
##  [3] "Population ages 35-39, male (% of male population)"    
##  [4] "Population ages 35-39, female (% of female population)"
##  [5] "Population ages 30-34, male (% of male population)"    
##  [6] "Population ages 30-34, female (% of female population)"
##  [7] "Population ages 25-29, male (% of male population)"    
##  [8] "Population ages 25-29, female (% of female population)"
##  [9] "Population ages 20-24, male (% of male population)"    
## [10] "Population ages 20-24, female (% of female population)"
## [11] "Population ages 15-64 (% of total)"                    
## [12] "Population ages 15-64, male (% of total)"              
## [13] "Population ages 15-64, female (% of total)"            
## [14] "Population ages 15-19, male (% of male population)"    
## [15] "Population ages 15-19, female (% of female population)"
## [16] "Population ages 10-14, male (% of male population)"    
## [17] "Population ages 10-14, female (% of female population)"
## [18] "Population ages 5-9, male (% of male population)"      
## [19] "Population ages 5-9, female (% of female population)"  
## [20] "Population ages 0-14 (% of total)"                     
## [21] "Population ages 0-14, male (% of total)"               
## [22] "Population ages 0-14, female (% of total)"             
## [23] "Population ages 0-4, male (% of male population)"      
## [24] "Population ages 0-4, female (% of female population)"
rownames(data_count)
## [1] "Population ages 15-64, total"  "Population ages 15-64, male"  
## [3] "Population ages 15-64, female" "Population ages 0-14, total"  
## [5] "Population ages 0-14, male"    "Population ages 0-14, female"

Let us proceed with the data frame data_count for now. The data might still not be in a convenient format, for example you may want to have the years as rows instead of columns, change the new row names to proper numbers, and shorten the column names:

data_count_t <- as.data.frame(t(data_count))    # transpose the dataframe
rownames(data_count_t) <- c(2011:2016)
names(data_count_t) <- c("total_15_64", "male_15_64", "female_15_64", "total_0_14", "male_0_14", "female_0_14")

head(data_count_t)

	total_15_64	male_15_64	female_15_64	total_0_14	male_0_14	female_0_14
2011	19980985	9907256	10073725	8505498	4361333	4144176
2012	20398036	10116728	10281303	8559994	4393616	4166385
2013	20809283	10323217	10486061	8652474	4445335	4207140
2014	21193681	10516527	10677153	8780801	4514702	4266101
2015	21558308	10697841	10860468	8910534	4585999	4324528
2016	21876451	10857091	11019354	9063398	4667807	4395593

You may want to further split the data, either according to certain time spans, or with respect to gender:

data_2011_2013 <- data_count_t[1:3,]
data_2014_2016 <- data_count_t[4:6,]

data_male   <- data_count_t[,c(2,5)]
data_female <- data_count_t[,c(3,6)]
data_total  <- data_count_t[,c(1,4)]

If you want to join the tables again, you can use the rbind (bind rows) or cbind (bind columns) function:

data_restored1 <- rbind(data_2011_2013, data_2014_2016)
data_restored2 <- cbind(data_male, data_female, data_total)

(By the way, you can also bind matrices to an array using the abind function from the abind-package.)

Above, you selected the entries of the table by indexing the rows and columns with numbers. However, there might be cases when you need to do a more sophisticated query, or when there are too many columns or rows than you could select manually. You can do that using either the subset or which function. This will be very important during the whole course, so make sure you well understand the logic behind these queries! You may also want to check this site.

Let’s start with the subset function:

# select all rows where the column "total_15_64" is less than 10.000.000
subset(data_count_t, total_15_64 < 21000000)

	total_15_64	male_15_64	female_15_64	total_0_14	male_0_14	female_0_14
2011	19980985	9907256	10073725	8505498	4361333	4144176
2012	20398036	10116728	10281303	8559994	4393616	4166385
2013	20809283	10323217	10486061	8652474	4445335	4207140


# include a second condition with an OR expression: select all rows where the column "total_15_64" is either less than 10.000.000 or greater than 20.000.000
test1 <- subset(data_count_t, total_15_64 < 20000000 | total_15_64 > 21000000)  
test1

	total_15_64	male_15_64	female_15_64	total_0_14	male_0_14	female_0_14
2011	19980985	9907256	10073725	8505498	4361333	4144176
2014	21193681	10516527	10677153	8780801	4514702	4266101
2015	21558308	10697841	10860468	8910534	4585999	4324528
2016	21876451	10857091	11019354	9063398	4667807	4395593


# expand the previous example and only keep the columns "male_15_64" and "female_15_64"
test2 <- subset(data_count_t, total_15_64 < 20000000 | total_15_64 > 21000000, select=c(male_15_64, female_15_64))  
test2

	male_15_64	female_15_64
2011	9907256	10073725
2014	10516527	10677153
2015	10697841	10860468
2016	10857091	11019354

Another way of subsetting is to use the which function. which produces a boolean vector, i.e. it compares a vector against a value and for every element in the vector, it returns TRUE or FALSE. When you use the which function in squared brackets to index elements, only those elements will be chosen whose position in the vector is the same as the positions of the TRUE values in the boolean vector.

boolean.vector <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
numeric.vector <- 1:5
numeric.vector[boolean.vector]
## [1] 1 3
which(numeric.vector <= 3)
## [1] 1 2 3

# define a vector of all past leap-years
leap_years <- seq(from=1960, to=2016, by=4)

# select only those entries where the row name is a leap-year
data_count_t[which(row.names(data_count_t) %in% leap_years), ]

	total_15_64	male_15_64	female_15_64	total_0_14	male_0_14	female_0_14
2012	20398036	10116728	10281303	8559994	4393616	4166385
2016	21876451	10857091	11019354	9063398	4667807	4395593


# select only those columns whose column names start with "male"
data_count_t[, which(substr(names(data_count_t), 1, 4) == "male")]

	male_15_64	male_0_14
2011	9907256	4361333
2012	10116728	4393616
2013	10323217	4445335
2014	10516527	4514702
2015	10697841	4585999
2016	10857091	4667807

The last line of code might seem quite confusing because there are a lot of brackets. In such cases, it helps to read the code from inside to outside. So, let’s decompose line 191:

names(data_count_t)
## [1] "total_15_64"  "male_15_64"   "female_15_64" "total_0_14"   "male_0_14"   
## [6] "female_0_14"
substr(names(data_count_t), 1, 4)
## [1] "tota" "male" "fema" "tota" "male" "fema"
which(substr(names(data_count_t), 1, 4) == "male")
## [1] 2 5
data_count_t[, which(substr(names(data_count_t), 1, 4) == "male")]

	male_15_64	male_0_14
2011	9907256	4361333
2012	10116728	4393616
2013	10323217	4445335
2014	10516527	4514702
2015	10697841	4585999
2016	10857091	4667807

Now let us do some calculations. As you learned in the previous script, columns and rows behave like vectors, so you can perform mathematical operations on them. Assume you want to calculate the sum of both age classes in data_total. You can do this the following way:

data_total[,3] <- data_total[,1] + data_total[,2]
names(data_total)[3] <- "total_0_64"

But, let’s undo that.

data_total[,3] <- NULL

However, imagine you have hundreds of columns that you want to sum up - then you would need some other code to simplify the expression. What you need is the apply function. The second argument in the apply function can take the value 1 or 2, depending on whether you want to apply your calculation to rows (1) or columns (2):

data_total[,3]  <- apply(data_total, 1, sum)   # calculate for each row (1) the sum of both age classes
data_total[7,] <- apply(data_total, 2, mean)  # calculate for each column (2) the mean over all years

names(data_total)[3] <- "sum of both age classes"
rownames(data_total)[7] <- "mean of all years"

head(data_total)

	total_15_64	total_0_14	sum of both age classes
2011	19980985	8505498	28486483
2012	20398036	8559994	28958030
2013	20809283	8652474	29461757
2014	21193681	8780801	29974482
2015	21558308	8910534	30468842
2016	21876451	9063398	30939849

tail(data_total)

	total_15_64	total_0_14	sum of both age classes
2012	20398036	8559994	28958030
2013	20809283	8652474	29461757
2014	21193681	8780801	29974482
2015	21558308	8910534	30468842
2016	21876451	9063398	30939849
mean of all years	20969457	8745450	29714907

For more variations of the apply function, please have a look here.

4. Export files

Exporting files is analogous to importing them - your working directory defines where the file will be stored by default. Specify the desired folder path in the write function. When you export a CSV, you should use write.csv; for text files, use write.table. In these functions, you first have to specify the name of the object that you want to export, then the name of the new file, and then it is often very convenient to set rownames to FALSE so there will be no default numbering added to the rows. As with the read-functions, in the help-section you can find more arguments that you might need to specify.

You can either write the whole file path in the write.csv function, or you define the working directory before with the setwd function:

write.csv(data_male, "F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/data_male.csv", row.names=T)
write.table(data_female, "F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/data_female.txt", row.names=F)

help(write.table)
## starting httpd help server ... done

setwd("F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/")
write.csv(data_male, "data_male.csv", row.names=T)
write.table(data_female, "data_female.txt", row.names=F)

That’s it! In the following scripts, you will always have to load and export files, and manipulate data in one or the other way, so please make sure that you well understand the tools presented to you in this script - you will need them every time that you work in R!

Module 0: Introduction to R and R Studio
Script 02: File and data exploration

Max Hofmann

07 June 2022

1. Import files

2. Explore data

3. Manipulate data

4. Export files

Module 0: Introduction to R and R Studio Script 02: File and data exploration

Max Hofmann

07 June 2022

1. Import files

2. Explore data

3. Manipulate data

4. Export files

Module 0: Introduction to R and R Studio
Script 02: File and data exploration