Sponsors: SPS Department

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Content and Organization of the workshop

Introduction to R computing
- taught by Sphend Kursani: Shpend.Kursani@eui.eu
- April 9th, 9:00 - 13:00, Seminar Room 2 at Badia Fiesolana

Data management I: Shaping and creating different datasets and variables
- taught by Adri?n del Rio: Adrian.delRioRodriguez@eui.eu
- April 11th, 14:00 - 18:00, Seminar Room at Villa Paola

Data management II: Relational Data and regular expressions
- taught by Adri?n del Rio: Adrian.delRioRodriguez@eui.eu
- April 12th, 14:00 - 18:00, Seminar Room at Villa Paola

Tutorial days: contact us via email at any time!

Credits: Nope, but fun and interesting ideas will come up!

The script (= slides):
- …is html (links!)
- use ↑ to get back to the top

General goals:
- being able to read, think and write following the R logic
- tips to work with data in a transparent and reproducible manner
- techniques to get your data into a tidy form
- …Save time, less stress!
- for the future…case studies?

Content of Data management I and II

Debate in data managment:

Procedure in data managment:
- The script: ethics and practice
- The data: ethics and practice

A grammar for data managment:
- Programming using DPLYR and pipes
- Separate and Unite
- Spread and Gather

Tips for a tidy dataset:
- The time dimension of datasets and the date
- Missing values: identification and possible correction
- Unique ids and its use Anonymization
- Duplicated observations

A toolkit for working with strings:
- Regular expressions and functions

Relational data:
- The logic data joins in R
- Inner and outer joins

Anything else?:

Goals in Data managment 1

Introduction to R studio connect: save and share scripts online

+ ....to avoid [this](https://bookdown.org/connect/#/apps/819/access), but rather...

Apply techniques to manipulate and re-shape datasets in a consistent, complete and informative manner
- …tidyverse package

Identify and change the time dimension of datasets
- Introduction to time data-types and its use to expand datasets or create variables (e.g. age)

Evaluate and correct errors in datasets
- Missing values, creation of ids and the problem of duplicated observations

##Recap from yesterday’s session

Quiz: the basic grammar of data managment

+ ```filter()```: ?
+ ```arrange()```: ?
+ ```select()```: ?
+ ```distinct()```: ?
+ ```mutate()```: Add new columns to existing columns 
+ ```summarise()```: Collapses a data frame to a single row (e.g. aggregation)
+ ```group_by()```:  ?
+ ```na.omit()```: ?

Workflow example 1: getting tidy dataset

Import the dataset ‘gdp.csv’ in R.
Change the layout of the data in a way that one column refers to country, year, gdp
Create a new column named ‘label’. This column should contain the country abbreviation for each observation.
Move forward the first value of each country-group and delate the first row of each country-group
save the dataset

library(tidyverse)
library(stringr)

#0
gdp<- read.csv("./data/gdp/gdp.csv", stringsAsFactors =  FALSE)[,-1]


#1
gdp <- gdp %>% gather(2:170, key = country, value = GDP, na.rm= TRUE, convert =  TRUE)

#2 
gdp$label<- gdp$GDP
gdp$label <- str_replace_all(gdp$label, "[0-9]{1,10}", "") # creates implict missing values! 
gdp$label[gdp$label == ""]<- NA

# 3

gdp <-fill(gdp, label) # to move forward values, and fill NA in this way 

gdp <- gdp %>% 
    group_by(country) %>% 
    slice(-1) # filter by row position OR

gdp_useless <- gdp %>% 
    group_by(country) %>% 
    slice(1) # 

gdp <- gdp %>% 
    group_by(country) %>% 
    filter(cgdppc != "year") # filter by row position OR


# extras: nice presentation
names(gdp)[1:3] <- c("year", "country", "gdp")

# 4 
write.csv(gdp, "gdpnew.csv")

Goals in Data managment 2

Develop strategies to manipulate characther vectors
- understand and translate word-patterns into regular expressions
- … functions from stringr package
- apply in the context of text as data (online html tables and pdfs)

Apply strategies to merge datasets
- understand the methods of data joins in R and work with relational data
- strategies to control for a number of data errors when merging datasets

Packages that we will use today!

install.packages(c("tidyverse", "foreign", "stringr", "haven", "openxlsx", "readxl", "eeptools", "tm", "quanteda", "pdftools", "rvest", "xml2"))

1. Debate about data managment ↑

From excel and codebook to Data Managment Plan ‘era’
- For funding and publication (1, 2), organizations ask researchers to demostrate their organizational, procedural and technical trustworthiness

Move beyond DMP as a short document…

+ How data is generated and/or sourced
+ __How data is used, elaborated and organised__
+ How data, and data subjects, are protected
+ How data, tools and ancillary elements are described and documented
+ How data is stored and secured, and how long it will be retained
+ How data authorship and credit are assigned
+ How data is preserved
+ How, whether, and under what terms, research data outputs can be shared

but, show these competences in the every-day-tasks when manipulating data

2. Procedure in data managment ↑

Focus:
- The art of getting your data into R in a useful form for visualisation and modelling

The process:
- Import data (not covered): in this workshop plain-text rectangular formats
- Tidy: consistent way of storing your data in a clean way
- Transform: create and shape the data for the analysis

2.1. The script: ethics and practice

Our motto: Will you understand how the data was organized, elaborated and used for your research in one year?

Our practice:

<p line-height: 1.5em;>1. Structure your mind properly

  + ...Figure out the logical structure of what to do (write it in a piece of paper)
  + ...Describe those tasks in the form of a computer package (dplyr in our case)
  + ...Execute codes (first in a sample, if not sure about the outcome)

<p line-height: 1.5em;> 2. Structure your code properly

   + Headlines
  + [Indentation](http://adv-r.had.co.nz/Style.html)
   + Write robust syntax (e.g. relative paths)

<p line-height: 1.5em;> 3. Comment your code

    + Data sources, data manipulations, steps of analysis
      + ...commenting increases readability and reproducability
    + but, you can't comment too much
    + ... __RMarkdown__ really helpful

What is Markdown? a notebook interface to weave together codes and text
- Useful? reproduction, automated reports in any format, control (analysis<-> report)
- Free? please…click here
- My experience? collection of data on elite defection, change layout of a panel data to a survival analysis, and visualization
  - … and share online my scripts and results: R studio connect

Today
- …We learn basic use.
- …And how to uploaded in RStudio Connect, others use Github.

Another day
- Encoding, github, Yarl, css…potential topics

# Get Rmarkdown installed
install.packages("rmarkdown")

# Generate a new file, check names and format of the file
    # White part = the space of text --> Information of the steps of your analysis
    # Grey part = R code as a separate paragraph--> analysis and specific comments about a command

#How does it work?
    # treat it as a normal script, at the end of your work, you can click the knit button and get the file you want. 

# get Rstudio Connect 
 # 1) go to the eye looking symbol right to the "run" button 
 # 2)click: "manage account" and add this to the server: https://bookdown.org
 # 3) follow the instructions to have access to Rs connect 
 # 4) log in your account

How can I learn more? here some links for Latex-Rmarkdown, 1,2 and html-Rmarkdown3

2.2. Data: ethics and practice

Our motto:
- Data should be consistent, complete and informative
- There is beauty in simplicity

Our practice:
- Simple codes, always better
- Give to the vector its suitable class and labels
- Separate data entry from coding as much as possible
- Divide complex tasks into several small steps; and if possible, perfom tasks automatically
- Create vectors and routines that enables the detection of errors
- Perform double-entry of the data to avoid unsolvable mistakes
- all in all, today’s tricks to save time when preparing your data for analysis

Install packages for today’s session (if you don’t have them!)

install.packages(c("tidyverse", "foreign", "stringr", "haven", "openxlsx", "readxl", "eeptools", "tm", "quanteda", "pdftools", "naniar", "rvest", "xml2"))

Import the data of today’s session

The breakdown of authoritarian regimes
- The number of countries named as authoritarian and their duration
- Types of autocracies according to the organization from which the ruler belong
- Types of regime failure
- see page 15 of the codebook for further information

# packages
library(tidyverse)
library(readxl)
library(openxlsx)
library(haven)
library(stringr)
 
# import daset 

gwf <- read.csv("./data/regime type/autocracydata.csv" , stringsAsFactors = FALSE) # change working directory
View(gwf)

Question: how do we know that this dataset is informative and complete? any thoughts?

3. A grammar for data managment ↑

A “new” package dplyr written by Hadley Wickham/Romain Francois replaces many old functions for data management

Functions in dplyr are highly performant (big data!) and consistent

See this page for an excellent overview and the Data Wrangling Cheat Sheet
- Functions
  - filter(): Select a subset of the rows of a data frame
  - arrange(): Reorders the rows of a data frame
  - select(): Selects columns
  - distinct(): Returns unique values in a table
  - mutate(): Add new columns to existing columns
  - summarise(): Collapses a data frame to a single row (e.g. aggregation)
  - group_by(): Break data set into groups (of rows), to apply above functions to each group

Programming
- %>% this is a pipe operator which means “nested in”, so you can encadenate functions
- quite useful for summary statistics and group comparisons

Our goal: obtain a short list of regime cases and their duration

Select and filter rows by position

select(): select columns while filter() select rows (e.g. countries, individuals etc.)

The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame

# Task: select all the columns, excepting the first one that refers to the number of rows

shortgwf <- select(gwf, -c(X))  # conversely select(gwf, cowcode:gwf_monarch)

shortgwf <- gwf %>% select(-X) %>% filter(year == 1995)


# select(dataset, columns) in columns you can add the name of the vector or its position. Positive values means the column will be considered, while negative means the column would not be considered

# sometimes, writting the column names do not work. To solve the problem write the name of the columns = `name`.

# Task: filter the dataset by those regimes that are monarchies

shortgwf <- filter(shortgwf, gwf_monarch != 1) # != not equal to

# Question: can I combine both functions? Yes, pipe them

shortgwf <- gwf %>% filter(gwf_monarch != 1) %>%  select(-X)

# how is the old-school way to do this? 


# shortgwf<- gwf[, -1 ]
# shortgwf<- gwf[gwf$gwf_monarch != 1, ]

Arrange

arrange(): reorder rows

# task: common practice...datasets in an alphabetical order 

colnames(shortgwf)
shortgwf <- gwf %>% arrange(gwf_country)# write the name order you want wihtin arrange() or their position 

# what if I want to organize the dataset by the regime type first?
shortgwf <- gwf %>% arrange(desc(gwf_country))

# Q: is equivalent the following function to the sum of previous command?

shortgwf <- gwf %>% arrange(gwf_country, desc(year))

# how is the old-school way to do this? 
#shortgwf <- shortgwf[order(shortgwf$gwf_country, desc(shortgwf$gwf_regimetype)), ]

Add new (transformed) columns

mutate(): transformation of columns executed line-by-line

mutate_all(): transformation of all the columns of a dataset. Any idea about its application?

mutate_at(): transformation of multiple columns selected with a character vector or vars()

mutate_if(): transformation of multiple columns if condition satisfied

# how can we enumerate rows?
shortgwf <- gwf %>% mutate( rows = row_number()) # kind useful to write the number of rows, which should be updated after deleating a row


# which was the classic approach? 
#shortgwf$rows <- 1:nrow(shortgwf)

# you can add them each column 

shortgwf <- shortgwf %>% mutate_at(c("rows", "year"), .funs = lag) # problem?

shortgwf <- shortgwf %>% mutate_at(c("rows", "year"), .funs = list(new = lag)) 

shortgwf <- shortgwf %>% mutate(new = lag(year, new2 = lag(rows)))
# which is the the difference with respect to the previous code?


# let's say that you want to change all the integer vectors to numeric vectors to numeric

shortgwf <- shortgwf %>% mutate_if(is.integer, .funs = as.numeric)

# only those columns with the name rows in it?

shortgwf <- shortgwf %>% mutate_at(c("rows", "rows_new"), .funs = as.integer) # OR

shortgwf <- shortgwf %>% mutate_at(vars(contains('new')), .funs = as.integer)

Group comparison and aggregation

We can apply above functions to subgroups within the dataset

group_by() function to describe how to break a dataset down into groups of rows

# total duration of every regime in a given country

shortgwf <- gwf %>% group_by(gwf_casename) %>% mutate( total = n()) 

# Question: what if we change ```n()``` for nrow()? what can these function be useful for?

shortgwf <- shortgwf %>% group_by(gwf_casename) %>% mutate( total2 = row_number())

# OR shortgwf <- shortgwf %>% group_by(gwf_casename) %>% mutate( total = n(), total2 = row_number()) 


# also more than one variable can be generated 

shortgwf <- shortgwf %>% group_by(gwf_casename, year) %>% mutate( total_months = total*12)

Create summaries of the data frame

summarise() function will create summary statistics for a given column in the data frame

think about the mean, sd, max and min values, lag values…

summarise_all() apply the same summary statistic to all the columns

summarise_at() apply the same summary statistic to more than one column

# create a list of regime-cases


list_gwf <- shortgwf %>% select(gwf_casename, total,total_months)  %>%  summarise(sum = sum(total))  

list_gwf <- select (list_gwf, -year) %>% distinct() # what's going with Argentina 43-46?

# we want to calculate the avarage lenght of every regime-case in years and month

regime_mean <- shortgwf %>% select(gwf_regimetype, total, total_months) %>% group_by(gwf_regimetype) %>% summarise_all(funs(mean = mean, sd = sd))

regime_mean2 <- shortgwf %>% select(gwf_regimetype, total, total_months) %>% group_by(gwf_regimetype) %>% summarise_at(vars(total:total_months), funs(mean = mean, sd = sd))

# Q: compare the outcome ob both functions, which are the advantages/disavantages of using one function or the other?


# old_style <- shortgwf %>% select(gwf_casename, total, total_months)  %>% group_by(gwf_casename) %>% summarise(mean1 = mean(total), mean2 = mean(total_months) and the sd...)

How can I learn more? check this blog to find some ideas about how to use summrise() and group_by. Here another blog in case you need to remmember the theory again

Exercise: basic grammar ↑

using the dataset ‘gwf’…
In one line of code, we need a dataset with the columns: gwf_country till gwf_enddate and gwf_regimetype and… remove observations that fall into the regime type ‘oligarchy’. Check you did it right
Create a new vector that contains information about the duration of every authoritarian regime type in a given year (tip: you might need row_number) in one line of code.
distinct() is a function that returns unique values. a<- shortgwf %>% group_by(gwf_casename) %>% distinct() why is this function not working? what shall we do to obtain a list of regime-case with their starting and ending date?

Solution

# Exercise 1
exercise1 <- gwf %>% select(gwf_country:gwf_enddate, gwf_regimetype ) %>% filter(gwf_regimetype != "oligarchy")

# Exercise 2
exercise2 <- exercise1 %>% group_by(gwf_casename) %>% mutate(duration = row_number()) # duration 

# Exercise 3 
exercise3 <- exercise2 %>% select(-duration) %>% distinct()

Tidyr to reshape your data

Tidyr package is the most intutive package to reshape your data

Basic logic
- identify clearly which are: your observations (rows) and variables (columns)

Functions
- separate(): separate one column into several
- unite(): unite in one column multiple columns
- spread(): spread rows into columns
- gather(): gather columns into rows

Separate and Unite

Sometimes… a column contain several values –> separate()

Sometimes… a single variable is spread into multiple columns –> unite()

# THE DATASET TO WORK NOW
tidy_data <- exercise3

# create three columns from the column 'gwf_startdate'

tidy_data1 <- tidy_data %>% separate(gwf_startdate , into = c("year", "month", "day"), sep = "-", remove = TRUE, convert = TRUE) 

# It works as follow separete(the column to separate, the name of the new columns, the symbol to separate the values)


# additional useful argument --> 
#'remove = FALSE' do not delete the original variable 
# convert = TRUE to give it a proper data structure

# list <- list %>% separate(gwf_startdate , into = c("year", "month", "day"),  sep = "-", remove = FALSE, convert = TRUE ) 


# also, you do not need 'symbols to separate '

tidy_data1 <- tidy_data1 %>% separate(year , into = c("century", "years"), sep = 2) 
# in sep, + values far-left of the strings, - values far-right of the strings


# Oh, god! that was not the right variable. How can I get my column back!!??

tidy_data1 <- tidy_data1 %>% unite(year, century, years) 

# how can I remove the '_'? adding "sep = "" "

tidy_data1 <- tidy_data %>% separate(gwf_startdate , into = c("year", "month", "day"), sep = "-", remove = TRUE, convert = TRUE) 

tidy_data1 <- tidy_data1 %>% separate(year , into = c("century", "years"), sep = 2) 

tidy_data1 <- tidy_data1 %>% unite(year, century, years, sep= "") 

# _ value is a default value when uniting between the values from different columns. we can specify different "sep" condition

Spread and Gather

Sometimes we encounter the problem that observations are scattered across multiple rows –> spread()
- time serie example in the file ‘economic data.xsls’

Sometimes…some of the column names are not names of variables, but values of a variable –> gather()

# create a dataset in which one column correspond to years and the value = regime type

spread_data <- tidy_data1 %>% select(gwf_casename, gwf_regimetype, year) %>%  spread(key = gwf_regimetype, value = year) # horizontally

# this can be useful to estimate some descriptive statistics for each regime type

spread_data <- spread_data %>% summarise_all(min , na.rm = TRUE ) # no possible two functions :'(


############# spread_data is a very ugly dataset, let's put it in its original form. in gather we trust #############

# We need the set of columns that represent values, not variables. In this example, those are the columns after total.

# We need the key: the name of the variable whose values form the column names. Here it is year.

# We need the value: the name of the variable whose values are spread over the cells. Here it's regime type.

united_data <- spread_data %>% gather(2:10, key = gwf_regimetype, value = year) %>% drop_na() # why so ugly?

spread() makes long tables shorter and wider

gather() makes wide tables narrower and longer

4. Tips for a tidy dataset ↑

4.1. Time is the essence: Introduction to time data-types

Time data vectors

Sometimes datasets contain information representing calendar dates and times

The problem: time-data vector is stored in different formats
- as.characther()
- as.POSIXct()
  - Internally, R save the number of seconds since the start of January 1, 1970. Negative values, seconds before this date.
  - Outcome: print a value. useful format to save in data.frames
- as.POSIXlt()
  - save a list with different type of entries (eg. sec., min., hours, years from 1900)

# data for the section
tidy_data <- exercise3


# save time data with POSIXct
value <- as.POSIXct(unlist(tidy_data[2,3]))
print(value)
typeof(value) # double? integer value with double precisioin floating points useful for mathematical purposes
cat(value,"\n") # how it is stored in R internally

# save time data with POSIXlt
list_date <- as.POSIXlt(Sys.time(), format = "%y-%m-%d")
print(list_date)
typeof(list_date)
cat(list_date,"\n") 
names(list_date)
list_date[[1]] # seconds
list_date[[2]] # minutes
list_date[[3]] # hours
list_date[[4]] # day of month (1-31)
list_date[[5]] # month of the year (0-11)

Tip: save time-data type with the function as.Date():
- easy manipulation, less risk it gets corrupted, keeps track of numbers of days
- Q: check help(as.Date) adn observe the commands: what else you can specify with this function?

# save time data with as.Date
value2 <- as.Date(unlist(tidy_data[1,3]))
print(value2)
typeof(value2)
cat(value2,"\n")

value2 <- as.Date.character(tidy_data[1,3], format= "%Y/%m/%d")
typeof(value2)

Exercise: obtaining time vectors ↑

Using the tidy_data… change the character vectors “gwf_startdate” and “gwf_startdate” into as.Date format. Tips: you need to write the code in one line and use mutate_at()

Solution

Time data operators

Date is a suitable R’s internal format to run a numer of basic operations
- Create a new vector that contains time differences
- Expand dataset conditioned on time/date

library(eeptools)

# create time differences 
tidy_data$duration_month <- age_calc(tidy_data$gwf_startdate, enddate = tidy_data$gwf_enddate, units = "months") # default is in months

# sensible to omited values

tidy_data <- na.omit(tidy_data)

tidy_data$duration_year <- age_calc(tidy_data$gwf_startdate, tidy_data$gwf_enddate, units = "months") #

 # old school
# Lenght of by months
elapsed_months <- function(end_date, start_date) {
    ed <- as.POSIXlt(end_date)
    sd <- as.POSIXlt(start_date)
    12 * (ed$year - sd$year) + (ed$mon - sd$mon)
}

#######################################################################################################

######### ########## Expand dataset condioned on time differences ######### ######### 

# daydly basis.
lst <- Map(`:`, tidy_data$gwf_startdate, tidy_data$gwf_enddate) # Map is a function from purr to extract a lsit values

spread_vertical_data <- cbind(tidy_data[rep(seq_len(nrow(tidy_data)), lengths(lst)), ], date = unlist(lst)) # why negative values?

spread_vertical_data$first <- as.Date(spread_vertical_data$date, origin = "1970-01-01") # date format!

library(lubridate)

# monthly basis

spread_vertical_data$second <- format(as.Date( spread_vertical_data$first), "%Y-%m") 
# it shows only month, still the data is stored on dayly basis!

spread_vertical_data <- spread_vertical_data %>% select(-c(first,date)) %>% distinct()

How can I learn more?: use lubridate for shaping datasets for even history dataset and here or reshape for panel data

4.2. Missing values

Theory

two types of :
- Explicit , i.e. flagged with NA, 99, -99 etc
- Implicit , i.e. simply not present in the data.

We have always to make the NA explicit! and be consistent with its coding

Practice

Identification either numerical or visualization
- install.packages("naniar") (today)

Procedure to deal with missingness
- Omit the missing values (today)
- Imputation (another day)
  - useful reading: King’s paper and Graham’s paper
  - useful packages: Amelia and bucky install.packages("bucky", repos=c("http://r.tahk.us/",getOption("repos")))

1. Identification

We have always to make the NA explicit! and be consistent with its coding
- complete()turns implicit missing values into explicit NA
- apply functions to omit data na_if()

library(naniar)
# generating a dataset
df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep= T)))

df <- mutate_all(df, funs(na_if(.,-99)))
 
# sometimes white spaces need to be converted in NA

df <- data.frame(replicate(6, sample(c(1:10, ""), 6, rep= T)))

mutate_all(df, funs(na_if(., "")))
 

####### Old school way a function to solve missings ####### 
fix_missing <- function(x){x[x == -99]<- NA  
x} 
# change the -99 in case the value is different.

df<-lapply(df, fix_missing) # to applied it for the whole dataset, of course, also you can use the functio to fix one particular column

df$X1 <- fix_missing(df$X1)

Numerical summaries of missing data in variables and cases
- miss_var_summary(): summary for each variable
- miss_case_summary() :summary for each case
- …or old school functions!

Visual summary of missing data in variables and cases, check this

# dataset of the section
gwf_2 <- gwf

library(naniar)
report1 <- miss_var_summary(gwf_2, order = TRUE, add_cumsum = TRUE)
report2 <- miss_case_summary(gwf_2, order = TRUE, add_cumsum = TRUE)

# old school, I prefer it

colSums(is.na(gwf_2))
apply(is.na(gwf_2), 2, which) # shows the rows with missing values and columns 

# subset 
no_missing <- gwf_2 %>% filter(is.na(gwf_regimetype)) # or

no_missing <- gwf_2 %>% complete.obs() # or

Q: Why did the data get missing? Implications? Solution?

2. Solution 1: Omit the missing values

dplyr solves it by using na.rm = TRUE

df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep= T)))

df <- mutate_all(df, funs(na_if(.,-99)))

# way to correct the previous mess
df2 <- df %>% gather(5:6, key = beer, value = cost, na.rm= TRUE) # exclude missing values, UPDATE positions based on the lcoation of the mssing values!

# several ways in the old.fashion... one of them list <- na.omit(list) (https://stackoverflow.com/questions/4862178/remove-rows-with-nas-missing-values-in-data-frame)

df %>% summarise(mean_cost = mean(X4, na.rm= TRUE), sd_cost = sd(X4, na.rm= TRUE))

# also na.rm when running regressions!

2. Solution 2: fill missing values

fill() function from tidy package
- It takes a set of columns where you want missing values to be replaced by the most recent non-missing value
- for example…

no_missing <- gwf_2 %>% filter(is.na(gwf_regimetype))

gwf_2 <- gwf_2 %>% group_by(gwf_casename) %>% fill(gwf_regimetype)  

# check missings 

no_missing2 <- gwf_2 %>% filter(is.na(gwf_regimetype)) 

gwf_2 <- gwf_2 %>% group_by(gwf_casename) %>% fill(gwf_regimetype, .direction = c("up"))  

# add manual entry for Iran and later apply fill with direction = down

2. Solution 3: Imputation

See the info at the beggining of this section

4.3. Duplications

Two identical observations or very similar observations are in the dataset
- Cause by: insert data manually (e.g.misspells and additional white space) or functions(e.g. expanding datasets…). Any thoughts?
- Consequence ?

distinct() is a function integrated in tidy package to deal with duplicates

gwf_2 <- gwf

data <- gwf_2 %>% select(cowcode, gwf_casename, gwf_regimetype) # all repeated

data <- gwf_2 %>% select(cowcode, gwf_casename, gwf_regimetype) %>% distinct() # tadaaaa

sometimes we need more in deep-work…

Process:
1. Search or create unique obs. : unit of analysis + a col with a unique value per obs. (e.g. time = year)
2. Count how many times these unique observation appears
3. Evaluate the source of the error (e.g. misspell = idiosyncratic?)
4. Deal with the error: deleate or recode

# If the problems are misspells...

# the process of creating ids

  gwf_2$id_duplica <- paste0(gwf_2$cowcode, sep = "." , gwf_2$year)
  gwf_2$id_duplica <- tolower(gwf_2$id_duplica) #everything in lower case
  gwf_2$id_duplica <- gsub('\\s{1,4}','', gwf_2$id_duplica )# remove empty spaces in the cell
  gwf_2 <- arrange(gwf_2, id_duplica) # to put some alphabetic order here...
  gwf_2 <- transform(gwf_2, id_duplica = as.numeric(factor(id_duplica)))
  
 
  sum(table(gwf_2$id_duplica)-1) # 42 observations are duplicates, but where?

  gwf_2 <- gwf_2 %>% group_by(id_duplica) %>% mutate(n=n())
  
  a <- gwf_2 %>% filter(n != 2)

Q :In the case of ‘data’ (see chunck above), what shall we do to remove duplicates? is it possible?

Workflow example 1: getting tidy dataset

Import the dataset ‘gdp.csv’ in R.
Change the layout of the data in a way that one column refers to country, year, gdp
Create a new column named ‘label’. This column should contain the country abbreviation for each observation.
Move forward the first value of each country-group and delate the first row of each country-group
save the dataset

5. A toolkit for working with strings

Challenge:
- Datasets and/or vectors contains information (e.g. words) presented in a semi-structured and/or unstructure manner.
- We want to exctract and manipulate the relevant bits of information as strings for analysis

Goal: introduce the logic of regular expressions (regexps) and a handful of functions
- regexps: a concise language for describing patterns in strings
- functions: stringr package

Why is this important? useful for
- … data cleaning (see below)
  - … data collection: example
    - … but check the project in the folder ‘stringr_casestudy’ for a complex case’)

Example data cleaning

library(stringr)

# change working directory
gwf <- read.csv("./data/regime type/autocracydata.csv", stringsAsFactors = FALSE) # change working directory


names(gwf) <- names(gwf) %>% str_replace_all("gwf_", "")

# old school:  names(gwf)[1:15] <- c("write here 15 names")

Learning the logic is important and only the very beggining

6.1. Regular expressions

Regexp generalizable text patterns for searching and manipulating data

We focus on exact matching and generalizable expressions

Exact matching

Identify a characther vector with the exact same values

example.string <- "</P> Dip. Rodirgo Gromenawer, Perez  Elected: PRIS -Parlamentary group:PIRS 2010</P>"

# A string matching itself
str_extract(example.string, "PRI")

# multiple matches
str_extract_all(example.string, "PRI")

# case sensitivity
str_extract(example.string, "pri")
str_extract_all(example.string, regex("pri", ignore_case =TRUE))

Generalizable regexps

The power of these regexps relies on formulas to write more flexible and generalized queries

Some regular expressions

+ ```[:digit:]```: Digits: 0 1 2 3 4 5 6 7 8 9
+ ```[:lower:]```: Lower-case characters: a-z 
+ ```[:upper:]```: Upper-case characters: A-Z
+ ```[:alpha:]```: Alphabetic characters: a-z and A-Z
+ ```[:alnum:]```: Digits and alphabetic characters
+ ```[:punct:]```: Punctuation characters: '.', ',', ';', etc.
+ ```[:graph:]```: Graphical characters [:alnum:] and [:punct:] 
+ ```[:blank:]```: Blank characters: Space and tab
+ ```[:space:]```: Space characters: Space, tab, newline, and other space characters

Additional shortcuts
- \d: matches digits = [0-9]
- \D: matches non-digits = [^0-9]
- \w: matches any characther = [a-z] + [A-Z]
- \s: matches whitespace characther
- \S: matches non-whitespace characther

but also a set of operators
- ^: match the beginning of a string
- $: match the end of a string +|: match more than one element +.+: match everything +[ ]: match a range of words or numbers included within the bracket

example.string <- "</P> Dip. Rodirgo Gromenawer, Perez  Elected: PRIS -Parlamentary group:PIRS</P>"

# match the beginning of a string
str_extract_all(example.string, "^</P>")

# match the end of a string
str_extract_all(example.string, "PRI$")
str_extract_all(example.string, "</P>")

# pipe operator
unlist(str_extract_all(example.string, "Dip|PRI"))

# wildcard
str_extract(example.string, "R.+go") 

# character class (range)
str_extract(example.string, "P[RSTUV][IJKLM]S")
str_extract(example.string, "Ro[:alpha:][:alpha:][:alpha:]go")
str_extract(example.string, "Ro[:alpha:]{4,}go")

# quite annoying repeating several times the same  right?

Quantifiers: to avoid repeating same expression:

+ ``` ? ```: The preceding item is optional and will be matched at most once
+ ``` +```: The preceding item will be matched one or more times
+ ```{n} ```: The preceding item is matched exactly _n_ times
+ ```{n,}```: The preceding item is matched _n_ or more times
+ ```{n,m}```: The preceding item is matched between _n_ and _m_

# greedy quantification
str_extract(example.string, "R[:alpha:]{1,6}o")

# the preceding item is the wildcard
str_extract(example.string, "Dip.+Elected") 

# the preceding item is the wildcard
str_extract_all(example.string, ".+?PRI") 

# meta characters
unlist(str_extract_all(example.string, "\\.")) # only want the dot. The dot has a meaning in regular expression that you want to isolate

# assertions
unlist(str_extract_all(example.string, "(?<=</P> ).+")) # positive lookbehind: (?<=...)
unlist(str_extract_all(example.string, ".+(?=</P>)")) # positive lookahead (?=...)

# to create strings, both single and double quotes work
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'

# to include a literal single or double quote in a string you can use \ to escape it:

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

# to extract info between brackets or quotations: two strategies

str_extract(string2, '(\").*?(\")')
str_extract(string2, "(?<=\")(.*?)(?=\")") # 

#  Also: negative lookbehind: (?<!...) and  negative lookahead (?!...), never used

Once again, it is matter of try and error. here a useful link to proof whether or not your regexps work before running the code

Exercise: first contact with regexps ↑

describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
- "[[:digit:]]{4,}"
- ".*?\\.txt$"
- "<(.+?)>.+?</\\1>" # example str_extract_all("Phone 150$, PC 690$", "[0-9]+\\$") # example to extract every number that precedes $

Solution

6.2. Strings manipulation

Some functions to use regexps

 + ```str_extract()```:  Extracts first string that matches pattern Character vector
 + ```str_extract_all()``` : Extracts all strings that match pattern List of character vectors
 + ```str_locate() ``` : Return position first pattern that match Matrix of start/end positions
 + ```str_locate_all()``` : Return positions of all pattern matches List of matrices 
 + ```str_replace()``` : Replaces first pattern match Character vector
 + ```str_split()```: Split string at pattern List of character vectors 
 + ```str_split_fixed()```: Split string at pattern into fixed number of pieces Matrix of character vector

library(stringr)


# joining
str_c("text", "manipulation", sep = " ")
str_c("text", c("manipulation", "basics"), sep = " ")


# working example 
example.string <- "</P> Dip. Rodirgo Gromenawer, Perez  Elected: PRI -Parlamentary group:PRI</P>"

# locate
str_locate(example.string, "Rodirgo Gromenawer, Perez")

# substring extraction
str_sub(example.string, start = 11, end = 35)

# replacement
str_replace(example.string, pattern = "</P>", replacement = "")
str_replace_all(example.string, "</P>",  "")

# splitting
str_split(example.string, ":") %>% unlist

Exercise: string manipulation ↑

The following code hides a secret message. Crack it with regular expressions

secret <- "clI.0pow1zLstc0d87wnkig7Ovd000dgdfgdfsdfeeeggvhryn92V55juwczi8hqrfpxs5j5dwpn0Enwo. Cwisdij76j8kpf03AT5dr3coc0bt7yczjataootj55t3j3ne6c4Sfek.W1w1IwwojigT d6vrf5656Hrbz2.2bkTnbhzgv4I9i05zEcropwSgnb.YqoO65fUa1otfb7wm24k6t3s3339zqe5 fy89n6?d5t9kc4fe905gmr"

Solution

Workflow example 2: working with PDFs

Goal with want to construct a dataset out of the PDF data (see legislatura 27.pdf)
- cols: name of mps, party affiliation, Estate-district where the mp was elected
- we need to install tm package by Ken Benoit.

Process
1. Is there any pattern in the text that we can use to our favor? Planning!
2. set working directory
3. set the corpus: an object that represents a collection of documents
4. Perform the strategy to extract information in a characther vector
5. Check if problems in the data extraction
6. Clean the vectors
7. Combine the vectors into a data.frame. Further noise that needs to be cleaned?

# import the pdf in R == setting a corpus. we need tm package

library(tm)
library(pdftools)
library(stringr)

# 1. my workign directory:  D:\Biblioteca D\courses material\data managment in R\data

cname <- file.path("D:", "Biblioteca D", "courses material", "data managment in R", "data") # I am telling to R where the pdf files are. Cchange \ for ,

# 2. Corpus

# Rpdf <- readPDF(control = list(text = "-layout")) # layout control in order to keep the original format as much as possible 

Rpdf <- readPDF(engine = "xpdf", control = list(text = "-layout")) # xpdf engine, similar as the default. but needs to installed: http://www.xpdfreader.com/

docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf)) # upload documents

writeLines(as.character(docs[1]))

## problems with encoding? Functions below change the encoding of R 

# Spanish\LATIN: Sys.setlocale("LC_CTYPE", "") 
# Cyrillic: Sys.setlocale(category = "LC_ALL", locale = "Russian")

# Specify the doc we want
content_doc <- docs[[1]]$content 

################################################################################## In case this code does not work for you: use


doc <- pdf_text("./data/Legislatura_36.pdf") # can upload only one doc!

#################################################################################################


# 3. extract the info that we need
# name
mps.regex <- "Dip.+[[:alpha:].,]{1,5}"
name_dip<- unlist(str_extract_all(doc, mps.regex)) #172?

# party
party <- unlist(str_extract_all(name_dip, ":(.+)"))
party <- unlist(str_replace_all(party, ":", "")) # 171?

# state
state.regex <- "Es[tad][tad].+[[:alpha:]:.+[0-9]]{1,8}"
state_dip<- unlist(str_extract_all(doc, state.regex)) # 171?

# 4 and 5. correct errors: error in the string 8, which one shall we used

name_dip <- unlist(str_remove(name_dip, "Diputado Favio Altamirano Manlio fallecio."))
name_dip <- name_dip[-8]

# 6. put together in data.frame when we feel comfortable with the lenght of vectors
data<- data.frame(names = name_dip, party.affiliation = party, state = state_dip, stringsAsFactors =  FALSE)


# 5. clean dataset!
data$names <- unlist(str_replace_all(data$names, "Partido:.+" , "")) # remove party in names col
data$names <- unlist(str_replace_all(data$names, "Dip." , "")) # remove dip in names col

data$names_ordered <- str_replace(data$names , "(.+), (.+)", "\\2 \\1") # change first name, second surname. Does it work?

data$names_ordered <- tolower(data$names_ordered) # lower case 

## other tasks: create a id per

How can I learn more? Text as data is a growing topic.
- Click here, here, or even here for a short introduction about all the techniques that can be used with the tm package or here for the quanteda package

Workflow example 3: working with websites

Goal with want to construct a dataset out of the wikipedia tables.
- we need columns with info about candidates name, outcome of election (“winner”, “loser”), party affiliation and district

Process: the same as in the previous example!

# packages

library(xml2)
library(rvest)
library(tidyverse)
library(stringr)

#######  data collection stage of the research  #######  

# 1. specify URL
url <- "https://en.wikipedia.org/wiki/Results_of_the_Malaysian_general_election,_2018_by_parliamentary_constituency"
browseURL(url)

# 2. download static HTML behind the URL and parse it into an XML file

web<- read_html(url) %>% html_table( fill = TRUE)

candidates <- rbind(web[[3]], web[[4]], web[[5]], web[[6]], web[[7]], web[[8]], 
                    web[[9]], web[[10]], web[[11]], web[[12]], web[[13]], web[[14]], web[[15]], web[[16]], web[[17]], web[[18]])

#######  data managment stage of the research  #######  


# 3. setting the dataset
# columns to deleate (10:13) 
candidates <- candidates[,1:9]


# changign labes to facilitate identification
names(candidates)[1]<- "Constituency.code"
names(candidates)[2]<- "Constituency"
names(candidates)[3]<- "Winner"
names(candidates)[6]<- "Loser"
names(candidates)[4]<- "Votes.total.wn"
names(candidates)[5]<- "Majority"
names(candidates)[7]<- "Votes.total.ls"
names(candidates)[8]<- "Incumbent"
names(candidates)[9]<- "Incumbent.Majority"


# create outcome of election column and candiate.names

# gathering columns 
candidates <- candidates %>% gather("Winner", "Loser", key = Outcome, value = Candidate.name, na.rm= TRUE, convert =  TRUE) # candidates and outcome 
candidates <- candidates %>% gather("Votes.total.wn", "Votes.total.ls", key = b, value = Votes.number, na.rm= TRUE, convert =  TRUE) # fix N of votes

# remove columns and empty rows, and repeated observations. 
candidates <- candidates %>% select(-b) %>% distinct()

# remove the first obs. 
candidates$flag <- candidates$Constituency.code # create a flag

candidates$flag <- str_replace_all(candidates$flag, "P[0-9]{1,3}", "1") 
candidates <- candidates %>% filter(flag == 1) %>% select(-flag)

candidates<- arrange(candidates, Constituency.code)


# correct problem: winner has most of the votes, loser has what is left 

#convert chrt vectors to numeric 
candidates$Votes.number <- str_replace_all(candidates$Votes.number, ",", "")
candidates$Votes.number <- as.numeric(as.character(candidates$Votes.number))

# listing winners, keep first obs. which the one with largest percentage
list.winners <- candidates %>% group_by(Candidate.name) %>% filter(Votes.number == max(Votes.number) & Outcome == "Winner")


# listing losers, deleate the first obs. which the one with largest percentage

list.losers <- candidates %>% group_by(Candidate.name) %>% filter(Votes.number != max(Votes.number) & Outcome == "Loser")

# Logic of the collection: the first row correspond to the winners value, the second row correspond to members' value in the table, the rest correspond the values of other candidates

# acombine both list
candidates2 <- rbind(list.winners, list.losers)
candidates2<- arrange(candidates2, desc(Constituency.code))


# adding column party affiliaition 
candidates2$Party.affiliation <- str_extract_all(candidates2$Candidate.name, "\\(.*\\)" )
candidates2$Party.affiliation <- str_replace_all(candidates2$Party.affiliation, "\\(", "")
candidates2$Party.affiliation <- str_replace_all(candidates2$Party.affiliation, "\\)", "")

candidates2$Candidate.name <- str_replace_all(candidates2$Candidate.name, "\\(.*\\)", "" )

# id = name without empty spaces
candidates2$id <- tolower(candidates2$Candidate.name)
candidates2$id <- gsub('\\s{1,4}','', candidates2$id)# remove empty spaces in the cell

# what shall I do with those districts that have changed the district name? 


### ### ### ### ### other advanced and interesting features of webscrapping and regular expressions ### ### ### ### ### 

# adding a column for wikipedia links

# extract url
a <- read_html(url) %>% html_nodes("a") %>% html_attr("href")
a<- unlist(str_extract_all(a, ".wiki.[[:alpha:]]{1,10}\\_+[[:alpha:]]{1,10}\\_.+")) # inspect and select only those links with candidate information 
a<- a[40:304] # simplicity is beauty, remmber 

# create full real links 
url.candidate2 <- paste0("https://en.wikipedia.org", a)

wiki.links2 <- data.frame(url.candidate = url.candidate2)
wiki.links2$Candidate.name <- str_replace_all(wiki.links2$url.candidate, ".+.wiki.", "")
wiki.links2$Candidate.name <- str_replace_all(wiki.links2$Candidate.name, "_", " ")
wiki.links2$Candidate.name <- str_replace_all(wiki.links2$Candidate.name, "\\(.*\\)", "" )


# id = name without empty spaces
wiki.links2$id <- tolower(wiki.links2$Candidate.name)
wiki.links2$id <- gsub('\\s{1,4}','', wiki.links2$id)# remove empty spaces in the cell

# remove repeated and non informative rows 
wiki.links2 <- distinct(wiki.links2)

# 6. merge both info
candidates2018 <- left_join(candidates2,wiki.links2, by = "id")
candidates2018<- candidates2018 %>% select(-14) %>% distinct()

### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###


# WRITE AND SAVE 
write.csv(candidates2018, "./candidateMSY2018.csv")
candidates2 <- read.csv("./candidateMSY2018.csv", na.strings="", stringsAsFactors = FALSE) # sexy stuff

# ;)

How can I learn more? best book ever, although a bit outdate. There are tutorials, but either you have to pay for them or they are very partial!

6. Relational data ↑

A set of datasets are connected by some key aspects: merge them!

Why is this important?
- Rarely researchers work with one dataset!
- Relate your datasets: increase efficiency!
- Functions we learn today: work faster, clearer and does not mess with rows order and labels

Relations are always defined between a pair of tables
- Think this pair as X (the dataset with larger info) and Y (smaller dataset)
- The strategy is to find the key that relates the pair.
  - The key should have same col.name in the pair

Types of joins
- ‘mutating’ joins: copy and paste columns and/rows + left_join() keeps all observations in x + right_join() keeps all observations in y + inner_join() keeps only matched observations in X and Y + full_join() keeps all observations in x and y even if not match
- ‘Filtering’ joins: keep observations in data frames if they match + semi_join() keeps all observations in x that have a match in y + anti_join() drops all observations in x that have a match in y

library(foreign)
library(tidyverse)
library(readxl)


# data sets

# Authoritarian regimes dataset 

gwf <- read.csv("./data/regime type/autocracydata.csv", stringsAsFactors = TRUE)[,-1] # change working directory

# dataset about the failure of authoritarian regimes
failure_regime <- read_excel("data/data/failure_regime.xlsx")
View(failure_regime)

Q How is it possible that both datasets have different number of observations?

Left versus Right joins

left_data <- left_join(gwf, failure_regime, by= c("cowcode", "year", "gwf_country"))

left_data <- left_join(gwf, failure_regime)

right_data <- right_join(gwf, failure_regime, by= c("cowcode", "year" , "gwf_country", "gwf_casename")) # why one obs. more than failure?




# what I use:  an updated versiobn of a function posted in stack overflow https://stackoverflow.com/questions/28992362/dplyr-join-define-na-values

left_join_NA <- function(x, y, ...) {
  left_join(x = x, y = y, by = ...) %>% 
    mutate_all(funs(replace(., which(is.na(.)), NA))) ## replace the implicit NA into a pre-defined valu, 0 in this case
} 

# you can change the 'type of join' with other type of joins and the default value (99 instead of 0)

complete_data <- left_join_NA(gwf, failure_regime, by= c("cowcode", "year", "gwf_country"))

## classical approaches 
# complete_data[is.na(complete_data)] <- 0 # to change all NA in 0 in a classic manner, better specify each column 

# complete_data$gwf_fail[is.na(complete_data$gwf_fail)] <- 0
# complete_data$gwf_fail_type[is.na(complete_data$gwf_fail_type)] <- 0

Full versus Inner joins

full_data<- full_join(gwf, failure_regime, by= c("cowcode", "year", "gwf_country")) # keep all observations.


inner_data <- inner_join(gwf, failure_regime, by= c("cowcode", "year", "gwf_country")) # keep only matched observations

# why inner seems to be better than the rest of datasets? 


# why?

Q Why do we observe one that inner_data and right_data has one more observation than the failure_data?

Anti versus Semi joins

anti_data <- anti_join(gwf, failure_regime, by= c("cowcode", "year", "gwf_country", "gwf_casename"))

# Anti-joins are useful for diagnosing join mismatches (e.g. countries have slightly different name!), but also as an opposite function of semi_join 

the_data <- anti_join(right_data, anti_data, by= c("cowcode", "year", "gwf_country"))

write.csv(the_data, "the_data.csv")

semi_data <- semi_join(failure_regime, inner_data, by= c("cowcode", "year", "gwf_country")) # Why, one obs. less?


# Semi-joins and anti joins are useful for matching filtered summary tables back to the original rows.

top_autocra <-  the_data %>%  group_by(gwf_regimetype) %>% filter(gwf_spell == max(gwf_spell)) # what is this?

semi_data <- the_data %>% semi_join(top_autocra) 

anti_data <- the_data %>% anti_join(top_autocra) # dataset that does not include the most long lasting authoritarian regimes in the world

Exercise: merging datasets! ↑

Use the gdp data in the previous exercise to build a more complete ‘the_data’ dataset. Identify why merging cannot be possible and which changes in the datasets need to be done in order to merge them. (tip: same colum name?)

Data management in R tricks to shape your data efficiently

Adrian del Rio and Sphend Kursani, fourth year SPS

April 9, 11 and 12 2019

Content and Organization of the workshop

Content of Data management I and II

Goals in Data managment 1

Workflow example 1: getting tidy dataset

Goals in Data managment 2

1. Debate about data managment ↑

2. Procedure in data managment ↑

2.1. The script: ethics and practice

2.2. Data: ethics and practice

Install packages for today’s session (if you don’t have them!)

Import the data of today’s session

3. A grammar for data managment ↑

Select and filter rows by position

Arrange

Add new (transformed) columns

Group comparison and aggregation

Create summaries of the data frame

Exercise: basic grammar ↑

Solution

Tidyr to reshape your data

Separate and Unite

Spread and Gather

4. Tips for a tidy dataset ↑

4.1. Time is the essence: Introduction to time data-types

Time data vectors

Exercise: obtaining time vectors ↑

Solution

Time data operators

4.2. Missing values

Theory

Practice

1. Identification

2. Solution 1: Omit the missing values

2. Solution 2: fill missing values

2. Solution 3: Imputation

4.3. Duplications

Workflow example 1: getting tidy dataset

5. A toolkit for working with strings

6.1. Regular expressions

Exact matching

Generalizable regexps

Exercise: first contact with regexps ↑

Solution

6.2. Strings manipulation

Exercise: string manipulation ↑

Solution

Workflow example 2: working with PDFs

Workflow example 3: working with websites

6. Relational data ↑

Left versus Right joins

Full versus Inner joins

Anti versus Semi joins

Exercise: merging datasets! ↑

Solution

Data management in R

tricks to shape your data efficiently