Practical 2
INTRODUCTION
The team has received a theoretical and practical underpinning of strings from the R for Data Science book. In this practical, we will extend on this knowledge by using an example from the South African context. We aim to reinforce our skills in R by relying on the following: 1) Regular Expressions 2) tidyverse
packages for data cleaning text data and 3) Exporting text data to csv or arrow::write_parquet
.
STATE CAPTURE COMMISSION
On 14 October 2016, the previous Public Protector, Adv Thuli Mandonsela, published the State of Capture Report. In subsequent years, many judicial reviews ensued ultimately resolving in the establishment of the State of Capture Commission. The commission has several public artefacts that lend themselves well to several data science tasks.
stringr
,arrow
,pdftools
and stringi
. Remember, the above libraries are suggestions. To practice Regular Expressions, you can use RegExr,an online tool to learn, build and test Regular Expressions.
EXERCISE
Here, we will focus on one, extracting meaningful data from a transcript. The transcript of interest is from the State Capture Commission Hearing that sat on 2022/12/10. Your task in this practical is to convert the text into a tidy data.frame. In other words, each column is a variable and each row is an observation. The data.frame must contain the following:
- Extracted date of the hearing
- Speaker
- Dialogue
- Page Number
You must export the final output into a csv or a parquet file. We do will do a worked example of this practical on 2022/08/05.