Practical 2

Author

Sivuyile Nzimeni

Published

July 29, 2022

A logo of the Commission of Inquiry into State Capture. The image includes the South African flag and a balanced scale along with the words Commission of Inquiry into State Capture.

State Capture Commission

INTRODUCTION

The team has received a theoretical and practical underpinning of strings from the R for Data Science book. In this practical, we will extend on this knowledge by using an example from the South African context. We aim to reinforce our skills in R by relying on the following: 1) Regular Expressions 2) tidyverse packages for data cleaning text data and 3) Exporting text data to csv or arrow::write_parquet.

STATE CAPTURE COMMISSION

On 14 October 2016, the previous Public Protector, Adv Thuli Mandonsela, published the State of Capture Report. In subsequent years, many judicial reviews ensued ultimately resolving in the establishment of the State of Capture Commission. The commission has several public artefacts that lend themselves well to several data science tasks.

Packages you may need

stringr,arrow,pdftools and stringi. Remember, the above libraries are suggestions. To practice Regular Expressions, you can use RegExr,an online tool to learn, build and test Regular Expressions.

EXERCISE

Here, we will focus on one, extracting meaningful data from a transcript. The transcript of interest is from the State Capture Commission Hearing that sat on 2022/12/10. Your task in this practical is to convert the text into a tidy data.frame. In other words, each column is a variable and each row is an observation. The data.frame must contain the following:

  1. Extracted date of the hearing
  2. Speaker
  3. Dialogue
  4. Page Number

You must export the final output into a csv or a parquet file. We do will do a worked example of this practical on 2022/08/05.