PDM Problem Set 2 - Instructions - BETA VERSION

Spring 2023

THIS PROBLEM SET IS UNDER DEVELOPMENT. I WILL NOTIFY YOU WHEN IT IS FINALIZED.

In this .html file, there are “Solution Output” code blocks for the questions. There is model output that follows that code.

Your job is to take the .qmd file and fill in the code where indicated, so the Solution Output code blocks work and replicate the output in this .html document.

The grading policy for this problem set follows all questions. If your output doesn’t quite match the solution output, you may still receive partial credit.

Question 1

Read a file containing the names and date of births of United States presidents.

Print a table showing how many presidents were born in the first quarter, second quarter, etc.

The file to be used for this question: presidents_birth_dates

All dates in the input file are legitimate dates, but some are in a non-standard format. Your code will need to handle these dates.

Code:

### Your code here

library(tidyverse)
library(lubridate)

url <- "https://raw.githubusercontent.com/jhudap/pdm_data/main/presidents_birth_dates.csv"

president_name_birth_date_tib <- read_csv(url, show_col_types = FALSE)

Solution Output:

print(birth_date_by_qtr)


1 2 3 4 
3 4 8 6

Question 2

Read three files, performing joins as needed to identify the five Malaysian states with the highest rate of COVID-19 new cases in January 2023.

The files needed for this question can be accessed through the following links:
cases_state: this file contains COVID-19 information by day and by state
state_code: this file contains the names of Malaysian states and state codes
state_population: this file contains the population of Malaysian states (units are millions)

COVID-19 data came from the covid19-public repo on the Ministry of Health GitHub account named MoH-Malaysia. Look in the epidemic folder. I used the cases_state.csv file.

I used this web page to find the Malaysian 2022 population by state and then I created a csv file.

I used this web page to find the names and state codes for Malaysian states and then created a csv file.

Code:

### Your code here

url_covid_state <- "https://raw.githubusercontent.com/jhudap/pdm_data/main/cases_state.csv"
covid_state <- read_csv(url_covid_state, show_col_types = FALSE)

Solution Output:

print(top_5_new_case_rates)

# A tibble: 5 × 2
  state             new_cases_per_million
  <chr>                             <dbl>
1 W.P. Putrajaya                    2650 
2 W.P. Kuala Lumpur                  832.
3 Melaka                             588 
4 Selangor                           433.
5 W.P. Labuan                        360

Question 3

The file needed for this question can be accessed through the following link:
cases_state

Read a file that contains COVID-19 data and create a bar chart showing the number of new cases by day of the week.

For this question, consider data for dates between 3-Oct-2022 and 25-Dec-2022. Saturday and Sunday new case counts must be combined into a day named “S_S”.

Your solution must demonstrate your ability to work with factors.

Part of this question will demonstrate your ability to learn a new R skill on the fly - in this case, making a bar plot. You can find a very handy reference for making bar plots here: https://r-graph-gallery.com/barplot.html.

Code:

### Your code here
url_covid_data <- "https://raw.githubusercontent.com/jhudap/pdm_data/main/cases_state.csv"
covid_data_1 <- read_csv(url_covid_data, show_col_types = FALSE)

Solution Output:

fig

Question 4

The file needed for this question can be accessed through the following link:
cases_ages

The input file contains data showing counts for people who have contracted a virus. The file contains data for multiple years. Each observation is a case reporting day. The case counts are organized by age bracket.

When processing the data, ignore observations where there is missing data.

The multiple age brackets make this data set a good example of wide data.

Your processing must convert the wide data into long format data. Next, perform processing to create summary counts and in doing so, reclassify the age brackets into a more simple three-class structure as shown below:

Ages 0 - 17 must be classified as youth
Ages 18 - 49 must be classified as adult
Ages 50 and up must be classified as mature

Print your newly classified and summarized case counts to the console.

Code:

### Your code here
url_cases_data <- "https://raw.githubusercontent.com/jhudap/pdm_data/main/cases_ages.csv"
cases_ages <- read_csv(url_cases_data, show_col_types = FALSE)

Solution Output:

print(summarised_cases_wide)

# A tibble: 4 × 4
# Groups:   case_year [4]
  case_year youth_cases adult_cases mature_cases
  <chr>           <dbl>       <dbl>        <dbl>
1 2020              533        3370         1179
2 2021            43970      152374        37620
3 2022            28318       92945        31467
4 2023               53         271          128

Question 5

In this problem, we will take a text file and transform it into a tibble that we can analyze.

There are multiple steps to this problem. In this .html file, you’ll see the output you should generate at each step. In the assignment worksheet, you’ll have to insert the code where indicated.

First import the text file, which I have done for you in this code chunk. You can view the text file here.

text <- read_file(url("https://raw.githubusercontent.com/jhudap/pdm_data/main/msnbc_transcripts.txt"))

Now, text is a variable that contains a single, very long string of characters, as you can see here.

typeof(text)

[1] "character"

nchar(text)

[1] 7059967

If you open up the link to the .txt file, you’ll see that is 131 different news program transcripts. Between each document, there is a unique chunk of text that only appears between transcripts of episodes: "Copyright 2015 CQ". This means that you can split the character string at this string and generate a vector with the text of each individual episode as an element of the vector.

Use that text string to subset the text character string into the constituent transcripts, and assign that output to the variable episodes. It might be helpful to refer to the stringr cheat sheet to help you identify the best function to use.

After you’ve created the episodes variable, the solution output uses R functions that show the type, test whether it is a vector, and show its length.

Code:

### Your code here

Solution Output:

typeof(episodes)

[1] "character"

is.vector(episodes)

[1] TRUE

length(episodes)

[1] 132

You’ll note that the 132 elements is more than the 131 episodes. That’s because the first element empty - just a by product of how this function works. So, we’ll just remove the first elements of the vector.

episodes[1]

[1] ""

episodes <- episodes[-1]

Now, we want to know the show title for each episode. If you look at the text document, you’ll see that there is a line in each episode transcript that identifies the show title.

As you think about it, you realize that there is a pattern to these lines. They all start with SHOW: and end with EST. If you only you could extract that text automatically and put it into a tibble, rather than have to copy and paste 131 times…you would look like such a technical whiz!

Out of desperation, you type into Google “extract a string between two words r”.

The first search result is some random person’s post on Stack Overflow suggesting you could use something called a “regular expression” and str_extract() to extract strings of text that start and end with certain words.

You’re still a little confused, so you send a Teams message to your senior programming colleague to ask how to write regular expression. He is busy binge watching the Last of Us but he sends you the following message on Teams: “use as the pattern argument pattern = SHOW\\s*(.*?)\\s*EST".”

You figure out how to use str_extract and that regular expression to extract the show titles from each element of your character vector. You print the first 10 elements of shows to the console.

Code:

### Your code here

Solution Output:

shows[1:10]

 [1] "SHOW: UP with STEVE KORNACKI 8:00 AM EST"                
 [2] "SHOW: MELISSA HARRIS-PERRY 10:00 AM EST"                 
 [3] "SHOW: THE ED SHOW 5:00 PM EST"                           
 [4] "SHOW: HARDBALL 5:00 PM EST"                              
 [5] "SHOW: POLITICS NATION 6:00 PM EST"                       
 [6] "SHOW: ALL IN with CHRIS HAYES 8:00 PM EST"               
 [7] "SHOW: THE RACHEL MADDOW SHOW 9:00 PM EST"                
 [8] "SHOW: THE LAST WORD WITH LAWRENCE O`DONNELL 10:00 PM EST"
 [9] "SHOW: HARDBALL 5:00 PM EST"                              
[10] "SHOW: THE ED SHOW 5:00 PM EST"

You also notice that there is a pattern of dates in each document. At the start of the transcript for each episode, the date of the episode lies between MSNBC and SHOW.

By analogy, figure out how use str_extract and modify the regular expression pattern to extract the show dates assign them to the variable dates, and you print the first 10 elements of dates to the console.

Code:

### your code here

Solution Output:

dates[1:10]

 [1] "MSNBC\r\n\r\n                            February 1, 2015 Sunday\r\n\r\n                    SHOW"        
 [2] "MSNBC\r\n\r\n                            February 1, 2015 Sunday\r\n\r\n                    SHOW"        
 [3] "MSNBC\r\n\r\n                            February 2, 2015 Monday\r\n\r\n                         SHOW"   
 [4] "MSNBC\r\n\r\n                            February 2, 2015 Monday\r\n\r\n                           SHOW" 
 [5] "MSNBC\r\n\r\n                            February 2, 2015 Monday\r\n\r\n                       SHOW"     
 [6] "MSNBC\r\n\r\n                            February 2, 2015 Monday\r\n\r\n                   SHOW"         
 [7] "MSNBC\r\n\r\n                            February 2, 2015 Monday\r\n\r\n                    SHOW"        
 [8] "MSNBC\r\n\r\n                            February 2, 2015 Monday\r\n\r\n            SHOW"                
 [9] "MSNBC\r\n\r\n                            February 3, 2015 Tuesday\r\n\r\n                           SHOW"
[10] "MSNBC\r\n\r\n                            February 3, 2015 Tuesday\r\n\r\n                         SHOW"

Now again, you perceptively notice that all these data in the dates variable start with the pattern “Fe” and end with “015.” You want to use the that vector of dates you just created, and apply the same function to get the nice dates in the format of like “February 16, 2015”. You think you can use the str_extract function and modify that regular expression somehow to fit this problem.

Code:

### Your code here

Solution Output:

dates_fixed[1:10]

 [1] "February 1, 2015" "February 1, 2015" "February 2, 2015" "February 2, 2015"
 [5] "February 2, 2015" "February 2, 2015" "February 2, 2015" "February 2, 2015"
 [9] "February 3, 2015" "February 3, 2015"

Now, finally, you have been directed to assess the reading level of each of these transcripts. Your boss tells you there is something called a “Flesch-Kincaid” readability score that will determine the grade level/reading difficulty of a text, based on the vocabulary used and that it is part of the quanteda.textstats package. He doesn’t remember the name of the function within that package, so you have to look at the documentation.

You figure it out and assign the vector of Flesch-Kincaid scores to a vector called read_level.

Code:

Solution Output:

read_level[1:10]

 [1] 75.16509 70.83982 75.52904 71.14188 72.67580 66.04337 64.89148 67.36121
 [9] 71.42190 67.74345

You decide to put all of this data you’ve created into a tibble.

Code:

### Your code here

Solution Output:

text_tibble

# A tibble: 131 × 3
   show                                                     dates        read_…¹
   <chr>                                                    <chr>          <dbl>
 1 SHOW: UP with STEVE KORNACKI 8:00 AM EST                 February 1,…    75.2
 2 SHOW: MELISSA HARRIS-PERRY 10:00 AM EST                  February 1,…    70.8
 3 SHOW: THE ED SHOW 5:00 PM EST                            February 2,…    75.5
 4 SHOW: HARDBALL 5:00 PM EST                               February 2,…    71.1
 5 SHOW: POLITICS NATION 6:00 PM EST                        February 2,…    72.7
 6 SHOW: ALL IN with CHRIS HAYES 8:00 PM EST                February 2,…    66.0
 7 SHOW: THE RACHEL MADDOW SHOW 9:00 PM EST                 February 2,…    64.9
 8 SHOW: THE LAST WORD WITH LAWRENCE O`DONNELL 10:00 PM EST February 2,…    67.4
 9 SHOW: HARDBALL 5:00 PM EST                               February 3,…    71.4
10 SHOW: THE ED SHOW 5:00 PM EST                            February 3,…    67.7
# … with 121 more rows, and abbreviated variable name ¹read_level

You sort these data so that you can figure out which episode of which show, on which date, had the lowest Flesch-Kincaid score (where lower scores actually indicate more difficult reading content).

It’s at like a 9th/10th grade reading level, per Wikipedia.

Code:

### Your code here

Solution Output:

sorted_tibble

# A tibble: 131 × 3
   show                                                     dates        read_…¹
   <chr>                                                    <chr>          <dbl>
 1 SHOW: MELISSA HARRIS-PERRY 10:00 AM EST                  February 7,…    61.1
 2 SHOW: ALL IN with CHRIS HAYES 8:00 PM EST                February 18…    61.3
 3 SHOW: THE LAST WORD WITH LAWRENCE O`DONNELL 10:00 PM EST February 12…    61.4
 4 SHOW: THE RACHEL MADDOW SHOW 9:00 PM EST                 February 3,…    61.7
 5 SHOW: THE RACHEL MADDOW SHOW 9:00 PM EST                 February 4,…    61.8
 6 SHOW: ALL IN with CHRIS HAYES 8:00 PM EST                February 5,…    61.8
 7 SHOW: THE RACHEL MADDOW SHOW 9:00 PM EST                 February 23…    61.9
 8 SHOW: ALL IN with CHRIS HAYES 8:00 PM EST                February 4,…    62.4
 9 SHOW: THE LAST WORD WITH LAWRENCE O`DONNELL 10:00 PM EST February 25…    63.0
10 SHOW: MELISSA HARRIS-PERRY 10:00 AM EST                  February 8,…    63.1
# … with 121 more rows, and abbreviated variable name ¹read_level

Grading Rubric

A: 100% - perfect or very nearly perfect execution. Task completion aligns with expected results with only trivial exceptions. The student has excellent command of the substance of the material.

A-: 92% - There are more than trivial errors in the assignment, but the assignment is mostly successful. The student understands the main concepts and problem-solving techniques, but has some gaps in their understanding or execution.

B+: 89% - The assignment is partially successful, but there are multiple substantive errors in task completion spread throughout the assignment. The student is not completely lost, but they require clarification of concepts or methods.

B: 86% - The assignment is not successful. The student exhibits poor understanding of the methods being used. The student made an effort, but the work is flawed and it is clear the student should seek additional support.

B-: 82%- The assignment is seriously flawed, suggesting significant need for remediation. The student has not gone in an entirely wrong or unproductive direction, but the student needs to change their study habits or seek further support to clarify key concepts and techniques.

C: 75% - Very little if anything is correct, suggesting insufficient effort or failure to develop fundamental proficiency with the subject matter.