Requirements (Please Read Carefully):

  1. Submit your report in pdf knitted from R markdown along with the .rmd file.

  2. Organize your report clearly by tasks, questions using different level of headers. Please refer to the example uploaded to Canvas for the last lab homework.

  3. For each question, include the question itself, the code/result/graph to answer the question, and your answer in plain language.

  4. You need to polish your graph details to reasonable visual comfort.

  5. AI is not allowed in any means.


1. Regular expressions (30 Bonus Points. You need to self study lecture notes.)


a) Use the words data set, find all the words that match each of the following patterns:

  • are exactly four letters long
  • are either four or five letters long
  • the second letter is “s” or “t”
  • contains the pattern like “oxx” where “o” is one letter and “x” is another letter
  • contains “a”, “e” and “o” at the same time


b) Use the sentences data set, make the following plot

  • a bar plot counting sentences with and without “the” (or “The”).
  • a scatterplot with \(x\) being the average length of words in a sentence, and \(y\) being the number of words starting with “a” or “e” or “i” or “o” or “u” in the sentence.


c) Application

  1. Download the Oxford English Dictionary as a “.txt” file from https://drive.google.com/file/d/1r0MrJDUGVv_Xh1I1cZ4ldMJedYOsiIAH/view?usp=sharing

  2. Read it into RStudio with read_lines() function (check how to use it by yourself)

  3. Turn the dictionary into a tibble and remove all blank lines

  4. Use regular expression to extract all words for each item in a separate column named “words”

  5. Find all words in the dictionary that contain “a”, “e”, “i”, “o”, “u” and “y” at the same time.


2. Factors


a) Use the BankChurners.csv to answer the following questions:

  • Which features can be regarded as a factor?
  • Which features can be regarded as an ordered factor (ordinal)?
  • Read BankChurners.csv into RStudio, then change the columns that you answered above into factors or ordered factors.
  • Visualize the effect of education level on averag utiliation ratio


b) Use the gss_cat data set

  • What are the levels of marital variable?
  • Combine “Separated”, “Divorced”, “Widowed” into a new category “Once Married”
  • Use the new levels, explore whether there is an effect of martial status on tvhours.


3. Date and Time - nycflights13 data set