W7-1: Supervised Learning and Data Manipulation

Learnig Objectives

Understand the task of supervised machine learning, and learn about feature representation
Learn about the way in which textual data are applied to machine learning algorithms
Introduce tidy data principles and see how to make data tidy with the functions from the magrittr and dplyr packages.
See how the tidytext package applies tidy data principles to text via the unnest_tokens() function.

What is supervised machine learning?

The task of supervised machine learning consists of using an automatic system (aka algorithm) to learn from a history of occurrences of a certain “event” and consequently make predictions about future occurrences of that event.

When we say supervised machine learning, we may realize there is also unsupervised machine learning. We will learn about one of the most popular unsupervised machine learning algorithms, which is called topic modeling, later in this course.

By the way, let’s pay attention to supervised machine learning today. Basically, supervise machine learning predicts an outcome for each case based on its features. The algorithm learns what such features of each case mean and how they are considered in predicting its outcome by learning how such features relate to the already known outcomes of other cases.

Let me give you an example to understand what supervised machine learning does with text data. Consider the task of predicting whether a given email is a spam mail or not. We could imagine a set of already known cases of spam mails and those that are not. And when we make these cases as input data, the learning algorithm makes predictions to classify a new email case into either a spam mail or not, based on a set of features that characterize spam mails. For example, the words like money, finance, earning, event, chance, etc. could be textual features that are likely to characterize spam mails. Using the textual features, the supervised learning algorithm classify a given email into spam-mail group and non-spam-mail group.

Case	money	finance	earning	event	chance	Spam Mail?
Email 1	0	0	0	1	0	No
Email 2	2	3	1	2	1	Yes
Email 3	1	1	0	1	0	Yes
Email 4	0	0	1	0	1	No
Email 5	2	1	0	2	2	?

The above table presents the frequency of the words as features that appear in each case of emails that are used to predict whether it is a spam mail or not. Based the data on the occurrence frequencies of the words in Emails 1~4 as the features, the supervised machine learning uses the frequencies of the words in Email 5 to classify it into Yes and No.

So, what we need to classify textual documents like emails or news articles into two possible classes such as spam mail and non-spam mail; or fake news and non-fake news, text data are to be structured into a document-feature matrix format as demonstrated above. And this process can begin with tokeniazation under tidy data principles.

Tidy data Principles

We’ve so far worked with text data extracted from a Wikipedia page on COVID-19 vaccine. It was formatted as a character vector, where the object contains a single string of text only. But we sometimes need to work with multiple strings of texts, which are annotated with information about each text. Let say we compare different news articles about a COVID-19 vaccine. In this case, we want to analyze not only texts from each article but also information about its source, reporter, publication date, and so on. Or we may want to know what people think and say about the vaccine. For doing so, Twitter data were retrieved by extracting all tweets mentioning about COVID-19 vaccines. Different from a single string of text extracted from a Wikipedia page, the Twitter data contains a lot of tweets where each tweet should be distinguished from one another.

Tweet	User ID	Date	Text	Like	Retweet
Tweet 1	abc	2021-04-02	“It seems increasingly likely that having a Vaccination Passport…”	10	5
Tweet 2	def	2021-04-02	“Private hospitals volunteer to assist public in registering for #Covid19 vaccine…”	23	3
Tweet 3	ghi	2021-04-03	“A total of 67,808 Covid-19 vaccine doses have been dispensed…”	7	1
Tweet 4	jkl	2021-04-03	“Vaccine appointments are available at…”	1	0
Tweet 5	mno	2021-04-04	“The figures for deaths available today are:…”	20	9

The above format allows data to contain multiple tweets individually where each tweet is characterized not only by textual content but also by other information such as user id, posting date, and the numbers of being liked or retweeted. This is possible insofar as Twitter data are structured in a data frame format where each row indicates each observation of tweet and each column indicates each tweet’s features. And generally speaking, when we refer to data, data are structured in the data frame format.

From now on, we are going to work with text data and learn how to (pre-)process text in the data frame format. For doing so, we are going to install and load several packages such as magrittr, dplyr, and tidytext packages, as well as the stringr package.

Let’s take a look at what functions such packages provide and how the functions are used for text data processing.

First of all, the magrittr package provide the pipe operator. And the pipe operator %>% is very useful in data processing.

Background of the pipe Operator in R

Let say we have two functions str_squish(A) => B and str_split(B, pattern=" ") => C. The function str_squish processes the input, A, to remove any redundant whitespace and returns the outcome, B. And the function str_split processes the input, B, to split a sting into pieces of tokens by a blank and returns the outcome, C.

So far, we have run the two functions step by step. But using the pipe operator, we can chain these two functions together by taking the output of one function and inserting into the next. In short, “changing” means that we pass an intermediate result onto the next function. Here, “str_split follows str_squish”: str_split(str_squish(x), pattern=" ")

In R, we can pass command from one to the next with the pipe operator. As we’ve seen, our R code is often containing lots of parentheses, ( and ), especially when code is complex: functions are nested in another function that are nested in another function, and so on… This makes R code hard to read and understand. Here’s where %>% comes in to the rescue.

Here’s an example

library(magrittr)
library(stringr)

covid_sent <- "A COVID-19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the\r\nvirus causing coronavirus disease 2019 (COVID-19)."

# For 1) normalizing into lower-case letter, 2) removing punctuation characters except the hyphen, 3) trimming whitespace, 4) splitting a string into words, 5) converting the list object as an outcome into a vector of words, 6) counting the occurrence of each word, and 7) sorting the frequency in a descending order, we used the following codes:

sort(table(unlist(str_split(str_squish(str_replace_all(tolower(covid_sent), "[^[:alnum:][:space:]-]", " ")), pattern=" "))), decreasing = TRUE)

## 
##           a coronavirus    covid-19     vaccine           2        2019 
##           2           2           2           2           1           1 
##    acquired       acute     against     causing     disease    immunity 
##           1           1           1           1           1           1 
##    intended          is     provide respiratory  sars-cov-2      severe 
##           1           1           1           1           1           1 
##    syndrome         the          to       virus 
##           1           1           1           1

# It looks very complicated and is hard to read what each function does.

But with the help of %>%, we can rewrite the above code step by step as follows:

covid_sent %>% 
        tolower() %>% 
        str_replace_all("[^[:alnum:][:space:]-]", " ") %>% 
        str_squish() %>% 
        str_split(pattern=" ") %>% 
        table() %>% 
        sort(decreasing = T)

## .
##           a coronavirus    covid-19     vaccine           2        2019 
##           2           2           2           2           1           1 
##    acquired       acute     against     causing     disease    immunity 
##           1           1           1           1           1           1 
##    intended          is     provide respiratory  sars-cov-2      severe 
##           1           1           1           1           1           1 
##    syndrome         the          to       virus 
##           1           1           1           1

Using the pipe operator, we can write the R input in an intuitively simple way while chaining a sequence of multiple functions together to be run.

Tidy data

Tidy data is basically just a way of consistently organizing your data that often makes subsequent analysis easier.

Tidy data has three requirements:

Each variable has its own column.
Each observation has its own row.
Each value has its own cell.

Given our example of Twitter data, the variables are user ID, Date, Text, Like, Retweet, and the observation is each tweet, and the values are the value or content of the variables for each tweet.

At this point, I am going to give you an actual dataset that consists of 114,299 tweets mentioning about COVID-19 vaccine(s). All tweets are written in English and posted for one week between March 29th and April 4th over the world this year. This dataset, named cv_tweets, is available in Lecture Resources in our E-class page.

Once you download the file, you may need to move the file into your R working directory and load it into the current R session.

load("cv_tweets.RData")
dim(cv_tweets)

## [1] 114299     91

# This `cv_tweets` dataset contains all 114,299 tweets that were posted on Twitter for one week from March 29th and April 4th in 2021. The data were collected by accessing Twitter API.

cv_tweets[1:10,1:5]

##    user_id           status_id          created_at screen_name
## 1  1652541 1378336545934032905 2021-04-03 13:20:15     Reuters
## 2  1652541 1378823457031471114 2021-04-04 21:35:04     Reuters
## 3  1652541 1378848625384636418 2021-04-04 23:15:04     Reuters
## 4  1652541 1378700128320638976 2021-04-04 13:25:00     Reuters
## 5  1652541 1378776935413743618 2021-04-04 18:30:12     Reuters
## 6  1652541 1378388094429372417 2021-04-03 16:45:05     Reuters
## 7  1652541 1378802069625274370 2021-04-04 20:10:04     Reuters
## 8  1652541 1378651093085851650 2021-04-04 10:10:09     Reuters
## 9  1652541 1378335309176438786 2021-04-03 13:15:20     Reuters
## 10 1652541 1378386857973665800 2021-04-03 16:40:10     Reuters
##                                                                                                                                                                                                                                                                 text
## 1                                                                                                                                             Ukraine approves Chinese vaccine as COVID-19 cases hit new record high https://t.co/zaICefv3vE https://t.co/lAk26WHMbQ
## 2                                                                                                                                           U.S. says 165 million doses of COVID-19 vaccine been administered so far https://t.co/dWTpvWtRp8 https://t.co/PTDcMbT4c6
## 3                                                                                                                                           U.S. says 165 million doses of COVID-19 vaccine been administered so far https://t.co/Kzmd2Z9lKC https://t.co/nyVuZWKSkD
## 4  The U.S. has put Johnson &amp; Johnson in charge of a plant that ruined 15 million doses of its COVID-19 vaccine and stopped British drugmaker AstraZeneca from using the facility, a senior health official said https://t.co/7h2482LLZv https://t.co/z3zdbUkzXB
## 5                                                                                                                                           U.S. says 165 million doses of COVID-19 vaccine been administered so far https://t.co/c6ooyBq4oN https://t.co/Sh57ZTQEVr
## 6                                                                                                                                             Ukraine approves Chinese vaccine as COVID-19 cases hit new record high https://t.co/Dml91tR06f https://t.co/GDN2rO39ST
## 7                                                                                                                                           U.S. says 165 million doses of COVID-19 vaccine been administered so far https://t.co/NYyKqyU8Mq https://t.co/klWmEnVkrK
## 8                                                                                                                                               China administered 136.68 million COVID-19 vaccine doses by Saturday https://t.co/B8JukRYu68 https://t.co/w1zzLJkJqj
## 9                                                                                                                                             Ukraine approves Chinese vaccine as COVID-19 cases hit new record high https://t.co/eUSARWBeGf https://t.co/KV4vO9am6c
## 10                                                                                                                                            Ukraine approves Chinese vaccine as COVID-19 cases hit new record high https://t.co/RiFottetq4 https://t.co/M3xhhx8qft

Data processing (transformation/cleaning) with `dplyr` package

From now on, we are going to work with the cv_tweets dataset in a data frame format to manipulate its observations and variables. Before processing texts for tokenization, we may need to manipulate the data. Let say we want to remove any tweet whose message is a duplicate of another tweet or a variable that is not of interest to our analysis. dplyr package provides useful functions for doing such tasks.

Today, I am going to introduce you to its basic set of functions and show you how to apply them to the cv_tweets data frame.

`dplyr` functions

Package dplyr provides useful functions for data manipulation:

Name	Task
`select()`	to select variables based on their names
`filter()`	to select cases (observations) based on their values
`mutate()`	to add new variables by the functions of existing variables
`arrange()`	to reorder the cases
`count()`	to count the number of observations based on their values

`select()`: Reduce data frame size to only desired variables for current task

select() function selects a subset of columns (variables) in a data frame. We often work with large datasets with many columns but only a few are acttually of interest to us. select() allows us to rapidly zoom in on a useful subset by mentioning the names of the variables we want to keep.

library(dplyr)

# We take the `cv_tweets` dataset as the input and send it directly to the select function using the pipe operator.

cv_tweets[1:10,]

## # A tibble: 10 x 91
##    user_id status_id   created_at          screen_name text             source  
##    <chr>   <chr>       <dttm>              <chr>       <chr>            <chr>   
##  1 1652541 1378336545~ 2021-04-03 13:20:15 Reuters     Ukraine approve~ True An~
##  2 1652541 1378823457~ 2021-04-04 21:35:04 Reuters     U.S. says 165 m~ True An~
##  3 1652541 1378848625~ 2021-04-04 23:15:04 Reuters     U.S. says 165 m~ True An~
##  4 1652541 1378700128~ 2021-04-04 13:25:00 Reuters     The U.S. has pu~ Twitter~
##  5 1652541 1378776935~ 2021-04-04 18:30:12 Reuters     U.S. says 165 m~ True An~
##  6 1652541 1378388094~ 2021-04-03 16:45:05 Reuters     Ukraine approve~ True An~
##  7 1652541 1378802069~ 2021-04-04 20:10:04 Reuters     U.S. says 165 m~ True An~
##  8 1652541 1378651093~ 2021-04-04 10:10:09 Reuters     China administe~ True An~
##  9 1652541 1378335309~ 2021-04-03 13:15:20 Reuters     Ukraine approve~ True An~
## 10 1652541 1378386857~ 2021-04-03 16:40:10 Reuters     Ukraine approve~ True An~
## # ... with 85 more variables: display_text_width <dbl>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## #   media_t.co <list>, media_expanded_url <list>, media_type <list>,
## #   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## #   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## #   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>, date <dttm>

cv_tweets %>% select(user_id, created_at, text)

## # A tibble: 114,299 x 3
##    user_id created_at          text                                             
##    <chr>   <dttm>              <chr>                                            
##  1 1652541 2021-04-03 13:20:15 Ukraine approves Chinese vaccine as COVID-19 cas~
##  2 1652541 2021-04-04 21:35:04 U.S. says 165 million doses of COVID-19 vaccine ~
##  3 1652541 2021-04-04 23:15:04 U.S. says 165 million doses of COVID-19 vaccine ~
##  4 1652541 2021-04-04 13:25:00 The U.S. has put Johnson &amp; Johnson in charge~
##  5 1652541 2021-04-04 18:30:12 U.S. says 165 million doses of COVID-19 vaccine ~
##  6 1652541 2021-04-03 16:45:05 Ukraine approves Chinese vaccine as COVID-19 cas~
##  7 1652541 2021-04-04 20:10:04 U.S. says 165 million doses of COVID-19 vaccine ~
##  8 1652541 2021-04-04 10:10:09 China administered 136.68 million COVID-19 vacci~
##  9 1652541 2021-04-03 13:15:20 Ukraine approves Chinese vaccine as COVID-19 cas~
## 10 1652541 2021-04-03 16:40:10 Ukraine approves Chinese vaccine as COVID-19 cas~
## # ... with 114,289 more rows

# The select() function keeps columns whose names are mentioned only.

`filter()`: Reduce rows/observations with matching conditions

Filtering data is a common task to identify and keep observations in which a paricular variable matches a specific value/condition. So, it requires an argument that refers to a variable within the data frame to select rows where the expression is TRUE. For instance, we can filter by the variable text whose values are duplicates. The base function duplicated determines which elements of a vector or a variable in a data frame are duplicates of elements and returns a logical vector indicating which elements (rows) are duplicates (TRUE) or unique (FALSE).

# We can chain the current filter() function to the previous function select()

cv_tweets %>% select(user_id, created_at, text) %>% filter(duplicated(text))

## # A tibble: 1,042 x 3
##    user_id        created_at          text                                      
##    <chr>          <dttm>              <chr>                                     
##  1 1369465601333~ 2021-04-03 13:49:16 "Vaccine appointments available at Walgre~
##  2 1369465601333~ 2021-04-04 04:27:06 "Vaccine appointments available at Walgre~
##  3 1369465601333~ 2021-04-04 08:13:07 "Vaccine appointments available at Walgre~
##  4 1369465601333~ 2021-04-04 04:12:05 "Vaccine appointments available at Walgre~
##  5 1369465601333~ 2021-04-03 18:37:05 "Vaccine appointments available at Walgre~
##  6 1366795232491~ 2021-04-04 21:34:04 "MO: Vaccine appointments available at CV~
##  7 1366795232491~ 2021-04-03 20:44:32 "MO: Vaccine appointments available at CV~
##  8 1366795232491~ 2021-04-03 19:58:32 "MO: Vaccine appointments available at CV~
##  9 1366795232491~ 2021-04-03 17:59:31 "KS: Vaccine appointments available at CV~
## 10 1366795232491~ 2021-04-03 19:58:32 "MO: Vaccine appointments available at CV~
## # ... with 1,032 more rows

# 1,042 tweets' messages are duplicated (replicated)

# By putting ! operator, we can reverse the logical vector by the duplicated function. This means, we can keep those tweets that are NOT duplicated.
cv_tweets %>% select(user_id, created_at, text) %>% filter(!duplicated(text))

## # A tibble: 113,257 x 3
##    user_id created_at          text                                             
##    <chr>   <dttm>              <chr>                                            
##  1 1652541 2021-04-03 13:20:15 Ukraine approves Chinese vaccine as COVID-19 cas~
##  2 1652541 2021-04-04 21:35:04 U.S. says 165 million doses of COVID-19 vaccine ~
##  3 1652541 2021-04-04 23:15:04 U.S. says 165 million doses of COVID-19 vaccine ~
##  4 1652541 2021-04-04 13:25:00 The U.S. has put Johnson &amp; Johnson in charge~
##  5 1652541 2021-04-04 18:30:12 U.S. says 165 million doses of COVID-19 vaccine ~
##  6 1652541 2021-04-03 16:45:05 Ukraine approves Chinese vaccine as COVID-19 cas~
##  7 1652541 2021-04-04 20:10:04 U.S. says 165 million doses of COVID-19 vaccine ~
##  8 1652541 2021-04-04 10:10:09 China administered 136.68 million COVID-19 vacci~
##  9 1652541 2021-04-03 13:15:20 Ukraine approves Chinese vaccine as COVID-19 cas~
## 10 1652541 2021-04-03 16:40:10 Ukraine approves Chinese vaccine as COVID-19 cas~
## # ... with 113,247 more rows

`mutate()`: Add new columns to data frame

Beside selecting sets of existing columns through select(), it’s often useful to add (create) new columns that are functions of existing columns. This is the job of mutate(). For example, we can add a new column time that rounds the date-time object of created_at to the nearest value of the hour time unit. The current created_at variable provides time information in seconds, so it is not useful to see the trend of tweet posting over time. But using the time variable, we can count the number of tweets about COVID-19 vaccines posted over time. For doing so, we need to install and load another package called lubridate which provides the floor_date() function that rounds a date-time object like our created_at variable and rounds it down to the nearest boundary of the specified time unit. And using the mutate() function, we create a new colume (variable) that stores the outcome of the function floor_date().

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

# mutate() adds a new variable and preserving existing ones. Within the function, the name of a new variable is expressed to store the outcome of a function of an existing variable.
# floor_date() takes a date-time object that is rounded to the boundary of the specified time unit like "day" or "hour".
cv_tweets %>% select(user_id, created_at, text) %>% filter(!duplicated(text)) %>% mutate(time = floor_date(created_at, unit="hour"))

## # A tibble: 113,257 x 4
##    user_id created_at          text                          time               
##    <chr>   <dttm>              <chr>                         <dttm>             
##  1 1652541 2021-04-03 13:20:15 Ukraine approves Chinese vac~ 2021-04-03 13:00:00
##  2 1652541 2021-04-04 21:35:04 U.S. says 165 million doses ~ 2021-04-04 21:00:00
##  3 1652541 2021-04-04 23:15:04 U.S. says 165 million doses ~ 2021-04-04 23:00:00
##  4 1652541 2021-04-04 13:25:00 The U.S. has put Johnson &am~ 2021-04-04 13:00:00
##  5 1652541 2021-04-04 18:30:12 U.S. says 165 million doses ~ 2021-04-04 18:00:00
##  6 1652541 2021-04-03 16:45:05 Ukraine approves Chinese vac~ 2021-04-03 16:00:00
##  7 1652541 2021-04-04 20:10:04 U.S. says 165 million doses ~ 2021-04-04 20:00:00
##  8 1652541 2021-04-04 10:10:09 China administered 136.68 mi~ 2021-04-04 10:00:00
##  9 1652541 2021-04-03 13:15:20 Ukraine approves Chinese vac~ 2021-04-03 13:00:00
## 10 1652541 2021-04-03 16:40:10 Ukraine approves Chinese vac~ 2021-04-03 16:00:00
## # ... with 113,247 more rows

`arrange()`: Arrange (or re-order) rows of a data frame by an expression involving its variables

Using arrange() function, we can arrange the rows in cv_tweets by the time variable in ascending order of time or in descending order along with the function desc().

# The oldest tweet comes first
cv_tweets %>% select(user_id, created_at, text) %>% filter(!duplicated(text)) %>% mutate(time = floor_date(created_at, unit="hour")) %>% arrange(time)

## # A tibble: 113,257 x 4
##    user_id     created_at          text                      time               
##    <chr>       <dttm>              <chr>                     <dttm>             
##  1 1364920154~ 2021-03-29 10:25:43 "NEW: CVS #9791 on 2021-~ 2021-03-29 10:00:00
##  2 1364920154~ 2021-03-29 10:57:04 "NEW: CVS #16049 on 2021~ 2021-03-29 10:00:00
##  3 1364920154~ 2021-03-29 10:55:00 "NEW: CVS #4019 on 2021-~ 2021-03-29 10:00:00
##  4 1364920154~ 2021-03-29 10:40:20 "NEW: CVS #9477 on 2021-~ 2021-03-29 10:00:00
##  5 1364920154~ 2021-03-29 10:09:47 "NEW: CVS #9680 on 2021-~ 2021-03-29 10:00:00
##  6 1364920154~ 2021-03-29 10:23:39 "NEW: CVS #9566 on 2021-~ 2021-03-29 10:00:00
##  7 1364920154~ 2021-03-29 10:07:39 "NEW: CVS #8878 on 2021-~ 2021-03-29 10:00:00
##  8 1364920154~ 2021-03-29 10:07:36 "NEW: CVS #8879 on 2021-~ 2021-03-29 10:00:00
##  9 1364920154~ 2021-03-29 10:21:35 "NEW: CVS #9686 on 2021-~ 2021-03-29 10:00:00
## 10 1364920154~ 2021-03-29 10:21:32 "NEW: CVS #9714 on 2021-~ 2021-03-29 10:00:00
## # ... with 113,247 more rows

# The latest tweet comes first
cv_tweets %>% select(user_id, created_at, text) %>% filter(!duplicated(text)) %>% mutate(time = floor_date(created_at, unit="hour")) %>% arrange(desc(time))

## # A tibble: 113,257 x 4
##    user_id     created_at          text                      time               
##    <chr>       <dttm>              <chr>                     <dttm>             
##  1 1652541     2021-04-04 23:15:04 "U.S. says 165 million d~ 2021-04-04 23:00:00
##  2 1369465601~ 2021-04-04 23:51:58 "Vaccine appointments av~ 2021-04-04 23:00:00
##  3 1369465601~ 2021-04-04 23:51:59 "Vaccine appointments av~ 2021-04-04 23:00:00
##  4 1369465601~ 2021-04-04 23:49:57 "Vaccine appointments av~ 2021-04-04 23:00:00
##  5 1369465601~ 2021-04-04 23:51:59 "Vaccine appointments av~ 2021-04-04 23:00:00
##  6 1369465601~ 2021-04-04 23:52:58 "Vaccine appointments av~ 2021-04-04 23:00:00
##  7 1369465601~ 2021-04-04 23:51:58 "Vaccine appointments av~ 2021-04-04 23:00:00
##  8 1369465601~ 2021-04-04 23:59:58 "Vaccine appointments av~ 2021-04-04 23:00:00
##  9 1369465601~ 2021-04-04 23:00:58 "Vaccine appointments av~ 2021-04-04 23:00:00
## 10 1369465601~ 2021-04-04 23:59:58 "Vaccine appointments av~ 2021-04-04 23:00:00
## # ... with 113,247 more rows

`count()`: count discrete values in a specified variable

When working with data, we often want to know the number of observations found for each value in a variable. For example, if we want to count the number of rows of data for each hour in the time variable, we can do:

cv_tweets %>% select(user_id, created_at, text) %>% filter(!duplicated(text)) %>% mutate(time = floor_date(created_at, unit="hour")) %>% arrange(desc(time)) %>% count(time)

## # A tibble: 158 x 2
##    time                    n
##  * <dttm>              <int>
##  1 2021-03-29 10:00:00   464
##  2 2021-03-29 11:00:00   683
##  3 2021-03-29 12:00:00   919
##  4 2021-03-29 13:00:00  1013
##  5 2021-03-29 14:00:00  1008
##  6 2021-03-29 15:00:00  1081
##  7 2021-03-29 16:00:00  1119
##  8 2021-03-29 17:00:00  1532
##  9 2021-03-29 18:00:00  1327
## 10 2021-03-29 19:00:00  1343
## # ... with 148 more rows

# For convenience, `count()` provides the `sort` argument:
cv_tweets %>% select(user_id, created_at, text) %>% filter(!duplicated(text)) %>% mutate(time = floor_date(created_at, unit="hour")) %>% arrange(desc(time)) %>% count(time, sort=TRUE)

## # A tibble: 158 x 2
##    time                    n
##    <dttm>              <int>
##  1 2021-03-31 11:00:00  1898
##  2 2021-03-31 15:00:00  1619
##  3 2021-03-31 14:00:00  1587
##  4 2021-03-29 17:00:00  1532
##  5 2021-03-31 13:00:00  1480
##  6 2021-03-31 16:00:00  1434
##  7 2021-03-31 12:00:00  1404
##  8 2021-04-01 15:00:00  1379
##  9 2021-03-29 19:00:00  1343
## 10 2021-04-01 14:00:00  1342
## # ... with 148 more rows

# The result is sorted in descending order of n

Visualization using `ggplot2()`

ggplot2() package provides very useful and accessible tools for creating graphics in R. Of course, there are lots of thiggs to cover for us to visualize the results of our data processing. But today let’s have a just taste of creating a bar chart that visualizes the trend of tweeting about COVID-19 vaccines from March 29th to April 4th. In the bar chart, day-unit observations in the time variable are ordered by x value and the number of observations, n, is represented as the height of each bar. So, this chart produces a simple visual comparison to see how the degree of tweeting about COVID-19 vaccines had changed during the period of time.

library(ggplot2)

# time: x-axis, n: y-axis
# The hour-unit observations can be arranged in ascending order

cv_tweets %>% select(user_id, created_at, text) %>% filter(!duplicated(text)) %>% mutate(time = floor_date(created_at, unit="day")) %>% arrange(desc(time)) %>% count(time) %>% ggplot(aes(x=time, y=n)) + geom_col()

This visualization techniques shows how our Twitter data is composed of in terms of when tweets are posted. During the week, tweeting about COVID-19 vaccines posted peaked on March 31st. And it decreased with the approach of weekend and the number of tweets was the lowest on April 4th (Sunday).

* `stringr`, `magrittr`, `dplyr`, `ggplot2` packages are all included in the `tidyverse` package.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## √ tibble  3.1.0     √ purrr   0.3.4
## √ tidyr   1.1.2     √ forcats 0.5.1
## √ readr   1.3.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x tidyr::extract()         masks magrittr::extract()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x purrr::set_names()       masks magrittr::set_names()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

W7-1: Supervised Learning and Data Manipulation

Shin Lee

4/13/2021

Learnig Objectives

What is supervised machine learning?

Tidy data Principles

Background of the pipe Operator in R

Tidy data

Data processing (transformation/cleaning) with `dplyr` package

`dplyr` functions

`select()`: Reduce data frame size to only desired variables for current task

`filter()`: Reduce rows/observations with matching conditions

`mutate()`: Add new columns to data frame

`arrange()`: Arrange (or re-order) rows of a data frame by an expression involving its variables

`count()`: count discrete values in a specified variable

Visualization using `ggplot2()`

* `stringr`, `magrittr`, `dplyr`, `ggplot2` packages are all included in the `tidyverse` package.

W7-1: Supervised Learning and Data Manipulation

Shin Lee

4/13/2021

Learnig Objectives

What is supervised machine learning?

Tidy data Principles

Background of the pipe Operator in R

Tidy data

Data processing (transformation/cleaning) with dplyr package

dplyr functions

select(): Reduce data frame size to only desired variables for current task

filter(): Reduce rows/observations with matching conditions

mutate(): Add new columns to data frame

arrange(): Arrange (or re-order) rows of a data frame by an expression involving its variables

count(): count discrete values in a specified variable

Visualization using ggplot2()

* stringr, magrittr, dplyr, ggplot2 packages are all included in the tidyverse package.

Data processing (transformation/cleaning) with `dplyr` package

`dplyr` functions

`select()`: Reduce data frame size to only desired variables for current task

`filter()`: Reduce rows/observations with matching conditions

`mutate()`: Add new columns to data frame

`arrange()`: Arrange (or re-order) rows of a data frame by an expression involving its variables

`count()`: count discrete values in a specified variable

Visualization using `ggplot2()`

* `stringr`, `magrittr`, `dplyr`, `ggplot2` packages are all included in the `tidyverse` package.