W7-1: Supervised Learning and Data Manipulation

Learnig Objectives

Understand the task of supervised machine learning, and learn about feature representation
Learn about the way in which textual data are applied to machine learning algorithms
Introduce tidy data principles and see how to make data tidy with the functions from the magrittr and dplyr packages.
See how the tidytext package applies tidy data principles to text via the unnest_tokens() function.

What is supervised machine learning?

The task of supervised machine learning consists of using an automatic system (aka algorithm) to learn from a history of occurrences of a certain “event” and consequently make predictions about future occurrences of that event.

When we say supervised machine learning, we may realize there is also unsupervised machine learning. We will learn about one of the most popular unsupervised machine learning algorithms, which is called topic modeling, later in this course.

By the way, let’s pay attention to supervised machine learning today. Basically, supervise machine learning predicts an outcome for each case based on its features. The algorithm learns what such features of each case mean and how they are considered in predicting its outcome by learning how such features relate to the already known outcomes of other cases.

Let me give you an example to understand what supervised machine learning does with text data. Consider the task of predicting whether a given email is a spam mail or not. We could imagine a set of already known cases of spam mails and those that are not. And when we make these cases as input data, the learning algorithm makes predictions to classify a new email case into either a spam mail or not, based on a set of features that characterize spam mails. For example, the words like money, finance, earning, event, chance, etc. could be textual features that are likely to characterize spam mails. Using the textual features, the supervised learning algorithm classify a given email into spam-mail group and non-spam-mail group.

Case	money	finance	earning	event	chance	Spam Mail?
Email 1	0	0	0	1	0	No
Email 2	2	3	1	2	1	Yes
Email 3	1	1	0	1	0	Yes
Email 4	0	0	1	0	1	No
Email 5	2	1	0	2	2	?

The above table presents the frequency of the words as features that appear in each case of emails that are used to predict whether it is a spam mail or not. Based the data on the occurrence frequencies of the words in Emails 1~4 as the features, the supervised machine learning uses the frequencies of the words in Email 5 to classify it into Yes and No.

So, what we need to classify textual documents like emails or news articles into two possible classes such as spam mail and non-spam mail; or fake news and non-fake news, text data are to be structured into a document-feature matrix format as demonstrated above. And this process can begin with tokeniazation under tidy data principles.

Tidy data Principles

We’ve so far worked with text data extracted from a Wikipedia page on COVID-19. It was formatted as a character vector, where the object contains a single string of text only. But we sometimes need to work with multiple strings of texts, which are annotated with information about each text. Let say we compare different news articles about COVID-19. In this case, we want to analyze not only texts from each article but also information about its source, reporter, publication date, and so on. Or we may want to know what people think and say about Coronavirus disease 2019. For doing so, Twitter data were retrieved by extracting all tweets mentioning about COVID-19. Different from a single string of text extracted from a Wikipedia page, the Twitter data contains a lot of tweets where each tweet should be distinguished from one another.

Tweet	User ID	Date	Text	Like	Retweet
Tweet 1	abc	2020-03-27	“Fascinating news in England:…”	10	5
Tweet 2	def	2020-03-27	“If our Governor were to open schools back up by Easter…”	23	3
Tweet 3	ghi	2020-03-27	“happening as we speak…”	7	1
Tweet 4	jkl	2020-03-28	“What’s even sicker than COVID-19?”	1	0
Tweet 5	mno	2020-03-27	“Myth and Fact about Corona”	20	9

The above format allows data to contain multiple tweets individually where each tweet is characterized not only by textual content but also by other information such as user id, posting date, and the numbers of being liked or retweeted. This is possible insofar as Twitter data are structured in a data frame format where each row indicates each observation of tweet and each column indicates each tweet’s features. And generally speaking, when we refer to data, data are structured in the data frame format.

From now on, we are going to work with text data and learn how to (pre-)process text in the data frame format. For doing so, we are going to install and load several packages such as magrittr, dplyr, and tidytext packages, as well as the stringr package.

Let’s take a look at what functions such packages provide and how the functions are used for text data processing.

First of all, the magrittr package provide the pipe operator. And the pipe operator %>% is very useful in data processing.

Background of the pipe Operator in R

Let say we have two functions str_squish(A) => B and str_split(B, pattern=" ") => C. The function str_squish processes the input, A, to remove any redundant whitespace and returns the outcome, B. And the function str_split processes the input, B, to split a sting into pieces of tokens by a blank and returns the outcome, C.

So far, we have run the two functions step by step. But using the pipe operator, we can chain these two functions together by taking the output of one function and inserting into the next. In short, “changing” means that we pass an intermediate result onto the next function. Here, “str_split follows str_squish”: str_split(str_squish(x), pattern=" ")

In R, we can pass command from one to the next with the pipe operator. As we’ve seen, our R code is often containing lots of parentheses, ( and ), especially when code is complex: functions are nested in another function that are nested in another function, and so on… This makes R code hard to read and understand. Here’s where %>% comes in to the rescue.

Here’s an example

library(magrittr)
library(stringr)

covid_sent <- "Coronavirus disease 2019\nCoronavirus disease 2019 (COVID-19) is an infectious\ndisease caused by severe acute respiratory syndrome\n                                                          Coronavirus disease 2019 (COVID-\ncoronavirus 2 (SARS-CoV-2)."

# For 1) normalizing into lower-case letter, 2) removing punctuation characters except the hyphen, 3) trimming whitespace, 4) splitting a string into words, 5) converting the list object as an outcome into a vector of words, 6) counting the occurrence of each word, and 7) sorting the frequency in a descending order, we used the following codes:

sort(table(unlist(str_split(str_squish(str_replace_all(tolower(covid_sent), "[^[:alnum:][:space:]-]", " ")), pattern=" "))), decreasing = TRUE)

## 
## coronavirus     disease        2019           2       acute          an 
##           4           4           3           1           1           1 
##          by      caused      covid-    covid-19  infectious          is 
##           1           1           1           1           1           1 
## respiratory  sars-cov-2      severe    syndrome 
##           1           1           1           1

# It looks very complicated and is hard to read what each function does.

But with the help of %>%, we can rewrite the above code step by step as follows:

covid_sent %>% 
        tolower() %>% 
        str_replace_all("[^[:alnum:][:space:]-]", " ") %>% 
        str_squish() %>% 
        str_split(pattern=" ") %>% 
        table() %>% 
        sort(decreasing = T)

## .
## coronavirus     disease        2019           2       acute          an 
##           4           4           3           1           1           1 
##          by      caused      covid-    covid-19  infectious          is 
##           1           1           1           1           1           1 
## respiratory  sars-cov-2      severe    syndrome 
##           1           1           1           1

Using the pipe operator, we can write the R input in an intuitively simple way while chaining a sequence of multiple functions together to be run.

Tidy data

Tidy data is basically just a way of consistently organizing your data that often makes subsequent analysis easier.

Tidy data has three requirements:

Each variable has its own column.
Each observation has its own row.
Each value has its own cell.

Given our example of Twitter data, the variables are user ID, Date, Text, Like, Retweet, and the observation is each tweet, and the values are the value or content of the variables for each tweet.

At this point, I am going to give you an actual dataset that consists of around one million tweets mentioning about COVID-19. All tweets are written in English and posted between March 26 and March 28 over the world this year. This dataset, named covid19_tweets_df, is available in Lecture Resources in our E-class page.

Once you download the file, you may need to move the file into your R working directory and load it into the current R session.

load("covid19_tweets_df.RData")
dim(covid19_tweets_df)

## [1] 1012305       6

covid19_tweets_df[1:10,]

##                user_id           status_id          created_at    screen_name
## 1            408707568 1243394454418886663 2020-03-27 04:28:33  KathleenBurge
## 2            145492546 1243394454112759809 2020-03-27 04:28:33       PetuniaV
## 3            399450399 1243394453974290432 2020-03-27 04:28:33  meghanttucker
## 4   863432135738130432 1243394453928210435 2020-03-27 04:28:33   drchristinal
## 5   863432135738130432 1243392682287697920 2020-03-27 04:21:31   drchristinal
## 6   863432135738130432 1243394067783770112 2020-03-27 04:27:01   drchristinal
## 7             81368775 1243394453722656769 2020-03-27 04:28:33   Zachsnapwell
## 8  1241597278919376896 1243394452212731904 2020-03-27 04:28:33 CustomizeMedia
## 9            229334332 1243394452179152899 2020-03-27 04:28:33   acentodiario
## 10           956452855 1243394451172519937 2020-03-27 04:28:32      AirSwerve
##                                                                                                                                                                                                                                                                                                                                                   text
## 1                                                                                                                                                Fascinating news in England: "UK firms and academics have also developed self-test kits for Covid-19 that are expected to be available to buy in the coming weeks or months." https://t.co/hZOxGc3Kv7
## 2                                                                                                                                                                                                                                                    https://t.co/6ZHx6M6Rwx\nCorona Virus Rhapsody \U0001f3a7\U0001f3b8\U0001f3bc\U0001f3a4  #Covid19
## 3                                                                                                           If our Governor were to open schools back up by Easter, I won't be sending my kids. Our county is hard hit by #COVID19. \n\nI'll either finish homeschooling or fucking hold them back a year.  Kids AREN'T immune like people are saying.
## 4                                                                     @jmj https://t.co/vkwfur9s7B but they are going further than fb... there is a need, 18,000 physicians are not in a private group to shoot the shit.... they are creating guidelines then going to openxmed to levrerage their platform and create technology they want and need.
## 5                                                                                                happening as we speak. n95 reuse and prolonged use... setting the standard for PPE for physicians.  &gt;17,000 physicians have spoken. #forphysiciansbyphysicians @openxmed #medtwitter #MedEd #IDtwitter #physician #COVID19 https://t.co/gSxYklpofE
## 6                                                                                                                                                       @stevejang fb groups of 18,000+ physicians are talking 24/7.... creating guidelines, swapping support... then going to things like openxmed to create the standard.... https://t.co/vkwfur9s7B
## 7  What's even sicker than COVID-19? \n\nMy COVID-20 playlist. \n\nIf anyone is looking for some new music to listen to, take a listen. It's mostly new music I've been listening to lately with a few classics thrown in. \n\U0001f918\U0001f3fb\U0001f918\U0001f3fb\U0001f918\U0001f3fb\n#CoronaVirusUpdates\n#CoronaJams\n\nhttps://t.co/Q20YaltuEe
## 8                                                                                                                                                                                                                                                         Myth and Fact about Corona\n\n#Lockdown21 #COVID2019 #QuarantineLife https://t.co/VJU777adnR
## 9                                                                                                                                                                                                                                                                                                          Post-COVID-19 en RD https://t.co/KQsQDSaInA
## 10                                                                                                                                                         she, along with dr. fauci and all the other infectious disease experts, are still trying to understand COVID-19. but they know what is needed to stop the spread and that distancing works.
##                              name
## 1                  Kathleen Burge
## 2                   PurplePetunia
## 3  Meghan - The Crippled Feminist
## 4                  Christina Lang
## 5                  Christina Lang
## 6                  Christina Lang
## 7                       Blackwell
## 8         Customize Media Venture
## 9                   acento.com.do
## 10                     AIR SWERVE

# This `covid19_tweets_df` dataset contains all 1,012,305 tweets that were posted on Twitter from March 26 to March 28 in 2020. The data were collected by accessing Twitter API.

Data processing (transformation/cleaning) with `dplyr` package

From now on, we are going to work with the covid19_tweets_df dataset in a data frame format to manipulate its observations and variables. Before processing texts for tokenization, we may need to manipulate the data. Let say we want to remove any tweet whose message is a duplicate of another tweet or a variable that is not of interest to our analysis. dplyr package provides useful functions for doing such tasks.

Today, I am going to introduce you to its basic set of functions and show you how to apply them to the covid19_tweets_df data frame.

`dplyr` functions

Package dplyr provides useful functions for data manipulation:

Name	Task
`select()`	to select variables based on their names
`filter()`	to select cases (observations) based on their values
`mutate()`	to add new variables by the functions of existing variables
`arrange()`	to reorder the cases
`count()`	to count the number of observations based on their values

`select()`: Reduce data frame size to only desired variables for current task

select() function selects a subset of columns (variables) in a data frame. We often work with large datasets with many columns but only a few are acttually of interest to us. select() allows us to rapidly zoom in on a useful subset by mentioning the names of the variables we want to keep.

library(dplyr)

# We take the covid19_tweets_df dataset as the input and send it directly to the select function using the pipe operator.

covid19_tweets_df[1:10,]

## # A tibble: 10 x 6
##    user_id   status_id   created_at          screen_name text           name    
##    <chr>     <chr>       <dttm>              <chr>       <chr>          <chr>   
##  1 408707568 1243394454… 2020-03-27 04:28:33 KathleenBu… "Fascinating … Kathlee…
##  2 145492546 1243394454… 2020-03-27 04:28:33 PetuniaV    "https://t.co… PurpleP…
##  3 399450399 1243394453… 2020-03-27 04:28:33 meghanttuc… "If our Gover… Meghan …
##  4 86343213… 1243394453… 2020-03-27 04:28:33 drchristin… "@jmj https:/… Christi…
##  5 86343213… 1243392682… 2020-03-27 04:21:31 drchristin… "happening as… Christi…
##  6 86343213… 1243394067… 2020-03-27 04:27:01 drchristin… "@stevejang f… Christi…
##  7 81368775  1243394453… 2020-03-27 04:28:33 Zachsnapwe… "What's even … Blackwe…
##  8 12415972… 1243394452… 2020-03-27 04:28:33 CustomizeM… "Myth and Fac… Customi…
##  9 229334332 1243394452… 2020-03-27 04:28:33 acentodiar… "Post-COVID-1… acento.…
## 10 956452855 1243394451… 2020-03-27 04:28:32 AirSwerve   "she, along w… AIR SWE…

covid19_tweets_df %>% select(user_id, created_at, text)

## # A tibble: 1,012,305 x 3
##    user_id        created_at          text                                      
##    <chr>          <dttm>              <chr>                                     
##  1 408707568      2020-03-27 04:28:33 "Fascinating news in England: \"UK firms …
##  2 145492546      2020-03-27 04:28:33 "https://t.co/6ZHx6M6Rwx\nCorona Virus Rh…
##  3 399450399      2020-03-27 04:28:33 "If our Governor were to open schools bac…
##  4 8634321357381… 2020-03-27 04:28:33 "@jmj https://t.co/vkwfur9s7B but they ar…
##  5 8634321357381… 2020-03-27 04:21:31 "happening as we speak. n95 reuse and pro…
##  6 8634321357381… 2020-03-27 04:27:01 "@stevejang fb groups of 18,000+ physicia…
##  7 81368775       2020-03-27 04:28:33 "What's even sicker than COVID-19? \n\nMy…
##  8 1241597278919… 2020-03-27 04:28:33 "Myth and Fact about Corona\n\n#Lockdown2…
##  9 229334332      2020-03-27 04:28:33 "Post-COVID-19 en RD https://t.co/KQsQDSa…
## 10 956452855      2020-03-27 04:28:32 "she, along with dr. fauci and all the ot…
## # … with 1,012,295 more rows

# The select() function keeps columns whose names are mentioned only.

`filter()`: Reduce rows/observations with matching conditions

Filtering data is a common task to identify and keep observations in which a paricular variable matches a specific value/condition. So, it requires an argument that refers to a variable within the data frame to select rows where the expression is TRUE. For instance, we can filter by the variable text whose values are duplicates. The base function duplicated determines which elements of a vector or a variable in a data frame are duplicates of elements and returns a logical vector indicating which elements (rows) are duplicates (TRUE) or unique (FALSE).

# We can chain the current filter() function to the previous function select()

covid19_tweets_df %>% select(user_id, created_at, text) %>% filter(duplicated(text))

## # A tibble: 7,665 x 3
##    user_id        created_at          text                                      
##    <chr>          <dttm>              <chr>                                     
##  1 1243385540889… 2020-03-27 04:26:03 "@NamoApp Please create nation Relief acc…
##  2 474483144      2020-03-27 04:25:43 "We stand by Italy during these trying ti…
##  3 1438242728     2020-03-27 04:24:24 "We stand by Italy during these trying ti…
##  4 1198352885358… 2020-03-27 04:23:58 "We stand by Italy during these trying ti…
##  5 15872418       2020-03-27 04:07:09 "@chiarazambrano Lorenzana: Good morning,…
##  6 150220875      2020-03-27 04:23:05 "Delhi CM @ArvindKejriwal to address peop…
##  7 1211132314278… 2020-03-27 04:22:18 "2 more #Coronavirus positive cases found…
##  8 1214333761     2020-03-27 04:19:27 "@WhiteHouse @realDonaldTrump Oh my God y…
##  9 1214333761     2020-03-27 04:20:09 "@realDonaldTrump Oh my God you have just…
## 10 8047200151822… 2020-03-27 04:15:34 "@narendramodi Honorable Prime Minister S…
## # … with 7,655 more rows

# 7,665 tweets' messages are duplicated (replicated)

# By putting ! operator, we can reverse the logical vector by the duplicated function. This means, we can keep those tweets that are NOT duplicated.
covid19_tweets_df %>% select(user_id, created_at, text) %>% filter(!duplicated(text))

## # A tibble: 1,004,640 x 3
##    user_id        created_at          text                                      
##    <chr>          <dttm>              <chr>                                     
##  1 408707568      2020-03-27 04:28:33 "Fascinating news in England: \"UK firms …
##  2 145492546      2020-03-27 04:28:33 "https://t.co/6ZHx6M6Rwx\nCorona Virus Rh…
##  3 399450399      2020-03-27 04:28:33 "If our Governor were to open schools bac…
##  4 8634321357381… 2020-03-27 04:28:33 "@jmj https://t.co/vkwfur9s7B but they ar…
##  5 8634321357381… 2020-03-27 04:21:31 "happening as we speak. n95 reuse and pro…
##  6 8634321357381… 2020-03-27 04:27:01 "@stevejang fb groups of 18,000+ physicia…
##  7 81368775       2020-03-27 04:28:33 "What's even sicker than COVID-19? \n\nMy…
##  8 1241597278919… 2020-03-27 04:28:33 "Myth and Fact about Corona\n\n#Lockdown2…
##  9 229334332      2020-03-27 04:28:33 "Post-COVID-19 en RD https://t.co/KQsQDSa…
## 10 956452855      2020-03-27 04:28:32 "she, along with dr. fauci and all the ot…
## # … with 1,004,630 more rows

`mutate()`: Add new columns to data frame

Beside selecting sets of existing columns through select(), it’s often useful to add (create) new columns that are functions of existing columns. This is the job of mutate(). For example, we can add a new column time that rounds the date-time object of created_at to the nearest value of the hour time unit. The current created_at variable provides time information in seconds, so it is not useful to see the trend of tweet posting over time. But using the time variable, we can count the number of tweets about COVID-19 posted over time. For doing so, we need to install and load another package called lubridate which provides the floor_date() function that rounds a date-time object like our created_at variable and rounds it down to the nearest boundary of the specified time unit. And using the mutate() function, we create a new colume (variable) that stores the outcome of the function floor_date().

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

# mutate() adds a new variable and preserving existing ones. Within the function, the name of a new variable is expressed to store the outcome of a function of an existing variable.
# floor_date() takes a date-time object that is rounded to the boundary of the specified time unit like "day" or "hour".
covid19_tweets_df %>% select(user_id, created_at, text) %>% filter(!duplicated(text)) %>% mutate(time = floor_date(created_at, unit="hour"))

## # A tibble: 1,004,640 x 4
##    user_id     created_at          text                      time               
##    <chr>       <dttm>              <chr>                     <dttm>             
##  1 408707568   2020-03-27 04:28:33 "Fascinating news in Eng… 2020-03-27 04:00:00
##  2 145492546   2020-03-27 04:28:33 "https://t.co/6ZHx6M6Rwx… 2020-03-27 04:00:00
##  3 399450399   2020-03-27 04:28:33 "If our Governor were to… 2020-03-27 04:00:00
##  4 8634321357… 2020-03-27 04:28:33 "@jmj https://t.co/vkwfu… 2020-03-27 04:00:00
##  5 8634321357… 2020-03-27 04:21:31 "happening as we speak. … 2020-03-27 04:00:00
##  6 8634321357… 2020-03-27 04:27:01 "@stevejang fb groups of… 2020-03-27 04:00:00
##  7 81368775    2020-03-27 04:28:33 "What's even sicker than… 2020-03-27 04:00:00
##  8 1241597278… 2020-03-27 04:28:33 "Myth and Fact about Cor… 2020-03-27 04:00:00
##  9 229334332   2020-03-27 04:28:33 "Post-COVID-19 en RD htt… 2020-03-27 04:00:00
## 10 956452855   2020-03-27 04:28:32 "she, along with dr. fau… 2020-03-27 04:00:00
## # … with 1,004,630 more rows

`arrange()`: Arrange (or re-order) rows of a data frame by an expression involving its variables

Using arrange() function, we can arrange the rows in covid19_tweets_df by the time variable in ascending order of time or in descending order along with the function desc().

# The oldest tweet comes first
covid19_tweets_df %>% select(user_id, created_at, text) %>% filter(!duplicated(text)) %>% mutate(time = floor_date(created_at, unit="hour")) %>% arrange(time)

## # A tibble: 1,004,640 x 4
##    user_id     created_at          text                      time               
##    <chr>       <dttm>              <chr>                     <dttm>             
##  1 1229826044… 2020-03-26 17:56:29 "@a1tradesfx Yer great i… 2020-03-26 17:00:00
##  2 1229826044… 2020-03-26 17:53:16 "@Punish4Q @SpiritFoxxx … 2020-03-26 17:00:00
##  3 192357257   2020-03-26 17:56:22 "#covid-19 #coronavirus … 2020-03-26 17:00:00
##  4 1107368128… 2020-03-26 17:53:11 "Israeli siege imposed o… 2020-03-26 17:00:00
##  5 1107368128… 2020-03-26 17:56:52 "The worst of fates, the… 2020-03-26 17:00:00
##  6 1107368128… 2020-03-26 17:51:03 "After 10 years of the I… 2020-03-26 17:00:00
##  7 1107368128… 2020-03-26 17:49:16 "Wars. Food insecurity. … 2020-03-26 17:00:00
##  8 7456565775… 2020-03-26 17:51:31 "Dear God, is this chick… 2020-03-26 17:00:00
##  9 7456565775… 2020-03-26 17:55:39 "You can just see the me… 2020-03-26 17:00:00
## 10 22592372    2020-03-26 17:57:04 "@StatistaCharts @NPR Ge… 2020-03-26 17:00:00
## # … with 1,004,630 more rows

# The latest tweet comes first
covid19_tweets_df %>% select(user_id, created_at, text) %>% filter(!duplicated(text)) %>% mutate(time = floor_date(created_at, unit="hour")) %>% arrange(desc(time))

## # A tibble: 1,004,640 x 4
##    user_id     created_at          text                      time               
##    <chr>       <dttm>              <chr>                     <dttm>             
##  1 1243563608… 2020-03-28 09:23:24 "@SwanseaCouncil @Swanse… 2020-03-28 09:00:00
##  2 174036091   2020-03-28 09:23:24 "Who do you wanna be dur… 2020-03-28 09:00:00
##  3 1891221151  2020-03-28 09:05:24 "Symptoms of #Coronaviru… 2020-03-28 09:00:00
##  4 1891221151  2020-03-28 09:01:02 "“Tracking the spread of… 2020-03-28 09:00:00
##  5 1891221151  2020-03-28 09:23:24 "Everyone In #Iceland Ca… 2020-03-28 09:00:00
##  6 1891221151  2020-03-28 09:18:58 "“Here’s which #Michigan… 2020-03-28 09:00:00
##  7 1891221151  2020-03-28 09:16:58 "#Brazil and #Mexico Are… 2020-03-28 09:00:00
##  8 1074699178… 2020-03-28 09:23:24 "#BREAKING\n\n7 more pos… 2020-03-28 09:00:00
##  9 9333341395… 2020-03-28 09:23:24 "@kansalrohit69 @diprjk … 2020-03-28 09:00:00
## 10 89149731    2020-03-28 09:23:24 "@IvankaTrump @realDonal… 2020-03-28 09:00:00
## # … with 1,004,630 more rows

`count()`: count discrete values in a specified variable

When working with data, we often want to know the number of observations found for each value in a variable. For example, if we want to count the number of rows of data for each hour in the time variable, we can do:

covid19_tweets_df %>% select(user_id, created_at, text) %>% filter(!duplicated(text)) %>% mutate(time = floor_date(created_at, unit="hour")) %>% arrange(desc(time)) %>% count(time)

## # A tibble: 31 x 2
##    time                    n
##    <dttm>              <int>
##  1 2020-03-26 17:00:00 40524
##  2 2020-03-26 18:00:00 50510
##  3 2020-03-26 19:00:00 45999
##  4 2020-03-26 20:00:00 50027
##  5 2020-03-26 21:00:00 44974
##  6 2020-03-26 22:00:00 41240
##  7 2020-03-26 23:00:00 33823
##  8 2020-03-27 00:00:00 31465
##  9 2020-03-27 01:00:00 28757
## 10 2020-03-27 02:00:00 25913
## # … with 21 more rows

# For covenience, `count()` provides the `sort` argument:
covid19_tweets_df %>% select(user_id, created_at, text) %>% filter(!duplicated(text)) %>% mutate(time = floor_date(created_at, unit="hour")) %>% arrange(desc(time)) %>% count(time, sort=TRUE)

## # A tibble: 31 x 2
##    time                    n
##    <dttm>              <int>
##  1 2020-03-27 16:00:00 55086
##  2 2020-03-27 17:00:00 53989
##  3 2020-03-27 18:00:00 51812
##  4 2020-03-26 18:00:00 50510
##  5 2020-03-26 20:00:00 50027
##  6 2020-03-27 19:00:00 47173
##  7 2020-03-26 19:00:00 45999
##  8 2020-03-27 20:00:00 45038
##  9 2020-03-26 21:00:00 44974
## 10 2020-03-26 22:00:00 41240
## # … with 21 more rows

# The result is sorted in descending order of n

Visualization using `ggplot2()`

ggplot2() package provides very useful and accessible tools for creating graphics in R. Of course, there are lots of thiggs to cover for us to visualize the results of our data processing. But today let’s have a just taste of creating a bar chart that visualizes the trend of tweeting about COVID-19 from March 26 to 28. In the bar chart, hour-unit observations in the time variable are ordered by x value and the number of observations, n, is represented as the height of each bar. So, this chart produces a simple visual comparison to see how the degree of tweeting about COVID-19 had changed during the period of time.

library(ggplot2)

# time: x-axis, n: y-axis
# The hour-unit observations can be arranged in ascending order

covid19_tweets_df %>% select(user_id, created_at, text) %>% filter(!duplicated(text)) %>% mutate(time = floor_date(created_at, unit="hour")) %>% arrange(desc(time)) %>% count(time) %>% ggplot(aes(x=time, y=n)) + geom_col()

This visualization techniques shows how our Twitter data is composed of in terms of when tweets are posted. Our data do not have tweets posted between 5am and 2pm on March 27 for some reason. And the number of tweets posted in the evening on March 27th was greater than the number of tweets posted in the evening on March 26th. And on both dates, tweeting about COVID-19 peaked around 6pm and decreased with the approach of night.

* stringr, magrittr, dplyr, lubridate, ggplot2 packages are all included in the tidyverse package.

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──

## ✓ tibble  2.1.3     ✓ readr   1.3.1
## ✓ tidyr   1.0.0     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ forcats 0.4.0

## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x tidyr::extract()         masks magrittr::extract()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x purrr::set_names()       masks magrittr::set_names()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

W7-1: Supervised Learning and Data Manipulation

이신행

4/27/2020

Learnig Objectives

What is supervised machine learning?

Tidy data Principles

Background of the pipe Operator in R

Tidy data

Data processing (transformation/cleaning) with `dplyr` package

`dplyr` functions

`select()`: Reduce data frame size to only desired variables for current task

`filter()`: Reduce rows/observations with matching conditions

`mutate()`: Add new columns to data frame

`arrange()`: Arrange (or re-order) rows of a data frame by an expression involving its variables

`count()`: count discrete values in a specified variable

Visualization using `ggplot2()`

* stringr, magrittr, dplyr, lubridate, ggplot2 packages are all included in the tidyverse package.

W7-1: Supervised Learning and Data Manipulation

이신행

4/27/2020

Learnig Objectives

What is supervised machine learning?

Tidy data Principles

Background of the pipe Operator in R

Tidy data

Data processing (transformation/cleaning) with dplyr package

dplyr functions

select(): Reduce data frame size to only desired variables for current task

filter(): Reduce rows/observations with matching conditions

mutate(): Add new columns to data frame

arrange(): Arrange (or re-order) rows of a data frame by an expression involving its variables

count(): count discrete values in a specified variable

Visualization using ggplot2()

* stringr, magrittr, dplyr, lubridate, ggplot2 packages are all included in the tidyverse package.

Data processing (transformation/cleaning) with `dplyr` package

`dplyr` functions

`select()`: Reduce data frame size to only desired variables for current task

`filter()`: Reduce rows/observations with matching conditions

`mutate()`: Add new columns to data frame

`arrange()`: Arrange (or re-order) rows of a data frame by an expression involving its variables

`count()`: count discrete values in a specified variable

Visualization using `ggplot2()`