Collecting Twitter Data with rtweet, Web Scrapping with rvest, and Input Method MongoDB Atlas


Berikut ini adalah Materi Praktikum 14 dari Mata Kuliah STA562-Manajemen Data Statistika Mahasiswa Magister Statistika dan Sains Data untuk Peminatan Big Data Analytics

Collecting Twitter Data with rtweet

Berikut ini akan dipelajari, penggunaan rtweet untuk mengumpulkan data dari twitter.Untuk menjalankannya diperlukan Token dari suatu Twitter Connected Apps. rtweet sudah menyediakannya.

rstats2twitter

Namun, pada Praktikum ini digunakan, token dari Sedotan, suatu Twitter Connected Apps.

Library

#install.packages("rtweet")
library(rtweet)

Penggunaan Token

consumer_key <- "4pHN****"
consumer_secret <- "VF3z****"
access_token <- "7357****"
access_secret <- "cK2t****"
token <- create_token(
  app = "Sedotan",
  consumer_key = consumer_key,
  consumer_secret = consumer_secret,
  access_token = access_token,
  access_secret = access_secret)

Contoh Penggunaan rtweet

Mencari Tweet dengan Kata Kunci tertentu

rt <- search_tweets("indonesia",
                    n = 1800,
                    include_rts = FALSE
                    )

Berikut adalah dimensi dari data yang berhasil dikumpulkan.

dim(rt)
## [1] 1800   90

Struktur Data

dplyr::glimpse(rt)
## Rows: 1,800
## Columns: 90
## $ user_id                 <chr> "1439772405828755458", "990788287680819200", "…
## $ status_id               <chr> "1465147121011593221", "1465147108244148231", …
## $ created_at              <dttm> 2021-11-29 02:34:29, 2021-11-29 02:34:26, 202…
## $ screen_name             <chr> "BUMNJakTimSiap", "Caveman9494", "TonuKumer", …
## $ text                    <chr> "AYO kita turut andil dalam Gerakan Kolaborasi…
## $ source                  <chr> "MagellanTweets", "Twitter for Android", "Twit…
## $ display_text_width      <dbl> 140, 138, 277, 277, 277, 140, 83, 100, 108, 66…
## $ reply_to_status_id      <chr> NA, "1465105612920930304", "146465565687133388…
## $ reply_to_user_id        <chr> NA, "1300483034739691520", "100447975756663193…
## $ reply_to_screen_name    <chr> NA, "thevibesnews", "CryptoFamilyVN", "DeFiDis…
## $ is_quote                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ is_retweet              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ favorite_count          <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0…
## $ retweet_count           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ quote_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ reply_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ hashtags                <list> "BUMNHijaukanIndonesia", NA, NA, NA, NA, "BUM…
## $ symbols                 <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ urls_url                <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ urls_t.co               <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ urls_expanded_url       <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ media_url               <list> "http://pbs.twimg.com/media/FFU_iprVIAEiXYL.j…
## $ media_t.co              <list> "https://t.co/bLOlSqzAVm", NA, NA, NA, NA, "h…
## $ media_expanded_url      <list> "https://twitter.com/BUMNJakTimSiap/status/14…
## $ media_type              <list> "photo", NA, NA, NA, NA, "photo", NA, "photo"…
## $ ext_media_url           <list> "http://pbs.twimg.com/media/FFU_iprVIAEiXYL.j…
## $ ext_media_t.co          <list> "https://t.co/bLOlSqzAVm", NA, NA, NA, NA, "h…
## $ ext_media_expanded_url  <list> "https://twitter.com/BUMNJakTimSiap/status/14…
## $ ext_media_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ mentions_user_id        <list> NA, "1300483034739691520", <"1004479757566631…
## $ mentions_screen_name    <list> NA, "thevibesnews", <"CryptoFamilyVN", "a2dao…
## $ lang                    <chr> "in", "en", "en", "en", "en", "in", "in", "in"…
## $ quoted_status_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_text             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_created_at       <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ quoted_source           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_favorite_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_retweet_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_user_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_screen_name      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_name             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_followers_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_friends_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_statuses_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_location         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_description      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ quoted_verified         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_status_id       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_text            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_created_at      <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ retweet_source          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_favorite_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_retweet_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_user_id         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_screen_name     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_name            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_followers_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_friends_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_statuses_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_location        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_description     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_verified        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ place_url               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ place_name              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ place_full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ place_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ country                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ country_code            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ geo_coords              <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, …
## $ coords_coords           <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, …
## $ bbox_coords             <list> <NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, NA…
## $ status_url              <chr> "https://twitter.com/BUMNJakTimSiap/status/146…
## $ name                    <chr> "BUMN_JakTim_SIAP", "Giri Ram", "sojib 10", "s…
## $ location                <chr> "", "", "", "", "", "", "", "Sulawesi Tengah, …
## $ description             <chr> "Hobby Olahraga & Travelling", "", "", "", "",…
## $ url                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ followers_count         <int> 19, 9, 13, 13, 13, 39, 2, 14, 14, 14, 14, 14, …
## $ friends_count           <int> 133, 33, 630, 630, 630, 10, 15, 2, 2, 2, 2, 2,…
## $ listed_count            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ statuses_count          <int> 101, 371, 2985, 2985, 2985, 96, 17, 628, 628, …
## $ favourites_count        <int> 18, 3209, 608, 608, 608, 0, 247, 613, 613, 613…
## $ account_created_at      <dttm> 2021-09-20 02:04:38, 2018-04-30 03:01:50, 202…
## $ verified                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ profile_url             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ profile_expanded_url    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ account_lang            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ profile_banner_url      <chr> "https://pbs.twimg.com/profile_banners/1439772…
## $ profile_background_url  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ profile_image_url       <chr> "http://pbs.twimg.com/profile_images/143984444…

Mencari tweet melebihi limit dari twitter.

Limit

Twitter membatasi jumlah tweet (18.000) yang bisa diambil dalam jangka waktu tertentu (15 menit). Jika Anda ingin mencari tweet dalam jumlah besar (mengamati kata kunci tertentu), Anda bisa menggunakan opsi retryonratelimit = TRUE

rt <- search_tweets("data",
                    n = 250000,
                    retryonratelimit = TRUE
                    )

Lokasi Tweet

## search for 10,000 tweets sent from the US
rt <- search_tweets("lang:en",
                    geocode = lookup_coords("usa"),
                    n = 10000
                    )
## Warning: Rate limit exceeded - 88
## Warning: Rate limit exceeded
## create lat/lng variables using all available tweet and profile geo-location data
rt <- lat_lng(rt)

## plot state boundaries
par(mar = c(0, 0, 0, 0))
maps::map("state", lwd = .25)

## plot lat and lng points onto state map
with(rt, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))

Following dari Suatu Akun

## get user IDs of accounts followed by ipbofficial 
following <- get_friends("ipbofficial")

head(following)
## # A tibble: 6 × 2
##   user        user_id            
##   <chr>       <chr>              
## 1 ipbofficial 151261945          
## 2 ipbofficial 22878447           
## 3 ipbofficial 74646907           
## 4 ipbofficial 51010742           
## 5 ipbofficial 2421551478         
## 6 ipbofficial 1219523581140332544
lookup_users(following$user_id)[1:10,4]
## # A tibble: 10 × 1
##    screen_name   
##    <chr>         
##  1 UGMYogyakarta 
##  2 itbofficial   
##  3 univ_indonesia
##  4 unpad         
##  5 AusIndCentre  
##  6 ltmptofficial 
##  7 KSPgoid       
##  8 pddikti       
##  9 SekreSNMPTN   
## 10 arif_satria

Follower dari suatu Akun

## get user IDs of accounts following ipbofficial
followers <- get_followers("ipbofficial", n = 500)
head(followers)
## # A tibble: 6 × 1
##   user_id            
##   <chr>              
## 1 1465140187764195332
## 2 17824995           
## 3 977473478323351552 
## 4 1465117404166443009
## 5 1465104855718977536
## 6 1145552705928093697
lookup_users(followers$user_id)[1:10,4]
## # A tibble: 10 × 1
##    screen_name    
##    <chr>          
##  1 lattepyong     
##  2 widhyaksara    
##  3 iamfayzasevanaa
##  4 Fazririsiregar 
##  5 AzkarBadri1    
##  6 AmandaFebby7   
##  7 NMI72729256    
##  8 HAmyrullah     
##  9 ardellreynold20
## 10 irateniaua

Tweet dari Suatu Akun

tmls <- get_timelines(c("republikaonline", "kompascom", "detikcom"), n = 3200)

## plot the frequency of tweets for each user over time
tmls %>%
  dplyr::filter(created_at > "2021-10-29") %>%
  dplyr::group_by(screen_name) %>%
  ts_plot("days", trim = 1L) +
  ggplot2::geom_point() +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.title = ggplot2::element_blank(),
    legend.position = "bottom",
    plot.title = ggplot2::element_text(face = "bold")) +
  ggplot2::labs(
    x = NULL, y = NULL,
    title = "Frequency of Twitter statuses posted by news organization",
    subtitle = "Twitter status (tweet) counts aggregated by day",
    caption = "\nSource: Data collected from Twitter's REST API via rtweet"
  )

Easily Harvest (Scrape) Web Pages with rvest

Dalam materi ini, Kita akan melakukan Web Scrapping dari itch.io dengan R.

About itch.io

itch.io is an open marketplace for independent digital creators with a focus on independent video games. It’s a platform that enables anyone to sell the content they’ve created. As a seller you’re in charge of how it’s done: you set the price, you run sales, and you design your pages. It’s never necessary to get votes, likes, or follows to get your content approved, and you can make changes to how you distribute your work as frequently as you like.

Berikut ini, adalah langkah-langkah, untuk scrapping Top Rated Free Games di Windows

Target data yang hendak diambil :

Targeted Data:

  • Game Title
  • Developer
  • Rating Count
  • Rating Score
  • Story/Description
  • Size

Library

library(rvest)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Inisialisasi

# prepare url to scrap
url <- "https://itch.io/games/top-rated/free/platform-windows"
# getting the html codes
itchio <- read_html(url)
itchio
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body data-page_name="browse" data-host="itch.io" class="locale_en main_l ...

Game Title

#get the product title codes
title_html <- html_nodes(itchio, ".game_title")
title_html[[2]] #game which we hover earlier
## {html_node}
## <div class="game_title">
## [1] <a class="title game_link" href="https://brianna-lei.itch.io/butterfly-so ...
#convert codes to text
title_text <- html_text(title_html)
title_text
##  [1] "Friday Night Funkin'"                    
##  [2] "Butterfly Soup"                          
##  [3] "​Our Life: Beginnings & Always"           
##  [4] "Project Kat"                             
##  [5] "Andromeda Six"                           
##  [6] "Doki Doki Literature Club!"              
##  [7] "Sort the Court!"                         
##  [8] "Cinderella Phenomenon"                   
##  [9] "Six Cats Under"                          
## [10] "Mindustry"                               
## [11] "Ebon Light"                              
## [12] "missed messages."                        
## [13] "Blooming Panic"                          
## [14] "Ravenfield (Beta 5)"                     
## [15] "Scout: An Apocalypse Story"              
## [16] "CHAINSAW DANCE DEMO"                     
## [17] "Vincent: The Secret of Myers"            
## [18] "one night, hot springs [jam ver.]-50%"   
## [19] "her tears were my light"                 
## [20] "Desktop Goose"                           
## [21] "Therapy with Dr. Albert Krueger"         
## [22] "Syrup and the Ultimate Sweet"            
## [23] "Juice Galaxy (formerly Juice World)"     
## [24] "Baldi's Basics in Education and Learning"
## [25] "Lonely Wolf Treat"                       
## [26] "Raft"                                    
## [27] "Devil Express"                           
## [28] "Lost Constellation"                      
## [29] "Perfumare"                               
## [30] "ULTRAKILL Prelude"

Game Author / Developer

# get author/developer
author_html <- html_nodes(itchio, ".game_author")
author_text <- html_text(author_html)
author_text
##  [1] "ninjamuffin99"       "Brianna Lei"         "GBPatch"            
##  [4] "Leef 6010"           "Wanderlust Games"    "Team Salvato"       
##  [7] "Graeme Borland"      "Dicesuki"            "Team Bean Loop"     
## [10] "Anuke"               "Underbliss"          "angela he"          
## [13] "robobarbie"          "SteelRaven7"         "Anya"               
## [16] "Benedique"           "dino999z"            "npckc"              
## [19] "NomnomNami"          "samperson"           "dino999z"           
## [22] "NomnomNami"          "fishlicka"           "Basically Games"    
## [25] "NomnomNami"          "Redbeet Interactive" "Bad Pet"            
## [28] "Finji"               "PDRRook"             "Hakita"

Rating Count

library(stringr) # since we're going to deal with character
rating_html <- html_nodes(itchio, ".game_rating")
rating_html
## {xml_nodeset (30)}
##  [1] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
##  [2] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
##  [3] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
##  [4] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
##  [5] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
##  [6] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
##  [7] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
##  [8] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
##  [9] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
## [10] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
## [11] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
## [12] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
## [13] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
## [14] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
## [15] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
## [16] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
## [17] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
## [18] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
## [19] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
## [20] <div class="game_rating">\n<div class="star_value">\n<div style="width:  ...
## ...
string<- c("[[:punct:]]") # prepare string to remove

rating_count <- html_text(rating_html) %>% 
  str_remove_all(pattern = string) %>% 
  str_squish() %>% as.numeric()
rating_count
##  [1] 8961 2945 2157 3116 1911 3126 4377 1510 1604 1303 1036 1285  591 1670  678
## [16]  939  443  872  785 1065  701  866  546 1460  938 2304  416  697  393  464

Rating Score

rating_score <- itchio %>%
  html_nodes("div") %>%
  html_nodes(".game_rating") %>%
  html_nodes("span") %>% 
  html_nodes(xpath = '//*[@class="rating_count"]') %>% 
  html_attr("title") %>% 
  as.numeric()
rating_score
##  [1] 4.75 4.90 4.95 4.84 4.89 4.80 4.69 4.83 4.78 4.82 4.85 4.80 4.95 4.75 4.90
## [16] 4.81 4.95 4.82 4.83 4.77 4.83 4.78 4.86 4.68 4.75 4.60 4.89 4.79 4.90 4.87

Wrap All

itchio_wrap <- data.frame(title_text, author_text, rating_score, rating_count)
head(itchio_wrap)
##                      title_text      author_text rating_score rating_count
## 1          Friday Night Funkin'    ninjamuffin99         4.75         8961
## 2                Butterfly Soup      Brianna Lei         4.90         2945
## 3 ​Our Life: Beginnings & Always          GBPatch         4.95         2157
## 4                   Project Kat        Leef 6010         4.84         3116
## 5                 Andromeda Six Wanderlust Games         4.89         1911
## 6    Doki Doki Literature Club!     Team Salvato         4.80         3126

Visualization

# data aggregation
itchio_arr <- itchio_wrap %>% 
  filter(rating_count >= 10, rating_score >= 4.5) %>% 
  arrange(desc(rating_score, rating_count))
itchio_arr
##                                  title_text         author_text rating_score
## 1             ​Our Life: Beginnings & Always             GBPatch         4.95
## 2                            Blooming Panic          robobarbie         4.95
## 3              Vincent: The Secret of Myers            dino999z         4.95
## 4                            Butterfly Soup         Brianna Lei         4.90
## 5                Scout: An Apocalypse Story                Anya         4.90
## 6                                 Perfumare             PDRRook         4.90
## 7                             Andromeda Six    Wanderlust Games         4.89
## 8                             Devil Express             Bad Pet         4.89
## 9                         ULTRAKILL Prelude              Hakita         4.87
## 10      Juice Galaxy (formerly Juice World)           fishlicka         4.86
## 11                               Ebon Light          Underbliss         4.85
## 12                              Project Kat           Leef 6010         4.84
## 13                    Cinderella Phenomenon            Dicesuki         4.83
## 14                  her tears were my light          NomnomNami         4.83
## 15          Therapy with Dr. Albert Krueger            dino999z         4.83
## 16                                Mindustry               Anuke         4.82
## 17    one night, hot springs [jam ver.]-50%               npckc         4.82
## 18                      CHAINSAW DANCE DEMO           Benedique         4.81
## 19               Doki Doki Literature Club!        Team Salvato         4.80
## 20                         missed messages.           angela he         4.80
## 21                       Lost Constellation               Finji         4.79
## 22                           Six Cats Under      Team Bean Loop         4.78
## 23             Syrup and the Ultimate Sweet          NomnomNami         4.78
## 24                            Desktop Goose           samperson         4.77
## 25                     Friday Night Funkin'       ninjamuffin99         4.75
## 26                      Ravenfield (Beta 5)         SteelRaven7         4.75
## 27                        Lonely Wolf Treat          NomnomNami         4.75
## 28                          Sort the Court!      Graeme Borland         4.69
## 29 Baldi's Basics in Education and Learning     Basically Games         4.68
## 30                                     Raft Redbeet Interactive         4.60
##    rating_count
## 1          2157
## 2           591
## 3           443
## 4          2945
## 5           678
## 6           393
## 7          1911
## 8           416
## 9           464
## 10          546
## 11         1036
## 12         3116
## 13         1510
## 14          785
## 15          701
## 16         1303
## 17          872
## 18          939
## 19         3126
## 20         1285
## 21          697
## 22         1604
## 23          866
## 24         1065
## 25         8961
## 26         1670
## 27          938
## 28         4377
## 29         1460
## 30         2304
# visualization
library(ggplot2)

plot <- ggplot(itchio_arr, aes(x=reorder(title_text, rating_score), y=rating_score)) +
  geom_point(aes(size = rating_count, color = rating_score)) + coord_flip() +
  labs(x = "",
       y = "Score",
       title = "Highest Rating Free Adventure-RPG Games on Itch.io",
       subtitle = "Filtered for Windows Platform",
       size = "Rating Count") +
  scale_color_continuous(low = "pink", high = "maroon") + 
  scale_size_continuous(breaks = c(25,50,100,200)) + 
  guides(color = F) +
  theme_minimal()
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
plot

Create Methods pada MongoDB Atlas

Pada pertemuan-pertemuan sebelumnya, sudah kita bahas Read Methods. Pada sesi ini, kita akan melakukan Input pada MongoDB Atlas.

Pada 2 Sesi di atas, Kita sudah memiliki 2 Dataframe, yaitu

dplyr::glimpse(rt)
## Rows: 4,400
## Columns: 92
## $ user_id                 <chr> "77944562", "888185595654062080", "83590831075…
## $ status_id               <chr> "1465147173310545923", "1465147173125820416", …
## $ created_at              <dttm> 2021-11-29 02:34:41, 2021-11-29 02:34:41, 202…
## $ screen_name             <chr> "L_Skrubby05", "Abhai_BTTG", "WHATSPUPDAWG", "…
## $ text                    <chr> "@RNicKL5 STFU ARE YOU KIDDING!?", "@JeffEisen…
## $ source                  <chr> "Twitter for iPhone", "Twitter Web App", "Twit…
## $ display_text_width      <dbl> 22, 7, 45, 110, 40, 223, 94, 75, 33, 246, 67, …
## $ reply_to_status_id      <chr> "1465146944683229193", "1465146999184011268", …
## $ reply_to_user_id        <chr> "37363167", "239575120", "865590564070227969",…
## $ reply_to_screen_name    <chr> "RNicKL5", "JeffEisenband", "HaveYouMetTomu", …
## $ is_quote                <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE…
## $ is_retweet              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ favorite_count          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ retweet_count           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ quote_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ reply_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ hashtags                <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, <"BlackFr…
## $ symbols                 <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ urls_url                <list> NA, NA, NA, NA, "twitter.com/booksarts1/sta…"…
## $ urls_t.co               <list> NA, NA, NA, NA, "https://t.co/st24lfLiXU", NA…
## $ urls_expanded_url       <list> NA, NA, NA, NA, "https://twitter.com/booksart…
## $ media_url               <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, "http://p…
## $ media_t.co              <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, "https://…
## $ media_expanded_url      <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, "https://…
## $ media_type              <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, "photo", …
## $ ext_media_url           <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, "http://p…
## $ ext_media_t.co          <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, "https://…
## $ ext_media_expanded_url  <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, "https://…
## $ ext_media_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ mentions_user_id        <list> "37363167", "239575120", "865590564070227969"…
## $ mentions_screen_name    <list> "RNicKL5", "JeffEisenband", "HaveYouMetTomu",…
## $ lang                    <chr> "en", "en", "en", "en", "en", "en", "en", "en"…
## $ quoted_status_id        <chr> NA, NA, NA, NA, "1464617086139969541", NA, NA,…
## $ quoted_text             <chr> NA, NA, NA, NA, "https://t.co/k5yEIryFng", NA,…
## $ quoted_created_at       <dttm> NA, NA, NA, NA, 2021-11-27 15:28:19, NA, NA, …
## $ quoted_source           <chr> NA, NA, NA, NA, "Twitter for iPhone", NA, NA, …
## $ quoted_favorite_count   <int> NA, NA, NA, NA, 207, NA, NA, NA, NA, NA, 1797,…
## $ quoted_retweet_count    <int> NA, NA, NA, NA, 48, NA, NA, NA, NA, NA, 588, N…
## $ quoted_user_id          <chr> NA, NA, NA, NA, "1241404370484441090", NA, NA,…
## $ quoted_screen_name      <chr> NA, NA, NA, NA, "BooksArts1", NA, NA, NA, NA, …
## $ quoted_name             <chr> NA, NA, NA, NA, "Books & Arts", NA, NA, NA, NA…
## $ quoted_followers_count  <int> NA, NA, NA, NA, 1892, NA, NA, NA, NA, NA, 5108…
## $ quoted_friends_count    <int> NA, NA, NA, NA, 2117, NA, NA, NA, NA, NA, 1496…
## $ quoted_statuses_count   <int> NA, NA, NA, NA, 9639, NA, NA, NA, NA, NA, 4062…
## $ quoted_location         <chr> NA, NA, NA, NA, "", NA, NA, NA, NA, NA, "Washi…
## $ quoted_description      <chr> NA, NA, NA, NA, "Voracious reader, bibliophile…
## $ quoted_verified         <lgl> NA, NA, NA, NA, FALSE, NA, NA, NA, NA, NA, TRU…
## $ retweet_status_id       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_text            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_created_at      <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ retweet_source          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_favorite_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_retweet_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_user_id         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_screen_name     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_name            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_followers_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_friends_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_statuses_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_location        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_description     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ retweet_verified        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ place_url               <chr> "https://api.twitter.com/1.1/geo/id/bbb67f6528…
## $ place_name              <chr> "Lido Beach", NA, NA, NA, "Richmond", NA, NA, …
## $ place_full_name         <chr> "Lido Beach, NY", NA, NA, NA, "Richmond, VA", …
## $ place_type              <chr> "city", NA, NA, NA, "city", NA, NA, NA, NA, NA…
## $ country                 <chr> "United States", NA, NA, NA, "United States", …
## $ country_code            <chr> "US", NA, NA, NA, "US", NA, NA, NA, NA, NA, NA…
## $ geo_coords              <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, …
## $ coords_coords           <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, …
## $ bbox_coords             <list> <-73.63760, -73.58373, -73.58373, -73.63760, …
## $ status_url              <chr> "https://twitter.com/L_Skrubby05/status/146514…
## $ name                    <chr> "Leo Skorupski", "Абхай Савкар || Bowled Throu…
## $ location                <chr> "Lido Beach, New York", "Santa Cruz, CA", "Tam…
## $ description             <chr> "Started a twitter to publicly yell at profess…
## $ url                     <chr> NA, "https://t.co/0JAHSJ9Ywa", "https://t.co/R…
## $ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ followers_count         <int> 343, 435, 549, 12, 421, 5311, 30, 3418, 653, 6…
## $ friends_count           <int> 1492, 1952, 336, 47, 205, 5685, 11, 792, 360, …
## $ listed_count            <int> 5, 3, 2, 0, 15, 6, 1, 106, 5, 0, 2, 0, 0, 2, 1…
## $ statuses_count          <int> 28257, 5157, 5483, 196, 22985, 118640, 13302, …
## $ favourites_count        <int> 76858, 5581, 167475, 639, 19770, 266056, 4779,…
## $ account_created_at      <dttm> 2009-09-28 06:35:28, 2017-07-20 23:55:22, 201…
## $ verified                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ profile_url             <chr> NA, "https://t.co/0JAHSJ9Ywa", "https://t.co/R…
## $ profile_expanded_url    <chr> NA, "http://bowledthroughthegate.weebly.com", …
## $ account_lang            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ profile_banner_url      <chr> "https://pbs.twimg.com/profile_banners/7794456…
## $ profile_background_url  <chr> "http://abs.twimg.com/images/themes/theme1/bg.…
## $ profile_image_url       <chr> "http://pbs.twimg.com/profile_images/146475876…
## $ lat                     <dbl> 40.59001, NA, NA, NA, 37.52988, NA, NA, NA, NA…
## $ lng                     <dbl> -73.61067, NA, NA, NA, -77.49317, NA, NA, NA, …
dplyr::glimpse(itchio_wrap)
## Rows: 30
## Columns: 4
## $ title_text   <chr> "Friday Night Funkin'", "Butterfly Soup", "​Our Life: Begi…
## $ author_text  <chr> "ninjamuffin99", "Brianna Lei", "GBPatch", "Leef 6010", "…
## $ rating_score <dbl> 4.75, 4.90, 4.95, 4.84, 4.89, 4.80, 4.69, 4.83, 4.78, 4.8…
## $ rating_count <dbl> 8961, 2945, 2157, 3116, 1911, 3126, 4377, 1510, 1604, 130…

Dua dataframe tersebut, akan kita gunakan untuk Input ke MongoDB Atlas.

Membuat Koneksi

library(mongolite)
# This is the connection_string. You can get the exact url from your MongoDB cluster screen
connection_string = 'mongodb+srv://<username>:<password>@<cluster-name>.<code>.mongodb.net/admin?retryWrites=true&w=majority'

Membuat Database dan Collection

twitter_collection <- mongo(collection = "twitter", # Creating collection
                         db = "sample_dataset_R", # Creating DataBase
                         url = connection_string, 
                         verbose = TRUE)
itchio_collection <- mongo(collection = "itchio", # Creating collection
                         db = "sample_dataset_R", # Creating DataBase
                         url = connection_string, 
                         verbose = TRUE)

Proses Input

twitter_collection$insert(rt)
## 
Processed 1000 rows...
Processed 2000 rows...
Processed 3000 rows...
Processed 4000 rows...
Complete! Processed total of 4400 rows.
## List of 5
##  $ nInserted  : num 4400
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()
itchio_collection$insert(itchio_wrap)
## 
Complete! Processed total of 30 rows.
## List of 5
##  $ nInserted  : num 30
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

Hasil

Berikut ini adalah hasil Input (Create Methods) ke MongoDB Atlas.


Referensi

Fauziyyah, NA. 2019. Web Scraping in R using rvest [Diakses Online : 29 November 2021]. https://rpubs.com/nabiilahardini/itchio


  1. Badan Informasi Geospasial, ↩︎