Social Media Data Analysis

Valentina Valmacco

2021-04-10

Introduction

The following document contains notes and exercises from the DataCamp course “Analyzing Social Media Data in R”.

Twitter data and the rtweet package

Considering the volume and rate of tweets posted every second worldwide, Twitter data represent an enormous amount of information, both from the tweet text and its metadata, which creates enormous opportunities to derive social and marketing insights. The rtweet package is an R interface for the Twitter API, and contains a lot of functions which can be used to extract twitter data relative to a specific topic or user. With the function stream_tweets() we can extract 1% of the total tweets tweeted in a certain amount of seconds and save it in a data frame.

## [1] 429  90

Search and extract tweets

The search_tweets() function

search_tweets() is a powerful function from rtweet which is used to extract tweets based on a search query. The function returns a maximum of 18,000 tweets for each request posted, posted in the time period of the last 6-9 days. In this exercise, search_tweets() is used to extract tweets on the Emmy Awards, by looking for tweets containing the Emmy Awards hashtag.

## # A tibble: 5 x 5
##   user_id    status_id    created_at          screen_name  text                 
##   <chr>      <chr>        <dttm>              <chr>        <chr>                
## 1 107225075… 13807534734… 2021-04-10 05:24:15 LMcBee4Dall… "Legendary natural h…
## 2 4120556734 13806253931… 2021-04-09 20:55:19 NBCDFWCommu… "Legendary natural h…
## 3 235598077  13806251676… 2021-04-09 20:54:25 earthxorg    "Legendary natural h…
## 4 822493154… 13806123892… 2021-04-09 20:03:38 EarthxFilm   "Legendary natural h…
## 5 29852972   13805661470… 2021-04-09 16:59:53 RobertaRT    "Flashback 11 years …

The get_timeline() function

get_timeline() is another function in the rtweet library that can be used to extract tweets, by extracting tweets posted by a given user to their timeline. The get_timeline() function can extract upto 3200 tweets at a time.

In this exercise, tweets posted by Cristiano Ronaldo (@Cristiano twitter handle) are extracted.

## # A tibble: 5 x 10
##   user_id  status_id   created_at          screen_name text              source 
##   <chr>    <chr>       <dttm>              <chr>       <chr>             <chr>  
## 1 1556592… 1379903300… 2021-04-07 21:05:58 Cristiano   "Grandi ragazzi,… Twitte…
## 2 1556592… 1379446925… 2021-04-06 14:52:30 Cristiano   "🏳️🏴👏🏽 #juventus… Twitte…
## 3 1556592… 1377019810… 2021-03-30 22:08:01 Cristiano   "Vitória importa… Twitte…
## 4 1556592… 1376596593… 2021-03-29 18:06:18 Cristiano   "🇵🇹❤️ https://t.… Twitte…
## 5 1556592… 1375064846… 2021-03-25 12:39:41 Cristiano   "Muito important… Twitte…
## # … with 4 more variables: display_text_width <dbl>, reply_to_status_id <lgl>,
## #   reply_to_user_id <lgl>, reply_to_screen_name <lgl>

Tweets metadata

Retweet counts

The number of times a twitter text is retweeted indicates what is trending.

In this exercise, the tweets on “Artificial Intelligence” that have been retweeted the most are extracted.

## # A tibble: 3 x 2
##   text                                                             retweet_count
##   <chr>                                                                    <int>
## 1 "We are thrilled to announce that @PacktPub is sponsoring @clou…           885
## 2 "Toyota Is Using Artificial Intelligence To Build A New City - …           674
## 3 "IAB AI Working Group to Establish Artificial Intelligence Stan…           669

Filtering tweets

Filtering for original tweets

An original tweet is an original posting by a twitter user and is not a retweet, quote, or reply. The “-filter” attribute can be combined with a search query to exclude retweets, quotes, and replies during tweet extraction. In this exercise, tweets on “Superbowl” that are original posts and not retweets, quotes, or replies are extracted.

##    x freq
## 1 NA  100
##       x freq
## 1 FALSE  100
##       x freq
## 1 FALSE  100

Filtering on tweet language

The lenguage filter can be used to extract tweets in a specific lenguage. Example: tweets posted in French on the topic “Apple iphone”.

Filter based on tweet popularity

Popular tweets are tweets that are retweeted and favorited several times. We can extract tweet filtering the ones that have been retweeted and/or favoured a certain amount of times. In this exercise, tweets on “Chelsea” that have been retweeted a minimum of 100 times and also favorited at least by 100 users are extracted.

User information

Extract user information

User information contains data on the number of followers and friends of the twitter user. It can be extracted with the user_data() function. The user information may have multiple instances of the same user as the user might have tweeted multiple times on a given subject. In this exercise, the number of friends and followers of users who tweet on #cosmetics are identified.

##   follower   friend
## 1 2850.622 1536.955

Explore users based on the golden ratio

The ratio of the number of followers to the number of friends a user has is called the golden ratio., which is a useful metric for marketers to strategize promotions. Users with a high Ratio can be used to promote a product.

## [1] follower friend   ratio   
## <0 rows> (or 0-length row.names)

Subscribers to twitter lists

A twitter list is a curated group of twitter accounts centered around a topic of interest. In this exercise, lists of the twitter account of “NBA”, are extracted.

## # A tibble: 4 x 3
##   list_id  name               uri                           
##   <chr>    <chr>              <chr>                         
## 1 18013707 NBA G League Teams /NBA/lists/nba-g-league-teams1
## 2 18013538 WNBA Teams         /NBA/lists/wnba-teams         
## 3 17852612 NBA Players        /NBA/lists/nba-players        
## 4 3738526  NBA Teams          /NBA/lists/nba-teams
## NULL
## <0 rows> (or 0-length row.names)
## # A tibble: 4 x 3
##   user_id             status_id           created_at         
##   <chr>               <chr>               <dttm>             
## 1 91031258            1380224790527545347 2021-04-08 18:23:28
## 2 1100210166727729152 1375079166849081345 2021-03-25 13:36:35
## 3 2150289313          1375150800176087048 2021-03-25 18:21:14
## 4 30284789            1373844979869741069 2021-03-22 03:52:22

Tweet frequency

Visualizing frequency of tweets

Visualizing the frequency of tweets over time helps understand the interest level over a topic or a product. In this exercise, tweets on “#walmart” are extracted and a time series plot created for visualizing the interest levels.

Create time series objects

A time series object contains the aggregated frequency of tweets over a specified time interval. It allows comparison between topics/products.

In this exercise, time series objects for the sportswear brands Puma and Nike is created.

##                  time puma_n nike_n
## 1 2021-04-08 12:00:00      4     NA
## 2 2021-04-08 13:00:00     36     NA
## 3 2021-04-08 14:00:00     53     NA
## 4 2021-04-08 15:00:00     45     NA
## 5 2021-04-08 16:00:00     33     NA
## 6 2021-04-08 17:00:00     23     NA
##                  time variable value
## 1 2021-04-08 12:00:00   puma_n     4
## 2 2021-04-08 13:00:00   puma_n    36
## 3 2021-04-08 14:00:00   puma_n    53
## 4 2021-04-08 15:00:00   puma_n    45
## 5 2021-04-08 16:00:00   puma_n    33
## 6 2021-04-08 17:00:00   puma_n    23

Analyze Tweet texts

Create a topic model

Topic modeling is the task of automatically discovering topics from a vast amount of text. You can create topic models from the tweet text to quickly summarize the vast information available into distinct topics and gain insights.

##      Topic 1         Topic 2          Topic 3         Topic 4        
## [1,] "climatechange" "climatechange"  "climatechange" "climatechange"
## [2,] "climatecrisis" "sustainability" "climate"       "years"        
## [3,] "amp"           "innovation"     "amp"           "will"         
## [4,] "human"         "climate"        "new"           "seen"         
## [5,] "can"           "amp"            "energy"        "today"        
##      Topic 5        
## [1,] "climatechange"
## [2,] "climate"      
## [3,] "climatecrisis"
## [4,] "environment"  
## [5,] "amp"
##      Topic 1            Topic 2         Topic 3         Topic 4        
## [1,] "climatechange"    "climatechange" "climatechange" "climatechange"
## [2,] "climatecrisis"    "years"         "climate"       "climatecrisis"
## [3,] "climate"          "seen"          "amp"           "climate"      
## [4,] "sustainability"   "will"          "energy"        "amp"          
## [5,] "amp"              "well"          "climatecrisis" "seals"        
## [6,] "fridaysforfuture" "die"           "need"          "change"

Network analysis

Create a retweet network

A retweet network can be built usig the igraph package. Understanding the position of potential customers on a retweet network allows a brand to identify key players who are likely to retweet posts to spread brand messaging.

## IGRAPH 5b00d36 DN-- 80 61 -- 
## + attr: name (v/c)
## + edges from 5b00d36 (vertex names):
##  [1] ReparSandra    ->505Nomad        tdg_trekking   ->Dariozogbi     
##  [3] Domainbot1     ->HWdomains       CapelliLaVita1 ->VIParis        
##  [5] carissahadid   ->SecretFlying    KingOfPentacl  ->org_scp        
##  [7] texastwins2004 ->birdwriter7     TriadTravelogs ->Dastylishfoodie
##  [9] ExploreAmadeus ->DeniseSanger    ExploreAmadeus ->HulloSafaris   
## [11] ExploreAmadeus ->CraftingDir     ExploreAmadeus ->logomaco       
## [13] ExploreAmadeus ->TravelYesPlease ExploreAmadeus ->rv_our         
## [15] travel_biz_news->CraftingDir     RubyPerry11    ->birdwriter7    
## + ... omitted several edges

Calculate out-degree scores

In a retweet network, the out-degree of a user indicates the number of times the user retweets posts, while in-degree of a user indicates the number of times the user’s posts are retweeted. Users with high out-degree scores are key players who can be used as a medium to retweet promotional posts. Users with high in-degrees are influential as their tweets are retweeted many times.

## myfoodfantasy69  ExploreAmadeus      sadytushar     ReparSandra    tdg_trekking 
##              15               6               2               1               1
##  Charlesfrize       VIParis   birdwriter7   CraftingDir 2WheelersLife 
##            15             2             2             2             2

Calculate the betweenness scores

Betweenness centrality represents the degree to which nodes stand between each other. In a retweet network, a user with a high betweenness centrality score can have more control over a network because more information will pass through the user.

##  ReparSandra     505Nomad tdg_trekking   Dariozogbi   Domainbot1 
##            0            0            0            0            0

Follower count to enhance the network plot

The users who retweet most will add more value if they have a high follower count as their retweets will reach a wider audience. We can add this information to the plot by setting the vertex color to indicate the follower count.

## $name
##  [1] "ReparSandra"     "505Nomad"        "tdg_trekking"    "Dariozogbi"     
##  [5] "Domainbot1"      "HWdomains"       "CapelliLaVita1"  "VIParis"        
##  [9] "carissahadid"    "SecretFlying"    "KingOfPentacl"   "org_scp"        
## [13] "texastwins2004"  "birdwriter7"     "TriadTravelogs"  "Dastylishfoodie"
## [17] "ExploreAmadeus"  "DeniseSanger"    "HulloSafaris"    "CraftingDir"    
## [21] "logomaco"        "TravelYesPlease" "rv_our"          "travel_biz_news"
## [25] "RubyPerry11"     "_DesertX"        "Snezny1"         "TamurilMinyatur"
## [29] "KastKe"          "t_dalmar"        "Ghereandthere"   "Fabriziobustama"
## [33] "TriptiCharan"    "Soofuro"         "Apple_505050"    "PenguinSix"     
## [37] "Kabirkhan547680" "GulzarMustafak1" "ACAroundTown"    "aaroundtown"    
## [41] "sadytushar"      "2WheelersLife"   "SolespireDerek"  "TRAVOH_travel"  
## [45] "lovelaughterlug" "RoadTripsCoffee" "YouggyS"         "bcgoodsintl"    
## [49] "NewsBizLizzy"    "jlessuck"        "GloballyKenyan"  "PrinceTrails"   
## [53] "katr_elena"      "SunnyHolidays4u" "myfoodfantasy69" "Charlesfrize"   
## [57] "mwangiedwin504"  "185a29a356b6406" "CompassAcademic" "travelmail"     
## [61] "CarmanK1"        "ArteLeonida"     "PIPIENPierre"    "CelinePivoine"  
## [65] "IslamabadScene"  "UMountaineer"    "loriiejaane"     "AllThingTravel" 
## [69] "C_Two_Eagle"     "wallpaperable"   "Maitedalmau56"   "Havenlust"      
## [73] "dipali_atul"     "anuradhagoyal"   "precruise"       "T1Texas"        
## [77] "Jeanyor"         "JimByersTravel"  "Y4794"           "simplyart4794"  
## 
## $followers
##  [1] "1" "1" "0" "0" "1" "0" "1" "1" "0" "1" "0" "1" "0" "1" "1" "1" "1" "1" "1"
## [20] "1" "0" "1" "1" "0" "1" "1" "1" "1" "1" "1" "0" "1" "1" "0" "0" "1" "1" "1"
## [39] "1" "1" "0" "1" "0" "1" "0" "1" "0" "0" "0" "0" "0" "1" "1" "0" "0" "1" "1"
## [58] "0" "1" "0" "0" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [77] "1" "1" "0" "1"

Tweets geolocation

It is possible to extract the geolocation of tweets to gain insight of the popularity of a topic/ oroduct across a region.