Spatial autocorrelation and GWR based on tweet UK data

Introduction

We will be utilizing a sample of data obtained via the Twitter. Some tweets included the exact location, while the majority provided location names, which were then geolocated using their respective bounding boxes.The sentiment analysis using the VADER algorithm was performed on this data to obtain the compaund scores. Additionally, we will incorporate the Geographically Weighted Regression (GWR) analysis with an “unemployment rate in regions” data set.

The goal of this project is to investigate the spatial patterns and relationships between sentiment expressed in tweets and regional unemployment rates across the UK. By applying spatial autocorrelation and Geographically Weighted Regression (GWR) analysis, the project aims to uncover how sentiment varies geographically and how it correlates with economic indicators such as unemployment rate. This analysis will provide insights into the spatial dynamics of social media activity and its potential linkages to socio-economic factors.

Firstly, we are going to start with loading the data sets

# read twitter data
tweet <- read.csv("uk-sentiment-data.csv")
# read twitter data
head(tweet)

##       tweet_id           created_at      place_name          full_place_name
## 1 1.080252e+18 2019-01-01T23:57:21Z       Islington        Islington, London
## 2 1.080243e+18 2019-01-01T23:24:20Z      Haslingden      Haslingden, England
## 3 1.080238e+18 2019-01-01T23:03:35Z        Scotland Scotland, United Kingdom
## 4 1.080238e+18 2019-01-01T23:02:35Z    Loughborough    Loughborough, England
## 5 1.080231e+18 2019-01-01T22:34:35Z        West End        West End, England
## 6 1.080227e+18 2019-01-01T22:19:47Z Southend-on-Sea    Southend-on-Sea, East
##         long      lat exact_coords place_type country_code      username
## 1 -0.1091814 51.54693        FALSE       city           GB      jdportes
## 2 -2.3255037 53.69489        FALSE       city           GB  stevegtweets
## 3 -4.2004410 57.73945        FALSE      admin           GB verafinlayson
## 4 -1.2239521 52.76671        FALSE       city           GB       luffdee
## 5 -1.3357180 50.92779        FALSE       city           GB Andrews47Andy
## 6  0.7212505 51.54944        FALSE       city           GB    1940MadMag
##                                                                                                                                                                                                                                                                                                 text
## 1 @ArronDavid12 @AnitaBellows12 @SwotTyler @DrLeeJones Sigh. People who talk about "supply and demand" like that almost invariably don't understand the basic economics of immigration. And *real* wage growth peaked in 2015-16, when EU migration was at highest ever level. Do a little homework.
## 2                                 @nilayspatelmd @jholtwriter @rleskew @WhiteHouse @realDonaldTrump You are the problem.\nWanting to stop illegal immigration and trafficking of people and drugs is not a racist or white nationalist viewpoint! The fact that you call it that is the real issue!!
## 3                                                                                                                                                                                                 @jessphillips @SoniaGallegoAJE And Labour's policy on immigration? EU citizens and Brexit? Please!
## 4                                                                                                                                                 @u2rshite @ReubenH @moas_eu @RevRichardColes @EvaShamouel @sobanoodle @seawatch_intl This is about police funding.  Nothing to do with immigration
## 5                                                                           @Kevin_Maguire He was caught on camera shouting pro Isis slogans and Allahu Akbar, so if he's also a Muslim immigrant or son of one it's of no consequence if he's declared mad or not, he's still an Islamic terrorist.
## 6                                                                                                                                       @petercwest @JuliaHB1 Under the UN Migration Compact it will become illegal to use the term illegal  for immigrants. They will be called irregular migrants.
##                                                                                                                               word_scores
## 1 {0, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0, -0.5, 1.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 2       {0, 0, 0, 0, 0, 0, 0, 0, -1.7, 0, 0, -1.2, -2.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 3                                                                                                  {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.3}
## 4                                                                                     {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 5          {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.2, 0, 0, 0, 0, -2.2, 0, 0, 0, 0, 0, 0, -3.7}
## 6                                                          {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2.6, 0, 0, 0, 0, -2.6, 0, 0, 0, 0, 0, 0, 0, 0}
##   compound   pos   neu   neg but_count
## 1    0.572 0.136 0.832 0.033         0
## 2   -0.732 0.068 0.735 0.197         0
## 3    0.450 0.212 0.788 0.000         0
## 4    0.000 0.000 1.000 0.000         0
## 5   -0.878 0.000 0.781 0.219         0
## 6   -0.802 0.000 0.753 0.247         0

str(tweet)

## 'data.frame':    3919 obs. of  17 variables:
##  $ tweet_id       : num  1.08e+18 1.08e+18 1.08e+18 1.08e+18 1.08e+18 ...
##  $ created_at     : chr  "2019-01-01T23:57:21Z" "2019-01-01T23:24:20Z" "2019-01-01T23:03:35Z" "2019-01-01T23:02:35Z" ...
##  $ place_name     : chr  "Islington" "Haslingden" "Scotland" "Loughborough" ...
##  $ full_place_name: chr  "Islington, London" "Haslingden, England" "Scotland, United Kingdom" "Loughborough, England" ...
##  $ long           : num  -0.109 -2.326 -4.2 -1.224 -1.336 ...
##  $ lat            : num  51.5 53.7 57.7 52.8 50.9 ...
##  $ exact_coords   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ place_type     : chr  "city" "city" "admin" "city" ...
##  $ country_code   : chr  "GB" "GB" "GB" "GB" ...
##  $ username       : chr  "jdportes" "stevegtweets" "verafinlayson" "luffdee" ...
##  $ text           : chr  "@ArronDavid12 @AnitaBellows12 @SwotTyler @DrLeeJones Sigh. People who talk about \"supply and demand\" like tha"| __truncated__ "@nilayspatelmd @jholtwriter @rleskew @WhiteHouse @realDonaldTrump You are the problem.\nWanting to stop illegal"| __truncated__ "@jessphillips @SoniaGallegoAJE And Labour's policy on immigration? EU citizens and Brexit? Please!" "@u2rshite @ReubenH @moas_eu @RevRichardColes @EvaShamouel @sobanoodle @seawatch_intl This is about police fundi"| __truncated__ ...
##  $ word_scores    : chr  "{0, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0, -0.5, 1.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.6, 0, 0, 0, 0, 0, 0, 0,"| __truncated__ "{0, 0, 0, 0, 0, 0, 0, 0, -1.7, 0, 0, -1.2, -2.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.22, 0, 0, 0, 0, 0, 0, 0, 0, 0,"| __truncated__ "{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.3}" "{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}" ...
##  $ compound       : num  0.572 -0.732 0.45 0 -0.878 -0.802 -0.762 0.262 0.44 -0.34 ...
##  $ pos            : num  0.136 0.068 0.212 0 0 0 0 0.257 0.225 0 ...
##  $ neu            : num  0.832 0.735 0.788 1 0.781 0.753 0.743 0.522 0.775 0.821 ...
##  $ neg            : num  0.033 0.197 0 0 0.219 0.247 0.257 0.221 0 0.179 ...
##  $ but_count      : int  0 0 0 0 0 0 0 1 0 0 ...

This data represents a data frame of tweets with 3919 observations and 17 variables. Each variable provides specific details about the tweets, including:

tweet_id: Numerical identifier for each tweet.
created_at: Timestamp indicating when the tweet was created (in ISO 8601 format).
place_name: Name of the place where the tweet was posted.
full_place_name: Complete name of the place, including city and country.
long: Longitude coordinate of the tweet’s location.
lat: Latitude coordinate of the tweet’s location.
exact_coords: Logical (boolean) indicating whether the exact coordinates are available.
place_type: Type of place (e.g., city, admin region).
country_code: Country code where the tweet was posted (e.g., “GB” for Great Britain).
username: Username of the Twitter user who posted the tweet.
text: The content of the tweet.
word_scores: A string representation of word-level sentiment scores.
compound: Compound sentiment score of the tweet (ranges from -1 to 1).
pos: Proportion of positive sentiment in the tweet.
neu: Proportion of neutral sentiment in the tweet.
neg: Proportion of negative sentiment in the tweet.
but_count: Count of the word “but” in the tweet text.

# read shapefile
UK_shp <- st_read("Local_Authority_Districts_(May_2021)_UK_BFE_V3/LAD_MAY_2021_UK_BFE_V2.shp")

## Reading layer `LAD_MAY_2021_UK_BFE_V2' from data source 
##   `C:\E\4th_Semester\Spatial_eco\Project\Spatial_Eco\data\Local_Authority_Districts_(May_2021)_UK_BFE_V3\LAD_MAY_2021_UK_BFE_V2.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 374 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -116.1928 ymin: 5333.81 xmax: 655989 ymax: 1220310
## Projected CRS: OSGB36 / British National Grid

head(UK_shp)

## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 344666.1 ymin: 378867 xmax: 478453.7 ymax: 537152
## Projected CRS: OSGB36 / British National Grid
##   OBJECTID   LAD21CD              LAD21NM  BNG_E  BNG_N     LONG      LAT
## 1        1 E06000001           Hartlepool 447160 531474 -1.27018 54.67614
## 2        2 E06000002        Middlesbrough 451141 516887 -1.21099 54.54467
## 3        3 E06000003 Redcar and Cleveland 464361 519597 -1.00608 54.56752
## 4        4 E06000004     Stockton-on-Tees 444940 518183 -1.30664 54.55691
## 5        5 E06000005           Darlington 428029 515648 -1.56835 54.53534
## 6        6 E06000006               Halton 354246 382146 -2.68853 53.33424
##   SHAPE_Leng SHAPE_Area                       geometry
## 1   66110.01   98351073 MULTIPOLYGON (((447213.9 53...
## 2   41055.79   54553586 MULTIPOLYGON (((448489.9 52...
## 3  105292.10  253785360 MULTIPOLYGON (((455525.9 52...
## 4  108085.19  209730809 MULTIPOLYGON (((444157 5279...
## 5  107203.15  197477768 MULTIPOLYGON (((423496.6 52...
## 6   60716.84   90321522 MULTIPOLYGON (((351539.9 38...

The Uk_shp data represents a simple feature collection of geographic areas, specifically local authority districts (LADs) in the United Kingdom. There are 6 features (LADs) with 9 fields providing details about each district.

# Read unemployment data
unemployment_rate <- read.csv("unemployment-region.csv")
head(unemployment_rate)

##    Quarter.ending Apr.09 Apr.10 Apr.11 May.11 Jun.11 Jul.11 Aug.11 Sep.11
## 1 Numbers ('000s)                                                        
## 2          London    352    389    403    421    429    436    447    432
## 3  United Kingdom  2,296  2,510  2,462  2,500  2,540  2,556  2,612  2,664
## 4         England  1,951  2,106  2,072  2,111  2,135  2,162  2,201  2,248
## 5           Wales    112    125    116    117    125    125    133    139
## 6        Scotland    182    219    211    210    215    206    213    216
##   Oct.11 Nov.11 Dec.11 Jan.12 Feb.12 Mar.12 Apr.12 May.12 Jun.12 Jul.12 Aug.12
## 1                                                                             
## 2    440    443    443    442    441    432    423    395    390    402    406
## 3  2,680  2,708  2,684  2,670  2,653  2,633  2,624  2,605  2,582  2,607  2,553
## 4  2,248  2,284  2,257  2,245  2,237  2,213  2,206  2,192  2,170  2,179  2,135
## 5    137    133    134    134    133    136    132    134    127    134    124
## 6    235    234    233    236    224    225    225    218    218    223    223
##   Sep.12 Oct.12 Nov.12 Dec.12 Jan.13 Feb.13 Mar.13 Apr.13 May.13 Jun.13 Jul.13
## 1                                                                             
## 2    397    403    382    388    387    412    394    391    394    400    385
## 3  2,542  2,539  2,526  2,529  2,533  2,582  2,541  2,527  2,524  2,527  2,506
## 4  2,133  2,145  2,125  2,129  2,133  2,189  2,145  2,138  2,129  2,141  2,122
## 5    123    120    125    128    126    123    124    126    123    122    118
## 6    220    207    208    205    202    200    202    196    205    200    206
##   Aug.13 Sep.13 Oct.13 Nov.13 Dec.13 Jan.14 Feb.14 Mar.14 Apr.14 May.14 Jun.14
## 1                                                                             
## 2    398    404    396    379    372    378    366    357    348    344    334
## 3  2,510  2,488  2,412  2,332  2,348  2,335  2,254  2,212  2,162  2,126  2,074
## 4  2,121  2,105  2,035  1,982  1,982  1,980  1,902  1,870  1,821  1,779  1,743
## 5    121    118    113    109    106    100    103    101     98     97     98
## 6    204    202    200    179    197    190    181    179    183    191    174
##   Jul.14 Aug.14 Sep.14 Oct.14 Nov.14 Dec.14 Jan.15 Feb.15 Mar.15 Apr.15 May.15
## 1                                                                             
## 2    318    301    287    296    298    295    283    287    285    286    308
## 3  2,021  1,972  1,959  1,958  1,914  1,862  1,856  1,838  1,827  1,813  1,853
## 4  1,699  1,673  1,645  1,642  1,602  1,564  1,550  1,525  1,504  1,501  1,546
## 5     97     94     98    105    103     99     92     92     99     95    100
## 6    167    151    164    156    158    149    162    167    168    163    152
##    X X.1 X.2 X.3 X.4 X.5
## 1 NA  NA  NA  NA  NA  NA
## 2 NA  NA  NA  NA  NA  NA
## 3 NA  NA  NA  NA  NA  NA
## 4 NA  NA  NA  NA  NA  NA
## 5 NA  NA  NA  NA  NA  NA
## 6 NA  NA  NA  NA  NA  NA

Unemployment data set contains information regarding the regional unemployment rate for UK for 2015 and 2014. The data is organized by region and quarter, with unemployment rates and numbers presented in thousands.

Data Preprocessing

The tweet dataset was processed to extract regional information and convert it into a spatial format suitable for further geographical analysis. Initially, the region was extracted from the full_place_name column by isolating the last part of each observation, which typically contains the region name. This was achieved by using string manipulation functions to extract and clean the region names. Next, the dataset was transformed into a spatial points (sp) object by defining the longitude and latitude columns as spatial coordinates. The coordinate reference system was set to “EPSG:4326” to ensure proper geographical referencing. Subsequently, the sp object was converted into a simple features (sf) object, allowing for advanced spatial operations and visualizations.

# Extract the last part of each observation in the full_place_name column
tweet <- tweet %>%
  mutate(region = str_extract(full_place_name, ",\\s*[^,]+$") %>% str_replace_all("^,\\s*", ""))


# changing point data into sp class 
tweet.sp<-tweet     # doubled object, still the same
coordinates(tweet.sp)<-c("long", "lat") # change into sp class
proj4string(tweet.sp)<-as(st_crs("EPSG:4326"), "CRS")

# conversion from sp to sf
tweet_coord<-st_as_sf(tweet.sp)
# View the updated dataset
head(tweet_coord)

## Simple feature collection with 6 features and 16 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -4.200441 ymin: 50.92779 xmax: 0.7212505 ymax: 57.73945
## Geodetic CRS:  WGS 84
##       tweet_id           created_at      place_name          full_place_name
## 1 1.080252e+18 2019-01-01T23:57:21Z       Islington        Islington, London
## 2 1.080243e+18 2019-01-01T23:24:20Z      Haslingden      Haslingden, England
## 3 1.080238e+18 2019-01-01T23:03:35Z        Scotland Scotland, United Kingdom
## 4 1.080238e+18 2019-01-01T23:02:35Z    Loughborough    Loughborough, England
## 5 1.080231e+18 2019-01-01T22:34:35Z        West End        West End, England
## 6 1.080227e+18 2019-01-01T22:19:47Z Southend-on-Sea    Southend-on-Sea, East
##   exact_coords place_type country_code      username
## 1        FALSE       city           GB      jdportes
## 2        FALSE       city           GB  stevegtweets
## 3        FALSE      admin           GB verafinlayson
## 4        FALSE       city           GB       luffdee
## 5        FALSE       city           GB Andrews47Andy
## 6        FALSE       city           GB    1940MadMag
##                                                                                                                                                                                                                                                                                                 text
## 1 @ArronDavid12 @AnitaBellows12 @SwotTyler @DrLeeJones Sigh. People who talk about "supply and demand" like that almost invariably don't understand the basic economics of immigration. And *real* wage growth peaked in 2015-16, when EU migration was at highest ever level. Do a little homework.
## 2                                 @nilayspatelmd @jholtwriter @rleskew @WhiteHouse @realDonaldTrump You are the problem.\nWanting to stop illegal immigration and trafficking of people and drugs is not a racist or white nationalist viewpoint! The fact that you call it that is the real issue!!
## 3                                                                                                                                                                                                 @jessphillips @SoniaGallegoAJE And Labour's policy on immigration? EU citizens and Brexit? Please!
## 4                                                                                                                                                 @u2rshite @ReubenH @moas_eu @RevRichardColes @EvaShamouel @sobanoodle @seawatch_intl This is about police funding.  Nothing to do with immigration
## 5                                                                           @Kevin_Maguire He was caught on camera shouting pro Isis slogans and Allahu Akbar, so if he's also a Muslim immigrant or son of one it's of no consequence if he's declared mad or not, he's still an Islamic terrorist.
## 6                                                                                                                                       @petercwest @JuliaHB1 Under the UN Migration Compact it will become illegal to use the term illegal  for immigrants. They will be called irregular migrants.
##                                                                                                                               word_scores
## 1 {0, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0, -0.5, 1.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 2       {0, 0, 0, 0, 0, 0, 0, 0, -1.7, 0, 0, -1.2, -2.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 3                                                                                                  {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.3}
## 4                                                                                     {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 5          {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.2, 0, 0, 0, 0, -2.2, 0, 0, 0, 0, 0, 0, -3.7}
## 6                                                          {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2.6, 0, 0, 0, 0, -2.6, 0, 0, 0, 0, 0, 0, 0, 0}
##   compound   pos   neu   neg but_count         region
## 1    0.572 0.136 0.832 0.033         0         London
## 2   -0.732 0.068 0.735 0.197         0        England
## 3    0.450 0.212 0.788 0.000         0 United Kingdom
## 4    0.000 0.000 1.000 0.000         0        England
## 5   -0.878 0.000 0.781 0.219         0        England
## 6   -0.802 0.000 0.753 0.247         0           East
##                      geometry
## 1 POINT (-0.1091814 51.54693)
## 2  POINT (-2.325504 53.69489)
## 3  POINT (-4.200441 57.73945)
## 4  POINT (-1.223952 52.76671)
## 5  POINT (-1.335718 50.92779)
## 6  POINT (0.7212505 51.54944)

The unemployment dataset was cleaned and prepared for merging with the tweet data. First, the initial 17 rows were removed to discard any unnecessary header information and to get the rates only. The remaining data was then adjusted by renaming the column Quarter.ending to Region for clarity. Next, the last 6 columns were removed to focus on the relevant data. This cleaned data was then merged with the tweet_coord dataset on the region column, aligning the unemployment rates with the corresponding tweet regions for further analysis.

### in unemployment dataset Remove the first 17 rows and rename the column
unemployment_rate_cleaned <- unemployment_rate %>%
  slice(-(1:17)) %>%       # Remove the first 17 rows
  rename(Region = Quarter.ending)  # Rename the column

# Remove the last 6 columns
n <- ncol(unemployment_rate_cleaned)

unemployment_rate_cleaned <- unemployment_rate_cleaned %>%
  select(1:(n-6))

# Select the columns of interest
unemployment_selected <- unemployment_rate_cleaned %>%
  select(Region, Jan.14, Jan.15)

# View the cleaned data
head(unemployment_selected)

##             Region Jan.14 Jan.15
## 1           London    8.3    6.2
## 2   United Kingdom    7.2    5.7
## 3          England    7.2    5.6
## 4            Wales    6.7    6.2
## 5         Scotland    6.9    5.9
## 6 Northern Ireland    7.5    6.0

# Merge with tweet_coord on region
tweet_coord <- tweet_coord %>%
  left_join(unemployment_selected, by = c("region" = "Region"))

head(tweet_coord)

## Simple feature collection with 6 features and 18 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -4.200441 ymin: 50.92779 xmax: 0.7212505 ymax: 57.73945
## Geodetic CRS:  WGS 84
##       tweet_id           created_at      place_name          full_place_name
## 1 1.080252e+18 2019-01-01T23:57:21Z       Islington        Islington, London
## 2 1.080243e+18 2019-01-01T23:24:20Z      Haslingden      Haslingden, England
## 3 1.080238e+18 2019-01-01T23:03:35Z        Scotland Scotland, United Kingdom
## 4 1.080238e+18 2019-01-01T23:02:35Z    Loughborough    Loughborough, England
## 5 1.080231e+18 2019-01-01T22:34:35Z        West End        West End, England
## 6 1.080227e+18 2019-01-01T22:19:47Z Southend-on-Sea    Southend-on-Sea, East
##   exact_coords place_type country_code      username
## 1        FALSE       city           GB      jdportes
## 2        FALSE       city           GB  stevegtweets
## 3        FALSE      admin           GB verafinlayson
## 4        FALSE       city           GB       luffdee
## 5        FALSE       city           GB Andrews47Andy
## 6        FALSE       city           GB    1940MadMag
##                                                                                                                                                                                                                                                                                                 text
## 1 @ArronDavid12 @AnitaBellows12 @SwotTyler @DrLeeJones Sigh. People who talk about "supply and demand" like that almost invariably don't understand the basic economics of immigration. And *real* wage growth peaked in 2015-16, when EU migration was at highest ever level. Do a little homework.
## 2                                 @nilayspatelmd @jholtwriter @rleskew @WhiteHouse @realDonaldTrump You are the problem.\nWanting to stop illegal immigration and trafficking of people and drugs is not a racist or white nationalist viewpoint! The fact that you call it that is the real issue!!
## 3                                                                                                                                                                                                 @jessphillips @SoniaGallegoAJE And Labour's policy on immigration? EU citizens and Brexit? Please!
## 4                                                                                                                                                 @u2rshite @ReubenH @moas_eu @RevRichardColes @EvaShamouel @sobanoodle @seawatch_intl This is about police funding.  Nothing to do with immigration
## 5                                                                           @Kevin_Maguire He was caught on camera shouting pro Isis slogans and Allahu Akbar, so if he's also a Muslim immigrant or son of one it's of no consequence if he's declared mad or not, he's still an Islamic terrorist.
## 6                                                                                                                                       @petercwest @JuliaHB1 Under the UN Migration Compact it will become illegal to use the term illegal  for immigrants. They will be called irregular migrants.
##                                                                                                                               word_scores
## 1 {0, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0, -0.5, 1.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 2       {0, 0, 0, 0, 0, 0, 0, 0, -1.7, 0, 0, -1.2, -2.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 3                                                                                                  {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.3}
## 4                                                                                     {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 5          {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.2, 0, 0, 0, 0, -2.2, 0, 0, 0, 0, 0, 0, -3.7}
## 6                                                          {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2.6, 0, 0, 0, 0, -2.6, 0, 0, 0, 0, 0, 0, 0, 0}
##   compound   pos   neu   neg but_count         region Jan.14 Jan.15
## 1    0.572 0.136 0.832 0.033         0         London    8.3    6.2
## 2   -0.732 0.068 0.735 0.197         0        England    7.2    5.6
## 3    0.450 0.212 0.788 0.000         0 United Kingdom    7.2    5.7
## 4    0.000 0.000 1.000 0.000         0        England    7.2    5.6
## 5   -0.878 0.000 0.781 0.219         0        England    7.2    5.6
## 6   -0.802 0.000 0.753 0.247         0           East   <NA>   <NA>
##                      geometry
## 1 POINT (-0.1091814 51.54693)
## 2  POINT (-2.325504 53.69489)
## 3  POINT (-4.200441 57.73945)
## 4  POINT (-1.223952 52.76671)
## 5  POINT (-1.335718 50.92779)
## 6  POINT (0.7212505 51.54944)

# change format in columns
tweet_coord$Jan.14 = as.numeric(tweet_coord$Jan.14)
tweet_coord$Jan.15 = as.numeric(tweet_coord$Jan.15)

The shapefile of the UK boundaries was processed to simplify and ensure valid geometries, followed by conversions between spatial formats. Initially, the boundaries of the UK shapefile were simplified to reduce complexity while preserving topology, using a tolerance of 1km. This simplification helps in reducing the computational load for further processing and visualization. The simplified and validated shapefile was then converted from a simple features (sf) object to a Spatial object (sp class) using the local authority district codes (LAD21CD) as identifiers. Coordinates were extracted from this Spatial object.

### process shape file

# simplify boundaries
UK_shp_simple <- st_simplify(UK_shp, 
                             preserveTopology =T,
                             dTolerance = 1000) # 1km


UK_shp_simple.sp<-as_Spatial(UK_shp_simple, cast=TRUE, IDs="LAD21CD")
crds.UK_shp_simple<-coordinates(UK_shp_simple.sp)
UK_shp_simple.sf<-st_as_sf(UK_shp_simple.sp)

Exploratory Data Analysis (EDA)

In this project, the next critical step is performing EDA to understand the relationships, distributions, and patterns within the datasets. EDA will help uncover insights and inform further analysis and visualization strategies.

Now, we will plot the tweets according to their regions using geographical visualizations. This will help us understand the spatial distribution of tweets and identify any regional patterns or trends.

# Plot tweets according to the region

# Summarize the number of tweets per place_name
points_summary <- tweet_coord %>%
  group_by(place_name) %>%
  summarize(
    geometry = st_union(geometry),
    tweet_count = n(),
    .groups = 'drop'
  )

# Plot the map with point data
ggplot() +
  geom_sf(data = UK_shp_simple.sf, fill = "lightgray") +
  geom_sf(data = points_summary, aes(color = tweet_count)) +
  labs(title = "Spatial Distribution of Tweets by Region", x = "Longitude", y = "Latitude") +
  theme_minimal() +
  theme(legend.position = "right")

There are significant clusters of tweets in major cities and urban areas, particularly in London, the Midlands, and parts of Northern England. This indicates higher social media activity in these regions. Rural and less populated areas, such as parts of Scotland, Wales, and South West England, show fewer tweets, as indicated by the sparse and lighter colored points. The tweets cover the entire UK, showing that social media activity is widespread, although concentrated in urban centers.

To visualize the sentiment of tweets across different regions in the UK, we use a geographic plot with compound score This allows us to see how sentiments vary spatially, providing insights into regional mood and opinions.

# plot map with compound gradient
ggplot() +
  geom_sf(data = UK_shp_simple.sf, fill = "lightgray") +
  geom_sf(data = tweet_coord, aes(color = compound)) +
  labs(title = "Regional compound analysis", x = "Longitude", y = "Latitude") +
  theme_minimal()

The map reveals that sentiment expressed in tweets varies across different regions of the UK, without a clear regional bias toward positivity or negativity. Urban areas have a higher density of tweets, indicating higher social media activity, which may lead to a more accurate representation of sentiment in those areas. Rural areas, while less dense in tweet activity, still provide valuable sentiment data.

To further understand the sentiment of tweets across different regions in the UK, additional maps were created focusing on negative and positive sentiment components separately.

# Plot the map with negative compound
ggplot() +
  geom_sf(data = UK_shp_simple.sf, fill = "lightgray") +
  geom_sf(data = tweet_coord, aes(color = neg)) +
  scale_color_gradient(name = "Negative Score", low = "red", high = "blue") +
  labs(title = "Regional Negative Compound Analysis", x = "Longitude", y = "Latitude") +
  theme_minimal() +
  theme(legend.position = "right")

# Plot the map with positive compound
ggplot() +
  geom_sf(data = UK_shp_simple.sf, fill = "lightgray") +
  geom_sf(data = tweet_coord, aes(color = pos)) +
  scale_color_gradient(name = "Positive Score", low = "orange", high = "blue") +
  labs(title = "Regional Positive Compound Analysis", x = "Longitude", y = "Latitude") +
  theme_minimal() +
  theme(legend.position = "right")

The first map visualizes the negative sentiment scores of tweets across the UK. This map reveals that negative sentiments are distributed across various regions, with noticeable clusters in urban areas such as London, the Midlands, and parts of Northern England. The second map focuses on the positive sentiment scores of tweets. The map shows that positive sentiments are also widely distributed across the UK, with significant clusters in urban areas. Similar to the negative sentiment map, rural areas have fewer tweets, but they still contribute to the overall positive sentiment landscape.

Next plot visualizes the unemployment rates across different regions in the UK.

ggplot() +
  geom_sf(data = UK_shp_simple.sf, fill = "lightgray") +
  geom_sf(data = tweet_coord, aes(color = Jan.15)) +
  scale_color_gradient(name = "Unemployment Rate", low = "red", high = "blue") +
  labs(title = "Regional Unemployment Rate", x = "Longitude", y = "Latitude") +
  theme_minimal() +
  theme(legend.position = "right")

Insights:

Clusters of High Unemployment: There are visible clusters of high unemployment in certain urban areas, particularly in the South.
Clusters of Low Unemployment: Predominantly in rural areas and some regions in the North and Scotland.

Creating a spatial weight matrix

In this section, several steps were undertaken to prepare the dataset for spatial analysis and to incorporate spatial lag variables. The compound sentiment score was assigned as the dependent variable y, the unemployment rate for 2015 was assigned as the independent variable x1, and the unemployment rate for 2014 was assigned as the temporal lag variable x1.t. This setup allows for the examination of relationships between sentiment and unemployment rates over time.

Buffer polygons were created around the points in the dataset with a specified buffer distance of 0.01 degrees. These buffer polygons represent the spatial area around each point and facilitate the identification of neighboring points. Using these buffer polygons, a neighbor object was created with the poly2nb function, identifying neighboring polygons based on spatial proximity. The resulting neighbor object was converted into a spatial weights list object using nb2listw, which is essential for conducting spatial statistical analyses.

# dependent variable `y` and independent variables `x1`, `x2`, etc.
tweet_coord$y <- tweet_coord$compound  # For example, using compound sentiment score as y
tweet_coord$x1 <- tweet_coord$Jan.15  # unemployment rate in 2015
tweet_coord$x1.t <- tweet_coord$Jan.14  # temporal lag of unemployment rate/ Ur in 2014

# Check for NA values and remove rows with NA
tweet_coord <- na.omit(tweet_coord)
tweet <- na.omit(tweet)

# Coordinates
crds <- st_coordinates(tweet_coord)

Spatial lag

To measure spatial dependence and further explore it, we will need to create an spatial lag. An spatial lag is the product of a spatial weight matrix and a given variable. The spatial lag of a variable is the average value of that variable in the neighborhood; that is, using the values of all the areas which are defined as neighbours; hence, the concept of spatial lag is inherently related to the concept of spatial weight matrix.

A spatial lag of the negative sentiment variable was computed using the lag.listw function. This spatial lag represents the influence of neighboring observations’ negative sentiment on each observation. These steps collectively enhance the dataset’s suitability for spatial analysis and provide insights into spatial dependencies and relationships.

# Create buffer polygons around the points (adjust buffer distance as needed)
buffer_distance <- 0.01  # Adjust buffer distance in desired units (e.g., degrees)
tweet_polygons <- st_buffer(tweet_coord, dist = buffer_distance)

# Now you can use tweet_polygons with poly2nb
nb <- poly2nb(tweet_polygons, queen = TRUE)
listw <- nb2listw(nb, style = "W", zero.policy = TRUE)


# Add spatial lag to your data:
neg_lag <- lag.listw(listw, tweet_coord$neg, zero.policy = TRUE)
tweet_coord$neg_lag <- neg_lag

GWR estimation

In this section, the Geographically Weighted Regression (GWR) model was formulated and executed to analyze the spatial relationships within the dataset, specifically incorporating spatial lag.

First, the formula for the GWR model was defined, where y is the dependent variable (compound sentiment score), and x1, x1.t, and neg_lag are the independent variables. This model aims to examine how the sentiment score is influenced by the unemployment rate in 2015 (x1), its temporal lag which is unemployment rate in 2014 (x1.t), and the spatial lag of negative sentiment (neg_lag).

# Formula for the GWR model including spatial lag
eq <- y ~ x1 + x1.t + neg_lag

# Optimum bandwidth
bw <- gwr.sel(eq, data = tweet_coord, coords = crds, adapt = TRUE)

## Adaptive q: 0.381966 CV score: 1104.416 
## Adaptive q: 0.618034 CV score: 1104.072 
## Adaptive q: 0.763932 CV score: 1104.062 
## Adaptive q: 0.7001615 CV score: 1104.066 
## Adaptive q: 0.854102 CV score: 1104.066 
## Adaptive q: 0.7758565 CV score: 1104.066 
## Adaptive q: 0.7638913 CV score: 1104.062 
## Adaptive q: 0.7684868 CV score: 1104.063 
## Adaptive q: 0.7656278 CV score: 1104.063 
## Adaptive q: 0.7639727 CV score: 1104.062 
## Adaptive q: 0.7640134 CV score: 1104.062 
## Adaptive q: 0.7646301 CV score: 1104.063 
## Adaptive q: 0.764249 CV score: 1104.062 
## Adaptive q: 0.7641034 CV score: 1104.062 
## Adaptive q: 0.764159 CV score: 1104.062 
## Adaptive q: 0.7641997 CV score: 1104.062 
## Adaptive q: 0.764159 CV score: 1104.062

# GWR model
model_gwr <- gwr(eq, data = tweet_coord, coords = crds, adapt = bw)
model_gwr

## Call:
## gwr(formula = eq, data = tweet_coord, coords = crds, adapt = bw)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.764159 (about 2886 of 3778 data points)
## Summary of GWR coefficient estimates at data points:
##                    Min.    1st Qu.     Median    3rd Qu.       Max.  Global
## X.Intercept. -0.0634382 -0.0629331 -0.0587078 -0.0545579 -0.0470006 -0.0521
## x1            0.0022720  0.0046497  0.0108676  0.0132474  0.0165642  0.0058
## x1.t          0.0038872  0.0060322  0.0072692  0.0108965  0.0125207  0.0105
## neg_lag      -0.6418354 -0.5711451 -0.5604900 -0.5339049 -0.4597040 -0.5810

The Geographically Weighted Regression (GWR) model was successfully fitted to the data. The Gaussian kernel function was utilized, with an adaptive bandwidth quantile of approximately 0.7562364, meaning the bandwidth adapts to cover around 2857 out of 3778 data points.

The summary of the GWR coefficient estimates at data points provides the minimum, first quartile (1st Qu.), median, third quartile (3rd Qu.), and maximum values of the coefficient estimates, along with the global coefficients, which represent the average effects across all locations.

The global coefficient for x1 is 0.0058, suggesting a positive relationship between the unemployment rate in 2015 and the compound sentiment score.Positive values suggest that higher unemployment rates are associated with higher sentiment scores (less negative).

The global coefficient for x1.t is 0.0105, indicating a positive relationship between the unemployment rate in 2014 and the compound sentiment score.higher past unemployment rates are associated with higher current sentiment scores.

The global coefficient for neg_lag is -0.5810, showing a strong negative relationship between the spatial lag of negative sentiment and the compound sentiment score.Negative values suggest that higher negative sentiment in neighboring areas is associated with lower sentiment scores (more negative).

The inclusion of spatial lag helps capture the influence of neighboring areas, while the unemployment rates help understand the temporal dynamics. The results show significant spatial dependence and provide a nuanced understanding of the factors influencing sentiment across the UK regions.

# Visualize the GWR coefficients for neg_lag
tweet_coord$GWR_neg_lag <- as.numeric(model_gwr$SDF$neg_lag)
# Create the base map
p <- ggplot(data = UK_shp_simple) + 
  geom_sf(color = "gray60", size = 0.1) +
  theme_void()


# Add the points with GWR coefficients for neg_lag
p + geom_sf(data = tweet_coord, aes(color = GWR_neg_lag), size = 2) +
  scale_color_viridis_c(option = "C") +
  labs(color = "GWR Coefficient for neg_lag") +
  theme_minimal() +
  ggtitle("GWR Coefficient for neg_lag")

Insights

The plot effectively illustrates the spatial variability of the impact of neighboring negative sentiment on sentiment scores across the UK. The color gradient highlights regions where negative sentiment from neighbors has a stronger or weaker impact.

Stronger Negative Impact: Clusters of more negative coefficients (purple) are seen in Southern England, suggesting higher sensitivity to neighboring negative sentiments in these areas.
Weaker Negative Impact: Clusters of less negative coefficients (yellow to orange) are observed in Northern England and Scotland, indicating these regions are less affected by neighboring negative sentiments.

Clustering of GWR coefficiences

# Extract the GWR coefficients
gwr_coefficients <- as.data.frame(model_gwr$SDF)

# Select relevant columns (x1, x1.t, neg_lag)
gwr_data <- gwr_coefficients %>%
  select(x1, x1.t, neg_lag)

# Calculate the optimal number of clusters
fviz_nbclust(gwr_data, kmeans)

The silhouette analysis plot indicates that the optimal number of clusters for this dataset is three. The average silhouette width is highest when the number of clusters (k) is three, suggesting that this number provides the best separation of the data into distinct groups.

# Perform clustering with eclust
set.seed(123)  # For reproducibility
klastry2 <- eclust(gwr_data, "kmeans", k = 3)

# Assign clusters to spatial data frame
tweet_coord$clust5 <- klastry2$cluster

# Visualize clusters on the map
ggplot(data = UK_shp) + 
  geom_sf(color = "gray60", size = 0.1) +
  geom_sf(data = tweet_coord, aes(color = as.factor(clust5)), size = 2) +
  scale_color_viridis_d(option = "C") +
  labs(color = "Cluster") +
  theme_minimal() +
  ggtitle("Clustering of GWR Coefficients")

The clustering analysis reveals significant regional variations in the relationships modeled by the GWR between compound sentiment scores and unemployment rates. These clusters highlight areas with similar underlying patterns in their data, suggesting that local factors influence the relationships between sentiment and unemployment. Regionally, we observe that the southern UK, including the Midlands and South West, tends to fall into Cluster 1, suggesting these areas have a specific dynamic between sentiment and unemployment rates. Central England forms Cluster 2, indicating a different pattern, while the northern regions, including Scotland and Wales, are primarily in Cluster 3, suggesting yet another unique relationship. These regional patterns reveal how unemployment rate and negative sentiment factor might differently influence sentiments in different regions.

Spatial Autocorrelation

A Moran plot (also known as a Moran scatter plot) is a graphical representation used to visualize and assess the spatial autocorrelation of a variable. It plots the values of a variable against the spatial lag of that variable, helping to identify the nature and strength of spatial autocorrelation.

# Moran Plot
ggplot(tweet_coord, aes(x = neg, y = neg_lag)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  ylab("Negative sentiment lag") + 
  xlab("Negative sentiment") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

tweet_coord <- tweet_coord %>% 
  mutate(
    st_neg = ( neg - mean(neg)) / sd(neg),
    st_neg_lag = ( neg_lag - mean(neg_lag)) / sd(neg_lag)
  )

ggplot(tweet_coord, aes(x = st_neg, y = st_neg_lag)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  geom_hline(yintercept = 0, color = "grey", alpha =.5) +
  geom_vline(xintercept = 0, color = "grey", alpha =.5) +
  ylab("Negative sentiment lag \n (standardised)") + 
  xlab("Negative sentiment \n (standardised)") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

Quadrants Interpretation:

High-High (HH): Upper-right quadrant, indicating locations where high negative sentiment values are surrounded by high negative sentiment values in neighboring areas.
Low-Low (LL): Lower-left quadrant, indicating locations where low negative sentiment values are surrounded by low negative sentiment values in neighboring areas.
High-Low (HL): Upper-left quadrant, indicating locations where high negative sentiment values are surrounded by low negative sentiment values in neighboring areas.
Low-High (LH): Lower-right quadrant, indicating locations where low negative sentiment values are surrounded by high negative sentiment values in neighboring areas.

Insights:

The slight upward slope of the trend line in both plots suggests a weak positive spatial autocorrelation.
This indicates that locations with high (or low) negative sentiment tend to have neighboring areas with similarly high (or low) negative sentiment, albeit the relationship is not very strong.

Moran’s I

To measure global spatial autocorrelation, we can use the Moran’s I. The Moran Plot and intrinsically related. The value of Moran’s I corresponds with the slope of the linear fit on the Moran Plot. We can compute it by running:

moran.test(tweet_coord$neg, listw = listw, zero.policy = TRUE, na.action = na.omit)

## 
##  Moran I test under randomisation
## 
## data:  tweet_coord$neg  
## weights: listw  
## n reduced by no-neighbour observations  
## 
## Moran I statistic standard deviate = 3.6629, p-value = 0.0001247
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.0366234370     -0.0002995806      0.0001016107

Moran I Statistic:

Moran I statistic: 0.0366234370

This value indicates the degree of spatial autocorrelation. Positive values suggest positive spatial autocorrelation (similar values are clustered), negative values suggest negative spatial autocorrelation (dissimilar values are clustered), and values close to zero suggest no spatial autocorrelation.
In this case, the Moran I statistic is positive (0.0366), indicating a slight positive spatial autocorrelation.

Test Statistics:

p-value is the probability of observing a Moran’s I as extreme as the one calculated, under the null hypothesis of no spatial autocorrelation. A very small p-value (typically less than 0.05) indicates strong evidence against the null hypothesis.
In this case, the p-value is very small (0.0001247), suggesting that the observed spatial autocorrelation is statistically significant.

Conclusion

This project utilized a sample of geolocated tweet data from the UK, coupled with regional unemployment data, to perform spatial analysis and Geographically Weighted Regression (GWR).

Exploratory Data Analysis (EDA) revealed significant clusters of tweets in urban areas, highlighting higher social media activity in cities such as London, the Midlands, and parts of Northern England. Sentiment analysis showed varied regional sentiment, with urban areas having a higher density of tweets, thereby providing a more accurate representation of sentiment in those areas. Rural areas, while having fewer tweets, still contributed valuable sentiment data.

The analysis also visualized unemployment rates across different regions, revealing significant clusters of higher unemployment rates in parts of southern England and Wales. These areas showed a correlation with clusters of negative sentiment tweets.

The GWR model was successfully fitted, revealing how the relationships between sentiment and unemployment rates, as well as the spatial lag of negative sentiment, varied across different geographical locations. The model showed a positive relationship between unemployment rates and sentiment scores and a strong negative relationship between the spatial lag of negative sentiment and the compound sentiment score. Clustering analysis of GWR coefficients further identified regional patterns, with the southern UK, Midlands, and South West forming distinct clusters, central England forming another, and northern regions, including Scotland and Wales, forming yet another. These clusters indicated unique relationships between sentiment and unemployment across different regions.

In conclusion, the project demonstrated significant spatial heterogeneity in sentiment and its relationship with unemployment rates, providing valuable insights into the spatial dynamics of social media activity and socio-economic factors in the UK.

Spatial autocorrelation and GWR based on tweet UK data

Kashfia Sarony & Oksana Chubatenko

2024-06-23

Introduction

Data Preprocessing

Exploratory Data Analysis (EDA)

Insights:

Creating a spatial weight matrix

Spatial lag

GWR estimation

Insights

Clustering of GWR coefficiences

Spatial Autocorrelation

Quadrants Interpretation:

Insights:

Moran’s I

Moran I Statistic:

Conclusion