We will be utilizing a sample of data obtained via the Twitter. Some tweets included the exact location, while the majority provided location names, which were then geolocated using their respective bounding boxes.The sentiment analysis using the VADER algorithm was performed on this data to obtain the compaund scores. Additionally, we will incorporate the Geographically Weighted Regression (GWR) analysis with an “unemployment rate in regions” data set.
The goal of this project is to investigate the spatial patterns and relationships between sentiment expressed in tweets and regional unemployment rates across the UK. By applying spatial autocorrelation and Geographically Weighted Regression (GWR) analysis, the project aims to uncover how sentiment varies geographically and how it correlates with economic indicators such as unemployment rate. This analysis will provide insights into the spatial dynamics of social media activity and its potential linkages to socio-economic factors.
Firstly, we are going to start with loading the data sets
# read twitter data
tweet <- read.csv("uk-sentiment-data.csv")
# read twitter data
head(tweet)
## tweet_id created_at place_name full_place_name
## 1 1.080252e+18 2019-01-01T23:57:21Z Islington Islington, London
## 2 1.080243e+18 2019-01-01T23:24:20Z Haslingden Haslingden, England
## 3 1.080238e+18 2019-01-01T23:03:35Z Scotland Scotland, United Kingdom
## 4 1.080238e+18 2019-01-01T23:02:35Z Loughborough Loughborough, England
## 5 1.080231e+18 2019-01-01T22:34:35Z West End West End, England
## 6 1.080227e+18 2019-01-01T22:19:47Z Southend-on-Sea Southend-on-Sea, East
## long lat exact_coords place_type country_code username
## 1 -0.1091814 51.54693 FALSE city GB jdportes
## 2 -2.3255037 53.69489 FALSE city GB stevegtweets
## 3 -4.2004410 57.73945 FALSE admin GB verafinlayson
## 4 -1.2239521 52.76671 FALSE city GB luffdee
## 5 -1.3357180 50.92779 FALSE city GB Andrews47Andy
## 6 0.7212505 51.54944 FALSE city GB 1940MadMag
## text
## 1 @ArronDavid12 @AnitaBellows12 @SwotTyler @DrLeeJones Sigh. People who talk about "supply and demand" like that almost invariably don't understand the basic economics of immigration. And *real* wage growth peaked in 2015-16, when EU migration was at highest ever level. Do a little homework.
## 2 @nilayspatelmd @jholtwriter @rleskew @WhiteHouse @realDonaldTrump You are the problem.\nWanting to stop illegal immigration and trafficking of people and drugs is not a racist or white nationalist viewpoint! The fact that you call it that is the real issue!!
## 3 @jessphillips @SoniaGallegoAJE And Labour's policy on immigration? EU citizens and Brexit? Please!
## 4 @u2rshite @ReubenH @moas_eu @RevRichardColes @EvaShamouel @sobanoodle @seawatch_intl This is about police funding. Nothing to do with immigration
## 5 @Kevin_Maguire He was caught on camera shouting pro Isis slogans and Allahu Akbar, so if he's also a Muslim immigrant or son of one it's of no consequence if he's declared mad or not, he's still an Islamic terrorist.
## 6 @petercwest @JuliaHB1 Under the UN Migration Compact it will become illegal to use the term illegal for immigrants. They will be called irregular migrants.
## word_scores
## 1 {0, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0, -0.5, 1.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 2 {0, 0, 0, 0, 0, 0, 0, 0, -1.7, 0, 0, -1.2, -2.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 3 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.3}
## 4 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 5 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.2, 0, 0, 0, 0, -2.2, 0, 0, 0, 0, 0, 0, -3.7}
## 6 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2.6, 0, 0, 0, 0, -2.6, 0, 0, 0, 0, 0, 0, 0, 0}
## compound pos neu neg but_count
## 1 0.572 0.136 0.832 0.033 0
## 2 -0.732 0.068 0.735 0.197 0
## 3 0.450 0.212 0.788 0.000 0
## 4 0.000 0.000 1.000 0.000 0
## 5 -0.878 0.000 0.781 0.219 0
## 6 -0.802 0.000 0.753 0.247 0
str(tweet)
## 'data.frame': 3919 obs. of 17 variables:
## $ tweet_id : num 1.08e+18 1.08e+18 1.08e+18 1.08e+18 1.08e+18 ...
## $ created_at : chr "2019-01-01T23:57:21Z" "2019-01-01T23:24:20Z" "2019-01-01T23:03:35Z" "2019-01-01T23:02:35Z" ...
## $ place_name : chr "Islington" "Haslingden" "Scotland" "Loughborough" ...
## $ full_place_name: chr "Islington, London" "Haslingden, England" "Scotland, United Kingdom" "Loughborough, England" ...
## $ long : num -0.109 -2.326 -4.2 -1.224 -1.336 ...
## $ lat : num 51.5 53.7 57.7 52.8 50.9 ...
## $ exact_coords : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ place_type : chr "city" "city" "admin" "city" ...
## $ country_code : chr "GB" "GB" "GB" "GB" ...
## $ username : chr "jdportes" "stevegtweets" "verafinlayson" "luffdee" ...
## $ text : chr "@ArronDavid12 @AnitaBellows12 @SwotTyler @DrLeeJones Sigh. People who talk about \"supply and demand\" like tha"| __truncated__ "@nilayspatelmd @jholtwriter @rleskew @WhiteHouse @realDonaldTrump You are the problem.\nWanting to stop illegal"| __truncated__ "@jessphillips @SoniaGallegoAJE And Labour's policy on immigration? EU citizens and Brexit? Please!" "@u2rshite @ReubenH @moas_eu @RevRichardColes @EvaShamouel @sobanoodle @seawatch_intl This is about police fundi"| __truncated__ ...
## $ word_scores : chr "{0, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0, -0.5, 1.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.6, 0, 0, 0, 0, 0, 0, 0,"| __truncated__ "{0, 0, 0, 0, 0, 0, 0, 0, -1.7, 0, 0, -1.2, -2.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.22, 0, 0, 0, 0, 0, 0, 0, 0, 0,"| __truncated__ "{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.3}" "{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}" ...
## $ compound : num 0.572 -0.732 0.45 0 -0.878 -0.802 -0.762 0.262 0.44 -0.34 ...
## $ pos : num 0.136 0.068 0.212 0 0 0 0 0.257 0.225 0 ...
## $ neu : num 0.832 0.735 0.788 1 0.781 0.753 0.743 0.522 0.775 0.821 ...
## $ neg : num 0.033 0.197 0 0 0.219 0.247 0.257 0.221 0 0.179 ...
## $ but_count : int 0 0 0 0 0 0 0 1 0 0 ...
This data represents a data frame of tweets with 3919 observations and 17 variables. Each variable provides specific details about the tweets, including:
# read shapefile
UK_shp <- st_read("Local_Authority_Districts_(May_2021)_UK_BFE_V3/LAD_MAY_2021_UK_BFE_V2.shp")
## Reading layer `LAD_MAY_2021_UK_BFE_V2' from data source
## `C:\E\4th_Semester\Spatial_eco\Project\Spatial_Eco\data\Local_Authority_Districts_(May_2021)_UK_BFE_V3\LAD_MAY_2021_UK_BFE_V2.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 374 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -116.1928 ymin: 5333.81 xmax: 655989 ymax: 1220310
## Projected CRS: OSGB36 / British National Grid
head(UK_shp)
## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 344666.1 ymin: 378867 xmax: 478453.7 ymax: 537152
## Projected CRS: OSGB36 / British National Grid
## OBJECTID LAD21CD LAD21NM BNG_E BNG_N LONG LAT
## 1 1 E06000001 Hartlepool 447160 531474 -1.27018 54.67614
## 2 2 E06000002 Middlesbrough 451141 516887 -1.21099 54.54467
## 3 3 E06000003 Redcar and Cleveland 464361 519597 -1.00608 54.56752
## 4 4 E06000004 Stockton-on-Tees 444940 518183 -1.30664 54.55691
## 5 5 E06000005 Darlington 428029 515648 -1.56835 54.53534
## 6 6 E06000006 Halton 354246 382146 -2.68853 53.33424
## SHAPE_Leng SHAPE_Area geometry
## 1 66110.01 98351073 MULTIPOLYGON (((447213.9 53...
## 2 41055.79 54553586 MULTIPOLYGON (((448489.9 52...
## 3 105292.10 253785360 MULTIPOLYGON (((455525.9 52...
## 4 108085.19 209730809 MULTIPOLYGON (((444157 5279...
## 5 107203.15 197477768 MULTIPOLYGON (((423496.6 52...
## 6 60716.84 90321522 MULTIPOLYGON (((351539.9 38...
The Uk_shp data represents a simple feature collection of geographic areas, specifically local authority districts (LADs) in the United Kingdom. There are 6 features (LADs) with 9 fields providing details about each district.
# Read unemployment data
unemployment_rate <- read.csv("unemployment-region.csv")
head(unemployment_rate)
## Quarter.ending Apr.09 Apr.10 Apr.11 May.11 Jun.11 Jul.11 Aug.11 Sep.11
## 1 Numbers ('000s)
## 2 London 352 389 403 421 429 436 447 432
## 3 United Kingdom 2,296 2,510 2,462 2,500 2,540 2,556 2,612 2,664
## 4 England 1,951 2,106 2,072 2,111 2,135 2,162 2,201 2,248
## 5 Wales 112 125 116 117 125 125 133 139
## 6 Scotland 182 219 211 210 215 206 213 216
## Oct.11 Nov.11 Dec.11 Jan.12 Feb.12 Mar.12 Apr.12 May.12 Jun.12 Jul.12 Aug.12
## 1
## 2 440 443 443 442 441 432 423 395 390 402 406
## 3 2,680 2,708 2,684 2,670 2,653 2,633 2,624 2,605 2,582 2,607 2,553
## 4 2,248 2,284 2,257 2,245 2,237 2,213 2,206 2,192 2,170 2,179 2,135
## 5 137 133 134 134 133 136 132 134 127 134 124
## 6 235 234 233 236 224 225 225 218 218 223 223
## Sep.12 Oct.12 Nov.12 Dec.12 Jan.13 Feb.13 Mar.13 Apr.13 May.13 Jun.13 Jul.13
## 1
## 2 397 403 382 388 387 412 394 391 394 400 385
## 3 2,542 2,539 2,526 2,529 2,533 2,582 2,541 2,527 2,524 2,527 2,506
## 4 2,133 2,145 2,125 2,129 2,133 2,189 2,145 2,138 2,129 2,141 2,122
## 5 123 120 125 128 126 123 124 126 123 122 118
## 6 220 207 208 205 202 200 202 196 205 200 206
## Aug.13 Sep.13 Oct.13 Nov.13 Dec.13 Jan.14 Feb.14 Mar.14 Apr.14 May.14 Jun.14
## 1
## 2 398 404 396 379 372 378 366 357 348 344 334
## 3 2,510 2,488 2,412 2,332 2,348 2,335 2,254 2,212 2,162 2,126 2,074
## 4 2,121 2,105 2,035 1,982 1,982 1,980 1,902 1,870 1,821 1,779 1,743
## 5 121 118 113 109 106 100 103 101 98 97 98
## 6 204 202 200 179 197 190 181 179 183 191 174
## Jul.14 Aug.14 Sep.14 Oct.14 Nov.14 Dec.14 Jan.15 Feb.15 Mar.15 Apr.15 May.15
## 1
## 2 318 301 287 296 298 295 283 287 285 286 308
## 3 2,021 1,972 1,959 1,958 1,914 1,862 1,856 1,838 1,827 1,813 1,853
## 4 1,699 1,673 1,645 1,642 1,602 1,564 1,550 1,525 1,504 1,501 1,546
## 5 97 94 98 105 103 99 92 92 99 95 100
## 6 167 151 164 156 158 149 162 167 168 163 152
## X X.1 X.2 X.3 X.4 X.5
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
Unemployment data set contains information regarding the regional unemployment rate for UK for 2015 and 2014. The data is organized by region and quarter, with unemployment rates and numbers presented in thousands.
The tweet dataset was processed to extract regional information and convert it into a spatial format suitable for further geographical analysis. Initially, the region was extracted from the full_place_name column by isolating the last part of each observation, which typically contains the region name. This was achieved by using string manipulation functions to extract and clean the region names. Next, the dataset was transformed into a spatial points (sp) object by defining the longitude and latitude columns as spatial coordinates. The coordinate reference system was set to “EPSG:4326” to ensure proper geographical referencing. Subsequently, the sp object was converted into a simple features (sf) object, allowing for advanced spatial operations and visualizations.
# Extract the last part of each observation in the full_place_name column
tweet <- tweet %>%
mutate(region = str_extract(full_place_name, ",\\s*[^,]+$") %>% str_replace_all("^,\\s*", ""))
# changing point data into sp class
tweet.sp<-tweet # doubled object, still the same
coordinates(tweet.sp)<-c("long", "lat") # change into sp class
proj4string(tweet.sp)<-as(st_crs("EPSG:4326"), "CRS")
# conversion from sp to sf
tweet_coord<-st_as_sf(tweet.sp)
# View the updated dataset
head(tweet_coord)
## Simple feature collection with 6 features and 16 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -4.200441 ymin: 50.92779 xmax: 0.7212505 ymax: 57.73945
## Geodetic CRS: WGS 84
## tweet_id created_at place_name full_place_name
## 1 1.080252e+18 2019-01-01T23:57:21Z Islington Islington, London
## 2 1.080243e+18 2019-01-01T23:24:20Z Haslingden Haslingden, England
## 3 1.080238e+18 2019-01-01T23:03:35Z Scotland Scotland, United Kingdom
## 4 1.080238e+18 2019-01-01T23:02:35Z Loughborough Loughborough, England
## 5 1.080231e+18 2019-01-01T22:34:35Z West End West End, England
## 6 1.080227e+18 2019-01-01T22:19:47Z Southend-on-Sea Southend-on-Sea, East
## exact_coords place_type country_code username
## 1 FALSE city GB jdportes
## 2 FALSE city GB stevegtweets
## 3 FALSE admin GB verafinlayson
## 4 FALSE city GB luffdee
## 5 FALSE city GB Andrews47Andy
## 6 FALSE city GB 1940MadMag
## text
## 1 @ArronDavid12 @AnitaBellows12 @SwotTyler @DrLeeJones Sigh. People who talk about "supply and demand" like that almost invariably don't understand the basic economics of immigration. And *real* wage growth peaked in 2015-16, when EU migration was at highest ever level. Do a little homework.
## 2 @nilayspatelmd @jholtwriter @rleskew @WhiteHouse @realDonaldTrump You are the problem.\nWanting to stop illegal immigration and trafficking of people and drugs is not a racist or white nationalist viewpoint! The fact that you call it that is the real issue!!
## 3 @jessphillips @SoniaGallegoAJE And Labour's policy on immigration? EU citizens and Brexit? Please!
## 4 @u2rshite @ReubenH @moas_eu @RevRichardColes @EvaShamouel @sobanoodle @seawatch_intl This is about police funding. Nothing to do with immigration
## 5 @Kevin_Maguire He was caught on camera shouting pro Isis slogans and Allahu Akbar, so if he's also a Muslim immigrant or son of one it's of no consequence if he's declared mad or not, he's still an Islamic terrorist.
## 6 @petercwest @JuliaHB1 Under the UN Migration Compact it will become illegal to use the term illegal for immigrants. They will be called irregular migrants.
## word_scores
## 1 {0, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0, -0.5, 1.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 2 {0, 0, 0, 0, 0, 0, 0, 0, -1.7, 0, 0, -1.2, -2.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 3 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.3}
## 4 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 5 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.2, 0, 0, 0, 0, -2.2, 0, 0, 0, 0, 0, 0, -3.7}
## 6 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2.6, 0, 0, 0, 0, -2.6, 0, 0, 0, 0, 0, 0, 0, 0}
## compound pos neu neg but_count region
## 1 0.572 0.136 0.832 0.033 0 London
## 2 -0.732 0.068 0.735 0.197 0 England
## 3 0.450 0.212 0.788 0.000 0 United Kingdom
## 4 0.000 0.000 1.000 0.000 0 England
## 5 -0.878 0.000 0.781 0.219 0 England
## 6 -0.802 0.000 0.753 0.247 0 East
## geometry
## 1 POINT (-0.1091814 51.54693)
## 2 POINT (-2.325504 53.69489)
## 3 POINT (-4.200441 57.73945)
## 4 POINT (-1.223952 52.76671)
## 5 POINT (-1.335718 50.92779)
## 6 POINT (0.7212505 51.54944)
The unemployment dataset was cleaned and prepared for merging with the tweet data. First, the initial 17 rows were removed to discard any unnecessary header information and to get the rates only. The remaining data was then adjusted by renaming the column Quarter.ending to Region for clarity. Next, the last 6 columns were removed to focus on the relevant data. This cleaned data was then merged with the tweet_coord dataset on the region column, aligning the unemployment rates with the corresponding tweet regions for further analysis.
### in unemployment dataset Remove the first 17 rows and rename the column
unemployment_rate_cleaned <- unemployment_rate %>%
slice(-(1:17)) %>% # Remove the first 17 rows
rename(Region = Quarter.ending) # Rename the column
# Remove the last 6 columns
n <- ncol(unemployment_rate_cleaned)
unemployment_rate_cleaned <- unemployment_rate_cleaned %>%
select(1:(n-6))
# Select the columns of interest
unemployment_selected <- unemployment_rate_cleaned %>%
select(Region, Jan.14, Jan.15)
# View the cleaned data
head(unemployment_selected)
## Region Jan.14 Jan.15
## 1 London 8.3 6.2
## 2 United Kingdom 7.2 5.7
## 3 England 7.2 5.6
## 4 Wales 6.7 6.2
## 5 Scotland 6.9 5.9
## 6 Northern Ireland 7.5 6.0
# Merge with tweet_coord on region
tweet_coord <- tweet_coord %>%
left_join(unemployment_selected, by = c("region" = "Region"))
head(tweet_coord)
## Simple feature collection with 6 features and 18 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -4.200441 ymin: 50.92779 xmax: 0.7212505 ymax: 57.73945
## Geodetic CRS: WGS 84
## tweet_id created_at place_name full_place_name
## 1 1.080252e+18 2019-01-01T23:57:21Z Islington Islington, London
## 2 1.080243e+18 2019-01-01T23:24:20Z Haslingden Haslingden, England
## 3 1.080238e+18 2019-01-01T23:03:35Z Scotland Scotland, United Kingdom
## 4 1.080238e+18 2019-01-01T23:02:35Z Loughborough Loughborough, England
## 5 1.080231e+18 2019-01-01T22:34:35Z West End West End, England
## 6 1.080227e+18 2019-01-01T22:19:47Z Southend-on-Sea Southend-on-Sea, East
## exact_coords place_type country_code username
## 1 FALSE city GB jdportes
## 2 FALSE city GB stevegtweets
## 3 FALSE admin GB verafinlayson
## 4 FALSE city GB luffdee
## 5 FALSE city GB Andrews47Andy
## 6 FALSE city GB 1940MadMag
## text
## 1 @ArronDavid12 @AnitaBellows12 @SwotTyler @DrLeeJones Sigh. People who talk about "supply and demand" like that almost invariably don't understand the basic economics of immigration. And *real* wage growth peaked in 2015-16, when EU migration was at highest ever level. Do a little homework.
## 2 @nilayspatelmd @jholtwriter @rleskew @WhiteHouse @realDonaldTrump You are the problem.\nWanting to stop illegal immigration and trafficking of people and drugs is not a racist or white nationalist viewpoint! The fact that you call it that is the real issue!!
## 3 @jessphillips @SoniaGallegoAJE And Labour's policy on immigration? EU citizens and Brexit? Please!
## 4 @u2rshite @ReubenH @moas_eu @RevRichardColes @EvaShamouel @sobanoodle @seawatch_intl This is about police funding. Nothing to do with immigration
## 5 @Kevin_Maguire He was caught on camera shouting pro Isis slogans and Allahu Akbar, so if he's also a Muslim immigrant or son of one it's of no consequence if he's declared mad or not, he's still an Islamic terrorist.
## 6 @petercwest @JuliaHB1 Under the UN Migration Compact it will become illegal to use the term illegal for immigrants. They will be called irregular migrants.
## word_scores
## 1 {0, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0, -0.5, 1.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 2 {0, 0, 0, 0, 0, 0, 0, 0, -1.7, 0, 0, -1.2, -2.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 3 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.3}
## 4 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
## 5 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.2, 0, 0, 0, 0, -2.2, 0, 0, 0, 0, 0, 0, -3.7}
## 6 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2.6, 0, 0, 0, 0, -2.6, 0, 0, 0, 0, 0, 0, 0, 0}
## compound pos neu neg but_count region Jan.14 Jan.15
## 1 0.572 0.136 0.832 0.033 0 London 8.3 6.2
## 2 -0.732 0.068 0.735 0.197 0 England 7.2 5.6
## 3 0.450 0.212 0.788 0.000 0 United Kingdom 7.2 5.7
## 4 0.000 0.000 1.000 0.000 0 England 7.2 5.6
## 5 -0.878 0.000 0.781 0.219 0 England 7.2 5.6
## 6 -0.802 0.000 0.753 0.247 0 East <NA> <NA>
## geometry
## 1 POINT (-0.1091814 51.54693)
## 2 POINT (-2.325504 53.69489)
## 3 POINT (-4.200441 57.73945)
## 4 POINT (-1.223952 52.76671)
## 5 POINT (-1.335718 50.92779)
## 6 POINT (0.7212505 51.54944)
# change format in columns
tweet_coord$Jan.14 = as.numeric(tweet_coord$Jan.14)
tweet_coord$Jan.15 = as.numeric(tweet_coord$Jan.15)
The shapefile of the UK boundaries was processed to simplify and ensure valid geometries, followed by conversions between spatial formats. Initially, the boundaries of the UK shapefile were simplified to reduce complexity while preserving topology, using a tolerance of 1km. This simplification helps in reducing the computational load for further processing and visualization. The simplified and validated shapefile was then converted from a simple features (sf) object to a Spatial object (sp class) using the local authority district codes (LAD21CD) as identifiers. Coordinates were extracted from this Spatial object.
### process shape file
# simplify boundaries
UK_shp_simple <- st_simplify(UK_shp,
preserveTopology =T,
dTolerance = 1000) # 1km
UK_shp_simple.sp<-as_Spatial(UK_shp_simple, cast=TRUE, IDs="LAD21CD")
crds.UK_shp_simple<-coordinates(UK_shp_simple.sp)
UK_shp_simple.sf<-st_as_sf(UK_shp_simple.sp)
In this project, the next critical step is performing EDA to understand the relationships, distributions, and patterns within the datasets. EDA will help uncover insights and inform further analysis and visualization strategies.
Now, we will plot the tweets according to their regions using geographical visualizations. This will help us understand the spatial distribution of tweets and identify any regional patterns or trends.
# Plot tweets according to the region
# Summarize the number of tweets per place_name
points_summary <- tweet_coord %>%
group_by(place_name) %>%
summarize(
geometry = st_union(geometry),
tweet_count = n(),
.groups = 'drop'
)
# Plot the map with point data
ggplot() +
geom_sf(data = UK_shp_simple.sf, fill = "lightgray") +
geom_sf(data = points_summary, aes(color = tweet_count)) +
labs(title = "Spatial Distribution of Tweets by Region", x = "Longitude", y = "Latitude") +
theme_minimal() +
theme(legend.position = "right")
There are significant clusters of tweets in major cities and urban areas, particularly in London, the Midlands, and parts of Northern England. This indicates higher social media activity in these regions. Rural and less populated areas, such as parts of Scotland, Wales, and South West England, show fewer tweets, as indicated by the sparse and lighter colored points. The tweets cover the entire UK, showing that social media activity is widespread, although concentrated in urban centers.
To visualize the sentiment of tweets across different regions in the UK, we use a geographic plot with compound score This allows us to see how sentiments vary spatially, providing insights into regional mood and opinions.
# plot map with compound gradient
ggplot() +
geom_sf(data = UK_shp_simple.sf, fill = "lightgray") +
geom_sf(data = tweet_coord, aes(color = compound)) +
labs(title = "Regional compound analysis", x = "Longitude", y = "Latitude") +
theme_minimal()
The map reveals that sentiment expressed in tweets varies across different regions of the UK, without a clear regional bias toward positivity or negativity. Urban areas have a higher density of tweets, indicating higher social media activity, which may lead to a more accurate representation of sentiment in those areas. Rural areas, while less dense in tweet activity, still provide valuable sentiment data.
To further understand the sentiment of tweets across different regions in the UK, additional maps were created focusing on negative and positive sentiment components separately.
# Plot the map with negative compound
ggplot() +
geom_sf(data = UK_shp_simple.sf, fill = "lightgray") +
geom_sf(data = tweet_coord, aes(color = neg)) +
scale_color_gradient(name = "Negative Score", low = "red", high = "blue") +
labs(title = "Regional Negative Compound Analysis", x = "Longitude", y = "Latitude") +
theme_minimal() +
theme(legend.position = "right")
# Plot the map with positive compound
ggplot() +
geom_sf(data = UK_shp_simple.sf, fill = "lightgray") +
geom_sf(data = tweet_coord, aes(color = pos)) +
scale_color_gradient(name = "Positive Score", low = "orange", high = "blue") +
labs(title = "Regional Positive Compound Analysis", x = "Longitude", y = "Latitude") +
theme_minimal() +
theme(legend.position = "right")
The first map visualizes the negative sentiment scores of tweets across the UK. This map reveals that negative sentiments are distributed across various regions, with noticeable clusters in urban areas such as London, the Midlands, and parts of Northern England. The second map focuses on the positive sentiment scores of tweets. The map shows that positive sentiments are also widely distributed across the UK, with significant clusters in urban areas. Similar to the negative sentiment map, rural areas have fewer tweets, but they still contribute to the overall positive sentiment landscape.
Next plot visualizes the unemployment rates across different regions in the UK.
ggplot() +
geom_sf(data = UK_shp_simple.sf, fill = "lightgray") +
geom_sf(data = tweet_coord, aes(color = Jan.15)) +
scale_color_gradient(name = "Unemployment Rate", low = "red", high = "blue") +
labs(title = "Regional Unemployment Rate", x = "Longitude", y = "Latitude") +
theme_minimal() +
theme(legend.position = "right")
In this section, several steps were undertaken to prepare the dataset for spatial analysis and to incorporate spatial lag variables. The compound sentiment score was assigned as the dependent variable y, the unemployment rate for 2015 was assigned as the independent variable x1, and the unemployment rate for 2014 was assigned as the temporal lag variable x1.t. This setup allows for the examination of relationships between sentiment and unemployment rates over time.
Buffer polygons were created around the points in the dataset with a specified buffer distance of 0.01 degrees. These buffer polygons represent the spatial area around each point and facilitate the identification of neighboring points. Using these buffer polygons, a neighbor object was created with the poly2nb function, identifying neighboring polygons based on spatial proximity. The resulting neighbor object was converted into a spatial weights list object using nb2listw, which is essential for conducting spatial statistical analyses.
# dependent variable `y` and independent variables `x1`, `x2`, etc.
tweet_coord$y <- tweet_coord$compound # For example, using compound sentiment score as y
tweet_coord$x1 <- tweet_coord$Jan.15 # unemployment rate in 2015
tweet_coord$x1.t <- tweet_coord$Jan.14 # temporal lag of unemployment rate/ Ur in 2014
# Check for NA values and remove rows with NA
tweet_coord <- na.omit(tweet_coord)
tweet <- na.omit(tweet)
# Coordinates
crds <- st_coordinates(tweet_coord)
To measure spatial dependence and further explore it, we will need to create an spatial lag. An spatial lag is the product of a spatial weight matrix and a given variable. The spatial lag of a variable is the average value of that variable in the neighborhood; that is, using the values of all the areas which are defined as neighbours; hence, the concept of spatial lag is inherently related to the concept of spatial weight matrix.
A spatial lag of the negative sentiment variable was computed using the lag.listw function. This spatial lag represents the influence of neighboring observations’ negative sentiment on each observation. These steps collectively enhance the dataset’s suitability for spatial analysis and provide insights into spatial dependencies and relationships.
# Create buffer polygons around the points (adjust buffer distance as needed)
buffer_distance <- 0.01 # Adjust buffer distance in desired units (e.g., degrees)
tweet_polygons <- st_buffer(tweet_coord, dist = buffer_distance)
# Now you can use tweet_polygons with poly2nb
nb <- poly2nb(tweet_polygons, queen = TRUE)
listw <- nb2listw(nb, style = "W", zero.policy = TRUE)
# Add spatial lag to your data:
neg_lag <- lag.listw(listw, tweet_coord$neg, zero.policy = TRUE)
tweet_coord$neg_lag <- neg_lag
In this section, the Geographically Weighted Regression (GWR) model was formulated and executed to analyze the spatial relationships within the dataset, specifically incorporating spatial lag.
First, the formula for the GWR model was defined, where y is the dependent variable (compound sentiment score), and x1, x1.t, and neg_lag are the independent variables. This model aims to examine how the sentiment score is influenced by the unemployment rate in 2015 (x1), its temporal lag which is unemployment rate in 2014 (x1.t), and the spatial lag of negative sentiment (neg_lag).
# Formula for the GWR model including spatial lag
eq <- y ~ x1 + x1.t + neg_lag
# Optimum bandwidth
bw <- gwr.sel(eq, data = tweet_coord, coords = crds, adapt = TRUE)
## Adaptive q: 0.381966 CV score: 1104.416
## Adaptive q: 0.618034 CV score: 1104.072
## Adaptive q: 0.763932 CV score: 1104.062
## Adaptive q: 0.7001615 CV score: 1104.066
## Adaptive q: 0.854102 CV score: 1104.066
## Adaptive q: 0.7758565 CV score: 1104.066
## Adaptive q: 0.7638913 CV score: 1104.062
## Adaptive q: 0.7684868 CV score: 1104.063
## Adaptive q: 0.7656278 CV score: 1104.063
## Adaptive q: 0.7639727 CV score: 1104.062
## Adaptive q: 0.7640134 CV score: 1104.062
## Adaptive q: 0.7646301 CV score: 1104.063
## Adaptive q: 0.764249 CV score: 1104.062
## Adaptive q: 0.7641034 CV score: 1104.062
## Adaptive q: 0.764159 CV score: 1104.062
## Adaptive q: 0.7641997 CV score: 1104.062
## Adaptive q: 0.764159 CV score: 1104.062
# GWR model
model_gwr <- gwr(eq, data = tweet_coord, coords = crds, adapt = bw)
model_gwr
## Call:
## gwr(formula = eq, data = tweet_coord, coords = crds, adapt = bw)
## Kernel function: gwr.Gauss
## Adaptive quantile: 0.764159 (about 2886 of 3778 data points)
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu. Median 3rd Qu. Max. Global
## X.Intercept. -0.0634382 -0.0629331 -0.0587078 -0.0545579 -0.0470006 -0.0521
## x1 0.0022720 0.0046497 0.0108676 0.0132474 0.0165642 0.0058
## x1.t 0.0038872 0.0060322 0.0072692 0.0108965 0.0125207 0.0105
## neg_lag -0.6418354 -0.5711451 -0.5604900 -0.5339049 -0.4597040 -0.5810
The Geographically Weighted Regression (GWR) model was successfully fitted to the data. The Gaussian kernel function was utilized, with an adaptive bandwidth quantile of approximately 0.7562364, meaning the bandwidth adapts to cover around 2857 out of 3778 data points.
The summary of the GWR coefficient estimates at data points provides the minimum, first quartile (1st Qu.), median, third quartile (3rd Qu.), and maximum values of the coefficient estimates, along with the global coefficients, which represent the average effects across all locations.
The global coefficient for x1 is 0.0058, suggesting a positive relationship between the unemployment rate in 2015 and the compound sentiment score.Positive values suggest that higher unemployment rates are associated with higher sentiment scores (less negative).
The global coefficient for x1.t is 0.0105, indicating a positive relationship between the unemployment rate in 2014 and the compound sentiment score.higher past unemployment rates are associated with higher current sentiment scores.
The global coefficient for neg_lag is -0.5810, showing a strong negative relationship between the spatial lag of negative sentiment and the compound sentiment score.Negative values suggest that higher negative sentiment in neighboring areas is associated with lower sentiment scores (more negative).
The inclusion of spatial lag helps capture the influence of neighboring areas, while the unemployment rates help understand the temporal dynamics. The results show significant spatial dependence and provide a nuanced understanding of the factors influencing sentiment across the UK regions.
# Visualize the GWR coefficients for neg_lag
tweet_coord$GWR_neg_lag <- as.numeric(model_gwr$SDF$neg_lag)
# Create the base map
p <- ggplot(data = UK_shp_simple) +
geom_sf(color = "gray60", size = 0.1) +
theme_void()
# Add the points with GWR coefficients for neg_lag
p + geom_sf(data = tweet_coord, aes(color = GWR_neg_lag), size = 2) +
scale_color_viridis_c(option = "C") +
labs(color = "GWR Coefficient for neg_lag") +
theme_minimal() +
ggtitle("GWR Coefficient for neg_lag")
The plot effectively illustrates the spatial variability of the impact of neighboring negative sentiment on sentiment scores across the UK. The color gradient highlights regions where negative sentiment from neighbors has a stronger or weaker impact.
Stronger Negative Impact: Clusters of more negative coefficients (purple) are seen in Southern England, suggesting higher sensitivity to neighboring negative sentiments in these areas.
Weaker Negative Impact: Clusters of less negative coefficients (yellow to orange) are observed in Northern England and Scotland, indicating these regions are less affected by neighboring negative sentiments.
# Extract the GWR coefficients
gwr_coefficients <- as.data.frame(model_gwr$SDF)
# Select relevant columns (x1, x1.t, neg_lag)
gwr_data <- gwr_coefficients %>%
select(x1, x1.t, neg_lag)
# Calculate the optimal number of clusters
fviz_nbclust(gwr_data, kmeans)
The silhouette analysis plot indicates that the optimal number of clusters for this dataset is three. The average silhouette width is highest when the number of clusters (k) is three, suggesting that this number provides the best separation of the data into distinct groups.
# Perform clustering with eclust
set.seed(123) # For reproducibility
klastry2 <- eclust(gwr_data, "kmeans", k = 3)
# Assign clusters to spatial data frame
tweet_coord$clust5 <- klastry2$cluster
# Visualize clusters on the map
ggplot(data = UK_shp) +
geom_sf(color = "gray60", size = 0.1) +
geom_sf(data = tweet_coord, aes(color = as.factor(clust5)), size = 2) +
scale_color_viridis_d(option = "C") +
labs(color = "Cluster") +
theme_minimal() +
ggtitle("Clustering of GWR Coefficients")
The clustering analysis reveals significant regional variations in the relationships modeled by the GWR between compound sentiment scores and unemployment rates. These clusters highlight areas with similar underlying patterns in their data, suggesting that local factors influence the relationships between sentiment and unemployment. Regionally, we observe that the southern UK, including the Midlands and South West, tends to fall into Cluster 1, suggesting these areas have a specific dynamic between sentiment and unemployment rates. Central England forms Cluster 2, indicating a different pattern, while the northern regions, including Scotland and Wales, are primarily in Cluster 3, suggesting yet another unique relationship. These regional patterns reveal how unemployment rate and negative sentiment factor might differently influence sentiments in different regions.
A Moran plot (also known as a Moran scatter plot) is a graphical representation used to visualize and assess the spatial autocorrelation of a variable. It plots the values of a variable against the spatial lag of that variable, helping to identify the nature and strength of spatial autocorrelation.
# Moran Plot
ggplot(tweet_coord, aes(x = neg, y = neg_lag)) +
geom_point() +
geom_smooth(method = "lm") +
ylab("Negative sentiment lag") +
xlab("Negative sentiment") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
tweet_coord <- tweet_coord %>%
mutate(
st_neg = ( neg - mean(neg)) / sd(neg),
st_neg_lag = ( neg_lag - mean(neg_lag)) / sd(neg_lag)
)
ggplot(tweet_coord, aes(x = st_neg, y = st_neg_lag)) +
geom_point() +
geom_smooth(method = "lm") +
geom_hline(yintercept = 0, color = "grey", alpha =.5) +
geom_vline(xintercept = 0, color = "grey", alpha =.5) +
ylab("Negative sentiment lag \n (standardised)") +
xlab("Negative sentiment \n (standardised)") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
To measure global spatial autocorrelation, we can use the Moran’s I. The Moran Plot and intrinsically related. The value of Moran’s I corresponds with the slope of the linear fit on the Moran Plot. We can compute it by running:
moran.test(tweet_coord$neg, listw = listw, zero.policy = TRUE, na.action = na.omit)
##
## Moran I test under randomisation
##
## data: tweet_coord$neg
## weights: listw
## n reduced by no-neighbour observations
##
## Moran I statistic standard deviate = 3.6629, p-value = 0.0001247
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.0366234370 -0.0002995806 0.0001016107
This project utilized a sample of geolocated tweet data from the UK, coupled with regional unemployment data, to perform spatial analysis and Geographically Weighted Regression (GWR).
Exploratory Data Analysis (EDA) revealed significant clusters of tweets in urban areas, highlighting higher social media activity in cities such as London, the Midlands, and parts of Northern England. Sentiment analysis showed varied regional sentiment, with urban areas having a higher density of tweets, thereby providing a more accurate representation of sentiment in those areas. Rural areas, while having fewer tweets, still contributed valuable sentiment data.
The analysis also visualized unemployment rates across different regions, revealing significant clusters of higher unemployment rates in parts of southern England and Wales. These areas showed a correlation with clusters of negative sentiment tweets.
The GWR model was successfully fitted, revealing how the relationships between sentiment and unemployment rates, as well as the spatial lag of negative sentiment, varied across different geographical locations. The model showed a positive relationship between unemployment rates and sentiment scores and a strong negative relationship between the spatial lag of negative sentiment and the compound sentiment score. Clustering analysis of GWR coefficients further identified regional patterns, with the southern UK, Midlands, and South West forming distinct clusters, central England forming another, and northern regions, including Scotland and Wales, forming yet another. These clusters indicated unique relationships between sentiment and unemployment across different regions.
In conclusion, the project demonstrated significant spatial heterogeneity in sentiment and its relationship with unemployment rates, providing valuable insights into the spatial dynamics of social media activity and socio-economic factors in the UK.