First, establishing connection with an already created PostgresSQL database. The goal is load the >23 million csv (6gb) row data in PostgreSQL, in this way will be possible to manipulate the data easily.
library(RPostgres)
library(dplyr)
fun_connect<-function(){dbConnect(RPostgres::Postgres(),
                                  dbname='inseryoudatabase',
                                  host='insertyouhost',
                                  port=5432,
                                  user='insertyouuser',
                                  password='insertyoupass',
                                  options= '-c search_path=insertyoudb')}
conn<-fun_connect() #THIS FUNCTION STABLISH CONNECTION, YOU MUST EXECUTE IT EVER WHEN YOU WANT TO WORK WITH SQL THROUGH R.
Now is necessary to create the table in PostgreSQL for this is strictly necessary to put the correct attributes: the class the data and extension. The function open column will get information about the data class without open the interested files. Finally we create manually our data table.
#Return class each column 
openColumn<-function(x){tidyr::spread(dplyr::group_by(as.data.frame(summary.default(read.csv(x),funs(summarise_all(max(nchar(.))))), by=Var1)),key = Var2, value = Freq)}
names_columnas<-openColumn("Where is hosted you data 6gb data")
#CREATING TABLE -----------
dbSendQuery(conn, "CREATE TABLE census (id SERIAL PRIMARY KEY,CENSUS_YEAR NUMERIC (20),DGUID VARCHAR (17),ALT_GEO_CODE NUMERIC (20),GEO_LEVEL VARCHAR (30),GEO_NAME VARCHAR (136),TNR_SF DOUBLE PRECISION,TNR_LF DOUBLE PRECISION,DATA_QUALITY_FLAG NUMERIC (20),CHARACTERISTIC_ID NUMERIC (20),CHARACTERISTIC_NAME VARCHAR (136),CHARACTERISTIC_NOTE NUMERIC (20),C1_COUNT_TOTAL REAL,
SYMBOL VARCHAR(5),
\"C2_COUNT_MEN+\" REAL,
SYMBOL1 VARCHAR(5),
\"C3_COUNT_WOMEN+\" REAL,
SYMBOL2 VARCHAR(5),
C4_COUNT_LOW_CI_TOTAL REAL,
SYMBOL3 VARCHAR(5),
\"C5_COUNT_LOW_CI_MEN+\" REAL,
SYMBOL4 VARCHAR(5),
\"C6_COUNT_LOW_CI_WOMEN+\" REAL,
SYMBOL5 VARCHAR(5),
C7_COUNT_HI_CI_TOTAL REAL,
SYMBOL6 VARCHAR(5),
\"C8_COUNT_HI_CI_MEN+\" REAL,
SYMBOL7 VARCHAR(5),
\"C9_COUNT_HI_CI_WOMEN+\"  REAL,
SYMBOL8 VARCHAR(5),
C10_RATE_TOTAL REAL,
SYMBOL9 VARCHAR(5),
\"C11_RATE_MEN+\" REAL,
SYMBOL10 VARCHAR(5),
\"C12_RATE_WOMEN+\" REAL,
SYMBOL11 VARCHAR(5),
C13_RATE_LOW_CI_TOTAL REAL,
SYMBOL12 VARCHAR(5),
\"C14_RATE_LOW_CI_MEN+\" REAL,
SYMBOL13 VARCHAR(5),
\"C15_RATE_LOW_CI_WOMEN+\"  REAL,
SYMBOL14 VARCHAR(5),
C16_RATE_HI_CI_TOTAL REAL,
SYMBOL15 VARCHAR(5),
\"C17_RATE_HI_CI_MEN+\" REAL,
SYMBOL16  VARCHAR(5),
\"C18_RATE_HI_CI_WOMEN+\" REAL,
SYMBOL17  VARCHAR(5))")
This is an interesting step, where is possible load big dataset to PostgreSQL, due is not possible load directly a csv file >4gb. Is possible to compress it in a zip format (7z) and copy the dataset to the table created above.
#IMPORTING DATA ----------------------------
dbSendQuery(conn, "copy censos.census (census_year, dguid, alt_geo_code, geo_level, geo_name, tnr_sf, tnr_lf, data_quality_flag, characteristic_id, characteristic_name, characteristic_note, C1_COUNT_TOTAL,SYMBOL,
\"C2_COUNT_MEN+\",
SYMBOL1,
\"C3_COUNT_WOMEN+\",
SYMBOL2,
C4_COUNT_LOW_CI_TOTAL,
SYMBOL3,
\"C5_COUNT_LOW_CI_MEN+\",
SYMBOL4,
\"C6_COUNT_LOW_CI_WOMEN+\",
SYMBOL5,
C7_COUNT_HI_CI_TOTAL,
SYMBOL6,
\"C8_COUNT_HI_CI_MEN+\",
SYMBOL7,
\"C9_COUNT_HI_CI_WOMEN+\",
SYMBOL8,
C10_RATE_TOTAL,
SYMBOL9,
\"C11_RATE_MEN+\",
SYMBOL10,
\"C12_RATE_WOMEN+\",
SYMBOL11,
C13_RATE_LOW_CI_TOTAL,
SYMBOL12,
\"C14_RATE_LOW_CI_MEN+\",
SYMBOL13, 
\"C15_RATE_LOW_CI_WOMEN+\",
SYMBOL14,
C16_RATE_HI_CI_TOTAL,
SYMBOL15,
\"C17_RATE_HI_CI_MEN+\",
SYMBOL16,
\"C18_RATE_HI_CI_WOMEN+\",
SYMBOL17) FROM PROGRAM '7z e -so C:/CEDEUS/2022/july05_canadaCensus/input/98-401-X2021006_eng_CSV/98-401-X2021006_English_CSV_data.7z' DELIMITER ',' CSV HEADER encoding 'windows-1251';")
¡AWESOME our data is in PostgreSQL is time to work with it!
We start to test the database recently loaded, for this we use the function dbSendQuery to do queries, in this case a good option is to ask for a administrative subdivision that is the Thompson City itself.
#Create Thompson  ------------------------------------
dbSendQuery(conn, "CREATE TABLE thompson_city AS
SELECT *
            FROM census
            WHERE (DGUID='2021A00054622026');")
Now, we send a query to the table created in the last step, and we ask to PostgreSQL that send us the data through dbFetch function.
#Recover data frame info from Thompson --------------
thompson_city<-dbSendQuery(conn,"SELECT *
  FROM thompson_city;")
thompson_city<-dbFetch(thompson_city)
View(thompson_city)
Cool, we saw that our information is correct, now we should to take advantage to ask for all the Dissemination Areas of our study area. To make more easy the process we can build the the DGUID code identifying the last three digits pattern, and make a sequence with the sprintf function.
query<-paste0("DGUID='2021S051246220",sprintf('%0.3d', 91:117),"'")
da_thompson<-glue::glue_collapse(query, sep=" OR " )
dbSendQuery(conn,
paste("CREATE TABLE thompson_da AS SELECT * FROM census WHERE (", da_thompson,") ORDER BY id;"))
Again, we get the information through PostgreSQL using the function dbFetch, now we have a new dataset with 9936 rows, it mean 27 Dissemination Areas with 368 variable. Curiously, the Census dataset is a kind of cross table, we have variables like rows and columns. So, should be good if we split the dataset in different dataframes, in that way we could to explore more deep each dissemination area in the future, and we can work with the apply families function to make the same process in each dataset.
With the function split we get a list that store 27 dataset, each one represent one Dissemination Area.
tableThompson_da<-dbSendQuery(conn,"SELECT *
  FROM thompson_da;")
tableThompson_da<-dbFetch(tableThompson_da)
tableThompson_da<-split(tableThompson_da,tableThompson_da$dguid)
After reflections, and some experience working with elderlies topics, I opted to choose two variables of interest. These variables are:
“Percentage of private households that are one-person households (e.g., persons living alone) Living alone” (id= 97)
“Percentage of low income based on the Low-income measure, after tax (LIM-AT) (%) 65 years and over (%)” (id= 335)
I think that would be interesting to find some relation in areas where there are people living alone and that has a low income prevalence, in some way should be a proxy to vulnerability, an important aspect to consider when we are working with Flood Risk.
# Selecting variables ------------------
#97 & 335
selectionVar<-lapply(tableThompson_da, function(x){x[x$characteristic_id==97|x$characteristic_id==335,]}) %>% bind_rows()
To complete the Task 1 of the assignment we proceed to save this two variables in two different csv files.
 ---------------------------------------------------
var97<-selectionVar[selectionVar$characteristic_id==97,]
var335<-selectionVar[selectionVar$characteristic_id==335,]
write.csv(var97,"output/var97.csv")
write.csv(var335,"output/var335.csv")
Now is time to open the dataset that refer to each Dissemination Area across Canada. At the same time, is a good decision to reduce our new shapefile only to the Thompson City area.
boun_da<-st_read("input/lda_000b21a_e/lda_000b21a_e.shp")
boun_da<-boun_da[boun_da$DGUID%in%unique(selectionVar$dguid),]
To the future Bivariate LISA analysis, we proceed to convert our data tables to shapefile. At the same tame we take a glance of each dataset to see if is correctly created.
thompson_var97<-left_join(boun_da,var97, by=c("DGUID"="dguid"))
var97<-ggplot(data=thompson_var97)+
  geom_sf(aes(fill=c1_count_total))
var97
thompson_var335<-left_join(boun_da,var335, by=c("DGUID"="dguid"))
var335<-ggplot(data=thompson_var335)+
  geom_sf(aes(fill=c1_count_total))
var335
In this step the two dataset will be join in only one to do the future Bivariate LISA analysis. The process consist in to get the information from our data frame list, leave just the necessary data, apply a pivot wider function, do it spatial and done.
#Select variables
thompson_da<-lapply(tableThompson_da, function(x){x[x$characteristic_id==97|x$characteristic_id==335,]}) %>% bind_rows()
#I choose interest variables
colnames(thompson_da)
thompson_da<-select(thompson_da,c(3,11,13))
#Pivot wider to split the variable like columns
thompson_da<- thompson_da %>%
  pivot_wider(names_from = characteristic_name, values_from = c1_count_total)
#Change name to do easy to work
colnames(thompson_da)[2:3]<-c("livingAlone","elderlie")
#Join with the spatial data
thompson_da<-left_join(boun_da,thompson_da, by=c("DGUID"="dguid"))
Now we are finished the Task 2 of the assignment.
Is moment to do the spatial analysis BiLISA, in this specific case first we will remove the NA values and create a Queen contiguity weight matrix, it will be used like input at the moment to performance the spatial analysis.
#DOING BILISA -----------------
library(sf)
library(rgeoda)
thompson_daNoNa<-thompson_da[!is.na(thompson_da$livingAlone)&!is.na(thompson_da$elderlie),]
thompson_daNA<-thompson_da[is.na(thompson_da$livingAlone)|is.na(thompson_da$elderlie),]
#Creating Queen Weight Matrix
wij <- queen_weights(thompson_daNoNa)
To do this analysis we will use the statistical Bivariable Local
Analysis, this is one novel technique used in the Flood Risk literature.
For example, see: Chakraborty
et al. 2022 & Tate
et al. 2021
We provide like input the Queen Weight Matrix and choose the
variable x the people that live alone (id=97) and the porcentage of
elderlies (>65) with low incomes after tax (id=325). After that just
save some values provided for the model.
library(rgeoda)
qsa <- local_bimoran(wij,thompson_daNoNa[c('livingAlone','elderlie')])
lisa_colors <- lisa_colors(qsa)
lisa_labels <- lisa_labels(qsa)
lisa_clusters <- lisa_clusters(qsa)
plot(st_geometry(thompson_da))
plot(st_geometry(thompson_daNoNa), col=sapply(lisa_clusters, 
function(x){return(lisa_colors[[x+1]])}), 
     border = "#333333", lwd=0.2,add=TRUE)
title(main = "Bivariate LISA Local Moran \n Var 97 & Var 335", font=1)
legend('bottomleft', legend = lisa_labels, fill = lisa_colors, 
       border = "#eeeeee",pch=1,
   cex = 0.7) 
If well we have some findings is possible to find better results with a variable that had more data, because the variable 335 just had 8 valid rows.
It is time to explore more deep the dataset that we have, in fact there are so many variables related to the pool proposed. For this we create a kind of dictionary that say me how many time is counted the variable in each one of the Dissemination Areas.
# Return if the value of the population counted is >=1
popVar<-lapply(tableThompson_da,function(x){x$characteristic_id[x$c1_count_total!=0&!is.na(x$c1_count_total)]} %>% as.data.frame(.))
# Joining the values in a data frame
popVar<-popVar %>% plyr::rbind.fill(.) %>% table(.) %>% as.data.frame(.,col.names=c("var","Freq"))
View(popVar)
# Setting up class variables to future join
thompson_city$characteristic_id<-as.factor(thompson_city$characteristic_id)
#Doing a join with another attributes
popVar<-left_join(popVar,thompson_city[,c("id","characteristic_id","characteristic_name")], by=c("."="characteristic_id"))
Now that I have a dictionary that say me how many times the variable is filled in the 27 Dissemination Areas is necessary to choose one, for this I selected 22 variables that make sense to the 5 set variables proposed in the task. Will be necessary to categorize the 22 pool variable in 5 categories to not relate similar variables.
#A Pool of variables based in 5 categories according the proposed questions
candidates<-data.frame(
name=popVar$characteristic_name[popVar$.%in%c(47,97,110,120,121,141,144,229:232,287:288,303,304,325,326,328,330,331,333,335)],
candidates=c(47,97,110,120,121,141,144,229:232,287:288,303,304,325,326,328,330,331,333,335),
ranking=popVar$Freq[popVar$.%in%c(47,97,110,120,121,141,144,229:232,287:288,303,304,325,326,328,330,331,333,335)]
)
#Create a category variable on the five questions
candidates$classVar<-case_when(candidates$candidates%in%c(325,326,328,330,331,333,335)~"lowIncome",
          candidates$candidates%in%c(229:232,287:288,303,304)~"medianIncome",
          candidates$candidates%in%c(120,121,141,144)~"goverTransfer",
          candidates$candidates%in%c(97,110)~"liveAlone",
          candidates$candidates%in%c(47)~"storeys")
Now we reduced our pool of selected variables to 9. This variables has the next numerical id: 97,110,120,230,232,288,304,325,326. This time we will put all this variables together in a same dataframe, because we know that is requirement to performance a new Bivariate LISA.
#Extracting variables
thompson_da<-lapply(tableThompson_da, function(x){x[c(325,326,288,304,232,230,120,110,97),]}) %>% bind_rows()
colnames(thompson_da)
thompson_da<-select(thompson_da,c(3,10,13))
#Splitting variables like columns
thompson_da<- thompson_da %>%
  pivot_wider(names_from = characteristic_id, values_from = c1_count_total)
#Choosing legible names
colnames(thompson_da)[2:10]<-c("limatElderTtl325","limatTotal326","medianInFam288","mdnTtlIncm304","mdnInc1prs232","mdnAThosehld230","numGovTrans120","onePer110","livAlone97")
#Doing spatial
thompson_da<-left_join(boun_da,thompson_da, by=c("DGUID"="dguid"))
The last time we did not put so much attention to the variable 335 map. That map did not give us so much information but we persist in to used, however, we did not get satisfactory results. This time we will performance a more complete visual exploration.
#Creating a function to repeat in the different variables
vermapa<-function(input,x){ggplot(data=input)+ geom_sf(aes(fill=.data[[x]]))}
variables<-c("limatElderPerc325","limatTotal326","medianInFam288","mdnTtlIncm304","mdnInc1prs232","mdnAThosehld230","numGovTrans120","onePer110","livAlone97")
map1<-vermapa(thompson_da,variables[1], "LIM AT >65 % Var.325") # Total
map2<-vermapa(thompson_da,variables[2], "LIM AT Total % Var.326")
map3<-vermapa(thompson_da,variables[3], "Median Income Family $ Var.288")#
map4<-vermapa(thompson_da,variables[4], "Median Total Income 2020 $  Var.304") 
map5<-vermapa(thompson_da,variables[5], "Median Income 1 Person Household 2020 $ Var.232")
map6<-vermapa(thompson_da,variables[6], "Median Total Income 2020 $ Var.230") #
map7<-vermapa(thompson_da,variables[7], "% Num of government transfers >15 Var.120") 
map8<-vermapa(thompson_da,variables[8], "% One-person households Var.110" )
map9<-vermapa(thompson_da,variables[9], "Live alone Var.97") #
library(ggpubr)
pubr<-ggpubr::font("title", size=7)+
  ggpubr::font("x.text",size=3)+
  ggpubr::font("y.text",size=3)+
  ggpubr::font("legend.text",size=3)+
  ggpubr::font("legend.title",size=5)
mapasExplora<-ggpubr::ggarrange(map1+pubr,map2+pubr,map3+pubr,map4+pubr,map5+pubr,map6+pubr,map7+pubr,map8+pubr,map9+pubr,ncol=3,nrow=3)
mapasExplora
Is moment to performance our new LISA with the new data, in this case with the variable 97 (x) and variable 325 (y), this last variable is: Total of 65 years and over population with a prevalence of low income in 2020 (LIM-AT).
thompson_da_na<-thompson_da[1,] #Save NA Dissemination Areas
thompson_da<-thompson_da[2:nrow(thompson_da),] # Remove NA Dissemination Areas
lisa<-local_bimoran(queen_weights(thompson_da), thompson_da[c("livAlone97","limatElderPerc325")]) 
lisa_colors <- lisa_colors(lisa)
lisa_labels <- lisa_labels(lisa)
lisa_clusters <- lisa_clusters(lisa)
Mapping our new BiLISA in the classic way
Excited for the results we proceed to corroborate it in GeoDA:
BiLISA map in Geoda Software
Scatter Moran Plot eenerated in Geoda
GOOD, NEW FINDINGS !
What about to print in ggplot? Just to have more option to make presentation and to explore how to plot this not common format map we will print it with the ggplot library.
But first, would be necessary to set up our dataset to print it.
thompsonLisa<-thompson_da %>% 
  mutate(cluster_num = lisa_clusters(lisa) + 1, # add 1 bc clusters are zero-indexed
         cluster = factor(lisa_labels[cluster_num], levels = lisa_labels)) 
thompsonLisa<-thompsonLisa %>% st_set_crs(.,3347) %>% 
  st_transform(.,4326)
test<-setNames(lisa_colors(lisa), lisa_labels) %>% data.frame(cluster=names(.),color=.)
thompsonLisa<-left_join(thompsonLisa,test, by="cluster")
thompsonLisa$cluster<-as.factor(thompsonLisa$cluster)
Now we set up the code to see the new map:
lisaggplot<-ggplot(data=thompsonLisa, aes(fill = cluster)) +
  geom_sf(color = "white", size = 0) +
  scale_fill_manual(values = setNames(lisa_colors(lisa), lisa_labels),lisa_colors(lisa), na.value = "green") 
lisaggplot
Finally, the most interesting map must be interactive and easy to explore the Thompson City geographic context.
library(leaflet)
mapleaflet<-leaflet(data=thompsonLisa) %>% 
              addTiles() %>% 
              addPolygons(data=st_transform(st_set_crs(thompson_daNA,3347),4326), color="green",fill=1,fillColor="#347C2C", popup="Area NA value") %>% 
              addPolygons(color =thompsonLisa$color, popup = thompsonLisa$cluster,fillOpacity = 0.8,stroke=T ) %>% 
              addLegend("bottomright", 
                      colors = c("#eeeeee",  "#FF0000", "#0000FF",  "#a7adf9","#f4ada8","#464646","#999999","#347C2C"),
                      labels = c("Not significant", "High-High", "Low-Low", "Low-High", "High-Low","Undefined","Isolated","Area NA value"),
                      title = "Bi Variate LISA",
                      opacity = 1) 
mapleaflet