Disclaimer: light purple boxes display the R code, the light blue ones the output
Discovering the tidyverse
The first part of the code is dedicated to preliminary operation: setting the working directory for the project, check what is inside the folder, load and inspect the data with the basic functions provided by R.
list.files() return all the files with extension in the specified directory. Same results is obtained with dir(), on the other if you want to obtain all the directory in a specific path you might want to use list.dir
#return the entire list of files in the folder specified "." refers to the working directory, you can
list.files(".") [1] "01_Introduction_to_R.html" "01_Introduction_to_R.Rmd"
[3] "01_Introduction_to_R_cache" "01_Introduction_to_R_files"
[5] "02_Introduction_to_QGIS.html" "02_Introduction_to_QGIS.md"
[7] "02_Introduction_to_QGIS.Rmd" "02_Introduction_to_QGIS_cache"
[9] "02_Introduction_to_QGIS_files" "cayman.css"
[11] "Code History" "Exercise"
[13] "GIS" "handouts"
[15] "logo-iuss.jpg" "raster venezia 90m.tif"
[17] "rsconnect" "Script_03_10.pdf"
[19] "Script_03_10.R" "style.css"
[21] "Tutorial-packages.html" "Tutorial-packages.Rmd"
[23] "Tutorial-packages_cache" "Tutorial-packages_files"
[25] "Tutorial packages.Rmd"
The unzip function is not relevant to the success of this script but it might still be useful to know. The function will extract the files from a zip archive and put it into a folder of your choice.
The data are loaded with the function provided by the base package of R.
#load csv
ny_airbnb <- read.csv(".\\Exercise\\Data\\AB_NYC_2019.csv")
#explore dataset
dim(ny_airbnb)[1] 48895 16
id name host_id
1 2539 Clean & quiet apt home by the park 2787
2 2595 Skylit Midtown Castle 2845
3 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632
4 3831 Cozy Entire Floor of Brownstone 4869
5 5022 Entire Apt: Spacious Studio/Loft by central park 7192
6 5099 Large Cozy 1 BR Apartment In Midtown East 7322
host_name neighbourhood_group neighbourhood latitude longitude
1 John Brooklyn Kensington 40.64749 -73.97237
2 Jennifer Manhattan Midtown 40.75362 -73.98377
3 Elisabeth Manhattan Harlem 40.80902 -73.94190
4 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976
5 Laura Manhattan East Harlem 40.79851 -73.94399
6 Chris Manhattan Murray Hill 40.74767 -73.97500
room_type price minimum_nights number_of_reviews last_review
1 Private room 149 1 9 2018-10-19
2 Entire home/apt 225 1 45 2019-05-21
3 Private room 150 3 0
4 Entire home/apt 89 1 270 2019-07-05
5 Entire home/apt 80 10 9 2018-11-19
6 Entire home/apt 200 3 74 2019-06-22
reviews_per_month calculated_host_listings_count availability_365
1 0.21 6 365
2 0.38 2 355
3 NA 1 365
4 4.64 1 194
5 0.10 1 0
6 0.59 1 129
'data.frame': 48895 obs. of 16 variables:
$ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
$ name : Factor w/ 47906 levels "","'Fan'tastic",..: 12661 38172 45171 15702 19366 25001 8337 25048 15597 17682 ...
$ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
$ host_name : Factor w/ 11453 levels "","'Cil","-TheQueensCornerLot",..: 5051 4846 2962 6264 5982 1970 3601 9699 6935 1264 ...
$ neighbourhood_group : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
$ neighbourhood : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
$ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
$ longitude : num -74 -74 -73.9 -74 -73.9 ...
$ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
$ price : int 149 225 150 89 80 200 60 79 79 150 ...
$ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
$ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
$ last_review : Factor w/ 1765 levels "","2011-03-28",..: 1503 1717 1 1762 1534 1749 1124 1751 1048 1736 ...
$ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
$ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
$ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
#write a simple function that return information on dataframe. Using the syntax needed for fucntions, firts we check is the argument passed through is a dataframe. Then, we bind the output that we want the function to return into a list
info <- function(x){
if (is.data.frame(x) == FALSE) {
print("Error: you are not using a data.frame")
}else{
info_df <- list(dimension=dim(x),top=head(x),bottom=tail(x))
}
return(info_df)
}
info(ny_airbnb)$dimension
[1] 48895 16
$top
id name host_id
1 2539 Clean & quiet apt home by the park 2787
2 2595 Skylit Midtown Castle 2845
3 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632
4 3831 Cozy Entire Floor of Brownstone 4869
5 5022 Entire Apt: Spacious Studio/Loft by central park 7192
6 5099 Large Cozy 1 BR Apartment In Midtown East 7322
host_name neighbourhood_group neighbourhood latitude longitude
1 John Brooklyn Kensington 40.64749 -73.97237
2 Jennifer Manhattan Midtown 40.75362 -73.98377
3 Elisabeth Manhattan Harlem 40.80902 -73.94190
4 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976
5 Laura Manhattan East Harlem 40.79851 -73.94399
6 Chris Manhattan Murray Hill 40.74767 -73.97500
room_type price minimum_nights number_of_reviews last_review
1 Private room 149 1 9 2018-10-19
2 Entire home/apt 225 1 45 2019-05-21
3 Private room 150 3 0
4 Entire home/apt 89 1 270 2019-07-05
5 Entire home/apt 80 10 9 2018-11-19
6 Entire home/apt 200 3 74 2019-06-22
reviews_per_month calculated_host_listings_count availability_365
1 0.21 6 365
2 0.38 2 355
3 NA 1 365
4 4.64 1 194
5 0.10 1 0
6 0.59 1 129
$bottom
id name host_id
48890 36484363 QUIT PRIVATE HOUSE 107716952
48891 36484665 Charming one bedroom - newly renovated rowhouse 8232441
48892 36485057 Affordable room in Bushwick/East Williamsburg 6570630
48893 36485431 Sunny Studio at Historical Neighborhood 23492952
48894 36485609 43rd St. Time Square-cozy single bed 30985759
48895 36487245 Trendy duplex in the very heart of Hell's Kitchen 68119814
host_name neighbourhood_group neighbourhood latitude
48890 Michael Queens Jamaica 40.69137
48891 Sabrina Brooklyn Bedford-Stuyvesant 40.67853
48892 Marisol Brooklyn Bushwick 40.70184
48893 Ilgar & Aysel Manhattan Harlem 40.81475
48894 Taz Manhattan Hell's Kitchen 40.75751
48895 Christophe Manhattan Hell's Kitchen 40.76404
longitude room_type price minimum_nights number_of_reviews
48890 -73.80844 Private room 65 1 0
48891 -73.94995 Private room 70 2 0
48892 -73.93317 Private room 40 4 0
48893 -73.94867 Entire home/apt 115 10 0
48894 -73.99112 Shared room 55 1 0
48895 -73.98933 Private room 90 7 0
last_review reviews_per_month calculated_host_listings_count
48890 NA 2
48891 NA 2
48892 NA 2
48893 NA 1
48894 NA 6
48895 NA 1
availability_365
48890 163
48891 9
48892 36
48893 27
48894 2
48895 23
The dataset used in the first part of this tutorial is Airbnb listings and metrics in NYC, NY, USA (2019) More information here
dplyr
This package provides great tools for data manipulation, especially when dealing with data frames. Its use spreaded thanks to a specific and easy to understand grammar (verbs) aimed at data manipulation and to the fast performance due to the exploitation of C++.
In the following chunk I’ll introduce the main verbs of dplyr and comparing its grammar with the normal syntax that one would use with basic indexing.
(The dplyr:: is used due to conflict with function name used in multiple packages).
#dplyr----
library(dplyr)
# select() select columns
# filter() filter rows
# arrange() re-order or arrange rows
# mutate() create new columns
# summarise() summarise values
# group_by() allows for group operations in the “split-apply-combine” concept
# ends_with() = Select columns that end with a character string
# contains() = Select columns that contain a character string
# matches() = Select columns that match a regular expression
# one_of() = Select columns names that are from a group of names
# everything() = Select all column
# last_col() = Select last column also with offsetThe pipe operator %>%
The pipe operator is used to emphasise a sequence of actions you want to apply to one primary object. This operator make it possible to keep the code well-ordered and to pass the object to several functions without creating “nested” code (more on this later in the code).
The pipe operator basically takes the object at the beginning of the line, usually a dataframe, and pass the object through the function at the right the operator, creating this way a literal pipeline of code.
The first functions we’re going to look at are select(), slice() and filter(). The first function subset the data based on the column you specify. The columns can be chosen the same way as with base subsetting plus some interesting way provided from the package tidyselect. slice() is the equivalent of select but for rows. filter() instead, subset the data row-wise where the conditions given for the filtering are true, that means it subset evaluating the result of a condition.
#Select
ny <- ny_airbnb[,1:10] #normal indexing
ny <- dplyr::select(ny_airbnb,1:10) #dplyr
ny <- ny_airbnb %>% dplyr::select(.,1:10) #dplyr#slice
ny <- ny_airbnb[1:5,] #normal indexing
ny <- slice(ny_airbnb,1:5) #dplyr
ny <- ny_airbnb %>% slice(1:5) #dplyr#filter
ny <- ny_airbnb[ny_airbnb$host_name == "Jessica",] #normal indexing
ny <- filter(ny_airbnb,host_name=="Jessica") #dplyr
ny <- ny_airbnb %>% filter(host_name=="Jessica") #dplyrgroup_by and summarise go hand-in-hand. The former takes a dataframe (or table) and creates a new table based on groups defined by variables. Let’s take our dataset, we have variables such as neighbourhood and host_name that repeat some values several time. Let’s say I want to perform an analysis per neighbourhood rather than per house, group_by allows us to create a new dataframe grouped by neighbourhood.
summarise adds another layer. It collapses each group into a single row summary, applying an aggregating or summary function to each group. Let’s assume I want to obtain how many people, amid the host of New York, share the same name. Without using dplyr I’d do something like this:
#group_by & summarise
name <- unique(ny_airbnb$host_name) #Find the unique names in the column
x <- data.frame(name=numeric(),number=numeric()) # Initialize dataframe that will store the information
for (i in 1:length(name)) { #Loop where at each iteration I compute how many i-th name are in the dataset
x[i,"name"] <- as.character(name[i])
x[i,"number"] <- length(ny_airbnb$host_name[ny_airbnb$host_name == name[i]])
}
head(x[order(x$number,decreasing = T),],2) name number
131 Michael 417
216 David 403
With dplyr would look something like this:
ny <- ny_airbnb %>%
group_by(host_name) %>%
dplyr::summarise(n=n())
head(ny[order(ny$n,decreasing = T),],2)# A tibble: 2 x 2
host_name n
<fct> <int>
1 Michael 417
2 David 403
At the end the result is the same but besides the aesthetics of the code it’s easy to verify the difference in computing speed of the two methods (check system.time() function or rbenchmark package) .
Now, I want to find the count of the names for each neighbourhood and filter out those name that appear less than 10 times to cancel the noise. Here is the “nested” effect previoulsy mentioned neglecting the use of the pipe operator.
# A tibble: 2 x 3
# Groups: host_name [1]
host_name neighbourhood n
<fct> <fct> <int>
1 Adam East Village 11
2 Adam Murray Hill 24
Using the Pipe operator:
ny <- ny_airbnb %>%
group_by(host_name,neighbourhood) %>%
dplyr::summarise(n=n()) %>%
filter(n>10)
head(ny,2)# A tibble: 2 x 3
# Groups: host_name [1]
host_name neighbourhood n
<fct> <fct> <int>
1 Adam East Village 11
2 Adam Murray Hill 24
A final example comprehensive of all the functions seen and also those provided by tidyselect. Let’s find out which are the more expensive neighbourhood in NY. I want the max,min and avg price the mean number of reviews and finally I want to add a column (mutate) assigning the ranking of each district and finally put them in order (arrange) based on the costliness of the hosting facilities.
ny <- ny_airbnb %>%
dplyr::select(.,starts_with("n"),starts_with("p"),starts_with("calcula")) %>% filter(price > 100) %>%
group_by(neighbourhood) %>%
summarise(expensive=max(price),cheap=min(price),avg_price=mean(price),
num_rev=mean(number_of_reviews)) %>%
arrange(desc(expensive)) %>%
mutate(ranking=seq(1:length(expensive)))
head(ny,2)# A tibble: 2 x 6
neighbourhood expensive cheap avg_price num_rev ranking
<fct> <int> <int> <dbl> <dbl> <int>
1 Astoria 10000 101 213. 19.8 1
2 Greenpoint 10000 102 200. 17.6 2
ggplot2
ggplot2 is a package used for data visualization based on The Grammar of Graphics. Every plot starts with ggplot() to which at least a dataframe is passed and creates a ggplot object. You then add on layers, scales, faceting specifications and much more through a +.
Either in ggplot() or the layers (e.g. geom_point())one must specify the aesthetics, aes(), which are the things that you can see.
As for dplyr a comparison between ggplot2 and basic plotting tools is carried out to highlight the differences and the advantages.
In this part of the tutorial will be used also the cars dataset provided by R which contains two columns: speed of cars and distance needed to stop. (Just for completeness: data are from the 1920s)
#ggplot2----
library(ggplot2)
#ggplot2 starts with the same syntax that creates a ggplot object and then you add elements to fine tune the plot
#Used with other packages can also create interactive plotI want to see the two variables in a scatterplot and I’d also expect a decent correlation between the two variables, therefore I want ot fit also a linear regression through the data.
lm fit a linear model to the data given a formula. In this case we want the model to predict the distance given the speed.
#basic plot
plot(cars$speed,cars$dist, col="red",pch=19,cex=2,type="p", main = "This is a scatterplot",
xlim = c(0,30))
points(10,60,col="blue",pch=19,cex=5) #this function adds points to wherever I want it
# abline(h=60)
# abline(v=10)
abline(a = lm(formula = dist ~ speed, data = cars)$coeff[1],
b = lm(formula = dist ~ speed, data = cars)$coeff[2], col ="orange") Fig.1 Scatterplot with R basic function
lm(formula = dist ~ speed, data = cars)$coeff[1] this is equivalent to: 1. fit the linear model 2. extract one-by-one the coefficients
(Intercept) speed
-17.579095 3.932409
(Intercept)
-17.57909
ggplot2 allows to save the plot into a variable that you can later use to add more layer
#ggplot
p1 <- ggplot(data = cars)+
theme_bw()+
theme(axis.title = element_text(size = 16, face = "bold"), #this are customization parameter that we could easily neglect
axis.text = element_text(size = 14),
text= element_text(face="bold"),
plot.title = element_text(color="Dodgerblue",size=17),
strip.text = element_text(size = 14, colour ="darkgreen",hjust = 0),
strip.background = element_rect(fill = "wheat") )
#add to the base ggplot object the points of the two variables with geom_point and the linear regression with geom_smooth, which will display also the confidence interval
p1+geom_point(aes(x=speed,y=dist))+geom_smooth(aes(x=speed,y=dist),method = "lm")Fig.2 Scatterplot with ggplot2
Going back to the NY airbnb dataset, let’s make some histogram and boxplot. I’m curious about the distribution of the price for the three types of room listed:
- Private Room
- Shared Room
- Entire home/apartment
#Filter data holding host with at least a review and with a price per house of more than 150$
df_ny <- ny_airbnb %>% filter(price >150,number_of_reviews >= 1)
#basic plotting
par(mfrow=c(2,2)) #This is needed to plot the 3 histograms on the same"page". It divides the plotting device into a grid with 2 rows and 2 columns
hist(df_ny$price[df_ny$room_type == "Private room"],xlab="Price in dollar",breaks = seq(0,max(df_ny$price),100),main = "This is an histogram",col = "red")
hist(df_ny$price[df_ny$room_type == "Shared room"], xlim = c(0,6000),breaks = seq(0,max(df_ny$price),100),main = "This is an histogram",col = "blue")
hist(df_ny$price[df_ny$room_type == "Entire home/apt"],breaks = seq(0,max(df_ny$price),100),main = "This is an histogram",col = "green")
#dev.off() #this is used to return to default plotting configuration that was changed previously with par()Fig.3 Histogram with R basic function
Using ggplot, I start a ggplot object providing the dataset and then I can add as much layer and customization as I want.
#This is the most basic histogram, with default binwidth = 30 and no customization whatsoever
ggplot(data = df_ny)+geom_histogram(aes(x=price))Fig.4 Basic Histogram with ggplot2
#Add some features to make it nice. theme_bw is one of the option for theme provided by ggplot, in theme you can customize each little piece of the plot
p1 <- ggplot(data = df_ny)+
theme_bw()+
theme(axis.title = element_text(size = 16, face = "bold"), #decide size and makes axis name bold
axis.text = element_text(size = 14), #decide size axis text
text= element_text(face="bold"),
plot.title = element_text(color="Dodgerblue",size=17), #Customize title of the plot
strip.text = element_text(size = 14, colour ="darkgreen",hjust = 0), #Customize the title of each panel containing the plot
strip.background = element_rect(fill = "wheat")) #Customize color of the panel containing title for each plot
p1+geom_histogram(aes(x=price,fill=room_type),binwidth = 10)+
facet_wrap(facets = vars(room_type),nrow = 2,scales = "free")+
labs(title = "This is a histogram",subtitle = "Analysis on Airbnb data in NY",
x="Price per night in $", y="N° of rented house", fill="Type of room/apt")Fig.5 Faceted histogram with further customization
Boxplots behave the same way
#basic boxplot
boxplot(df_ny$price[df_ny$room_type == "Private room"],main = "This is a boxplot",colour = "red", width=0.1)Fig.6 boxplot with R basic function
#ggplot boxplot
p1+geom_boxplot(aes(y=price,col=room_type),outlier.shape = "H",outlier.size = 3,notch = FALSE,notchwidth = 0.5 )+
# facet_wrap(facets = vars(room_type),nrow = 2,scales = "free")+
labs(title = "This is a boxplot",subtitle = "Analysis on Airbnb data in NY",
x="", y="Price per night in $", col="Type of room/apt")+
theme(axis.text.x = element_blank())Fig.7 boxplot with ggplot2
sf
This pakcage offers the possibility to combine geometry of simple feature object with dasets containing attributes in a single object. Basically with sf we can manipulates spatial data that have attached data with them. It is the evolution of the sp package with improved speed and the ability to use the pipe operator.
Methods for sf object are:
I’m going to illustrate some of this function. Mainly creating a sf object from matrix of coordinates, change CRS to an sf object, read external vector data, and some interaction between sf object.
The NY Airbnb dataset has two columns with latitude and longitude let’s try to create a simple feature object with attached the data. The reference system of the coordinates is WGS84.
#sf----
library(sf)
#Creates an sf object with attached the data about Airbnb, using the coordinates provided in the columns and specifyin the CRS
sf_ny <- st_as_sf(df_ny,crs=4326,coords = c("longitude","latitude"))
head(sf_ny,3)Simple feature collection with 3 features and 14 fields
geometry type: POINT
dimension: XY
bbox: xmin: -73.98377 ymin: 40.69169 xmax: -73.97185 ymax: 40.75362
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
id name host_id host_name
1 2595 Skylit Midtown Castle 2845 Jennifer
2 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris
3 7097 Perfect for Your Parents + Garden 17571 Jane
neighbourhood_group neighbourhood room_type price minimum_nights
1 Manhattan Midtown Entire home/apt 225 1
2 Manhattan Murray Hill Entire home/apt 200 3
3 Brooklyn Fort Greene Entire home/apt 215 2
number_of_reviews last_review reviews_per_month
1 45 2019-05-21 0.38
2 74 2019-06-22 0.59
3 198 2019-06-28 1.72
calculated_host_listings_count availability_365
1 2 355
2 1 129
3 1 321
geometry
1 POINT (-73.98377 40.75362)
2 POINT (-73.975 40.74767)
3 POINT (-73.97185 40.69169)
#st_transfomr allows us to change the reference system of the object
sf_ny_2 <- st_transform(sf_ny,3857)
head(sf_ny_2,3)Simple feature collection with 3 features and 14 fields
geometry type: POINT
dimension: XY
bbox: xmin: -8235836 ymin: 4966972 xmax: -8234509 ymax: 4976068
epsg (SRID): 3857
proj4string: +proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs
id name host_id host_name
1 2595 Skylit Midtown Castle 2845 Jennifer
2 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris
3 7097 Perfect for Your Parents + Garden 17571 Jane
neighbourhood_group neighbourhood room_type price minimum_nights
1 Manhattan Midtown Entire home/apt 225 1
2 Manhattan Murray Hill Entire home/apt 200 3
3 Brooklyn Fort Greene Entire home/apt 215 2
number_of_reviews last_review reviews_per_month
1 45 2019-05-21 0.38
2 74 2019-06-22 0.59
3 198 2019-06-28 1.72
calculated_host_listings_count availability_365 geometry
1 2 355 POINT (-8235836 4976068)
2 1 129 POINT (-8234859 4975194)
3 1 321 POINT (-8234509 4966972)
#Trying to perform an intersection between two layer an error is returned since the 2 crs are different
st_intersection(sf_ny,sf_ny_2)Error in geos_op2_geom("intersection", x, y): st_crs(x) == st_crs(y) is not TRUE
External data are loaded with the function st_read() specifying the name of the layer you want to load and the folder that contains it. When plotting sf objects if you don’t specify the geometry column you will get a plot for each attribute in the dataframe, i.e. 1 plot for each column. Try the command plot(usa).
Now will load the shapefile of the United States with the border of the 50 states and the shapefile of Italy at the region detail (both polygon simple feature). Also, will load the shapefile of the capital of each province in Italy and extract the centroid from the municipality of each city in order to get a point object.
Reading layer `gadm36_USA_1' from data source `G:\Il mio Drive\Lecture Hydrological risk\Exercise\Shp' using driver `ESRI Shapefile'
Simple feature collection with 51 features and 10 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -179.1506 ymin: 18.90986 xmax: 179.7734 ymax: 72.6875
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
Fig.8 Map of the United States at the federal state level
Reading layer `gadm36_ITA_1' from data source `G:\Il mio Drive\Lecture Hydrological risk\Exercise\Shp\Italy_shapefile\gadm36_ITA_1.shp' using driver `ESRI Shapefile'
Simple feature collection with 20 features and 10 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 6.630879 ymin: 35.49292 xmax: 18.52069 ymax: 47.09096
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
#Plot of Lombardia and Veneto
plot((italy %>% filter(NAME_1 %in% c("Lombardia","Veneto")))$geometry)Fig.9 Map of Lombardia and Veneto
#st_union merges two geometries. Here I merge Lombardia and Veneto
vl <- st_union(italy %>% filter(NAME_1 %in% c("Lombardia","Veneto")))
plot(vl)Fig.10 Example of st_union() merging the border between Lomabardia and Veneto
#load capoluoghi
cap <- st_read(".\\Exercise\\Shp\\Italy_shapefile\\blog_post_capoluoghi\\capoluoghi_provincia_2019.shp")Reading layer `capoluoghi_provincia_2019' from data source `G:\Il mio Drive\Lecture Hydrological risk\Exercise\Shp\Italy_shapefile\blog_post_capoluoghi\capoluoghi_provincia_2019.shp' using driver `ESRI Shapefile'
Simple feature collection with 109 features and 11 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: 363053.9 ymin: 4084271 xmax: 1290591 ymax: 5155671
epsg (SRID): 32632
proj4string: +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs
Fig.11 Municipality extension of the provincial capitals
Fig.12 Map of the etracted centroids
tmap
It is a map-making package, it has a concise syntax and is based on the Grammar of graphics just like ggplot2. It is capable to generate static and interactive map and can be fed with several type of spatial data. Just like ggplot2 there is a basic building block, tm_shape(), which creates a tmap-element that defines input data. This basic block is followed by one or more layer used to customize the spatial data object provided in the initial block. These layers are add with a +. Next, I’m going to plot some feature of the Airbnb dataset.
#tmap----
library(tmap)
tmap_mode("plot") #this command dictates if you want a static map "plot" or a interactive map ("view")
#This is a basic plot I provide the spatial data to tm_shape() and I tell tmap to plot the houses as points
tm_shape(sf_ny_2)+
tm_dots()Fig. 13 Basic static map
#This is a basic plot I provide the spatial data to tm_shape() and I tell tmap to plot the houses as points
tm_shape(sf_ny_2)+
tm_symbols(shape = 25)Fig. 14 Basic static map with different symbols for each point
In the interactive mode tm_basemap() allows us to plot our spatial data on tiles obtained from a provider.
#The tmap package works a little bit as ggplot ypu build your map piece by piece starting from a base"tmap_element"
tmap_mode("view")
tm_basemap(server = leaflet::providers$OpenStreetMap)+
tm_shape(sf_ny_2)+
tm_dots(col="room_type",alpha = 0.6, popup.vars = c("price"))+
tm_layout(legend.show = FALSE)Some options are not available in the static mode. For example, popup.vars gives us the chance to select which variables are shown in the clickable popup.
tmap_mode("plot")
tm_shape(shp = (italy %>% mutate(.,population = rnorm(20,mean = 3e6,sd = 2e6))) )+
tm_polygons(alpha=0.5,col="population", popup.vars=c("NAME_1","population"),palette="viridis", title = "Popolazione")+
tm_layout(legend.outside = TRUE, legend.title.color = "Dodgerblue",
title = "Region breakdown of Italy", title.position = c("LEFT","TOP"))+
tm_shape(st_transform(cap,4326))+
tm_markers(clustering = FALSE, popup.vars="comune") Fig. 16 Basic static map of NY Airbnb hosting facilities with markers
raster
This package provides tool for creating, manipulating, reading and writing raster spatial data. In this last part of the tutorial I’ll be workink with raster layer of Italy. This are data from the Shuttle Radar Topography Mission sampled at 3 arc-seconds that means with a resolution of about 90m (each pixel is 90 meters by 90 meters). raster() reads the raster layer provided
#raster----
library(raster)
#load the 4 block
r1 <- raster(".\\Exercise\\Dem\\srtm_38_03.tif")
r2 <- raster(".\\Exercise\\Dem\\srtm_38_04.tif")
r3 <- raster(".\\Exercise\\Dem\\srtm_39_03.tif")
r4 <- raster(".\\Exercise\\Dem\\srtm_39_04.tif")
#merge the blocks
r_tot <- merge(r1,r2,r3,r4)
plot(r_tot)Fig.17 Raster map of Italy
Fig.18 Plot of the extent, frame, that contains the raster
The clip function saw in QGIS is equivalent to crop() in the raster package. This command returns a subset of the raster by a specified extent. So to speak, it takes the extent of th object you want ot use to clip the raster and execute the crop() command.
mask() instead, specifically create a new raster object carved out from the border of the ‘mask’ layer.
#crop,mask the raster with shape file
veneto <- italy %>% filter(.,NAME_1 == "Veneto")
r_crop <- crop(r_tot,veneto)
r_mask <- raster::mask(r_crop,veneto)
par(mfrow=c(1,2))
plot(r_crop)
plot(r_mask)Fig.19 Difference between cropped and masked maps
Raster object are grids with values in their cells, thus, they can be seen as large matrix and as such they can be manipulated like you would with normal matrix, by indexing.
Let’s assume I want to filter some of the data in the raster. I would like to replace depression points of the map (elevation < 0 m a.s.l.) with normally randomly distributed points with mean -500 and standard deviation 2.
#filter values of the data
r_fill <- r_mask
r_fill[ r_fill < 0] <- rnorm(n = length(r_fill[ r_fill < 0]),mean = -500,sd = 2)
plot(r_mask, col=terrain.colors(5))Fig.20 Comparison between unfiltered map and map with altered pixels
Fig.21 Zoomed in raster map playing with the extent of the map
Fig.20 Comparison between unfiltered map and map with altered pixels
Here I crop the map to a fixed extent rather than to a layer data.
r_ven <- crop(r_fill,extent(c(12,12.5,45.15,45.5)))
plot(r_ven)
points(12.34629,45.46773,pch=19,col="blue",cex = 1)Fig.22 Map cropped to a given extent with the addition of a customized point
Below some useful functions usable to retrieve informations about the raster:
click(): allows you to obtain information from raster by clicking directly on the map,n()is the number of click you might want to usecellFromXY,xyFromCell: this family of functions makes it possible to information from a cell specifying the raster and the coordinates or viceversa expliciting the number of the cell getting back the coordinates and the cell values
#click
click(r_ven,n=1,xy=TRUE,cell=TRUE)
cellFromXY(r_ven,c(12.22,45.36667))
xyFromCell(r_ven,95665)extract comes very handy wen you wanto to obtain values from a raster at the location of other spatial data. You can fed to the function coordinates (points), lines, polygons or an Extent (rectangle) object.
#extract value of a pixel
raster::extract(r_fill,y=cap %>% st_transform(.,4326) %>% filter(.,cod_reg == 5) %>% st_centroid() %>% st_coordinates())[1] 19 111 371 NA 12 33 6
The function stack() returns a collection of raster layer with same spatial extent and resolution. For example can be used to put several flood hazard map, with different return period in the same object. Here, I’m going to stack together the original map of Veneto with the map of Veneto where the depression point are replaced by randomly distributed points.
Fig.23 Rasterstack object with 2 layers
Final example plotting and hillshade raster in tmap. This might look a little bit cumbersome with due respect to QGIS.
#terrain compute slope, aspect and other characteristics of the dem they are both output that are fed to hillShade()
r_slope <- terrain(r_crop, opt=c("slope"))
r_Aspect <- terrain(r_crop, opt=c("aspect"))
#Creates hill shade layer
r_hill <- hillShade(slope = r_slope,aspect = r_Aspect,angle = 40,direction = 270)
#Basic plot
plot(r_hill, col=grey(0:100/100), legend=FALSE, main='venezia')Fig.24 Hillshade on plot()
#Using tmap
tm_shape(r_hill) +
tm_raster(palette = gray(0:100 / 100), n = 100, legend.show = FALSE) +
tm_shape(r_crop) +
tm_raster(alpha = 0.6, palette = terrain.colors(30),
legend.show = FALSE)Fig.25 Hillshade and DEM on tmap
Like any other type of data, raster can be written to local memory with writeRaster() passing as necessary arguments the name of the object and the name of the file with the extension.
#writeraster
writeRaster(r_ven,"raster venezia 90m.tif",overwrite=TRUE)
writeRaster(r_tot,".\\Exercise\\raster 90m.tif",overwrite=TRUE)#References