In this R tutorial, I will show you how to conduct some simple analyses on your data, including:

Please note, you might need to install some packages

Part One: Data cleaning and summary

First things first, let’s get organized and either use your existing RStudio project from a previous week or set-up a new RStudio project following the steps below.

Click the “File” menu button (top left corner), then “New Project” Click “New Directory” Click “New Project” In “Directory name”, type the name of your project, for example “EVB203” Browse and select a folder where to locate your project (~ is your home directory). For example, a folder called “R_tutorials” in “EVB203”: ~/EVB203/R_tutorials. Click the “Create Project” button Create three sub-folders/directories in your project. Run:

dir.create("scripts")
dir.create("data")
dir.create("plots")

Or click on the New Folder button in the Files pane in the lower right quadrant of RStudio and type in the folder names.

# move data to your data folder or you can use "setwd" to tell R where your data is stored
setwd("C:/Users/N11094427/Downloads") #change this to your directory

# import the class data
mydata <- read.csv("StudentDataSet_2023.csv")

# usually, I like to check my data before I dive into analysis
nrow(mydata)
## [1] 516
summary(mydata)  ### have a look on your data
##    wkt_geom             Time               Obs                Light       
##  Length:516         Length:516         Length:516         Min.   :     8  
##  Class :character   Class :character   Class :character   1st Qu.:  2260  
##  Mode  :character   Mode  :character   Mode  :character   Median : 13527  
##                                                           Mean   : 34012  
##                                                           3rd Qu.: 50706  
##                                                           Max.   :692141  
##      name               date              author             Cover          
##  Length:516         Length:516         Length:516         Length:516        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Humidity          Long          Lat                 Temp      
##  Min.   :21.43   Min.   :153   Min.   :-27470.00   Min.   :21.08  
##  1st Qu.:56.17   1st Qu.:153   1st Qu.:   -27.47   1st Qu.:23.18  
##  Median :65.16   Median :153   Median :   -27.47   Median :24.86  
##  Mean   :62.60   Mean   :153   Mean   :   -80.65   Mean   :29.73  
##  3rd Qu.:72.26   3rd Qu.:153   3rd Qu.:   -27.47   3rd Qu.:27.38  
##  Max.   :87.17   Max.   :153   Max.   :   -27.46   Max.   :75.90

We can find some problems here, for example, there are some points located far away from Australia, whose Lat is -27470. We can delete those points by using their Lats.

mydata <- subset(mydata, mydata$Lat > -30.0) # by using subset, we only keep data that lattitude is greater than -30
nrow(mydata)
## [1] 515

We can create histograms to see the distributions of temperature, light, and humidity

hist(mydata$Temp)

hist(mydata$Light)

hist(mydata$Humidity)

There are some temperatures ranging from 55 to 80 celsius, this is abnormal, check the data and think why is that and what should we do? Remove them or find a way to fix it?

Now, let’s create scatterplots to compare the relationship between humidity vs temperature, also we can test their covariances

plot(x = mydata$Humidity,y = mydata$Temp,
     xlab = "Humidity",
     ylab = "Temperature",
     main = "Humidity vs Temperature"
)

We can also test the linerity between Humidity and Temperature

Hum_Temp <- lm(Temp ~ Humidity, data = mydata)
summary(Hum_Temp) # negatively related
## 
## Call:
## lm(formula = Temp ~ Humidity, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.446  -6.343  -1.431   1.489  38.179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 68.18351    2.12898   32.03   <2e-16 ***
## Humidity    -0.61436    0.03319  -18.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.62 on 513 degrees of freedom
## Multiple R-squared:  0.4004, Adjusted R-squared:  0.3992 
## F-statistic: 342.6 on 1 and 513 DF,  p-value: < 2.2e-16

Let’s visualize it

library(ggplot2)
library(ggpmisc)
## Loading required package: ggpp
## 
## Attaching package: 'ggpp'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
ggplot(data = mydata, aes(x = Humidity, y = Temp)) +
  stat_poly_line() +
  stat_poly_eq() +
  geom_point()

If we can remove those abnormal temperture, it seems that we can get more accurate regression.

We can also take the four types of cover into consideration

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata %>%
  ggplot( aes(x=Humidity, y=Temp, group=Cover, color=Cover)) +
  geom_point() +
  ggtitle("Humidity - Temperature") +
  ylab("Temerature")

The result shows that the column of cover needs to be unified, for example, we have five groups of concrete: Concete, concrete, Concrete, Concrete(canopy), Concrete(shade)

mydata$Cover[grepl("onc", mydata$Cover, ignore.case=FALSE)] <- "Concrete"
# I used function "grepl" to replace all strings that contain "onc" to Concrete

mydata %>%
  ggplot( aes(x=Humidity, y=Temp, group=Cover, color=Cover)) +
  geom_point() +
  ggtitle("Humidity - Temperature") +
  ylab("Temerature")

Thank god, we now only have one group of concrete, you can merge other groups in your way.

ggplot(mydata, aes(x=Cover, y=Humidity)) + 
  geom_boxplot(color="red", fill="orange", alpha=0.2)

After all cover types been unified, your box-plot will be cleaner

Part two: Spatial mapping

# load the spatial libraries
library(sp)
library(rgdal)
library(rgeos)
library(tmap)
library(sf)
library(spatstat)  
library(maptools)
library(tidyverse)
library(gstat)

# we need to use the latitude and longtitude information
mydata_WSG84 <- st_as_sf(mydata, coords = c("Long", "Lat"), crs = 4326) # give the location to your points

# to make your map more accurate, we reproject your points to projection:WGS 84 / UTM zone 56S
# in r code , it is "+proj=utm +zone=56 +south +datum=WGS84 +units=m +no_defs +type=crs"
mycrs <- paste0("+proj=utm +zone=56 +south +datum=WGS84 +units=m +no_defs +type=crs")
mydata_zone56 <- st_transform(mydata_WSG84, crs=mycrs) # give the location to your points

have a look with a openstreet map as background

tmap_mode("view")
## tmap mode set to interactive viewing
points_map <- tm_shape(mydata_zone56) + tm_dots(col = "Humidity") 
points_map #looks fine

We need a boundary to mask our study area. Let’s make one.

x_coord <- c(153.015,  153.02,  153.025, 153.029, 153.035, 153.033, 153.031, 153.027, 153.025, 153.020, 153.015)
y_coord <- c(-27.460, -27.470,-27.475, -27.480, -27.480, -27.475, -27.469, -27.465, -27.465, -27.46, -27.460)
xym <- cbind(x_coord, y_coord)

boundary = st_polygon(list(xym))
boundary_WSG84 = st_sfc(boundary, crs= 4326)
boundary_zone56 <- st_transform(boundary_WSG84, crs=mycrs)

### add boundary to our map
points_map_with_boundary <- points_map + tm_shape(boundary_zone56)+ tm_borders(lwd = 2)
points_map_with_boundary

Part Three: Spatial Interpolation

here we can use the concept of voronoi diagram: this is type of the nearest neighbor analysis # more info: https://mathworld.wolfram.com/VoronoiDiagram.html#:~:text=points%20into%20convex%20polygons%20such,known%20as%20a%20Dirichlet%20tessellation.

library(terra)
mydata_zone56_V <- vect(mydata_zone56) # we’ll need to convert mydata_zone56 to S4 objects using vect()
boundary_zone56_V <- vect(boundary_zone56) # vectlise our boundary too
v <- voronoi(mydata_zone56_V)
plot(v)
points(mydata_zone56_V)

mydata_Voronoi <- crop(v, boundary_zone56_V) #we only care about the area within our boundary
plot(mydata_Voronoi, "Humidity")

# rasterize
r <- rast(mydata_Voronoi, res=10)  # Builds a blank raster of given dimensions and resolution  
# the unit of our projection is meter, so res = 10, means each grid is 100 squre metre big.

vr <- rasterize(mydata_Voronoi, r, "Humidity") # we want to rasterize our result to small pixels, so that our analysis can be more accurate
plot(vr)

let’s do some interpolation

library(gstat)

d <- data.frame(geom(mydata_zone56_V)[,c("x", "y")], as.data.frame(mydata_zone56_V)) #we transfer mydata_zone56_V to dataframe

gs <- gstat(formula=Humidity~1, locations=~x+y, data=d, nmax=Inf, set=list(idp=2))

idw <- interpolate(r, gs, debug.level=0) # r is the blank raster we just built

plot(idw)

idwr <- mask(idw, vr) # crop the result with the rasterised boundary vr
plot(idwr, 1)

It would be interesting to quantify some other factors that could influnce your results, for example, the relationship of humidity and the proximity to water (river or pond). Don’t be limited to our tutorials