GESC 258- Geographical Research Methods
Access online version of this document at (https://rpubs.com/majidhojati/gesc258lab1)
R-Studio is one of the most widely used software packages in data science, statistics and data analysis more generally. Many employers in the environmental field now greatly value skills in R - so even though the learning curve can be a little steeper, it is well worth the effort to learn this approach to data analysis.
Attend your first lab session to briefly meet your TA instructor and to get the course software installed on your computer. We will be using an open source software package called R and R-Studio for most of the work in this course.
In this lab you will:
Familiarize yourself with the R studio interface
Learn the basic of R programming
Create simple graphs
As a supplement to the lab materials, I highly recommend using
[YaRrr! The Pirate’s Guide to R](https://bookdown.org/ndphillips/YaRrr/)
- which has lots of great information and resources for learning R. You
can read chapter 2
for how to get it installed on your
machine. The TA will verify everyone has a working installation of R on
their machine in lab.
Part I: Getting started with R
In order to use R/Rstudio you can follow one of the following methods.
1- Using Virtual Remote Desktop Labs (Recommended)
If you prefer to use R studio on [virtual Remote desktop] (https://students.wlu.ca/services-and-spaces/tech-services/assets/resources/virtual-computer-labs.html)
you need to login to your online account and search for RStudio in the
start -> all programs
.
“Make sure to setup onedrive on your remote desktop machine to avoid losing your data. Ask your lab instructor if you need help setting onedrive up.”
2- Installing R and R/ Studio on your machine
Please follow the steps explained in the Chapter 2 of YaRrr to install R and R/Studio on your machine. https://bookdown.org/ndphillips/YaRrr/installing-base-r-and-rstudio.html
Working With R Studio
Credit: This section is based on https://bookdown.org/ndphillips/YaRrr/ When you
open RStudio, you’ll see the following four windows (also called panes)
The main 4 sections are as follows :
Source - Your notepad for code (Top Left)
The source pane is where you create and edit R Scripts” - your collections of code. When you open RStudio, it will automatically start a new Untitled script. Before you start typing in an untitled R script, you should always save the file under a new file name.
“NOTE: If you see a”*” is shown beside your file name. It means you have unsaved changes in your source. ”
You’ll notice that when you’re typing code in a script in the Source
panel, R won’t actually evaluate the code as you type. To have R
actually evaluate your code, you need to first send
the
code to the Console. There are many ways to send your code from the
Source to the console. The slowest way is to copy and paste. A faster
way is to highlight the code you wish to evaluate and clicking on the
run
button on the top right of the Source.
Console: R’s Heart (Bottom Left)
The console is the heart of R. Here is where R actually evaluates
code. At the beginning of the console you’ll see the character . This is
a prompt that tells you that R is ready for new code. For example, if
you type 1+1
into the console and press enter, you’ll see
that R immediately gives an output of 2.
1+1
## [1] 2
Tip: Try to write most of your code in a document in the Source. Only type directly into the Console to de-bug or do quick analyses.
Tip: If you use # at the beginning of a line in R code it will not get executed. We usually use # at the beginning of lines to write comments in the code.
HINT: during this document if a line or section starts with
##
it means it is output of the code and it is only for your information.
Environment / History (Top Right)
The Environment tab of this panel shows you the names of all the data objects (like vectors, variables, and dataframes) that you’ve defined in your current R session. You can also see information like the number of observations and rows in data objects.
Files / Plots / Packages / Help (Bottom Right)
The Files / Plots / Packages / Help panel shows you lots of helpful information. Let’s go through each tab in detail:
Files - The files panel gives you access to the file directory on your hard drive. One nice feature of the “Files” panel is that you can use it to set your working directory - once you navigate to a folder you want to read and save files to, click “More” and then “Set As Working Directory.”
Plots - The Plots panel (no big surprise), shows all your plots. There are buttons for opening the plot in a separate window and exporting the plot as a pdf or jpeg (though you can also do this with code using the pdf() or jpeg() functions.)
You can use this tab to
Export
plots and save them on your machine.
Packages - Shows a list of all the R packages installed on your hard drive and indicates whether or not they are currently loaded. Packages that are loaded in the current session are checked while those that are installed but not yet loaded are unchecked.
Help - Help menu for R functions. You can either type the name of a function in the search window, or use the code to search for a function with the name
R as a simple calculator
You can simply do arithmetic in R console, for example type the following codes and check their results:
100+23
## [1] 123
The following are the simple mathematics functions can be used in R
Exponents:
^
or**
Multiply:
*
Divide:
/
Add:
+
Subtract:
-
You can also use other math functions as follows
sin(1) + log(3) * exp(1)
## [1] 3.827809
We can also do comparisons in R:
1 == 1 # equality (note two equals signs, read as "is equal to")
## [1] TRUE
Type the following commands and see their results
Variables and assignment
We can store values in variables using the assignment operator
<-
, like this:
<- 1/40 x
Type the following commands in R console and see the outputs
<- 12
x <- log(x)
y x
## [1] 12
y
## [1] 2.484907
then
<- y #see how R is replacing value of x with y
x -x y
## [1] 0
Part II: Statistical Analysis in R
In this section, we’ll analyze a dataset called cities. The dataset
contains data from 1748
population centers across Canada.
Run the following lines to load this dataset into R.
“Hint: type
?read.csv
into R console to learn how this function works”
<- read.csv("https://raw.githubusercontent.com/am2222/GESC258/master/Lab1/data/canadian_population_centers.csv") #loading dataset into R as a dataframe cities
Once you run above code you will see a new item called
cities
is added to the environment
panel in
the RStudio. A very common data structure is an array. In different
domains there are other names for such a type of data used synonymously,
such as matrix
in mathematics, table
in
databases, spreadsheet
, and data frame
, which
is a fundamental R object. Click on cities
in the
Environment
tab to open it in a new windows. You can also
type the following command in RConsole to get the same result.
View(cities)
Explore the data. As you see there are different types of columns in
this dataset. Some of the columns are in Text format such as
city
or province_id
. In R we call this type of
columns characters
or strings
. Some of the
columns are all numerical like population
which represent
population of each region. Scroll down to this table and in the
overall_csi_index
column you will see some of the values
are shown as NA
. It means that the value for this cell is
missing. Notice that a missing value is different than a
zero
value. During your calculations, you should always be
aware of NA
values. Now, let’s take a look at the first few
rows of the dataset using the head()
function. This
function will show you the first few rows of the dataframe.
head(cities) # Look at the first few rows of the data
X | city | city_ascii | province_id | province_name | lat | lng | population | density | overall_csi_index | overall_csi_rank | violent_csi_index | violent_csi_rank | nonviolent_csi_index | nonviolent_csi_rank |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Toronto | Toronto | ON | Ontario | 43.7417 | -79.3733 | 5429524 | 4334.4 | 57.84 | 168 | 90.41 | 95 | 45.99 | 196 |
2 | Montréal | Montreal | QC | Quebec | 45.5089 | -73.5617 | 3519595 | 3889.0 | 67.29 | 135 | 92.11 | 90 | 58.20 | 157 |
3 | Vancouver | Vancouver | BC | British Columbia | 49.2500 | -123.1000 | 2264823 | 5492.6 | 104.67 | 58 | 99.81 | 74 | 106.18 | 55 |
4 | Calgary | Calgary | AB | Alberta | 51.0500 | -114.0667 | 1239220 | 1501.1 | 79.96 | 101 | 78.26 | 118 | 80.38 | 95 |
5 | Edmonton | Edmonton | AB | Alberta | 53.5344 | -113.4903 | 1062643 | 1360.9 | 115.55 | 40 | 127.42 | 43 | 111.00 | 51 |
6 | Ottawa | Ottawa | ON | Ontario | 45.4247 | -75.6950 | 989567 | 334.0 | 49.05 | 197 | 56.38 | 194 | 46.30 | 192 |
The following command shows the last few rows of a dataframe.
tail(cities)
You can look at the names of the columns in the dataset with the names() function
# What are the names of the columns?
names(cities)
## [1] "X" "city" "city_ascii"
## [4] "province_id" "province_name" "lat"
## [7] "lng" "population" "density"
## [10] "overall_csi_index" "overall_csi_rank" "violent_csi_index"
## [13] "violent_csi_rank" "nonviolent_csi_index" "nonviolent_csi_rank"
You can access dataframe columns and rows using following methods
Measures of Central Tendency
Mean
There are three main measures of central tendency: the
mean
, the median
, and the mode
.
Lets start with mean
\[ \bar x = \frac{1}{n}\sum_{i=1}^n x_i \]
<- cities$population # extract populations column and storing it as c.population. dot (.) in c.population does not mean anything
c.population <- length(c.population) # get the length of the populations vector, which is the number of observations
c.population.n <- sum(c.population) # sum up the populations vector
c.population.sum <-c.population.sum / c.population.n
c.population.xbar sprintf("The arithmetic mean populations in the canadian cities dataset is: %s", round(c.population.xbar,1))
## [1] "The arithmetic mean populations in the canadian cities dataset is: 23100.2"
“Note that we used
length
,sum
andround
functions in this code snippet. You can learn about each function by typing?function
in R console.
Of course there is as well an in-built function called
mean()
.
mean(c.population)
## [1] 23100.24
<- mean(c.population) #we save mean for future use c.population.mean
You can check our result with the output of the mean
function.
all.equal(mean(c.population),c.population.xbar )
## [1] TRUE
Median
The calculation of the median consists of the following two steps: 1. Rank the data set in increasing order. 2. Find the middle term. The value of this term is the median.
median(c.population) # returns the median of the c.population
## [1] 2944
<- median(c.population) # we save median for future use c.population.median
Other useful function to explore a dataset:
<- min(c.population) # returns the min value of the c.population
c.population.min <- max(c.population) # returns the max value of the c.population
c.population.max <- cities[cities$population==c.population.min,] # returns the row that has min population
city_min <- cities[cities$population==c.population.max[1],] # returns the row that has max population in cities dataframe
city_max sprintf("%s has max number of population and %s has min number of population", city_max$city,city_min$city)
## [1] "Toronto has max number of population and Oyen has min number of population"
The Mode
In statistics, the mode represents the most common value in a data
set. Therefore, the mode is the value that occurs with the highest
frequency in a data set (Mann 2012). You can calculate
the mode
of a column using the following method
<- unique(c.population) # find all the uniqe values in the c.population
uniqv <- tabulate(match(c.population, uniqv)) #counts the number of times each integer occurs in c.population
tabulated which.max(tabulated)] #Find the item which has max number of occerance uniqv[
## [1] 2011
Lets plot population of canadian cities and their mean value
plot(c.population, ylim = c(min(c.population),max(c.population))) #plot figure
abline(h = mean(c.population),
col='red',
lwd = 3) # add horizontal line
abline(h = median(c.population),
col='green',
lwd = 3) # add horizontal line
legend('topright',
legend = c("Median","Arithmetic mean"),
col = c("green","red"),
lty = "solid") # add legend
“The median is not influenced by outliers. as a result the median is preferred over the mean as a measure of central tendency for data sets that contain outliers.”
As you see the outliers are causing a problem in visualizing our data.
Measures of Dispersion
The measures of central tendency, such as the mean, median, and mode, do not reveal the whole picture of the distribution of a data set. Two data sets with the same mean may have completely different spreads(Hartmann, 2018). The measures of central tendency and dispersion taken together give a better picture of a data set than the measures of central tendency alone (Mann 2012).
We will use crime Severity Index (CSI) for 2020 in this section. AS
you remember there are NA
values in the
overall_csi_index
field. The principal behind the Crime
Severity Index (CSI) is to measure the seriousness of crime reported to
the police year to year by Statistics Canada. lets first clean the
dataset and remove them.
<- cities[!is.na(cities$overall_csi_index),'overall_csi_index'] # we get all the rows that their overall_csi_index is not NA and then save value of overall_csi_index into a new variable named overall_csi
overall_csi <- mean(overall_csi) # we calculate mean of overall_csi overall_csi.mean
Ragne
The range as a measure of dispersion is simple to calculate. It is obtained by taking the difference between the largest and the smallest values in a data set.
\[Range=Largest value−Smallest value\]
<- max(overall_csi) - min(overall_csi) overall_csi.range
Standard Deviation
By using the mean and standard deviation, we can find the proportion or percentage of the total observations that fall within a given interval about the mean.
The variance
is the the sum of squared deviations from
the mean. The variance for population data is denoted by \(σ^2\) (read as sigma squared), and the
variance calculated for sample data is denoted by \(s^2\). Here is the equation to calculate
\(s^2\)
\[s^2 = \frac{\sum_{i=1}^n (x_i-\bar
x)^2}{n-1}\] The standard deviation
is obtained by
taking the square root of the variance
. Consequently, the
standard deviation calculated for population data is denoted by \(σ\) and the standard deviation calculated
for sample data is denoted by \(s\).
\[s = \sqrt{\frac{\sum_{i=1}^n (x_i-\bar x)^2}{n-1}}\]
<- var(overall_csi) #variance
overall_csi.var <- sd(overall_csi) #standard deviation
overall_csi.sd # concatenate the vectors and round to 2 digits
<- round(cbind(overall_csi.mean,
overall_csi.stats
overall_csi.var,
overall_csi.sd,2)
overall_csi.range),
colnames(overall_csi.stats) <- c('mean', 'variance', 'standard deviation','range') # rename column names
Type the following to see the result of
overall_csi.stats
overall_csi.stats
mean | variance | standard deviation | range |
---|---|---|---|
82.29 | 3463.1 | 58.85 | 434.88 |
Quartiles and Interquartile Range
Quartiles divide a ranked data set into four equal parts. These three measures are denoted first quartile (denoted by \(Q1\)), second quartile (denoted by \(Q2\)), and third quartile (denoted by \(Q3\)). The second quartile is the same as the median of a data set. The first quartile is the value of the middle term among the observations that are less than the median, and the third quartile is the value of the middle term among the observations that are greater than the median (Mann 2012).
The difference between the third quartile and the first quartile for a data set is called the interquartile range (IQR) (Mann 2012).
\[IQR = Q3-Q1\] To calculate the
quartiles for a variable, we apply the function quantile()
.
If you call the help()
function on quantile()
,
you see that as default values for the argument probs are set to 0,
0.25, 0.5 and 0.75. Thus, in order to calculate the quartiles for the
our variable we just write:
<- quantile(overall_csi) overall_csi.quantile
In order to calculate the \(IQR\)
for the overall_csi
variable we either subtract
overall_csi.quantile[4] - overall_csi.quantile[3]
or use
IQR()
function:
IQR(overall_csi)
## [1] 59.14
lets plot our quartiles
“Don’t forget to update
Your Name/Date
before saving your plot.”
<- hist(overall_csi, breaks = 100, plot = F)
h
<- cut(h$breaks, c(0, quantile(overall_csi)))
cuts
plot(h,
col = rep(c("4","4","3","2","1"))[cuts],
main = 'Quartiles',
xlab = 'Crime Severity Index')
text(400,10,"Your Name/Date")
# add legend
legend('topright',
legend = c("1st","2nd","3rd","4th"),
col = c(4,3,2,1),
pch = 15)
Extra Note
You can run summary(cities)
to get most of the
descriptive statistics for your dataframe ;).
Hand in
Please submit the final plot you have made into MLS system. To export
a plot use Export
menu on the icons on top of the plot
tab.
References
https://bookdown.org/ndphillips/YaRrr/
Hartmann, K., Krois, J., Waske, B. (2018): E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin.
Mann, P. S. Introductory Statistics, 8th Edition; John Wiley and Sons, Incorporated, 2012.