###Author and date need to be changed; Name - student name and date; done when a new markdown is started. #students only need to submit in their html the answers and code for the questions they need to complete/finish.. — title: “GGR276 Lab 1 Part 3 Understanding the GEO in Geostatistics” author: “Areeba Noor” date: “2024-07-12” output: html_document —
R Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
If your R Markdown is NOT in the same folder as your
data, please set your working directory using setwd()
first. Here is an example
setwd("\\medusa\StudentWork\(Your UTOR ID)\GGR276\Lab1").
You will need to change the code to reflect your personal directory.
Otherwise, you may skip this step and continue to import
data by reading your WaterRetainingFacilities.csv
file. You may view the data by clicking the WaterRetaining in the
Environment window or type code View(DataSet). Click on the
little green triangle on the right to run current
chunk.
#you may need to adjust the csv file name - if you have a different spelling or use captials in your file structure it will not match the one listed in the code below.
WaterRetainingFacilities <- read.csv("WaterRetainingFacilities.csv", sep = ',', header = TRUE)
Now that we have data imported, we are ready to calculate median, mean, range and quantiles of the Impoundment Volume in cubic meters (m^3)
summary(WaterRetainingFacilities$DamHeight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 14.70 22.88 41.06 45.00 285.00
Next, we will calculate the standard deviation of Dam Height in
meters (m)). If there are missing values in your dataset, simply add
na.rm = TRUE to your code to tell the R to remove NAs in
the calculation. Like this:
sd(WaterRetaining$DamHeight, na.rm=TRUE). Since there isn’t
a missing value in this dataset, this line is not necessary here.
sd(WaterRetainingFacilities$DamHeight)
## [1] 46.14621
Question 6: Now it is your turn to write the code to calculate the median, mean, range, interquartile range, and standard deviation of the impoundment volume; review the variable information to find out the units. (2 marks)
#TODO
Type your response here for the summary descriptive statistics and standard deviation (include units, round to 1 decimal place): median = 1.0 mm^3 mean = 17.9 mm^3 range = 384.1 mm^3 interquartile range = 9.0 mm^3 standard deviation = 44.7 mm^3
Visualize the spread of the Dam Height dataset, and answer Question 7: According to the boxplot, would you use median or mean as your central tendency measure? Explain and justify your choice? (2 marks)
Type your response here:
I would use median as my central tendency measure because the boxplot
shows a lot of outliers, meaning that it is skewed and the mean would be
affected by it and not give the best measure of central tendency.
boxplot(WaterRetainingFacilities$DamHeight)
Visualize the locations of the water retaining facilities by Dam Height and the mean and weighted mean centres.
plot(WaterRetainingFacilities$easting_m, WaterRetainingFacilities$northing_m, xlab="Easting (m)", ylab="Northing (m)", main = "Switzerland's Water Retaining Facilities", xlim=c(2488000, 2840000), ylim=c(1085000, 1285000))
n<-nrow(WaterRetainingFacilities[1])
mc_x<-sum(WaterRetainingFacilities$easting_m)/n
mc_y<-sum(WaterRetainingFacilities$northing_m)/n
points(mc_x,mc_y,'p',pch=15,cex=2,col="blue")
wmc_x<-sum(as.numeric(WaterRetainingFacilities$DamHeight*WaterRetainingFacilities$easting_m))/sum(WaterRetainingFacilities$DamHeight)
wmc_y<-sum(as.numeric(WaterRetainingFacilities$DamHeight*WaterRetainingFacilities$northing_m))/sum(WaterRetainingFacilities$DamHeight)
points(wmc_x, wmc_y,'p',pch=15,cex=2,col='red')
legend("topright", legend = c("Mean centre", "Weighted mean centre"), pch = c(15,15), col = c("blue","red"))
Notice that after running this script, this script shows all of the dams even for all dam types. Let’s filter to keep only the DamType that are for “Gravity”.
filtered_WaterRetaining <- subset(WaterRetainingFacilities, (DamType == "Gravity"))
Visualize the locations of the Water Retaining Facitilites whose primary dam type is a gravity dam - include their mean and weighted mean centres on the filtered data. Use dam height to determine the weighted mean centre. Please adjust the range of x and y axes as well as the symbols to best represent the data. Make sure to change the colour colours and symbols.
#TODO
Question 8: Produce comments (i.e., a detailed explanation for each parameter in the code) that describes the execution of the statements given to you in the WaterRetaining example. Make sure you provide a description for each set of statements and organize your answer in a manner similar to the following example. (16 marks)
Example:
setwd ("\\medusa\StudentWork\(Your UTOR ID)\GGR276\Lab1")
The setwd() command tells R the current folder to look
into when searching for data, saving outputs etc.
Explain what is occurring in each of the following lines of code (2 marks each):
WaterRetaining <- read.csv("WaterRetainingFacilities.csv", sep = ',', header = TRUE)
plot(WaterRetaining$easting_m, WaterRetaining$northing_m, xlab="Easting (m)", ylab="Northing (m)", main = "Switzerland's Water Retaining Facilities", xlim=c(2488000, 2840000), ylim=c(1085000, 1285000))
n<-nrow(WaterRetaining[1])
mc_x<-sum(WaterRetaining$easting_m)/n
mc_y<-sum(WaterRetaining$northing_m)/n
points(mc_x,mc_y,'p',pch=16,cex=2,col="blue")
wmc_x<-sum(as.numeric(WaterRetaining$DamHeight*WaterRetaining$easting_m))/sum(WaterRetaining$DamHeight)
wmc_y<-sum(as.numeric(WaterRetaining$DamHeight*WaterRetaining$northing_m))/sum(WaterRetaining$DamHeight)
legend("topleft", legend = c("Mean centre", "Weighted mean centre"), pch = c(16,15), col = c("blue","red"))
install.packages("rmarkdown")
library(rmarkdown)
Type your response here: a) The
WaterRetaining is the variable in which the
WaterRetainingFacilities.csv file read by the command
read.csv is being stored in and it is file in which values
are seperated by a comma therefore it says sep = ","
This plot() code is plotting the information into a
graph in which the x variables are the easting_m column
values in the dataframe of WaterRetaining and the
northing_m column in the data frame
WaterRetaining are the y-values. The $ sign in
WaterRetaining$easting_m is extracting the values in the
easting_m column from the WaterRetaining
dataframe. The ylab = "" labels the y-axis and the
xlab = "" has the label for the x-axis. The
xlim =c() limits the values in the x-axis to be between the
2 numbers, and the ylim = c() limits the y-axis values to
be between the numbers put in the brackets. In this case it will limit
the values to a certain location.
The nrow(WaterRetaining[1]) code will tell us the
number of rows in the first row of the data frame
WaterRetaining
This code will give us the mean of the easting_m
column, because the sum() code will add up all the values
in the column easting_m of the data frame
WaterRetaining and / will divide it by n which
is the number of rows (i.e. number of values). It will give us the mean
center of the x values (easting coordinates) and similarly mc_y will
give the mean center of the y values (northing coordinates).
This code points() would add points on the
scatterplot, and from the previously calculated mean center of x and y
axis we are adding those values in the variables mc_x and
mc_y to the scatter plot and adjusting the types of points
(i.e pch and cex being the symbol to be placed
and the size of it respectively) and the col code is for
the coloor of the symbol.
The sum() function adds up the values inside the
brackets. The as.numeric() makes sure that the values
inside the brackets are numeric and operations can be done with them
without any errors. The values in the DamHeight column are
being multiplied by the easting_m column values and then
being divided by the sum of the values in the DamHeight
column. They are calculating the weighted mean center of the x values
(easting_m) and the y values (northing_m) in
DamHeight.
This code legend() will put the legend of mean
center and weighted mean center with their respective symbols and
colours on the top left side of the graph.
The install.packages code will install the rmarkdown
package in the system and the code library(rmarkdown) will
make it available in the R script by getting it from the system which
was previously installed already and make it available in your
script.
Question 9: What do the mean and weighted mean center tell you about the distribution of the filtered Water Retaining Facility locations (e.g. Gravity Dams) and their dam heights? Please explain. (3 marks)
Type your response here: I don’t know. I kept trying and trying but I was unable to export the WaterRetainingFacilities with the easting_m and northing_m features added to it so the calculations didnt go through. I am unable to do any questions regarding mean center and weighted mean center. It kept saying that the file was unsupported. When I exported the file, I was able to put in the input features but could not find the file.
There are different dam heights for each of the Gravity Dams. Do they have the same mean centre? Imagine we are interested in how the distribution of Gravity Dams and their associated Dam Heights are with sm_DamHeights (Dam Heights less than the median) differ from those with lrg_DamHeights (Dam Heights greater than or equal to the median). To do this, we will split the filter_WaterRetaining dataset into two subsets. One subset will only contain Dam Heights using sm_DamHeights as the primary source and the other will contain only those with lrg_DamHeights as the primary source.The Height will be based on the median dam height in the filter_WaterRetaining dataset.
#TODO:find the median using the summary function
sm_DamHeights <- subset(filtered_WaterRetaining, (filtered_WaterRetaining$DamHeight < 22.88))
#TODO lrg_DamHeights
Once created, the subset can be viewed in the Console by calling the
object (sm_DamHeights) using View(sm_Damheights). If you
want to know the number of Small Dam Heights, you can run:
nrow(sm_DamHeights[1]).
Question 10: Show the Switzerland Water Retaining Facilities scatter plot using the subsetted data that you created (small and large dam heights). Remember to overlay the subsets on the original data (WaterRetaining) to see the distribution of the subset. Use a different color for the subsets and the original dataset. Also plot the mean centres for the entire dataset and the subsets. Be sure to include a legend, axis titles (with units), and a main title. Include your name and student number in brackets at the end of your main title. Please do not show irrelevant information on the graph. Please also type your code in the code chunk below. (10 marks)
So far, we have measured the central tendency of the spatial data. How about dispersion? Standard deviation is a measure of dispersion that can be used to assess the distribution of spatial data. To calculate the orthogonal dispersion (east-west, north-south) associated with filtered_WaterRetaining dataset, we will use sd() command applied on easting and northing, respectively. Please do the same for the subsets in Question 11.
sd(filtered_WaterRetaining$easting_m)
## [1] NA
sd(filtered_WaterRetaining$northing_m)
## [1] NA
Question 11: Please show your code as well as the calculated standard deviation in your R Markdown. Provide a concise conclusion regarding the orthogonal dispersion for both the Water Retaining Facilities dataset and all subsets . These conclusions should include a short description of the dispersion and a comparison (i.e. Water Retaining Gravity Dam Facilities (filtered) vs. Small Dam Heights vs. Large Dam Heigts). Remember to include units of measurement in your response and round to one decimal place and include units. Finally why was median used to subset the data versus the mean? (7 marks)
#TODO
Type your response here: