###Author and date need to be changed; Name - student name and date; done when a new markdown is started. #students only need to submit in their html the answers and code for the questions they need to complete/finish.. — title: “GGR276 Lab 1 Part 3 Understanding the GEO in Geostatistics” author: “Areeba Noor” date: “2024-07-12” output: html_document —

Info about R Markdown

R Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Part 3A. Non-spatial Statistics

Import Data

If your R Markdown is NOT in the same folder as your data, please set your working directory using setwd() first. Here is an example setwd("\\medusa\StudentWork\(Your UTOR ID)\GGR276\Lab1"). You will need to change the code to reflect your personal directory. Otherwise, you may skip this step and continue to import data by reading your WaterRetainingFacilities.csv file. You may view the data by clicking the WaterRetaining in the Environment window or type code View(DataSet). Click on the little green triangle on the right to run current chunk.

  • For testing where there is any error in any line of code, you should run the script line by line and check how they work carefully.
    • Any errors in your code will be displayed in the bottom left Console window of RStudio, which will help you pinpoint where the error lies.
    • If there is an error, try selecting each line of code individually and running the script one by one. This will tell you which line of code is causing the error.

#you may need to adjust the csv file name - if you have a different spelling or use captials in your file structure it will not match the one listed in the code below.

WaterRetainingFacilities <- read.csv("WaterRetainingFacilities.csv", sep = ',', header = TRUE)

Descriptive Statistics of DamHeight

Now that we have data imported, we are ready to calculate median, mean, range and quantiles of the Impoundment Volume in cubic meters (m^3)

summary(WaterRetainingFacilities$DamHeight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   14.70   22.88   41.06   45.00  285.00

Next, we will calculate the standard deviation of Dam Height in meters (m)). If there are missing values in your dataset, simply add na.rm = TRUE to your code to tell the R to remove NAs in the calculation. Like this: sd(WaterRetaining$DamHeight, na.rm=TRUE). Since there isn’t a missing value in this dataset, this line is not necessary here.

sd(WaterRetainingFacilities$DamHeight)
## [1] 46.14621

Descriptive Statistics of ImpoundmentVolume

Question 6: Now it is your turn to write the code to calculate the median, mean, range, interquartile range, and standard deviation of the impoundment volume; review the variable information to find out the units. (2 marks)

#TODO

Type your response here for the summary descriptive statistics and standard deviation (include units, round to 1 decimal place): median = 1.0 mm^3 mean = 17.9 mm^3 range = 384.1 mm^3 interquartile range = 9.0 mm^3 standard deviation = 44.7 mm^3

Visualize

Visualize the spread of the Dam Height dataset, and answer Question 7: According to the boxplot, would you use median or mean as your central tendency measure? Explain and justify your choice? (2 marks)

Type your response here:
I would use median as my central tendency measure because the boxplot shows a lot of outliers, meaning that it is skewed and the mean would be affected by it and not give the best measure of central tendency.

boxplot(WaterRetainingFacilities$DamHeight)

Part 3B. Mapping Mean and Weighted Mean Centre

Visualize the locations of the water retaining facilities by Dam Height and the mean and weighted mean centres.

plot(WaterRetainingFacilities$easting_m, WaterRetainingFacilities$northing_m, xlab="Easting (m)", ylab="Northing (m)", main = "Switzerland's Water Retaining Facilities", xlim=c(2488000, 2840000), ylim=c(1085000, 1285000))
n<-nrow(WaterRetainingFacilities[1])
mc_x<-sum(WaterRetainingFacilities$easting_m)/n
mc_y<-sum(WaterRetainingFacilities$northing_m)/n
points(mc_x,mc_y,'p',pch=15,cex=2,col="blue")
wmc_x<-sum(as.numeric(WaterRetainingFacilities$DamHeight*WaterRetainingFacilities$easting_m))/sum(WaterRetainingFacilities$DamHeight)
wmc_y<-sum(as.numeric(WaterRetainingFacilities$DamHeight*WaterRetainingFacilities$northing_m))/sum(WaterRetainingFacilities$DamHeight)
points(wmc_x, wmc_y,'p',pch=15,cex=2,col='red')
legend("topright", legend = c("Mean centre", "Weighted mean centre"), pch = c(15,15), col = c("blue","red"))

Notice that after running this script, this script shows all of the dams even for all dam types. Let’s filter to keep only the DamType that are for “Gravity”.

filtered_WaterRetaining <- subset(WaterRetainingFacilities, (DamType == "Gravity"))

Visualize the locations of the Water Retaining Facitilites whose primary dam type is a gravity dam - include their mean and weighted mean centres on the filtered data. Use dam height to determine the weighted mean centre. Please adjust the range of x and y axes as well as the symbols to best represent the data. Make sure to change the colour colours and symbols.

#TODO

Question 8: Produce comments (i.e., a detailed explanation for each parameter in the code) that describes the execution of the statements given to you in the WaterRetaining example. Make sure you provide a description for each set of statements and organize your answer in a manner similar to the following example. (16 marks)

Example: setwd ("\\medusa\StudentWork\(Your UTOR ID)\GGR276\Lab1")

The setwd() command tells R the current folder to look into when searching for data, saving outputs etc.

Explain what is occurring in each of the following lines of code (2 marks each):

  1. WaterRetaining <- read.csv("WaterRetainingFacilities.csv", sep = ',', header = TRUE)

  2. plot(WaterRetaining$easting_m, WaterRetaining$northing_m, xlab="Easting (m)", ylab="Northing (m)", main = "Switzerland's Water Retaining Facilities", xlim=c(2488000, 2840000), ylim=c(1085000, 1285000))

  3. n<-nrow(WaterRetaining[1])

  4. mc_x<-sum(WaterRetaining$easting_m)/n mc_y<-sum(WaterRetaining$northing_m)/n

  5. points(mc_x,mc_y,'p',pch=16,cex=2,col="blue")

  6. wmc_x<-sum(as.numeric(WaterRetaining$DamHeight*WaterRetaining$easting_m))/sum(WaterRetaining$DamHeight) wmc_y<-sum(as.numeric(WaterRetaining$DamHeight*WaterRetaining$northing_m))/sum(WaterRetaining$DamHeight)

  7. legend("topleft", legend = c("Mean centre", "Weighted mean centre"), pch = c(16,15), col = c("blue","red"))

  8. install.packages("rmarkdown") library(rmarkdown)

Type your response here: a) The WaterRetaining is the variable in which the WaterRetainingFacilities.csv file read by the command read.csv is being stored in and it is file in which values are seperated by a comma therefore it says sep = ","

  1. This plot() code is plotting the information into a graph in which the x variables are the easting_m column values in the dataframe of WaterRetaining and the northing_m column in the data frame WaterRetaining are the y-values. The $ sign in WaterRetaining$easting_m is extracting the values in the easting_m column from the WaterRetaining dataframe. The ylab = "" labels the y-axis and the xlab = "" has the label for the x-axis. The xlim =c() limits the values in the x-axis to be between the 2 numbers, and the ylim = c() limits the y-axis values to be between the numbers put in the brackets. In this case it will limit the values to a certain location.

  2. The nrow(WaterRetaining[1]) code will tell us the number of rows in the first row of the data frame WaterRetaining

  3. This code will give us the mean of the easting_m column, because the sum() code will add up all the values in the column easting_m of the data frame WaterRetaining and / will divide it by n which is the number of rows (i.e. number of values). It will give us the mean center of the x values (easting coordinates) and similarly mc_y will give the mean center of the y values (northing coordinates).

  4. This code points() would add points on the scatterplot, and from the previously calculated mean center of x and y axis we are adding those values in the variables mc_x and mc_y to the scatter plot and adjusting the types of points (i.e pch and cex being the symbol to be placed and the size of it respectively) and the col code is for the coloor of the symbol.

  5. The sum() function adds up the values inside the brackets. The as.numeric() makes sure that the values inside the brackets are numeric and operations can be done with them without any errors. The values in the DamHeight column are being multiplied by the easting_m column values and then being divided by the sum of the values in the DamHeight column. They are calculating the weighted mean center of the x values (easting_m) and the y values (northing_m) in DamHeight.

  6. This code legend() will put the legend of mean center and weighted mean center with their respective symbols and colours on the top left side of the graph.

  7. The install.packages code will install the rmarkdown package in the system and the code library(rmarkdown) will make it available in the R script by getting it from the system which was previously installed already and make it available in your script.

Question 9: What do the mean and weighted mean center tell you about the distribution of the filtered Water Retaining Facility locations (e.g. Gravity Dams) and their dam heights? Please explain. (3 marks)

Type your response here: I don’t know. I kept trying and trying but I was unable to export the WaterRetainingFacilities with the easting_m and northing_m features added to it so the calculations didnt go through. I am unable to do any questions regarding mean center and weighted mean center. It kept saying that the file was unsupported. When I exported the file, I was able to put in the input features but could not find the file.

Part 3C. Create and Visualize Subsets

Create Subsets

There are different dam heights for each of the Gravity Dams. Do they have the same mean centre? Imagine we are interested in how the distribution of Gravity Dams and their associated Dam Heights are with sm_DamHeights (Dam Heights less than the median) differ from those with lrg_DamHeights (Dam Heights greater than or equal to the median). To do this, we will split the filter_WaterRetaining dataset into two subsets. One subset will only contain Dam Heights using sm_DamHeights as the primary source and the other will contain only those with lrg_DamHeights as the primary source.The Height will be based on the median dam height in the filter_WaterRetaining dataset.

#TODO:find the median using the summary function

sm_DamHeights <- subset(filtered_WaterRetaining, (filtered_WaterRetaining$DamHeight < 22.88))
#TODO lrg_DamHeights

Once created, the subset can be viewed in the Console by calling the object (sm_DamHeights) using View(sm_Damheights). If you want to know the number of Small Dam Heights, you can run: nrow(sm_DamHeights[1]).

Visualize Subsets

Question 10: Show the Switzerland Water Retaining Facilities scatter plot using the subsetted data that you created (small and large dam heights). Remember to overlay the subsets on the original data (WaterRetaining) to see the distribution of the subset. Use a different color for the subsets and the original dataset. Also plot the mean centres for the entire dataset and the subsets. Be sure to include a legend, axis titles (with units), and a main title. Include your name and student number in brackets at the end of your main title. Please do not show irrelevant information on the graph. Please also type your code in the code chunk below. (10 marks)

Part 3D. Dispersion of Subsets

Standard Deviation of Subsets

So far, we have measured the central tendency of the spatial data. How about dispersion? Standard deviation is a measure of dispersion that can be used to assess the distribution of spatial data. To calculate the orthogonal dispersion (east-west, north-south) associated with filtered_WaterRetaining dataset, we will use sd() command applied on easting and northing, respectively. Please do the same for the subsets in Question 11.

sd(filtered_WaterRetaining$easting_m)
## [1] NA
sd(filtered_WaterRetaining$northing_m)
## [1] NA

Question 11: Please show your code as well as the calculated standard deviation in your R Markdown. Provide a concise conclusion regarding the orthogonal dispersion for both the Water Retaining Facilities dataset and all subsets . These conclusions should include a short description of the dispersion and a comparison (i.e. Water Retaining Gravity Dam Facilities (filtered) vs. Small Dam Heights vs. Large Dam Heigts). Remember to include units of measurement in your response and round to one decimal place and include units. Finally why was median used to subset the data versus the mean? (7 marks)

#TODO

Type your response here: