Info about R Markdown

R Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Part 3A. Non-spatial Statistics

Import Data

If your R Markdown is NOT in the same folder as your data, please set your working directory using setwd() first. Here is an example setwd("\\medusa\StudentWork\(Your UTOR ID)\GGR276\Lab1"). You will need to change the code to reflect your personal directory. Otherwise, you may skip this step and continue to import data by reading your stations.csv file. You may view the data by clicking the stations in the Environment window or type code View(DataSet). Click on the little green triangle on the right to run current chunk.

For testing where there is any error in any line of code, you should run the script line by line and check how they work carefully.
- Any errors in your code will be displayed in the bottom left Console window of RStudio, which will help you pinpoint where the error lies.
- If there is an error, try selecting each line of code individually and running the script one by one. This will tell you which line of code is causing the error.

setwd("/Users/amroopbains/Downloads") 
stations <- read.csv("stations.csv", sep = ',', header = TRUE)

Descriptive Statistics of CNG_Compression

Now that we have data imported, we are ready to calculate median, mean, range and quantiles of the compression.

summary(stations$CNG_Compression)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     2.0   250.0   672.0   912.4  1182.8 10620.0  108521

Next, we will calculate the standard deviation of the CNG_Compression. If there are missing values in your dataset, simply add na.rm = TRUE to your code to tell the R to remove NAs in the calculation. Like this: sd(stations$CNG_Compression, na.rm=TRUE).

sd(stations$CNG_Compression, na.rm = TRUE)

## [1] 1067.735

Descriptive Statistics of InitialPow

Question 6: Now it is your turn to write the code to calculate the median, mean, range, interquartile range, and standard deviation of the CNG_Dispensers. (2 marks)

summary(stations$CNG_Dispensers)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       1       2       3       2      82  107907

sd(stations$CNG_Dispensers, na.rm = TRUE)

## [1] 7.15083

#TODO

Visualize

Visualize the spread of the CNG_Compression, and answer Question 7: According to the boxplot, would you use median or mean as your central tendency measure? Explain and justify your choice? (2 marks)

Type your response here:

boxplot(stations$CNG_Compression)

Part 3B. Mapping Mean and Weighted Mean Centre

Visualize the locations of the CNG stations and the mean and weighted mean centres.

CNG_stations <- stations[stations$Fuel_Type == "CNG", ]
plot(CNG_stations$Longitude, CNG_stations$Latitude, xlab="Longitude", ylab="Latitude", main = "CNG Stations in US and Canada", xlim=c(-125, -63), ylim=c(25, 62))
n<-nrow(CNG_stations[1])
mc_x<-sum(CNG_stations$Longitude)/n
mc_y<-sum(CNG_stations$Latitude)/n
points(mc_x,mc_y,'p',pch=15,cex=2,col="blue")
wmc_x <- sum(as.numeric(CNG_stations$CNG_Compression * CNG_stations$Longitude), na.rm = TRUE) / sum(CNG_stations$CNG_Compression, na.rm = TRUE)
wmc_y <- sum(as.numeric(CNG_stations$CNG_Compression * CNG_stations$Latitude), na.rm = TRUE) / sum(CNG_stations$CNG_Compression, na.rm = TRUE)
points(wmc_x, wmc_y,'p',pch=15,cex=2,col='red')
legend("topright", legend = c("Mean centre", "Weighted mean centre"), pch = c(15,15), col = c("blue","red"))

Code for different colours in R can be found here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
Symbol code in R can be found here: http://www.statmethods.net/advgraphs/parameters.html
Use xlim=c(,) or ylim=c(,) in plot() to change the scale of the dataset. Make sure that there isn’t much white space and that the legend does not cover the data points. You can also change the position of the legend.

Question 8: Produce comments (i.e., a detailed explanation for each parameter in the code) that describes the execution of the statements given to you in the stations example. Make sure you provide a description for each set of statements and organize your answer in a manner similar to the following example. (16 marks)

Example: setwd ("\\medusa\StudentWork\(Your UTOR ID)\GGR276\Lab1")

The setwd() command tells R the current folder to look into when searching for data, saving outputs etc.

Explain what is occurring in each of the following lines of code:

stations <- read.csv("stations.csv", sep = ',', header = TRUE)
plot(CNG_stations$Longitude, CNG_stations$Latitude, xlab="Longitude", ylab="Latitude", main = "CNG Stations in US and Canada", xlim=c(-125, -63), ylim=c(25, 62))
n<-nrow(CNG_stations[1])
mc_x<-sum(CNG_stations$Longitude)/n mc_y<-sum(CNG_stations$Latitude)/n
points(mc_x,mc_y,'p',pch=15,cex=2,col="blue")
wmc_x<-sum(as.numeric(CNG_stations$CNG_Compression*CNG_stations$Longitude))/sum(CNG_stations$CNG_Compression) wmc_y<-sum(as.numeric(CNG_stations$CNG_Compression*CNG_stations$Latitude))/sum(CNG_stations$CNG_Compression)
points(wmc_x, wmc_y,'p',pch=15,cex=2,col='red')
legend("topright", legend = c("Mean centre", "Weighted mean centre"), pch = c(15,15), col = c("blue","red"))

Type your response here:

a. Reads and assigns stations.csv file into data frame on R called stations, indicates data is comma seperated, and that the first row includes labels

b. Creates scatterplot for the longitude and latitude for the CNG stations, in which the longitude is on the x axis, and the latitude is on the y axis. The x axis label is Longitude, and the y axis label is Latitude. The title for the plot is CNG Stations in the US and Canada, and the x axis is limited between -125 and -63, the y axis is limited between 25 and 62.

c. The total number of rows in first column of CNG_stations is assigned to the variable of n

d. The mean longitude is calculated by getting sum of all longitudes and dividing by number of stations, assigned to variable mc_x. The mean latitude is calculated as well by getting sum of all latitudes and dividing by number of stations, assigned to variable mc_y.

e. Points added to plot for mean centre, pch=15 indicates square symbol, cex=2 indicates point size, and the point colour is selected as blue

f. Weighted mean centre is calculated for longitude and latitude of CNG stations in which they are weighted by the total CNG compression values. Each stations longitude and latitude values are multiplied by its CNG compression, and then the sum function gives total for weighted values.

g. Adds points to plot, in which blue square for mean centre, and red square for weighted mean centre.

h. Legend added to topright, in which blue represents mean centre, and red represents weighted mean centre. Plot symbol indicated to be square

Question 9: What do the mean and weighted mean center tell you about the distribution of the CNG station locations and their compression? Please explain. (3 marks)

The mean for the CNG station locations represents the average location for all CNG stations, while the weighted mean centre accounts the location of CNG stations with the compression values. As a result the weight mean centre will be around the stations with high compression values.

Part 3C. Create and Visualize Subsets

Create Subsets

There are different types of fuels. Do they have the same mean centre? Imagine we are interested in how the distribution of stations with ELEC differ from those with LPG. To do this, we will split the stations dataset into two subsets. One subset will only contain stations using ELEC as fuel and the other will contain only those with LPG as fuel.

ELEC_stations <- subset(stations, (stations$Fuel_Type == "ELEC"))
LPG_stations <- subset(stations, stations$Fuel_Type == "LPG")
#TODO

Once created, the subset can be viewed in the Console by calling the object (ELEC_stations) using View(ELEC_stations). If you want to know the number of ELEC_stations, you can run: nrow(ELEC_stations[1]).

Visualize Subsets

Question 10: Show the stations scatter plot using the subset data that you created. Use a different color for the subsets. Also plot the mean centres for the subsets. Be sure to include a legend, axis titles (with units), and a main title. Include your name and student number in brackets at the end of your main title. Do not show irrelevant information on the graph. Please also type your code in the code chunk below. (10 marks)

plot(ELEC_stations$Longitude, ELEC_stations$Latitude, 
     xlab="Longitude (°)", ylab="Latitude (°)", 
     main="CNG Stations: ELEC vs LPG (Amroop Bains, 1008063863)", 
     col="red", pch=16, cex=0.5, xlim=c(-125, -63), ylim=c(25, 62))

# Add the LPG stations to the plot (using blue color)
points(LPG_stations$Longitude, LPG_stations$Latitude, 
       col="blue", pch=16, cex=0.5)

# Calculate and plot the mean center for ELEC stations
mc_ELEC_x <- mean(ELEC_stations$Longitude)
mc_ELEC_y <- mean(ELEC_stations$Latitude)
points(mc_ELEC_x, mc_ELEC_y, pch=15, cex=2, col="red")  # Square marker for mean center

# Calculate and plot the mean center for LPG stations
mc_LPG_x <- mean(LPG_stations$Longitude)
mc_LPG_y <- mean(LPG_stations$Latitude)
points(mc_LPG_x, mc_LPG_y, pch=15, cex=2, col="blue")  # Square marker for mean center

# Add a legend to the plot
legend("topright", 
       legend = c("ELEC Stations", "LPG Stations", "Mean Centre ELEC", "Mean Centre LPG"), 
       pch = c(16, 16, 15, 15),  # Use pch=15 (square) for mean centers
       col = c("red", "blue", "red", "blue"), 
       cex = 0.8)

#TODO

Part 3D. Dispersion of Subsets

Standard Deviation of Subsets

So far, we have measured the central tendency of the spatial data. How about dispersion? Standard deviation is a measure of dispersion that can be used to assess the distribution of spatial data. To calculate the orthogonal dispersion (east-west, north-south) associated with CNG_stations dataset, we will use sd() command applied on Longitude and Latitude, respectively. Please do the same for the subsets in Question 11.

sd(CNG_stations$Longitude)

## [1] 16.70479

sd(CNG_stations$Latitude)

## [1] 5.167414

Question 11: Please show your code as well as the calculated standard deviation in your R Markdown. Provide a concise conclusion regarding the orthogonal dispersion for the stations dataset and subsets. These conclusions should include a short description of the dispersion and a comparison (i.e. stations vs. ELEC_stations vs. LPG_stations). Remember to include units of measurement in your response. (6 marks)

sd(ELEC_stations$Longitude)

## [1] 19.80361

sd(ELEC_stations$Latitude)

## [1] 5.722009

sd(LPG_stations$Longitude)

## [1] 15.63939

sd(LPG_stations$Latitude)

## [1] 6.650902

#TODO

Type your response here: For CNG stations the longitude standard deviation was 16.7 degrees, this means the stations are spread out widely from east-west, and the latitude standard deviation was 5.17 degrees which means a much more narrow dispersion for the stations in the north-south direction. The ELEC stations had the widet spatial dispersion for longitude being 16.7 degrees meaning it covers a lot of area in the east-direction, although it had a narrow dispersion for latitude being 5.7. The LPG stations had a standard deviation of 15.64 degrees for longitude being moderately distributed in east-west, and a standard deviation of 6.51 degrees for latitude being more narrow similar to the other categories. From these values we can see in all categories that there is much more dispersion in the longitude compared to the latitude. ELEc stations had the highest east-west dispersion, and LPG stations had the highest north-south dispersion.

GGR276 Lab 1 Part 3 Understanding the GEO in Geostatistics

Amroop Bains

2024-12-23