rm(list=ls()); gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 528671 28.3 1175765 62.8 NA 669445 35.8
## Vcells 975057 7.5 8388608 64.0 16384 1851708 14.2
stations <- read.csv("station.csv", sep = ',', header = TRUE)
Now that we have data imported, we are ready to calculate median, mean, range and quantiles of the compression.
summary(stations$CNG_Compression)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.0 250.0 672.0 912.4 1182.8 10620.0 108907
Next, we will calculate the standard deviation of the
CNG_Compression. If there are missing values in your dataset, simply add
na.rm = TRUE to your code to tell the R to remove NAs in
the calculation. Like this:
sd(stations$CNG_Compression, na.rm=TRUE).
sd(stations$CNG_Compression, na.rm = TRUE)
## [1] 1067.735
Question 6: Now it is your turn to write the code to calculate the median, mean, range, interquartile range, and standard deviation of the CNG_Dispensers. (2 marks)
summary(stations$CNG_Dispensers)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 1 2 3 2 82 108292
sd(stations$CNG_Dispensers, na.rm = TRUE)
## [1] 7.148115
range(stations$CNG_Dispensers, na.rm = TRUE)
## [1] 0 82
IQR(stations$CNG_Dispensers, na.rm = TRUE)
## [1] 1
Visualize the spread of the CNG_Compression, and answer Question 7: According to the boxplot, would you use median or mean as your central tendency measure? Explain and justify your choice? (2 marks)
Type your response here:
I would use the median as the central tendency measure. According to the
boxplot, the data is positively skewed, with many outliers above the
upper whisker. When there is a skewed distribution and data with extreme
values, the median is the best measure of central tendency. On the other
hand, mode is the best option for bimodal or multimodal data but not
skewed distributions.
boxplot(stations$CNG_Compression)
Visualize the locations of the CNG stations and the mean and weighted mean centres.
CNG_stations <- stations[stations$Fuel_Type == "CNG", ]
plot(CNG_stations$Longitude, CNG_stations$Latitude, xlab="Longitude", ylab="Latitude", main = "CNG Stations in US and Canada", xlim=c(-125, -63), ylim=c(25, 62))
n<-nrow(CNG_stations[1])
mc_x<-sum(CNG_stations$Longitude)/n
mc_y<-sum(CNG_stations$Latitude)/n
points(mc_x,mc_y,'p',pch=15,cex=2,col="blue")
wmc_x <- sum(as.numeric(CNG_stations$CNG_Compression * CNG_stations$Longitude), na.rm = TRUE) / sum(CNG_stations$CNG_Compression, na.rm = TRUE)
wmc_y <- sum(as.numeric(CNG_stations$CNG_Compression * CNG_stations$Latitude), na.rm = TRUE) / sum(CNG_stations$CNG_Compression, na.rm = TRUE)
points(wmc_x, wmc_y,'p',pch=15,cex=2,col='red')
legend("topright", legend = c("Mean centre", "Weighted mean centre"), pch = c(15,15), col = c("blue","red"))
Question 8: Produce comments (i.e., a detailed explanation for each parameter in the code) that describes the execution of the statements given to you in the stations example. Make sure you provide a description for each set of statements and organize your answer in a manner similar to the following example. (16 marks)
Example: setwd (“\medusaUTOR ID) The setwd() command tells R the current folder to look into when searching for data, saving outputs etc.
Explain what is occurring in each of the following lines of code:
stations <- read.csv("stations.csv", sep = ',', header = TRUE)plot(CNG_stations$Longitude, CNG_stations$Latitude, xlab="Longitude", ylab="Latitude", main = "CNG Stations in US and Canada", xlim=c(-125, -63), ylim=c(25, 62))n<-nrow(CNG_stations[1])mc_x<-sum(CNG_stations$Longitude)/n
mc_y<-sum(CNG_stations$Latitude)/npoints(mc_x,mc_y,'p',pch=15,cex=2,col="blue")wmc_x<-sum(as.numeric(CNG_stations$CNG_Compression*CNG_stations$Longitude))/sum(CNG_stations$CNG_Compression)
wmc_y<-sum(as.numeric(CNG_stations$CNG_Compression*CNG_stations$Latitude))/sum(CNG_stations$CNG_Compression)points(wmc_x, wmc_y,'p',pch=15,cex=2,col='red')legend("topright", legend = c("Mean centre", "Weighted mean centre"), pch = c(15,15), col = c("blue","red"))Type your response here: a) read and name the file “stations.csv”, separate values by ‘,’, and ensure the first row is header b) plot and label the longitude and latitude as x-axis and y-axis, main is the title, and use xlim and ylim to limit the range of the map c) get the number of rows of CNG_stations d) mc_x and mc_y refers to the calculation of mean center of longitude and latitude of CNG_stations e) point the mean center in blue with square shape and doubled larger f) wmc_x and wmc_y refers to the calculation of weighted mean center of longitude and latitude of CNG_stations by using CNG_Compression to weight g) point the weighted mean center in red with square shape and doubled larger h) label “Mean centre” in blue and “Weighted mean centre” in red within a legend in the top-right corner of the plot with square shape
Question 9: What do the mean and weighted mean center tell you about the distribution of the CNG station locations and their compression? Please explain. (3 marks)
Type your response here: The mean represents an average of all CNG station locations. It is the geographic center of all CNG stations, regardless of any weighting factor. The weighted mean center considers the capacity of CNG_Compression at each station. Those with better capacity have a larger influence. Both are measures of centrality for spatial data, helping us understand the overall distribution of CNG stations. If two centers are close, stations are equally distributed. If two centers are far away, stations are clustered in some specific areas.
There are different types of fuels. Do they have the same mean centre? Imagine we are interested in how the distribution of stations with ELEC differ from those with LPG. To do this, we will split the stations dataset into two subsets. One subset will only contain stations using ELEC as fuel and the other will contain only those with LPG as fuel.
ELEC_stations <- subset(stations, (stations$Fuel_Type == "ELEC"))
LPG_stations <- subset(stations, (stations$Fuel_Type == "LPG"))
Once created, the subset can be viewed in the Console by calling the
object (ELEC_stations) using View(ELEC_stations). If you
want to know the number of ELEC_stations, you can run:
nrow(ELEC_stations[1]).
Question 10: Show the stations scatter plot using the subset data that you created. Use a different color for the subsets. Also plot the mean centres for the subsets. Be sure to include a legend, axis titles (with units), and a main title. Include your name and student number in brackets at the end of your main title. Do not show irrelevant information on the graph. Please also type your code in the code chunk below. (10 marks)
nrow(ELEC_stations)
## [1] 96202
nrow(LPG_stations)
## [1] 3549
plot(ELEC_stations$Longitude, ELEC_stations$Latitude,
xlab="Longitude (°)", ylab="Latitude (°)", main = "ELEC and LPG Stations in US and Canada (MAN LAI MANNIE SUM 1008964952)",
xlim=c(-125, -63), ylim=c(25, 62))
points(LPG_stations$Longitude, LPG_stations$Latitude)
mELEC_x<-sum(ELEC_stations$Longitude)/nrow(ELEC_stations)
mELEC_y<-sum(ELEC_stations$Latitude)/nrow(ELEC_stations)
mLPG_x<-sum(LPG_stations$Longitude)/nrow(LPG_stations)
mLPG_y<-sum(LPG_stations$Latitude)/nrow(LPG_stations)
points(mELEC_x,mELEC_y,'p',pch=15,cex=2,col="blue")
points(mLPG_x, mLPG_y,'p',pch=15,cex=2,col='red')
legend("topright", legend = c("ELEC_stations", "LPG_stations"), pch = c(15,15), col = c("blue","red"))
So far, we have measured the central tendency of the spatial data. How about dispersion? Standard deviation is a measure of dispersion that can be used to assess the distribution of spatial data. To calculate the orthogonal dispersion (east-west, north-south) associated with CNG_stations dataset, we will use sd() command applied on Longitude and Latitude, respectively. Please do the same for the subsets in Question 11.
sd(CNG_stations$Longitude)
## [1] 16.70479
sd(CNG_stations$Latitude)
## [1] 5.167413
Question 11: Please show your code as well as the calculated standard deviation in your R Markdown. Provide a concise conclusion regarding the orthogonal dispersion for the stations dataset and subsets. These conclusions should include a short description of the dispersion and a comparison (i.e. stations vs. ELEC_stations vs. LPG_stations). Remember to include units of measurement in your response. (6 marks)
sd(ELEC_stations$Longitude)
## [1] 19.80548
sd(ELEC_stations$Latitude)
## [1] 5.720748
sd(LPG_stations$Longitude)
## [1] 15.63939
sd(LPG_stations$Latitude)
## [1] 6.650902
Type your response here: The longitude standard deviation of CNG_stations is 16.70479°, while the latitude standard deviation of CNG_stations is 5.167413°. The longitude standard deviation of ELEC_stations is 19.80548°, while the latitude standard deviation of ELEC_stations is 5.720748°. The longitude standard deviation of LPG_stations is 15.63939°, while the latitude standard deviation of LPG_stations is 6.650902°. It shows that ELEC_stations are on the east side of CNG_stations for 4.16609°, while LPG_stations are on the west side of CNG_stations for 1.10654°. Also, LPG_stations are on the most north side of CNG_stations for 1.483489°, while ELEC_stations are in the middle. hence, comparing with CNG_stations, ELEC_stations are on the north-east side, and LPG_stations are on the north-west side.