All data for this assignment is downloaded from the WVU Geog 693c Open-Source Spatial Analytics (R) course website.
setwd("C:/SW/Grad_WVU/693c/A3_A4/Summary_and_Stats")
matts_movies <- read.csv("matts_movies.csv", header = TRUE, stringsAsFactors = TRUE)
mine_class <- read.csv("mine_classification_with_lidar.csv", header = TRUE, stringsAsFactors = TRUE)
wetland_bn <- read.csv("wetland_binary.csv", header = TRUE, stringsAsFactors = TRUE)
t1 <- mine_class %>%
group_by(class) %>%
count() %>%
nrow()
t1
[1] 5
There are 5 different land cover types.
To obtain the result of 5 different land cover types, I used the group_by function to group by “class”, then count to get the total number for each class, then counted the number of rows as each row represent a different land cover type.
t2 <- mine_class %>%
group_by(class) %>%
summarise(ndvi_avg = mean(ndvi))
t2
# A tibble: 5 x 2
class ndvi_avg
* <fct> <dbl>
1 barren -0.156
2 forest 0.0573
3 herb 0.0709
4 shrub 0.301
5 water -0.365
The shrub class has the highest average NDVI.
I used the group_by function to group by “class” then the summarise function to find the mean of each class’s NDVI.
t3 <- mine_class %>%
group_by(class) %>%
summarise(mean_height = mean(diff))
t3
# A tibble: 5 x 2
class mean_height
* <fct> <dbl>
1 barren 0.0895
2 forest 11.1
3 herb 0.0405
4 shrub 3.31
5 water 0.767
The forest class has the highest average height.
I used the group_by function to group by “class” then the summarise function to find the mean of each class’s highest mean height “diff”.
t4 <- mine_class %>%
filter(class == "forest") %>%
summarise(stan_dev = sd(ndvi))
t4
stan_dev
1 0.04083046
The standard deviation of NDVI for the forest class is 0.04083046.
I used the filter function to select the forest class then the summarise function to get the standard deviation of NDVI for the forest class.
t5 <- matts_movies %>%
filter(Genre== "Drama") %>%
summarise(mean_rate = mean(My.Rating))
t5
mean_rate
1 7.140561
The mean rating for the genre Drama is 7.140561.
I used the filter function to select the Drama class then the summarise function to get the mean of My.Rating for the genre class.
t6 <- matts_movies %>%
group_by(Genre) %>%
summarise(range_rating= max(My.Rating) - min(My.Rating))
t6
# A tibble: 16 x 2
Genre range_rating
* <fct> <dbl>
1 Action 9.2
2 Classic 4.3
3 Comedy 8.2
4 Documentary 3.44
5 Drama 9.11
6 Family 8.85
7 Fantasy 3.64
8 Foreign 5.87
9 Horror 7.95
10 Independent 8.11
11 Romance 6.39
12 Sci-Fi 8.03
13 Sports 5.84
14 Thriller 6.8
15 War 4.3
16 Western 4.12
Action has the largest range of ratings.
I used the group_by function to group by Genre then the summarise function to get the range of ratings by subtracting the max rating and min rating for each each genre.
t7 <- cor(wetland_bn$slp_d, wetland_bn$diss_a, method="spearman")
t7
[1] 0.5006004
The correlation coefficient is 0.5006004.
I used the correlation function with the method set to spearman to evaluate the correlation coefficient between slope in degrees and topographic dissection. The Spearman’s Rank Correlation Coefficient is useful to evaluate the relationship between two groups. An explanation and example of Spearman’s method can be found here, provided by the Barcelona Field Studies Centre.
t.test(slp_d ~ class, data = wetland_bn)
Welch Two Sample t-test
data: slp_d by class
t = 48.453, df = 1277.9, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
12.76741 13.84493
sample estimates:
mean in group not mean in group wet
15.478290 2.172118
The average slope between wetlands (class “wet”) and not wetlands (class “not”) is different. With an extremely small p-value, the results suggests statistical difference.
I used the t.test() function to evaluate the average slope between wetlands and not wetlands. T-tests are parametric tests that assumes variable independence, normal distribution, and is used to evaluate if means of two groups are different. When the p-value is lower than 0.05, in this case, it indicates failure to reject the null hypothesis.
wilcox.test(slp_d ~ class, data = wetland_bn)
Wilcoxon rank sum test with continuity correction
data: slp_d by class
W = 958189, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
The Mann-Whitney U test result produces a statistically significant p-value, which suggests that the average slope between wetlands and not wetlands is statstically different.
I applied the wilcox.test() function to compare the average slope between wetlands and not wetlands classes. The Mann-Whitney U test, also known as the Wilcoxon Rank Sum Test, is a nonparametric alternative to the t-test but it still assumes independence between variables, groups, and/or samples. As with the t-test, it is used to evaluate if means of two groups are different.
More information about the Mann-Whitney U Test
comedy_drama <- matts_movies %>%
filter(Genre == "Comedy" | Genre == "Drama")
t.test(My.Rating ~ Genre, data = comedy_drama)
Welch Two Sample t-test
data: My.Rating by Genre
t = -3.5081, df = 547.72, p-value = 0.0004884
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6311229 -0.1780421
sample estimates:
mean in group Comedy mean in group Drama
6.735978 7.140561
The small p-value result suggest that there is statistically significant difference between the average rating of Drama and Comedy movies.
I selected the Comedy and Drama genres using the filter() function then applied the t.test() function to the filtered data to evaluate the average rating between the Drama and Comedy genres.
wilcox.test(My.Rating ~ Genre, data = comedy_drama)
Wilcoxon rank sum test with continuity correction
data: My.Rating by Genre
W = 36163, p-value = 0.0001081
alternative hypothesis: true location shift is not equal to 0
The Mann-Whitney U test result suggest that average rating is statistically different between Drama and Comedy movies.
I applied the wilcox.test() function to compare the average rating between Drama and Comedy genres.
t12 <- aov(ndvi ~ class, data = mine_class)
summary(t12)
Df Sum Sq Mean Sq F value Pr(>F)
class 4 190.59 47.65 11309 <2e-16 ***
Residuals 4512 19.01 0.00
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA result produces a statistically significant p-value, which suggests that at least one pair of land cover types have statistically different mean NDVI.
Analysis of variance, ANOVA, has the ability to compare variables and groups than are greater than two, as compared to the previous t-test and Mann-Whitney U test. Using the one-way ANOVA method, I apply the aov() function to examine if mean NDVI is different between land cover types.
TukeyHSD(t12)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = ndvi ~ class, data = mine_class)
$class
diff lwr upr p adj
forest-barren 0.21346127 0.205539885 0.22138265 0.00e+00
herb-barren 0.22709823 0.219176844 0.23501961 0.00e+00
shrub-barren 0.45712849 0.449207110 0.46504987 0.00e+00
water-barren -0.20884992 -0.218444669 -0.19925518 0.00e+00
herb-forest 0.01363696 0.005715578 0.02155834 2.67e-05
shrub-forest 0.24366723 0.235745844 0.25158861 0.00e+00
water-forest -0.42231119 -0.431905935 -0.41271645 0.00e+00
shrub-herb 0.23003027 0.222108885 0.23795165 0.00e+00
water-herb -0.43594815 -0.445542895 -0.42635341 0.00e+00
water-shrub -0.66597842 -0.675573161 -0.65638367 0.00e+00
The p-value of 0 for the land cover pair shrub-forest suggests that there is statistical significant difference between the NDVI of forest and shrub classes.
To compare each land cover pair combination, I applied the TukeyHSD function to run a Tukey’s Honest Significant Difference pair-wise test
Further explanation of the TukeyHSD function
qqPlot(lm(ndvi ~ class, data = mine_class), simulate = TRUE, main="QQ Plot", labels = FALSE)
[1] 1213 1214
This plot suggests that the residuals are not normally distributed, challenging the assumption that they are normally distributed and other ANOVA assumptions.
Q-Q Plots are used to assess the normality of data through comparison of data quantiles to theoretical quantiles. The diverging points from the line suggest that the data is not normally distribution. I used the qqPlot() function to to represent the distribution of the model residuals.
bartlett.test(ndvi ~ class, data = mine_class)
Bartlett test of homogeneity of variances
data: ndvi by class
Bartlett's K-squared = 2017.8, df = 4, p-value < 2.2e-16
The Bartlett Test result returns a statistically significant p-value which suggests that there is not consistent and/or equal variance between the classes.
The Bartlett Test of Homogeneity of Variance is useful to assess if variance is constant between different groups. I used the bartlett.test() function to assess the degree of variance between classes for NDVI.
outlierTest(t12)
rstudent unadjusted p-value Bonferroni p
1213 7.686756 1.8401e-14 8.3115e-11
1214 7.384167 1.8173e-13 8.2088e-10
3962 6.612750 4.2114e-11 1.9023e-07
1233 6.044188 1.6227e-09 7.3299e-06
1194 5.858383 5.0063e-09 2.2614e-05
1228 5.348877 9.2881e-08 4.1954e-04
2836 5.169711 2.4460e-07 1.1048e-03
780 -5.037194 4.9085e-07 2.2172e-03
4347 -4.782641 1.7852e-06 8.0639e-03
4432 4.748768 2.1101e-06 9.5312e-03
The Bonferroni Outlier Test does suggest the presence of outliers.
The Bonferroni Outlier Test is used to determine if outliers exist as the ANOVA method is sensitive to outliers. I used the outlierTest() function on the ANOVA fit data from the t12 variable.
kruskal.test(ndvi ~ class, data = mine_class)
Kruskal-Wallis rank sum test
data: ndvi by class
Kruskal-Wallis chi-squared = 3974.3, df = 4, p-value < 2.2e-16
The statistically significant p-value suggests that at least two class types, one pair of groups, have different mean NDVI.
The Kruskal-Wallis Rank Sum Test evaluates difference between groups, like ANOVA does, but it instead using a nonparametric approach. Its results provide information on if a pair exhibits difference but does not identify which pair. The kruskal.test() function is applied to evaluate the mean NDVI between land cover type pairs.
pairw.kw(mine_class$ndvi, mine_class$class, conf = 0.95)
95% Confidence intervals for Kruskal-Wallis comparisons
Diff Lower Upper Decision Adj. P-value
barren-forest -1451.9575 -1615.6656 -1288.2494 Reject H0 0
barren-herb -1597.733 -1761.4411 -1434.0249 Reject H0 0
forest-herb -145.7755 -309.4836 17.9326 FTR H0 0.124351
barren-shrub -2993.107 -3156.8151 -2829.3989 Reject H0 0
forest-shrub -1541.1495 -1704.8576 -1377.4414 Reject H0 0
herb-shrub -1395.374 -1559.0821 -1231.6659 Reject H0 0
barren-water 715.12336 516.8325 913.41422 Reject H0 0
forest-water 2167.08086 1968.79 2365.37172 Reject H0 0
herb-water 2312.85636 2114.5655 2511.14722 Reject H0 0
shrub-water 3708.23036 3509.9395 3906.52122 Reject H0 0
The pairwise test produces a non-significant p-value for the forest-herb class pair, which fails to reject the null hypothesis and suggests that there is not difference in mean NDVI at a 95% confidence interval.
The Kruskal-Wallis Rank Sum Test identifies if a pair of groups are different but does not identify which pair. A p-value lower than 0.05 and a confidence interval that does not include zero does suggest difference between the groups. I applied the pairwkw() function to produce the pair comparisons for NDVI and evaluate if forest and herb class pair differ.