All data for this assignment is downloaded from the WVU Geog 693c Open-Source Spatial Analytics (R) course website.

setwd("C:/SW/Grad_WVU/693c/A3_A4/Summary_and_Stats")
matts_movies <- read.csv("matts_movies.csv", header = TRUE, stringsAsFactors = TRUE)
mine_class <- read.csv("mine_classification_with_lidar.csv", header = TRUE, stringsAsFactors = TRUE)
wetland_bn <- read.csv("wetland_binary.csv", header = TRUE, stringsAsFactors = TRUE)

How many different land cover types (“class”) are differentiated in the mine classification data set?

t1 <- mine_class %>%
  group_by(class) %>%
  count() %>%
  nrow()
t1
[1] 5

There are 5 different land cover types.
To obtain the result of 5 different land cover types, I used the group_by function to group by “class”, then count to get the total number for each class, then counted the number of rows as each row represent a different land cover type.

Which land cover type (“class”) has the highest mean normalized difference vegetation index (“ndvi”)?

t2 <- mine_class %>%
  group_by(class) %>%
  summarise(ndvi_avg = mean(ndvi))
t2
# A tibble: 5 x 2
  class  ndvi_avg
* <fct>     <dbl>
1 barren  -0.156 
2 forest   0.0573
3 herb     0.0709
4 shrub    0.301 
5 water   -0.365 

The shrub class has the highest average NDVI.
I used the group_by function to group by “class” then the summarise function to find the mean of each class’s NDVI.

Which land cover type (“class”) has the highest mean height (“diff”)?

t3 <- mine_class %>%
  group_by(class) %>%
  summarise(mean_height = mean(diff))
t3
# A tibble: 5 x 2
  class  mean_height
* <fct>        <dbl>
1 barren      0.0895
2 forest     11.1   
3 herb        0.0405
4 shrub       3.31  
5 water       0.767 

The forest class has the highest average height.
I used the group_by function to group by “class” then the summarise function to find the mean of each class’s highest mean height “diff”.

What is the standard deviation of NDVI (“ndvi”) for the forest class (“class” forest)?

t4 <- mine_class %>%
  filter(class == "forest") %>%
  summarise(stan_dev = sd(ndvi))
t4
    stan_dev
1 0.04083046

The standard deviation of NDVI for the forest class is 0.04083046.
I used the filter function to select the forest class then the summarise function to get the standard deviation of NDVI for the forest class.

What is the mean rating (“My.Rating”) for the dramas (“Genre” “Drama”) in the movie data set?

t5 <- matts_movies %>%
  filter(Genre== "Drama") %>%
  summarise(mean_rate = mean(My.Rating))
t5
  mean_rate
1  7.140561

The mean rating for the genre Drama is 7.140561.
I used the filter function to select the Drama class then the summarise function to get the mean of My.Rating for the genre class.

Which genre (“Genre”) has the largest range of ratings (“My.Rating”) in the movies data set?

t6 <- matts_movies %>%
  group_by(Genre) %>%
  summarise(range_rating= max(My.Rating) - min(My.Rating))
t6
# A tibble: 16 x 2
   Genre       range_rating
 * <fct>              <dbl>
 1 Action              9.2 
 2 Classic             4.3 
 3 Comedy              8.2 
 4 Documentary         3.44
 5 Drama               9.11
 6 Family              8.85
 7 Fantasy             3.64
 8 Foreign             5.87
 9 Horror              7.95
10 Independent         8.11
11 Romance             6.39
12 Sci-Fi              8.03
13 Sports              5.84
14 Thriller            6.8 
15 War                 4.3 
16 Western             4.12

Action has the largest range of ratings.
I used the group_by function to group by Genre then the summarise function to get the range of ratings by subtracting the max rating and min rating for each each genre.

Using the Spearman method, what is the correlation coefficient between slope in degrees (“slp_d”) and topographic dissection (“diss_a”) in the wetland data set?

t7 <- cor(wetland_bn$slp_d, wetland_bn$diss_a, method="spearman")
t7
[1] 0.5006004

The correlation coefficient is 0.5006004.
I used the correlation function with the method set to spearman to evaluate the correlation coefficient between slope in degrees and topographic dissection. The Spearman’s Rank Correlation Coefficient is useful to evaluate the relationship between two groups. An explanation and example of Spearman’s method can be found here, provided by the Barcelona Field Studies Centre.

Use a t-test to assess if average slope (“slp_d”) is different between wetlands (“class” wet) and not wetlands (“class” not). What does the result suggest?

t.test(slp_d ~ class, data = wetland_bn)

    Welch Two Sample t-test

data:  slp_d by class
t = 48.453, df = 1277.9, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 12.76741 13.84493
sample estimates:
mean in group not mean in group wet 
        15.478290          2.172118 

The average slope between wetlands (class “wet”) and not wetlands (class “not”) is different. With an extremely small p-value, the results suggests statistical difference.
I used the t.test() function to evaluate the average slope between wetlands and not wetlands. T-tests are parametric tests that assumes variable independence, normal distribution, and is used to evaluate if means of two groups are different. When the p-value is lower than 0.05, in this case, it indicates failure to reject the null hypothesis.

Use a Mann-Whitney U test to assess if average slope (“slp_d”) is different between wetlands (“class” wet) and not wetlands (“class” not). What does the result suggest?

wilcox.test(slp_d ~ class, data = wetland_bn)

    Wilcoxon rank sum test with continuity correction

data:  slp_d by class
W = 958189, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

The Mann-Whitney U test result produces a statistically significant p-value, which suggests that the average slope between wetlands and not wetlands is statstically different.
I applied the wilcox.test() function to compare the average slope between wetlands and not wetlands classes. The Mann-Whitney U test, also known as the Wilcoxon Rank Sum Test, is a nonparametric alternative to the t-test but it still assumes independence between variables, groups, and/or samples. As with the t-test, it is used to evaluate if means of two groups are different.

More information about the Mann-Whitney U Test

Use a t-test to assess if the average rating (“My.Rating”) is different between dramas (“Genre” Drama) and comedies (“Genre” Comedy). What does the result suggest?

comedy_drama <- matts_movies %>%
  filter(Genre == "Comedy" | Genre == "Drama")

t.test(My.Rating ~ Genre, data = comedy_drama)

    Welch Two Sample t-test

data:  My.Rating by Genre
t = -3.5081, df = 547.72, p-value = 0.0004884
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.6311229 -0.1780421
sample estimates:
mean in group Comedy  mean in group Drama 
            6.735978             7.140561 

The small p-value result suggest that there is statistically significant difference between the average rating of Drama and Comedy movies.
I selected the Comedy and Drama genres using the filter() function then applied the t.test() function to the filtered data to evaluate the average rating between the Drama and Comedy genres.

Use a Mann-Whitney U test to assess if the average rating (“My.Rating”) is different between dramas (“Genre” Drama) and comedies (“Genre” Comedy). What does the result suggest?

wilcox.test(My.Rating ~ Genre, data = comedy_drama)

    Wilcoxon rank sum test with continuity correction

data:  My.Rating by Genre
W = 36163, p-value = 0.0001081
alternative hypothesis: true location shift is not equal to 0

The Mann-Whitney U test result suggest that average rating is statistically different between Drama and Comedy movies.
I applied the wilcox.test() function to compare the average rating between Drama and Comedy genres.

Use ANOVA to assess whether there is statistical difference in mean NDVI (“ndvi”) between at least two land cover type (“class”). What does the result suggest?

t12 <- aov(ndvi ~ class, data = mine_class)
summary(t12)
              Df Sum Sq Mean Sq F value Pr(>F)    
class          4 190.59   47.65   11309 <2e-16 ***
Residuals   4512  19.01    0.00                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA result produces a statistically significant p-value, which suggests that at least one pair of land cover types have statistically different mean NDVI.
Analysis of variance, ANOVA, has the ability to compare variables and groups than are greater than two, as compared to the previous t-test and Mann-Whitney U test. Using the one-way ANOVA method, I apply the aov() function to examine if mean NDVI is different between land cover types.

Use Tukey’s Honest Significant Difference test to assess whether there is a difference between the forest (“class” forest) and shrub (“class” shrub) land cover types specifically. What does the result suggest?

TukeyHSD(t12)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = ndvi ~ class, data = mine_class)

$class
                     diff          lwr         upr    p adj
forest-barren  0.21346127  0.205539885  0.22138265 0.00e+00
herb-barren    0.22709823  0.219176844  0.23501961 0.00e+00
shrub-barren   0.45712849  0.449207110  0.46504987 0.00e+00
water-barren  -0.20884992 -0.218444669 -0.19925518 0.00e+00
herb-forest    0.01363696  0.005715578  0.02155834 2.67e-05
shrub-forest   0.24366723  0.235745844  0.25158861 0.00e+00
water-forest  -0.42231119 -0.431905935 -0.41271645 0.00e+00
shrub-herb     0.23003027  0.222108885  0.23795165 0.00e+00
water-herb    -0.43594815 -0.445542895 -0.42635341 0.00e+00
water-shrub   -0.66597842 -0.675573161 -0.65638367 0.00e+00

The p-value of 0 for the land cover pair shrub-forest suggests that there is statistical significant difference between the NDVI of forest and shrub classes.
To compare each land cover pair combination, I applied the TukeyHSD function to run a Tukey’s Honest Significant Difference pair-wise test
Further explanation of the TukeyHSD function

Create a QQ plot for the model residuals to assess whether the residuals are normally distributed. Does the plot suggest that there is an issue with this assumption?

qqPlot(lm(ndvi ~ class, data = mine_class), simulate = TRUE, main="QQ Plot", labels = FALSE)

[1] 1213 1214

This plot suggests that the residuals are not normally distributed, challenging the assumption that they are normally distributed and other ANOVA assumptions.
Q-Q Plots are used to assess the normality of data through comparison of data quantiles to theoretical quantiles. The diverging points from the line suggest that the data is not normally distribution. I used the qqPlot() function to to represent the distribution of the model residuals.

Use the Bartlett Test of Homogeneity of Variance to assess whether there is consistent variance between the classes. What does this test suggest?

bartlett.test(ndvi ~ class, data = mine_class)

    Bartlett test of homogeneity of variances

data:  ndvi by class
Bartlett's K-squared = 2017.8, df = 4, p-value < 2.2e-16

The Bartlett Test result returns a statistically significant p-value which suggests that there is not consistent and/or equal variance between the classes.
The Bartlett Test of Homogeneity of Variance is useful to assess if variance is constant between different groups. I used the bartlett.test() function to assess the degree of variance between classes for NDVI.

Use the Bonferroni Outlier Test to test for outliers. Does this test suggest the presence of outliers?

outlierTest(t12)
      rstudent unadjusted p-value Bonferroni p
1213  7.686756         1.8401e-14   8.3115e-11
1214  7.384167         1.8173e-13   8.2088e-10
3962  6.612750         4.2114e-11   1.9023e-07
1233  6.044188         1.6227e-09   7.3299e-06
1194  5.858383         5.0063e-09   2.2614e-05
1228  5.348877         9.2881e-08   4.1954e-04
2836  5.169711         2.4460e-07   1.1048e-03
780  -5.037194         4.9085e-07   2.2172e-03
4347 -4.782641         1.7852e-06   8.0639e-03
4432  4.748768         2.1101e-06   9.5312e-03

The Bonferroni Outlier Test does suggest the presence of outliers.
The Bonferroni Outlier Test is used to determine if outliers exist as the ANOVA method is sensitive to outliers. I used the outlierTest() function on the ANOVA fit data from the t12 variable.

Use the Kruskal-Wallis Rank Sum Test to assess whether at least two cover types (“class”) have a different mean NDVI (“ndvi”). What does this test suggest?

kruskal.test(ndvi ~ class, data = mine_class)

    Kruskal-Wallis rank sum test

data:  ndvi by class
Kruskal-Wallis chi-squared = 3974.3, df = 4, p-value < 2.2e-16

The statistically significant p-value suggests that at least two class types, one pair of groups, have different mean NDVI.
The Kruskal-Wallis Rank Sum Test evaluates difference between groups, like ANOVA does, but it instead using a nonparametric approach. Its results provide information on if a pair exhibits difference but does not identify which pair. The kruskal.test() function is applied to evaluate the mean NDVI between land cover type pairs.

Use a pairwise Kruskal Wallace test to assess whether the herbaceous (“class” herb) and forest (“class” forest) types have different mean NDVI at a 95% confidence interval. What does this test suggest?

pairw.kw(mine_class$ndvi, mine_class$class, conf = 0.95)

95% Confidence intervals for Kruskal-Wallis comparisons 

                    Diff      Lower      Upper  Decision Adj. P-value
barren-forest -1451.9575 -1615.6656 -1288.2494 Reject H0            0
barren-herb    -1597.733 -1761.4411 -1434.0249 Reject H0            0
forest-herb    -145.7755  -309.4836    17.9326    FTR H0     0.124351
barren-shrub   -2993.107 -3156.8151 -2829.3989 Reject H0            0
forest-shrub  -1541.1495 -1704.8576 -1377.4414 Reject H0            0
herb-shrub     -1395.374 -1559.0821 -1231.6659 Reject H0            0
barren-water   715.12336   516.8325  913.41422 Reject H0            0
forest-water  2167.08086    1968.79 2365.37172 Reject H0            0
herb-water    2312.85636  2114.5655 2511.14722 Reject H0            0
shrub-water   3708.23036  3509.9395 3906.52122 Reject H0            0

The pairwise test produces a non-significant p-value for the forest-herb class pair, which fails to reject the null hypothesis and suggests that there is not difference in mean NDVI at a 95% confidence interval.
The Kruskal-Wallis Rank Sum Test identifies if a pair of groups are different but does not identify which pair. A p-value lower than 0.05 and a confidence interval that does not include zero does suggest difference between the groups. I applied the pairwkw() function to produce the pair comparisons for NDVI and evaluate if forest and herb class pair differ.