Biological Stream Survey

Initial Setup

library(dplyr)
library(plyr)
## [1] "FinalProject_JordanWinemiller.html"       
## [2] "FinalProject_JordanWinemiller.Rmd"        
## [3] "Jordan Winemiller_Project 2_original.docx"
## [4] "LongnoseDace.txt"                         
## [5] "Project 2_Jordan Winemiller_original.R"   
## [6] "rsconnect"

Note: This project was done off of a Linear Regression Course in 2015 from UMKC; and the original file, and report are attached in the .zip file.
The repurposing of this report is to see the difference in original presentation in R console with Microsoft Word presentation to using packages in R Studio with R markdown.

Analysis Report

Abstract

According to the Maryland Biological Stream Survey, the number of longnose dace per 75 meter section of stream can be predicted using the area (in acres) drained by the stream, the dissolved oxygen (in mg/liter), the maximum depth (in cm) of the 75 meter segment, the nitrate concentration (in mg/liter), the sulfate concentration (in mg/liter), and the water temperature on the sampling date (in degrees C). The data are available in the file “LongnoseDace.txt.”

Overview

In this report we will be testing the amount longnose dace (Fish) in the stream mention in the abstract from the problem. In order to accomplish this we will be testing a segments of stream broken into 75 meter sections. We will be using six non-categorical predictors in our initial model for regression. These predictors are as follows.

degree.symbol <- paste0(intToUtf8(176),'C')
names <- c('Fish','Area','Oxygen','Depth','Nitrate','Sulfate','Temperature') 
text <- c('Number of longnose dace per 75 meter section',
          'Area of drained by the stream (acres)','Dissolved oxygen (mg/liter)',
          'Maximum Depth (cm)','Nitrate concentration (mg/liter)',
          'Sulfate concentration (mg/liter)','Water temperature on sampling date (.)')
descriptions <- gsub('[.]', degree.symbol, text)
variables <- c('Y for the response variable','x1 first predictor variable',
               'x2 second predictor variable',
               'x3 third predictor variable','x4 fourth predictor variable',
               'x5 fifth predictor variable','x6 sixth predictor variable')
columns <- c('Name', 'Description', 'Variable Name')
table.description <- cbind(names, descriptions, variables)
knitr::kable(table.description, col.names = columns, align = 'c',
             caption = 'Variable Discription Table')
Variable Discription Table
Name Description Variable Name
Fish Number of longnose dace per 75 meter section Y for the response variable
Area Area of drained by the stream (acres) x1 first predictor variable
Oxygen Dissolved oxygen (mg/liter) x2 second predictor variable
Depth Maximum Depth (cm) x3 third predictor variable
Nitrate Nitrate concentration (mg/liter) x4 fourth predictor variable
Sulfate Sulfate concentration (mg/liter) x5 fifth predictor variable
Temperature Water temperature on sampling date (°C) x6 sixth predictor variable

The degree symbol was brought in from UTF8 standards with \(\hat{A}\) as a consequence.

Importing Data

raw.data <- as.data.frame(read.table('LongnoseDace.txt'))
raw.remove <- dplyr::select(raw.data, 2:8)
data <- plyr::rename(raw.remove, c('V2' = 'Fish', 'V3' = 'Area', 'V4' = 'Oxygen',
                                   'V5' = 'Depth', 'V6' = 'Nitrate', 'V7' = 'Sulfate',
                                   'V8' = 'Temp'))
data
##    Fish  Area Oxygen Depth Nitrate Sulfate Temp
## 1    13  2528    9.6    80    2.28   16.75 15.3
## 2    12  3333    8.5    83    5.34    7.74 19.4
## 3    54 19611    8.3    96    0.99   10.92 19.5
## 4    19  3570    9.2    56    5.44   16.53 17.0
## 5    37  1722    8.1    43    5.66    5.91 19.3
## 6     2   583    9.2    51    2.26    8.81 12.9
## 7    72  4790    9.4    91    4.10    5.65 16.7
## 8   164 35971   10.2    81    3.20   17.53 13.8
## 9    18 25440    7.5   120    3.53    8.20 13.7
## 10    1  2217    8.5    46    1.20   10.85 14.3
## 11   53  1971   11.9    56    3.25   11.12 22.2
## 12   16 12620    8.3    37    0.61   18.87 16.8
## 13   32 19046    8.3   120    2.93   11.31 18.0
## 14   21  8612    8.2   103    1.57   16.09 15.0
## 15   23  3896   10.4   105    2.77   12.79 18.4
## 16   18  6298    8.6    42    0.26   17.63 18.2
## 17  112 27350    8.5    65    6.95   14.94 24.1
## 18   25  4145    8.7    51    0.34   44.93 23.0
## 19    5  1175    7.7    57    1.30   21.68 21.8
## 20   26  8297    9.9    60    5.26    6.36 19.1
## 21    8  7814    6.8   160    0.44   20.24 22.6
## 22   15  1745    9.4    48    2.19   10.27 14.3
## 23   11  5046    7.6   109    0.73    7.10 19.0
## 24   11 18943    9.2    50    0.25   14.21 18.5
## 25   87  8624    8.6    78    3.37    7.51 21.3
## 26   33  2225    9.1    41    2.30    9.72 20.5
## 27   22 12659    9.7    65    3.30    5.98 18.0
## 28   98  1967    8.6    50    7.71   26.44 16.8
## 29    1  1172    8.3    73    2.62    4.64 20.5
## 30    5   639    9.5    26    3.53    4.46 20.1
## 31    1  7056    6.4    60    0.25    9.82 24.5
## 32   38  1934   10.5    85    2.34   11.44 12.0
## 33   30  6260    9.5   133    2.41   13.77 21.0
## 34   12   424    8.3    62    3.49    5.82 20.2
## 35   24  3488    9.3    44    2.11   13.37 24.0
## 36    6  3330    9.1    67    0.81    8.16 14.9
## 37   15  2227    6.8    54    0.33    7.60 24.0
## 38   38  8115    9.6   110    3.40    9.22 20.5
## 39   84  1600   10.2    56    3.54    5.69 19.5
## 40    3 15305    9.7    85    2.60    6.96 17.5
## 41   18  7121    9.5    58    0.51    7.41 16.0
## 42   63  5794    9.4    34    1.19   12.27 17.5
## 43  239  8636    8.4   150    3.31    5.95 18.1
## 44  234  4803    8.5    93    5.01   10.98 24.3
## 45    6  1097    8.3    53    1.71   15.77 13.1
## 46   76  9765    9.3   130    4.38    5.74 16.9
## 47   25  4266    8.9    68    2.05   12.77 17.0
## 48    8  1507    7.4    51    0.84   16.30 21.0
## 49   23  3836    8.3   121    1.32    7.36 18.5
## 50   16 17419    7.4    48    0.29    2.50 18.0
## 51    6  8735    8.2    63    1.56   13.22 20.8
## 52  100 22550    8.4   107    1.41   14.45 23.0
## 53   80  9961    8.6    79    1.02    9.07 21.8
## 54   28  4706    8.9    61    4.06    9.90 19.7
## 55   48  4011    8.3    52    4.70    5.38 18.9
## 56   18  6949    9.3   100    4.57   17.84 18.6
## 57   36 11405    9.2    70    2.17   10.17 23.6
## 58   19   904    9.8    39    6.81    9.20 19.2
## 59   32  3332    8.4    73    2.09    5.50 17.7
## 60    3   575    6.8    33    2.47    7.61 18.0
## 61  106 29708    7.7    73    0.63   12.28 21.4
## 62   62  2511   10.2    60    4.17   10.75 17.7
## 63   23 18422    9.9    45    1.58    8.37 20.1
## 64    2  6311    7.6    46    0.64   21.16 18.5
## 65   26  1450    7.9    60    2.96    8.84 18.6
## 66   20  4106   10.0    96    2.62    5.45 15.4
## 67   38 10274    9.3    90    5.45   24.76 15.0
## 68   19   510    6.7    82    5.25   14.19 26.5

Exploratory Data Analysis

This matrix enables you to tell whether the response variable appears to have any association with any of the predictor variables, and if any two of the predictor variables appear to be correlated.

pairs(data, main="Scatterplot Matrix")

In the scatter plot matrix, we see a strong linear relationship between our response variable (Fish) and predictors for area, oxygen, depth, nitrate, and temperature. Opposed to the weaker relations for sulfate. In addition, where the plots line up between predictor will tell us about the correlation between those predictors. Most have an evenly distributed scatter about the plot, which lets us know that the correlation between those predictors is low which is preferable. In the column (or row) for sulfate there seems to be a pattern in which there is a heavy grouping on the left (or bottom) of the plots, which could mean a correlation to the other predictors.

Correlation Equation

\[ \varphi_{X,Y} = corr(X,Y) = \frac{cov(X,Y)}{\sigma_X\sigma_Y}=\frac{E(X - \mu_X)(Y-\mu_Y)}{\sigma_X\sigma_Y}\] Where \(\varphi\) is the correlation, and \(X\) would stand for one of the predictors, and \(Y\) would stand for another predictor. Example \(XY\), in these case would stand for \(X_1X_2\) or \(X_3X_6\). \(COV\) is the covariance between the two predictors. The symbol \(\sigma\), stands for the standard deviation for the mean for that predictor. \(E\), is the expected value of the top equation, where \(\mu\) is the mean or average of that predictor. So, the correlation is the expectation of first predictor minus its mean multiplied by the second predictor minus its mean all divided by the standard deviations of the first and second predictors multiplied together. In the table below these operations have been carried out in R, and have been put into a correlation matrix. From looking at the table it seems that there will be a problem with multicollinearity in the cells that have high values except for the diagonal of ones in the table, other issues may arise where the correlation is negative.

Correlation Matrix

This matrix has been edited to show only correlation between predictors

options(width = 100)
data.without.y <- dplyr::select(data,2:7)
cor(data.without.y)
##                 Area      Oxygen        Depth      Nitrate     Sulfate         Temp
## Area     1.000000000 -0.02243326  0.258624140 -0.099527887  0.04877639  0.003541386
## Oxygen  -0.022433261  1.00000000 -0.057570044  0.273426337 -0.07241127 -0.318864885
## Depth    0.258624140 -0.05757004  1.000000000  0.036268615 -0.04987166 -0.004895135
## Nitrate -0.099527887  0.27342634  0.036268615  1.000000000 -0.08713012 -0.001596462
## Sulfate  0.048776389 -0.07241127 -0.049871660 -0.087130116  1.00000000  0.079791845
## Temp     0.003541386 -0.31886489 -0.004895135 -0.001596462  0.07979184  1.000000000

Inference and Estimation

Model

The model we will use to estimate will contain the six predictors using there variable name. \[Y_i = \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3}+\beta_4x_{i4}+\beta_5x_{i5}+\beta_6x_{i6}+\epsilon_i~;~\epsilon_i \sim _{iid} N(0,\sigma^2)\] New notation \(\beta=\) is the coefficient of regression, the subscript indicates which predictor it is associated with. In the case of \(\beta_0\), this coefficient in which the response plane (line) cross the axis for the response variable. Other new notation, \(x_{i1}\) or any of the \(X_i\) is the predictor for that data point which we have 68 in this data set. The second subscript number is the predictor, where area is \(X_1\) and Temperature is \(X_6\). The notation \(\epsilon_i\) is for the error terms that are in the model. These error terms are independent of each other and identical distributed on a normal bell curve with a mean of 0 and variance \(\sigma^2\) or \(\sim N(0, \sigma^2)\). It is also notable that these error terms sum to zero and will not appear in the fitted model.

Regression Coefficients

In order to estimate the regression coefficients we will have to use matrix methods. We will be using matrix methods to calculate the total sum of squares, the regression sum of squares, and the error sum of squares, and then compiling them into a table for Analysis of Variance (ANOVA).

So, sum of squares, and what are they. Once a regression line is fitted to the data set, the line is spaced between the data points, but most of the data points are not on the line. So, my calculating the distance between the data point and regression line. Then we square these numbers since some the data points will fall below the regression line, and other will lay above. The sum of squares are broken into two main categories depending on whether the distance is a tribute to regression sum of squares for a lack of fit for the model, or part of the error sum of squares. These two groups can be further broken down which we will see later. Also these two main categories add up to the total sum of squares.
Estimated Coefficients: Intercept:\(\hat\beta_0\), Area:\(\hat\beta_1\), Oxygen:\(\hat\beta_2\), Depth:\(\hat\beta_3\), Nitrate:\(\hat\beta_4\), Sulfate:\(\hat\beta_5\), Temp:\(\hat\beta_6\)

attach(data)
Intercept <- rep(1, length(Fish))

X <- cbind(Intercept,Area,Oxygen,Depth,Nitrate,Sulfate,Temp)
Y <- data$Fish
beta.hat <- as.data.frame(solve(t(X)%*%X)%*%t(X)%*%Y)
column.names <- c('Coefficient Value')
knitr::kable(beta.hat,col.names = column.names, align = 'c', caption = 'Regression Coefficients')
Regression Coefficients
Coefficient Value
Intercept -127.5865278
Area 0.0019625
Oxygen 6.1035851
Depth 0.3541803
Nitrate 7.7133007
Sulfate -0.0086050
Temp 2.7483294

New notation \(\hat\beta\), the \(\hat0\) denotes beta hat, were the hat notation is universal through the document for an estimated value. The number attached is the same as the subscript number from the general model above.

ANOVA Table

Analysis of Variance
The degrees of freedom for regression are equal to the number of predictors, because each predictor use one predictor. The degrees of freedom for the residuals are computed by the taking the amount of observations minus the number of predictors in the model, and then minus one. So, the degrees of freedom for the total equals the number of observations minus one. Since matrix methods were used to calculate the sum of squares the process can be found in the appendix, which is in R code, and also attached in the set of files. The mean sum of squares equals the sum of squares divided by the degrees of freedom for the respective type. In the table the total mean sum of squares is blacked out, because the mean squares for the total do not exist.

Table

x.variables <- as.matrix(X)
aov.out <- aov(Y~X,data = data)
summary(aov.out)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## X            6  47044    7841   4.653 0.00059 ***
## Residuals   61 102787    1685                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Determination

The coefficient of determination is the sum of squares for regression divided by the total sum of squares. This number is used to see how well the model fits the data. Another value in the table include \(r\) the linear correlation coefficient which is the \(\sqrt{R^2}\), use to measure the strength and the direction of a linear relationship. Since we have multiple predictors in the model we have to adjust \(R^2\) which is called \(R^2a\) for adjust \(R^2\), this is calculated by 1 minus (sum of squares for residuals/degrees of freedom for the residuals)/ (total sum of squares/total degrees of freedom)

J<- matrix(1, length(Fish), length(Fish),length(Fish))
SSTO <- t(Y)%*%Y - (1/length(Fish))*t(Y)%*%J%*%Y
SSE <- t(Y)%*%Y-t(beta.hat)%*%t(X)%*%Y
SSR <- SSTO-SSE
df.R <- ncol(X) -1
df.E <- nrow(X) - df.R - 1
MSR <- SSR/df.R
MSE <- SSE/df.E
R2 <- SSR/SSTO
R <- sqrt(R2)
R2a <- 1-((length(Fish)-1)/df.E)*SSE/SSTO
detem <- cbind(R2,R,R2a)
names.detem <- c('R squared', 'r', 'R squared adjusted')
knitr::kable(detem, col.names = names.detem, row.names = FALSE, align = 'c', caption = 'Determination Table')
Determination Table
R squared r R squared adjusted
0.313981 0.5603401 0.2465037

So, there is a strong linear positive relationship, but the model is not a great fit for the data.

Testing at a 5% significance level for linear relationship

f.star <- MSR/MSE
f.crit <- qf(0.95, df.R, df.E)
if(f.star >= f.crit){
result <- "Reject H0 at 5% Significance Level"
}
p.val <- (1-pf(f.star,df.R,df.E))

\(F*:\) 4.653137
\(F critical:\)2.2514183
\(p-value:\)5.90450810^{-4}

The null hypothesis is \(H_0: \beta_0 = \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = \beta_6 = 0\), and our alternate is \(H_a:\) one of the predictors is not equal to zero. In order to do this we will use an \(F-statistic\) to test the fit of the model. In doing this we will name an \(F*\) that is equal to the mean sum of square for the regression divided by mean sum of square for the residuals which is equal to 4.653137, and use an \(F_{0.95, 6, 60}\) distribution, where the first number is the upper quantile at the 5% significance level and the next two numbers are the respective degrees of freedom. The \(F_{0.95, 6, 60}\) distribution is equal to 2.2514183. We will reject \(H_0\) if \(F*\) is greater than or equal to the \(F_{0.95, 6, 60}\) distribution. We will reject \(H_0\) at the 5% significance level, meaning that null hypothesis is rejected and that at least one of the regression coefficients are not equal to zero. The probability of the null hypothesis being correct is 5.90450810^{-4}, this is calculated by taking 1 minus the upper quantile of a \(F\) distribution with the same degrees of freedom as before, but with a critical region of the value of \(F*\).
\(Result:\) Reject H0 at 5% Significance Level

Stepwise ANOVA Table

Here we will be partition the sum of squares for regression to see which predictors add value to our model.
Scheffe Method

## New Method
aov.full <- aov(Fish~Area+Oxygen+Depth+Nitrate+Sulfate+Temp)
summary(aov.full)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Area         1  17990   17990  10.676 0.00178 **
## Oxygen       1   3105    3105   1.843 0.17961   
## Depth        1   7995    7995   4.745 0.03326 * 
## Nitrate      1  13506   13506   8.015 0.00627 **
## Sulfate      1     15      15   0.009 0.92506   
## Temp         1   4433    4433   2.631 0.10997   
## Residuals   61 102787    1685                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Old Method
#m<-6
#S<-sqrt(m*qf(0.95,6,df.E))
#S
#anx1<-anova(lm(y~x1))
#f.s1<-anx1[["Sum Sq"]][1]
#anx1x2<-anova(lm(y~x1+x2))
#mse2<-anx1x2[["Mean Sq"]][2]
#extra.ss2<-anx1x2[["Sum Sq"]][2]
#anx1x2x3<-anova(lm(y~x1+x2+x3))
#mse3<-anx1x2x3[["Mean Sq"]][2]
#extra.ss3<-anx1x2x3[["Sum Sq"]][3]
#anx1x2x3x4<-anova(lm(y~x1+x2+x3+x4))
#mse4<-anx1x2x3x4[["Mean Sq"]][2]
#extra.ss4<-anx1x2x3x4[["Sum Sq"]][4]
#anx1x2x3x4x5<-anova(lm(y~x1+x2+x3+x4+x5))
#mse5<-anx1x2x3x4x5[["Mean Sq"]][2]
#extra.ss5<-anx1x2x3x4x5[["Sum Sq"]][5]
#anx1x2x3x4x5x6<-anova(lm(y~x1+x2+x3+x4+x5+x6))
#mse6<-anx1x2x3x4x5x6[["Mean Sq"]][2]

Reduced Model ANOVA Table

So, as seen before in the scatter plot matrix the predictor for sulfate adds little value to the model but the predictor for temperature does add value to the model, but if you look at the model only the first four predictors our ANOVA table becomes.

aov.reduced <- aov(Fish~Area+Oxygen+Depth+Nitrate)
summary(aov.reduced)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Area         1  17990   17990  10.569 0.00185 **
## Oxygen       1   3105    3105   1.824 0.18162   
## Depth        1   7995    7995   4.697 0.03400 * 
## Nitrate      1  13506   13506   7.935 0.00647 **
## Residuals   63 107234    1702                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

In these reduction of the model variables where drop that still impact in the full model regression.

Graphs

Graphs with ggplot2 graphics package

Residual Plots, Normal Probability Plot of the Residuals, and Histogram
The color scheme on the Histogram is not intended to add value other than color.

library(ggplot2)
library(tidyr)
regression<-lm(Fish~Area+Oxygen+Depth+Nitrate+Sulfate+Temp)
fitted<-regression$fitted
res<-regression$resid
graph.data <- cbind(res, fitted, data[2:7])
graph.data %>%
  gather(-res,key="var",value="value") %>%
  ggplot(aes(x = value, y = res)) +
  geom_point(color = "blue") +
  xlab("") +
  ylab("Residuals") +
  ggtitle("Residual Plots by Alphabetic Order") +
  facet_wrap(~ var, scales = "free")

qplot(res, geom='histogram', xlab = 'Residuals', ylab='Frequency', main = 'Histogram of Residuals', color=factor(Fish))

qqnorm(res, main="Normal Probability Plot of Residuals")
qqline(res)

Graphs with standard graphics

Residual Plots, Normal Probability Plot of the Residuals, and Histogram

regression<-lm(Fish~Area+Oxygen+Depth+Nitrate+Sulfate+Temp)
fitted<-regression$fitted
res<-regression$resid
par(mfrow = c(3,3))
plot(fitted, res, main="Residuals vs Fitted Values", xlab="Fitted Values", ylab="Residuals")
plot(Area, res, main="Residuals vs Area", xlab = "Area", ylab = "Residuals")
plot(Oxygen, res, main="Residuals vs Oxygen", xlab = "Oxygen", ylab = "Residuals")
plot(Depth, res, main="Residuals vs Depth", xlab = "Depth", ylab = "Residuals")
plot(Nitrate, res, main="Residuals vs Nitrate", xlab = "Nitrate", ylab = "Residuals")
plot(Sulfate, res, main="Residuals vs Sulfate", xlab = "Sulfate", ylab = "Residuals")
plot(Temp, res, main="Residuals vs Temperature", xlab = "Temperature", ylab = "Residuals")
qqnorm(res, main="Normal Probability Plot of Residuals")
hist(res, main="Histogram of Residuals", xlab="Residuals")