Water Quality Report 4: Random Sampling and Confidence Intervals of New York Harbor Dissolved Oxygen Levels

Abstract

The purpose of this report is to analyze a new data set, New York Harbor Data set, and build confidence intervals about the sample mean to get an understanding of where the true population mean may exist.The comparisons being made are from the top and bottom of the harbor. This is taken a step further by comparing the intervals from year 2012 and 2017, a five-year gap. The key findings are that The top of the harbor contains more oxygen (mg/L) than the bottom, the new samples contain more oxygen (mg/L) than the old samples, and confidence intervals are successfully made to the .95 confidence level.

Introduction

While this report is focusing on the use and application of statistical methods, it is important to understand what exactly the theories are being applied to. Dissolved oxygen is a specific quality of all water. In order to sustain life in the water, the water must contain dissolved oxygen [2]. This is analogous to there being enough oxygen in the air that humans breath, aquatic life requires oxygen in the water that they breath. This is to say that if fish or other aquatic life require a certain amount of oxygen in the water it is imperative that the levels are high enough. It would also imply that if different depths of water contained different measures, this would also determine where the life would form its habitat. Insight on the questions using the New York Harbor Data is gained through the following statistical analysis.

Methods

In order to carry out any statistical methods, the data had to be downloaded [1]. This was more difficult than expected as the data contained many missing values, previously labeled as “NS”. I found it best to work in Google Sheets for the time being because the original data set was essentially unusable in R. After downloading both data sets (2012 and 2017) I created a new spreadsheet where I copy and pasted the columns labeled “Top” and “Bottom” of the data sets. In order to remove the missing values and not replace them with a value that could throw off the data, I filtered all columns to not contain “NS” or “0”. This is one foreseeable minor flaw to this report. This is slightly flawed because more data was deleted than necessary. For example, the 2012 data may have contained a missing value, but the row that is filtered out may have had perfectly usable data in the columns for 2017, this is because the new spreadsheet is only cross-sectional for the data of the same year and not data of a different year. I said this is a minor flaw, because the number of observations used exceeded 800 and 1000 for years 2012 and 2017, respectively, but more data is always more ideal.

When finding confidence intervals, it is best to first describe the data being examined because these summary values will be required to perform further analysis. Table 6 below contains descriptive statistics of the new data:

Immediately we can see that the “top”" data have larger means than the “bottom”, and that the new(2017) data has means larger than the old(2012) data. The presented standard deviations and observations of Table 6 will be used in order to build confidence intervals of the .95 level. The intervals will be found using the following formula: \[ \bar{x} ± 1.96(\frac{σ}{\sqrt(n)}) \] Where x-bar is either limit of the interval, 1.96 is the z-value for two-tailed .95 confidence, σ is the standard deviation, and n is the number of observations. The resulting value of x-bar when using subtraction will result in the lower limit of the .95 confidence interval, while adding will result in the upper limit of the .95 confidence interval. Reducing this formula is also possible. The Standard error, s, is found by: \[ s = (\frac{σ}{\sqrt(n)})\] which we have in the formula for computing the confidence interval, therefore we can substitute s into the original formula: \[\bar{x} ± 1.96(s)\] Using this formula for building a .95 confidence interval we can make a table showing the results.

Results and Discussion

Table 7 produces the desired confidence intervals for all 4 variables. It should be said and highlighted what this interval means. It does not mean that there is a 95% chance that the true population mean, μ, is contained in this interval. This interval means that if we draw 100 more random samples, we can expect that 95 of them will have a sample mean within this interval, it does not imply more than this, which is a very common misinterpretation.
Mechanically speaking, we can infer from the equation for standard error, s, that because the old (2012) data have fewer observations (808 compared to 1114), the standard errors for the old dissolved oxygen levels are larger. This is because as the number of observations increases, the smaller the standard error, until it reaches the lower limit of 0, which would imply having all observations of the population, which would further imply that x-bar = μ, which here is not the case.

Conclusion

The goal of this report was to estimate the population mean through the approximation of .95 confidence interval. This was successfully performed and the results indicate that there is a larger interval for the old data thanthe new data, which is likely due to the old data containing fewer observations. Questions regarding the differences in the harbors’ top and bottom measurements of dissolved oxygen have also been answered. The main discovery being that the lower the depth of water that is measured the fewer oxygen molecules are dissolved in the water. As someone who knew nothing of this quality in water, it makes logical sense as to why some fish and aquatic life exist at certain depths of the water. When improving this analysis, it would be possible to extract more data from the original data set than what this report extracted, and at the cost of having to clean very unclean data it may be worth it for the larger sample and by extension, a tighter confidence interval. This report is also preliminary to comparing the means of different populations and measuring their statistical significance.

References

[1] “Harbor Water Sampling Data.” Gowanus Canal – History, www.nyc.gov/html/dep/html/harborwater/harbor_water_sampling_results.shtml.
[2] “What Is Dissolved Oxygen in Water?” Environmental Measurement Systems, 4 Feb. 2014, www.fondriest.com/news/whatisdissolvedoxygen.htm.