The topic of water quality is a pressing issure in modern society. The idea that there still remains people even in first-world countries without access to water. The United Nations Data sets provides data from 1990 until 2015 showing the percent of the population with access to drinking water by country. This week, the goal is to investigate the data and to work together as a class to provide insightful questions about the data.
In order to view and use the data, I opened the file in google sheets. From there, I downloaded a copy of the data set as a csv file. In order to carry out with future statistic analysis and report-writing, the class will be using RStudio. Creating a directory “StatsProject” I have saved the data into this directory, and have permenently changed the working directory of RStudio to this folder.
After viewing the data and having my first impressions with the data, I came up with these questions: Of the countries that have shown improvement, what is the average length of time it has taken to improve by 10 points and does this depend greatly on the starting point of that country? Can we estimate the amount of money is costs to make this jump? During the years of large improvements for specific countries, do neighboring countries receive a “second-hand” improvement and what variables would you need to control to test this? In Bermuda, households supply themselves with their own water by collecting the rainwater that lands on their roofs and storing it under their house in a tank where it is purified. Is being individually responsible for drinking water the reason Bermuda has missing values? Are the countries with the largest economies necessarily those with the highest percent of their population with access to clean water?
The topic of discussion is of great import, now that the data has been collected, analysed, and discussed, we are ready to move on to answering our own questions to the best of our abilities.
This report continues the research from the previous report. Now that a data set has been cleaned, the data of the United States and China are extracted and summarized using descriptive statistics and graphics. The results of this report indicate that the United States has higher water acccessibility than China, and that the data are mostly normally distributed.
This week, I will be looking to answer one of the questions I came up with from the last report: “Are the countries with the largest economies necessarily those with the highest percent of their population with access to clean water?” The 2 largest economies in the world are those of the United States and China. The current way of measuring the standard of living by country is primarily through GDP per capita. This value is widely different from GDP because it accounts for the wealth of each person, with the assumption of a uniform distribution (it is not uniform in reality).Median GDP per household however is a far better estimator as it accounts for an unequal distribution of wealth. According to a 2013 Gallop Poll, these values for China and the United States are $6180, and $43,585, respectively[1]. Based on these numbers I would expect to see that the United States has a higher percent of its population with access to drinking water than China does. This report attempts to answer that question.
With the intent of comparing the United States and China, first it is necessary to extract these countries’ data from the parent data set[2]. Because the observation count is very small, 26, I decided to do this in Google Sheets. The original data set was downloaded previously, and from there I created another spreadsheet of 3 variables; “Year”,“US”, and “China”. This was saved as a csv file and using the following command, imported this data set into RStudio:
Using this data set, I will examine both China and The United States’ populations in their ability to access drinking water. The examination will include multiple tables and graphics showing descriptive statistics including, mean median, minimums, maximums, quartiles, standard deviations, kurtosis, skewness, scatter plots, histograms, and Q-Q plots.
In order to better visualize the data, I will use a scatter plot that directly reflects the complete refined data set in Table 1.
In Figure 1, we can see the progression of both countries over the years 1990 to 2015. Here, it is easy to see that while the United States has had little improvement, this is because its starting point was very high, and given that the upper limit is 100, it is expected to see a low rate of change. China on the other hand, started significantly lower, and made a great progression over the years to nearly reach the values of The United States.
Next, I will look at the basic descriptive statistics of both countries:
Table 2 shows the mean, median, minimum, maximum, and the 1st and 3rd Quartiles of both countries. We can see that the United States’ population has significantly greater access to drinking water than China’s population. There could be many explanations of this. One explanation is that the United States has been a superpower economy for over a century, thus giving them ample time to improve living conditions, while China saw its rapid increase in production far more recently and hasn’t had equal time to reach the highest standards. The mean Value of the United States is 98.85, this means that 98.85% of the population within the United States have access to drinking water. While this is a high value, it is also surprising that the United States, the largest economy in the world, has not reached a value of 100 by 2015. The mean value of China is 82.63. Given the history of China and its minimum value of 66.9. It is clear that they have made strides in improving the quality of life of their population, and in 2015, they have improved to 95.5, nearly mirroring the level of the United States. In analyzing what this means in general, this might suggest that there is a lag in improvement in water access relative to when the country’s economy grows.
Further, the distributional statistics of both countries are found below in Table 3:
These statistics tell us more about how the data is shaped. The Standard deviation tells us the dispersion of the data about the mean. A high Value would indicate that a significant portion of the data is far from the mean, and that visually, the distribution about the mean is “wider” than that of a normal distribution. A lower Value would indicate that the data is very close to the mean, and visually, the distribution would appear “narrow” compared to the normal distribution. We can see in Table 3 that the standard deviation of the United States is very low, .25. This means that over the past 26 years, the country’s population with access to drinking water has been consistent and not too deviant of the mean value. In China however, we have a high standard deviation. This is indicative of the serious improvement China has made in 26 years. Because of the steady progression, the data contains values much lower and much higher than the average. In the calculation of standard deviation, the differences in the data from the mean of the data are squared to obtain positive values, summed, and square rooted, thus, the high value shows that much of the data is far from the mean.
The skewness and kurtosis tell us more about both the symmetry about the mean and the thickness of the tails, respectively. In Table3, The skewness of the US and China are -.32 and -.21, respectively. These values are close to that of the standard normal distribution, however the do indicate that there is a slight left-skewness to the distribution. This theoretically implies that median of the data is slighty greater than the mean, indicating that more of the years were lower than the mean than higher than the mean. The kurtosis of both countries are also very similar,, and are less than the standard normal distribution of 3. This means that less data exists in the tails of each distribution, or fewer outliers.
The distributions of each country compared with the normal distribution curve are below in Figure 2 and Figure 3:
These distributions give us an idea about how normal the data is. Normal data is easy to manipulate for further statistical analysis and therefore it is ideal to use data that is close to being normally distributed. The data in this paper is not perfectly normal, however, it is close enough to carry out further analysis.
Another useful tool in analyzing the normality of the data is through the use of Q-Q plots. What this does is measure how linear the data is. This tool can be more visually indicative of normality as it does not rely on amount of bins and bin size like a histogram does. Theoretically, if the data forms a perfectly straight line, then the data is normal.In Figure 4 and Figure 5 below, we can see how the data values of each country compare to a straight line:
The data here again is not perfectly normal, but it does follow the general trend of a straight line which indicates that the data is close to being normal. This allows for continued analysis with statistical meaning and impact.
The purpose of this report was to get a better idea of the key descriptive statistics and shape of the data. After performing the necessarily actions, we find that the United States has a higher % of its population with access to clean water than China has. Despite this, China has made far greater improvement, increasing by 29.9% over the past 26 years. This shows us that China is currently making improvements and is quickly approaching the quality of the United States. Given the histograms and Q-Q plots, we can also confirm that the data is almost normal, which provides enough insight as to whether or not the data can be manipulated as they were normal. Continued research on this topic will attempt to model the data based on the year and test the significance of these models.
[1] Phelps, G., & Crabtree, S. (2013, December 16). Worldwide, Median Household Income About $10,000. http://news.gallup.com/poll/166211/worldwide-median-household-income-000.aspx
[2]UNdata | record view | Proportion of the population using improved drinking water sources, total. http://data.un.org/Data.aspx?q=water&d=MDG&f=seriesRowID:665
The purpose of this report is to build on the ending descriptive results of the previous report. Linear regression is used to model the % of the populations of the United States and China with access to drinking water as a function of time over the years 1990 and 2015. The key results of this paper are that statistically significant models were estimated as a function of time, with strong evidence that the water quality of these two countries improves over time.
Once data have been analyzed and summarized, it is important to continue further evaluation. Linear regression is among the most common methods of modelling data. Even though it is a simplistic approach, it still produces key findings that are integral to understanding the implications of the data. Linear regression provides the analyst with a line that fits the data that is hopefully more accurate than simply using the mean to estimate output values. This type of modelling is done by minimizing the squared difference between the line and the actual data points, hence its proper name “Ordinary Least Squares” or “OLS” for short. The line that minimizes the total squared distance between all points and the line is the resulting output from this model and takes the form of \[y = mx + b\] where m is the rate of change for an unit increase in the independent variable x, and b is the constant value of the output given the independent variable is 0.
In this report, I will approximate two linear models using the UN Water Quality Data Set[1]. The first being the % of China’s population with access to drinking water: \[ ChinaValue = β_1*Year + β_0 \] and the second being the % of the United State’s Population with access to drinking water: \[ USValue = β_1*Year + β_0 \] In both of these models, it is vital to mention that the previous variable year, has been transformed to instead measure the number of years after 1990, the base year. For example, 1999 takes on the value of 9. This way, we are looking only at the years of our data set which will produce more meaningful outputs. These equations will be estimated, presented in table format, analyzed for significance, and shown graphically.
The estimations of each equation are presented below: \[ ChinaValue = 1.16*Year + 68.16 \] \[ USValue = .03*Year + 98.45\] We can see that the beta coefficients for both models are very different. This is expected however due to the US having a very small range between minimum and maximum values, and given that the dependent variable has an upper limit of 100 and the starting point for the US is 98.4. Because China had a significantly larger range, the much larger beta coefficient is expected, this is not to say that the US parameter is insignificant. These equations are derived from Table 4 below which also contains values indicative of statistical significance:
Table 4 tells us several things about using year to predict water quality value that the above equations do not. It tells us that in both models, the estimated beta parameters are not likely by chance. The extremely high t-values indicate that the estimations have a probability of less than .001 of being by chance. It is for that reason we can reject the null hypothesis that year has no impact on water quality. Also in Table 4, the F statistics and R^2 values are presented. The F-statistics tell the analyst whether or not the entire model has good predictive power of the independent variable. The values for the US and China models are 885.07 and 3578.93 respectively. These values are remarkably high and indicate that there is less than a .001 chance that using the mean to estimate water quality is better than using the models. The R^2 terms tell us how well the resulting line fits the scatter plot of the data. The R^2 terms of the US and China models are .9736 and .9933, respectively. These values are very strong and indicate a closely matching fit. The fit can be visually seen by plotting the linear equations along with the original scatter plots of the data. This is presented in Figure 6 below:
In Figure 6, it is evident that the two models are very close fits to the data points and almost perfectly connect the individual data points for both the US and China.
These results are rational to see. With improvements in both countries economies, we can typically expect to see the living conditions of their people to improve. Given that both countries experience economic growth annually, it is no surprise that using the year as the regressor of these models is statistically significant.
Based on the results of this report, we can estimate the impact the year of measurement has on these two countries. It is important for analysts to perform these actions in order to both measure the past, and predict the future. This is especially useful when looking at China. Because China has more room to grow, we can use the China model to predict the values for 2016 and perhaps even years after. This is done very simply by inputting 26 in for year as 2016 is 26 years after 1990: \[ ChinaValue = 1.16*(26) + 68.16\] \[ = 98.32\] Which suggests if China improves their water quality in 2016 similarly to the years 1990-2016, we can expect that in 2016, 98.32% of the Chinese population will have access to drinking water. While these models provide a great estimation of water quality, they are very general and there are always more variables to include in how we can make such predictions. Future research should involve using specific areas of these massive nations, and perhaps include more variables for a multivariate analysis such as “neighboring country value” or “welfare spending”.
[1] UNdata | record view | Proportion of the population using improved drinking water sources, total. http://data.un.org/Data.aspx?q=water&d=MDG&f=seriesRowID:665
The purpose of this report is to analyze a new data set, New York Harbor Data set, and build confidence intervals about the sample mean to get an understanding of where the true population mean may exist.The comparisons being made are from the top and bottom of the harbor. This is taken a step further by comparing the intervals from year 2012 and 2017, a five-year gap. The key findings are that The top of the harbor contains more oxygen (mg/L) than the bottom, the new samples contain more oxygen (mg/L) than the old samples, and confidence intervals are successfully made to the .95 confidence level.
While this report is focusing on the use and application of statistical methods, it is important to understand what exactly the theories are being applied to. Dissolved oxygen is a specific quality of all water. In order to sustain life in the water, the water must contain dissolved oxygen [2]. This is analogous to there being enough oxygen in the air that humans breath, aquatic life requires oxygen in the water that they breath. This is to say that if fish or other aquatic life require a certain amount of oxygen in the water it is imperative that the levels are high enough. It would also imply that if different depths of water contained different measures, this would also determine where the life would form its habitat. Insight on the questions using the New York Harbor Data is gained through the following statistical analysis.
In order to carry out any statistical methods, the data had to be downloaded [1]. This was more difficult than expected as the data contained many missing values, previously labeled as “NS”. I found it best to work in Google Sheets for the time being because the original data set was essentially unusable in R. After downloading both data sets (2012 and 2017) I created a new spreadsheet where I copy and pasted the columns labeled “Top” and “Bottom” of the data sets. In order to remove the missing values and not replace them with a value that could throw off the data, I filtered all columns to not contain “NS” or “0”. This is one foreseeable minor flaw to this report. This is slightly flawed because more data was deleted than necessary. For example, the 2012 data may have contained a missing value, but the row that is filtered out may have had perfectly usable data in the columns for 2017, this is because the new spreadsheet is only cross-sectional for the data of the same year and not data of a different year. I said this is a minor flaw, because the number of observations used exceeded 800 and 1000 for years 2012 and 2017, respectively, but more data is always more ideal.
When finding confidence intervals, it is best to first describe the data being examined because these summary values will be required to perform further analysis. Table 6 below contains descriptive statistics of the new data:
Immediately we can see that the “top”" data have larger means than the “bottom”, and that the new(2017) data has means larger than the old(2012) data. The presented standard deviations and observations of Table 6 will be used in order to build confidence intervals of the .95 level. The intervals will be found using the following formula: \[ \bar{x} ± 1.96(\frac{σ}{\sqrt(n)}) \] Where x-bar is either limit of the interval, 1.96 is the z-value for two-tailed .95 confidence, σ is the standard deviation, and n is the number of observations. The resulting value of x-bar when using subtraction will result in the lower limit of the .95 confidence interval, while adding will result in the upper limit of the .95 confidence interval. Reducing this formula is also possible. The Standard error, s, is found by: \[ s = (\frac{σ}{\sqrt(n)})\] which we have in the formula for computing the confidence interval, therefore we can substitute s into the original formula: \[\bar{x} ± 1.96(s)\] Using this formula for building a .95 confidence interval we can make a table showing the results.
Table 7 produces the desired confidence intervals for all 4 variables. It should be said and highlighted what this interval means. It does not mean that there is a 95% chance that the true population mean, μ, is contained in this interval. This interval means that if we draw 100 more random samples, we can expect that 95 of them will have a sample mean within this interval, it does not imply more than this, which is a very common misinterpretation.
Mechanically speaking, we can infer from the equation for standard error, s, that because the old (2012) data have fewer observations (808 compared to 1114), the standard errors for the old dissolved oxygen levels are larger. This is because as the number of observations increases, the smaller the standard error, until it reaches the lower limit of 0, which would imply having all observations of the population, which would further imply that x-bar = μ, which here is not the case.
The goal of this report was to estimate the population mean through the approximation of .95 confidence interval. This was successfully performed and the results indicate that there is a larger interval for the old data thanthe new data, which is likely due to the old data containing fewer observations. Questions regarding the differences in the harbors’ top and bottom measurements of dissolved oxygen have also been answered. The main discovery being that the lower the depth of water that is measured the fewer oxygen molecules are dissolved in the water. As someone who knew nothing of this quality in water, it makes logical sense as to why some fish and aquatic life exist at certain depths of the water. When improving this analysis, it would be possible to extract more data from the original data set than what this report extracted, and at the cost of having to clean very unclean data it may be worth it for the larger sample and by extension, a tighter confidence interval. This report is also preliminary to comparing the means of different populations and measuring their statistical significance.
[1] “Harbor Water Sampling Data.” Gowanus Canal – History, www.nyc.gov/html/dep/html/harborwater/harbor_water_sampling_results.shtml.
[2] “What Is Dissolved Oxygen in Water?” Environmental Measurement Systems, 4 Feb. 2014, www.fondriest.com/news/whatisdissolvedoxygen.htm.
The intent of this report is to compare 2 different variables’ means in order to test whether or not there is a significant difference in them. T-tests are performed in Stata in order to obtain results. The significant results of this report are that the differences in the means between 2012 and 2017 are statically significant, as are the means of dissolved oxygen at the top and bottom of the testing location.
While the previous report focused on creating confidence intervals for the population means of the variables, this report goes further in estimating the differences in those means and confidence intervals in order to better understand the New York Harbor Data Set, and water quality in general [1]. This report will also be examining the amount of dissolved oxygen (mg/L) at the tops and bottoms of test locations during 2012 and 2017. This is an important variable to be consistently measuring in order to ensure the sustainability of aquatic life [2].
In order to use the paired ttest method of comparing sample means, we must first make sure that the data we are analyzing is close to being normally distributed. I have elected to use histograms of the data graphed against the theoretical normal distribution. These figures are presented in the following Results and Discussions section. Because the higher number of observations in this data set, we can expect that the data be close to normal due to the central limit theorem. After analyzing the data, paired t-tests were performed on the 2012 top and bottom data, the 2017 top and bottom data, thr 2012 top and 2017 top data, and finally the 2012 bottom and 2017 bottom data. The are good choices because it allows the viewer to see the difference in dissolved oxygen depending on the depth of the water, and also the change in dissolved oxygen at the same depth over 5 years.
The distributions of the data are presented below:
The distributions are mostly normal with some minor flaws, which should not affect the validity of using the paired t-test to compare means.
The t-tests were performed and produced the following 4 results:
Based on Table 8, we can see that all four tests display extremely high t-values which indicate that we can reject the null hypothesis that the difference in means is 0. This implies a few different things. The first is that the difference in dissolved oxygen measure at the top and bottom are greatly different. This could mean that aquatic life that requires a higher amount of oxygen would likely exist closer to the surface of the water, while life that does not require as much oxygen may likely be found towards the floor of the body of water. The second implication is that over the past 5 years, there has been a significant increase dissolved oxygen levels both at the top and bottom of the harbor. Generally, this is a good thing as dissolved oxygen helps sustain life and promote the decay of organic materials[2].This could also cause a change in where certain aquatic life will form their habitats, if lower depths of water now have enough oxygen to support life that it once could not, this could lead to changes in the overall ecology of aquatic life.
Based on the results of this report, it is nearly certain that there is a difference in the means of dissolved oxygen in the top and bottom depths of the test locations, and that over time, these values have significantly increased. While the data does not perfectly fit the assumptions of using the ttest, it is close, and no data set will ever be perfect. If continuing this research, I would try to obtain data from a major city in China and compare that data with the presently used data, this would tie in nicely with the previous work in this overall report, and allow for more scientific comparison between the two countries rather than simply relying on year as the only regressor.
[1]“Harbor Water Sampling Data.” Gowanus Canal – History, www.nyc.gov/html/dep/html/harborwater/harbor_water_sampling_results.shtml.
[2] “What Is Dissolved Oxygen in Water?” Environmental Measurement Systems, 4 Feb. 2014, www.fondriest.com/news/whatisdissolvedoxygen.htm.