Water Quality Report 3: Using Linear Regression to Model Water Quality of the United States and China as a Function of Time

Abstract

The purpose of this report is to build on the ending descriptive results of the previous report. Linear regression is used to model the % of the populations of the United States and China with access to drinking water as a function of time over the years 1990 and 2015. The key results of this paper are that statistically significant models were estimated as a function of time, with strong evidence that the water quality of these two countries improves over time.

Introduction

Once data have been analyzed and summarized, it is important to continue further evaluation. Linear regression is among the most common methods of modelling data. Even though it is a simplistic approach, it still produces key findings that are integral to understanding the implications of the data. Linear regression provides the analyst with a line that fits the data that is hopefully more accurate than simply using the mean to estimate output values. This type of modelling is done by minimizing the squared difference between the line and the actual data points, hence its proper name “Ordinary Least Squares” or “OLS” for short. The line that minimizes the total squared distance between all points and the line is the resulting output from this model and takes the form of \[y = mx + b\] where m is the rate of change for an unit increase in the independent variable x, and b is the constant value of the output given the independent variable is 0.

Methods

In this report, I will approximate two linear models. The first being the % of China’s population with access to drinking water: \[ ChinaValue = β_1*Year + β_0 \] and the second being the % of the United State’s Population with access to drinking water: \[ USValue = β_1*Year + β_0 \] In both of these models, it is vital to mention that the previous variable year, has been transformed to instead measure the number of years after 1990, the base year. For example, 1999 takes on the value of 9. This way, we are looking only at the years of our data set which will produce more meaningful outputs. These equations will be estimated, presented in table format, analyzed for significance, and shown graphically.

Results and Discussion

The estimations of each equation are presented below: \[ ChinaValue = 1.16*Year + 68.16 \] \[ USValue = .03*Year + 98.45\] We can see that the beta coefficients for both models are very different. This is expected however due to the US having a very small range between minimum and maximum values, and given that the dependent variable has an upper limit of 100 and the starting point for the US is 98.4. Because China had a significantly larger range, the much larger beta coefficient is expected, this is not to say that the US parameter is insignificant. These equations are derived from Table 4 below which also contains values indicative of statistical significance:

Table 4 tells us several things about using year to predict water quality value that the above equations do not. It tells us that in both models, the estimated beta parameters are not likely by chance. The extremely high t-values indicate that the estimations have a probability of less than .001 of being by chance. It is for that reason we can reject the null hypothesis that year has no impact on water quality. Also in Table 4, the F statistics and R^2 values are presented. The F-statistics tell the analyst whether or not the entire model has good predictive power of the independent variable. The values for the US and China models are 885.07 and 3578.93 respectively. These values are remarkably high and indicate that there is less than a .001 chance that using the mean to estimate water quality is better than using the models. The R^2 terms tell us how well the resulting line fits the scatter plot of the data. The R^2 terms of the US and China models are .9736 and .9933, respectively. These values are very strong and indicate a closely matching fit. The fit can be visually seen by plotting the linear equations along with the original scatter plots of the data. This is presented in Figure 6 below:

In Figure 6, it is evident that the two models are very close fits to the data points and almost perfectly connect the individual data points for both the US and China.
These results are rational to see. With improvements in both countries economies, we can typically expect to see the living conditions of their people to improve. Given that both countries experience economic growth annually, it is no surprise that using the year as the regressor of these models is statistically significant.

Conclusion

Based on the results of this report, we can estimate the impact the year of measurement has on these two countries. It is important for analysts to perform these actions in order to both measure the past, and predict the future. This is especially useful when looking at China. Because China has more room to grow, we can use the China model to predict the values for 2016 and perhaps even years after. This is done very simply by inputting 26 in for year as 2016 is 26 years after 1990: \[ ChinaValue = 1.16*(26) + 68.16\] \[ = 98.32\] Which suggests if China improves their water quality in 2016 similarly to the years 1990-2016, we can expect that in 2016, 98.32% of the Chinese population will have access to drinking water. While these models provide a great estimation of water quality, they are very general and there are always more variables to include in how we can make such predictions. Future research should involve using specific areas of these massive nations, and perhaps include more variables for a multivariate analysis such as “neighboring country value” or “welfare spending”.