Polynomial Regression
Polynomial Regression
- 1 Biodiversity 2020 Indicators from England’s wildlife and ecosystem services
- 2 Sanitation and Drinking Water Conditions
- 2.1 Run a linear regression: Sanitation ~ Year
- 2.2 Run a Polynomial regression: Sanitation ~ Year+I(Year^2)
- 2.3 Scatterplot(Sanitation ~ Year) + a Linear Regression + a Polynomial Regression
- 2.4 Run a linear regression: Water ~ Year
- 2.5 Run a Polynomial regression: Water ~ Year+I(Year^2)
- 2.6 Scatterplot(Water ~ Year) + a Linear Regression + a Polynomial Regression
- 2.7 Conclusion
1 Biodiversity 2020 Indicators from England’s wildlife and ecosystem services
- Dataset: 1970-2016, ENV09 - England biodiversity indicators (sheet 4b)
- https://www.gov.uk/government/statistical-data-sets/env09-england-biodiversity-indicators
- load dat.csv
mydataindex = read.table("./dat.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
mydataindex## Year Hymenoptera Moths
## 1 1970 NA 100.00000
## 2 1971 NA 97.40308
## 3 1972 NA 93.46015
## 4 1973 NA 90.20746
## 5 1974 NA 88.21933
## 6 1975 NA 89.98069
## 7 1976 NA 91.79424
## 8 1977 NA 87.66070
## 9 1978 NA 86.17904
## 10 1979 NA 85.04052
## 11 1980 100.00000 83.04061
## 12 1981 96.06412 81.25337
## 13 1982 94.50884 83.94502
## 14 1983 97.25261 85.60653
## 15 1984 99.61304 85.25475
## 16 1985 100.68708 81.65969
## 17 1986 98.32698 79.08266
## 18 1987 96.20649 77.54394
## 19 1988 94.51686 75.99746
## 20 1989 93.38365 76.72467
## 21 1990 95.23808 77.15220
## 22 1991 95.65064 75.93391
## 23 1992 98.79097 74.25514
## 24 1993 100.70310 71.61099
## 25 1994 101.82695 70.87709
## 26 1995 102.83995 73.25683
## 27 1996 105.26216 72.55038
## 28 1997 104.48161 72.08166
## 29 1998 98.12524 66.48195
## 30 1999 95.70980 64.81656
## 31 2000 95.08747 63.32317
## 32 2001 94.44489 64.34150
## 33 2002 90.79075 63.31262
## 34 2003 88.91001 64.03736
## 35 2004 87.97619 62.93375
## [ reached 'max' / getOption("max.print") -- omitted 12 rows ]
1.1 Run a linear regression: Hymenoptera~Year
##
## Call:
## lm(formula = Hymenoptera ~ Year, data = mydataindex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.113 -3.839 -1.201 3.006 11.060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1152.05540 145.93978 7.894 2.79e-09 ***
## Year -0.52999 0.07304 -7.256 1.79e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.744 on 35 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.6007, Adjusted R-squared: 0.5893
## F-statistic: 52.65 on 1 and 35 DF, p-value: 1.791e-08
par(mfrow=c(2,2))
plot(RegModel.1)
mtext("Hymenoptera ~ Year", side = 3, line = -2, outer = TRUE ,cex=1.2,font=2, col = "red")1.2 Run a Polynomial regression: Hymenoptera~ Year + I(Year^2)
##
## Call:
## lm(formula = Hymenoptera ~ Year + I(Year^2), data = mydataindex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6122 -2.5352 -0.1891 1.7453 7.4466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.300e+05 2.135e+04 -6.088 6.62e-07 ***
## Year 1.307e+02 2.137e+01 6.117 6.07e-07 ***
## I(Year^2) -3.285e-02 5.349e-03 -6.142 5.64e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.314 on 34 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.8107, Adjusted R-squared: 0.7996
## F-statistic: 72.81 on 2 and 34 DF, p-value: 5.143e-13
par(mfrow=c(2,2))
plot(RegModel.2, col="blue", lwd = 2.5, lty= 6)
mtext("Hymenoptera ~ Year + I(Year^2)", side = 3, line = -2, outer = TRUE ,cex=1.2,font=2, col = "red")1.3 Scatterplot(Hymenoptera~Year) + a Linear Regression + a Polynomial Regression
scatterplot(Hymenoptera~Year, regLine=FALSE, smooth=FALSE, boxplots=FALSE, xlim=c(1980, 2020), ylim=c(70,110), ylab="Hymenoptera, index value",cex=1.3, pch=16, data=mydataindex, col="green")
xx <- seq(1980,2020, length=10)
lines(xx, predict(RegModel.1, data.frame(Year=xx)), col="blue", lwd = 2.5, lty= 6)
lines(xx, predict(RegModel.2, data.frame(Year=xx)), col="red", lwd = 2.5, lty= 6)
legend("bottomleft", legend=c( "Linear Regression: Hymenoptera ~ Year", "Polynomial regression: Hymenoptera ~ Year + I(Year^2)" ),col=c( "blue", "red"), lty= c(6,6), cex=0.9)1.4 Conclusion
Both the 2nd order of Polynomial regression and linear regression can not model the relationship of Hymenoptera and Year correctly.
2 Sanitation and Drinking Water Conditions
- Dataset downloaded from environment for development (http://geodata.grid.unep.ch/)
- Variable Definition:
- Total population for 114 countries from 1990 - 2012 of :
- Sanitation:= Mean improved sanitation conditions
- Water:= mean improved drinking water conditions in % o. Data from United Nations Environment
mydatasant = read.table("./dat02.csv", header=TRUE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)
mydatasant[2] = (as.numeric(gsub(",", ".", mydatasant[,2])))
mydatasant[3] = (as.numeric(gsub(",", ".", mydatasant[,3])))
mydatasant## Year Sanitation Water
## 1 1990 44.81 73.06
## 2 1991 46.22 74.49
## 3 1992 47.39 75.32
## 4 1993 48.42 76.12
## 5 1994 50.24 78.16
## 6 1995 51.13 78.89
## 7 1996 52.65 79.62
## 8 1997 53.58 80.39
## 9 1998 54.47 81.10
## 10 1999 55.59 81.81
## 11 2000 56.38 82.52
## 12 2001 57.25 83.23
## 13 2002 58.18 83.82
## 14 2003 59.06 84.52
## 15 2004 59.91 85.19
## 16 2005 60.80 85.86
## 17 2006 61.60 86.51
## 18 2007 62.45 87.14
## 19 2008 62.79 87.43
## 20 2009 62.65 87.62
## 21 2010 63.43 88.22
## 22 2011 63.81 88.62
## 23 2012 64.15 88.96
2.1 Run a linear regression: Sanitation ~ Year
##
## Call:
## lm(formula = Sanitation ~ Year, data = mydatasant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0363 -0.9198 0.7135 0.8483 0.9817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.726e+03 6.554e+01 -26.33 <2e-16 ***
## Year 8.906e-01 3.275e-02 27.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.042 on 21 degrees of freedom
## Multiple R-squared: 0.9724, Adjusted R-squared: 0.9711
## F-statistic: 739.4 on 1 and 21 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(RegModel.21A)
mtext("Sanitation ~ Year ", side = 3, line = -2, outer = TRUE ,cex=1.2,font=2, col = "red")2.2 Run a Polynomial regression: Sanitation ~ Year+I(Year^2)
##
## Call:
## lm(formula = Sanitation ~ Year + I(Year^2), data = mydatasant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3688 -0.1215 -0.0580 0.1398 0.5185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.010e+05 4.850e+03 -20.82 5.01e-15 ***
## Year 1.001e+02 4.847e+00 20.65 5.87e-15 ***
## I(Year^2) -2.479e-02 1.211e-03 -20.46 6.97e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2279 on 20 degrees of freedom
## Multiple R-squared: 0.9987, Adjusted R-squared: 0.9986
## F-statistic: 7934 on 2 and 20 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(RegModel.21B, col="blue", lwd = 2.5, lty= 6)
mtext("Sanitation ~ Year + I(Year^2)", side = 3, line = -2, outer = TRUE ,cex=1.2,font=2, col = "red")2.3 Scatterplot(Sanitation ~ Year) + a Linear Regression + a Polynomial Regression
scatterplot(Sanitation ~ Year, regLine=FALSE, smooth=FALSE, boxplots=FALSE, xlim=c(1990, 2020), ylim=c(30,80), ylab="Sanitation mean value",cex=1.3, pch=16, data= mydatasant, col="green")
xx <- seq(1990,2020, length=10)
lines(xx, predict(RegModel.21A, data.frame(Year=xx)), col="blue", lwd = 2.5, lty= 6)
lines(xx, predict(RegModel.21B, data.frame(Year=xx)), col="red", lwd = 2.5, lty= 6)
legend("bottomleft", legend=c( "Linear Regression: Sanitation ~ Year", "Polynomial regression: Sanitation ~ Year + I(Year^2)"),col=c( "blue", "red"), lty= c(6,6), cex=0.9)2.4 Run a linear regression: Water ~ Year
##
## Call:
## lm(formula = Water ~ Year, data = mydatasant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7268 -0.7306 0.4900 0.5877 0.6822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.329e+03 5.077e+01 -26.18 <2e-16 ***
## Year 7.055e-01 2.537e-02 27.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8071 on 21 degrees of freedom
## Multiple R-squared: 0.9736, Adjusted R-squared: 0.9723
## F-statistic: 773.4 on 1 and 21 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(RegModel.22A)
mtext("Water ~ Year ", side = 3, line = -2, outer = TRUE ,cex=1.2,font=2, col = "red")2.5 Run a Polynomial regression: Water ~ Year+I(Year^2)
##
## Call:
## lm(formula = Water ~ Year + I(Year^2), data = mydatasant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40834 -0.16353 0.01231 0.08678 0.64476
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.643e+04 5.248e+03 -14.56 4.13e-12 ***
## Year 7.577e+01 5.245e+00 14.45 4.80e-12 ***
## I(Year^2) -1.876e-02 1.311e-03 -14.31 5.70e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2467 on 20 degrees of freedom
## Multiple R-squared: 0.9976, Adjusted R-squared: 0.9974
## F-statistic: 4243 on 2 and 20 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(RegModel.22B, col="blue", lwd = 2.5, lty= 6)
mtext("Water ~ Year + I(Year^2)", side = 3, line = -2, outer = TRUE ,cex=1.2,font=2, col = "red")2.6 Scatterplot(Water ~ Year) + a Linear Regression + a Polynomial Regression
scatterplot(Water ~ Year, regLine=FALSE, smooth=FALSE, boxplots=FALSE, xlim=c(1990, 2020), ylim=c(65,100), ylab="Water mean value",cex=1.3, pch=16, data= mydatasant, col="green")
xx <- seq(1990,2020, length=10)
lines(xx, predict(RegModel.22A, data.frame(Year=xx)), col="blue", lwd = 2.5, lty= 6)
lines(xx, predict(RegModel.22B, data.frame(Year=xx)), col="red", lwd = 2.5, lty= 6)
legend("bottomleft", legend=c( "Linear Regression: Water ~ Year", "Polynomial regression: Water ~ Year + I(Year^2)"),col=c( "blue", "red"), lty= c(6,6), cex=0.9)2.7 Conclusion
The 2nd order of Polynomial regression is better than linear regression to model the relationship of Sanitation and Year, and the relationship of Water and Year respectively.
2020-01-22