Guthub Repo: https://github.com/asmozo24/DATA605_Regression_Analysis web:
The dataset used on this assignment comes from a real-world data-2008. The variables description are shown below.
variable_names1 <- read.csv("https://raw.githubusercontent.com/asmozo24/DATA605_Regression_Analysis/main/who.csv_variables_description", stringsAsFactors=FALSE, header = FALSE)
variable_names1
## V1
## 1 The attached who.csv dataset contains real-world data from 2008. The variables included follow.
## 2 Country: name of the country
## 3 LifeExp: average life expectancy for the country in years
## 4 InfantSurvival: proportion of those surviving to one year or more
## 5 Under5Survival: proportion of those surviving to five years or more
## 6 TBFree: proportion of the population without TB.
## 7 PropMD: proportion of the population who are MDs
## 8 PropRN: proportion of the population who are RNs
## 9 PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
## 10 GovtExp: mean government expenditures per capita on healthcare
## 11 US dollars at average exchange rate
## 12 TotExp: sum of personal and government expenditures.
who_df <- read.csv("https://raw.githubusercontent.com/asmozo24/DATA605_Regression_Analysis/main/who.csv", stringsAsFactors=FALSE)
str(who_df)
## 'data.frame': 190 obs. of 10 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ LifeExp : int 42 71 71 82 41 73 75 69 82 80 ...
## $ InfantSurvival: num 0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
## $ Under5Survival: num 0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
## $ TBFree : num 0.998 1 0.999 1 0.997 ...
## $ PropMD : num 2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
## $ PropRN : num 0.000572 0.004614 0.002091 0.0035 0.001146 ...
## $ PersExp : int 20 169 108 2589 36 503 484 88 3181 3788 ...
## $ GovtExp : int 92 3128 5184 169725 1620 12543 19170 1856 187616 189354 ...
## $ TotExp : int 112 3297 5292 172314 1656 13046 19654 1944 190797 193142 ...
the ‘who’ dataframe has 10 variables with 190 records. Only one variable(Country) has character data type, the rest of variables are numericals(double and integer).
#scatterplot of LifeExp~TotExp
plot(LifeExp~TotExp, data = who_df, main = "Life Expectancy vs.Government Expenditures ", xlab = "sum of personal and government expenditures", ylab = "average life expectancy for the country in years", pch = 19)
#simple linear regression
who_lm <- lm(LifeExp~TotExp, data = who_df)
summary(who_lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Interpretation of summary (who_df) F statistics: how good is the relationship between the predictor (government expenditure) and response (life expectancy)? Since we a small dataset (190 observations), we need a large value of F-statistic to confirm a strong bond between predictor and response. In this case, F-statistic: 65.26 which indicates there is some relationship between the two variables, how strong we cannot confirm it.
R^2: 0.2577 or 25.77% is weak (scale from 0 to 1, R^2 = 1 being best fit), which indicates there are some outliers or large errors. This means there is no good fit to the dataset.
Standard error: errors in the models or a measure of the quality of the simple linear regression fit. We expect all regression model to carry in some errors. The actual life expectancy can deviate from the true regression line by 9.371 on government expenditure)
p-values: Based on the 95% confidence interval, 7.714e-14 <0.05 which indicatess that changes in the predictor (government expenditure) are associated with changes in the response variable(life expectancy)
We need to transform the two variables in the dataset on power factor.
who_df$LifeExp2 <- (who_df$LifeExp)^4.6
who_df$TotExp2 <- (who_df$TotExp)^0.06
plot((LifeExp2)~(TotExp2), data = who_df, main = "Life Expectancy vs.Government Expenditures ", xlab = "sum of personal and government expenditures", ylab = "average life expectancy for the country in years", pch = 19)
#simple linear regression
who_lm2 <- lm(LifeExp2~TotExp2, data = who_df)
summary(who_lm2)
##
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = who_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp2 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
The model 1 Residual standard error: 9.371 Multiple R-squared: 0.2577 F-statistic: 65.26 p-value: 7.714e-14
Model 2 Residual standard error: 90490000 Multiple R-squared: 0.7298 F-statistic: 507.7 p-value: < 2.2e-16
When we compared model 1 and 2, we definitely see a higher value of F-statistic (), pvalue is still less than 0.05, a better Rsquare value 0.7298-0.2577. we conclude model 2 is better than model 1.
#install.packages('forecast')
library(forecast)
## Warning: package 'forecast' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
fcast <- function(x){
who_df$TotExp3 <- x
who_df$LifeExp3 = -736527910 + 620060216*who_df$TotExp3
#fit <- lm(who_df$LifeExp3 ~ trend + season)
# something about a = b^n
a <-forecast((who_df$LifeExp3)^{1/4.6}, h=20)
b <- plot(forecast(who_df$LifeExp3, h=20))
return (list(a, b))
}
fcast(1.5)
## [[1]]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 191 63.31153 63.31153 63.31153 63.31153 63.31153
## 192 63.31153 63.31153 63.31153 63.31153 63.31153
## 193 63.31153 63.31153 63.31153 63.31153 63.31153
## 194 63.31153 63.31153 63.31153 63.31153 63.31153
## 195 63.31153 63.31153 63.31153 63.31153 63.31153
## 196 63.31153 63.31153 63.31153 63.31153 63.31153
## 197 63.31153 63.31153 63.31153 63.31153 63.31153
## 198 63.31153 63.31153 63.31153 63.31153 63.31153
## 199 63.31153 63.31153 63.31153 63.31153 63.31153
## 200 63.31153 63.31153 63.31153 63.31153 63.31153
## 201 63.31153 63.31153 63.31153 63.31153 63.31153
## 202 63.31153 63.31153 63.31153 63.31153 63.31153
## 203 63.31153 63.31153 63.31153 63.31153 63.31153
## 204 63.31153 63.31153 63.31153 63.31153 63.31153
## 205 63.31153 63.31153 63.31153 63.31153 63.31153
## 206 63.31153 63.31153 63.31153 63.31153 63.31153
## 207 63.31153 63.31153 63.31153 63.31153 63.31153
## 208 63.31153 63.31153 63.31153 63.31153 63.31153
## 209 63.31153 63.31153 63.31153 63.31153 63.31153
## 210 63.31153 63.31153 63.31153 63.31153 63.31153
##
## [[2]]
## [[2]]$mean
## Time Series:
## Start = 191
## End = 210
## Frequency = 1
## [1] 193562414 193562414 193562414 193562414 193562414 193562414 193562414
## [8] 193562414 193562414 193562414 193562414 193562414 193562414 193562414
## [15] 193562414 193562414 193562414 193562414 193562414 193562414
##
## [[2]]$lower
## Time Series:
## Start = 191
## End = 210
## Frequency = 1
## 80% 95%
## 191 193562414 193562414
## 192 193562414 193562414
## 193 193562414 193562414
## 194 193562414 193562414
## 195 193562414 193562414
## 196 193562414 193562414
## 197 193562414 193562414
## 198 193562414 193562414
## 199 193562414 193562414
## 200 193562414 193562414
## 201 193562414 193562414
## 202 193562414 193562414
## 203 193562414 193562414
## 204 193562414 193562414
## 205 193562414 193562414
## 206 193562414 193562414
## 207 193562414 193562414
## 208 193562414 193562414
## 209 193562414 193562414
## 210 193562414 193562414
##
## [[2]]$upper
## Time Series:
## Start = 191
## End = 210
## Frequency = 1
## 80% 95%
## 191 193562414 193562414
## 192 193562414 193562414
## 193 193562414 193562414
## 194 193562414 193562414
## 195 193562414 193562414
## 196 193562414 193562414
## 197 193562414 193562414
## 198 193562414 193562414
## 199 193562414 193562414
## 200 193562414 193562414
## 201 193562414 193562414
## 202 193562414 193562414
## 203 193562414 193562414
## 204 193562414 193562414
## 205 193562414 193562414
## 206 193562414 193562414
## 207 193562414 193562414
## 208 193562414 193562414
## 209 193562414 193562414
## 210 193562414 193562414
fcast(2.5)
## [[1]]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 191 86.50645 86.50645 86.50645 86.50645 86.50645
## 192 86.50645 86.50645 86.50645 86.50645 86.50645
## 193 86.50645 86.50645 86.50645 86.50645 86.50645
## 194 86.50645 86.50645 86.50645 86.50645 86.50645
## 195 86.50645 86.50645 86.50645 86.50645 86.50645
## 196 86.50645 86.50645 86.50645 86.50645 86.50645
## 197 86.50645 86.50645 86.50645 86.50645 86.50645
## 198 86.50645 86.50645 86.50645 86.50645 86.50645
## 199 86.50645 86.50645 86.50645 86.50645 86.50645
## 200 86.50645 86.50645 86.50645 86.50645 86.50645
## 201 86.50645 86.50645 86.50645 86.50645 86.50645
## 202 86.50645 86.50645 86.50645 86.50645 86.50645
## 203 86.50645 86.50645 86.50645 86.50645 86.50645
## 204 86.50645 86.50645 86.50645 86.50645 86.50645
## 205 86.50645 86.50645 86.50645 86.50645 86.50645
## 206 86.50645 86.50645 86.50645 86.50645 86.50645
## 207 86.50645 86.50645 86.50645 86.50645 86.50645
## 208 86.50645 86.50645 86.50645 86.50645 86.50645
## 209 86.50645 86.50645 86.50645 86.50645 86.50645
## 210 86.50645 86.50645 86.50645 86.50645 86.50645
##
## [[2]]
## [[2]]$mean
## Time Series:
## Start = 191
## End = 210
## Frequency = 1
## [1] 813622630 813622630 813622630 813622630 813622630 813622630 813622630
## [8] 813622630 813622630 813622630 813622630 813622630 813622630 813622630
## [15] 813622630 813622630 813622630 813622630 813622630 813622630
##
## [[2]]$lower
## Time Series:
## Start = 191
## End = 210
## Frequency = 1
## 80% 95%
## 191 813622630 813622630
## 192 813622630 813622630
## 193 813622630 813622630
## 194 813622630 813622630
## 195 813622630 813622630
## 196 813622630 813622630
## 197 813622630 813622630
## 198 813622630 813622630
## 199 813622630 813622630
## 200 813622630 813622630
## 201 813622630 813622630
## 202 813622630 813622630
## 203 813622630 813622630
## 204 813622630 813622630
## 205 813622630 813622630
## 206 813622630 813622630
## 207 813622630 813622630
## 208 813622630 813622630
## 209 813622630 813622630
## 210 813622630 813622630
##
## [[2]]$upper
## Time Series:
## Start = 191
## End = 210
## Frequency = 1
## 80% 95%
## 191 813622630 813622630
## 192 813622630 813622630
## 193 813622630 813622630
## 194 813622630 813622630
## 195 813622630 813622630
## 196 813622630 813622630
## 197 813622630 813622630
## 198 813622630 813622630
## 199 813622630 813622630
## 200 813622630 813622630
## 201 813622630 813622630
## 202 813622630 813622630
## 203 813622630 813622630
## 204 813622630 813622630
## 205 813622630 813622630
## 206 813622630 813622630
## 207 813622630 813622630
## 208 813622630 813622630
## 209 813622630 813622630
## 210 813622630 813622630
who_lm3 <- lm(LifeExp ~ PropMD + TotExp + (PropMD*TotExp), data=who_df)
summary(who_lm3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = who_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
Interpretation of summary (who_lm3) F-statistics: F-statistic: 34.49 which indicates there is some relationship between the two variables, how strong we cannot confirm it.
R^2: 0.3574 or 35.74% is weak, which indicates there are some outliers or large errors. This means there is no good fit to the dataset.
Standard error: The actual life expectancy can deviate from the true regression line by 8.765 on each predictor)
p-values: Based on the 95% confidence interval, 2.2e-16 <0.05 which indicatess that changes in the predictors are associated with changes in the response variable(life expectancy)
Based on the summary of lm, the model is not good due to weak R^2 , F-statistics . This is based of model 2.
fcast1 <- function(x1, x2){
PropMD4 <- x1
TotExp4 <- x2
y = 62.77 + 7.233e-05*TotExp4 +1497*PropMD4 -6.026*10^{-3}*(PropMD4*TotExp4)
#fit <- lm(who_df$LifeExp3 ~ trend + season)
# something about a = b^n
# a <-forecast(y, h=20)
# b <- plot(a)
return (y)# (list(a, b))
}
fcast1(0.03, 14)
## [1] 107.6785
#let's find out what the life expectancy min and max, average
min(who_df$LifeExp)
## [1] 40
mean(who_df$LifeExp)
## [1] 67.37895
max(who_df$LifeExp)
## [1] 83
The forecast output 107.6785 while the life expectancy (average life is 67.37). This is unrealistic. Perhaps, a life expectancy for one individual only because the minimum is 40 and the maximum is 83.