Guthub Repo: https://github.com/asmozo24/DATA605_Regression_Analysis web:

Homework 12: Multiple Linear Regression

Loading the data from github

The dataset used on this assignment comes from a real-world data-2008. The variables description are shown below.

variable_names1 <- read.csv("https://raw.githubusercontent.com/asmozo24/DATA605_Regression_Analysis/main/who.csv_variables_description", stringsAsFactors=FALSE, header = FALSE)

variable_names1
##                                                                                                 V1
## 1  The attached who.csv dataset contains real-world data from 2008. The variables included follow.
## 2                                                                     Country: name of the country
## 3                                        LifeExp: average life expectancy for the country in years
## 4                                InfantSurvival: proportion of those surviving to one year or more
## 5                              Under5Survival: proportion of those surviving to five years or more
## 6                                                 TBFree: proportion of the population without TB.
## 7                                                 PropMD: proportion of the population who are MDs
## 8                                                 PropRN: proportion of the population who are RNs
## 9         PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
## 10                                  GovtExp: mean government expenditures per capita on healthcare
## 11                                                             US dollars at average exchange rate
## 12                                            TotExp: sum of personal and government expenditures.
who_df <- read.csv("https://raw.githubusercontent.com/asmozo24/DATA605_Regression_Analysis/main/who.csv", stringsAsFactors=FALSE)
str(who_df)
## 'data.frame':    190 obs. of  10 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ LifeExp       : int  42 71 71 82 41 73 75 69 82 80 ...
##  $ InfantSurvival: num  0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
##  $ Under5Survival: num  0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
##  $ TBFree        : num  0.998 1 0.999 1 0.997 ...
##  $ PropMD        : num  2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
##  $ PropRN        : num  0.000572 0.004614 0.002091 0.0035 0.001146 ...
##  $ PersExp       : int  20 169 108 2589 36 503 484 88 3181 3788 ...
##  $ GovtExp       : int  92 3128 5184 169725 1620 12543 19170 1856 187616 189354 ...
##  $ TotExp        : int  112 3297 5292 172314 1656 13046 19654 1944 190797 193142 ...

the ‘who’ dataframe has 10 variables with 190 records. Only one variable(Country) has character data type, the rest of variables are numericals(double and integer).

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

#scatterplot of LifeExp~TotExp
plot(LifeExp~TotExp, data = who_df, main = "Life Expectancy vs.Government Expenditures ", xlab = "sum of personal and government expenditures", ylab = "average life expectancy for the country in years", pch = 19)

#simple linear regression
who_lm <- lm(LifeExp~TotExp, data = who_df)
summary(who_lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Interpretation of summary (who_df) F statistics: how good is the relationship between the predictor (government expenditure) and response (life expectancy)? Since we a small dataset (190 observations), we need a large value of F-statistic to confirm a strong bond between predictor and response. In this case, F-statistic: 65.26 which indicates there is some relationship between the two variables, how strong we cannot confirm it.

R^2: 0.2577 or 25.77% is weak (scale from 0 to 1, R^2 = 1 being best fit), which indicates there are some outliers or large errors. This means there is no good fit to the dataset.

Standard error: errors in the models or a measure of the quality of the simple linear regression fit. We expect all regression model to carry in some errors. The actual life expectancy can deviate from the true regression line by 9.371 on government expenditure)

p-values: Based on the 95% confidence interval, 7.714e-14 <0.05 which indicatess that changes in the predictor (government expenditure) are associated with changes in the response variable(life expectancy)

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

We need to transform the two variables in the dataset on power factor.

who_df$LifeExp2 <- (who_df$LifeExp)^4.6
who_df$TotExp2 <- (who_df$TotExp)^0.06
plot((LifeExp2)~(TotExp2), data = who_df, main = "Life Expectancy vs.Government Expenditures ", xlab = "sum of personal and government expenditures", ylab = "average life expectancy for the country in years", pch = 19)

#simple linear regression
who_lm2 <- lm(LifeExp2~TotExp2, data = who_df)
summary(who_lm2)
## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = who_df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp2      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Interpreting the summary lm output

The model 1 Residual standard error: 9.371 Multiple R-squared: 0.2577 F-statistic: 65.26 p-value: 7.714e-14

Model 2 Residual standard error: 90490000 Multiple R-squared: 0.7298 F-statistic: 507.7 p-value: < 2.2e-16

When we compared model 1 and 2, we definitely see a higher value of F-statistic (), pvalue is still less than 0.05, a better Rsquare value 0.7298-0.2577. we conclude model 2 is better than model 1.

3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5. I don’t think forecast function can apply to this dataset as it appears not to be a time serie, perhaps it was more to day predict y based on x.

#install.packages('forecast')
library(forecast)
## Warning: package 'forecast' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
fcast <- function(x){
who_df$TotExp3 <- x
who_df$LifeExp3 = -736527910 + 620060216*who_df$TotExp3 
#fit <- lm(who_df$LifeExp3 ~ trend + season)
# something about a = b^n
a <-forecast((who_df$LifeExp3)^{1/4.6}, h=20)
b <- plot(forecast(who_df$LifeExp3, h=20))
return (list(a, b))

}
fcast(1.5)

## [[1]]
##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 191       63.31153 63.31153 63.31153 63.31153 63.31153
## 192       63.31153 63.31153 63.31153 63.31153 63.31153
## 193       63.31153 63.31153 63.31153 63.31153 63.31153
## 194       63.31153 63.31153 63.31153 63.31153 63.31153
## 195       63.31153 63.31153 63.31153 63.31153 63.31153
## 196       63.31153 63.31153 63.31153 63.31153 63.31153
## 197       63.31153 63.31153 63.31153 63.31153 63.31153
## 198       63.31153 63.31153 63.31153 63.31153 63.31153
## 199       63.31153 63.31153 63.31153 63.31153 63.31153
## 200       63.31153 63.31153 63.31153 63.31153 63.31153
## 201       63.31153 63.31153 63.31153 63.31153 63.31153
## 202       63.31153 63.31153 63.31153 63.31153 63.31153
## 203       63.31153 63.31153 63.31153 63.31153 63.31153
## 204       63.31153 63.31153 63.31153 63.31153 63.31153
## 205       63.31153 63.31153 63.31153 63.31153 63.31153
## 206       63.31153 63.31153 63.31153 63.31153 63.31153
## 207       63.31153 63.31153 63.31153 63.31153 63.31153
## 208       63.31153 63.31153 63.31153 63.31153 63.31153
## 209       63.31153 63.31153 63.31153 63.31153 63.31153
## 210       63.31153 63.31153 63.31153 63.31153 63.31153
## 
## [[2]]
## [[2]]$mean
## Time Series:
## Start = 191 
## End = 210 
## Frequency = 1 
##  [1] 193562414 193562414 193562414 193562414 193562414 193562414 193562414
##  [8] 193562414 193562414 193562414 193562414 193562414 193562414 193562414
## [15] 193562414 193562414 193562414 193562414 193562414 193562414
## 
## [[2]]$lower
## Time Series:
## Start = 191 
## End = 210 
## Frequency = 1 
##           80%       95%
## 191 193562414 193562414
## 192 193562414 193562414
## 193 193562414 193562414
## 194 193562414 193562414
## 195 193562414 193562414
## 196 193562414 193562414
## 197 193562414 193562414
## 198 193562414 193562414
## 199 193562414 193562414
## 200 193562414 193562414
## 201 193562414 193562414
## 202 193562414 193562414
## 203 193562414 193562414
## 204 193562414 193562414
## 205 193562414 193562414
## 206 193562414 193562414
## 207 193562414 193562414
## 208 193562414 193562414
## 209 193562414 193562414
## 210 193562414 193562414
## 
## [[2]]$upper
## Time Series:
## Start = 191 
## End = 210 
## Frequency = 1 
##           80%       95%
## 191 193562414 193562414
## 192 193562414 193562414
## 193 193562414 193562414
## 194 193562414 193562414
## 195 193562414 193562414
## 196 193562414 193562414
## 197 193562414 193562414
## 198 193562414 193562414
## 199 193562414 193562414
## 200 193562414 193562414
## 201 193562414 193562414
## 202 193562414 193562414
## 203 193562414 193562414
## 204 193562414 193562414
## 205 193562414 193562414
## 206 193562414 193562414
## 207 193562414 193562414
## 208 193562414 193562414
## 209 193562414 193562414
## 210 193562414 193562414
fcast(2.5)

## [[1]]
##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 191       86.50645 86.50645 86.50645 86.50645 86.50645
## 192       86.50645 86.50645 86.50645 86.50645 86.50645
## 193       86.50645 86.50645 86.50645 86.50645 86.50645
## 194       86.50645 86.50645 86.50645 86.50645 86.50645
## 195       86.50645 86.50645 86.50645 86.50645 86.50645
## 196       86.50645 86.50645 86.50645 86.50645 86.50645
## 197       86.50645 86.50645 86.50645 86.50645 86.50645
## 198       86.50645 86.50645 86.50645 86.50645 86.50645
## 199       86.50645 86.50645 86.50645 86.50645 86.50645
## 200       86.50645 86.50645 86.50645 86.50645 86.50645
## 201       86.50645 86.50645 86.50645 86.50645 86.50645
## 202       86.50645 86.50645 86.50645 86.50645 86.50645
## 203       86.50645 86.50645 86.50645 86.50645 86.50645
## 204       86.50645 86.50645 86.50645 86.50645 86.50645
## 205       86.50645 86.50645 86.50645 86.50645 86.50645
## 206       86.50645 86.50645 86.50645 86.50645 86.50645
## 207       86.50645 86.50645 86.50645 86.50645 86.50645
## 208       86.50645 86.50645 86.50645 86.50645 86.50645
## 209       86.50645 86.50645 86.50645 86.50645 86.50645
## 210       86.50645 86.50645 86.50645 86.50645 86.50645
## 
## [[2]]
## [[2]]$mean
## Time Series:
## Start = 191 
## End = 210 
## Frequency = 1 
##  [1] 813622630 813622630 813622630 813622630 813622630 813622630 813622630
##  [8] 813622630 813622630 813622630 813622630 813622630 813622630 813622630
## [15] 813622630 813622630 813622630 813622630 813622630 813622630
## 
## [[2]]$lower
## Time Series:
## Start = 191 
## End = 210 
## Frequency = 1 
##           80%       95%
## 191 813622630 813622630
## 192 813622630 813622630
## 193 813622630 813622630
## 194 813622630 813622630
## 195 813622630 813622630
## 196 813622630 813622630
## 197 813622630 813622630
## 198 813622630 813622630
## 199 813622630 813622630
## 200 813622630 813622630
## 201 813622630 813622630
## 202 813622630 813622630
## 203 813622630 813622630
## 204 813622630 813622630
## 205 813622630 813622630
## 206 813622630 813622630
## 207 813622630 813622630
## 208 813622630 813622630
## 209 813622630 813622630
## 210 813622630 813622630
## 
## [[2]]$upper
## Time Series:
## Start = 191 
## End = 210 
## Frequency = 1 
##           80%       95%
## 191 813622630 813622630
## 192 813622630 813622630
## 193 813622630 813622630
## 194 813622630 813622630
## 195 813622630 813622630
## 196 813622630 813622630
## 197 813622630 813622630
## 198 813622630 813622630
## 199 813622630 813622630
## 200 813622630 813622630
## 201 813622630 813622630
## 202 813622630 813622630
## 203 813622630 813622630
## 204 813622630 813622630
## 205 813622630 813622630
## 206 813622630 813622630
## 207 813622630 813622630
## 208 813622630 813622630
## 209 813622630 813622630
## 210 813622630 813622630

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

who_lm3 <- lm(LifeExp ~ PropMD + TotExp + (PropMD*TotExp), data=who_df)
summary(who_lm3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Interpretation of summary (who_lm3) F-statistics: F-statistic: 34.49 which indicates there is some relationship between the two variables, how strong we cannot confirm it.

R^2: 0.3574 or 35.74% is weak, which indicates there are some outliers or large errors. This means there is no good fit to the dataset.

Standard error: The actual life expectancy can deviate from the true regression line by 8.765 on each predictor)

p-values: Based on the 95% confidence interval, 2.2e-16 <0.05 which indicatess that changes in the predictors are associated with changes in the response variable(life expectancy)

Based on the summary of lm, the model is not good due to weak R^2 , F-statistics . This is based of model 2.

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

fcast1 <- function(x1, x2){
PropMD4 <- x1
TotExp4 <- x2
y = 62.77 + 7.233e-05*TotExp4 +1497*PropMD4 -6.026*10^{-3}*(PropMD4*TotExp4)
#fit <- lm(who_df$LifeExp3 ~ trend + season)
# something about a = b^n
# a <-forecast(y, h=20)
# b <- plot(a)
    return (y)# (list(a, b))
}
fcast1(0.03, 14)
## [1] 107.6785
#let's find out what the life expectancy min and max, average
min(who_df$LifeExp)
## [1] 40
mean(who_df$LifeExp)
## [1] 67.37895
max(who_df$LifeExp)
## [1] 83

The forecast output 107.6785 while the life expectancy (average life is 67.37). This is unrealistic. Perhaps, a life expectancy for one individual only because the minimum is 40 and the maximum is 83.