HW12

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

1

data <- read.csv(file="who.csv", head=TRUE,  sep=",", stringsAsFactors = FALSE)
colnames(data)

##  [1] "Country"        "LifeExp"        "InfantSurvival" "Under5Survival"
##  [5] "TBFree"         "PropMD"         "PropRN"         "PersExp"       
##  [9] "GovtExp"        "TotExp"

plot(LifeExp ~ TotExp,data)
m1 = lm(LifeExp ~ TotExp, data)
abline(m1)

summary(m1)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

plot(m1)

F statistic test if any coefficients in mutiple regression has significance and since this is not a mutiple regression. F statistics is not very useful P Value tells the significance of the model. in this case P < 0.05 which means the model is statistically significant R-squared statistic provides an overall measure of how well the model fits the data.R-squared is 0.2577 means model can explain 25.77% of data variation. The Residual plot tells that there is a pattarn in the variation of Residuals and the QQ plot shows the Residuals is not normally distributed.Both of the plots indicates that the linear regression model doesn’t fit

2

data2 <- data
data2$LifeExp <- data2$LifeExp^4.6
data2$TotExp  <- data2$TotExp^0.06
m2 <- lm(LifeExp ~ TotExp, data2)
plot(LifeExp ~ TotExp, data2)
abline(m2)

summary(m2)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

plot(m2)

F statistics is 507.7 means the cofficient of TotExp is not 0 P < 0.05 which means the model is statistically significant R-squared is 0.7298 means model can explain 72.98% of data variation. Since R-squared is higher in this model and residual plot the QQ plot suggest the linear regression assumption is better fullfilled than model 1 So model 2 is better

3

results <- c((-736527909 + (620060216 * 1.5))^(1/4.6), (-736527909 + (620060216 * 2.5))^(1/4.6))

Estimated life expectancy is 63.3163.31 years when TotExp (0.06) is 1.5.

Estimated life expectancy is 86.5186.51 years when TotExp (0.06) is 2.5.

4

m3 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data )
summary(m3)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

plot(m3)

## Warning in sqrt(crit * p * (1 - hh)/hh): 产生了NaNs

## Warning in sqrt(crit * p * (1 - hh)/hh): 产生了NaNs

Average Life Expectancy=62.77270326+1497.49395252×PropMD+0.00007233×TotExp???0.00602569×PropMD TotExpAverage Life Expectancy=62.77270326+1497.49395252×PropMD+0.00007233×TotExp???0.00602569×PropMDXTotExp

F statistics is 34.49 means the cofficient of with P < 0.05 imeans at least one variable is a significant predcitor For each of three varibles, the P value is less than 0.05 means all three variables are significant R-squared is 0.3574 means model can explain 35.74% of data variation. however, the residual plot shows that the varition of residual is not constant which means the linear regression assumption doesn’t meet

5

y <- round(m3$coefficients[1], 4) + (round(m3$coefficients[2], 4) * 0.03) +
    (round(m3$coefficients[3], 4) * 14) + (round(m3$coefficients[4], 4) * 14 * 0.03)
print(y)

## (Intercept) 
##    107.6964

The exepect life is 107 which is unrealistic in common sense and also the higghest age in the data set is just around 80