hw 9 data 101

Author

Kenneth

setwd("C:/Users/kenne/Downloads")
Countries <- read.csv("AllCountries.csv")

Question 1

simple_model <- lm(LifeExpectancy ~ GDP, data = Countries)

# View the model summary
summary(simple_model)


Call:
lm(formula = LifeExpectancy ~ GDP, data = Countries)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.352  -3.882   1.550   4.458   9.330 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.901 on 177 degrees of freedom
  (38 observations deleted due to missingness)
Multiple R-squared:  0.4304,    Adjusted R-squared:  0.4272 
F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Looking at the summary, we can see that the intercept is that 6.842e+01 when GDP is at 0. And that for every dollar increase in the GDP, the life expectancy should increase by 2.476e-04 which is .0002476. And the p-value is < 2.2e-16 so there is very low chance that the reason due to the increase is due to chance. That there is stat. significance in our model. And the adjusted R-squared is 0.4272 means that 42.72% of the variance of life expectancy can be explained by GDP alone.

question 2

multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = Countries)

summary(multiple_model)


Call:
lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = Countries)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.5662  -1.8227   0.4108   2.5422   9.4161 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
GDP         2.367e-05  2.287e-05   1.035 0.302025    
Health      2.479e-01  6.619e-02   3.745 0.000247 ***
Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.104 on 169 degrees of freedom
  (44 observations deleted due to missingness)
Multiple R-squared:  0.7213,    Adjusted R-squared:  0.7164 
F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Looking at the summary for multiple model, we can see that if the GDP increase by a dollar, the life expectancy increases by about 0.248 years. There is a connection to health and internet as the better the health and internet is, the life expectancy increase as well. So both health and internet are positive. And the p-value of health is 0.000247 while the p-value of internet is < 2e-16 meaning that both is significant. The adjusted R-squared is 0.7164 so 71.64% of GDP of the variance of life expectancy can be explained with the added health and internet. This is also better than the model from question one.

Question 3

For the homoscedasticity, the core diagnostic plots, the Residuals vs. Fitted graph should display a flat red line with points scattered randomly, indicating constant variance. The Scale-Location plot should also show a horizontal line with an even distribution of points. And for the normality, In the Q-Q Residuals plot, the points should lie close to the dashed reference line, without noticeable patterns or outliers.

par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))

Reflection

Looking at the residuals vs fitted, it shows a curved relationship, meaning that it is not suggesting homoscedasticity. And looking at the plot below it(Scale location), it is mostly to the right which isn’t a good thing. And for the Q-Q residual, it is mostly good with the execept of a few, so there are a few outlines.

Question 4

residuals_multiple <- resid(multiple_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple

[1] 4.056417

The RMSE is 4.056417 years in life expectancy meaning that our predictions is off by 4 years. Large residuals will make errors in the model making the model less reliable. To make the model better I need to use the other variables like healthcare quality, death rates, education and anything else that is important.

Question 5

It could make the model less useful and reliable because due to the energy and electricity are highly correlated, multicollinearity makes it difficult to isolate the influence of either variable. And if you make any changes even if it is a small change, it could cause the coefficients to have errors, issues or swing to a wrong number.