1 Load Data

setwd("C:/Users/wesle/Downloads/Data 101")
AllCountries <- read.csv("AllCountries.csv")

2 Simple Linear Regression Model

slrm <- lm(LifeExpectancy ~ GDP, data = AllCountries)

summary(slrm)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

The intercept is 68.42, this means that when the GDP is zero life expectancy is 68.42 years. The slope is 0.0002476, this means that when the GDP increases by a unit of 1 life expectancy also increases by 0.0002476 years. The R-squared value is 0.43, this means that GDP explains 43% of the variance in life expectancy across the countries in the dataset.

3 Multiple Linear Regression

mlrm <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)

summary(mlrm)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

The coeffecient for health is 0.2479, this means that when health increases by 1 life expectancy increaes by 0.2479 years. The adjusted R-squared is 0.7164 which is almost 0.3 better than the simple linear regression model. This means that GDP along side health and internet explain 71.64% of the variance in life expectancy across the countries.

4 Checking Assumptions (Simple Linear Regression Model)

Checking the homoscedasticity and normality can be done with a residuals vs fitted plot and Q-Q residuals plot respectively. For the residuals vs fitted plot, to meet the assumption of homoscedasticity the points should be scattered around zero. For normality to be checked the points must follow the line on the Q-Q residuals plot and not deviate to far from the line.

par(mfrow=c(2,2)); plot(slrm); par(mfrow=c(1,1))

The residuals vs fitted plot shows a curve at the beginning of the plot, which then afterwards becomes a line that trends downwards. Furthermore the many of the points plotted seem to be grouped up at the curve of the line, then it begins to thin out as you move across the plot. The points also seem to have a lot of variance in their distance from zero througout the plot. This suggests that the homoscedasticity assumption isn’t met.

For the Q-Q plot the points both at the left and right end of the line seem to deviate downwards and away from the line. With a several deviations that seem to be severe. This suggests that normality isn’t checked.

5 Diagnosing Model Fit (Multiple Linear Regression) (RMSE)

rm <- resid(mlrm)

rmsem <- sqrt(mean(rm^2))
rmsem

## [1] 4.056417

The RMSE is 4.056417, this means that on average the predictions for life expectancy from the multiple linear regression model miss by 4.056417 years. Large residuals could have affected the confidence in this models prediction as really high or really low residuals can drag the RMSE up or down. To investgate what may be causing this issue you could remove NAs, check each of the predictors seperately, or see if any of the predictions aren’t independent of one another.

6

When two variables are highly correlated with one another multicollineraity from this high correlation can severly impact the model and its reliability. Multicollineraity not only makes it harder for the model to seperate the impacts/effects of each model but it also makes the model’s predictions weaker and errors greater. Thus coeffecients may be inflated or wrong as the reliability and error for the model is greater when multicollineraity is present.

Assignment 9

Wesley Samimi

1

Load Data

2

Simple Linear Regression Model

3

Multiple Linear Regression

4

Checking Assumptions (Simple Linear Regression Model)

5

Diagnosing Model Fit (Multiple Linear Regression) (RMSE)

6