The data set you are working on is called AllCountries. Here is some information about it:
Description: Data on the countries of the world
Format: A data frame with 217 observations on the following 26 variables.
Country: Country name
Code: Three-letter code for country
LandArea: Size in 1000 sq. km.
Population: Population in millions
Density: Number of people per square kilometer
GDP: Gross Domestic Product (in $US) per capita
Rural: Percentage of population living in rural areas
CO2: CO2 emissions (metric tons per capita)
PumpPrice: Price for a liter of gasoline ($US)
Military: Percentage of government expenditures directed toward the military
Health: Percentage of government expenditures directed towards healthcare
ArmedForces: Number of active duty military personnel (in 1,000’s)
Internet: Percentage of the population with access to the internet
Cell: Cell phone subscriptions (per 100 people)
HIV: Percentage of the population with HIV
Hunger: Percent of the population considered undernourished
Diabetes: Percent of the population diagnosed with diabetes
BirthRate: Births per 1000 people
DeathRate: Deaths per 1000 people
ElderlyPop: Percentage of the population at least 65 years old
LifeExpectancy: Average life expectancy (years)
FemaleLabor: Percent of females 15 - 64 in the labor force
Unemployment: Percent of labor force unemployed
Energy: Kilotons of oil equivalent
Electricity: Electric power consumption (kWh per capita)
Developed: Categories for kilowatt hours per capita, 1= under 2500, 2=2500 to 5000,
3=over 5000
Source:
The data was gathered online from data.worldbank.org. Accessed June 2019.
You need to create a new Markdown and perform the following:
Loading the Data
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 217 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, Code
dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Simple Linear Regression
Simple Linear Regression (Fitting and Interpretation): Using the AllCountries
dataset, fit a simple linear regression model to predict LifeExpectancy (average life
expectancy in years) based on GDP (gross domestic product per capita in $US). Report
the intercept and slope coefficients and interpret their meaning in the context of the
dataset. What does the R² value tell you about how well GDP explains variation in life
expectancy across countries?
# Fit simple linear regression: mpg ~ weightsimple_model <-lm(LifeExpectancy ~ GDP, data = allcountries)# View the model summarysummary(simple_model)
Call:
lm(formula = LifeExpectancy ~ GDP, data = allcountries)
Residuals:
Min 1Q Median 3Q Max
-16.352 -3.882 1.550 4.458 9.330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.901 on 177 degrees of freedom
(38 observations deleted due to missingness)
Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
The intercept coefficent is 68.42, meaning the life expectancy is about 68 when GDP is zero.
The slope coefficent is 0.0002476, which means that the life expectancy increases by 0.0002 for every increase of 1 (dollar?) in GDP.
The R squared value of 0.4272 shows that about 43% of the variation in life expectancy can be explained by GDP, but the other 57% is explained by other variables that aren’t included in the model.
Multiple Linear Regression
Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression
model to predict LifeExpectancy using GDP, Health (percentage of government
expenditures on healthcare), and Internet (percentage of population with internet access)
as predictors. Interpret the coefficient for Health, explaining what it means in terms of
life expectancy while controlling for GDP and Internet. How does the adjusted R²
compare to the simple regression model from Question 1, and what does this suggest
about the additional predictors?
# Fit multiple linear regression: mpg ~ weight + horsepower + cylinders + yearmultiple_model <-lm(LifeExpectancy ~ GDP + Health + Internet, data = allcountries)# View the model summarysummary(multiple_model)
Call:
lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = allcountries)
Residuals:
Min 1Q Median 3Q Max
-14.5662 -1.8227 0.4108 2.5422 9.4161
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
GDP 2.367e-05 2.287e-05 1.035 0.302025
Health 2.479e-01 6.619e-02 3.745 0.000247 ***
Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.104 on 169 degrees of freedom
(44 observations deleted due to missingness)
Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
The Health coefficient is 0.2479, which means that more money spent on health increases life expectancy.
Here the adjusted R squared is much larger than the simple linear regression, these variables account for 71% of the variation in life expectancy, rather than just 43%.
Checking Assumptions
Checking Assumptions (Homoscedasticity and Normality) For the simple linear
regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would
check the assumptions of homoscedasticity and normality of residuals. For each
assumption, explain what an ideal outcome would look like and what a violation might
indicate about the model’s reliability for predicting life expectancy. Afterwords, code
your answer and reflect if it matched the ideal outcome.
In order to check the assumptions of homoscedasticity we need to ensure that there is constant variance of residuals. We have to examine the residuals vs fitted distribution, and, if there is homoscedasity, the residuals should be scattered evenly around zero. A violation of homoscedacity would invalidate the model in predicting life expectancy. In order to check normality we need to ensure that the residuals are normally distributed, which we can check with a Q-Q plot. This plot should ideally be a straight diagonal line to show normal distribution. If this assumption is violated then I think p values and confidence intervals are less accurate?
Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from
Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain
what it represents in the context of predicting life expectancy. How would large residuals
for certain countries (e.g., those with unusually high or low life expectancy) affect your
confidence in the model’s predictions, and what might you investigate further?
# Calculate residualsresiduals_multiple <-resid(multiple_model)# Calculate RMSE for multiple modelrmse_multiple <-sqrt(mean(residuals_multiple^2))rmse_multiple
[1] 4.056417
This RSME means that predictions for life expectancy miss by about 4.1 years on average. Larger residuals (meaning a greater difference between actual and predicted values) would mean the model is less accurate so I would be less confident in it’s predictions. I would likely examine or add other variables to the multiple linear model to account for the variation in the data.
Hypothetical Example
Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are
analyzing the AllCountries dataset and fit a multiple linear regression model to predict
CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and
Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are
highly correlated. Explain how this multicollinearity might affect the interpretation of the
regression coefficients and the reliability of the model.
Multicollinearity makes the results and coefficents less precise and less reliable because energy is dependent on electricity (or vise versa). The model will become less reliable as a result, so I could either remove one of the variables or create two models with one of the variables in each and look at the results seperately.