Homework 9 - DATA 101

Author

Kalina P

Homework Information

The data set you are working on is called AllCountries. Here is some information about it:
Description: Data on the countries of the world
Format: A data frame with 217 observations on the following 26 variables.
 Country: Country name
 Code: Three-letter code for country
 LandArea: Size in 1000 sq. km.
 Population: Population in millions
 Density: Number of people per square kilometer
 GDP: Gross Domestic Product (in $US) per capita
 Rural: Percentage of population living in rural areas
 CO2: CO2 emissions (metric tons per capita)
 PumpPrice: Price for a liter of gasoline ($US)
 Military: Percentage of government expenditures directed toward the military
 Health: Percentage of government expenditures directed towards healthcare
 ArmedForces: Number of active duty military personnel (in 1,000’s)
 Internet: Percentage of the population with access to the internet
 Cell: Cell phone subscriptions (per 100 people)
 HIV: Percentage of the population with HIV
 Hunger: Percent of the population considered undernourished
 Diabetes: Percent of the population diagnosed with diabetes
 BirthRate: Births per 1000 people
 DeathRate: Deaths per 1000 people
 ElderlyPop: Percentage of the population at least 65 years old
 LifeExpectancy: Average life expectancy (years)
 FemaleLabor: Percent of females 15 - 64 in the labor force
 Unemployment: Percent of labor force unemployed
 Energy: Kilotons of oil equivalent
 Electricity: Electric power consumption (kWh per capita)
 Developed: Categories for kilowatt hours per capita, 1= under 2500, 2=2500 to 5000,
3=over 5000
Source:
The data was gathered online from data.worldbank.org. Accessed June 2019.

You need to create a new Markdown and perform the following:

Loading the Data

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("C:/Users/kpeter81/OneDrive - montgomerycollege.edu/Datasets")
allcountries <- read_csv("AllCountries.csv")

Rows: 217 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Country, Code
dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Simple Linear Regression

Simple Linear Regression (Fitting and Interpretation): Using the AllCountries
dataset, fit a simple linear regression model to predict LifeExpectancy (average life
expectancy in years) based on GDP (gross domestic product per capita in $US). Report
the intercept and slope coefficients and interpret their meaning in the context of the
dataset. What does the R² value tell you about how well GDP explains variation in life
expectancy across countries?

# Fit simple linear regression: mpg ~ weight
simple_model <- lm(LifeExpectancy ~ GDP, data = allcountries)

# View the model summary
summary(simple_model)


Call:
lm(formula = LifeExpectancy ~ GDP, data = allcountries)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.352  -3.882   1.550   4.458   9.330 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.901 on 177 degrees of freedom
  (38 observations deleted due to missingness)
Multiple R-squared:  0.4304,    Adjusted R-squared:  0.4272 
F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

The intercept coefficent is 68.42, meaning the life expectancy is about 68 when GDP is zero.

The slope coefficent is 0.0002476, which means that the life expectancy increases by 0.0002 for every increase of 1 (dollar?) in GDP.

The R squared value of 0.4272 shows that about 43% of the variation in life expectancy can be explained by GDP, but the other 57% is explained by other variables that aren’t included in the model.

Multiple Linear Regression

Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression
model to predict LifeExpectancy using GDP, Health (percentage of government
expenditures on healthcare), and Internet (percentage of population with internet access)
as predictors. Interpret the coefficient for Health, explaining what it means in terms of
life expectancy while controlling for GDP and Internet. How does the adjusted R²
compare to the simple regression model from Question 1, and what does this suggest
about the additional predictors?

# Fit multiple linear regression: mpg ~ weight + horsepower + cylinders + year
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = allcountries)

# View the model summary
summary(multiple_model)


Call:
lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = allcountries)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.5662  -1.8227   0.4108   2.5422   9.4161 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
GDP         2.367e-05  2.287e-05   1.035 0.302025    
Health      2.479e-01  6.619e-02   3.745 0.000247 ***
Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.104 on 169 degrees of freedom
  (44 observations deleted due to missingness)
Multiple R-squared:  0.7213,    Adjusted R-squared:  0.7164 
F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

The Health coefficient is 0.2479, which means that more money spent on health increases life expectancy.

Here the adjusted R squared is much larger than the simple linear regression, these variables account for 71% of the variation in life expectancy, rather than just 43%.

Checking Assumptions

Checking Assumptions (Homoscedasticity and Normality) For the simple linear
regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would
check the assumptions of homoscedasticity and normality of residuals. For each
assumption, explain what an ideal outcome would look like and what a violation might
indicate about the model’s reliability for predicting life expectancy. Afterwords, code
your answer and reflect if it matched the ideal outcome.

In order to check the assumptions of homoscedasticity we need to ensure that there is constant variance of residuals. We have to examine the residuals vs fitted distribution, and, if there is homoscedasity, the residuals should be scattered evenly around zero. A violation of homoscedacity would invalidate the model in predicting life expectancy. In order to check normality we need to ensure that the residuals are normally distributed, which we can check with a Q-Q plot. This plot should ideally be a straight diagonal line to show normal distribution. If this assumption is violated then I think p values and confidence intervals are less accurate?

par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))

Diagnosing Model Fit

Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from
Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain
what it represents in the context of predicting life expectancy. How would large residuals
for certain countries (e.g., those with unusually high or low life expectancy) affect your
confidence in the model’s predictions, and what might you investigate further?

# Calculate residuals
residuals_multiple <- resid(multiple_model)

# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple

[1] 4.056417

This RSME means that predictions for life expectancy miss by about 4.1 years on average. Larger residuals (meaning a greater difference between actual and predicted values) would mean the model is less accurate so I would be less confident in it’s predictions. I would likely examine or add other variables to the multiple linear model to account for the variation in the data.

Hypothetical Example

Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are
analyzing the AllCountries dataset and fit a multiple linear regression model to predict
CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and
Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are
highly correlated. Explain how this multicollinearity might affect the interpretation of the
regression coefficients and the reliability of the model.

Multicollinearity makes the results and coefficents less precise and less reliable because energy is dependent on electricity (or vise versa). The model will become less reliable as a result, so I could either remove one of the variables or create two models with one of the variables in each and look at the results seperately.