I used this data for a project in the past. For that project, I ended up using a polynomial model to get the best results.First, though, I tried using a linear model of the log-transformed datapoints. I didn’t end up using it because it didn’t fit. I will illustrate some of the thought process there.

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.3     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.1
## âś” ggplot2   3.4.4     âś” tibble    3.2.1
## âś” lubridate 1.9.2     âś” tidyr     1.3.0
## âś” purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- 'https://raw.githubusercontent.com/Shayaeng/Data606/main/Final%20Project/malnutrition-death-rate-vs-gdp-per-capita.csv'
gdp_malnutrition <- read.csv(url)
colnames(gdp_malnutrition)[colnames(gdp_malnutrition) 
                             == "GDP_per_capita_PPP_.2017."] <- "GDP"
colnames(gdp_malnutrition)[colnames(gdp_malnutrition) 
                             == "Deaths_Protein.energy.malnutrition"] <- "Deaths"

gdp_malnutrition <- gdp_malnutrition[complete.cases(gdp_malnutrition), ]
## 
## Call:
## lm(formula = log_Deaths ~ GDP, data = gdp_malnutrition)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.868 -1.126  0.469  1.324  5.391 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.676e+00  3.325e-02   50.39   <2e-16 ***
## GDP         -6.228e-05  1.285e-06  -48.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.834 on 5389 degrees of freedom
## Multiple R-squared:  0.3036, Adjusted R-squared:  0.3034 
## F-statistic:  2349 on 1 and 5389 DF,  p-value: < 2.2e-16

Residual Analysis

par(mfrow=c(2, 2))
plot(linear_model_log)

The diagnistic plots do not support using a linear model here. The Residuals vs Fitted plot shows extreme curvature indicating non-linearity. Additionally, the QQ Plot does not ssem to be very straight. The Scale Location plot also doesn’t seem to support homoscedasiticty.

What is interesting to note though, is that the linear model might work if we narrow our range and to a subset of the data and do not try to predict outside our range.