I used this data for a project in the past. For that project, I ended up using a polynomial model to get the best results.First, though, I tried using a linear model of the log-transformed datapoints. I didn’t end up using it because it didn’t fit. I will illustrate some of the thought process there.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.3 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 3.4.4 âś” tibble 3.2.1
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- 'https://raw.githubusercontent.com/Shayaeng/Data606/main/Final%20Project/malnutrition-death-rate-vs-gdp-per-capita.csv'
gdp_malnutrition <- read.csv(url)
colnames(gdp_malnutrition)[colnames(gdp_malnutrition)
== "GDP_per_capita_PPP_.2017."] <- "GDP"
colnames(gdp_malnutrition)[colnames(gdp_malnutrition)
== "Deaths_Protein.energy.malnutrition"] <- "Deaths"
gdp_malnutrition <- gdp_malnutrition[complete.cases(gdp_malnutrition), ]
##
## Call:
## lm(formula = log_Deaths ~ GDP, data = gdp_malnutrition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.868 -1.126 0.469 1.324 5.391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.676e+00 3.325e-02 50.39 <2e-16 ***
## GDP -6.228e-05 1.285e-06 -48.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.834 on 5389 degrees of freedom
## Multiple R-squared: 0.3036, Adjusted R-squared: 0.3034
## F-statistic: 2349 on 1 and 5389 DF, p-value: < 2.2e-16
par(mfrow=c(2, 2))
plot(linear_model_log)
The diagnistic plots do not support using a linear model here. The
Residuals vs Fitted plot shows extreme curvature indicating
non-linearity. Additionally, the QQ Plot does not ssem to be very
straight. The Scale Location plot also doesn’t seem to support
homoscedasiticty.
What is interesting to note though, is that the linear model might work if we narrow our range and to a subset of the data and do not try to predict outside our range.