Final Project - Karthik Balasubramanian

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.2

## Warning: package 'ggplot2' was built under R version 4.3.2

## Warning: package 'tibble' was built under R version 4.3.2

## Warning: package 'tidyr' was built under R version 4.3.2

## Warning: package 'readr' was built under R version 4.3.2

## Warning: package 'purrr' was built under R version 4.3.2

## Warning: package 'dplyr' was built under R version 4.3.2

## Warning: package 'stringr' was built under R version 4.3.2

## Warning: package 'forcats' was built under R version 4.3.2

## Warning: package 'lubridate' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.3.2

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.3.2

library(AmesHousing)

## Warning: package 'AmesHousing' was built under R version 4.3.2

library(boot)
library(broom)

## Warning: package 'broom' was built under R version 4.3.2

library(lindia)

## Warning: package 'lindia' was built under R version 4.3.2

datas <- read.csv("C:\\Users\\karth\\Downloads\\Child Growth and Malnutrition.csv")
view(datas)

datas$Wasting = as.numeric(datas$Wasting)

## Warning: NAs introduced by coercion

datas$Overweight = as.numeric(datas$Overweight)

## Warning: NAs introduced by coercion

summary(datas)

##  Country.ISO.3.Code Country.Short.Name Year.period        Median.Year       
##  Length:39519       Length:39519       Length:39519       Length:39519      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Start.Month         End.Month         WHO.Reference.number
##  Length:39519       Length:39519       Length:39519        
##  Class :character   Class :character   Class :character    
##  Mode  :character   Mode  :character   Mode  :character    
##                                                            
##                                                            
##                                                            
##                                                            
##  Representation.level    Region          Other.group.Zone       Age           
##  Length:39519         Length:39519       Length:39519       Length:39519      
##  Class :character     Class :character   Class :character   Class :character  
##  Mode  :character     Mode  :character   Mode  :character   Mode  :character  
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##      Sex            Mother.education   Wealth.quintile    Urban.Rural       
##  Length:39519       Length:39519       Length:39519       Length:39519      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Sample.size        Severe.wasting        Wasting         Overweight    
##  Length:39519       Length:39519       Min.   : 0.000   Min.   : 0.000  
##  Class :character   Class :character   1st Qu.: 2.535   1st Qu.: 2.116  
##  Mode  :character   Mode  :character   Median : 5.723   Median : 4.600  
##                                        Mean   : 7.044   Mean   : 5.905  
##                                        3rd Qu.:10.000   3rd Qu.: 8.290  
##                                        Max.   :58.736   Max.   :72.000  
##                                        NA's   :1528     NA's   :2147    
##     Stunting      Underweight    Short.Source.Code  Reference.Title   
##  Min.   : 0.00   Min.   : 0.00   Length:39519       Length:39519      
##  1st Qu.:14.27   1st Qu.: 5.27   Class :character   Class :character  
##  Median :26.40   Median :13.43   Mode  :character   Mode  :character  
##  Mean   :27.82   Mean   :15.81                                        
##  3rd Qu.:38.93   3rd Qu.:22.89                                        
##  Max.   :95.90   Max.   :78.70                                        
##  NA's   :1462    NA's   :1083                                         
##     Author             Notes            JME..Y.N.        
##  Length:39519       Length:39519       Length:39519      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##

a <- datas |>
  group_by(Country.Short.Name, Median.Year) |>
  filter(mean(Wasting) > 20.000) |>
  mutate(average.of.Wasting = mean(Wasting))
unique(a$Country.Short.Name)

## [1] "Djibouti"                             
## [2] "India"                                
## [3] "Maldives"                             
## [4] "Mali"                                 
## [5] "Niger"                                
## [6] "Nigeria"                              
## [7] "Pakistan"                             
## [8] "Democratic People's Republic of Korea"
## [9] "South Sudan"

a |> ggplot() +
  geom_histogram(mapping = aes(x = average.of.Wasting), binwidth = 0.5)

b <- datas |>
  group_by(Country.Short.Name, Median.Year) |>
  filter(mean(Wasting) < 0.500) |>
  mutate(average.of.Wasting = mean(Wasting))
unique(b$Country.Short.Name)

## [1] "Australia"                                           
## [2] "Chile"                                               
## [3] "United Kingdom of Great Britain and Northern Ireland"
## [4] "Guatemala"                                           
## [5] "Nicaragua"                                           
## [6] "Peru"                                                
## [7] "Portugal"                                            
## [8] "United States of America"

b |> ggplot() +
  geom_histogram(mapping = aes(x = average.of.Wasting), binwidth = 0.5)

The above 2 graphs represent the countries with the highest and lowest average Wasting. here, Wasting represents the number of deviations from world mean for that particular metric. Higher deviation means more Wasting among the children, and vice-versa. So from this, we have a basic idea of which countries need to improve their standards.

countries <- c('India', 'Maldives', 'Djibouti', 'Mali', 'Niger', 'Nigeria', 'Pakistan', "Democratic People's Republic of Korea", 'South Sudan')

c <- datas |>
  filter(Country.Short.Name == countries) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = Country.Short.Name, y = Stunting))
c

d <- datas |>
  filter(Country.Short.Name == countries) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = Country.Short.Name, y = Underweight))
d

## Warning: Removed 4 rows containing non-finite values (`stat_boxplot()`).

e <- datas |>
  filter(Country.Short.Name == countries) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = Country.Short.Name, y = Overweight))
e

## Warning: Removed 8 rows containing non-finite values (`stat_boxplot()`).

The above graphs show the progression of the key statistics Underweight, Overweight and Stunting in the countries where there was high wasting average. We can see that these countries have improved, but maybe not as much as the low Wasting average countries. We’ll see these graphs below.

countries_1 <- c('Australia', 'Chile', 'United Kingdom of Great Britain and Northern Ireland', 'Guatemala', 'Nicaragua', 'Peru', 'Portugal', 'United States of America')

f <- datas |>
  filter(Country.Short.Name == countries_1) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = Country.Short.Name, y = Overweight))

## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `Country.Short.Name == countries_1`.
## Caused by warning in `Country.Short.Name == countries_1`:
## ! longer object length is not a multiple of shorter object length

## Warning: Removed 11 rows containing non-finite values (`stat_boxplot()`).

g <- datas |>
  filter(Country.Short.Name == countries_1) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = Country.Short.Name, y = Underweight))

## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `Country.Short.Name == countries_1`.
## Caused by warning in `Country.Short.Name == countries_1`:
## ! longer object length is not a multiple of shorter object length

## Warning: Removed 14 rows containing non-finite values (`stat_boxplot()`).

h <- datas |>
  filter(Country.Short.Name == countries_1) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = Country.Short.Name, y = Stunting))

## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `Country.Short.Name == countries_1`.
## Caused by warning in `Country.Short.Name == countries_1`:
## ! longer object length is not a multiple of shorter object length

## Warning: Removed 11 rows containing non-finite values (`stat_boxplot()`).

Now that we have looked at different graphs, let’s go for some Hypothesis testing.

1st Null Hypothesis - The value of Severe Wasting is always more than the corresponding value of Wasting

datas$Severe.wasting = as.numeric(datas$Severe.wasting)

## Warning: NAs introduced by coercion

datas$difference <- datas$Wasting - datas$Severe.wasting
x <- mean(datas$Wasting, na.rm = TRUE)
y <- mean(datas$Severe.wasting, na.rm = TRUE)
z <- sd(datas$Wasting, na.rm = TRUE)
i <- sd(datas$Severe.wasting, na.rm = TRUE)
x - y

## [1] 4.881972

Choosing an alpha-value of 0.10 Choosing power-level as 0.95

x1 <- length(datas$Wasting)
y1 <- length(datas$Severe.wasting)
s_pooled <- sqrt(
  (((x1-1)*z^2) + ((y1-1)*i^2))/
    ((x1-1)+(y1-1)))
cohen_d <- (x - y)/s_pooled
cohen_d

## [1] 1.099829

The above value represents the minimum effect size for the chosen Null Hypothesis

observed_diff <- x - y
paste("Observed Difference: ", observed_diff)

## [1] "Observed Difference:  4.88197248917647"

bootstrap <- function (x, func=mean, n_iter=10^4) {
  # empty vector to be filled with values from each iteration
  func_values <- c(NULL)
  
  # we simulate sampling `n_iter` times
  for (i in 1:n_iter) {
    # pull the sample (e.g., a vector or data frame)
    x_sample <- sample_n(x, size = length(x), replace = TRUE)
    
    # add on this iteration's value to the collection
    func_values <- c(func_values, func(x_sample))
  }
  
  return(func_values)
}

diff_in_avg <- function (x_data) {
  diff <- mean(x_data$Wasting, na.rm = TRUE) - mean(x_data$Severe.wasting, na.rm = TRUE)
  
  return(diff)
}

diffs_in_avgs <- bootstrap(datas, diff_in_avg, n_iter = 100)

ggplot() +
  geom_function(xlim = c(-5, 5), 
                fun = function(x) dnorm(x, mean = 0, 
                                        sd = sd(diffs_in_avgs))) +
  geom_vline(mapping = aes(xintercept = x - y,
                           color = paste("observed: ",
                                         round(x - y)))) +
  labs(title = "Bootstrapped Sampling Distribution of Value Differences",
       x = "Difference in Value",
       y = "Probability Density",
       color = "") +
  scale_x_continuous(breaks = seq(-5, 5, 1)) +
  theme_minimal()

Alternative Hypothesis - The mean of Wasting > The mean of Severe.wasting

critical_value <- 2.5
delta <- 2

f_0 <- function(x) dnorm(x, mean = 0)
f_a <- function(x) dnorm(x, mean = delta)

ggplot() +
  stat_function(mapping = aes(fill = 'power'),
                fun = f_a, 
                xlim = c(critical_value, 4),
                geom = "area") +
    stat_function(mapping = aes(fill = 'alpha'),
                fun = f_0, 
                xlim = c(critical_value, 4),
                geom = "area") +
  geom_function(mapping = aes(color = 'Null Hypothesis'),
                xlim = c(-4, 4), fun = f_0) +
  geom_function(mapping = aes(color = 'Alternative Hypothesis'),
                xlim = c(-4, 4), fun = f_a) +
  geom_vline(mapping = aes(xintercept = critical_value,
                           color = "Critical Value")) +
  geom_vline(mapping = aes(xintercept = delta,
                           color = "Delta")) +
  geom_vline(mapping = aes(xintercept = 0),
             color = 'gray', linetype=2) +
  labs(title = "One-Tailed Test",
       x = "Mean Value",
       y = "Probability Density",
       color = "",
       fill = "") +
  scale_x_continuous(breaks = seq(-4, 4, 1)) +
  scale_fill_manual(values = c('lightblue', 'pink')) +
  scale_color_manual(values = c('darkred', 'darkorange', 'darkblue', 
                                'darkgreen')) +
  theme_minimal()

f_sampling <- function(x) dnorm(x, mean = 0, 
                                sd = sd(diffs_in_avgs))

ggplot() +
  stat_function(mapping = aes(fill = 'more extreme samples'),
                fun = f_sampling, 
                xlim = c(observed_diff, 5),
                geom = "area") +
  geom_function(xlim = c(-5, 5), 
                fun = f_sampling) +
  geom_vline(mapping = aes(xintercept = observed_diff,
                           color = paste("observed: ",
                                         round(observed_diff, 1)))) +
  labs(title = "Bootstrapped Sampling Distribution of Value Difference",
       x = "Difference in Value",
       y = "Probability Density",
       color = "",
       fill = "") +
  scale_x_continuous(breaks = seq(-5, 5, 1)) +
  scale_fill_manual(values = 'lightblue') +
  theme_minimal()

diffs_in_avgs_d <- diffs_in_avgs - mean(diffs_in_avgs)

paste("p-value ", 
      sum(observed_diff < diffs_in_avgs_d) /
        length(diffs_in_avgs_d))

## [1] "p-value  0"

A p-value of 0 tells us that the Alternate Hypothesis is correct, and that we can safely reject our NULL hypothesis.

Now, we move on to fitting our data to a regression model. Independent Variables - Sex, Urban/Rural, Wasting, Stunting, Overweight, Underweight The target/dependent variable - Selected for JME (Y/N) The chosen regression model will be a poisson regression model, as we have our target variable as Selected for JME or not - a binary variable.

unique(datas$JME..Y.N.)

## [1] "Selected for JME"     "Not selected for JME" ""

datas <- datas |>
  mutate_at("JME..Y.N.", ~na_if(., ''))
datas <- datas |>
  drop_na(JME..Y.N.)

unique(datas$JME..Y.N.)

## [1] "Selected for JME"     "Not selected for JME"

datas$JME <- ifelse(datas$JME..Y.N. == "Selected for JME",1,0)
model <- glm(JME ~ Sex + Urban.Rural + Wasting + Stunting + Overweight + Underweight, data = datas,
             family = binomial(link = "logit"))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

model$coefficients

##              (Intercept)      SexNUTRITION_FEMALE        SexNUTRITION_MALE 
##             -2.854402241            -17.495768402            -17.463854946 
## Urban.RuralNUTRITION_RUR Urban.RuralNUTRITION_URB                  Wasting 
##            -17.081336018            -17.193047054             -0.036317700 
##                 Stunting               Overweight              Underweight 
##             -0.011183001              0.008985224              0.021429593

This means that for every unit change in each of the variables, the odds of being selected for JME goes up in different amounts:

Wasting - \(e^{-0.036317700} = 0.9643\)

Overweight - \(e^{0.008985224} = 1.01\)

Stunting - \(e^{-0.011183001} = 0.989\)

Underweight - \(e^{0.021429593} = 1.022\)

For the Categorical Variables, there is no unit change, but the intercepts are:

Female Sex - \(-2.854402241\)

Male Sex - \(-17.495768402\)

Urban Environment - \(-17.463854946\)

plot(model)

The above plots show the graphs for the Logistic model, if it were treated as a plain linear model

**The below model will be a linear regression model, showing the correlation between the Weights and Heights.

lr <- lm(Stunting ~ Underweight + Overweight, data = datas)
summary(lr)

## 
## Call:
## lm(formula = Stunting ~ Underweight + Overweight, data = datas)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.395  -6.832  -1.398   5.460  57.625 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.556591   0.124701   60.60   <2e-16 ***
## Underweight 1.140329   0.004391  259.68   <2e-16 ***
## Overweight  0.403705   0.011068   36.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.499 on 37020 degrees of freedom
##   (2305 observations deleted due to missingness)
## Multiple R-squared:  0.675,  Adjusted R-squared:  0.675 
## F-statistic: 3.845e+04 on 2 and 37020 DF,  p-value: < 2.2e-16

The above model summary shows that the relation between Height and weight is that, for an unit change in Underweight, Stunting goes up by 1.14. But for an unit change in Overweight, Stunting goes up by only 0.403.

lr$coefficients

## (Intercept) Underweight  Overweight 
##   7.5565915   1.1403292   0.4037054

Final Project - Karthik Balasubramanian

2023-11-28

R Markdown