This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tibble' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## Warning: package 'purrr' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## Warning: package 'forcats' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.2
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.2
library(AmesHousing)
## Warning: package 'AmesHousing' was built under R version 4.3.2
library(boot)
library(broom)
## Warning: package 'broom' was built under R version 4.3.2
library(lindia)
## Warning: package 'lindia' was built under R version 4.3.2
datas <- read.csv("C:\\Users\\karth\\Downloads\\Child Growth and Malnutrition.csv")
view(datas)
datas$Wasting = as.numeric(datas$Wasting)
## Warning: NAs introduced by coercion
datas$Overweight = as.numeric(datas$Overweight)
## Warning: NAs introduced by coercion
summary(datas)
## Country.ISO.3.Code Country.Short.Name Year.period Median.Year
## Length:39519 Length:39519 Length:39519 Length:39519
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Start.Month End.Month WHO.Reference.number
## Length:39519 Length:39519 Length:39519
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Representation.level Region Other.group.Zone Age
## Length:39519 Length:39519 Length:39519 Length:39519
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Sex Mother.education Wealth.quintile Urban.Rural
## Length:39519 Length:39519 Length:39519 Length:39519
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Sample.size Severe.wasting Wasting Overweight
## Length:39519 Length:39519 Min. : 0.000 Min. : 0.000
## Class :character Class :character 1st Qu.: 2.535 1st Qu.: 2.116
## Mode :character Mode :character Median : 5.723 Median : 4.600
## Mean : 7.044 Mean : 5.905
## 3rd Qu.:10.000 3rd Qu.: 8.290
## Max. :58.736 Max. :72.000
## NA's :1528 NA's :2147
## Stunting Underweight Short.Source.Code Reference.Title
## Min. : 0.00 Min. : 0.00 Length:39519 Length:39519
## 1st Qu.:14.27 1st Qu.: 5.27 Class :character Class :character
## Median :26.40 Median :13.43 Mode :character Mode :character
## Mean :27.82 Mean :15.81
## 3rd Qu.:38.93 3rd Qu.:22.89
## Max. :95.90 Max. :78.70
## NA's :1462 NA's :1083
## Author Notes JME..Y.N.
## Length:39519 Length:39519 Length:39519
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
a <- datas |>
group_by(Country.Short.Name, Median.Year) |>
filter(mean(Wasting) > 20.000) |>
mutate(average.of.Wasting = mean(Wasting))
unique(a$Country.Short.Name)
## [1] "Djibouti"
## [2] "India"
## [3] "Maldives"
## [4] "Mali"
## [5] "Niger"
## [6] "Nigeria"
## [7] "Pakistan"
## [8] "Democratic People's Republic of Korea"
## [9] "South Sudan"
a |> ggplot() +
geom_histogram(mapping = aes(x = average.of.Wasting), binwidth = 0.5)
b <- datas |>
group_by(Country.Short.Name, Median.Year) |>
filter(mean(Wasting) < 0.500) |>
mutate(average.of.Wasting = mean(Wasting))
unique(b$Country.Short.Name)
## [1] "Australia"
## [2] "Chile"
## [3] "United Kingdom of Great Britain and Northern Ireland"
## [4] "Guatemala"
## [5] "Nicaragua"
## [6] "Peru"
## [7] "Portugal"
## [8] "United States of America"
b |> ggplot() +
geom_histogram(mapping = aes(x = average.of.Wasting), binwidth = 0.5)
The above 2 graphs represent the countries with the highest and
lowest average Wasting. here, Wasting represents the number of
deviations from world mean for that particular metric. Higher deviation
means more Wasting among the children, and vice-versa. So from this, we
have a basic idea of which countries need to improve their
standards.
countries <- c('India', 'Maldives', 'Djibouti', 'Mali', 'Niger', 'Nigeria', 'Pakistan', "Democratic People's Republic of Korea", 'South Sudan')
c <- datas |>
filter(Country.Short.Name == countries) |>
ggplot() +
geom_boxplot(mapping = aes(x = Country.Short.Name, y = Stunting))
c
d <- datas |>
filter(Country.Short.Name == countries) |>
ggplot() +
geom_boxplot(mapping = aes(x = Country.Short.Name, y = Underweight))
d
## Warning: Removed 4 rows containing non-finite values (`stat_boxplot()`).
e <- datas |>
filter(Country.Short.Name == countries) |>
ggplot() +
geom_boxplot(mapping = aes(x = Country.Short.Name, y = Overweight))
e
## Warning: Removed 8 rows containing non-finite values (`stat_boxplot()`).
The above graphs show the progression of the key statistics
Underweight, Overweight and Stunting in the countries where there was
high wasting average. We can see that these countries have improved, but
maybe not as much as the low Wasting average countries. We’ll see these
graphs below.
countries_1 <- c('Australia', 'Chile', 'United Kingdom of Great Britain and Northern Ireland', 'Guatemala', 'Nicaragua', 'Peru', 'Portugal', 'United States of America')
f <- datas |>
filter(Country.Short.Name == countries_1) |>
ggplot() +
geom_boxplot(mapping = aes(x = Country.Short.Name, y = Overweight))
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `Country.Short.Name == countries_1`.
## Caused by warning in `Country.Short.Name == countries_1`:
## ! longer object length is not a multiple of shorter object length
f
## Warning: Removed 11 rows containing non-finite values (`stat_boxplot()`).
g <- datas |>
filter(Country.Short.Name == countries_1) |>
ggplot() +
geom_boxplot(mapping = aes(x = Country.Short.Name, y = Underweight))
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `Country.Short.Name == countries_1`.
## Caused by warning in `Country.Short.Name == countries_1`:
## ! longer object length is not a multiple of shorter object length
g
## Warning: Removed 14 rows containing non-finite values (`stat_boxplot()`).
h <- datas |>
filter(Country.Short.Name == countries_1) |>
ggplot() +
geom_boxplot(mapping = aes(x = Country.Short.Name, y = Stunting))
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `Country.Short.Name == countries_1`.
## Caused by warning in `Country.Short.Name == countries_1`:
## ! longer object length is not a multiple of shorter object length
h
## Warning: Removed 11 rows containing non-finite values (`stat_boxplot()`).
Now that we have looked at different graphs, let’s go for some
Hypothesis testing.
1st Null Hypothesis - The value of Severe Wasting is always more than the corresponding value of Wasting
datas$Severe.wasting = as.numeric(datas$Severe.wasting)
## Warning: NAs introduced by coercion
datas$difference <- datas$Wasting - datas$Severe.wasting
x <- mean(datas$Wasting, na.rm = TRUE)
y <- mean(datas$Severe.wasting, na.rm = TRUE)
z <- sd(datas$Wasting, na.rm = TRUE)
i <- sd(datas$Severe.wasting, na.rm = TRUE)
x - y
## [1] 4.881972
Choosing an alpha-value of 0.10 Choosing power-level as 0.95
x1 <- length(datas$Wasting)
y1 <- length(datas$Severe.wasting)
s_pooled <- sqrt(
(((x1-1)*z^2) + ((y1-1)*i^2))/
((x1-1)+(y1-1)))
cohen_d <- (x - y)/s_pooled
cohen_d
## [1] 1.099829
The above value represents the minimum effect size for the chosen Null Hypothesis
observed_diff <- x - y
paste("Observed Difference: ", observed_diff)
## [1] "Observed Difference: 4.88197248917647"
bootstrap <- function (x, func=mean, n_iter=10^4) {
# empty vector to be filled with values from each iteration
func_values <- c(NULL)
# we simulate sampling `n_iter` times
for (i in 1:n_iter) {
# pull the sample (e.g., a vector or data frame)
x_sample <- sample_n(x, size = length(x), replace = TRUE)
# add on this iteration's value to the collection
func_values <- c(func_values, func(x_sample))
}
return(func_values)
}
diff_in_avg <- function (x_data) {
diff <- mean(x_data$Wasting, na.rm = TRUE) - mean(x_data$Severe.wasting, na.rm = TRUE)
return(diff)
}
diffs_in_avgs <- bootstrap(datas, diff_in_avg, n_iter = 100)
ggplot() +
geom_function(xlim = c(-5, 5),
fun = function(x) dnorm(x, mean = 0,
sd = sd(diffs_in_avgs))) +
geom_vline(mapping = aes(xintercept = x - y,
color = paste("observed: ",
round(x - y)))) +
labs(title = "Bootstrapped Sampling Distribution of Value Differences",
x = "Difference in Value",
y = "Probability Density",
color = "") +
scale_x_continuous(breaks = seq(-5, 5, 1)) +
theme_minimal()
Alternative Hypothesis - The mean of Wasting > The mean of
Severe.wasting
critical_value <- 2.5
delta <- 2
f_0 <- function(x) dnorm(x, mean = 0)
f_a <- function(x) dnorm(x, mean = delta)
ggplot() +
stat_function(mapping = aes(fill = 'power'),
fun = f_a,
xlim = c(critical_value, 4),
geom = "area") +
stat_function(mapping = aes(fill = 'alpha'),
fun = f_0,
xlim = c(critical_value, 4),
geom = "area") +
geom_function(mapping = aes(color = 'Null Hypothesis'),
xlim = c(-4, 4), fun = f_0) +
geom_function(mapping = aes(color = 'Alternative Hypothesis'),
xlim = c(-4, 4), fun = f_a) +
geom_vline(mapping = aes(xintercept = critical_value,
color = "Critical Value")) +
geom_vline(mapping = aes(xintercept = delta,
color = "Delta")) +
geom_vline(mapping = aes(xintercept = 0),
color = 'gray', linetype=2) +
labs(title = "One-Tailed Test",
x = "Mean Value",
y = "Probability Density",
color = "",
fill = "") +
scale_x_continuous(breaks = seq(-4, 4, 1)) +
scale_fill_manual(values = c('lightblue', 'pink')) +
scale_color_manual(values = c('darkred', 'darkorange', 'darkblue',
'darkgreen')) +
theme_minimal()
f_sampling <- function(x) dnorm(x, mean = 0,
sd = sd(diffs_in_avgs))
ggplot() +
stat_function(mapping = aes(fill = 'more extreme samples'),
fun = f_sampling,
xlim = c(observed_diff, 5),
geom = "area") +
geom_function(xlim = c(-5, 5),
fun = f_sampling) +
geom_vline(mapping = aes(xintercept = observed_diff,
color = paste("observed: ",
round(observed_diff, 1)))) +
labs(title = "Bootstrapped Sampling Distribution of Value Difference",
x = "Difference in Value",
y = "Probability Density",
color = "",
fill = "") +
scale_x_continuous(breaks = seq(-5, 5, 1)) +
scale_fill_manual(values = 'lightblue') +
theme_minimal()
diffs_in_avgs_d <- diffs_in_avgs - mean(diffs_in_avgs)
paste("p-value ",
sum(observed_diff < diffs_in_avgs_d) /
length(diffs_in_avgs_d))
## [1] "p-value 0"
A p-value of 0 tells us that the Alternate Hypothesis is correct, and that we can safely reject our NULL hypothesis.
Now, we move on to fitting our data to a regression model. Independent Variables - Sex, Urban/Rural, Wasting, Stunting, Overweight, Underweight The target/dependent variable - Selected for JME (Y/N) The chosen regression model will be a poisson regression model, as we have our target variable as Selected for JME or not - a binary variable.
unique(datas$JME..Y.N.)
## [1] "Selected for JME" "Not selected for JME" ""
datas <- datas |>
mutate_at("JME..Y.N.", ~na_if(., ''))
datas <- datas |>
drop_na(JME..Y.N.)
unique(datas$JME..Y.N.)
## [1] "Selected for JME" "Not selected for JME"
datas$JME <- ifelse(datas$JME..Y.N. == "Selected for JME",1,0)
model <- glm(JME ~ Sex + Urban.Rural + Wasting + Stunting + Overweight + Underweight, data = datas,
family = binomial(link = "logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
model$coefficients
## (Intercept) SexNUTRITION_FEMALE SexNUTRITION_MALE
## -2.854402241 -17.495768402 -17.463854946
## Urban.RuralNUTRITION_RUR Urban.RuralNUTRITION_URB Wasting
## -17.081336018 -17.193047054 -0.036317700
## Stunting Overweight Underweight
## -0.011183001 0.008985224 0.021429593
This means that for every unit change in each of the variables, the odds of being selected for JME goes up in different amounts:
Wasting - \(e^{-0.036317700} = 0.9643\)
Overweight - \(e^{0.008985224} = 1.01\)
Stunting - \(e^{-0.011183001} = 0.989\)
Underweight - \(e^{0.021429593} = 1.022\)
For the Categorical Variables, there is no unit change, but the intercepts are:
Female Sex - \(-2.854402241\)
Male Sex - \(-17.495768402\)
Urban Environment - \(-17.463854946\)
plot(model)
The above plots show the graphs for the Logistic model, if it were treated as a plain linear model
**The below model will be a linear regression model, showing the correlation between the Weights and Heights.
lr <- lm(Stunting ~ Underweight + Overweight, data = datas)
summary(lr)
##
## Call:
## lm(formula = Stunting ~ Underweight + Overweight, data = datas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.395 -6.832 -1.398 5.460 57.625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.556591 0.124701 60.60 <2e-16 ***
## Underweight 1.140329 0.004391 259.68 <2e-16 ***
## Overweight 0.403705 0.011068 36.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.499 on 37020 degrees of freedom
## (2305 observations deleted due to missingness)
## Multiple R-squared: 0.675, Adjusted R-squared: 0.675
## F-statistic: 3.845e+04 on 2 and 37020 DF, p-value: < 2.2e-16
The above model summary shows that the relation between Height and weight is that, for an unit change in Underweight, Stunting goes up by 1.14. But for an unit change in Overweight, Stunting goes up by only 0.403.
lr$coefficients
## (Intercept) Underweight Overweight
## 7.5565915 1.1403292 0.4037054