This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
datas <- read.csv("C:\\Users\\karth\\Downloads\\Child Growth and Malnutrition.csv")
view(datas)
Response Variable - Stunting Categorical variables - Who.Reference.Number, Urban.Rural, Sex and Age
The NULL hypothesis - the mean of Stunting across different categories in each explanatory variable - Reference Number, Urban/Rural, Sex and Age - is comparitively the same.
m <- aov(Stunting ~ WHO.Reference.number + Urban.Rural + Sex + Age, data = datas)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## WHO.Reference.number 1174 7946179 6768 150.5 <2e-16 ***
## Urban.Rural 2 31591 15796 351.2 <2e-16 ***
## Sex 2 47542 23771 528.5 <2e-16 ***
## Age 37 796770 21534 478.8 <2e-16 ***
## Residuals 36841 1656905 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1462 observations deleted due to missingness
We see that the p-value is very small for all the explanatory variable, hence we can conclude that there is a significant difference in means between the various categories of all categorical variable. This arguement is further strenghtened by the fact that the F-value for all explanatory variables is exceedingly large, in the magnitude of 100s. All of this points to the disparity in means.
With ANOVA, we are able to disprove the NULL hypothesis
The explanatory continuous variable - Underweight
lr <- lm(Stunting ~ Underweight, data = datas)
summary(lr)
##
## Call:
## lm(formula = Stunting ~ Underweight, data = datas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.744 -6.878 -1.448 5.612 70.972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.150128 0.078836 141.4 <2e-16 ***
## Underweight 1.062900 0.003907 272.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.644 on 37788 degrees of freedom
## (1729 observations deleted due to missingness)
## Multiple R-squared: 0.662, Adjusted R-squared: 0.662
## F-statistic: 7.401e+04 on 1 and 37788 DF, p-value: < 2.2e-16
summary(lr)$r.squared
## [1] 0.6620034
The above value tells us that the model did a reasonably good job of fitting one variable to the other
lr$coefficients
## (Intercept) Underweight
## 11.15013 1.06290
beta_1 <- 1.062900
beta_0 <- 11.150128
datas_1 <- datas |> select(Stunting, Underweight)
datas_2 <- sample_n(datas_1, 20)
lm_ <- \(x) beta_1 * x + beta_0
datas_2 |>
ggplot() +
geom_point(mapping = aes(x = Underweight, y = Stunting), size = 2) +
geom_smooth(mapping = aes(x = Underweight, y = Stunting), method = "lm", se = FALSE, color = 'red', linewidth = 0.5) +
geom_rect(mapping = aes(xmin = Underweight,
xmax = Underweight + abs(Stunting - lm_(Underweight)),
ymin = lm_(Underweight),
ymax = Stunting),
fill = NA, color = 'darkblue') +
labs(color = '') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing missing values (`geom_rect()`).
The above plot shows the Squared error for 20 points in the
sample taken from the original dataset. But the total R-Sqaured was
0.66.
The coefficients of the linear regression model tell us that when compared to Underweight, Stunting increases or decreases in the same direction, at approximately the same rate (slope = 1.06..). But there is also an intercept of ~11.5. This tells us that height generally is more than weight for any children, not that a person with 0 weight will have some height.
We are now going to add the Sex variable to the regression model.
lr1 <- lm(Stunting ~ Underweight + Sex, data = datas)
summary(lr1)
##
## Call:
## lm(formula = Stunting ~ Underweight + Sex, data = datas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.953 -6.883 -1.444 5.610 70.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.445430 0.088715 129.014 <2e-16 ***
## Underweight 1.060845 0.003906 271.602 <2e-16 ***
## SexNUTRITION_FEMALE -1.361237 0.128887 -10.562 <2e-16 ***
## SexNUTRITION_MALE 0.015715 0.128709 0.122 0.903
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.63 on 37786 degrees of freedom
## (1729 observations deleted due to missingness)
## Multiple R-squared: 0.6631, Adjusted R-squared: 0.663
## F-statistic: 2.479e+04 on 3 and 37786 DF, p-value: < 2.2e-16
The P-value for 2 variables is very low, but for “SexNUTRITION_MALE”, the p-value is extremely high. This means we cannot conclude how this particular variable affects Stunting.
This doesn’t tell us anything, as we all know that the sex of a person has some part to play in the height of the person. But according to this data, only females have that option. Males can have any height at any weight, because their sex doesn’t control at all what their height is. What this tell us is that, the recorded data is not proper, or a linear regression will not work for this combination of variables, and we need to test some other models for the fit.
summary(lr1)$r.squared
## [1] 0.6630662
Not a bad fit, considering one variable has no effect. But this value doesn’t mean anything now.
lr1$coefficients
## (Intercept) Underweight SexNUTRITION_FEMALE SexNUTRITION_MALE
## 11.44542962 1.06084538 -1.36123677 0.01571495