Week 8 Assignment - Karthik Balasubramanian

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)

datas <- read.csv("C:\\Users\\karth\\Downloads\\Child Growth and Malnutrition.csv")
view(datas)

Response Variable - Stunting Categorical variables - Who.Reference.Number, Urban.Rural, Sex and Age

The NULL hypothesis - the mean of Stunting across different categories in each explanatory variable - Reference Number, Urban/Rural, Sex and Age - is comparitively the same.

m <- aov(Stunting ~ WHO.Reference.number + Urban.Rural + Sex + Age, data = datas)
summary(m)

##                         Df  Sum Sq Mean Sq F value Pr(>F)    
## WHO.Reference.number  1174 7946179    6768   150.5 <2e-16 ***
## Urban.Rural              2   31591   15796   351.2 <2e-16 ***
## Sex                      2   47542   23771   528.5 <2e-16 ***
## Age                     37  796770   21534   478.8 <2e-16 ***
## Residuals            36841 1656905      45                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1462 observations deleted due to missingness

We see that the p-value is very small for all the explanatory variable, hence we can conclude that there is a significant difference in means between the various categories of all categorical variable. This arguement is further strenghtened by the fact that the F-value for all explanatory variables is exceedingly large, in the magnitude of 100s. All of this points to the disparity in means.

With ANOVA, we are able to disprove the NULL hypothesis

The explanatory continuous variable - Underweight

lr <- lm(Stunting ~ Underweight, data = datas)
summary(lr)

## 
## Call:
## lm(formula = Stunting ~ Underweight, data = datas)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.744  -6.878  -1.448   5.612  70.972 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.150128   0.078836   141.4   <2e-16 ***
## Underweight  1.062900   0.003907   272.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.644 on 37788 degrees of freedom
##   (1729 observations deleted due to missingness)
## Multiple R-squared:  0.662,  Adjusted R-squared:  0.662 
## F-statistic: 7.401e+04 on 1 and 37788 DF,  p-value: < 2.2e-16

summary(lr)$r.squared

## [1] 0.6620034

The above value tells us that the model did a reasonably good job of fitting one variable to the other

lr$coefficients

## (Intercept) Underweight 
##    11.15013     1.06290

beta_1 <- 1.062900
beta_0 <- 11.150128

datas_1 <- datas |> select(Stunting, Underweight)
datas_2 <- sample_n(datas_1, 20)

lm_ <- \(x) beta_1 * x + beta_0

datas_2 |>
  ggplot() +
  geom_point(mapping = aes(x = Underweight, y = Stunting), size = 2) +
  geom_smooth(mapping = aes(x = Underweight, y = Stunting), method = "lm", se = FALSE, color = 'red', linewidth = 0.5) + 
  geom_rect(mapping = aes(xmin = Underweight, 
                          xmax = Underweight + abs(Stunting - lm_(Underweight)),
                          ymin = lm_(Underweight), 
                          ymax = Stunting), 
            fill = NA, color = 'darkblue') +
  labs(color = '') +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 rows containing missing values (`geom_rect()`).

The above plot shows the Squared error for 20 points in the sample taken from the original dataset. But the total R-Sqaured was 0.66.

The coefficients of the linear regression model tell us that when compared to Underweight, Stunting increases or decreases in the same direction, at approximately the same rate (slope = 1.06..). But there is also an intercept of ~11.5. This tells us that height generally is more than weight for any children, not that a person with 0 weight will have some height.

We are now going to add the Sex variable to the regression model.

lr1 <- lm(Stunting ~ Underweight + Sex, data = datas)
summary(lr1)

## 
## Call:
## lm(formula = Stunting ~ Underweight + Sex, data = datas)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.953  -6.883  -1.444   5.610  70.693 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         11.445430   0.088715 129.014   <2e-16 ***
## Underweight          1.060845   0.003906 271.602   <2e-16 ***
## SexNUTRITION_FEMALE -1.361237   0.128887 -10.562   <2e-16 ***
## SexNUTRITION_MALE    0.015715   0.128709   0.122    0.903    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.63 on 37786 degrees of freedom
##   (1729 observations deleted due to missingness)
## Multiple R-squared:  0.6631, Adjusted R-squared:  0.663 
## F-statistic: 2.479e+04 on 3 and 37786 DF,  p-value: < 2.2e-16

The P-value for 2 variables is very low, but for “SexNUTRITION_MALE”, the p-value is extremely high. This means we cannot conclude how this particular variable affects Stunting.

This doesn’t tell us anything, as we all know that the sex of a person has some part to play in the height of the person. But according to this data, only females have that option. Males can have any height at any weight, because their sex doesn’t control at all what their height is. What this tell us is that, the recorded data is not proper, or a linear regression will not work for this combination of variables, and we need to test some other models for the fit.

summary(lr1)$r.squared

## [1] 0.6630662

Not a bad fit, considering one variable has no effect. But this value doesn’t mean anything now.

lr1$coefficients

##         (Intercept)         Underweight SexNUTRITION_FEMALE   SexNUTRITION_MALE 
##         11.44542962          1.06084538         -1.36123677          0.01571495

Week 8 Assignment - Karthik Balasubramanian

2023-10-21

R Markdown