Karthik Balasubramaian

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

datas <- read.csv("C:\\Users\\karth\\Downloads\\Child Growth and Malnutrition.csv")

view(datas)

The below table defines the split of age by country

datas_age = datas |>
            group_by(Country.ISO.3.Code, Age) |>
            summarise(Freq = n())

## `summarise()` has grouped output by 'Country.ISO.3.Code'. You can override
## using the `.groups` argument.

datas_age

## # A tibble: 1,614 × 3
## # Groups:   Country.ISO.3.Code [178]
##    Country.ISO.3.Code Age          Freq
##    <chr>              <chr>       <int>
##  1 AFG                0.   - 0.49     6
##  2 AFG                0.   - 1.99     6
##  3 AFG                0.   - 4.99    85
##  4 AFG                0.50 - 0.99     9
##  5 AFG                0.50 - 4.99     1
##  6 AFG                0.50 - 5.00     3
##  7 AFG                1.   - 1.99     9
##  8 AFG                2.   - 2.99     9
##  9 AFG                2.   - 4.99     6
## 10 AFG                3.   - 3.99     9
## # ℹ 1,604 more rows

datas_age_country <- function(Country_code){
  datas_age_c <- datas_age |>
                  filter(Country.ISO.3.Code == Country_code)
  total = sum(datas_age_c$Freq)
  datas_age_c <- datas_age_c[order(datas_age_c$Freq),]
  datas_age_c <- datas_age_c |>
                  mutate(probability = Freq/total)
  thislist <- c("Y")
  for (i in 2:nrow(datas_age_c)){
    thislist <- append(thislist, "N")
  }
  datas_age_c$anomaly <- c(thislist)
  return(datas_age_c)
}

datas_age_country("AFG")

## # A tibble: 12 × 5
## # Groups:   Country.ISO.3.Code [1]
##    Country.ISO.3.Code Age          Freq probability anomaly
##    <chr>              <chr>       <int>       <dbl> <chr>  
##  1 AFG                0.50 - 4.99     1     0.00658 Y      
##  2 AFG                0.50 - 5.00     3     0.0197  N      
##  3 AFG                4.   - 5.00     3     0.0197  N      
##  4 AFG                0.   - 0.49     6     0.0395  N      
##  5 AFG                0.   - 1.99     6     0.0395  N      
##  6 AFG                2.   - 4.99     6     0.0395  N      
##  7 AFG                4.   - 4.99     6     0.0395  N      
##  8 AFG                0.50 - 0.99     9     0.0592  N      
##  9 AFG                1.   - 1.99     9     0.0592  N      
## 10 AFG                2.   - 2.99     9     0.0592  N      
## 11 AFG                3.   - 3.99     9     0.0592  N      
## 12 AFG                0.   - 4.99    85     0.559   N

The above table shows the probability of each age group being recorded in the Malnutrition database, for Afghanistan. If required, we can change the country to look at a different country. The data shows that majority of the children are in the age group 0 - 4.99, but some children are specifically admitted to certain age groups. The lowest frequency of this, is for the weird age group “0.50 - 4.99”, which is almost the same as 0 - 4.99. With this, we can see the quality of the data being collected, which will significantly improve the data analysis.

The below tables define the split of sex by country

datas_sex = datas |>
            group_by(Country.ISO.3.Code, Sex) |>
            summarise(Freq = n())

## `summarise()` has grouped output by 'Country.ISO.3.Code'. You can override
## using the `.groups` argument.

datas_sex

## # A tibble: 517 × 3
## # Groups:   Country.ISO.3.Code [178]
##    Country.ISO.3.Code Sex               Freq
##    <chr>              <chr>            <int>
##  1 AFG                BTSX               104
##  2 AFG                NUTRITION_FEMALE    24
##  3 AFG                NUTRITION_MALE      24
##  4 AGO                BTSX               126
##  5 AGO                NUTRITION_FEMALE    26
##  6 AGO                NUTRITION_MALE      26
##  7 ALB                BTSX                81
##  8 ALB                NUTRITION_FEMALE    27
##  9 ALB                NUTRITION_MALE      27
## 10 ARG                BTSX                68
## # ℹ 507 more rows

datas_sex_country <- function(Country_code){
  datas_age_s <- datas_sex |>
                  filter(Country.ISO.3.Code == Country_code)
  total = sum(datas_age_s$Freq)
  datas_age_s <- datas_age_s[order(datas_age_s$Freq),]
  datas_age_s <- datas_age_s |>
                  mutate(probability = Freq/total)
  thislist <- c("Y")
  for (i in 2:nrow(datas_age_s)){
    thislist <- append(thislist, "N")
  }
  datas_age_s$anomaly <- c(thislist)
  return(datas_age_s)
}

datas_sex_country("JOR")

## # A tibble: 3 × 5
## # Groups:   Country.ISO.3.Code [1]
##   Country.ISO.3.Code Sex               Freq probability anomaly
##   <chr>              <chr>            <int>       <dbl> <chr>  
## 1 JOR                NUTRITION_FEMALE    46       0.201 Y      
## 2 JOR                NUTRITION_MALE      46       0.201 N      
## 3 JOR                BTSX               137       0.598 N

datas_sex_country("KWT")

## # A tibble: 3 × 5
## # Groups:   Country.ISO.3.Code [1]
##   Country.ISO.3.Code Sex               Freq probability anomaly
##   <chr>              <chr>            <int>       <dbl> <chr>  
## 1 KWT                NUTRITION_FEMALE   152       0.300 Y      
## 2 KWT                NUTRITION_MALE     152       0.300 N      
## 3 KWT                BTSX               203       0.400 N

datas_sex_country("KHM")

## # A tibble: 3 × 5
## # Groups:   Country.ISO.3.Code [1]
##   Country.ISO.3.Code Sex               Freq probability anomaly
##   <chr>              <chr>            <int>       <dbl> <chr>  
## 1 KHM                NUTRITION_FEMALE    63       0.155 Y      
## 2 KHM                NUTRITION_MALE      63       0.155 N      
## 3 KHM                BTSX               281       0.690 N

From the above 3 tables, we see the sex of the child whose data was recorded. Majority of the countries have equal number in NUTRITION_MALE and NUTRITION_FEMALE, and the BTSX is the anomaly, having high numbers in each country.

The below tables define the split of Area by country

datas_area = datas |>
            group_by(Country.ISO.3.Code, Urban.Rural) |>
            summarise(Freq = n())

## `summarise()` has grouped output by 'Country.ISO.3.Code'. You can override
## using the `.groups` argument.

datas_area

## # A tibble: 465 × 3
## # Groups:   Country.ISO.3.Code [178]
##    Country.ISO.3.Code Urban.Rural    Freq
##    <chr>              <chr>         <int>
##  1 AFG                BOTH            148
##  2 AFG                NUTRITION_RUR     2
##  3 AFG                NUTRITION_URB     2
##  4 AGO                BOTH            169
##  5 AGO                NUTRITION_RUR     4
##  6 AGO                NUTRITION_URB     5
##  7 ALB                BOTH            129
##  8 ALB                NUTRITION_RUR     3
##  9 ALB                NUTRITION_URB     3
## 10 ARG                BOTH            126
## # ℹ 455 more rows

datas_area_country <- function(Country_code){
  datas_age_a <- datas_area |>
                  filter(Country.ISO.3.Code == Country_code)
  total = sum(datas_age_a$Freq)
  datas_age_a <- datas_age_a[order(datas_age_a$Freq),]
  datas_age_a <- datas_age_a |>
                  mutate(probability = Freq/total)
  thislist <- c("Y")
  for (i in 2:nrow(datas_age_a)){
    thislist <- append(thislist, "N")
  }
  datas_age_a$anomaly <- c(thislist)
  return(datas_age_a)
}

datas_area_country("BOL")

## # A tibble: 3 × 5
## # Groups:   Country.ISO.3.Code [1]
##   Country.ISO.3.Code Urban.Rural    Freq probability anomaly
##   <chr>              <chr>         <int>       <dbl> <chr>  
## 1 BOL                NUTRITION_RUR     7      0.0239 Y      
## 2 BOL                NUTRITION_URB     7      0.0239 N      
## 3 BOL                BOTH            279      0.952  N

datas_area_country("MKD")

## # A tibble: 3 × 5
## # Groups:   Country.ISO.3.Code [1]
##   Country.ISO.3.Code Urban.Rural    Freq probability anomaly
##   <chr>              <chr>         <int>       <dbl> <chr>  
## 1 MKD                NUTRITION_RUR     4      0.0248 Y      
## 2 MKD                NUTRITION_URB     4      0.0248 N      
## 3 MKD                BOTH            153      0.950  N

From this, we see that majority of the data is collected from both the rural and urban areas, but some anomalies exist - some are from rural area alone, while some are from urban area alone. This tells us that the quality of data collected is very good, as it is very representative of the whole data.

datas <- transform(datas, Sample.size = as.numeric(Sample.size))

## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion

summary(datas$Sample.size)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     342     737    1738    1572  237205    5616

The below tables and graphs are for sample_size summaries

datas_size = datas |>
            group_by(Country.ISO.3.Code) |>
            summarise(samplesize = mean(Sample.size, na.rm = TRUE),
                      number_of_na = sum(is.na(Sample.size)))
datas_size

## # A tibble: 178 × 3
##    Country.ISO.3.Code samplesize number_of_na
##    <chr>                   <dbl>        <int>
##  1 AFG                     2930.           19
##  2 AGO                     1082.           35
##  3 ALB                      408.            0
##  4 ARG                     2563.           38
##  5 ARM                      374.            0
##  6 ATG                      NaN            21
##  7 AUS                      445.            4
##  8 AZE                      402.            3
##  9 BDI                     1949.            8
## 10 BEL                      172.            0
## # ℹ 168 more rows

datas_size = datas_size[order(datas_size$number_of_na, -datas_size$samplesize), ]
datas_size

## # A tibble: 178 × 3
##    Country.ISO.3.Code samplesize number_of_na
##    <chr>                   <dbl>        <int>
##  1 LTU                    27553.            0
##  2 CAF                     2057.            0
##  3 BEN                     1759.            0
##  4 PSE                     1450.            0
##  5 GNB                     1427.            0
##  6 SLV                     1143.            0
##  7 NLD                     1118.            0
##  8 UKR                      929.            0
##  9 KHM                      852.            0
## 10 NAM                      780.            0
## # ℹ 168 more rows

ss <- datas_size |>
      filter(samplesize>3800.00000) |>
      select(Country.ISO.3.Code) |>
      as_vector()
ss

## Country.ISO.3.Code1 Country.ISO.3.Code2 Country.ISO.3.Code3 Country.ISO.3.Code4 
##               "LTU"               "CPV"               "IRQ"               "IND" 
## Country.ISO.3.Code5 Country.ISO.3.Code6 
##               "VNM"               "IDN"

a <- datas_size |>
      filter(Country.ISO.3.Code %in% ss) |>
      ggplot()+
      geom_point(mapping = aes(x = number_of_na, y = samplesize, color = Country.ISO.3.Code))
      
a

ss1 <- datas_size |>
        filter(number_of_na>150) |>
        select(Country.ISO.3.Code) |>
        as_vector()
ss1

## Country.ISO.3.Code1 Country.ISO.3.Code2 Country.ISO.3.Code3 Country.ISO.3.Code4 
##               "PHL"               "CHL"               "CHN"               "KWT" 
## Country.ISO.3.Code5 Country.ISO.3.Code6 Country.ISO.3.Code7 
##               "VNM"               "IDN"               "BGD"

b <- datas_size |>
      filter(Country.ISO.3.Code %in% ss1) |>
      ggplot()+
      geom_point(mapping = aes(x = number_of_na, y = samplesize, color = Country.ISO.3.Code))
      
b

## Warning: Removed 2 rows containing missing values (`geom_point()`).

The above 2 graphs tell us that generally countries with high number of missing values also have a slightly high samplesize mean, thereby distorting the availabe data. So, what this tells us is that we need to have a good number of sample_size from countries, that should be taken multiple times - or else we will tend to arrive at the wrong conclusions

Karthik Balasubramaian - Week 3

2023-09-11

R Markdown