This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
datas <- read.csv("C:\\Users\\karth\\Downloads\\Child Growth and Malnutrition.csv")
view(datas)
The below table defines the split of age by country
datas_age = datas |>
group_by(Country.ISO.3.Code, Age) |>
summarise(Freq = n())
## `summarise()` has grouped output by 'Country.ISO.3.Code'. You can override
## using the `.groups` argument.
datas_age
## # A tibble: 1,614 × 3
## # Groups: Country.ISO.3.Code [178]
## Country.ISO.3.Code Age Freq
## <chr> <chr> <int>
## 1 AFG 0. - 0.49 6
## 2 AFG 0. - 1.99 6
## 3 AFG 0. - 4.99 85
## 4 AFG 0.50 - 0.99 9
## 5 AFG 0.50 - 4.99 1
## 6 AFG 0.50 - 5.00 3
## 7 AFG 1. - 1.99 9
## 8 AFG 2. - 2.99 9
## 9 AFG 2. - 4.99 6
## 10 AFG 3. - 3.99 9
## # ℹ 1,604 more rows
datas_age_country <- function(Country_code){
datas_age_c <- datas_age |>
filter(Country.ISO.3.Code == Country_code)
total = sum(datas_age_c$Freq)
datas_age_c <- datas_age_c[order(datas_age_c$Freq),]
datas_age_c <- datas_age_c |>
mutate(probability = Freq/total)
thislist <- c("Y")
for (i in 2:nrow(datas_age_c)){
thislist <- append(thislist, "N")
}
datas_age_c$anomaly <- c(thislist)
return(datas_age_c)
}
datas_age_country("AFG")
## # A tibble: 12 × 5
## # Groups: Country.ISO.3.Code [1]
## Country.ISO.3.Code Age Freq probability anomaly
## <chr> <chr> <int> <dbl> <chr>
## 1 AFG 0.50 - 4.99 1 0.00658 Y
## 2 AFG 0.50 - 5.00 3 0.0197 N
## 3 AFG 4. - 5.00 3 0.0197 N
## 4 AFG 0. - 0.49 6 0.0395 N
## 5 AFG 0. - 1.99 6 0.0395 N
## 6 AFG 2. - 4.99 6 0.0395 N
## 7 AFG 4. - 4.99 6 0.0395 N
## 8 AFG 0.50 - 0.99 9 0.0592 N
## 9 AFG 1. - 1.99 9 0.0592 N
## 10 AFG 2. - 2.99 9 0.0592 N
## 11 AFG 3. - 3.99 9 0.0592 N
## 12 AFG 0. - 4.99 85 0.559 N
The above table shows the probability of each age group being recorded in the Malnutrition database, for Afghanistan. If required, we can change the country to look at a different country. The data shows that majority of the children are in the age group 0 - 4.99, but some children are specifically admitted to certain age groups. The lowest frequency of this, is for the weird age group “0.50 - 4.99”, which is almost the same as 0 - 4.99. With this, we can see the quality of the data being collected, which will significantly improve the data analysis.
The below tables define the split of sex by country
datas_sex = datas |>
group_by(Country.ISO.3.Code, Sex) |>
summarise(Freq = n())
## `summarise()` has grouped output by 'Country.ISO.3.Code'. You can override
## using the `.groups` argument.
datas_sex
## # A tibble: 517 × 3
## # Groups: Country.ISO.3.Code [178]
## Country.ISO.3.Code Sex Freq
## <chr> <chr> <int>
## 1 AFG BTSX 104
## 2 AFG NUTRITION_FEMALE 24
## 3 AFG NUTRITION_MALE 24
## 4 AGO BTSX 126
## 5 AGO NUTRITION_FEMALE 26
## 6 AGO NUTRITION_MALE 26
## 7 ALB BTSX 81
## 8 ALB NUTRITION_FEMALE 27
## 9 ALB NUTRITION_MALE 27
## 10 ARG BTSX 68
## # ℹ 507 more rows
datas_sex_country <- function(Country_code){
datas_age_s <- datas_sex |>
filter(Country.ISO.3.Code == Country_code)
total = sum(datas_age_s$Freq)
datas_age_s <- datas_age_s[order(datas_age_s$Freq),]
datas_age_s <- datas_age_s |>
mutate(probability = Freq/total)
thislist <- c("Y")
for (i in 2:nrow(datas_age_s)){
thislist <- append(thislist, "N")
}
datas_age_s$anomaly <- c(thislist)
return(datas_age_s)
}
datas_sex_country("JOR")
## # A tibble: 3 × 5
## # Groups: Country.ISO.3.Code [1]
## Country.ISO.3.Code Sex Freq probability anomaly
## <chr> <chr> <int> <dbl> <chr>
## 1 JOR NUTRITION_FEMALE 46 0.201 Y
## 2 JOR NUTRITION_MALE 46 0.201 N
## 3 JOR BTSX 137 0.598 N
datas_sex_country("KWT")
## # A tibble: 3 × 5
## # Groups: Country.ISO.3.Code [1]
## Country.ISO.3.Code Sex Freq probability anomaly
## <chr> <chr> <int> <dbl> <chr>
## 1 KWT NUTRITION_FEMALE 152 0.300 Y
## 2 KWT NUTRITION_MALE 152 0.300 N
## 3 KWT BTSX 203 0.400 N
datas_sex_country("KHM")
## # A tibble: 3 × 5
## # Groups: Country.ISO.3.Code [1]
## Country.ISO.3.Code Sex Freq probability anomaly
## <chr> <chr> <int> <dbl> <chr>
## 1 KHM NUTRITION_FEMALE 63 0.155 Y
## 2 KHM NUTRITION_MALE 63 0.155 N
## 3 KHM BTSX 281 0.690 N
From the above 3 tables, we see the sex of the child whose data was recorded. Majority of the countries have equal number in NUTRITION_MALE and NUTRITION_FEMALE, and the BTSX is the anomaly, having high numbers in each country.
The below tables define the split of Area by country
datas_area = datas |>
group_by(Country.ISO.3.Code, Urban.Rural) |>
summarise(Freq = n())
## `summarise()` has grouped output by 'Country.ISO.3.Code'. You can override
## using the `.groups` argument.
datas_area
## # A tibble: 465 × 3
## # Groups: Country.ISO.3.Code [178]
## Country.ISO.3.Code Urban.Rural Freq
## <chr> <chr> <int>
## 1 AFG BOTH 148
## 2 AFG NUTRITION_RUR 2
## 3 AFG NUTRITION_URB 2
## 4 AGO BOTH 169
## 5 AGO NUTRITION_RUR 4
## 6 AGO NUTRITION_URB 5
## 7 ALB BOTH 129
## 8 ALB NUTRITION_RUR 3
## 9 ALB NUTRITION_URB 3
## 10 ARG BOTH 126
## # ℹ 455 more rows
datas_area_country <- function(Country_code){
datas_age_a <- datas_area |>
filter(Country.ISO.3.Code == Country_code)
total = sum(datas_age_a$Freq)
datas_age_a <- datas_age_a[order(datas_age_a$Freq),]
datas_age_a <- datas_age_a |>
mutate(probability = Freq/total)
thislist <- c("Y")
for (i in 2:nrow(datas_age_a)){
thislist <- append(thislist, "N")
}
datas_age_a$anomaly <- c(thislist)
return(datas_age_a)
}
datas_area_country("BOL")
## # A tibble: 3 × 5
## # Groups: Country.ISO.3.Code [1]
## Country.ISO.3.Code Urban.Rural Freq probability anomaly
## <chr> <chr> <int> <dbl> <chr>
## 1 BOL NUTRITION_RUR 7 0.0239 Y
## 2 BOL NUTRITION_URB 7 0.0239 N
## 3 BOL BOTH 279 0.952 N
datas_area_country("MKD")
## # A tibble: 3 × 5
## # Groups: Country.ISO.3.Code [1]
## Country.ISO.3.Code Urban.Rural Freq probability anomaly
## <chr> <chr> <int> <dbl> <chr>
## 1 MKD NUTRITION_RUR 4 0.0248 Y
## 2 MKD NUTRITION_URB 4 0.0248 N
## 3 MKD BOTH 153 0.950 N
From this, we see that majority of the data is collected from both the rural and urban areas, but some anomalies exist - some are from rural area alone, while some are from urban area alone. This tells us that the quality of data collected is very good, as it is very representative of the whole data.
datas <- transform(datas, Sample.size = as.numeric(Sample.size))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
summary(datas$Sample.size)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 342 737 1738 1572 237205 5616
The below tables and graphs are for sample_size summaries
datas_size = datas |>
group_by(Country.ISO.3.Code) |>
summarise(samplesize = mean(Sample.size, na.rm = TRUE),
number_of_na = sum(is.na(Sample.size)))
datas_size
## # A tibble: 178 × 3
## Country.ISO.3.Code samplesize number_of_na
## <chr> <dbl> <int>
## 1 AFG 2930. 19
## 2 AGO 1082. 35
## 3 ALB 408. 0
## 4 ARG 2563. 38
## 5 ARM 374. 0
## 6 ATG NaN 21
## 7 AUS 445. 4
## 8 AZE 402. 3
## 9 BDI 1949. 8
## 10 BEL 172. 0
## # ℹ 168 more rows
datas_size = datas_size[order(datas_size$number_of_na, -datas_size$samplesize), ]
datas_size
## # A tibble: 178 × 3
## Country.ISO.3.Code samplesize number_of_na
## <chr> <dbl> <int>
## 1 LTU 27553. 0
## 2 CAF 2057. 0
## 3 BEN 1759. 0
## 4 PSE 1450. 0
## 5 GNB 1427. 0
## 6 SLV 1143. 0
## 7 NLD 1118. 0
## 8 UKR 929. 0
## 9 KHM 852. 0
## 10 NAM 780. 0
## # ℹ 168 more rows
ss <- datas_size |>
filter(samplesize>3800.00000) |>
select(Country.ISO.3.Code) |>
as_vector()
ss
## Country.ISO.3.Code1 Country.ISO.3.Code2 Country.ISO.3.Code3 Country.ISO.3.Code4
## "LTU" "CPV" "IRQ" "IND"
## Country.ISO.3.Code5 Country.ISO.3.Code6
## "VNM" "IDN"
a <- datas_size |>
filter(Country.ISO.3.Code %in% ss) |>
ggplot()+
geom_point(mapping = aes(x = number_of_na, y = samplesize, color = Country.ISO.3.Code))
a
ss1 <- datas_size |>
filter(number_of_na>150) |>
select(Country.ISO.3.Code) |>
as_vector()
ss1
## Country.ISO.3.Code1 Country.ISO.3.Code2 Country.ISO.3.Code3 Country.ISO.3.Code4
## "PHL" "CHL" "CHN" "KWT"
## Country.ISO.3.Code5 Country.ISO.3.Code6 Country.ISO.3.Code7
## "VNM" "IDN" "BGD"
b <- datas_size |>
filter(Country.ISO.3.Code %in% ss1) |>
ggplot()+
geom_point(mapping = aes(x = number_of_na, y = samplesize, color = Country.ISO.3.Code))
b
## Warning: Removed 2 rows containing missing values (`geom_point()`).
The above 2 graphs tell us that generally countries with high number of missing values also have a slightly high samplesize mean, thereby distorting the availabe data. So, what this tells us is that we need to have a good number of sample_size from countries, that should be taken multiple times - or else we will tend to arrive at the wrong conclusions