200709_Final

Introduction:

This study is based on the use of the Rasch model to generate an interval measure of a latent construct. The latent construct in this study is child nutrition. The Rasch model generates two measures. One is the person measure, which is essentially a row summary. In this study, dealing as it does with foods given to children, for each record (row) in the database,the program sums the individual ordinal “scores” for each of the foods given and from the total generates a linear “measure.” This is the “person measure,” and each person (record) in the database has a person measure. The person measure shows the level of the latent construct each person has. In this study, the latent construct is “child nutrition” because I study the foods given to children. These person measures may be aggregated by categorical variables. In the cleaned dataset, the categorical variables include the birth year of the child, the survey year, the county in which the child lives, an urban-rural indicator, the mother’s educational level and the family’s wealth level. The major importance of this study is the comparison of the aggregate person measures between the two survey years, given Liberia’s nearly 25% GDP/capita increase between those two years.

The program also generates a measure for each of the items under study. In this case, the items are the individual food groups. A low item measure, in this study, means the food group was “easily endorsed,” meaning many children were given this food. Foods that were given less often, and were thus “harder to endorse,” had higher item measures.(Note the inverse relationship.) As will be seen in the scatterplot below, foods higher in nutritional value had higher item measures and thus were given less frequently to the children.

The specific data sets chosen for analysis were the Demographic and Health Surveys (DHS) for Liberia for the years 2007 and 2013. DHS surveys are conducted at various times by contractors hired by the United States Agency for International Development (USAID). In Liberia, DHS surveys were conducted in 2007 and 2013. 2007 was just after the end of a protracted Civil War (it began in 1980) and 2013 was the next survey. In that time period Liberia’s GDP/capita increased by ~ 25%. The surveys in these two years provided an excellent opportunity to ascertain whether the level of child nutrition rose in parallel with the rise in GDP/capita.

Cleaning. The DHS survey is very complex, with many data files for each survey. I used the breastfeeding file from each of those two surveys.The breastfeeding files each conained over 900 variables, including basic demographic data, foods eaten by mothers and foods given to children, data on health habits such as smoking and drinking, data on sexual activity and contraception, etc. I selected the variables relating to foods given to children the day prior to the survey. The 2007 file contained over 22,000 observations; the 2013 file contained > 30,000 records. I eliminated unneeded variables and then cleaned the child nutrition variables to remove the don’t knows and missing data. I also had to harmonize the county name coding and the food groups between the two years. After cleaning, the 2007 file had 9,322 records and the 2013 file had 10,400 records, with 25 matching variables in each file. The files included categorical data as well as the data for foods given to children (1= yes, 0 = no). Then, using the Winsteps program developed by J. Michael Linacre, PhD, I generated person measures for each record in the file, as well as item measures for each food group. The measures are continuous interval variables and thus may be used for parametric statistical analysis.

Basic Set-up for R analysis.

getwd()

## [1] "C:/Users/Jerome/Documents/Data_Science_110/Datasets"

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.2

## -- Attaching packages ----------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.2     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## Warning: package 'dplyr' was built under R version 4.0.2

## -- Conflicts -------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(ggplot2)
library(RColorBrewer)

Explore the data. The categorical variables in the data set included the birth year of the children, the year of the survey, the county in which the family lived, an indicator for urban or rural location, the family’s wealth classification, and the mother’s educataional level. By plotting the demographic variables against the birth year and nutritional measure, a comparison of the years by the demographic variables is possible. Following are two plots showing the patterns of child nutritonal levels by survey year and mother’s education and family wealth.

As will be seen in the two plots, overall nutritional levels were lower in 2013 than they were in 2007 across all educational levels and across all wealth levels.

lcn_fpf_co_names <-read.csv("lcn_fpf_co_names.csv")
lcn_fpf_co_names <-group_by(lcn_fpf_co_names, Svy_Year, Educ)
lcn_fpf_co_names_educ_summ <- summarize(lcn_fpf_co_names, MEASURE=mean(MEASURE))

## `summarise()` regrouping output by 'Svy_Year' (override with `.groups` argument)

the_measure <-  "Comparison of Mean Person Measures by Mother's Education/Survey Year"
lcn_fpf_co_names_educ_summ$Educ<-as.factor(lcn_fpf_co_names_educ_summ$Educ)
levels(lcn_fpf_co_names_educ_summ$Educ)<- c("No Education", "Incomplete Primary", "Primary", "Incomplete Secondary", "Secondary", "Higher Education")
plot<-ggplot(lcn_fpf_co_names_educ_summ, aes(fill=as.factor(Educ), y=MEASURE, x=as.factor(Svy_Year))) + 
  geom_bar(position="dodge", stat="identity")+
labs(y="Mean Person Measure",x= "Survey Year") +
  ggtitle(the_measure)
print(plot)

lcn_fpf_co_names <- ungroup(lcn_fpf_co_names)

lcn_fpf_co_names <-group_by(lcn_fpf_co_names, Svy_Year, Wealth)
lcn_fpf_co_names_wealth_summ <- summarize(lcn_fpf_co_names, MEASURE=mean(MEASURE))

## `summarise()` regrouping output by 'Svy_Year' (override with `.groups` argument)

the_measure <-  "Comparison of Mean Person Measures by Wealth Category/Survey Year" 
lcn_fpf_co_names_wealth_summ$Wealth<-as.factor(lcn_fpf_co_names_wealth_summ$Wealth)
levels(lcn_fpf_co_names_wealth_summ$Wealth)<- c("Poorest", "Poorer", "Middle", "Richer", "Richest")

plot<-ggplot(lcn_fpf_co_names_wealth_summ, aes(fill=as.factor(Wealth), y=MEASURE, x=as.factor(Svy_Year))) + 
  geom_bar(position="dodge", stat="identity") +

  labs(y="Mean Person Measure",x= "Survey Year") +
  ggtitle(the_measure)
print(plot)

## Another way to see the change in nutrition levels is to plot the measure by County to see how the measures changed in each County over the two years for which data are available. The chart indicates nutrition levels rose in only three counties, stayed constant in eight counties, and dropped in four counties. Given the increase in per capita GDP over the time period, this seems to indicate a concern.

lcn_fpf_co_names <-read.csv("lcn_fpf_co_names.csv")
data("lcn_fpf_co_names")

## Warning in data("lcn_fpf_co_names"): data set 'lcn_fpf_co_names' not found

the_measure <- "Measures of Child Nutrition in Liberian Counties, 2007 & 2013"
lcn_fpf_co_names %>%
ggplot(aes(Svy_Year, Co_Name, fill = MEASURE)) +
geom_tile(color = "grey50") +
scale_x_continuous(breaks=c(2007,  2013)) +
scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
geom_vline(xintercept=2007, col = "blue") +
  geom_vline(xintercept=2013, col = "blue") +
theme_minimal() + theme(panel.grid = element_blank()) +
ggtitle(the_measure) +
ylab("") +
xlab("")

The charts above show the changes in person measures. But what about the item measures? Were Liberian children given healthy foods or foods with lower nutritional value? A plot of the item measures suggests Liberian children were not, overall, given healthy foods. As reported in one study, The data confirm the findings of one recent report on Liberian agricultural value chains, which noted, “children continue to eat mostly rice-laden meals with palm oil.” Rutherford D, Burke H, Cheung K, and S Field. Impact of an Agricultural Value Chain Project on Smallholder Farmers, Households and Children in Liberia. World Development 2016; 83: 70 – 83.

library(ggrepel)
Item_Measures <-read.csv("Item_Measures.csv")
p1 <- ggplot(Item_Measures, aes(x= Item_Measures_2007, y = Item_Measures_2013, label = Food_Group)) 
  p1 + geom_point()+ geom_smooth(method = "lm", formula =y~x)+
  ggtitle ("Patterns of Foods Fed to Children in Liberia, 2007 and 2013") +
geom_text_repel (nudge_x = .005)+
  xlim (0,100) +ylim (0,100)

The visuals presented above show a relationship – over time, child nutritional levels have decreased. But is that decrease statistically significant? A two-sample t-test can show statistical significance. A regression analysis may also show significance. Both tests are shown below.

t.test(lcn_fpf_co_names$MEASURE~lcn_fpf_co_names$Svy_Year)

## 
##  Welch Two Sample t-test
## 
## data:  lcn_fpf_co_names$MEASURE by lcn_fpf_co_names$Svy_Year
## t = 36.309, df = 19492, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  10.84161 12.07895
## sample estimates:
## mean in group 2007 mean in group 2013 
##           49.40959           37.94930

fit4 <- lm(MEASURE ~ as.factor(UrbRur) +as.factor(Educ) +as.factor(Wealth), data = lcn_fpf_co_names)
summary(fit4)

## 
## Call:
## lm(formula = MEASURE ~ as.factor(UrbRur) + as.factor(Educ) + 
##     as.factor(Wealth), data = lcn_fpf_co_names)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.614 -20.592   7.934  15.424  49.724 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        41.66065    0.50126  83.112  < 2e-16 ***
## as.factor(UrbRur)2  0.06121    0.41911   0.146   0.8839    
## as.factor(Educ)1   -0.31553    0.37293  -0.846   0.3975    
## as.factor(Educ)2    4.38239    0.84769   5.170 2.37e-07 ***
## as.factor(Educ)3    0.32681    0.55807   0.586   0.5581    
## as.factor(Educ)4    1.21807    1.23126   0.989   0.3225    
## as.factor(Educ)5    4.33243    2.16209   2.004   0.0451 *  
## as.factor(Wealth)2  0.12967    0.42736   0.303   0.7616    
## as.factor(Wealth)3  2.10815    0.47627   4.426 9.63e-06 ***
## as.factor(Wealth)4  3.72918    0.58120   6.416 1.43e-10 ***
## as.factor(Wealth)5  7.58124    0.80801   9.383  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.73 on 19711 degrees of freedom
## Multiple R-squared:  0.01143,    Adjusted R-squared:  0.01093 
## F-statistic: 22.79 on 10 and 19711 DF,  p-value: < 2.2e-16

print(fit4)

## 
## Call:
## lm(formula = MEASURE ~ as.factor(UrbRur) + as.factor(Educ) + 
##     as.factor(Wealth), data = lcn_fpf_co_names)
## 
## Coefficients:
##        (Intercept)  as.factor(UrbRur)2    as.factor(Educ)1    as.factor(Educ)2  
##           41.66065             0.06121            -0.31553             4.38239  
##   as.factor(Educ)3    as.factor(Educ)4    as.factor(Educ)5  as.factor(Wealth)2  
##            0.32681             1.21807             4.33243             0.12967  
## as.factor(Wealth)3  as.factor(Wealth)4  as.factor(Wealth)5  
##            2.10815             3.72918             7.58124

But given the size of the database, almost any difference, no matter how small, will show as statistically significant. The t-test results showed that. But an examination of the R-squared value shows the model itself was not significant at all, even though some of the higher wealth categories and one of the educational categories were highly significant. In order to get a better picture of possible statistical significance, a random sample of 176 cases was drawn and the tests run on that smaller data set. The number 176 was found, in the original study on which this exercise is based, to give a statistical power of 0.95 and an effect size (Cohen’s d) of 0.5.

library(mosaic)

## Warning: package 'mosaic' was built under R version 4.0.2

## Loading required package: lattice

## Loading required package: ggformula

## Warning: package 'ggformula' was built under R version 4.0.2

## Loading required package: ggstance

## Warning: package 'ggstance' was built under R version 4.0.2

## 
## Attaching package: 'ggstance'

## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh

## 
## New to ggformula?  Try the tutorials: 
##  learnr::run_tutorial("introduction", package = "ggformula")
##  learnr::run_tutorial("refining", package = "ggformula")

## Loading required package: mosaicData

## Warning: package 'mosaicData' was built under R version 4.0.2

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2

## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
## 
## Have you tried the ggformula package for your plots?

## 
## Attaching package: 'mosaic'

## The following object is masked from 'package:Matrix':
## 
##     mean

## The following object is masked from 'package:plotly':
## 
##     do

## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally

## The following object is masked from 'package:purrr':
## 
##     cross

## The following object is masked from 'package:ggplot2':
## 
##     stat

## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var

## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

lcn_sample_176 = resample(lcn_fpf_co_names, size = 176, replace = FALSE)
write.csv(lcn_sample_176, file = "lcn_sample_176.csv")

Now re-run the t-test and regression on the smaller data set.

lcn_sample_176 <-read.csv("lcn_sample_176.csv", header = TRUE)
t.test(lcn_sample_176$MEASURE~lcn_sample_176$Svy_Year)

## 
##  Welch Two Sample t-test
## 
## data:  lcn_sample_176$MEASURE by lcn_sample_176$Svy_Year
## t = 3.9157, df = 165.52, p-value = 0.0001316
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   6.128047 18.593391
## sample estimates:
## mean in group 2007 mean in group 2013 
##           49.02544           36.66472

fit5 <- lm(MEASURE ~ as.factor(UrbRur) +as.factor(Educ) +as.factor(Wealth), data = lcn_sample_176)
summary(fit5)

## 
## Call:
## lm(formula = MEASURE ~ as.factor(UrbRur) + as.factor(Educ) + 
##     as.factor(Wealth), data = lcn_sample_176)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.282 -17.936   7.113  17.423  40.981 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         45.6242     4.8447   9.417   <2e-16 ***
## as.factor(UrbRur)2  -4.5954     4.3203  -1.064    0.289    
## as.factor(Educ)1    -0.4001     3.9471  -0.101    0.919    
## as.factor(Educ)2     7.6427     9.1522   0.835    0.405    
## as.factor(Educ)3    -0.8092     6.1809  -0.131    0.896    
## as.factor(Educ)4   -16.7027    16.7523  -0.997    0.320    
## as.factor(Educ)5     3.7422    17.5044   0.214    0.831    
## as.factor(Wealth)2  -2.8362     4.4019  -0.644    0.520    
## as.factor(Wealth)3  -1.9745     5.0325  -0.392    0.695    
## as.factor(Wealth)4  -1.9263     6.6434  -0.290    0.772    
## as.factor(Wealth)5   5.6487     7.9581   0.710    0.479    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.58 on 165 degrees of freedom
## Multiple R-squared:  0.03313,    Adjusted R-squared:  -0.02547 
## F-statistic: 0.5653 on 10 and 165 DF,  p-value: 0.8405

print(fit5)

## 
## Call:
## lm(formula = MEASURE ~ as.factor(UrbRur) + as.factor(Educ) + 
##     as.factor(Wealth), data = lcn_sample_176)
## 
## Coefficients:
##        (Intercept)  as.factor(UrbRur)2    as.factor(Educ)1    as.factor(Educ)2  
##            45.6242             -4.5954             -0.4001              7.6427  
##   as.factor(Educ)3    as.factor(Educ)4    as.factor(Educ)5  as.factor(Wealth)2  
##            -0.8092            -16.7027              3.7422             -2.8362  
## as.factor(Wealth)3  as.factor(Wealth)4  as.factor(Wealth)5  
##            -1.9745             -1.9263              5.6487

The regression model run against the random sample still provided little assurance the differences were statistically significant. None of the variables showed significance, and the R-squared value was, as before, negligible. However, the t-test showed statistical significance, and Cohen’s d was actually 0.66, which is fairly robust. I would not want to claim the difference is highly statistically significant, but certainly the t-test suggests, along with the visuals, that there was a somewhat significant lowering of nutritional values among Liberia’s children between the two years. Given the large increase in GDP/capita, this is in fact a troubling development. Even if results are not statistically significant, the fact there was no increase is in itself a matter of concern. Given six years of relative stabilty, there should have been at least a constant level of nutrition, if not even a slight increase. Something is going on that merits further investigation.

200709_Final_Project

Jerome

7/9/2020

Introduction:

Basic Set-up for R analysis.

As will be seen in the two plots, overall nutritional levels were lower in 2013 than they were in 2007 across all educational levels and across all wealth levels.

The visuals presented above show a relationship – over time, child nutritional levels have decreased. But is that decrease statistically significant? A two-sample t-test can show statistical significance. A regression analysis may also show significance. Both tests are shown below.

Now re-run the t-test and regression on the smaller data set.

See the accompanying Word document for answers to the three questions.