Project 2 Deliverable

My selected household survey was performed in Gabon. The original survey consists of two tables, a ‘household’ and ‘individual’ table. Once these tables were joined, the Gabon survey consisted of 6183 observations (each observation representing one interviewee) and 4531 variables (most of which were repeated). Variables consisted of different aspects ranging from topics related to family life and pregnancy to the square footage of each home. We selected variables we believed would be useful in the final prediction of level of education and in our estimation of the number and location of households. Our main selected variables were education, sex, age, household size, weights, province, unit #, and department. The variables used for estimation of level of education were education, sex, age, and household size.

The spatial distribution of households at the adm0 level of Gabon spans the entire country. This plot was created using our point process model with an estimation of the number of households (derived from dividing the population of Gabon by the mean household size). From this estimation of household location and population, we create a table with two columns and expand households to persons by creating pivot tables using variables from our ‘individual’ table: sex, age, and education. These pivot tables are then bound into a new dataframe containing an observation for each household at the adm0 level (estimated total number of households, not only those accessed by the survey administrators). This process happens to be a computational challenge for Gabon which has a population nearing 3 million people. Increasing the available RAM in R using ‘memory.limit(size = #####)’ solved the issue of high memory usage creating an exit error. My error in estimating the number of households in all of Gabon was only 0.002%.

After estimating household locations and creating an estimation of actual demographic data at the adm0 location, I decided to do the same at the adm2 level at my selected area of interest: Woleu. This was much more computationally efficient than my adm0 analysis because of a decrease in the population we are estimating. I use the same process here to estimate these household locations through creating a sex, age, and education pivot. I gather these pivots into a single data frame, and calculate my error by dividing the number of rows in my newly created dataframe by the sum of the weights in the same table. (After subtracting 1 and multiplying by 100) the resulting error in my estimation in the number of households is 26.64%.

Here is a visual distribution of the size of households in Woleu, and an accompanying table showing the greater distribution. The green line represents the household sizes pulled from the entire dataset, while the gold line represents household sizes only within Woleu. Although reduced, the extent of the household size distribution is still noticeably large.

There is a definite decrease in accuracy in our estimation of the number of households. Perhaps using mean household size as a main variable in our estimation of the number of households gets less reliable at this administrative level. The difference in population between the entirety of Gabon and Woleu is massive, and the distribution of the number of residents in each household is vast in our survey data. A solution would be to use the distribution of household sizes to estimate the number of households per square kilometer where denser regions are likely to have more populated households. Taking into account the population density of the desired frame measured in km2 in our formula could deliver more accurate estimates of the number of households within that frame which would produce a more accurate estimation of the total number of households in the given administrative boundary.

I believe this population is less accurate than the estimation at the adm0 level given the calculated error, yet the distribution of points here follows a similar shape to that of the figure plotting the actual population. Thus, I believe this simulated population is accurate enough for our purposes.

Estimated Locations:

Actual Locations:

A randomly generated synthetic population that describes demographic characteristics of households and persons could not approximate reality as close as this model. The very foundation of our estimation is built on real data representing the people of Gabon. Firstly, this model uses variables that are relative to the population (has clean water, etc), whereas a randomly generated synthetic population is filled with variables that are selected by the programmer. Secondly, the distribution of data for each variable in our model is simulated based on the real distribution of ages, household size, and education levels of households in Gabon. This data would also be estimated by a programmer in a randomly generated population. In general, a synthetic population created with complete spatial randomness cannot match the accuracy of a synthetic population generated from a legitimate data source.

Our final predictions for level of education are displayed here with the use of two variables, age and household size. It is interesting to see a linearly decreasing slope. This insinuates that as households grow, it becomes more and more difficult for those members to obtain an education. Likely a result of spread resources.