Summary

  1. 3 Models were used: xgboost, logistic regression and k-nearest neighbors (knn).

  2. The xgboost model had the best performance with a roc auc of 0.8918.

  3. The most important variable in predicting whether a household is living in poverty is the price of the rent in the neighborhood where the families reside.

  4. Work-related variables are crucial. Specifically, the number of gainfully and not gainfully employed people in the household.

  5. Certain characteristics of the head of the household such as: age, years of schooling, whether they work in the formal or informal sector and medical condition in the last 12 months are a significant factor.

  6. The tenure status of the home is important, especially, if the family owns, rents or occupies the house illegally.

  7. The composition of the household is fundamental: the number of people living in the house and the presence of people over 60 or underaged.

Intro

It is often difficult to know the real income of people when surveys are conducted, even more so when there are complex circumstances such as the health crisis experienced in 2020. Without good income data it is very difficult to know what the real poverty level is in a community. This is why I made use of the CASEN 2020 survey. The National Socioeconomic Characterization Survey, CASEN, is carried out by the Ministry of Social Development of Chile. One of its objectives is to have information that makes it possible to periodically know the situation of households and the population, especially those living in poverty and the groups defined as priorities by social policy. It is important to mention that the poverty line varies depending on the number of people that make up a household. If the monthly income of a family is below this value, it is considered poor. Here is a table with the values of the poverty line according to the number of household members.

Number Of People by Household Poverty Line (US Dollars)
1 218
2 355
3 475
4 578
5 675
6 767
7 855
8 939
9 1020
10 1097

My goal was to predict, through classification models, whether a family suffers from poverty without looking at income data. With this goal in mind, I deleted from the database all the questions related to income and I tried, with the rest of the variables (health, housing, work, education, etc.), to predict through various methods (xgboost, logistic regression, knn) when a family is poor. The objective was to try and establish the most important variables to consider when predicting the level of poverty of a household without having their income data.

Description of the data

The CASEN survey has 7 topics. The first one is called “resident registration” and it records the demographic information of people, such as: gender, age, marital status, etc. The second one is about education, it includes indicators such as the level of education of the population and the proportion of people that are not a part of the educational system. A third topicis related to work and it consists of questions that allow estimating indicators on the occupational situation of the population (unemployment and employment rates) and characterizing the employment situation of the employed labor force: economic sector in which they work, what type of employment contract they have, type of occupation, etc.

The fourth one is about income, it includes questions that allow to collect information about the different streams of income that individuals and households receive. The fifth one deals with health, it consists of a set of questions such as the coverage of the health care systems; effective access to health care and the health status of the respondents. The sixth one is about identities, it includes questions about belonging to any indigenous group and international migration. In addition, it incorporates a set of questions to measure food insecurity, according to the international scale recommended by the Food and Agriculture Organization.

The last topic has to do with housing, it is composed of questions that allow estimating indicators on some of the basic characteristics of homes in the country as well as the habitability conditions, such as: sanitation of the house, freeloading and overcrowdedness. The complete list of variables can be reviewed here

Treatment of variables

In the first place, all the income variables and those categorical variables that had more than 30 levels were eliminated. Then, only the observations that belonged to heads of households were kept (to have one variable per household and because the database, being so large, caused the data to take a long time to train). For categorical variables with NA’s, an unknown category was created. In addition, all categorical variables were transformed into dummies variables. The NA’s observations of the numerical variables were imputed using the median. Furthermore, these variables were logarithm transformed and then normalized.

Exploratory Analysis

The first thing I did was to analyze demographic variables of the heads of households in order to detect any relationship with poverty. If one looks at the plot, it is clear that there are certain groups in which it seems that poverty is more recurrent. Families where the head of the household is young, immigrant, indigenous or female tend to be poorer than their counterparts. Furthermore, households in rural sectors have a higher proportion of poverty than those in urban sectors. It is striking that more than 20% of the households in which the head of the household is an immigrant are in a state of poverty. The data was collected during the pandemic and it must be borne in mind that much of the aid that the goverment gave to families to alleviate the burden of the pandemic was not received by immigrants, therefore, it is not possible to know if under normal conditions this group is also the one with the highest proportion of poor households.

It seems that the younger the head of the household, the greater the probability the people in the household is living below the poverty line. It is a bit difficult to find a clear reason, it may be that the pandemic affected the employment of young people the most. In our database there are very few numerical variables, we will explore some of them below.

It is possible to note significant differences in 3 out of the 4 variables of the graph. It seems that the age (as we noticed previously) and the years of schooling of the household head together with the number of people working in a household, have an effect on poverty. The same cannot be said for the number of people living in a household.

Let’s now look at other variables that might be interesting.

Looking at poverty by the type of household tenure, it is evident that people who are renting and those who occupy a home illegally tend to have a higher proportion of poor people. In the case of those who rent, a portion of their income is used to pay the rent and, therefore, cannot be spent on basic necessities. In the case of people who illegally occupy a property the situation is the same, they are probably even in a more dire economic situation than the people who rent.

Another important variable is medical treatment, in particular when the head of the household responds that they do not know or do not remember whether they received medical treatment during the last 12 months. It is likely that people who suffer from some type of disease with a high social stigma prefer to give this type of response. The other alternative is that responding that they do not know or do not remember is, in itself, a sign of some health condition that somehow affects a person’s ability to earn an income.

Another interesting variable is the rent price paid in the neighborhoods the families reside in. This value is obtained by asking people directly: how much do you pay for rent in this sector? Here is a graph of the distributions OF poor and non-poor people. It is clear that non-poor people live in neighborhoods where rent is more expensive than neighborhoods where poor families live.

Results

I tested different models: xgboost, logistic regression, and k-nearest neighbors (knn). The one with the best performance was the xgboost model with a roc auc of 0.8918. This model was validated with a k-fold cross validation process, with k = 5. I used 1000 trees as default. The hyperparameters that gave me the best results were the following:

mtry min_n tree_depth learn_rate loss_reduction sample_size
67 14 6 0.0140818 1.906337e-10 0.6928669

These are the most important variables of the model

In general terms, there is a group of variables related to the type of residency of the families: the house is owned, rented or illegaly occupied. In this group are the variables V13 and ten_viv. A second group is related to the labor aspect of the home, examples of this are the variables n_ocupados (number of employed people) and n_inactivos (number of not gainfully employed people). Another variable related to the workplace is ocup_inf, which specifies whether the head of the household is working legally or illegally.

A third group is linked to characteristics of the head of the household, here we have the years of schooling, represented by esc2 and esc, age, edad and if the head of household received medical treatment in the last 12 months s28. A fourth group of variables is linked to the composition of the household: the presence of minors men18c, of people over 60 years old may60c and the number of people living in the household: numviv, tot_per, numper and p6. Finally, without being clearly related to the previous groups, it seems that the most important variable is v19, which corresponds to the value that people report when they are asked: How much is the rent in this neighborhood?

Final Thoughts

First of all, it is important to understand that this data was collected during the pandemic, therefore, our conclusions may not be valid under “normal” conditions. Having clarified this, I think it should be noted that we find it surprising that contrary to what we initially believed, many of the demographic variables, such as race or gender, were not as important as we thought.

It is possible to think that public policy can play an important role in reducing poverty. Elements such as health, education and housing have a strong predictive power on poverty. If public policies that facilitate access to these 3 elements are strengthened, I believe that there is a possibility to further reducing poverty.

This project was extremely complex to carry out, in large part due to the large size of the database, it took a long time to tune the hyperparameters of the models, which made the workflow very hard. Therefore, I think there is a lot of room to improve the models if one has access to servers that allow us to speed up the process. Finally, I would like to point out again that one has to be very careful about making generalizations with the results of this project, since the data was collected in the midst of such an unusual situation as the 2020 health crisis.

---
title: "Predicting Poverty with Machine Learning"
output: 
  html_notebook:
    theme: united
    toc: yes
---
<style>

body{
 text-align: justify;
}

</style>




### Summary

1) 3 Models were used: xgboost, logistic regression and k-nearest neighbors (*knn*). 

2) The xgboost model had the best performance with a **roc auc** of 0.8918.

3) The most important variable in predicting whether a household is living in poverty is the price of the rent in the neighborhood where the families reside.

4) Work-related variables are crucial. Specifically, the number of gainfully and not gainfully employed  people in the household.

5) Certain characteristics of the head of the household such as: age, years of schooling, whether they work in the formal or informal sector and medical condition in the last 12 months are a significant factor.

6) The tenure status of the home is important, especially, if the family owns, rents or occupies the house illegally.

7) The composition of the household is fundamental: the number of people living in the house and the presence of people over 60 or underaged.


### Intro

![](http://observatorio.ministeriodesarrollosocial.gob.cl/images/casen_2020.svg)


It is often difficult to know the real income of people when surveys are conducted, even more so when there are complex circumstances such as the health crisis experienced in 2020. Without good income data it is very difficult to know what the real poverty level is in a community. This is why I made use of the CASEN 2020 survey. The National Socioeconomic Characterization Survey, CASEN, is carried out by the Ministry of Social Development of Chile. One of its objectives is to have information that makes it possible to periodically know the situation of households and the population, especially those living in poverty and the groups defined as priorities by social policy. It is important to mention that the poverty line varies depending on the number of people that make up a household. If the monthly income of a family is below this value, it is considered poor. Here is a table with the values of the poverty line according to the number of household members.



```{r,echo=FALSE,warning=FALSE}
library(kableExtra)

library(knitr)

x<-tibble(
  "Number Of People by Household" = 1:10, 
  "Poverty Line (US Dollars)" = c(218,355,475,578,675,767,855,939,1020,1097)
)

x %>% kbl(align="cc") %>%
  kable_paper("hover", full_width = F)

```

My goal was to predict, through classification models, whether a family suffers from poverty without looking at income data. With this goal in mind, I deleted from the database all the questions related to income and I tried, with the rest of the variables (health, housing, work, education, etc.), to predict through various methods (xgboost, logistic regression, knn) when a family is poor. The objective was to try and establish the most important variables to consider when predicting the level of poverty of a household without having their income data.

### Description of the data

The CASEN survey has 7 topics. The first one is called "resident registration" and it records the demographic information of people, such as: gender, age, marital status, etc. The second one is about education, it includes indicators such as the level of education of the population and the proportion of people that are not a part of the educational system. A third topicis related to work and it consists of questions that allow estimating indicators on the occupational situation of the population (unemployment and employment rates) and characterizing the employment situation of the employed labor force: economic sector in which they work, what type of employment contract they have, type of occupation, etc.


The fourth one is about income, it includes questions that allow to collect information about the different streams of income that individuals and households receive. The fifth one deals with health, it consists of a set of questions such as the coverage of the health care systems; effective access to health care and the health status of the respondents. The sixth one is about *identities*, it includes questions about belonging to any indigenous group and international migration. In addition, it incorporates a set of questions to measure food insecurity, according to the international scale recommended by the Food and Agriculture Organization.


The last topic has to do with housing, it is composed of questions that allow estimating indicators on some of the basic characteristics of homes in the country as well as the habitability conditions, such as: sanitation of the house, freeloading and overcrowdedness. The complete list of variables can be reviewed [here](http://observatorio.ministeriodesarrollosocial.gob.cl/storage/docs/casen/2020/Libro_de_codigos_Base_de_Datos_Casen_en_Pandemia_2020.pdf)



### Treatment of variables

In the first place, all the income variables and those categorical variables that had more than 30 levels were eliminated. Then, only the observations that belonged to heads of households were kept (to have one variable per household and because the database, being so large, caused the data to take a long time to train). For categorical variables with NA's, an *unknown* category was created. In addition, all categorical variables were transformed into *dummies* variables. The NA's observations of the numerical variables were imputed using the median. Furthermore, these variables were logarithm transformed and then normalized.

### Exploratory Analysis


```{r,echo = FALSE, warning=FALSE, message=FALSE}

sexo<-train %>%
  mutate(pobreza = pobreza == "1") %>%
  group_by(sexo) %>% 
    summarise(pobreza = mean(pobreza)) %>% 
  rename(Variable = sexo) %>% 
  mutate(Variable= case_when( Variable == "Hombre" ~ "Man",
  Variable == "Mujer" ~ "Woman"))

edad<-train%>% mutate(edad=cut(edad, breaks=c(18, 30, 45,60, Inf), labels=c("Age 18-29", "Age 30-44","Age 45-59","Age +60")))%>%
  mutate(pobreza = pobreza == "1") %>%
  group_by(edad) %>% 
    summarise(pobreza = mean(pobreza)) %>% 
  rename(Variable = edad)

etnia<-train %>%
  mutate(pobreza = pobreza == "1") %>%
  group_by(etnia) %>% 
    summarise(pobreza = mean(pobreza)) %>% 
  rename(Variable = etnia)%>% 
  mutate(Variable= case_when( Variable == "No pertenece a ninguno pueblo indígena" ~ "Non-Native American",
  Variable == "Pertenece a pueblos indígenas" ~ "Native American"))


zona<-train %>%
  mutate(pobreza = pobreza == "1") %>%
  group_by(zona) %>% 
    summarise(pobreza = mean(pobreza)) %>% 
  rename(Variable = zona)%>% 
  mutate(Variable= case_when( Variable == "Rural" ~ "Rural",
  Variable == "Urbano" ~ "Urban"))

inmigrante<-train %>%
  mutate(pobreza = pobreza == "1") %>%
  group_by(inmigrante) %>% 
    summarise(pobreza = mean(pobreza)) %>% 
  rename(Variable = inmigrante)%>% 
  mutate(Variable= case_when( Variable == "No inmigrante" ~ "Non-Inmigrant",
  Variable == "Inmigrante" ~ "Inmigrant"))


juntas<-rbind(sexo,edad,etnia,zona,inmigrante)





#su rbind

```

```{r,echo = FALSE, warning=FALSE, message=FALSE}

juntas[-c(7, 14), ] %>%
  mutate(Variable = fct_reorder(Variable, pobreza)) %>% 
ggplot( aes(pobreza, fct_reorder(Variable,pobreza))) +
        geom_segment(aes(x = 0, y = Variable, xend = pobreza, yend = Variable), color = "grey50") +
        geom_point() +ggtitle("Poverty Rate by Head of Household Demographic Characteristics")+
  labs(x = "Proportion of Poor Households",y=NULL)+ theme_light()

```


The first thing I did was to analyze demographic variables of the heads of households in order to detect any relationship with poverty. If one looks at the plot, it is clear that there are certain groups in which it seems that poverty is more recurrent. Families where the head of the household is young, immigrant, indigenous or female tend to be poorer than their counterparts. Furthermore, households in rural sectors have a higher proportion of poverty than those in urban sectors. It is striking that more than 20% of the households in which the head of the household is an immigrant are in a state of poverty. The data was collected during the pandemic and it must be borne in mind that much of the aid that the goverment gave to families to alleviate the burden of the pandemic was not received by immigrants, therefore, it is not possible to know if under normal conditions this group is also the one with the highest proportion of poor households.


It seems that the younger the head of the household, the greater the probability the people in the household is living below the poverty line. It is a bit difficult to find a clear reason, it may be that the pandemic affected the employment of young people the most. In our database there are very few numerical variables, we will explore some of them below.



```{r,echo = FALSE, warning=FALSE, message=FALSE}
train %>%
  pivot_longer(c(edad,esc,n_ocupados,numper), names_to = "stat", values_to = "value")%>% 
  mutate(stat= case_when(stat == "edad" ~ "Head of House Age",
                          stat == "esc" ~ "Head of House Schooling Years",
                          stat == "n_ocupados" ~ "Number of Employed People by Household",
                          stat == "numper" ~ "Number of People by Household"),
         pobreza=ifelse(pobreza=="1","Poor","Not Poor")) %>%
  ggplot(aes(pobreza, value, fill = pobreza, color = pobreza)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap(~stat, scales = "free_y", nrow = 2) +
  labs(y = NULL,x="Poverty", color = NULL, fill = NULL)+theme_light()


                          
                          
```

It is possible to note significant differences in 3 out of the 4 variables of the graph. It seems that the age (as we noticed previously) and the years of schooling of the household head together with the number of people working in a household, have an effect on poverty. The same cannot be said for the number of people living in a household.

Let's now look at other variables that might be interesting.

```{r,echo=FALSE, message=FALSE,fig.width=7.5, fig.height=4.2, warning=FALSE}

train %>%  
  mutate(v13= case_when(v13 == "Propia" ~ "Owner",
                          v13 == "Arrendada" ~ "Renting",
                          v13 == "Cedida" ~ " Granted Free of Charge",
                          v13 == "Usufructo (sólo uso y goce)" ~ "Other",
                          v13 == "Ocupación irregular (de hecho)"| v13 == "Poseedor irregular"  ~ "Illegal Occupation"),
                       pobreza=ifelse(pobreza=="1","Poor","Not Poor")) %>% count(pobreza, v13) %>% 
  group_by(v13) %>%  mutate(prop = n / sum(n)) %>% 
  plot_ly(x =~prop, y = ~v13, color = ~pobreza) %>% 
  add_bars() %>% 
  layout(barmode = "stack",title = "Poverty by Tenure Status of Households",
         yaxis=list(title =""))

```


Looking at poverty by the type of household tenure, it is evident that people who are renting and those who occupy a home illegally tend to have a higher proportion of poor people. In the case of those who rent, a portion of their income is used to pay the rent and, therefore, cannot be spent on basic necessities. In the case of people who illegally occupy a property the situation is the same, they are probably even in a more dire economic situation than the people who rent.

```{r,echo=FALSE,message=FALSE,warning=FALSE,fig.width=7.5, fig.height=4.2}
train %>%  
  mutate(s28= case_when(s28 == "No ha estado en tratamiento por ninguna condición de salud anterior" ~ "No",
                          s28 == "No sabe/No recuerda" ~ "Don't Know/Don't Remember",
                          s28 != "No ha estado en tratamiento por ninguna condición de salud anterior" | 
                          s28 != "No sabe/No recuerda" ~ "Yes"),
                       pobreza=ifelse(pobreza=="1","Poor","Not Poor"))%>% count(pobreza, s28) %>% 
  group_by(s28) %>%  mutate(prop = n / sum(n)) %>% 
  plot_ly(x =~prop, y = ~s28, color = ~pobreza) %>% 
  add_bars() %>% 
  layout(barmode = "stack",title = "During the past 12 months, have you been in medical treatment?",
         yaxis=list(title =""))
```

Another important variable is medical treatment, in particular when the head of the household responds that they do not know or do not remember whether they received medical treatment during the last 12 months. It is likely that people who suffer from some type of disease with a high social stigma prefer to give this type of response. The other alternative is that responding that they do not know or do not remember is, in itself, a sign of some health condition that somehow affects a person's ability to earn an income.

Another interesting variable is the rent price paid in the neighborhoods the families reside in. This value is obtained by asking people directly: how much do you pay for rent in this sector? Here is a graph of the distributions OF poor and non-poor people. It is clear that non-poor people live in neighborhoods where rent is more expensive than neighborhoods where poor families live.


```{r,echo=FALSE, warning=FALSE, message=FALSE,fig.width=7.5, fig.height=4.2}
d1 <- filter(train, pobreza == "1"&is.na(v19)==FALSE)
d2 <- filter(train, pobreza == "0"&is.na(v19)==FALSE)

density1 <- density(d1$v19)
density2 <- density(d2$v19)

plot_ly(opacity = 0.8) %>%  add_lines(x = ~log(density1$x+1), y = ~log(density1$y+1), name = "poor") %>% 
  add_lines(x = ~log(density2$x+1), y = ~log(density2$y+1), name = "not poor") %>% 
  layout(xaxis = list(title = 'Log(Rent Price)'),
         yaxis = list(title = 'Density'),
         title = "Neighborhood Rent Price Distribution")

```



## Results




```{r,echo=FALSE}
collect_metrics(final_res) 

```



```{r,echo=FALSE,warning=FALSE,message=FALSE}
final_xgb %>%
    fit(data = train) %>%
    pull_workflow_fit() %>%
    vip(geom = "point", num_features = 20)+
  theme_light()
```

I tested different models: xgboost, logistic regression, and k-nearest neighbors (*knn*). The one with the best performance was the xgboost model with a **roc auc** of 0.8918. This model was validated with a *k-fold cross validation* process, with k = 5. I used 1000 trees as default. The hyperparameters that gave me the best results were the following:


```{r,echo=FALSE,warning=FALSE}
library(kableExtra)

library(knitr)

x<-tibble(
  "mtry" = 67, 
  "min_n" = 14,
  "tree_depth" = 6,
  "learn_rate" = 0.01408179,
  "loss_reduction" = "1.906337e-10",
  "sample_size" = 0.6928669
)

x %>% kbl(align="cc") %>%
  kable_paper("hover", full_width = F)

```

These are the most important variables of the model

![](C:/Users/Cayoyo/Desktop/R/xg.boost.PNG)

In general terms, there is a group of variables related to the type of residency of the families: the house is owned, rented or illegaly occupied. In this group are the variables **V13** and **ten_viv**. A second group is related to the labor aspect of the home, examples of this are the variables **n_ocupados** (number of employed people) and **n_inactivos** (number of not gainfully employed people). Another variable related to the workplace is **ocup_inf**, which specifies whether the head of the household is working legally or illegally.

A third group is linked to characteristics of the head of the household, here we have the years of schooling, represented by **esc2** and **esc**, age, **edad** and if the head of household received medical treatment in the last 12 months **s28**. A fourth group of variables is linked to the composition of the household: the presence of minors **men18c**, of people over 60 years old **may60c** and the number of people living in the household: **numviv**, **tot_per**, **numper** and **p6**. Finally, without being clearly related to the previous groups, it seems that the most important variable is **v19**, which corresponds to the value that people report when they are asked: How much is the rent in this neighborhood?


```{r,echo=FALSE, warning=FALSE}
xg_conf <- final_res %>%
  unnest(.predictions) %>%
  conf_mat(pobreza, .pred_class)

xg_conf


```


```{r,echo=FALSE,warning=FALSE}
final_lr %>%
  fit(data = train) %>%
  pull_workflow_fit() %>%
  vip(geom = "point", num_features = 20)+
  theme_light()
```


```{r,echo=FALSE}
final_res.ln <- last_fit(final_lr, spl)

collect_metrics(final_res.ln)
```


## Final Thoughts

First of all, it is important to understand that this data was collected during the pandemic, therefore, our conclusions may not be valid under "normal" conditions. Having clarified this, I think it should be noted that we find it surprising that contrary to what we initially believed, many of the demographic variables, such as race or gender, were not as important as we thought.

It is possible to think that public policy can play an important role in reducing poverty. Elements such as health, education and housing have a strong predictive power on poverty. If public policies that facilitate access to these 3 elements are strengthened, I believe that there is a possibility to further reducing poverty.

This project was extremely complex to carry out, in large part due to the large size of the database, it took a long time to tune the hyperparameters of the models, which made the workflow very hard. Therefore, I think there is a lot of room to improve the models if one has access to servers that allow us to speed up the process. Finally, I would like to point out again that one has to be very careful about making generalizations with the results of this project, since the data was collected in the midst of such an unusual situation as the 2020 health crisis.












