This study examines the extent and determinants of poverty in Somaliland using the Wealth Index as the dependent variable. Independent variables include individual factors (age, sex, education, number of household members, living children, and household head’s marital status), community-level factors (residence and region), and household-level factors (access to safe drinking water and sanitation facilities). These variables collectively provide a comprehensive framework for exploring poverty drivers.
library(haven)
EData<-read_dta("~/SLDHS.dta")
library(dplyr)
relevant <- EData %>%
select("V190", "V024", "V025", "V106", "V152", "V151", "V136", "V201", "V501", "V113", "V116")
str(relevant)
## tibble [17,686 × 11] (S3: tbl_df/tbl/data.frame)
## $ V190: dbl+lbl [1:17686] 5, 5, 5, 5, 5, 3, 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3...
## ..@ label : chr "Wealth index combined"
## ..@ format.stata: chr "%1.0f"
## ..@ labels : Named num [1:5] 1 2 3 4 5
## .. ..- attr(*, "names")= chr [1:5] "Lowest" "Second" "Middle" "Fourth" ...
## $ V024: dbl+lbl [1:17686] 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, ...
## ..@ label : chr "Region"
## ..@ format.stata: chr "%2.0f"
## ..@ labels : Named num [1:6] 11 12 13 14 15 16
## .. ..- attr(*, "names")= chr [1:6] "Awdal" " Marodijeh" "Sahil" "Togdheer" ...
## $ V025: dbl+lbl [1:17686] 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## ..@ label : chr "Type of place of residence"
## ..@ format.stata: chr "%1.0f"
## ..@ labels : Named num [1:6] 1 2 3 4 5 6
## .. ..- attr(*, "names")= chr [1:6] "Rural" "Urban" "Nomadic" "Rural IDP" ...
## $ V106: dbl+lbl [1:17686] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## ..@ label : chr "Highest educational level"
## ..@ format.stata: chr "%1.0f"
## ..@ labels : Named num [1:4] 0 1 2 3
## .. ..- attr(*, "names")= chr [1:4] "No Education" "Primary" "Secondary" "Higher"
## $ V152: num [1:17686] 23 23 23 23 23 61 61 23 23 23 ...
## ..- attr(*, "label")= chr "Age of household head"
## ..- attr(*, "format.stata")= chr "%2.0f"
## $ V151: num [1:17686] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "Sex of household head"
## ..- attr(*, "format.stata")= chr "%1.0f"
## $ V136: num [1:17686] 6 6 6 6 6 4 4 6 6 6 ...
## ..- attr(*, "label")= chr "Number of household members (listed)"
## ..- attr(*, "format.stata")= chr "%1.0f"
## $ V201: num [1:17686] 5 5 5 5 5 2 2 4 4 4 ...
## ..- attr(*, "label")= chr "Total children ever born"
## ..- attr(*, "format.stata")= chr "%2.0f"
## $ V501: dbl+lbl [1:17686] 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## ..@ label : chr "Current marital status"
## ..@ format.stata: chr "%1.0f"
## ..@ labels : Named num [1:4] 0 1 2 3
## .. ..- attr(*, "names")= chr [1:4] "Never Married" "Married" "Divorced" "Widowed"
## $ V113: num [1:17686] 11 11 11 11 11 13 13 13 13 13 ...
## ..- attr(*, "label")= chr "Source of drinking water"
## ..- attr(*, "format.stata")= chr "%2.0f"
## $ V116: num [1:17686] 23 23 23 23 23 61 61 23 23 23 ...
## ..- attr(*, "label")= chr "Type of toilet facility"
## ..- attr(*, "format.stata")= chr "%2.0f"
variable_names<- colnames(relevant)
print(variable_names)
## [1] "V190" "V024" "V025" "V106" "V152" "V151" "V136" "V201" "V501" "V113"
## [11] "V116"
variable_types <- sapply(relevant, class)
print(variable_types)
## $V190
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $V024
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $V025
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $V106
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $V152
## [1] "numeric"
##
## $V151
## [1] "numeric"
##
## $V136
## [1] "numeric"
##
## $V201
## [1] "numeric"
##
## $V501
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $V113
## [1] "numeric"
##
## $V116
## [1] "numeric"
Numeric variables are for quantitative analysis, like calculating averages or performing statistical tests while Factors are especially useful for categorical analysis, such as creating tables or visualizing data by group.
mean(EData$V136, na.rm= TRUE)
## [1] 5.107781
median(EData$V136, na.rm= TRUE)
## [1] 5
sd(EData$V136, na.rm= TRUE)
## [1] 2.475511
freq_table <- table(EData$V106)
education_labels <- c("No Education", "Primary", "Secondary", "Higher")
names(freq_table) <- education_labels
print(freq_table)
## No Education Primary Secondary Higher
## 15287 1991 311 97
3.Calculate the proportion of households in each wealth quintile (V190)
proportions_v190 <- prop.table(table(EData$V190))
# Display the proportions
proportions_v190
##
## 1 2 3 4 5
## 0.3575144 0.1705304 0.1161371 0.1643673 0.1914509
# Assign descriptive labels to the wealth quintiles
EData$V190 <- factor(EData$V190,
levels = c(1, 2, 3, 4, 5),
labels = c("Lowest", "Second", "Middle", "Fourth", "Highest"))
# Recalculate proportions with labels
labeled_proportions <- prop.table(table(EData$V190))
# Display labeled proportions
labeled_proportions
##
## Lowest Second Middle Fourth Highest
## 0.3575144 0.1705304 0.1161371 0.1643673 0.1914509
age_household_head <- EData$V151
num_living_children <- EData$V201
correlation <- cor(age_household_head, num_living_children, use = "complete.obs")
cat("Correlation Coefficient between age of household head and number of living children:", correlation, "\n")
## Correlation Coefficient between age of household head and number of living children: 0.0005634619
To calculate the correlation coefficient between the household head’s age (V151) and the number of living children (V201) in R, use the cor() function. The default Pearson’s correlation measures the linear relationship. For instance, cor(EData\(V151, EData\)V201, use = “complete.obs”) computes the correlation, yielding a value between -1 and 1, indicating the relationship’s strength and direction.
relevant$V190 <- as.numeric(as.character(relevant$V190))
# Create the poverty_status variable
relevant$poverty_status <- ifelse(relevant$V190 <= 3, 1, 2)
# Label the poverty_status variable
relevant$poverty_status <- factor(relevant$poverty_status,
levels = c(1, 2),
labels = c("Poor", "Non-Poor"))
# Verify the new variable
table(relevant$poverty_status)
##
## Poor Non-Poor
## 11393 6293
summary(relevant)
## V190 V024 V025 V106
## Min. :1.000 Min. :11.00 Min. :1.000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:13.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :2.000 Median :14.00 Median :1.000 Median :0.0000
## Mean :2.662 Mean :13.95 Mean :1.498 Mean :0.1642
## 3rd Qu.:4.000 3rd Qu.:15.00 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :5.000 Max. :16.00 Max. :2.000 Max. :3.0000
##
## V152 V151 V136 V201
## Min. :11.0 Min. :1.000 Min. :0.000 Min. : 0.000
## 1st Qu.:13.0 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 4.000
## Median :22.0 Median :1.000 Median :5.000 Median : 6.000
## Mean :26.2 Mean :1.376 Mean :5.108 Mean : 6.274
## 3rd Qu.:23.0 3rd Qu.:2.000 3rd Qu.:7.000 3rd Qu.: 8.000
## Max. :96.0 Max. :2.000 Max. :9.000 Max. :19.000
## NA's :862 NA's :858 NA's :2303 NA's :13
## V501 V113 V116 poverty_status
## Min. :1.000 Min. :11.00 Min. :11.0 Poor :11393
## 1st Qu.:1.000 1st Qu.:12.00 1st Qu.:13.0 Non-Poor: 6293
## Median :1.000 Median :31.00 Median :22.0
## Mean :1.138 Mean :33.44 Mean :26.2
## 3rd Qu.:1.000 3rd Qu.:61.00 3rd Qu.:23.0
## Max. :3.000 Max. :96.00 Max. :96.0
## NA's :862 NA's :862
# Load dplyr
library(dplyr)
# Ensure 'relevant' dataset is properly loaded
# relevant <- read.csv("path_to_your_dataset.csv")
# Drop rows with missing values in specific variables
data_cleaned <- relevant %>%
filter(
!is.na(V190), !is.na(V024), !is.na(V025),
!is.na(V106), !is.na(V152), !is.na(V151),
!is.na(V136), !is.na(V201), !is.na(V501),
!is.na(V113), !is.na(V116)
)
# Check dimensions of the cleaned data
dim(data_cleaned)
## [1] 14514 12
# Verify no missing values remain in specific variables
sapply(data_cleaned[, c("V190", "V024", "V025", "V106", "V152", "V151",
"V136", "V201", "V501", "V113", "V116")], function(x) sum(is.na(x)))
## V190 V024 V025 V106 V152 V151 V136 V201 V501 V113 V116
## 0 0 0 0 0 0 0 0 0 0 0
data_cleaned$V113_recode <- ifelse(data_cleaned$V113 %in% c(11, 12, 13, 21, 31, 41),
"Improved", "Unimproved")
# Verify the classification
table(data_cleaned$V113_recode)
##
## Improved Unimproved
## 7474 7040
# Classify Type of Toilet Facility (V116)
data_cleaned$V116_recode <- ifelse(data_cleaned$V116 %in% c(11, 12, 13, 14, 15, 21, 41),
"Improved", "Unimproved")
# Verify the classification
table(data_cleaned$V116_recode)
##
## Improved Unimproved
## 6249 8265
Missing values are managed by using complete.cases() to eliminate rows with missing data in specified columns, ensuring removing, imputing and flaging.
hist(EData$V136,
main = "Distribution of Number of Household Members",
xlab = "Number of Household Members",
col = "blue",
breaks = 10)
The histogram shows that the number of household members (V136) is approximately uniformly distributed, with most households having between 3 and 7 members. Households with fewer than 2 or more than 7 members are less common.
library(ggplot2)
barplot(prop.table(table(data_cleaned$poverty_status)),
main = "Proportion of Households by Poverty Status",
xlab = "Poverty Status",
ylab = "Proportion",
col = c("red", "green"),
names.arg = c("Poor", "Non-Poor"))
The bar chart shows that a larger proportion of households fall under the “Poor” category compared to the “Non-Poor” category, with the proportion of poor households exceeding 60%. This indicates that poverty is more prevalent among the households analyzed.
library(ggplot2)
boxplot(data_cleaned$V201 ~ data_cleaned$poverty_status,
main = "Number of Living Children by Poverty Status",
xlab = "Poverty Status",
ylab = "Number of Living Children",
col = c("red", "green"))
The boxplot shows that poor and non-poor households have similar median numbers of living children, but poor households exhibit slightly greater variability. Additionally, poor households tend to have more outliers with a higher number of children, suggesting a slight tendency toward larger family sizes among the poor.
The Important for choosing appropriate for visualization for different types including: * Histogram which is a crucial tool for data visualization because it provides a clear graphical representation of the distribution of a continuous variable. A bar plot is an essential tool for data visualization, especially for categorical data. A boxplot is a vital tool for data visualization, particularly for summarizing and comparing distributions of continuous variables.