Introduction

This study examines the extent and determinants of poverty in Somaliland using the Wealth Index as the dependent variable. Independent variables include individual factors (age, sex, education, number of household members, living children, and household head’s marital status), community-level factors (residence and region), and household-level factors (access to safe drinking water and sanitation facilities). These variables collectively provide a comprehensive framework for exploring poverty drivers.

Question 1: Data Struction in R

  1. Import SDHS into R ### R Code
library(haven)
EData<-read_dta("~/SLDHS.dta")

Load the dplyr package

library(dplyr)
relevant <- EData %>%
  select("V190", "V024", "V025", "V106", "V152", "V151", "V136", "V201", "V501", "V113", "V116")
  1. Display the Structure of the Data set using the str() Function

R Code

str(relevant)
## tibble [17,686 × 11] (S3: tbl_df/tbl/data.frame)
##  $ V190: dbl+lbl [1:17686] 5, 5, 5, 5, 5, 3, 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3...
##    ..@ label       : chr "Wealth index combined"
##    ..@ format.stata: chr "%1.0f"
##    ..@ labels      : Named num [1:5] 1 2 3 4 5
##    .. ..- attr(*, "names")= chr [1:5] "Lowest" "Second" "Middle" "Fourth" ...
##  $ V024: dbl+lbl [1:17686] 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, ...
##    ..@ label       : chr "Region"
##    ..@ format.stata: chr "%2.0f"
##    ..@ labels      : Named num [1:6] 11 12 13 14 15 16
##    .. ..- attr(*, "names")= chr [1:6] "Awdal" " Marodijeh" "Sahil" "Togdheer" ...
##  $ V025: dbl+lbl [1:17686] 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
##    ..@ label       : chr "Type of place of residence"
##    ..@ format.stata: chr "%1.0f"
##    ..@ labels      : Named num [1:6] 1 2 3 4 5 6
##    .. ..- attr(*, "names")= chr [1:6] "Rural" "Urban" "Nomadic" "Rural IDP" ...
##  $ V106: dbl+lbl [1:17686] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
##    ..@ label       : chr "Highest educational level"
##    ..@ format.stata: chr "%1.0f"
##    ..@ labels      : Named num [1:4] 0 1 2 3
##    .. ..- attr(*, "names")= chr [1:4] "No Education" "Primary" "Secondary" "Higher"
##  $ V152: num [1:17686] 23 23 23 23 23 61 61 23 23 23 ...
##   ..- attr(*, "label")= chr "Age of household head"
##   ..- attr(*, "format.stata")= chr "%2.0f"
##  $ V151: num [1:17686] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Sex of household head"
##   ..- attr(*, "format.stata")= chr "%1.0f"
##  $ V136: num [1:17686] 6 6 6 6 6 4 4 6 6 6 ...
##   ..- attr(*, "label")= chr "Number of household members (listed)"
##   ..- attr(*, "format.stata")= chr "%1.0f"
##  $ V201: num [1:17686] 5 5 5 5 5 2 2 4 4 4 ...
##   ..- attr(*, "label")= chr "Total children ever born"
##   ..- attr(*, "format.stata")= chr "%2.0f"
##  $ V501: dbl+lbl [1:17686] 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
##    ..@ label       : chr "Current marital status"
##    ..@ format.stata: chr "%1.0f"
##    ..@ labels      : Named num [1:4] 0 1 2 3
##    .. ..- attr(*, "names")= chr [1:4] "Never Married" "Married" "Divorced" "Widowed"
##  $ V113: num [1:17686] 11 11 11 11 11 13 13 13 13 13 ...
##   ..- attr(*, "label")= chr "Source of drinking water"
##   ..- attr(*, "format.stata")= chr "%2.0f"
##  $ V116: num [1:17686] 23 23 23 23 23 61 61 23 23 23 ...
##   ..- attr(*, "label")= chr "Type of toilet facility"
##   ..- attr(*, "format.stata")= chr "%2.0f"
  1. Identify and List all variables provided in the variable code list

R Code

variable_names<- colnames(relevant)
print(variable_names)
##  [1] "V190" "V024" "V025" "V106" "V152" "V151" "V136" "V201" "V501" "V113"
## [11] "V116"
  1. Determine the data types of each variable.

R Code

variable_types <- sapply(relevant, class)
print(variable_types)
## $V190
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $V024
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $V025
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $V106
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $V152
## [1] "numeric"
## 
## $V151
## [1] "numeric"
## 
## $V136
## [1] "numeric"
## 
## $V201
## [1] "numeric"
## 
## $V501
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $V113
## [1] "numeric"
## 
## $V116
## [1] "numeric"
  1. Briefly Explain the difference between numeric and factor variables

Numeric variables are for quantitative analysis, like calculating averages or performing statistical tests while Factors are especially useful for categorical analysis, such as creating tables or visualizing data by group.

Question 2: Descriptive Statistics in R

  1. Calculate the mean, median and Standard deviation for the variable “V136” (Number of household members).

R Code

mean(EData$V136, na.rm= TRUE)
## [1] 5.107781
median(EData$V136, na.rm= TRUE)
## [1] 5
sd(EData$V136, na.rm= TRUE)
## [1] 2.475511
  1. Create a Frequency table for variable “V106” (Education Level) using the table() Function

R Code

freq_table <- table(EData$V106)
education_labels <- c("No Education", "Primary", "Secondary", "Higher")
names(freq_table) <- education_labels
print(freq_table)
## No Education      Primary    Secondary       Higher 
##        15287         1991          311           97

3.Calculate the proportion of households in each wealth quintile (V190)

R Code

proportions_v190 <- prop.table(table(EData$V190))
# Display the proportions
proportions_v190
## 
##         1         2         3         4         5 
## 0.3575144 0.1705304 0.1161371 0.1643673 0.1914509
# Assign descriptive labels to the wealth quintiles
EData$V190 <- factor(EData$V190, 
                             levels = c(1, 2, 3, 4, 5), 
                             labels = c("Lowest", "Second", "Middle", "Fourth", "Highest"))
# Recalculate proportions with labels
labeled_proportions <- prop.table(table(EData$V190))

# Display labeled proportions
labeled_proportions
## 
##    Lowest    Second    Middle    Fourth   Highest 
## 0.3575144 0.1705304 0.1161371 0.1643673 0.1914509
  1. Explain how you would use R to Calculate the correlation coefficient between age of household head (“V151”) and the number of living children (“V201”)

R Code

age_household_head <- EData$V151
num_living_children <- EData$V201
correlation <- cor(age_household_head, num_living_children, use = "complete.obs")
cat("Correlation Coefficient between age of household head and number of living children:", correlation, "\n")
## Correlation Coefficient between age of household head and number of living children: 0.0005634619

To calculate the correlation coefficient between the household head’s age (V151) and the number of living children (V201) in R, use the cor() function. The default Pearson’s correlation measures the linear relationship. For instance, cor(EData\(V151, EData\)V201, use = “complete.obs”) computes the correlation, yielding a value between -1 and 1, indicating the relationship’s strength and direction.

Question 3: Data Cleaning in R

  1. Create New Variable called “Poverty Status” based on the “V190” variable (Wealth quantile) and categorize household into two groups:

R Code

relevant$V190 <- as.numeric(as.character(relevant$V190))

# Create the poverty_status variable
relevant$poverty_status <- ifelse(relevant$V190 <= 3, 1, 2)

# Label the poverty_status variable
relevant$poverty_status <- factor(relevant$poverty_status, 
                                       levels = c(1, 2), 
                                       labels = c("Poor", "Non-Poor"))

# Verify the new variable
table(relevant$poverty_status)
## 
##     Poor Non-Poor 
##    11393     6293
  1. Check for Missing values in all variables using the summary () function.

R Code

summary(relevant)
##       V190            V024            V025            V106       
##  Min.   :1.000   Min.   :11.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:13.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :2.000   Median :14.00   Median :1.000   Median :0.0000  
##  Mean   :2.662   Mean   :13.95   Mean   :1.498   Mean   :0.1642  
##  3rd Qu.:4.000   3rd Qu.:15.00   3rd Qu.:2.000   3rd Qu.:0.0000  
##  Max.   :5.000   Max.   :16.00   Max.   :2.000   Max.   :3.0000  
##                                                                  
##       V152           V151            V136            V201       
##  Min.   :11.0   Min.   :1.000   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:13.0   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 4.000  
##  Median :22.0   Median :1.000   Median :5.000   Median : 6.000  
##  Mean   :26.2   Mean   :1.376   Mean   :5.108   Mean   : 6.274  
##  3rd Qu.:23.0   3rd Qu.:2.000   3rd Qu.:7.000   3rd Qu.: 8.000  
##  Max.   :96.0   Max.   :2.000   Max.   :9.000   Max.   :19.000  
##  NA's   :862    NA's   :858     NA's   :2303    NA's   :13      
##       V501            V113            V116       poverty_status 
##  Min.   :1.000   Min.   :11.00   Min.   :11.0   Poor    :11393  
##  1st Qu.:1.000   1st Qu.:12.00   1st Qu.:13.0   Non-Poor: 6293  
##  Median :1.000   Median :31.00   Median :22.0                   
##  Mean   :1.138   Mean   :33.44   Mean   :26.2                   
##  3rd Qu.:1.000   3rd Qu.:61.00   3rd Qu.:23.0                   
##  Max.   :3.000   Max.   :96.00   Max.   :96.0                   
##                  NA's   :862     NA's   :862
  • Cleaning Data by removing the missing values

R Code

# Load dplyr
library(dplyr)

# Ensure 'relevant' dataset is properly loaded
# relevant <- read.csv("path_to_your_dataset.csv")

# Drop rows with missing values in specific variables
data_cleaned <- relevant %>%
  filter(
    !is.na(V190), !is.na(V024), !is.na(V025), 
    !is.na(V106), !is.na(V152), !is.na(V151), 
    !is.na(V136), !is.na(V201), !is.na(V501), 
    !is.na(V113), !is.na(V116)
  )

# Check dimensions of the cleaned data
dim(data_cleaned)
## [1] 14514    12
# Verify no missing values remain in specific variables
sapply(data_cleaned[, c("V190", "V024", "V025", "V106", "V152", "V151", 
                        "V136", "V201", "V501", "V113", "V116")], function(x) sum(is.na(x)))
## V190 V024 V025 V106 V152 V151 V136 V201 V501 V113 V116 
##    0    0    0    0    0    0    0    0    0    0    0
  1. Recode the variables “V113” (source of drinking water) and “V116” (Toilet facility) into “Improved” and “Unimproved” categories based on the definitions provided in the EData.

R Code

data_cleaned$V113_recode <- ifelse(data_cleaned$V113 %in% c(11, 12, 13, 21, 31, 41), 
                                    "Improved", "Unimproved")

# Verify the classification
table(data_cleaned$V113_recode)
## 
##   Improved Unimproved 
##       7474       7040
# Classify Type of Toilet Facility (V116)
data_cleaned$V116_recode <- ifelse(data_cleaned$V116 %in% c(11, 12, 13, 14, 15, 21, 41), 
                                    "Improved", "Unimproved")

# Verify the classification
table(data_cleaned$V116_recode)
## 
##   Improved Unimproved 
##       6249       8265
  1. Explain how you would handle missing values in your dataset.

Missing values are managed by using complete.cases() to eliminate rows with missing data in specified columns, ensuring removing, imputing and flaging.

Question 4: Data Visualization in R

  1. Create a histogram to show the distribution of the variable “V136” (Number of household members)

R Code

hist(EData$V136,
     main = "Distribution of Number of Household Members",
     xlab = "Number of Household Members",
     col = "blue",
     breaks = 10)

The histogram shows that the number of household members (V136) is approximately uniformly distributed, with most households having between 3 and 7 members. Households with fewer than 2 or more than 7 members are less common.

  1. Create a bar chart to visualize the proportion of houeseholds in each poverty status category (“poverty_status”)

R Code

library(ggplot2)
barplot(prop.table(table(data_cleaned$poverty_status)),
        main = "Proportion of Households by Poverty Status",
        xlab = "Poverty Status",
        ylab = "Proportion",
        col = c("red", "green"),
        names.arg = c("Poor", "Non-Poor"))

The bar chart shows that a larger proportion of households fall under the “Poor” category compared to the “Non-Poor” category, with the proportion of poor households exceeding 60%. This indicates that poverty is more prevalent among the households analyzed.

  1. Create a boxplot to compare the number of living children (“V201”) betweeen poor and non-poor houeseholds (“poverty_status”)

R Code

library(ggplot2)
boxplot(data_cleaned$V201 ~ data_cleaned$poverty_status,
        main = "Number of Living Children by Poverty Status",
        xlab = "Poverty Status",
        ylab = "Number of Living Children",
        col = c("red", "green"))

The boxplot shows that poor and non-poor households have similar median numbers of living children, but poor households exhibit slightly greater variability. Additionally, poor households tend to have more outliers with a higher number of children, suggesting a slight tendency toward larger family sizes among the poor.

  1. Briefly Explain the importance of Choosing appropriate visualization techniques for different types of data.

The Important for choosing appropriate for visualization for different types including: * Histogram which is a crucial tool for data visualization because it provides a clear graphical representation of the distribution of a continuous variable. A bar plot is an essential tool for data visualization, especially for categorical data. A boxplot is a vital tool for data visualization, particularly for summarizing and comparing distributions of continuous variables.