1:Data Structures in R Software (10 points)

#Load and Clean the Data 1. Import the SDHS dataset into R: # Install the haven package if you haven’t already # Load the haven package

R Code

# Load the dataset (update the path to your dataset file)

# Load haven package
library(haven)
SLDHS<-read_dta("~/SLDHS.dta")


#Drop Missing Values with the SLDHS
# Load dplyr
library(dplyr)
data_cleaned <- SLDHS %>%
  filter(!is.na(V190), !is.na(V024), !is.na(V025), 
         !is.na(V106), !is.na(V152), !is.na(V151), 
         !is.na(V136), !is.na(V201), !is.na(V501), 
         !is.na(V113), !is.na(V116))

# Check dimensions of the cleaned data
dim(data_cleaned)
## [1] 14514   563
# Verify no missing values remain in specific variables
sapply(data_cleaned[, c("V190", "V024", "V025", "V106", "V152", "V151", 
                        "V136", "V201", "V501", "V113", "V116")], function(x) sum(is.na(x)))
## V190 V024 V025 V106 V152 V151 V136 V201 V501 V113 V116 
##    0    0    0    0    0    0    0    0    0    0    0
# Select only the relevant variables
# Load the dplyr package
library(dplyr)
data_relevant <- data_cleaned %>%
  select("V190", "V024", "V025", "V106", "V152", "V151", "V136", "V201", "V501", "V113", "V116")
  1. Display the Structure of the Data set using the str() Function ## First Select only the relevant variables (variables in the code list) # Load the dplyr package if it is not loaded.

R Code

#1. List All Variables in the Dataset

# Ensure data_relevant exists before listing variables
if (exists("data_relevant")) {
  # Display the structure of the dataset
  str(data_relevant)
} else {
  stop("The variable 'data_relevant' does not exist. Please create it first.")
}
## tibble [14,514 × 11] (S3: tbl_df/tbl/data.frame)
##  $ V190: dbl+lbl [1:14514] 5, 5, 5, 5, 5, 3, 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3...
##    ..@ label       : chr "Wealth index combined"
##    ..@ format.stata: chr "%1.0f"
##    ..@ labels      : Named num [1:5] 1 2 3 4 5
##    .. ..- attr(*, "names")= chr [1:5] "Lowest" "Second" "Middle" "Fourth" ...
##  $ V024: dbl+lbl [1:14514] 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, ...
##    ..@ label       : chr "Region"
##    ..@ format.stata: chr "%2.0f"
##    ..@ labels      : Named num [1:6] 11 12 13 14 15 16
##    .. ..- attr(*, "names")= chr [1:6] "Awdal" " Marodijeh" "Sahil" "Togdheer" ...
##  $ V025: dbl+lbl [1:14514] 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
##    ..@ label       : chr "Type of place of residence"
##    ..@ format.stata: chr "%1.0f"
##    ..@ labels      : Named num [1:6] 1 2 3 4 5 6
##    .. ..- attr(*, "names")= chr [1:6] "Rural" "Urban" "Nomadic" "Rural IDP" ...
##  $ V106: dbl+lbl [1:14514] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
##    ..@ label       : chr "Highest educational level"
##    ..@ format.stata: chr "%1.0f"
##    ..@ labels      : Named num [1:4] 0 1 2 3
##    .. ..- attr(*, "names")= chr [1:4] "No Education" "Primary" "Secondary" "Higher"
##  $ V152: num [1:14514] 23 23 23 23 23 61 61 23 23 23 ...
##   ..- attr(*, "label")= chr "Age of household head"
##   ..- attr(*, "format.stata")= chr "%2.0f"
##  $ V151: num [1:14514] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Sex of household head"
##   ..- attr(*, "format.stata")= chr "%1.0f"
##  $ V136: num [1:14514] 6 6 6 6 6 4 4 6 6 6 ...
##   ..- attr(*, "label")= chr "Number of household members (listed)"
##   ..- attr(*, "format.stata")= chr "%1.0f"
##  $ V201: num [1:14514] 5 5 5 5 5 2 2 4 4 4 ...
##   ..- attr(*, "label")= chr "Total children ever born"
##   ..- attr(*, "format.stata")= chr "%2.0f"
##  $ V501: dbl+lbl [1:14514] 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
##    ..@ label       : chr "Current marital status"
##    ..@ format.stata: chr "%1.0f"
##    ..@ labels      : Named num [1:4] 0 1 2 3
##    .. ..- attr(*, "names")= chr [1:4] "Never Married" "Married" "Divorced" "Widowed"
##  $ V113: num [1:14514] 11 11 11 11 11 13 13 13 13 13 ...
##   ..- attr(*, "label")= chr "Source of drinking water"
##   ..- attr(*, "format.stata")= chr "%2.0f"
##  $ V116: num [1:14514] 23 23 23 23 23 61 61 23 23 23 ...
##   ..- attr(*, "label")= chr "Type of toilet facility"
##   ..- attr(*, "format.stata")= chr "%2.0f"
  1. Briefly Explain the difference between numeric and factor variables

Numeric Variables: These variables contain numerical data and can be used for mathematical operations. They represent continuous or discrete quantities (e.g., age, income). Factor Variables These are categorical variables that represent distinct groups or categories. They are stored as integers internally but have corresponding labels (e.g., gender, marital status).

Question 2: Descriptive Statistics in R (10 points)

  1. Calculate the mean, median and Standard deviation for the variable “V136” (Number of household members). ## R Code
mean(data_cleaned$V136, na.rm= TRUE)
## [1] 5.407744
median(data_cleaned$V136, na.rm= TRUE)
## [1] 6
sd(data_cleaned$V136, na.rm= TRUE)
## [1] 2.203568
  1. Create a Frequency table for variable “V106” (Education Level) using the table() Function

R Code

freq_table <- table(data_cleaned$V106)
education_labels <- c("No Education", "Primary", "Secondary", "Higher")
names(freq_table) <- education_labels
print(freq_table)
## No Education      Primary    Secondary       Higher 
##        12595         1557          278           84
  1. Calculate the proportion of households in each wealth quintile (“V190”):
proportions_v190 <- prop.table(table(data_cleaned$V190))

# Display the proportions
proportions_v190
## 
##         1         2         3         4         5 
## 0.3690919 0.1726609 0.1164393 0.1605347 0.1812733
# Assign descriptive labels to the wealth quintiles
data_cleaned$V190 <- factor(data_cleaned$V190, 
                             levels = c(1, 2, 3, 4, 5), 
                             labels = c("Lowest", "Second", "Middle", "Fourth", "Highest"))

# Recalculate proportions with labels
labeled_proportions <- prop.table(table(data_cleaned$V190))
# Display labeled proportions
labeled_proportions
## 
##    Lowest    Second    Middle    Fourth   Highest 
## 0.3690919 0.1726609 0.1164393 0.1605347 0.1812733

Explain how you would use R to Calculate the correlation coefficient between age of household head (“V151”) and the number of living children (“V201”) ## R Code

age_household_head <- data_cleaned$V151
num_living_children <- data_cleaned$V201
correlation <- cor(age_household_head, num_living_children, use = "complete.obs")
cat("Correlation Coefficient between age of household head and number of living children:", correlation, "\n")
## Correlation Coefficient between age of household head and number of living children: 0.005892517

To calculate the correlation coefficient between the household head’s age (V151) and the number of living children (V201) in R, use the cor() function. The default Pearson’s correlation measures the linear relationship. For instance, cor(data_relevant\(V151, data_relevant\)V201, use = “complete.obs”) computes the correlation, yielding a value between -1 and 1, indicating the relationship’s strength and direction.

Question 3: Data Cleaning in R (10 points)

  1. Create New Variable called “Poverty Status” based on the “V190” variable (Wealth quantile) and categorize household into two groups:

R Code

#1. Create a new variable called poverty_status based on the V190 variable (wealth quintile):

# Step 1: Ensure V190 is correctly converted to numeric
# Check the structure of V190
str(data_relevant$V190)
##  dbl+lbl [1:14514] 5, 5, 5, 5, 5, 3, 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,...
##  @ label       : chr "Wealth index combined"
##  @ format.stata: chr "%1.0f"
##  @ labels      : Named num [1:5] 1 2 3 4 5
##   ..- attr(*, "names")= chr [1:5] "Lowest" "Second" "Middle" "Fourth" ...
# Convert V190 to numeric (if it's a factor or character)
data_relevant$V190 <- as.numeric(as.character(data_relevant$V190))

# Verify that V190 has valid numeric values
summary(data_relevant$V190)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.612   4.000   5.000
# Step 2: Check for missing values in V190 and remove them
data_relevant <- data_relevant[!is.na(data_relevant$V190), ]

# Step 3: Create the poverty_status variable
data_relevant$poverty_status <- ifelse(data_relevant$V190 <= 3, 1, 2)

# Step 4: Label the poverty_status variable
data_relevant$poverty_status <- factor(data_relevant$poverty_status, 
                                      levels = c(1, 2), 
                                      labels = c("Poor", "Non-Poor"))
# Step 5: Verify the new variable
table(data_relevant$poverty_status)
## 
##     Poor Non-Poor 
##     9553     4961
  1. Check for Missing values in all variables using the summary () function.

R Code

summary(data_relevant)
##       V190            V024            V025            V106       
##  Min.   :1.000   Min.   :11.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:13.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :2.000   Median :14.00   Median :1.000   Median :0.0000  
##  Mean   :2.612   Mean   :14.02   Mean   :1.473   Mean   :0.1629  
##  3rd Qu.:4.000   3rd Qu.:15.00   3rd Qu.:2.000   3rd Qu.:0.0000  
##  Max.   :5.000   Max.   :16.00   Max.   :2.000   Max.   :3.0000  
##       V152            V151            V136            V201       
##  Min.   :11.00   Min.   :1.000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:13.00   1st Qu.:1.000   1st Qu.:4.000   1st Qu.: 4.000  
##  Median :22.00   Median :1.000   Median :6.000   Median : 6.000  
##  Mean   :26.79   Mean   :1.397   Mean   :5.408   Mean   : 6.253  
##  3rd Qu.:23.00   3rd Qu.:2.000   3rd Qu.:7.000   3rd Qu.: 8.000  
##  Max.   :96.00   Max.   :2.000   Max.   :9.000   Max.   :19.000  
##       V501           V113            V116        poverty_status
##  Min.   :1.00   Min.   :11.00   Min.   :11.00   Poor    :9553  
##  1st Qu.:1.00   1st Qu.:12.00   1st Qu.:13.00   Non-Poor:4961  
##  Median :1.00   Median :31.00   Median :22.00                  
##  Mean   :1.14   Mean   :34.44   Mean   :26.79                  
##  3rd Qu.:1.00   3rd Qu.:61.00   3rd Qu.:23.00                  
##  Max.   :3.00   Max.   :96.00   Max.   :96.00
  • Cleaning Data by removing the missing values ## R Code
data_cleaned$V113_recode <- ifelse(data_cleaned$V113 %in% c(11, 12, 13, 21, 31, 41), 
                                    "Improved", "Unimproved")

# Verify the classification
table(data_cleaned$V113_recode)
## 
##   Improved Unimproved 
##       7474       7040
# Classify Type of Toilet Facility (V116)
data_cleaned$V116_recode <- ifelse(data_cleaned$V116 %in% c(11, 12, 13, 14, 15, 21, 41), 
                                    "Improved", "Unimproved")

# Verify the classification
table(data_cleaned$V116_recode)
## 
##   Improved Unimproved 
##       6249       8265
  1. Explain how you would handle missing values in your dataset.

Remove Missing Values: If the number of missing values is small, you can remove the rows containing missing data using na.omit() or filtering techniques. Impute Missing Values: The mean, median, or other statistical estimations can be used to impute missing values for numerical variables. The mode or a predicted value derived from other factors can be used to categorical variables. Flag Missing Values: To identify missing data for additional reporting or analysis, create a flag variable.

Question 4: Data Visualization in R (10 points)

  1. Create a histogram to show the distribution of the variable “V136” (Number of household members)

R Code

Histogram for V136

# Create a histogram for V136
hist(data_relevant$V136,
     main = "Distribution of Number of Household Members",
     xlab = "Number of Household Members",
     ylab = "Frequency",
     col = "blue",
     breaks = 10)

  1. Create a bar chart to visualize the proportion of houeseholds in each poverty status category (“poverty_status”)

R Code

# Create a bar chart for poverty_status
# Step 1: Calculate proportions for poverty_status
poverty_status_proportions <- prop.table(table(data_relevant$poverty_status))
# Step 2: Create the bar chart
barplot(
  poverty_status_proportions,
  main = "Proportion of Households by Poverty Status",
  xlab = "Poverty Status",
  ylab = "Proportion",
  col = c("yellow", "blue"), # Colors for the bars
  names.arg = c("Poor", "Non-Poor") # Label categories
)

Create a boxplot to compare the number of living children (“V201”) between poor and non-poor households (“poverty_status”)

R Code

library(ggplot2)
boxplot(data_relevant$V201 ~ data_relevant$poverty_status,
        main = "Number of Living Children by Poverty Status",
        xlab = "Poverty Status",
        ylab = "Number of Living Children",
        col = c("blue","cyan"))

4. Briefly Explain the importance of Choosing appropriate visualization techniques for different types of data.

#Explanation: The data insights are successfully communicated with the use of appropriate visualization approaches. Take the following example: *Histograms may be used to comprehend the distribution of continuous data (like V136). For categorical variables (like poverty_status), bar charts are an excellent way to show proportions or counts. Oxplots are perfect for comparing distributions between groups and locating outliers (like V201 by poverty_status).