Question ONE : Data Structures in R

* Instruction

  1. Import the SDHS data set into R.
# Setup
##Install and load the necessary packages to reproduce the report
library(haven) # Useful for importing SPSS, SAS, STATA etc. data files
library(readr) # Useful for importing data
## Warning: package 'readr' was built under R version 4.4.2
library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files

Import the STATA file into R

  1. Display the structure of the data set using the str() function.’

  2. Identify and list all the variables provided in the variable code list.

Select only the variable code list

Loading the dplyr package

library(dplyr)
sLDHS_DATA1<-SLDHS %>%
select("V190", "V024", "V025", "V106", "V152", "V151", "V136", "V201", "V501", "V113", "V116")
  1. Determine the data types of each variable.
str(sLDHS_DATA1)
  1. briefly explain the difference between numeric and factor variables.

Answer

Numeric variables

Numeric variables represent quantities, and can be used in arithmetic operations (e.g., age, height, temperature).

Factor variables

factor variables represent categories or groups, and are typically represented by labels or levels (e.g., gender, color, species). While a factor might use numbers to represent categories, those numbers don’t have inherent mathematical meaning; they’re just labels.

Question TWO : Descriptive Statistics in R

* Instructions:

  1. Calculate the mean, median and standard deviation for the variable “V136” (number of household members).
mean(sLDHS_DATA1$V136, na.rm = TRUE)
median(sLDHS_DATA1$V136, na.rm = TRUE)
sd(sLDHS_DATA1$V136, na.rm = TRUE)
  1. Create a function table for the variable “V106” (education level) using the table() function.
# Create a frequency table for V106
table(sLDHS_DATA1$V106)
  1. Calculate the proportion of households in each wealth quintile (“V190”).
# Create a Proportion of household  for V106
prop.table(table(sLDHS_DATA1$V106))
  1. Explain how you would use R to calculate the correlation coefficient between age of household head (“V151”) and the number of living children (“V201”)
sum(is.na(sLDHS_DATA1$V152))
sum(is.na(sLDHS_DATA1$V201))
data_clean <- na.omit(sLDHS_DATA1[, c("V152", "V201")])
correlation <- cor(data_clean$V152, data_clean$V201, method = "pearson")
print(correlation)

# There is weak direct correlation between V152 and V201 it means if V152 increases the V201 increase very little value.

Question THREE : Data Cleaning in R

* Instructions:

  1. Create a new variable called “poverty_status” based on the “V190” variable (wealth quintile) and categorize household into two groups: “Poor”: Poorer, and Middle quintiles ”Non-Poor”: Richer and Richest quintiles
# Load the dplyr package
library(dplyr)

sLDHS_DATA1$poverty_status <- ifelse(sLDHS_DATA1$V190 %in% c(1, 2,3), "Poor", 
                               ifelse(sLDHS_DATA1$V190 %in% c(4, 5), "Non-Poor", NA))
table(sLDHS_DATA1$poverty_status)
  1. Check for missing values in all variables using the summary() function.
summary(sLDHS_DATA1)
  1. Recode the variables “V113” (source of drinking water) and “V116” (toilet facility) into “Improved” and “Unimproved” categories based on the definitions provided in the DHS data.
#Recode V113 into a new variable 'Source_of_drinking_water' using base R
#Recode V113 into "improved" and "unimproved"
sLDHS_DATA1$Source_of_drinking_water <- NA 
# Initialize the 'Source_of_drinking_water' variable as NA
sLDHS_DATA1$Source_of_drinking_water[sLDHS_DATA1$V113 %in% c(11, 12, 13, 14, 21, 51, 61, 71, 72)] <- "improved"
sLDHS_DATA1$Source_of_drinking_water[sLDHS_DATA1$V113 %in% c(32, 42, 81, 96)] <- "unimproved"
table(sLDHS_DATA1$Source_of_drinking_water)
## Recode V116 into new variable 'toilet_facility' using base R
## Recode V116 into "improved" and "unimproved"
sLDHS_DATA1$toilet_facility <- NA 
sLDHS_DATA1$toilet_facility[sLDHS_DATA1$V116 %in% c(11, 12, 13, 21, 22, 31)] <- "improved"
sLDHS_DATA1$toilet_facility[sLDHS_DATA1$V116 %in% c(14, 15, 23, 41, 51, 61, 96)] <- "unimproved"
table(sLDHS_DATA1$toilet_facility)
  1. Explain how you would handle missing values in your dataset.
# To handle missing values in dataset you may follow the below steps:
is.na(sLDHS_DATA1)
is.na.data.frame(sLDHS_DATA1)
any(is.na(sLDHS_DATA1))
sum(is.na(sLDHS_DATA1))

Question FOUR : Data Visualization in R

* Instructions:

  1. create a histogram to show the distribution of the variable “V136” (number of household members).
library(ggplot2)
hist(sLDHS_DATA1$V136, breaks = 5 )
  1. Create a bar chart to visualize the proportion of households in each poverty status category (“poverty_status”).
barplot(table(sLDHS_DATA1$poverty_status))
  1. Create a boxplot to compare the number of living children (“V201”) between poor and non-poor households (“poverty_status”)
boxplot(sLDHS_DATA1$V201,main="SLDHS$poverty_status",col="blue")
  1. Briefly explain the importance of choosing appropriate visualization techniques for different types of data.

Choosing the right visualization in R (or any data visualization tool) is crucial for effective communication. Different visualization techniques highlight different aspects of the data, and using the wrong one can mislead the audience or obscure important information.

Histograms:

Best for showing the distribution of a single numeric variable. They display the frequency of data points falling within specific ranges (bins). Useful for identifying patterns like skewness, modality (number of peaks), and outliers.

Bar charts:

Best for comparing the frequencies or values of different categorical (factor) variables or groups. Each bar represents a category, and its height shows its corresponding value. Avoid using bar charts for continuous numeric data.

Boxplots:

Excellent for comparing the distribution of a numeric variable across different categorical (factor) groups. They show the median, quartiles, and potential outliers for each group, providing a concise summary of the central tendency, spread, and potential extreme values