Report of Exam

Hussein Mohamed Abdillahi

2024-12-02

Firtly I installed all necessary libraries

library(haven)
## Warning: package 'haven' was built under R version 4.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(scales)
library(knitr)
## Warning: package 'knitr' was built under R version 4.4.2

QUESTION 1: Data Structure in R

Question 1.1: I imported SHDS dataset into R

data1<-read_dta("D:/APPLIED STATISTICS/DATA SCIENCE POST GRADUATE/Course1/EXAM FINAL/SLDHS.dta") 

Quetion 1.2 (Displaying the structure of the dataset using str() function)

str(data1)

it clearly gave me gives an overview of the dataset structure of SHDS `r str(data1)

Question 1.3: I order to identify and list all variables provided in the variable code list, I have used this below list and code

We need to select variables include ( Wealth index V190, Region V024, Place of Residence V025, Education Level V106. Sex of Household Head V151, Number of Household members V136, Number of Living Children V201, Marital Status V501, Source of drinking water V113, Tiolet facilities or sanitation access V116)

selected_data <- data1 %>%
select(V190, V024, V025, V106, V151, V136, V201, V501, V113, V116)

Confirm that my data is correctly loaded and the column names match the ones you’re trying to select

names(data1)

names(selected_data)
##  [1] "V190" "V024" "V025" "V106" "V151" "V136" "V201" "V501" "V113" "V116"
sapply(selected_data, class)
## $V190
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $V024
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $V025
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $V106
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $V151
## [1] "numeric"
## 
## $V136
## [1] "numeric"
## 
## $V201
## [1] "numeric"
## 
## $V501
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $V113
## [1] "numeric"
## 
## $V116
## [1] "numeric"

QUESTION 2: DESCRIPTIVE STATISICS IN R

Question 2.1: Calculated the mean, median and standard deviation for the variales “V136” (Number of household members)

Mean

mean(selected_data$V136, na.rm = TRUE) 
## [1] 5.107781

Median

median(selected_data$V136, na.rm = TRUE)
## [1] 5

Standard deviation

sd(selected_data$V136, na.rm = TRUE)
## [1] 2.475511

I applied quick summary to view all statistics at once

summary(selected_data$V136)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   3.000   5.000   5.108   7.000   9.000    2303

Question 2.2: Created frequency table for the variable “V106” (Educational level) using table() function

table(selected_data$V106) 
## 
##     0     1     2     3 
## 15287  1991   311    97

I also have tried to enhance the table with this follow function Convert V106 to a factor (if not already)

selected_data$V106 <- factor(selected_data$V106,levels = c(1, 2, 3, 4),labels = c("Primary", "Secondary", "University", "College"))

Create the frequency table with labeled categorie

table(selected_data$V106)
## 
##    Primary  Secondary University    College 
##       1991        311         97          0
unique(selected_data$V106)
## [1] <NA>       Secondary  Primary    University
## Levels: Primary Secondary University College

selected_data\(V106 <- factor(selected_data\)V106, levels =c(“Primary”, “Secondary”, “University”, “College”), labels = c(1, 2, 3, 4)

Question 2.3: Calculated the proportion of households in each wealth quintile (“V190)

Firstly I inspected the variables to check the unique values in V190 to understand its range

unique(selected_data$V190) 
## <labelled<double>[5]>: Wealth index combined
## [1] 5 3 2 4 1
## 
## Labels:
##  value   label
##      1  Lowest
##      2  Second
##      3  Middle
##      4  Fourth
##      5 Highest

I get that variables value label (1 Lowest, 2 Second, 3 Middle, 4 Fourth, 5 Highest) I have Counted the number of households in each wealth quintile using the table() function:

freq_table <- table(selected_data$V190)
print(freq_table) # the observed result became (Lowest 6323,  Second 3016, Middle  2054, Fourth 2907 and Highest 3386) 
## 
##    1    2    3    4    5 
## 6323 3016 2054 2907 3386

Now I have calculated the proportion of households in each quintile, divide the frequency of each category by the total number of observations:

prop_table <- prop.table(freq_table)
print(prop_table)
## 
##         1         2         3         4         5 
## 0.3575144 0.1705304 0.1161371 0.1643673 0.1914509

The proportion of household in each quintile is as below: 1 2 3 4 5 0.3575144 0.1705304 0.1161371 0.1643673 0.1914509*

Question 2.4: Explaination on how I would use R to calculate the correlation coefficient between age of household head (“V151”) and the number of living children (“201”)

First step I will do the inspect the data if there is missing values on bothe two variables Ensure both V151 and V201 are numeric and free of missing values I have used the follow code:

sum(is.na(selected_data$V151))
## [1] 858

858 cases are missing value or NA

sum(is.na(selected_data$V201))
## [1] 13

13 cases are missing value or NA Because if there are missing values, they will need to be handled, as correlation calculations cannot include NA

Therefore it is need to handle the missing value using na.omit(): help me to exclude missing value

clean_data <- na.omit(selected_data[, c("V151", "V201")])
cor(clean_data$V151, clean_data$V201) 
## [1] 0.0005634619

The result became 0.000563 That means there is no correlation coefficient between age of household head (“V151”) and the number of living children (“201”)

QUESTION 3: DATA CLEANING IN R

Question 3.1: Create a new variable called ” Poverty_status” based on the “V190” variable (Wealth quintile) and categorize households into groups.

“Poor”: Poorest, Poorer, and Middle Quintiles “Non-Poor”: Richer and Richest Quintiles

table(selected_data$V190)
## 
##    1    2    3    4    5 
## 6323 3016 2054 2907 3386

To create new variable I used Use the ifelse() function to assign “Poor” or “Non-Poor” based on the values in V190

selected_data$poverty_status <- ifelse(selected_data$V190 %in% c(1, 2, 3), "Poor", "Non-Poor")

I have convert the poverty_status variable into a factor

selected_data$poverty_status <- factor(selected_data$poverty_status, levels = c("Poor", "Non-Poor"))

checked if it’s done correctly

table(selected_data$poverty_status) 
## 
##     Poor Non-Poor 
##    11393     6293

Question 3.2 Checking for missing values in all variables using summary() function.

summary(selected_data[c("V190", "V024", "V025", "V106", "V151", "V136", "V201", "V501", "V113", "V116")])
##       V190            V024            V025               V106      
##  Min.   :1.000   Min.   :11.00   Min.   :1.000   Primary   : 1991  
##  1st Qu.:1.000   1st Qu.:13.00   1st Qu.:1.000   Secondary :  311  
##  Median :2.000   Median :14.00   Median :1.000   University:   97  
##  Mean   :2.662   Mean   :13.95   Mean   :1.498   College   :    0  
##  3rd Qu.:4.000   3rd Qu.:15.00   3rd Qu.:2.000   NA's      :15287  
##  Max.   :5.000   Max.   :16.00   Max.   :2.000                     
##                                                                    
##       V151            V136            V201             V501      
##  Min.   :1.000   Min.   :0.000   Min.   : 0.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 4.000   1st Qu.:1.000  
##  Median :1.000   Median :5.000   Median : 6.000   Median :1.000  
##  Mean   :1.376   Mean   :5.108   Mean   : 6.274   Mean   :1.138  
##  3rd Qu.:2.000   3rd Qu.:7.000   3rd Qu.: 8.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :9.000   Max.   :19.000   Max.   :3.000  
##  NA's   :858     NA's   :2303    NA's   :13                      
##       V113            V116     
##  Min.   :11.00   Min.   :11.0  
##  1st Qu.:12.00   1st Qu.:13.0  
##  Median :31.00   Median :22.0  
##  Mean   :33.44   Mean   :26.2  
##  3rd Qu.:61.00   3rd Qu.:23.0  
##  Max.   :96.00   Max.   :96.0  
##  NA's   :862     NA's   :862

Question 3.3 Recoded the variables “V113” (source of drinking water) and “V116” (Toilet facility) into “Improved” and “Unimproved” categories based on the definition provided in the DHS data.

First I have observed variables

table(selected_data$V113)
## 
##   11   12   13   14   21   31   32   41   42   51   61   71   72   81   91   96 
## 3762 2075  760  582  388 1616 1494  438  393  466 3835  363  124  319  104  105

This two variables have not labels, so we need to map it by its value.

table(selected_data$V116)
## 
##   11   12   13   14   15   21   22   23   31   41   51   61   96 
##  743  725 3428  166   23 1949 2554 4167  121  236   29 2537  146

Recode V113 (source of drinking water) into “Improved” and “Unimproved”

selected_data$V113_recoded <- ifelse(selected_data$V113 %in% c(11, 13, 72, 41, 32, 31, 42, 81), "Improved",ifelse(selected_data$V113 %in% c(61, 51, 71, 14, 12, 21), "Unimproved", ifelse(selected_data$V113 %in% c(96, 91), NA, NA)))


head(selected_data$V113_recoded)
## [1] "Improved" "Improved" "Improved" "Improved" "Improved" "Improved"
table(selected_data$V113_recoded)
## 
##   Improved Unimproved 
##       8906       7709

Recode V116 (Toilet facility) into “Improved” and “Unimproved” Recode the V116 variable based on the provided mapping

selected_data <- selected_data %>%
  mutate(V116_recoded = case_when(
    V116 %in% c(11, 21, 22, 12, 13, 41, 15) ~ "Improved", 
    V116 %in% c(14, 31, 23, 61, 51) ~ "Unimproved",  
    V116 %in% c(96, NA) ~ NA_character_  
  ))

View the updated dataset with recoded variable

head(selected_data)
## # A tibble: 6 × 13
##   V190        V024       V025      V106   V151  V136  V201 V501       V113  V116
##   <dbl+lbl>   <dbl+lbl>  <dbl+lbl> <fct> <dbl> <dbl> <dbl> <dbl+lbl> <dbl> <dbl>
## 1 5 [Highest] 11 [Awdal] 2 [Urban] <NA>      1     6     5 1 [Marri…    11    23
## 2 5 [Highest] 11 [Awdal] 2 [Urban] <NA>      1     6     5 1 [Marri…    11    23
## 3 5 [Highest] 11 [Awdal] 2 [Urban] <NA>      1     6     5 1 [Marri…    11    23
## 4 5 [Highest] 11 [Awdal] 2 [Urban] <NA>      1     6     5 1 [Marri…    11    23
## 5 5 [Highest] 11 [Awdal] 2 [Urban] <NA>      1     6     5 1 [Marri…    11    23
## 6 3 [Middle]  11 [Awdal] 2 [Urban] <NA>      1     4     2 1 [Marri…    13    61
## # ℹ 3 more variables: poverty_status <fct>, V113_recoded <chr>,
## #   V116_recoded <chr>

Reference of SDH (https://dhsprogram.com/data/Guide-to-DHS-Statistics/Type_of_Sanitation_Facility.htm)

Question 3.4 Explained how i handle missing values in my dataset.

Firstly I am going to identify if there is missing or not in my dataset ( this answer based on selected_datae)

Using summary() function will help me to indetify the exising missing values.

summary(selected_data)  
##       V190            V024            V025               V106      
##  Min.   :1.000   Min.   :11.00   Min.   :1.000   Primary   : 1991  
##  1st Qu.:1.000   1st Qu.:13.00   1st Qu.:1.000   Secondary :  311  
##  Median :2.000   Median :14.00   Median :1.000   University:   97  
##  Mean   :2.662   Mean   :13.95   Mean   :1.498   College   :    0  
##  3rd Qu.:4.000   3rd Qu.:15.00   3rd Qu.:2.000   NA's      :15287  
##  Max.   :5.000   Max.   :16.00   Max.   :2.000                     
##                                                                    
##       V151            V136            V201             V501      
##  Min.   :1.000   Min.   :0.000   Min.   : 0.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 4.000   1st Qu.:1.000  
##  Median :1.000   Median :5.000   Median : 6.000   Median :1.000  
##  Mean   :1.376   Mean   :5.108   Mean   : 6.274   Mean   :1.138  
##  3rd Qu.:2.000   3rd Qu.:7.000   3rd Qu.: 8.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :9.000   Max.   :19.000   Max.   :3.000  
##  NA's   :858     NA's   :2303    NA's   :13                      
##       V113            V116       poverty_status  V113_recoded      
##  Min.   :11.00   Min.   :11.0   Poor    :11393   Length:17686      
##  1st Qu.:12.00   1st Qu.:13.0   Non-Poor: 6293   Class :character  
##  Median :31.00   Median :22.0                    Mode  :character  
##  Mean   :33.44   Mean   :26.2                                      
##  3rd Qu.:61.00   3rd Qu.:23.0                                      
##  Max.   :96.00   Max.   :96.0                                      
##  NA's   :862     NA's   :862                                       
##  V116_recoded      
##  Length:17686      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

*The missed value is 862 cases

If I can need to know existed missing value for each varaibles of my dataset, then I have to use this below function

colSums(is.na(selected_data))
##           V190           V024           V025           V106           V151 
##              0              0              0          15287            858 
##           V136           V201           V501           V113           V116 
##           2303             13              0            862            862 
## poverty_status   V113_recoded   V116_recoded 
##              0           1071           1008

To deal this missing value, I have two option (1) to remove missing values or and (2) imputation method, replacing missing value by mean or median

selected_data_clean <- na.omit(selected_data)

by Removing missing value

Replace missing values in a specific variable (e.g., V190) with the mean

selected_data$V190[is.na(selected_data$V190)] <- mean(selected_data$V190, na.rm = TRUE)

Replace missing values with the median

selected_data$V190[is.na(selected_data$V190)] <- median(selected_data$V190, na.rm = TRUE)

QUESTION 4: DATA VISUALIZATION IN R

Creating histogram to show the distribution of the variable “V136” (number of household members)

I am used the ggplot2 package for more polished plot install.packages(“ggplot2”)

I can create hostogram now,

ggplot(selected_data, aes(x = V136)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Household Members", 
       x = "Number of Household Members", 
       y = "Frequency") +
  theme_minimal()
## Warning: Removed 2303 rows containing non-finite outside the scale range
## (`stat_bin()`).

#### Question 4.2: Created a bar chart to visualize the proportion of households in each poverty status category (“poverty_status”)

Fistly, for this question I called package install.packages(“scales”) library(scales)

ggplot(selected_data, aes(x = poverty_status)) +
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "skyblue", color = "black") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Proportion of Households by Poverty Status", 
       x = "Poverty Status", 
       y = "Proportion (%)") +
  theme_minimal()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#### Question 4.3: Created boxplot to compare the number of living children (“V201”) between poor and non-poor households (“poverty_status”). windows()
This helped me to open widely for good dsiplayed chart.

ggplot(selected_data, aes(x = poverty_status, y = V201, fill = poverty_status)) +
  geom_boxplot() +
  labs(title = "Comparison of Number of Living Children by Poverty Status", 
       x = "Poverty Status", 
       y = "Number of Living Children") +
  scale_fill_brewer(palette = "Pastel1") + # Optional: Adds color for better visualization
  theme_minimal()
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_boxplot()`).