2024-12-02
Firtly I installed all necessary libraries
library(haven)
## Warning: package 'haven' was built under R version 4.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(scales)
library(knitr)
## Warning: package 'knitr' was built under R version 4.4.2
data1<-read_dta("D:/APPLIED STATISTICS/DATA SCIENCE POST GRADUATE/Course1/EXAM FINAL/SLDHS.dta")
str(data1)
it clearly gave me gives an overview of the dataset structure of SHDS `r str(data1)
We need to select variables include ( Wealth index V190, Region V024, Place of Residence V025, Education Level V106. Sex of Household Head V151, Number of Household members V136, Number of Living Children V201, Marital Status V501, Source of drinking water V113, Tiolet facilities or sanitation access V116)
selected_data <- data1 %>%
select(V190, V024, V025, V106, V151, V136, V201, V501, V113, V116)
Confirm that my data is correctly loaded and the column names match the ones you’re trying to select
names(data1)
names(selected_data)
## [1] "V190" "V024" "V025" "V106" "V151" "V136" "V201" "V501" "V113" "V116"
sapply(selected_data, class)
## $V190
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $V024
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $V025
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $V106
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $V151
## [1] "numeric"
##
## $V136
## [1] "numeric"
##
## $V201
## [1] "numeric"
##
## $V501
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $V113
## [1] "numeric"
##
## $V116
## [1] "numeric"
Mean
mean(selected_data$V136, na.rm = TRUE)
## [1] 5.107781
Median
median(selected_data$V136, na.rm = TRUE)
## [1] 5
Standard deviation
sd(selected_data$V136, na.rm = TRUE)
## [1] 2.475511
I applied quick summary to view all statistics at once
summary(selected_data$V136)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 5.000 5.108 7.000 9.000 2303
Question 2.2: Created frequency table for the variable “V106” (Educational level) using table() function
table(selected_data$V106)
##
## 0 1 2 3
## 15287 1991 311 97
I also have tried to enhance the table with this follow function Convert V106 to a factor (if not already)
selected_data$V106 <- factor(selected_data$V106,levels = c(1, 2, 3, 4),labels = c("Primary", "Secondary", "University", "College"))
Create the frequency table with labeled categorie
table(selected_data$V106)
##
## Primary Secondary University College
## 1991 311 97 0
unique(selected_data$V106)
## [1] <NA> Secondary Primary University
## Levels: Primary Secondary University College
selected_data\(V106 <- factor(selected_data\)V106, levels =c(“Primary”, “Secondary”, “University”, “College”), labels = c(1, 2, 3, 4)
Firstly I inspected the variables to check the unique values in V190 to understand its range
unique(selected_data$V190)
## <labelled<double>[5]>: Wealth index combined
## [1] 5 3 2 4 1
##
## Labels:
## value label
## 1 Lowest
## 2 Second
## 3 Middle
## 4 Fourth
## 5 Highest
I get that variables value label (1 Lowest, 2 Second, 3 Middle, 4 Fourth, 5 Highest) I have Counted the number of households in each wealth quintile using the table() function:
freq_table <- table(selected_data$V190)
print(freq_table) # the observed result became (Lowest 6323, Second 3016, Middle 2054, Fourth 2907 and Highest 3386)
##
## 1 2 3 4 5
## 6323 3016 2054 2907 3386
Now I have calculated the proportion of households in each quintile, divide the frequency of each category by the total number of observations:
prop_table <- prop.table(freq_table)
print(prop_table)
##
## 1 2 3 4 5
## 0.3575144 0.1705304 0.1161371 0.1643673 0.1914509
The proportion of household in each quintile is as below: 1 2 3 4 5 0.3575144 0.1705304 0.1161371 0.1643673 0.1914509*
First step I will do the inspect the data if there is missing values on bothe two variables Ensure both V151 and V201 are numeric and free of missing values I have used the follow code:
sum(is.na(selected_data$V151))
## [1] 858
858 cases are missing value or NA
sum(is.na(selected_data$V201))
## [1] 13
13 cases are missing value or NA Because if there are missing values, they will need to be handled, as correlation calculations cannot include NA
Therefore it is need to handle the missing value using na.omit(): help me to exclude missing value
clean_data <- na.omit(selected_data[, c("V151", "V201")])
cor(clean_data$V151, clean_data$V201)
## [1] 0.0005634619
The result became 0.000563 That means there is no correlation coefficient between age of household head (“V151”) and the number of living children (“201”)
“Poor”: Poorest, Poorer, and Middle Quintiles “Non-Poor”: Richer and Richest Quintiles
table(selected_data$V190)
##
## 1 2 3 4 5
## 6323 3016 2054 2907 3386
To create new variable I used Use the ifelse() function to assign “Poor” or “Non-Poor” based on the values in V190
selected_data$poverty_status <- ifelse(selected_data$V190 %in% c(1, 2, 3), "Poor", "Non-Poor")
I have convert the poverty_status variable into a factor
selected_data$poverty_status <- factor(selected_data$poverty_status, levels = c("Poor", "Non-Poor"))
checked if it’s done correctly
table(selected_data$poverty_status)
##
## Poor Non-Poor
## 11393 6293
summary(selected_data[c("V190", "V024", "V025", "V106", "V151", "V136", "V201", "V501", "V113", "V116")])
## V190 V024 V025 V106
## Min. :1.000 Min. :11.00 Min. :1.000 Primary : 1991
## 1st Qu.:1.000 1st Qu.:13.00 1st Qu.:1.000 Secondary : 311
## Median :2.000 Median :14.00 Median :1.000 University: 97
## Mean :2.662 Mean :13.95 Mean :1.498 College : 0
## 3rd Qu.:4.000 3rd Qu.:15.00 3rd Qu.:2.000 NA's :15287
## Max. :5.000 Max. :16.00 Max. :2.000
##
## V151 V136 V201 V501
## Min. :1.000 Min. :0.000 Min. : 0.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 4.000 1st Qu.:1.000
## Median :1.000 Median :5.000 Median : 6.000 Median :1.000
## Mean :1.376 Mean :5.108 Mean : 6.274 Mean :1.138
## 3rd Qu.:2.000 3rd Qu.:7.000 3rd Qu.: 8.000 3rd Qu.:1.000
## Max. :2.000 Max. :9.000 Max. :19.000 Max. :3.000
## NA's :858 NA's :2303 NA's :13
## V113 V116
## Min. :11.00 Min. :11.0
## 1st Qu.:12.00 1st Qu.:13.0
## Median :31.00 Median :22.0
## Mean :33.44 Mean :26.2
## 3rd Qu.:61.00 3rd Qu.:23.0
## Max. :96.00 Max. :96.0
## NA's :862 NA's :862
First I have observed variables
table(selected_data$V113)
##
## 11 12 13 14 21 31 32 41 42 51 61 71 72 81 91 96
## 3762 2075 760 582 388 1616 1494 438 393 466 3835 363 124 319 104 105
This two variables have not labels, so we need to map it by its value.
table(selected_data$V116)
##
## 11 12 13 14 15 21 22 23 31 41 51 61 96
## 743 725 3428 166 23 1949 2554 4167 121 236 29 2537 146
Recode V113 (source of drinking water) into “Improved” and “Unimproved”
selected_data$V113_recoded <- ifelse(selected_data$V113 %in% c(11, 13, 72, 41, 32, 31, 42, 81), "Improved",ifelse(selected_data$V113 %in% c(61, 51, 71, 14, 12, 21), "Unimproved", ifelse(selected_data$V113 %in% c(96, 91), NA, NA)))
head(selected_data$V113_recoded)
## [1] "Improved" "Improved" "Improved" "Improved" "Improved" "Improved"
table(selected_data$V113_recoded)
##
## Improved Unimproved
## 8906 7709
Recode V116 (Toilet facility) into “Improved” and “Unimproved” Recode the V116 variable based on the provided mapping
selected_data <- selected_data %>%
mutate(V116_recoded = case_when(
V116 %in% c(11, 21, 22, 12, 13, 41, 15) ~ "Improved",
V116 %in% c(14, 31, 23, 61, 51) ~ "Unimproved",
V116 %in% c(96, NA) ~ NA_character_
))
View the updated dataset with recoded variable
head(selected_data)
## # A tibble: 6 × 13
## V190 V024 V025 V106 V151 V136 V201 V501 V113 V116
## <dbl+lbl> <dbl+lbl> <dbl+lbl> <fct> <dbl> <dbl> <dbl> <dbl+lbl> <dbl> <dbl>
## 1 5 [Highest] 11 [Awdal] 2 [Urban] <NA> 1 6 5 1 [Marri… 11 23
## 2 5 [Highest] 11 [Awdal] 2 [Urban] <NA> 1 6 5 1 [Marri… 11 23
## 3 5 [Highest] 11 [Awdal] 2 [Urban] <NA> 1 6 5 1 [Marri… 11 23
## 4 5 [Highest] 11 [Awdal] 2 [Urban] <NA> 1 6 5 1 [Marri… 11 23
## 5 5 [Highest] 11 [Awdal] 2 [Urban] <NA> 1 6 5 1 [Marri… 11 23
## 6 3 [Middle] 11 [Awdal] 2 [Urban] <NA> 1 4 2 1 [Marri… 13 61
## # ℹ 3 more variables: poverty_status <fct>, V113_recoded <chr>,
## # V116_recoded <chr>
Reference of SDH (https://dhsprogram.com/data/Guide-to-DHS-Statistics/Type_of_Sanitation_Facility.htm)
Firstly I am going to identify if there is missing or not in my dataset ( this answer based on selected_datae)
Using summary() function will help me to indetify the exising missing values.
summary(selected_data)
## V190 V024 V025 V106
## Min. :1.000 Min. :11.00 Min. :1.000 Primary : 1991
## 1st Qu.:1.000 1st Qu.:13.00 1st Qu.:1.000 Secondary : 311
## Median :2.000 Median :14.00 Median :1.000 University: 97
## Mean :2.662 Mean :13.95 Mean :1.498 College : 0
## 3rd Qu.:4.000 3rd Qu.:15.00 3rd Qu.:2.000 NA's :15287
## Max. :5.000 Max. :16.00 Max. :2.000
##
## V151 V136 V201 V501
## Min. :1.000 Min. :0.000 Min. : 0.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 4.000 1st Qu.:1.000
## Median :1.000 Median :5.000 Median : 6.000 Median :1.000
## Mean :1.376 Mean :5.108 Mean : 6.274 Mean :1.138
## 3rd Qu.:2.000 3rd Qu.:7.000 3rd Qu.: 8.000 3rd Qu.:1.000
## Max. :2.000 Max. :9.000 Max. :19.000 Max. :3.000
## NA's :858 NA's :2303 NA's :13
## V113 V116 poverty_status V113_recoded
## Min. :11.00 Min. :11.0 Poor :11393 Length:17686
## 1st Qu.:12.00 1st Qu.:13.0 Non-Poor: 6293 Class :character
## Median :31.00 Median :22.0 Mode :character
## Mean :33.44 Mean :26.2
## 3rd Qu.:61.00 3rd Qu.:23.0
## Max. :96.00 Max. :96.0
## NA's :862 NA's :862
## V116_recoded
## Length:17686
## Class :character
## Mode :character
##
##
##
##
*The missed value is 862 cases
If I can need to know existed missing value for each varaibles of my dataset, then I have to use this below function
colSums(is.na(selected_data))
## V190 V024 V025 V106 V151
## 0 0 0 15287 858
## V136 V201 V501 V113 V116
## 2303 13 0 862 862
## poverty_status V113_recoded V116_recoded
## 0 1071 1008
To deal this missing value, I have two option (1) to remove missing values or and (2) imputation method, replacing missing value by mean or median
selected_data_clean <- na.omit(selected_data)
by Removing missing value
Replace missing values in a specific variable (e.g., V190) with the mean
selected_data$V190[is.na(selected_data$V190)] <- mean(selected_data$V190, na.rm = TRUE)
Replace missing values with the median
selected_data$V190[is.na(selected_data$V190)] <- median(selected_data$V190, na.rm = TRUE)
Creating histogram to show the distribution of the variable “V136” (number of household members)
I am used the ggplot2 package for more polished plot install.packages(“ggplot2”)
I can create hostogram now,
ggplot(selected_data, aes(x = V136)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Distribution of Household Members",
x = "Number of Household Members",
y = "Frequency") +
theme_minimal()
## Warning: Removed 2303 rows containing non-finite outside the scale range
## (`stat_bin()`).
#### Question 4.2: Created a bar chart to visualize the proportion of
households in each poverty status category (“poverty_status”)
Fistly, for this question I called package install.packages(“scales”) library(scales)
ggplot(selected_data, aes(x = poverty_status)) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = "skyblue", color = "black") +
scale_y_continuous(labels = scales::percent) +
labs(title = "Proportion of Households by Poverty Status",
x = "Poverty Status",
y = "Proportion (%)") +
theme_minimal()
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#### Question 4.3: Created boxplot to compare the number of living
children (“V201”) between poor and non-poor households
(“poverty_status”). windows()
This helped me to open widely for good dsiplayed chart.
ggplot(selected_data, aes(x = poverty_status, y = V201, fill = poverty_status)) +
geom_boxplot() +
labs(title = "Comparison of Number of Living Children by Poverty Status",
x = "Poverty Status",
y = "Number of Living Children") +
scale_fill_brewer(palette = "Pastel1") + # Optional: Adds color for better visualization
theme_minimal()
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_boxplot()`).