# Setup
##Install and load the necessary packages to reproduce the report
library(haven) # Useful for importing SPSS, SAS, STATA etc. data files
library(readr) # Useful for importing data
## Warning: package 'readr' was built under R version 4.4.2
library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
Display the structure of the data set using the str() function.’
Identify and list all the variables provided in the variable code list.
library(dplyr)
sLDHS_DATA1<-SLDHS %>%
select("V190", "V024", "V025", "V106", "V152", "V151", "V136", "V201", "V501", "V113", "V116")
str(sLDHS_DATA1)
Numeric variables represent quantities, and can be used in arithmetic operations (e.g., age, height, temperature).
factor variables represent categories or groups, and are typically represented by labels or levels (e.g., gender, color, species). While a factor might use numbers to represent categories, those numbers don’t have inherent mathematical meaning; they’re just labels.
mean(sLDHS_DATA1$V136, na.rm = TRUE)
median(sLDHS_DATA1$V136, na.rm = TRUE)
sd(sLDHS_DATA1$V136, na.rm = TRUE)
# Create a frequency table for V106
table(sLDHS_DATA1$V106)
# Create a Proportion of household for V106
prop.table(table(sLDHS_DATA1$V106))
sum(is.na(sLDHS_DATA1$V152))
sum(is.na(sLDHS_DATA1$V201))
data_clean <- na.omit(sLDHS_DATA1[, c("V152", "V201")])
correlation <- cor(data_clean$V152, data_clean$V201, method = "pearson")
print(correlation)
# There is weak direct correlation between V152 and V201 it means if V152 increases the V201 increase very little value.
# Load the dplyr package
library(dplyr)
sLDHS_DATA1$poverty_status <- ifelse(sLDHS_DATA1$V190 %in% c(1, 2,3), "Poor",
ifelse(sLDHS_DATA1$V190 %in% c(4, 5), "Non-Poor", NA))
table(sLDHS_DATA1$poverty_status)
summary(sLDHS_DATA1)
#Recode V113 into a new variable 'Source_of_drinking_water' using base R
#Recode V113 into "improved" and "unimproved"
sLDHS_DATA1$Source_of_drinking_water <- NA
# Initialize the 'Source_of_drinking_water' variable as NA
sLDHS_DATA1$Source_of_drinking_water[sLDHS_DATA1$V113 %in% c(11, 12, 13, 14, 21, 51, 61, 71, 72)] <- "improved"
sLDHS_DATA1$Source_of_drinking_water[sLDHS_DATA1$V113 %in% c(32, 42, 81, 96)] <- "unimproved"
table(sLDHS_DATA1$Source_of_drinking_water)
## Recode V116 into new variable 'toilet_facility' using base R
## Recode V116 into "improved" and "unimproved"
sLDHS_DATA1$toilet_facility <- NA
sLDHS_DATA1$toilet_facility[sLDHS_DATA1$V116 %in% c(11, 12, 13, 21, 22, 31)] <- "improved"
sLDHS_DATA1$toilet_facility[sLDHS_DATA1$V116 %in% c(14, 15, 23, 41, 51, 61, 96)] <- "unimproved"
table(sLDHS_DATA1$toilet_facility)
# To handle missing values in dataset you may follow the below steps:
is.na(sLDHS_DATA1)
is.na.data.frame(sLDHS_DATA1)
any(is.na(sLDHS_DATA1))
sum(is.na(sLDHS_DATA1))
library(ggplot2)
hist(sLDHS_DATA1$V136, breaks = 5 )
barplot(table(sLDHS_DATA1$poverty_status))
boxplot(sLDHS_DATA1$V201,main="SLDHS$poverty_status",col="blue")
Choosing the right visualization in R (or any data visualization tool) is crucial for effective communication. Different visualization techniques highlight different aspects of the data, and using the wrong one can mislead the audience or obscure important information.
Best for showing the distribution of a single numeric variable. They display the frequency of data points falling within specific ranges (bins). Useful for identifying patterns like skewness, modality (number of peaks), and outliers.
Best for comparing the frequencies or values of different categorical (factor) variables or groups. Each bar represents a category, and its height shows its corresponding value. Avoid using bar charts for continuous numeric data.
Excellent for comparing the distribution of a numeric variable across different categorical (factor) groups. They show the median, quartiles, and potential outliers for each group, providing a concise summary of the central tendency, spread, and potential extreme values