install.packages(“haven”)
library(haven)
DATA_SLDHS<-read_dta(“~/SLDHS.dta”)
colname(DATA_SLDHS)
library(dplyr) data_SLDHS1 <- DATA_SLDHS %>% select(“V190”, “V024”, “V025”, “V106”, “V152”, “V151”, “V136”, “V201”, “V501”, “V113”, “V116”)
str(data_SLDHS1)
code_list %in% colnames(data_SLDHS1)
sapply(data_SLDHS1, class)
Numeric variables represent quantities, and can be used in arithmetic operations (e.g., age, height, temperature).
factor variables represent categories or groups, and are typically represented by labels or levels (e.g., gender, color, species). While a factor might use numbers to represent categories, those numbers don’t have inherent mathematical meaning; they’re just labels.
mean(data_SLDHS1$V136, na.rm = TRUE)
median(data_SLDHS1$V136, na.rm = TRUE)
sd(data_SLDHS1$V136, na.rm = TRUE)
table(data_SLDHS1$V106)
prop.table(table(data_SLDHS1$V190))
cor(data_SLDHS1\(V151, data_SLDHS1\)V201, use = “complete.obs”)
correlation coefficient between two numeric variables. It determines the strength and direction of the relationship between the age of the household head (V151) and the number of living children (V201). Setting use = “complete.obs” ensures that missing values are ignored during the computation.
data_SLDHS1\(poverty_status <-ifelse(data_SLDHS1\)V190 <= 3, “Poor”, “Non-Poor”)
summary(data_SLDHS1)
data_SLDHS1\(V113_recode <- ifelse(data_SLDHS1\)V113 %in% c(“Improved categories…”), “Improved”, “Unimproved”)
data_SLDHS1\(V116_recode <- ifelse(data_SLDHS1\)V116 %in% c(“Improved categories…”), “Improved”, “Unimproved”)
table(data_SLDHS1\(V113_recode) table(data_SLDHS1\)V116_recode)
Remove Missing Values: If the number of missing values is small, you can remove the rows containing missing data using na.omit() or filtering techniques. Impute Missing Values: For numeric variables, missing values can be replaced with the mean, median, or other statistical estimates. For categorical variables, the mode or a predicted value based on other variables can be used. Flag Missing Values: Create a flag variable to indicate missing data for further analysis or reporting.
hist(data_relevant$V136, main = “Distribution of Number of Household Members”, xlab = “Number of Household Members”, col = “blue”, breaks = 20)
barplot(prop.table(table(data_SLDHS1$poverty_status)), main = “Proportion of Households by Poverty Status”, xlab = “Poverty Status”, ylab = “Proportion”, col = c(“red”, “green”), names.arg = c(“Poor”, “Non-Poor”))
boxplot(data_SLDHS1\(V201 ~ data_SLDHS1\)poverty_status, main = “Number of Living Children by Poverty Status”, xlab = “Poverty Status”, ylab = “Number of Living Children”, col = c(“red”, “green”))
Choosing the right visualization in R (or any data visualization tool) is crucial for effective communication. Different visualization techniques highlight different aspects of the data, and using the wrong one can mislead the audience or obscure important information.
Best for showing the distribution of a single numeric variable. They display the frequency of data points falling within specific ranges (bins). Useful for identifying patterns like skewness, modality (number of peaks), and outliers.
Best for comparing the frequencies or values of different categorical (factor) variables or groups. Each bar represents a category, and its height shows its corresponding value. Avoid using bar charts for continuous numeric data.
Excellent for comparing the distribution of a numeric variable across different categorical (factor) groups. They show the median, quartiles, and potential outliers for each group, providing a concise summary of the central tendency, spread, and potential extreme values