Question ONE : Data Structures in R

* Instruction

Import the SDHS data set into R.

Install haven package if you haven’t already

install.packages(“haven”)

Run the haven package

library(haven)

Import the STATA file into R

DATA_SLDHS<-read_dta(“~/SLDHS.dta”)

Display the structure of the data set using the str() function.’

list all variables

Identify and list all the variables provided in the variable code list.

identify the variables in the dataset

colname(DATA_SLDHS)

Select only the relevant variables

Loading the dplyr package

library(dplyr) data_SLDHS1 <- DATA_SLDHS %>% select(“V190”, “V024”, “V025”, “V106”, “V152”, “V151”, “V136”, “V201”, “V501”, “V113”, “V116”)

Check the structure of the new dataset

str(data_SLDHS1)

Check if the provided variables are in the dataset

code_list %in% colnames(data_SLDHS1)

Determine the data types of each variable.

Check data types of each variable

sapply(data_SLDHS1, class)

briefly explain the difference between numeric and factor variables.

Answer

Numeric variables

Numeric variables represent quantities, and can be used in arithmetic operations (e.g., age, height, temperature).

Factor variables

factor variables represent categories or groups, and are typically represented by labels or levels (e.g., gender, color, species). While a factor might use numbers to represent categories, those numbers don’t have inherent mathematical meaning; they’re just labels.

Question TWO : Descriptive Statistics in R

* Instructions:

Calculate the mean, median and standard deviation for the variable “V136” (number of household members).

Mean

mean(data_SLDHS1$V136, na.rm = TRUE)

Median

median(data_SLDHS1$V136, na.rm = TRUE)

Standard deviation

sd(data_SLDHS1$V136, na.rm = TRUE)

Create a function table for the variable “V106” (education level) using the table() function.

Create a frequency table for V106

table(data_SLDHS1$V106)

Calculate the proportion of households in each wealth quintile (“V190”).

Create a proportion table for V190

prop.table(table(data_SLDHS1$V190))

Explain how you would use R to calculate the correlation coefficient between age of household head (“V151”) and the number of living children (“V201”)

Calculate the correlation coefficient

cor(data_SLDHS1$V151, data_SLDHS1$V201, use = “complete.obs”)

Explanation: The cor() function in R is used to compute the

correlation coefficient between two numeric variables. It determines the strength and direction of the relationship between the age of the household head (V151) and the number of living children (V201). Setting use = “complete.obs” ensures that missing values are ignored during the computation.

Question THREE : Data Cleaning in R

* Instructions:

Create a new variable called “poverty_status” based on the “V190” variable (wealth quintile) and categorize household into two groups: “Poor”: Poorer, and Middle quintiles ”Non-Poor”: Richer and Richest quintiles

Create a new variable poverty_status

data_SLDHS1$poverty_status <-ifelse(data_SLDHS1$V190 <= 3, “Poor”, “Non-Poor”)

Check for missing values in all variables using the summary() function.

Check for missing values in the dataset

summary(data_SLDHS1)

Recode the variables “V113” (source of drinking water) and “V116” (toilet facility) into “Improved” and “Unimproved” categories based on the definitions provided in the DHS data.

Recode V113 (source of drinking water)

data_SLDHS1$V113_recode <- ifelse(data_SLDHS1$V113 %in% c(“Improved categories…”), “Improved”, “Unimproved”)

Recode V116 (toilet facility)

data_SLDHS1$V116_recode <- ifelse(data_SLDHS1$V116 %in% c(“Improved categories…”), “Improved”, “Unimproved”)

Verify recoded variables

table(data_SLDHS1$V113_recode) table(data_SLDHS1$V116_recode)

Explain how you would handle missing values in your dataset.

Remove Missing Values: If the number of missing values is small, you can remove the rows containing missing data using na.omit() or filtering techniques. Impute Missing Values: For numeric variables, missing values can be replaced with the mean, median, or other statistical estimates. For categorical variables, the mode or a predicted value based on other variables can be used. Flag Missing Values: Create a flag variable to indicate missing data for further analysis or reporting.

Question FOUR : Data Visualization in R

* Instructions:

create a histogram to show the distribution of the variable “V136” (number of household members).

Create a histogram for the number of household members

hist(data_relevant$V136, main = “Distribution of Number of Household Members”, xlab = “Number of Household Members”, col = “blue”, breaks = 20)

Create a bar chart to visualize the proportion of households in each poverty status category (“poverty_status”).

Create a bar chart for poverty_status

barplot(prop.table(table(data_SLDHS1$poverty_status)), main = “Proportion of Households by Poverty Status”, xlab = “Poverty Status”, ylab = “Proportion”, col = c(“red”, “green”), names.arg = c(“Poor”, “Non-Poor”))

Create a box plot to compare the number of living children (“V201”) between poor and non-poor household (“poverty_status).

Create a boxplot for the number of living children by poverty_status

boxplot(data_SLDHS1$V201 ~ data_SLDHS1$poverty_status, main = “Number of Living Children by Poverty Status”, xlab = “Poverty Status”, ylab = “Number of Living Children”, col = c(“red”, “green”))

Briefly explain the importance of choosing appropriate visualization techniques for different types of data.

Choosing the right visualization in R (or any data visualization tool) is crucial for effective communication. Different visualization techniques highlight different aspects of the data, and using the wrong one can mislead the audience or obscure important information.

Histograms:

Best for showing the distribution of a single numeric variable. They display the frequency of data points falling within specific ranges (bins). Useful for identifying patterns like skewness, modality (number of peaks), and outliers.

Bar charts:

Best for comparing the frequencies or values of different categorical (factor) variables or groups. Each bar represents a category, and its height shows its corresponding value. Avoid using bar charts for continuous numeric data.

Boxplots:

Excellent for comparing the distribution of a numeric variable across different categorical (factor) groups. They show the median, quartiles, and potential outliers for each group, providing a concise summary of the central tendency, spread, and potential extreme values

Project: Final Exam

Mustafa Hassan Dahir Ali

2024-11-25

Question ONE : Data Structures in R

* Instruction

Install haven package if you haven’t already

Run the haven package

Import the STATA file into R

list all variables

identify the variables in the dataset

Select only the relevant variables

Loading the dplyr package

Check the structure of the new dataset

Check if the provided variables are in the dataset

Check data types of each variable

Answer

Numeric variables

Factor variables

Question TWO : Descriptive Statistics in R

* Instructions:

Mean

Median

Standard deviation

Create a frequency table for V106

Create a proportion table for V190

Calculate the correlation coefficient

Explanation: The cor() function in R is used to compute the

Question THREE : Data Cleaning in R

* Instructions:

Create a new variable poverty_status

Check for missing values in the dataset

Recode V113 (source of drinking water)

Recode V116 (toilet facility)

Verify recoded variables

Question FOUR : Data Visualization in R

* Instructions:

Create a histogram for the number of household members

Create a bar chart for poverty_status

Create a boxplot for the number of living children by poverty_status

Histograms:

Bar charts:

Boxplots: