ASSET Project_Data Cleaning
PART 1: DATA PRE-CLEANING
1.1. Set up the R environment
First, we need to clean the R environment and install some packages and libraries.
# clean R environment
rm(list = ls())
# Install some packages and call libraries
#install.packages('dplyr')
#install.packages("stringi")
#install.packages("janitor")
#install.packages("stargazer")
library(dplyr)
library(readxl)
library(tidyverse)
library(stringi)
#library(janitor)
#library(stargazer)
#library(ggplot2)1.2. Import data from Excel file
This step, we should pay attention to the path of the Excel file.
data_1 <- read_excel("D:/R/R_2023/ASSET_PCA analysis/data/asset_hhsurvey_v1.xlsx", sheet = 1) # it takes several minutes
# check if the data set is data frame
is.data.frame(data_1) [1] TRUE
# drop the time/date variables
data_1 %>%
select(-c("start_time", # Just to memorize the select function!!!
"end_time",
"today",
"o1",
"_submission_time")) -> data_2
# drop the variables with the conditions: NA/ blank/ 0
data_2 %>%
select_if(~!(all(is.na(.)) | all(. == "") | all(. == "0"))) -> data_3 # it takes several minutes1.3. Export the data into CSV
Next, we export the data into the csv file.
write.csv(data_3, # the data
"D:\\R\\R_2023\\ASSET_PCA analysis\\data\\output\\asset_hhsurvey_v3_1.csv", # working directory and file name
row.names = FALSE, # without row name because we can use the household ID
na='', # with blank for NA
) # it takes several minutes
# now we can open the .csv file to check the initial resultPART 2: DETERMINING THE VARIABLES FOR ANALYSIS
2.1. Import data from CSV file
Now we again import data from the csv file by the function fread in the library data.table.
2.2. Extract the list of coding questions
This step, we extract the coding questions in the Excel file through several sub-steps. The main point is that we have to convert the the coding questions in a data frame to a list, and then convert into numeric format. The coding questions can be downloaded at here. Pay attention to the path of the Excel file when using read_excel function.
coding_q <- read_excel("D:/R/R_2023/ASSET_PCA analysis/coding_questionnaire.xlsx", sheet = "EN") # import the sheet of coding questions
coding_q %>%
select(5) -> list_q # the questions in column 5th
colnames(list_q)[1] -> "column_q" # rename the name of column 1st
str(list_q) # check the structure. the list_q is data frame formattibble [184 × 1] (S3: tbl_df/tbl/data.frame)
$ R1C1_Style: num [1:184] 3 4 5 6 7 8 9 10 27 28 ...
as.list(list_q) -> list_q1 # convert to list format
str(list_q1) # check the structure. the list_q1 is list formatList of 1
$ R1C1_Style: num [1:184] 3 4 5 6 7 8 9 10 27 28 ...
as.numeric(unlist(list_q1)) -> list_q2 # convert to numeric format
str(list_q2) # check the structure. the list_q2 is numeric format num [1:184] 3 4 5 6 7 8 9 10 27 28 ...
2.3. Select the columns in accordance with list_q2
Now we can select all the variables/ columns needed in the analysis.
2.4. Rename the names of all columns in data_5 corresponding to the file coding_questionnaire.xlsx
This step is similar to Extract the list of coding question.
coding_q %>%
select(3) -> question_coding # in column 3th
str(question_coding) # check the structure. question_coding is data frame formattibble [184 × 1] (S3: tbl_df/tbl/data.frame)
$ Question_coding: chr [1:184] "o2" "o3" "o4" "o5" ...
2.5. Save data_5 in csv file
We now save the data into csv file for a more convenient look-up.
write.csv(data_5, # the data
"D:\\R\\R_2023\\ASSET_PCA analysis\\data\\output\\asset_hhsurvey_v3_5.csv", # working directory and file name
row.names = TRUE, # with row name
na='', # with blank for NA
) Now we can open the csv file to check the data.