ASSET Project_Data Cleaning

PART 1: DATA PRE-CLEANING

1.1. Set up the R environment

First, we need to clean the R environment and install some packages and libraries.

# clean R environment

rm(list = ls())

# Install some packages and call libraries

#install.packages('dplyr')

#install.packages("stringi")

#install.packages("janitor")

#install.packages("stargazer")

library(dplyr)

library(readxl)

library(tidyverse)

library(stringi)

#library(janitor)

#library(stargazer)

#library(ggplot2)

1.2. Import data from Excel file

This step, we should pay attention to the path of the Excel file.

data_1 <- read_excel("D:/R/R_2023/ASSET_PCA analysis/data/asset_hhsurvey_v1.xlsx", sheet = 1) # it takes several minutes

# check if the data set is data frame

is.data.frame(data_1) 
[1] TRUE
# drop the time/date variables

data_1 %>%
  select(-c("start_time",                     # Just to memorize the select function!!!
            "end_time",
            "today",
            "o1",
            "_submission_time")) -> data_2

# drop the variables with the conditions: NA/ blank/ 0

data_2 %>%                                      
  select_if(~!(all(is.na(.)) | all(. == "") | all(. == "0"))) -> data_3  # it takes several minutes

1.3. Export the data into CSV

Next, we export the data into the csv file.

write.csv(data_3,                                                                      # the data
          "D:\\R\\R_2023\\ASSET_PCA analysis\\data\\output\\asset_hhsurvey_v3_1.csv",  # working directory and file name
          row.names = FALSE,                                                           # without row name because we can use the household ID
          na='',                                                                       # with blank for NA
          )                                                                            # it takes several minutes

# now we can open the .csv file to check the initial result

PART 2: DETERMINING THE VARIABLES FOR ANALYSIS

2.1. Import data from CSV file

Now we again import data from the csv file by the function fread in the library data.table.

library(data.table)

data_4 <- fread("D:/R/R_2023/ASSET_PCA analysis/data/output/asset_hhsurvey_v3_1.csv")  # fread function is the quickest one compared to read.csv and read_csv  

2.2. Extract the list of coding questions

This step, we extract the coding questions in the Excel file through several sub-steps. The main point is that we have to convert the the coding questions in a data frame to a list, and then convert into numeric format. The coding questions can be downloaded at here. Pay attention to the path of the Excel file when using read_excel function.

coding_q <- read_excel("D:/R/R_2023/ASSET_PCA analysis/coding_questionnaire.xlsx", sheet = "EN") # import the sheet of coding questions

coding_q %>% 
  select(5) -> list_q                    # the questions in column 5th

colnames(list_q)[1] -> "column_q"        # rename the name of column 1st

str(list_q)                              # check the structure. the list_q is data frame format
tibble [184 × 1] (S3: tbl_df/tbl/data.frame)
 $ R1C1_Style: num [1:184] 3 4 5 6 7 8 9 10 27 28 ...
as.list(list_q) -> list_q1               # convert to list format

str(list_q1)                             # check the structure. the list_q1 is list format
List of 1
 $ R1C1_Style: num [1:184] 3 4 5 6 7 8 9 10 27 28 ...
as.numeric(unlist(list_q1)) -> list_q2   # convert to numeric format

str(list_q2)                             # check the structure. the list_q2 is numeric format
 num [1:184] 3 4 5 6 7 8 9 10 27 28 ...

2.3. Select the columns in accordance with list_q2

Now we can select all the variables/ columns needed in the analysis.

data_4 %>% 
  select(all_of(list_q2)) -> data_5   # list_q2 contains 184 numbers corresponding to 184 columns to select

2.4. Rename the names of all columns in data_5 corresponding to the file coding_questionnaire.xlsx

This step is similar to Extract the list of coding question.

coding_q %>% 
  select(3) -> question_coding                                  # in column 3th

str(question_coding)                                            # check the structure. question_coding is data frame format
tibble [184 × 1] (S3: tbl_df/tbl/data.frame)
 $ Question_coding: chr [1:184] "o2" "o3" "o4" "o5" ...
as.list(question_coding) -> question_coding_2                   # convert to list format

as.character(unlist(question_coding_2)) -> question_coding_3    # convert to character format

setnames(data_5, question_coding_3)                             # rename the columns corresponding to question_coding_3

2.5. Save data_5 in csv file

We now save the data into csv file for a more convenient look-up.

write.csv(data_5,                                                                      # the data                                                                  
          "D:\\R\\R_2023\\ASSET_PCA analysis\\data\\output\\asset_hhsurvey_v3_5.csv",  # working directory and file name
          row.names = TRUE,                                                            # with row name
          na='',                                                                       # with blank for NA
          )      

Now we can open the csv file to check the data.