Assignment overview

In this assignment, we will work on some simple “cleaning’’ functions for the baseline data. We will use the cleaned variables to verify balance in the next assignment.

Cleaning the data is a big component of the work associated with survey data. The initial dataset is typically referred as “raw”, and once the dataset is cleaned it’s called “cleaned”. Cleaning involves looking around at summary statistics and making sense of your data.

We will work on the baseline only (bline_df).

Preliminaries: load up dataset, packages

  1. Load the baseline data frame, the original data frame, and the package pacman. Use p_load to install packages haven, tidyverse, dplyr and skimr.
library(pacman)
p_load(haven, tidyverse, dplyr, skimr)

VSLA_PS_1_3 = read_dta("VSLA_PS_1-3.dta")
bline_df = select(VSLA_PS_1_3, ends_with("base"))
  1. From the original dataset, add the identifying variables group_ID and member_ID to the baseline.
bline_df$group_ID <- VSLA_PS_1_3$group_ID
bline_df$member_ID <- VSLA_PS_1_3$member_ID

Part 1: Managing outliers in variables

Sometimes, variables have a few very extreme values, called outliers. Outliers can be problematic. For example, they create a wedge between mean and median values. In regressions, these outliers can have a large impact on the estimated coefficients. Outlier values are often found in financial records, i.e., amounts saved in savings accounts, loan sizes, wealth levels. That is because some people can be significantly richer than the rest.

When working with survey data, outliers are sometimes the result of mistakes. A typical error happens when the data collector (called “enumerator”) adds too many zeros as an answer to a question like “amount of wealth”. Instead of writing 10000 (10,000) they write 100000.

We begin by identifying some variables with outliers.

Identifying variables with outliers

Consider the variables: valuehhdurables_base (value of durable goods in the household, measured in Ugandan Shillings), valuelivestock_base (value of livestock), childnum_base (number of children in the household) bank_savings_base, and loanamt_formal_base (amount saved in bank, amount borrowed from bank in previous year). Do they have outliers?

Use the command skim (from the skimr package). HINT: For me, I started with the baseline, then I used the pipe operator %>% to select the three variables, then, I used the pipe operator again to apply the skim command on these selected variables.

skim(bline_df, "valuehhdurables_base")
Data summary
Name bline_df
Number of rows 1359
Number of columns 38
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
valuehhdurables_base 138 0.9 2453554 19845641 3000 235000 662000 2010000 670904960 ▇▁▁▁▁
skim(bline_df, "valuelivestock_base")
Data summary
Name bline_df
Number of rows 1359
Number of columns 38
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
valuelivestock_base 480 0.65 10894886 185609935 160 224500 7e+05 2147500 5251359744 ▇▁▁▁▁
skim(bline_df, "childnum_base")
Data summary
Name bline_df
Number of rows 1359
Number of columns 38
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
childnum_base 0 1 2.73 2.15 0 1 3 4 11 ▇▅▂▁▁
skim(bline_df, "bank_savings_base")
Data summary
Name bline_df
Number of rows 1359
Number of columns 38
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bank_savings_base 0 1 11526.05 120831.6 0 0 0 0 2375000 ▇▁▁▁▁
skim(bline_df, "loanamt_formal_base")
Data summary
Name bline_df
Number of rows 1359
Number of columns 38
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
loanamt_formal_base 8 0.99 300244.3 9272493 0 0 0 0 3.4e+08 ▇▁▁▁▁
bline_df %>% select("valuehhdurables_base", "valuelivestock_base", "childnum_base", "bank_savings_base", "loanamt_formal_base") %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
##   skim_type        skim_variable n_missing complete_rate    numeric.mean
## 1   numeric valuehhdurables_base       138     0.8984547  2453553.511876
## 2   numeric  valuelivestock_base       480     0.6467991 10894886.118316
## 3   numeric        childnum_base         0     1.0000000        2.729213
## 4   numeric    bank_savings_base         0     1.0000000    11526.048565
## 5   numeric  loanamt_formal_base         8     0.9941133   300244.263509
##         numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
## 1  19845640.940489       3000      235000      662000     2010000    670904960
## 2 185609935.144849        160      224500      700000     2147500   5251359744
## 3         2.147429          0           1           3           4           11
## 4    120831.565657          0           0           0           0      2375000
## 5   9272492.589869          0           0           0           0    340000000
##   numeric.hist
## 1        ▇▁▁▁▁
## 2        ▇▁▁▁▁
## 3        ▇▅▂▁▁
## 4        ▇▁▁▁▁
## 5        ▇▁▁▁▁
## Both of these ways of showing the information work, but I prefer the first option as I can look at the tibble data without having to convert scientific notation into something more understandable.

Comment on the output you see. Which variables have outliers? What pieces of evidence do you have to support your claim?

## Outliers: Bank savings and formal loan amounts have the most significant outliers.I'm not exactly sure what the 'p' in "p0" through "p100" means, but I assume it records the highest number in that percentage of the data. For bank savings and formal loan amount, this number is 0 until p100 is reached, which indicates that there are very few observations that have this value at all, let alone the very high numbers displayed in the p100 section. One can make an argument that value of hhdurables, value of livestock, and number of children also have outliers, since the using the interquartile range method to calculate outliers finds that all of these categories of interest contain both upper and lower range outliers (except for number of children, which only has upper range outliers). The mini histograms for all categories, except number of children, are also indicative of there being a few very large values skewing the data.

Dealing with outliers: winsorizing

The way to deal with outliers is to winsorize: you take the top 5% of the values and replace their value with the value associated with the 95th percentile. For example, suppose you have 100 observations. 95 of them have a value of 10 or less. The remaining 5 have all values between 150 and 1,000. With winsorize, you replace these 5 observations with the value 10. We will use the comamnd “winsor” for this, which is part of the package psych.

# First, install the package psych.

install.packages("psych") 
## package 'psych' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\suzye\AppData\Local\Temp\Rtmp6tTjSH\downloaded_packages
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
# Next, this line of code will make sure that all variables are numeric; run it first
bline_df$valuehhdurables_base <- as.numeric(bline_df$valuehhdurables_base) 
bline_df$valuelivestock_base <- as.numeric(bline_df$valuelivestock_base)
bline_df$childnum_base <- as.numeric(bline_df$childnum_base)
bline_df$bank_savings_base <- as.numeric(bline_df$bank_savings_base)
bline_df$loanamt_formal_base <- as.numeric(bline_df$loanamt_formal_base)

# Now write your own code. Winsorize the variables by running the psych::winsor() command. Specify the trim level to be 0.05. The variable name should have affixed a _w, so you know it's the winsorized variable. 
    
bline_df <- bline_df %>% mutate(
  valuehhdurables_base_w = winsor(bline_df$valuehhdurables_base, trim = 0.05),
  valuelivestock_base_w = winsor(bline_df$valuelivestock_base, trim = 0.05),
  childnum_base_w = winsor(bline_df$childnum_base, trim = 0.05),
  bank_savings_base_w = winsor(bline_df$bank_savings_base, trim = 0.05),
  loanamt_formal_base_w = winsor(bline_df$loanamt_formal_base, trim = 0.05)
  )

Finally, create a table of these winsorized and original variables with skim().

bline_df %>% select("valuehhdurables_base", "valuehhdurables_base_w") %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
##   skim_type          skim_variable n_missing complete_rate numeric.mean
## 1   numeric   valuehhdurables_base       138     0.8984547      2453554
## 2   numeric valuehhdurables_base_w       138     0.8984547      1457910
##   numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
## 1   19845641       3000      235000      662000     2010000    670904960
## 2    1784577      53000      235000      662000     2010000      6496000
##   numeric.hist
## 1        ▇▁▁▁▁
## 2        ▇▂▁▁▁
bline_df %>% select("valuelivestock_base", "valuelivestock_base_w") %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
##   skim_type         skim_variable n_missing complete_rate numeric.mean
## 1   numeric   valuelivestock_base       480     0.6467991     10894886
## 2   numeric valuelivestock_base_w       480     0.6467991      1650605
##   numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
## 1  185609935        160      224500      700000     2147500   5251359744
## 2    2186659      47700      224500      700000     2147500      8402000
##   numeric.hist
## 1        ▇▁▁▁▁
## 2        ▇▂▁▁▁
bline_df %>% select("childnum_base", "childnum_base_w") %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
##   skim_type   skim_variable n_missing complete_rate numeric.mean numeric.sd
## 1   numeric   childnum_base         0             1     2.729213   2.147429
## 2   numeric childnum_base_w         0             1     2.685798   2.035914
##   numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
## 1          0           1           3           4           11        ▇▅▂▁▁
## 2          0           1           3           4            7        ▇▅▇▂▂
bline_df %>% select("bank_savings_base", "bank_savings_base_w" ) %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
##   skim_type       skim_variable n_missing complete_rate numeric.mean numeric.sd
## 1   numeric   bank_savings_base         0             1     11526.05   120831.6
## 2   numeric bank_savings_base_w         0             1         0.00        0.0
##   numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
## 1          0           0           0           0      2375000        ▇▁▁▁▁
## 2          0           0           0           0            0        ▁▁▇▁▁
bline_df %>% select("loanamt_formal_base", "loanamt_formal_base_w") %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
##   skim_type         skim_variable n_missing complete_rate numeric.mean
## 1   numeric   loanamt_formal_base         8     0.9941133     300244.3
## 2   numeric loanamt_formal_base_w         8     0.9941133          0.0
##   numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
## 1    9272493          0           0           0           0    340000000
## 2          0          0           0           0           0            0
##   numeric.hist
## 1        ▇▁▁▁▁
## 2        ▁▁▇▁▁

Compare the two types of variables. What do you notice? How did winsorizing change the summary statistics? How?

## By winsorizing, we can see quite a few changes to the data as a result of trimming the data from the bottom and top 5% of the data from the variables of interest.

## Changes to value of hhdurables & value of livestock: The mean values and standard deviations of both of these variables have decreased significantly, indicating that a few very large outliers were biasing up the mean, and now the data covers a smaller range of values. Not only have the very largest outliers been trimmed from the data, but so have the very lowest ones. As a result, the histogram also looks a little better distributed. Additionally, only the values in the p0 and p100 sections of the data have been altered (which makes sense considering we've only trimmed the very edges of the data).

## Changes to number of children: The mean and standard deviation of this variable has only very slightly decreased, indicating that the outliers in the unwinsorized version of the data were not very impactful on those measures. The p0 through p75 sections of the tibble are largely unchanged, barring p100.

## Changes to bank savings and loanamt formal: Both of these variables have changed significantly. It's clear that values in the greatest 5% of observations were what made the mean and standard deviation the size they were, as after winsorizing, for both variables, the value of these measures are 0. Contrary to what the original data seemed to say, very few people have any bank savings or money in a formal loan at all.

Summary statistics of baseline variables

Consider now the following baseline variables:

age_resp_base female_resp_base schooling_resp_base married_resp_base formalloan_base childnum_base valuelivestock_base valuehhdurables_base has_formal_account_base has_mobile_money_base num_vslas_base electricity_base VSLA_savings_base bank_savings_base loanamt_formal_base asset_index_base total_savings_base

Using skim, create a table of all baseline variables. If a variable was winsorized, use the winsorized variable only, not the original variable.

 bline_df %>% select("age_resp_base", "female_resp_base", "schooling_resp_base", "married_resp_base", "formalloan_base", "childnum_base_w", "valuelivestock_base_w", "valuehhdurables_base_w", "has_formal_account_base", "has_mobile_money_base", "num_vslas_base", "electricity_base", "VSLA_savings_base", "bank_savings_base_w", "loanamt_formal_base_w", "asset_index_base", "total_savings_base") %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
##    skim_type           skim_variable n_missing complete_rate     numeric.mean
## 1    numeric           age_resp_base         0     1.0000000      39.59087564
## 2    numeric        female_resp_base         0     1.0000000       0.77777778
## 3    numeric     schooling_resp_base         0     1.0000000       2.84768212
## 4    numeric       married_resp_base         0     1.0000000       0.68947756
## 5    numeric         formalloan_base         2     0.9985283  320593.35003685
## 6    numeric         childnum_base_w         0     1.0000000       2.68579838
## 7    numeric   valuelivestock_base_w       480     0.6467991 1650604.99886234
## 8    numeric  valuehhdurables_base_w       138     0.8984547 1457910.05569206
## 9    numeric has_formal_account_base         0     1.0000000       0.07505519
## 10   numeric   has_mobile_money_base         0     1.0000000       0.14716703
## 11   numeric          num_vslas_base         3     0.9977925       1.46460177
## 12   numeric        electricity_base         0     1.0000000       0.18543046
## 13   numeric       VSLA_savings_base         0     1.0000000  159666.29506990
## 14   numeric     bank_savings_base_w         0     1.0000000       0.00000000
## 15   numeric   loanamt_formal_base_w         8     0.9941133       0.00000000
## 16   numeric        asset_index_base         0     1.0000000       0.04481834
## 17   numeric      total_savings_base         0     1.0000000  201155.84621045
##         numeric.sd   numeric.p0   numeric.p25     numeric.p50    numeric.p75
## 1       13.6290139    18.000000     29.000000     37.00000000      48.000000
## 2        0.4158928     0.000000      1.000000      1.00000000       1.000000
## 3        1.3408365     1.000000      2.000000      2.00000000       4.000000
## 4        0.4628779     0.000000      0.000000      1.00000000       1.000000
## 5  9552018.9904126     0.000000      0.000000      0.00000000       0.000000
## 6        2.0359139     0.000000      1.000000      3.00000000       4.000000
## 7  2186658.5137422 47700.000000 224500.000000 700000.00000000 2147500.000000
## 8  1784576.7679134 53000.000000 235000.000000 662000.00000000 2010000.000000
## 9        0.2635774     0.000000      0.000000      0.00000000       0.000000
## 10       0.3544028     0.000000      0.000000      0.00000000       0.000000
## 11       0.9368633     1.000000      1.000000      1.00000000       2.000000
## 12       0.3887894     0.000000      0.000000      0.00000000       0.000000
## 13  230123.4154785     0.000000      0.000000  80000.00000000  200000.000000
## 14       0.0000000     0.000000      0.000000      0.00000000       0.000000
## 15       0.0000000     0.000000      0.000000      0.00000000       0.000000
## 16       1.7237489    -3.145163     -1.126157     -0.06067866       1.132419
## 17  394066.8596736     0.000000      0.000000  85000.00000000  220000.000000
##        numeric.p100 numeric.hist
## 1         90.000000        ▇▇▅▂▁
## 2          1.000000        ▂▁▁▁▇
## 3          8.000000        ▇▂▅▁▁
## 4          1.000000        ▃▁▁▁▇
## 5  350000000.000000        ▇▁▁▁▁
## 6          7.000000        ▇▅▇▂▂
## 7    8402000.000000        ▇▂▁▁▁
## 8    6496000.000000        ▇▂▁▁▁
## 9          1.000000        ▇▁▁▁▁
## 10         1.000000        ▇▁▁▁▂
## 11        16.000000        ▇▁▁▁▁
## 12         1.000000        ▇▁▁▁▂
## 13   1500000.000000        ▇▁▁▁▁
## 14         0.000000        ▁▁▇▁▁
## 15         0.000000        ▁▁▇▁▁
## 16         4.861075        ▃▇▆▃▁
## 17   3000000.000000        ▇▁▁▁▁

To the best of your knowledge, and using educated guesses, use the summary statistics table you created to describe the sample, one variable at a time. For example: “Average years of schooling in the sample is 2.9 years” “77% of respondents are women”. Rely on averages.

## The average age of respondents in the sample is 39.6 years. 77.8% of respondents are female. The average years of schooling for respondents is 2.8. 68.9% of respondents are married. The mean of formal loans taken out by respondents is 320,593 Ugandan shillings. The average number of children respondents have is 2.7. The average value of livestock for respondents 1,650,605.0 Ugandan shillings, and the average value of household durables is 1,457,910.1 Ugandan shillings. 7.5% of respondents have a formal savings account, and 14.7% have "mobile money". The mean number of VSLAs in the household for respondents is 1.5. 18.5% of respondents have electricity. The average of VSLA savings for respondents is 159,666.3 Ugandan shillings. The mean of both bank savings and formal loan amounts for respondents is 0 Ugandan shillings.  Respondents' asset index is, on average, 0.04. The mean total savings held by respondents is 201,155.8 Ugandan shillings.

How would you describe the sample, in just a couple of sentences? No wrong answers here, but your statement should be related to the description above. (For example, it wouldn’t be correct to state “participants are highly educated and mostly male”!)

## According to the data, most respondents in the sample are female and married and seem to have at least one or two children, but they are not very educated. Most respondents are also relatively poor and located in rural areas, as most do not have electricity and the most amount of valuable assets people hold are either in their household durables or their livestock. Formal money saving/loaning methods are also not very common among the sample; most respondents don't have a formal savings account, nor do they use mobile money, and the mean of bank savings and formal loan amounts are 0 as a result.