In this assignment, we will work on some simple “cleaning’’ functions for the baseline data. We will use the cleaned variables to verify balance in the next assignment.
Cleaning the data is a big component of the work associated with survey data. The initial dataset is typically referred as “raw”, and once the dataset is cleaned it’s called “cleaned”. Cleaning involves looking around at summary statistics and making sense of your data.
We will work on the baseline only (bline_df).
library(pacman)
p_load(haven, tidyverse, dplyr, skimr)
VSLA_PS_1_3 = read_dta("VSLA_PS_1-3.dta")
bline_df = select(VSLA_PS_1_3, ends_with("base"))
bline_df$group_ID <- VSLA_PS_1_3$group_ID
bline_df$member_ID <- VSLA_PS_1_3$member_ID
Sometimes, variables have a few very extreme values, called outliers. Outliers can be problematic. For example, they create a wedge between mean and median values. In regressions, these outliers can have a large impact on the estimated coefficients. Outlier values are often found in financial records, i.e., amounts saved in savings accounts, loan sizes, wealth levels. That is because some people can be significantly richer than the rest.
When working with survey data, outliers are sometimes the result of mistakes. A typical error happens when the data collector (called “enumerator”) adds too many zeros as an answer to a question like “amount of wealth”. Instead of writing 10000 (10,000) they write 100000.
We begin by identifying some variables with outliers.
Consider the variables: valuehhdurables_base (value of durable goods in the household, measured in Ugandan Shillings), valuelivestock_base (value of livestock), childnum_base (number of children in the household) bank_savings_base, and loanamt_formal_base (amount saved in bank, amount borrowed from bank in previous year). Do they have outliers?
Use the command skim (from the skimr package). HINT: For me, I started with the baseline, then I used the pipe operator %>% to select the three variables, then, I used the pipe operator again to apply the skim command on these selected variables.
skim(bline_df, "valuehhdurables_base")
| Name | bline_df |
| Number of rows | 1359 |
| Number of columns | 38 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| valuehhdurables_base | 138 | 0.9 | 2453554 | 19845641 | 3000 | 235000 | 662000 | 2010000 | 670904960 | ▇▁▁▁▁ |
skim(bline_df, "valuelivestock_base")
| Name | bline_df |
| Number of rows | 1359 |
| Number of columns | 38 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| valuelivestock_base | 480 | 0.65 | 10894886 | 185609935 | 160 | 224500 | 7e+05 | 2147500 | 5251359744 | ▇▁▁▁▁ |
skim(bline_df, "childnum_base")
| Name | bline_df |
| Number of rows | 1359 |
| Number of columns | 38 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| childnum_base | 0 | 1 | 2.73 | 2.15 | 0 | 1 | 3 | 4 | 11 | ▇▅▂▁▁ |
skim(bline_df, "bank_savings_base")
| Name | bline_df |
| Number of rows | 1359 |
| Number of columns | 38 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bank_savings_base | 0 | 1 | 11526.05 | 120831.6 | 0 | 0 | 0 | 0 | 2375000 | ▇▁▁▁▁ |
skim(bline_df, "loanamt_formal_base")
| Name | bline_df |
| Number of rows | 1359 |
| Number of columns | 38 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| loanamt_formal_base | 8 | 0.99 | 300244.3 | 9272493 | 0 | 0 | 0 | 0 | 3.4e+08 | ▇▁▁▁▁ |
bline_df %>% select("valuehhdurables_base", "valuelivestock_base", "childnum_base", "bank_savings_base", "loanamt_formal_base") %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
## skim_type skim_variable n_missing complete_rate numeric.mean
## 1 numeric valuehhdurables_base 138 0.8984547 2453553.511876
## 2 numeric valuelivestock_base 480 0.6467991 10894886.118316
## 3 numeric childnum_base 0 1.0000000 2.729213
## 4 numeric bank_savings_base 0 1.0000000 11526.048565
## 5 numeric loanamt_formal_base 8 0.9941133 300244.263509
## numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
## 1 19845640.940489 3000 235000 662000 2010000 670904960
## 2 185609935.144849 160 224500 700000 2147500 5251359744
## 3 2.147429 0 1 3 4 11
## 4 120831.565657 0 0 0 0 2375000
## 5 9272492.589869 0 0 0 0 340000000
## numeric.hist
## 1 ▇▁▁▁▁
## 2 ▇▁▁▁▁
## 3 ▇▅▂▁▁
## 4 ▇▁▁▁▁
## 5 ▇▁▁▁▁
## Both of these ways of showing the information work, but I prefer the first option as I can look at the tibble data without having to convert scientific notation into something more understandable.
Comment on the output you see. Which variables have outliers? What pieces of evidence do you have to support your claim?
## Outliers: Bank savings and formal loan amounts have the most significant outliers.I'm not exactly sure what the 'p' in "p0" through "p100" means, but I assume it records the highest number in that percentage of the data. For bank savings and formal loan amount, this number is 0 until p100 is reached, which indicates that there are very few observations that have this value at all, let alone the very high numbers displayed in the p100 section. One can make an argument that value of hhdurables, value of livestock, and number of children also have outliers, since the using the interquartile range method to calculate outliers finds that all of these categories of interest contain both upper and lower range outliers (except for number of children, which only has upper range outliers). The mini histograms for all categories, except number of children, are also indicative of there being a few very large values skewing the data.
The way to deal with outliers is to winsorize: you take the top 5% of the values and replace their value with the value associated with the 95th percentile. For example, suppose you have 100 observations. 95 of them have a value of 10 or less. The remaining 5 have all values between 150 and 1,000. With winsorize, you replace these 5 observations with the value 10. We will use the comamnd “winsor” for this, which is part of the package psych.
# First, install the package psych.
install.packages("psych")
## package 'psych' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\suzye\AppData\Local\Temp\Rtmp6tTjSH\downloaded_packages
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
# Next, this line of code will make sure that all variables are numeric; run it first
bline_df$valuehhdurables_base <- as.numeric(bline_df$valuehhdurables_base)
bline_df$valuelivestock_base <- as.numeric(bline_df$valuelivestock_base)
bline_df$childnum_base <- as.numeric(bline_df$childnum_base)
bline_df$bank_savings_base <- as.numeric(bline_df$bank_savings_base)
bline_df$loanamt_formal_base <- as.numeric(bline_df$loanamt_formal_base)
# Now write your own code. Winsorize the variables by running the psych::winsor() command. Specify the trim level to be 0.05. The variable name should have affixed a _w, so you know it's the winsorized variable.
bline_df <- bline_df %>% mutate(
valuehhdurables_base_w = winsor(bline_df$valuehhdurables_base, trim = 0.05),
valuelivestock_base_w = winsor(bline_df$valuelivestock_base, trim = 0.05),
childnum_base_w = winsor(bline_df$childnum_base, trim = 0.05),
bank_savings_base_w = winsor(bline_df$bank_savings_base, trim = 0.05),
loanamt_formal_base_w = winsor(bline_df$loanamt_formal_base, trim = 0.05)
)
Finally, create a table of these winsorized and original variables with skim().
bline_df %>% select("valuehhdurables_base", "valuehhdurables_base_w") %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
## skim_type skim_variable n_missing complete_rate numeric.mean
## 1 numeric valuehhdurables_base 138 0.8984547 2453554
## 2 numeric valuehhdurables_base_w 138 0.8984547 1457910
## numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
## 1 19845641 3000 235000 662000 2010000 670904960
## 2 1784577 53000 235000 662000 2010000 6496000
## numeric.hist
## 1 ▇▁▁▁▁
## 2 ▇▂▁▁▁
bline_df %>% select("valuelivestock_base", "valuelivestock_base_w") %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
## skim_type skim_variable n_missing complete_rate numeric.mean
## 1 numeric valuelivestock_base 480 0.6467991 10894886
## 2 numeric valuelivestock_base_w 480 0.6467991 1650605
## numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
## 1 185609935 160 224500 700000 2147500 5251359744
## 2 2186659 47700 224500 700000 2147500 8402000
## numeric.hist
## 1 ▇▁▁▁▁
## 2 ▇▂▁▁▁
bline_df %>% select("childnum_base", "childnum_base_w") %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
## skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd
## 1 numeric childnum_base 0 1 2.729213 2.147429
## 2 numeric childnum_base_w 0 1 2.685798 2.035914
## numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
## 1 0 1 3 4 11 ▇▅▂▁▁
## 2 0 1 3 4 7 ▇▅▇▂▂
bline_df %>% select("bank_savings_base", "bank_savings_base_w" ) %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
## skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd
## 1 numeric bank_savings_base 0 1 11526.05 120831.6
## 2 numeric bank_savings_base_w 0 1 0.00 0.0
## numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
## 1 0 0 0 0 2375000 ▇▁▁▁▁
## 2 0 0 0 0 0 ▁▁▇▁▁
bline_df %>% select("loanamt_formal_base", "loanamt_formal_base_w") %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
## skim_type skim_variable n_missing complete_rate numeric.mean
## 1 numeric loanamt_formal_base 8 0.9941133 300244.3
## 2 numeric loanamt_formal_base_w 8 0.9941133 0.0
## numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
## 1 9272493 0 0 0 0 340000000
## 2 0 0 0 0 0 0
## numeric.hist
## 1 ▇▁▁▁▁
## 2 ▁▁▇▁▁
Compare the two types of variables. What do you notice? How did winsorizing change the summary statistics? How?
## By winsorizing, we can see quite a few changes to the data as a result of trimming the data from the bottom and top 5% of the data from the variables of interest.
## Changes to value of hhdurables & value of livestock: The mean values and standard deviations of both of these variables have decreased significantly, indicating that a few very large outliers were biasing up the mean, and now the data covers a smaller range of values. Not only have the very largest outliers been trimmed from the data, but so have the very lowest ones. As a result, the histogram also looks a little better distributed. Additionally, only the values in the p0 and p100 sections of the data have been altered (which makes sense considering we've only trimmed the very edges of the data).
## Changes to number of children: The mean and standard deviation of this variable has only very slightly decreased, indicating that the outliers in the unwinsorized version of the data were not very impactful on those measures. The p0 through p75 sections of the tibble are largely unchanged, barring p100.
## Changes to bank savings and loanamt formal: Both of these variables have changed significantly. It's clear that values in the greatest 5% of observations were what made the mean and standard deviation the size they were, as after winsorizing, for both variables, the value of these measures are 0. Contrary to what the original data seemed to say, very few people have any bank savings or money in a formal loan at all.
Consider now the following baseline variables:
age_resp_base female_resp_base schooling_resp_base married_resp_base formalloan_base childnum_base valuelivestock_base valuehhdurables_base has_formal_account_base has_mobile_money_base num_vslas_base electricity_base VSLA_savings_base bank_savings_base loanamt_formal_base asset_index_base total_savings_base
Using skim, create a table of all baseline variables. If a variable was winsorized, use the winsorized variable only, not the original variable.
bline_df %>% select("age_resp_base", "female_resp_base", "schooling_resp_base", "married_resp_base", "formalloan_base", "childnum_base_w", "valuelivestock_base_w", "valuehhdurables_base_w", "has_formal_account_base", "has_mobile_money_base", "num_vslas_base", "electricity_base", "VSLA_savings_base", "bank_savings_base_w", "loanamt_formal_base_w", "asset_index_base", "total_savings_base") %>% skim() %>% as.data.frame() %>% format(scientific=FALSE)
## skim_type skim_variable n_missing complete_rate numeric.mean
## 1 numeric age_resp_base 0 1.0000000 39.59087564
## 2 numeric female_resp_base 0 1.0000000 0.77777778
## 3 numeric schooling_resp_base 0 1.0000000 2.84768212
## 4 numeric married_resp_base 0 1.0000000 0.68947756
## 5 numeric formalloan_base 2 0.9985283 320593.35003685
## 6 numeric childnum_base_w 0 1.0000000 2.68579838
## 7 numeric valuelivestock_base_w 480 0.6467991 1650604.99886234
## 8 numeric valuehhdurables_base_w 138 0.8984547 1457910.05569206
## 9 numeric has_formal_account_base 0 1.0000000 0.07505519
## 10 numeric has_mobile_money_base 0 1.0000000 0.14716703
## 11 numeric num_vslas_base 3 0.9977925 1.46460177
## 12 numeric electricity_base 0 1.0000000 0.18543046
## 13 numeric VSLA_savings_base 0 1.0000000 159666.29506990
## 14 numeric bank_savings_base_w 0 1.0000000 0.00000000
## 15 numeric loanamt_formal_base_w 8 0.9941133 0.00000000
## 16 numeric asset_index_base 0 1.0000000 0.04481834
## 17 numeric total_savings_base 0 1.0000000 201155.84621045
## numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75
## 1 13.6290139 18.000000 29.000000 37.00000000 48.000000
## 2 0.4158928 0.000000 1.000000 1.00000000 1.000000
## 3 1.3408365 1.000000 2.000000 2.00000000 4.000000
## 4 0.4628779 0.000000 0.000000 1.00000000 1.000000
## 5 9552018.9904126 0.000000 0.000000 0.00000000 0.000000
## 6 2.0359139 0.000000 1.000000 3.00000000 4.000000
## 7 2186658.5137422 47700.000000 224500.000000 700000.00000000 2147500.000000
## 8 1784576.7679134 53000.000000 235000.000000 662000.00000000 2010000.000000
## 9 0.2635774 0.000000 0.000000 0.00000000 0.000000
## 10 0.3544028 0.000000 0.000000 0.00000000 0.000000
## 11 0.9368633 1.000000 1.000000 1.00000000 2.000000
## 12 0.3887894 0.000000 0.000000 0.00000000 0.000000
## 13 230123.4154785 0.000000 0.000000 80000.00000000 200000.000000
## 14 0.0000000 0.000000 0.000000 0.00000000 0.000000
## 15 0.0000000 0.000000 0.000000 0.00000000 0.000000
## 16 1.7237489 -3.145163 -1.126157 -0.06067866 1.132419
## 17 394066.8596736 0.000000 0.000000 85000.00000000 220000.000000
## numeric.p100 numeric.hist
## 1 90.000000 ▇▇▅▂▁
## 2 1.000000 ▂▁▁▁▇
## 3 8.000000 ▇▂▅▁▁
## 4 1.000000 ▃▁▁▁▇
## 5 350000000.000000 ▇▁▁▁▁
## 6 7.000000 ▇▅▇▂▂
## 7 8402000.000000 ▇▂▁▁▁
## 8 6496000.000000 ▇▂▁▁▁
## 9 1.000000 ▇▁▁▁▁
## 10 1.000000 ▇▁▁▁▂
## 11 16.000000 ▇▁▁▁▁
## 12 1.000000 ▇▁▁▁▂
## 13 1500000.000000 ▇▁▁▁▁
## 14 0.000000 ▁▁▇▁▁
## 15 0.000000 ▁▁▇▁▁
## 16 4.861075 ▃▇▆▃▁
## 17 3000000.000000 ▇▁▁▁▁
To the best of your knowledge, and using educated guesses, use the summary statistics table you created to describe the sample, one variable at a time. For example: “Average years of schooling in the sample is 2.9 years” “77% of respondents are women”. Rely on averages.
## The average age of respondents in the sample is 39.6 years. 77.8% of respondents are female. The average years of schooling for respondents is 2.8. 68.9% of respondents are married. The mean of formal loans taken out by respondents is 320,593 Ugandan shillings. The average number of children respondents have is 2.7. The average value of livestock for respondents 1,650,605.0 Ugandan shillings, and the average value of household durables is 1,457,910.1 Ugandan shillings. 7.5% of respondents have a formal savings account, and 14.7% have "mobile money". The mean number of VSLAs in the household for respondents is 1.5. 18.5% of respondents have electricity. The average of VSLA savings for respondents is 159,666.3 Ugandan shillings. The mean of both bank savings and formal loan amounts for respondents is 0 Ugandan shillings. Respondents' asset index is, on average, 0.04. The mean total savings held by respondents is 201,155.8 Ugandan shillings.
How would you describe the sample, in just a couple of sentences? No wrong answers here, but your statement should be related to the description above. (For example, it wouldn’t be correct to state “participants are highly educated and mostly male”!)
## According to the data, most respondents in the sample are female and married and seem to have at least one or two children, but they are not very educated. Most respondents are also relatively poor and located in rural areas, as most do not have electricity and the most amount of valuable assets people hold are either in their household durables or their livestock. Formal money saving/loaning methods are also not very common among the sample; most respondents don't have a formal savings account, nor do they use mobile money, and the mean of bank savings and formal loan amounts are 0 as a result.