Why data management is important?

Often, you will want to change how your variable is coded so that you can answer your research question more efficiently.

  • say you have a variable that describes a person’s age and their spouses age and you are interested in the differences in ages between spouses.

Data management is especially important when you only have access to messy data. Data are typically messy when you see one or both of these things occur:

  1. Column headers are values, not variable names.

  1. Multiple variables are stored in one column.

Before the analysis, such as calculating correlations and doing data visualization, your data should be tidy: each variable is a column and each observation is a row.

Frequently used data management operations include selecting columns (variables), filtering rows (cases), and creating new columns. These are the operations we will cover here.

dplyr package

Data management in R can be done without using any additional packages but the code is not very intuitive. Instead, we will use the dplyr package. Make sure to install it and load it into R Studio.

We will focus on the following three sets of functions:

mutate() adds new variables that are functions of existing variables.

select() picks variables based on their names.

filter() picks cases based on their values.

Selecting columns (select)

Often, your dataset will have many variables that you will not need for you analysis. You want to keep only the variables your need.

  • say we want to create a new dataset based on gss that includes only the variables age, marital, and lockedup.

but first, let’s see what variables are included in the dataset that we just loaded into R (gss). We can do that by running the command head and applying it to the gss dataset.

##   age  race    sex educ prestg10 sppres10 arrest       marital lockedup
## 1  22 white   male   16       38       NA     no never married       no
## 2  21 white   male   12       43       NA     no never married       no
## 3  42 other   male   12       75       61    yes       married       no
## 4  49 white female   13       60       38     no       married       no
## 5  70 black female   16       40       NA    yes     separated       no
## 6  50 white female   19       73       NA    yes       widowed      yes

It looks like there are 9 variables in the dataset.

Now, let’s use the function select from the dplyr package to create a dataset that has fewer variables: only the variables age, marital, and lockedup.

##   age       marital lockedup
## 1  22 never married       no
## 2  21 never married       no
## 3  42       married       no
## 4  49       married       no
## 5  70     separated       no
## 6  50       widowed      yes

We created a new dataset that contains only those three variables and we named the new dataset gss_reduced_1. You can name the dataset any way you want. Or you can overwrite the existing dataset.

You can also select which variables you want to remove from the dataset. Let’s remove variables prestg10 and marital. We remove variables by adding a - sign before the variables we want to remove.

##   age  race    sex educ sppres10 arrest lockedup
## 1  22 white   male   16       NA     no       no
## 2  21 white   male   12       NA     no       no
## 3  42 other   male   12       61    yes       no
## 4  49 white female   13       38     no       no
## 5  70 black female   16       NA    yes       no
## 6  50 white female   19       NA    yes      yes

As you can see above, this new dataset (gss_reduced_2) no longer contains variables prestg10 and marital. In general, removing or selecting variables to keep in a dataset is most useful when you have a dataset with many variables, for example hundreds.

Exercise

Create a dataset that includes only the variables arrest and educ. Name the new dataset.

Filtering rows (filter)

Sometimes you want to change the dataset so that you analyze only a subset. For instance, you could only be interested in men in the sample, or you can only be interested in men who are older than 30 years.

  • here’s an example of creating a subset of the gss dataset (try to figure out what part of the sample we included in the new dataset).
##       age           race         sex           educ          prestg10    
##  Min.   :22.00   iap  :  0   male  :410   Min.   : 0.00   Min.   :21.00  
##  1st Qu.:38.25   white:321   female:  0   1st Qu.:12.00   1st Qu.:35.00  
##  Median :50.00   black: 40                Median :13.50   Median :45.00  
##  Mean   :51.58   other: 49                Mean   :13.65   Mean   :47.01  
##  3rd Qu.:63.00                            3rd Qu.:16.00   3rd Qu.:60.00  
##  Max.   :89.00                            Max.   :20.00   Max.   :80.00  
##                                                           NA's   :4      
##     sppres10         arrest             marital          lockedup  
##  Min.   :17.00   iap    :  0   married      :410   iap       :  0  
##  1st Qu.:35.00   yes    : 91   widowed      :  0   yes       : 59  
##  Median :47.00   no     :276   divorced     :  0   no        :308  
##  Mean   :46.58   refused:  0   separated    :  0   don't know:  0  
##  3rd Qu.:56.25   dk     :  0   never married:  0   no answer :  0  
##  Max.   :80.00   na     :  0   na           :  0   NA's      : 43  
##  NA's   :58      NA's   : 43

If you guessed “married men”, you were right!

Exercise

Subset and store the gss dataset so that it only includes women older than 35 years.

Creating columns (mutate)

When we want to create a new variable that is a function of other variables, we should use the mutate command.

  • say you would like to create a variable that contains the difference in occupational prestige score between participants and their spouses.
##       age           race          sex            educ          prestg10    
##  Min.   :18.00   iap  :   0   male  : 885   Min.   : 0.00   Min.   :16.00  
##  1st Qu.:33.00   white:1477   female:1089   1st Qu.:12.00   1st Qu.:34.00  
##  Median :47.00   black: 301                 Median :13.00   Median :44.00  
##  Mean   :48.19   other: 196                 Mean   :13.53   Mean   :44.33  
##  3rd Qu.:61.00                              3rd Qu.:16.00   3rd Qu.:53.00  
##  Max.   :89.00                              Max.   :20.00   Max.   :80.00  
##  NA's   :5                                  NA's   :2       NA's   :103    
##     sppres10         arrest              marital          lockedup   
##  Min.   :17.00   iap    :   0   married      :900   iap       :   0  
##  1st Qu.:35.00   yes    : 373   widowed      :163   yes       : 259  
##  Median :46.00   no     :1375   divorced     :317   no        :1489  
##  Mean   :46.82   refused:   0   separated    : 68   don't know:   0  
##  3rd Qu.:59.00   dk     :   0   never married:526   no answer :   0  
##  Max.   :80.00   na     :   0   na           :  0   NA's      : 226  
##  NA's   :1149    NA's   : 226                                        
##     presdiff       
##  Min.   :-53.0000  
##  1st Qu.:-11.0000  
##  Median :  0.0000  
##  Mean   :  0.0289  
##  3rd Qu.: 11.0000  
##  Max.   : 53.0000  
##  NA's   :1178

Exercise

Create a new variable that is equal to squared age.

Putting it all together

First, create a new dataset that only has variables marital, age, and sex. Next, restrict the dataset to include only divorced men who are older than 45 years. Finally, show the summary information for all the variables in this new dataset.