Often, you will want to change how your variable is coded so that you can answer your research question more efficiently.
Data management is especially important when you only have access to messy data. Data are typically messy when you see one or both of these things occur:
Before the analysis, such as calculating correlations and doing data visualization, your data should be tidy: each variable is a column and each observation is a row.
Frequently used data management operations include selecting columns (variables), filtering rows (cases), and creating new columns. These are the operations we will cover here.
Data management in R can be done without using any additional packages but the code is not very intuitive. Instead, we will use the dplyr package. Make sure to install it and load it into R Studio.
We will focus on the following three sets of functions:
mutate() adds new variables that are functions of existing variables.
select() picks variables based on their names.
filter() picks cases based on their values.
select)Often, your dataset will have many variables that you will not need for you analysis. You want to keep only the variables your need.
gss that includes only the variables age, marital, and lockedup.but first, let’s see what variables are included in the dataset that we just loaded into R (gss). We can do that by running the command head and applying it to the gss dataset.
## age race sex educ prestg10 sppres10 arrest marital lockedup
## 1 22 white male 16 38 NA no never married no
## 2 21 white male 12 43 NA no never married no
## 3 42 other male 12 75 61 yes married no
## 4 49 white female 13 60 38 no married no
## 5 70 black female 16 40 NA yes separated no
## 6 50 white female 19 73 NA yes widowed yes
It looks like there are 9 variables in the dataset.
Now, let’s use the function select from the dplyr package to create a dataset that has fewer variables: only the variables age, marital, and lockedup.
## age marital lockedup
## 1 22 never married no
## 2 21 never married no
## 3 42 married no
## 4 49 married no
## 5 70 separated no
## 6 50 widowed yes
We created a new dataset that contains only those three variables and we named the new dataset gss_reduced_1. You can name the dataset any way you want. Or you can overwrite the existing dataset.
You can also select which variables you want to remove from the dataset. Let’s remove variables prestg10 and marital. We remove variables by adding a - sign before the variables we want to remove.
## age race sex educ sppres10 arrest lockedup
## 1 22 white male 16 NA no no
## 2 21 white male 12 NA no no
## 3 42 other male 12 61 yes no
## 4 49 white female 13 38 no no
## 5 70 black female 16 NA yes no
## 6 50 white female 19 NA yes yes
As you can see above, this new dataset (gss_reduced_2) no longer contains variables prestg10 and marital. In general, removing or selecting variables to keep in a dataset is most useful when you have a dataset with many variables, for example hundreds.
Create a dataset that includes only the variables arrest and educ. Name the new dataset.
filter)Sometimes you want to change the dataset so that you analyze only a subset. For instance, you could only be interested in men in the sample, or you can only be interested in men who are older than 30 years.
gss dataset (try to figure out what part of the sample we included in the new dataset).## age race sex educ prestg10
## Min. :22.00 iap : 0 male :410 Min. : 0.00 Min. :21.00
## 1st Qu.:38.25 white:321 female: 0 1st Qu.:12.00 1st Qu.:35.00
## Median :50.00 black: 40 Median :13.50 Median :45.00
## Mean :51.58 other: 49 Mean :13.65 Mean :47.01
## 3rd Qu.:63.00 3rd Qu.:16.00 3rd Qu.:60.00
## Max. :89.00 Max. :20.00 Max. :80.00
## NA's :4
## sppres10 arrest marital lockedup
## Min. :17.00 iap : 0 married :410 iap : 0
## 1st Qu.:35.00 yes : 91 widowed : 0 yes : 59
## Median :47.00 no :276 divorced : 0 no :308
## Mean :46.58 refused: 0 separated : 0 don't know: 0
## 3rd Qu.:56.25 dk : 0 never married: 0 no answer : 0
## Max. :80.00 na : 0 na : 0 NA's : 43
## NA's :58 NA's : 43
If you guessed “married men”, you were right!
Subset and store the gss dataset so that it only includes women older than 35 years.
mutate)When we want to create a new variable that is a function of other variables, we should use the mutate command.
## age race sex educ prestg10
## Min. :18.00 iap : 0 male : 885 Min. : 0.00 Min. :16.00
## 1st Qu.:33.00 white:1477 female:1089 1st Qu.:12.00 1st Qu.:34.00
## Median :47.00 black: 301 Median :13.00 Median :44.00
## Mean :48.19 other: 196 Mean :13.53 Mean :44.33
## 3rd Qu.:61.00 3rd Qu.:16.00 3rd Qu.:53.00
## Max. :89.00 Max. :20.00 Max. :80.00
## NA's :5 NA's :2 NA's :103
## sppres10 arrest marital lockedup
## Min. :17.00 iap : 0 married :900 iap : 0
## 1st Qu.:35.00 yes : 373 widowed :163 yes : 259
## Median :46.00 no :1375 divorced :317 no :1489
## Mean :46.82 refused: 0 separated : 68 don't know: 0
## 3rd Qu.:59.00 dk : 0 never married:526 no answer : 0
## Max. :80.00 na : 0 na : 0 NA's : 226
## NA's :1149 NA's : 226
## presdiff
## Min. :-53.0000
## 1st Qu.:-11.0000
## Median : 0.0000
## Mean : 0.0289
## 3rd Qu.: 11.0000
## Max. : 53.0000
## NA's :1178
Create a new variable that is equal to squared age.
First, create a new dataset that only has variables marital, age, and sex. Next, restrict the dataset to include only divorced men who are older than 45 years. Finally, show the summary information for all the variables in this new dataset.