R Workshop Day 1

Reading Data In

You’ve got to get your data into R. I will show you a very simple way to do that. We are going to be using a dataset collected by the Paul Simon Public Policy Institute. It was a phone survey of 1000 respondents in Illinois that was conducted in March of 2016. Here’s the codebook to take a look at the data.

You need to have downloaded and installed R and R studio before you begin.

A link to R is here

A link to R Studio is here

Open up R Studio and paste the following command into the console at the bottom left.

simon <- read.csv(url("http://goo.gl/exQA14"))

If you’ve done it correctly you should have loaded in a dataset and labeled it simon. To check if you have done this correctly type the following:

head(simon)

##   ï..id     date int_lang area type cell safe youngest reg gender rwdirusa
## 1 60433 20160219        1    3    1   NA   NA        1   1      2        2
## 2  3405 20160215        1    2    1   NA   NA        1   1      2        2
## 3 11628 20160217        1    3    2    1    1       NA   1      1        2
## 4  1144 20160214        1    3    2    1    1       NA   1      2        1
## 5  2004 20160215        1    2    1   NA   NA        1   1      1        1
## 6  2773 20160216        1    3    2    1    1       NA   1      2        1
##   rwdiril rwdirlocal qlife app_gov_rau app_ussen_kirk app_ussen_dur
## 1       2          2     3           5              4             2
## 2       2          2     3           5              4             5
## 3       2          2     3           2              2             2
## 4       2          2     5           5              2             1
## 5       1          1     2           3              9             3
## 6       2          1     2           5              9             2
##   prim16_party prim16_pres_dem prim16_pres_dem_other prim16_sen_dem
## 1            3              NA                    NA             NA
## 2            4              NA                    NA             NA
## 3            4              NA                    NA             NA
## 4            4              NA                    NA             NA
## 5            1               1                    NA              1
## 6            3              NA                    NA             NA
##   prim16_sen_dem_other prim16_pres_rep prim16_pres_rep_other
## 1                                   NA                      
## 2                                   NA                      
## 3                                   NA                      
## 4                                   NA                      
## 5                                   NA                      
## 6                                   NA                      
##   prim16_sen_rep prim16_sen_rep_other fixdeficit_il cuts_K12 cuts_he
## 1             NA                   NA             5        2       2
## 2             NA                   NA             5        2       2
## 3             NA                   NA             3        1       1
## 4             NA                   NA             3        2       2
## 5             NA                   NA             3        2       3
## 6             NA                   NA             3        2       2
##   cuts_cops cuts_parks cuts_poor cuts_dis pen_cuts tax_services gamble_exp
## 1         2          1         2        2        1            2          1
## 2         2          2         3        2        2            2          4
## 3         1          1         1        1        1            3          4
## 4         1          2         2        2        1            2          4
## 5         2          2         2        2        1            1          1
## 6         1          2         2        2        2            2          1
##   tax_ret_inc tax_ret_50K tax_ret_supp tax_inc_temp tax_inc_mil
## 1           2           1            1            4           1
## 2           2           2            2            2           1
## 3           1          NA            1            2           4
## 4           1          NA            1            1           1
## 5           2           3            2            3           1
## 6           2           2            2            2           2
##   tax_inc_grad tax_gas redist_neutral redist_commis cf_judge cf_judge_pub
## 1            1       4              1             4        1            4
## 2            3       3              1             1        1            1
## 3            4       4              1             1        1            4
## 4            1       1              1             1        1            1
## 5            1       1              1             2        1            2
## 6            3       2              2             2        2            3
##   term_lim right2work budg_affect budg_imp1 budg_imp2 budge_imp3 budg_imp4
## 1        1          1           1         9        NA         NA        NA
## 2        3          4           1         9        NA         NA        NA
## 3        1          1           1         9        NA         NA        NA
## 4        4          4           1         5        NA         NA        NA
## 5        1          4           2        NA        NA         NA        NA
## 6        2          1           1         3        NA         NA        NA
##   budg_imp5 budg_imp6 budg_imp7            budg_imp_other budg_child
## 1        NA        NA        NA             The fifth law          0
## 2        NA        NA        NA            Less security.          0
## 3        NA        NA        NA Bills are not being paid.          0
## 4        NA        NA        NA                                    0
## 5        NA        NA        NA                                    0
## 6        NA        NA        NA                                    0
##   budg_citycuts budg_he budg_infradec budg_jobloss budg_locecon
## 1             0       0             0            0            0
## 2             0       0             0            0            0
## 3             0       0             0            0            0
## 4             0       0             0            1            0
## 5             0       0             0            0            0
## 6             0       1             0            0            0
##   budg_realest budg_socserv budg_other abort_leg lg_rights med_mari
## 1            0            0          1         1         1        1
## 2            0            0          1         2         4        1
## 3            0            0          1         1         2        1
## 4            0            0          0         1         1        1
## 5            0            0          0         3         1        1
## 6            0            0          0         1         1        1
##   mari_rec vet_courts miss_diabet miss_depress miss_heart miss_anxiety
## 1        1          1           1           NA         NA            2
## 2        2          2           5           NA          5           NA
## 3        1          2           1           NA          2           NA
## 4        2          1           1           NA         NA            1
## 5        1          1          NA            2         NA            3
## 6        2          2          NA            2         NA            2
##   education ideo_leans ideology party_leans party_aff employ_time
## 1         1          1        1           2         1           3
## 2         5          3        2           4         2           1
## 3         6          3        2           9         9           3
## 4         6          1        1           3         1           1
## 5         2          3        2           1         1           3
## 6         6          2        1           2         1           2
##   employ_type retired_type union relig_att_often relig_type religion
## 1          NA            1     2               3          1       NA
## 2           4           NA     1               1          1       NA
## 3          NA            3     2               4         NA        8
## 4           3           NA     2               2          1       NA
## 5          NA            1     2               5         NA        3
## 6           2           NA     2               4         NA        1
##   born_again vet vet_fam   zip age_cat birth_yr rac_eth rac_eth_other
## 1          2   2       1 99999      11     1942       1              
## 2          2   5       1 99999       6     1970       7              
## 3         NA   5       2 99999      14     1900       7              
## 4          2   5       1 99999       9     1954       1              
## 5         NA   5       2 99999      12     1938       1              
## 6          2   5       2 99999      10     1950       1              
##   hhinc
## 1     1
## 2     5
## 3     9
## 4     6
## 5     2
## 6     5

You should see a bunch of columns with the first five rows listed.

Creating Variables and Recoding Variables

Let’s look at an individual variable in the data set for the race of the respondent. It’s called rac_eth

table(simon$rac_eth)

## 
##   1   2   3   4   5   6   7 
## 724 128  23  40  21   6  58

We see that it has a bunch of different values and that 1 = white, 2 = black, 3= Asian, and so on. (I got that by looking at the codebook I linked to above, nothing in the R output will tell you what 1 or 2 or 3 corresponds to, you have to use the codebook.)

So, let’s do some basic manipulation. Create a new variable for black respondents and then recode it into a dichotomous state where being black =1 and everything else =0. Here’s how we do that. First we need to install and load a package that will allow us to do recodes easily. Here’s the syntax for loading the “car” package.

##install.packages("car")
library(car)

Then let’s create a new variable called black and then recode it.

simon$black <- simon$rac_eth
simon$black <- recode(simon$black, "2=1; else=0")
table(simon$black)

## 
##   0   1 
## 872 128

As you can see, now we have a variable that is just two values, 1 if the respondent is black and zero if they are anything else.

Now, do the same thing but instead create an asian variable and recode it to a dichotomous variable.

Let’s take a little bit of an advanced look at recoding. There’s a variable in the dataset called ideo_leans. Here’s what it looks like:

table(simon$ideo_leans)

## 
##   1   2   3   4   5   6 
## 121 214 279 222 107  57

1 means a strong democrat, 2 is a lean democrat, 3 is moderate, 4 is lean republican, and 5 is a strong republican. In addition, 6 is “Don’t Know.” Let’s turn that into a much simpler variable. 1 if you are democrat. 2 if you are moderate. 3 if you are republican.

simon$newideo <- simon$ideo_leans
simon$newideo <- recode(simon$newideo, "2=1; 3=2; 4=3; 5=3; 6=9")
table(simon$newideo)

## 
##   1   2   3   9 
## 335 279 329  57

So we can see that we have collapsed that into few categories. But, I really don’t like the “Don’t Know” being a 9. Let’s just mark it to missing.

simon$newideo[simon$newideo == 9] <- NA
table(simon$newideo)

## 
##   1   2   3 
## 335 279 329

Much better. Now let’s look at that result visually.

Basic Graphs

ggplot2 is the best and most customizable graphing package in R. Let’s load it up.

##install.packages("ggplot2")
library(ggplot2)

and just do a little simple histogram (bar plot) of our new ideology variable.

qplot(newideo, data=simon, geom="histogram")

Let’s make that graph interactive by using the plotly package.

##install.packages("plotly")
library(plotly)
plot_ly(x = simon$newideo, type = "histogram")

If you mouse over this graphic it should be interactive. All you have to do is just change the “x =” part of the syntax and you can create an interactive plotly of any of the variables in our dataset.

Crosstabs

Crosstabs are great because it can give you a little peek at the relationships between different variables. We just created a new variable about political ideology above. Let’s just that in our crosstab. First, let’s install a package with a great crosstab command.

library(descr)

Now that it’s loaded, let’s do a crosstab of the age variable and our political ideology variable.

crosstab(simon$newideo, simon$age_cat)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## ==============================================================================================
##        simon$age_cat
## sm$      1     2     3     4     5     6     7     8     9    10    11    12    13    14   Ttl
## ----------------------------------------------------------------------------------------------
## 1       16    19    24    18    26    26    28    29    34    28    22    21    26    18   335
## ----------------------------------------------------------------------------------------------
## 2        8     6     7     8    10    21    22    38    34    37    21    14    35    18   279
## ----------------------------------------------------------------------------------------------
## 3        6    10     9     8    16    25    30    42    38    35    25    28    34    23   329
## ----------------------------------------------------------------------------------------------
## Ttl     30    35    40    34    52    72    80   109   106   100    68    63    95    59   943
## ==============================================================================================

In the “newideo” variable, 1 is liberal and 3 is conservative. The age variable goes from younger ages on the left and older people on the right. As you can see, younger people are more liberal and older people are more conservative. You can also see that the sample had larger numbers of people in 8,9,10 than other age brackets.

What does this crosstab tell you?

crosstab(simon$newideo, simon$black)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## ==================================
##                  simon$black
## simon$newideo      0     1   Total
## ----------------------------------
## 1                283    52     335
## ----------------------------------
## 2                240    39     279
## ----------------------------------
## 3                305    24     329
## ----------------------------------
## Total            828   115     943
## ==================================

Concluding Thoughts

This is just a very little bit of all that R can do. The best way to learn is just find a dataset and start messing around with it. You can find some good ones on Kaggle. or you can create your own using tweets or tables on Wikipedia or a bunch of other things.