You’ve got to get your data into R. I will show you a very simple way to do that. We are going to be using a dataset collected by the Paul Simon Public Policy Institute. It was a phone survey of 1000 respondents in Illinois that was conducted in March of 2016. Here’s the codebook to take a look at the data.
You need to have downloaded and installed R and R studio before you begin.
A link to R is here
A link to R Studio is here
Open up R Studio and paste the following command into the console at the bottom left.
simon <- read.csv(url("http://goo.gl/exQA14"))
If you’ve done it correctly you should have loaded in a dataset and labeled it simon. To check if you have done this correctly type the following:
head(simon)
## ï..id date int_lang area type cell safe youngest reg gender rwdirusa
## 1 60433 20160219 1 3 1 NA NA 1 1 2 2
## 2 3405 20160215 1 2 1 NA NA 1 1 2 2
## 3 11628 20160217 1 3 2 1 1 NA 1 1 2
## 4 1144 20160214 1 3 2 1 1 NA 1 2 1
## 5 2004 20160215 1 2 1 NA NA 1 1 1 1
## 6 2773 20160216 1 3 2 1 1 NA 1 2 1
## rwdiril rwdirlocal qlife app_gov_rau app_ussen_kirk app_ussen_dur
## 1 2 2 3 5 4 2
## 2 2 2 3 5 4 5
## 3 2 2 3 2 2 2
## 4 2 2 5 5 2 1
## 5 1 1 2 3 9 3
## 6 2 1 2 5 9 2
## prim16_party prim16_pres_dem prim16_pres_dem_other prim16_sen_dem
## 1 3 NA NA NA
## 2 4 NA NA NA
## 3 4 NA NA NA
## 4 4 NA NA NA
## 5 1 1 NA 1
## 6 3 NA NA NA
## prim16_sen_dem_other prim16_pres_rep prim16_pres_rep_other
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## prim16_sen_rep prim16_sen_rep_other fixdeficit_il cuts_K12 cuts_he
## 1 NA NA 5 2 2
## 2 NA NA 5 2 2
## 3 NA NA 3 1 1
## 4 NA NA 3 2 2
## 5 NA NA 3 2 3
## 6 NA NA 3 2 2
## cuts_cops cuts_parks cuts_poor cuts_dis pen_cuts tax_services gamble_exp
## 1 2 1 2 2 1 2 1
## 2 2 2 3 2 2 2 4
## 3 1 1 1 1 1 3 4
## 4 1 2 2 2 1 2 4
## 5 2 2 2 2 1 1 1
## 6 1 2 2 2 2 2 1
## tax_ret_inc tax_ret_50K tax_ret_supp tax_inc_temp tax_inc_mil
## 1 2 1 1 4 1
## 2 2 2 2 2 1
## 3 1 NA 1 2 4
## 4 1 NA 1 1 1
## 5 2 3 2 3 1
## 6 2 2 2 2 2
## tax_inc_grad tax_gas redist_neutral redist_commis cf_judge cf_judge_pub
## 1 1 4 1 4 1 4
## 2 3 3 1 1 1 1
## 3 4 4 1 1 1 4
## 4 1 1 1 1 1 1
## 5 1 1 1 2 1 2
## 6 3 2 2 2 2 3
## term_lim right2work budg_affect budg_imp1 budg_imp2 budge_imp3 budg_imp4
## 1 1 1 1 9 NA NA NA
## 2 3 4 1 9 NA NA NA
## 3 1 1 1 9 NA NA NA
## 4 4 4 1 5 NA NA NA
## 5 1 4 2 NA NA NA NA
## 6 2 1 1 3 NA NA NA
## budg_imp5 budg_imp6 budg_imp7 budg_imp_other budg_child
## 1 NA NA NA The fifth law 0
## 2 NA NA NA Less security. 0
## 3 NA NA NA Bills are not being paid. 0
## 4 NA NA NA 0
## 5 NA NA NA 0
## 6 NA NA NA 0
## budg_citycuts budg_he budg_infradec budg_jobloss budg_locecon
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 1 0
## 5 0 0 0 0 0
## 6 0 1 0 0 0
## budg_realest budg_socserv budg_other abort_leg lg_rights med_mari
## 1 0 0 1 1 1 1
## 2 0 0 1 2 4 1
## 3 0 0 1 1 2 1
## 4 0 0 0 1 1 1
## 5 0 0 0 3 1 1
## 6 0 0 0 1 1 1
## mari_rec vet_courts miss_diabet miss_depress miss_heart miss_anxiety
## 1 1 1 1 NA NA 2
## 2 2 2 5 NA 5 NA
## 3 1 2 1 NA 2 NA
## 4 2 1 1 NA NA 1
## 5 1 1 NA 2 NA 3
## 6 2 2 NA 2 NA 2
## education ideo_leans ideology party_leans party_aff employ_time
## 1 1 1 1 2 1 3
## 2 5 3 2 4 2 1
## 3 6 3 2 9 9 3
## 4 6 1 1 3 1 1
## 5 2 3 2 1 1 3
## 6 6 2 1 2 1 2
## employ_type retired_type union relig_att_often relig_type religion
## 1 NA 1 2 3 1 NA
## 2 4 NA 1 1 1 NA
## 3 NA 3 2 4 NA 8
## 4 3 NA 2 2 1 NA
## 5 NA 1 2 5 NA 3
## 6 2 NA 2 4 NA 1
## born_again vet vet_fam zip age_cat birth_yr rac_eth rac_eth_other
## 1 2 2 1 99999 11 1942 1
## 2 2 5 1 99999 6 1970 7
## 3 NA 5 2 99999 14 1900 7
## 4 2 5 1 99999 9 1954 1
## 5 NA 5 2 99999 12 1938 1
## 6 2 5 2 99999 10 1950 1
## hhinc
## 1 1
## 2 5
## 3 9
## 4 6
## 5 2
## 6 5
You should see a bunch of columns with the first five rows listed.
Let’s look at an individual variable in the data set for the race of the respondent. It’s called rac_eth
table(simon$rac_eth)
##
## 1 2 3 4 5 6 7
## 724 128 23 40 21 6 58
We see that it has a bunch of different values and that 1 = white, 2 = black, 3= Asian, and so on. (I got that by looking at the codebook I linked to above, nothing in the R output will tell you what 1 or 2 or 3 corresponds to, you have to use the codebook.)
So, let’s do some basic manipulation. Create a new variable for black respondents and then recode it into a dichotomous state where being black =1 and everything else =0. Here’s how we do that. First we need to install and load a package that will allow us to do recodes easily. Here’s the syntax for loading the “car” package.
##install.packages("car")
library(car)
Then let’s create a new variable called black and then recode it.
simon$black <- simon$rac_eth
simon$black <- recode(simon$black, "2=1; else=0")
table(simon$black)
##
## 0 1
## 872 128
As you can see, now we have a variable that is just two values, 1 if the respondent is black and zero if they are anything else.
Now, do the same thing but instead create an asian variable and recode it to a dichotomous variable.
Let’s take a little bit of an advanced look at recoding. There’s a variable in the dataset called ideo_leans. Here’s what it looks like:
table(simon$ideo_leans)
##
## 1 2 3 4 5 6
## 121 214 279 222 107 57
1 means a strong democrat, 2 is a lean democrat, 3 is moderate, 4 is lean republican, and 5 is a strong republican. In addition, 6 is “Don’t Know.” Let’s turn that into a much simpler variable. 1 if you are democrat. 2 if you are moderate. 3 if you are republican.
simon$newideo <- simon$ideo_leans
simon$newideo <- recode(simon$newideo, "2=1; 3=2; 4=3; 5=3; 6=9")
table(simon$newideo)
##
## 1 2 3 9
## 335 279 329 57
So we can see that we have collapsed that into few categories. But, I really don’t like the “Don’t Know” being a 9. Let’s just mark it to missing.
simon$newideo[simon$newideo == 9] <- NA
table(simon$newideo)
##
## 1 2 3
## 335 279 329
Much better. Now let’s look at that result visually.
ggplot2 is the best and most customizable graphing package in R. Let’s load it up.
##install.packages("ggplot2")
library(ggplot2)
and just do a little simple histogram (bar plot) of our new ideology variable.
qplot(newideo, data=simon, geom="histogram")
Let’s make that graph interactive by using the plotly package.
##install.packages("plotly")
library(plotly)
plot_ly(x = simon$newideo, type = "histogram")
If you mouse over this graphic it should be interactive. All you have to do is just change the “x =” part of the syntax and you can create an interactive plotly of any of the variables in our dataset.
Crosstabs are great because it can give you a little peek at the relationships between different variables. We just created a new variable about political ideology above. Let’s just that in our crosstab. First, let’s install a package with a great crosstab command.
library(descr)
Now that it’s loaded, let’s do a crosstab of the age variable and our political ideology variable.
crosstab(simon$newideo, simon$age_cat)
## Cell Contents
## |-------------------------|
## | Count |
## |-------------------------|
##
## ==============================================================================================
## simon$age_cat
## sm$ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Ttl
## ----------------------------------------------------------------------------------------------
## 1 16 19 24 18 26 26 28 29 34 28 22 21 26 18 335
## ----------------------------------------------------------------------------------------------
## 2 8 6 7 8 10 21 22 38 34 37 21 14 35 18 279
## ----------------------------------------------------------------------------------------------
## 3 6 10 9 8 16 25 30 42 38 35 25 28 34 23 329
## ----------------------------------------------------------------------------------------------
## Ttl 30 35 40 34 52 72 80 109 106 100 68 63 95 59 943
## ==============================================================================================
In the “newideo” variable, 1 is liberal and 3 is conservative. The age variable goes from younger ages on the left and older people on the right. As you can see, younger people are more liberal and older people are more conservative. You can also see that the sample had larger numbers of people in 8,9,10 than other age brackets.
What does this crosstab tell you?
crosstab(simon$newideo, simon$black)
## Cell Contents
## |-------------------------|
## | Count |
## |-------------------------|
##
## ==================================
## simon$black
## simon$newideo 0 1 Total
## ----------------------------------
## 1 283 52 335
## ----------------------------------
## 2 240 39 279
## ----------------------------------
## 3 305 24 329
## ----------------------------------
## Total 828 115 943
## ==================================
This is just a very little bit of all that R can do. The best way to learn is just find a dataset and start messing around with it. You can find some good ones on Kaggle. or you can create your own using tweets or tables on Wikipedia or a bunch of other things.