PDM Week 3 Supplement

Author

Collin Paschall

Lists, Dataframes, and PS 1

We’re through the first two weeks of PDM! In the first two weeks, we’ve been covering very basic programming concepts in R, and right now we are in the middle of data types. The lectures and code for this week cover lists and dataframes. Of lists and dataframes, dataframes are the more important. Most of the data that you work with in R will be in a dataframe. When you import data into R from an outside source like data in a comma-separated values files (like a spreadsheet) or even data in JSON documents (more on this later), you are going to be getting these data into the dataframe format to use them.

Lists are less important for practical use in R, though they will come up occasionally. We’re covering lists this week because dataframes are specialized lists, so we’re trying to be thorough with building the foundation. Occasionally, you’ll find a package that wants data in a list or generates data that is in a list, so in those cases, the familiarity can be here helpful.

As a brief outline, here’s the scoop on dataframes.

Make a Dataframe

You can make a dataframe with any number of rows and columns (well, it might not be a good idea to make one with millions of rows or columns, but let’s put aside really big data).

Every cell in the dataframe has to have a value, and all values should be of the same type. So, in the data frame I create here, values_2 has to have a NA in the last row, or the dataframe won’t go.

If they aren’t the same type, R will coerce the values in a column to be the same type, as in the “mixed” column here.

my_df <- data.frame(
            values_1=c(345,234,7456,234,57856,23234),
            values_2=c(456,234,274,273,2462,NA),
            strings=c("werwer","werwer","oinoh","qwewr","bnoi","miwue"),
            mixed=c(1,"o;oih",TRUE,2234,"oihm",235)
)

my_df

  values_1 values_2 strings mixed
1      345      456  werwer     1
2      234      234  werwer o;oih
3     7456      274   oinoh  TRUE
4      234      273   qwewr  2234
5    57856     2462    bnoi  oihm
6    23234       NA   miwue   235

Using Dataframes

Just being able to make a dataframe isn’t that interesting. What’s more important is the ability to use dataframes to access data. This is pretty straightforward.

You can extract a column from a data frame using the $ operator. So, to get the values1 column from my_df, just do this:

my_df$values_1

[1]   345   234  7456   234 57856 23234

You can assign the values_1 column to a variable in your environment, and it is just a vector with double/numeric type data.

the_data <- my_df$values_1

typeof(the_data)

[1] "double"

the_data

[1]   345   234  7456   234 57856 23234

You won’t need to do this often, but if you wanted to extract specific cells, you can just do that using subsetting brackets. It’s important to note there that there are numerous ways to get to the same output using different ways to subset the data. Just stick with this for now.

my_df$values_1[4]

[1] 234

my_df$values_1[2:5]

[1]   234  7456   234 57856

You can also do logical tests to extract data or subset data.

my_df$values_1[my_df$values_1>234]

[1]   345  7456 57856 23234

A powerful application of this is extracting subsets of rows of an entire data frame based on the criteria of a single column. Here, I get all rows of the dataframe where the values in the values_1 column are greater than 234. Note the comma here in the brackets. This says that I want all columns of the dataset and all rows where values_1 is more than 234. All subsetting takes the form of [row,column]

my_df[my_df$values_1>234,]

  values_1 values_2 strings mixed
1      345      456  werwer     1
3     7456      274   oinoh  TRUE
5    57856     2462    bnoi  oihm
6    23234       NA   miwue   235

This can be a little mind numbing to learn ahead of time. You kind of just have to struggle your way through trying to select the particular data you want from a dataframe, and eventually you will get the hang for it - it’s doesn’t come easily!

Let me give you a quick, more extended example of how you can manage data with dataframes and do something useful with it. Some of this is beyond the scope of the problem set and lectures for the week.. It’s just an example.

cces <- read.csv(url("https://raw.githubusercontent.com/collinpaschall/teaching_repo/main/cces_sample.csv")) #import some example survey data, here I am grabbing a .csv from a GitHub page

class(cces) # it's a data frame

[1] "data.frame"

names(cces) # look at the column names

 [1] "caseid"       "region"       "gender"       "educ"         "edloan"      
 [6] "race"         "hispanic"     "employ"       "marstat"      "pid7"        
[11] "ideo5"        "pew_religimp" "newsint"      "faminc_new"   "union"       
[16] "investor"     "CC18_308a"    "CC18_310a"    "CC18_310b"    "CC18_310c"   
[21] "CC18_310d"    "CC18_325a"    "CC18_325b"    "CC18_325c"    "CC18_325d"

# Let's focus on just political idoelogy and education.

# Political Ideology is a seven point scale
summary(cces$pid7)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   3.000   3.624   6.000   7.000

# Education is a 6 point scale
summary(cces$educ)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   4.000   3.761   5.000   6.000

# Assess the mean political ideology (pid7) of high and low educated survey respondents. Do you understand how these two lines compare the means of political ideology among high and low education respondents? If so, you are on your way.

mean(cces[cces$educ==1,]$pid7)

[1] 3.441176

mean(cces[cces$educ==6,]$pid7)

[1] 3.175758

With the data conveniently stored in a dataframe, you can also do fun things like generating plots. Let’s calculate the mean political ideology for each level of education, and plot them! Note: there are more streamlined ways of doing this we will learn later, but this is the kind of brute force approach you might use when you are first using to code. This is not good coding practice, but it works!

mean_pid <- 
  c(mean(cces[cces$educ==1,]$pid7),
  mean(cces[cces$educ==2,]$pid7),
  mean(cces[cces$educ==3,]$pid7),
  mean(cces[cces$educ==4,]$pid7),
  mean(cces[cces$educ==5,]$pid7),
  mean(cces[cces$educ==6,]$pid7))

educ_levels <- seq(1,6)

plot_data <- data.frame(
  mean_ideology=mean_pid,
  education=educ_levels
)

plot(plot_data$education,plot_data$mean_ideology)

This plot is a little funky because it doesn’t show a super clear pattern - but these are the kinds of mysteries you confront when you start working with data - in data frames!

# Or alternatively, make a boxplot

boxplot(cces[cces$educ==1,]$pid7,
        cces[cces$educ==2,]$pid7,
        cces[cces$educ==3,]$pid7,
        cces[cces$educ==4,]$pid7,
        cces[cces$educ==5,]$pid7,
        cces[cces$educ==6,]$pid7,
        xlab="Education Level",
        ylab="Ideology")