Inputting and Managing Data in R

In this recitation, we (1) input a csv dataset, (2) explore its properties and the properties of its variables, and (3) create and save a subset of the dataset in an .Rdata format.

Relevant functions: read.csv(), is.data.frame(), dim(), colnames(), head(), class(), levels(), nlevels(), summary(), length(), which(), min(), max(), which.min(), which.max(), subset(), save().

1. Importing the Data Using `read.csv()`

We begin by loading the dataset, which is available on Canvas. Don’t forget to include the correct path to your file within the parentheses! Also, please don’t use the “import dataset” function in the environment window. The script alone should be sufficient to replicate your work.

# Don't forget to set your working directory using setwd()

# Loading the dataset
PollData <- read.csv("/Users/evelynebrie/Dropbox/TA/PSCI_107_Fall2018/Recitation/Week3/Data_Poll.csv")

2. Exploring the Data

The following recitation material contains examples inspired from Yannick Dufresne’s “POL-7004 Analyse Quantitative” seminar.

This dataset represents the results of a poll conducted on residents of a fictive country. It contains the following variables: age of respondent, sex of respondent, and three indicator variables for the respondent’s education level.

2.1 Properties of the Dataset

The first things we typically want to know from a dataset are: (1) how many dimensions this data frame has, (2) what are the names of its columns and (3) what the data itself looks like.

# Confirming that "PollData" is a data frame using is.data.frame()
is.data.frame(PollData)

## [1] TRUE

# Looking at the dimensions of the dataset using dim()
dim(PollData)

## [1] 11  6

# Looking at the variable names using colnames()
colnames(PollData)

## [1] "voteChoice"  "age"         "female"      "educHS"      "educCollege"
## [6] "educGrad"

# Displaying the content of the first 5 rows using head()
head(PollData,5)

##   voteChoice age female educHS educCollege educGrad
## 1        red  28      1      0           1        0
## 2       blue  18      0      1           0        0
## 3       blue  65      0      0           1        0
## 4     yellow  40      1      0           0        1
## 5        red  44      1      0           0        1

2.2 Properties of the Variables

Let’s focus on the variable that might be the most interesting to our pollster: vote intentions. In our dataset, this variable is called “voteChoice”. This variable is a factor variable, which means it’s categorical (we’ll discuss this notion in recitation).

# Viewing the class attribute of the "voteChoice" variable using class()
class(PollData$voteChoice)

## [1] "factor"

# Check what the levels of this factor are using levels()
levels(PollData$voteChoice)

## [1] "blue"   "red"    "yellow"

# Check how many levels there are within this factor using nlevels()
nlevels(PollData$voteChoice)

## [1] 3

# Check how many respondents there are per levels (i.e. political party) using summary()
summary(PollData$voteChoice)

##   blue    red yellow 
##      4      4      3

3. Managing Data

Before making any prediction on the issue of the forthcoming election using this data, we might want to exclude all underage respondents. In this section, we will take an in-depth look at the age (PollData$age) variable and create a subset containing only the survey responses for respondents who are 21 or older.

3.1 Understanding our Data

# Summarizing the content of the "age" variable using
summary(PollData$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   28.50   37.00   39.91   48.00   72.00

# Displaying how many elements of the "age" vector have a value smaller than 21 using length() and which(). In words, this code might read something like: "give me the length (so, the number of observations) of elements within the "age" vector which fit the following criteria: being smaller than 21)"
length(which(PollData$age<21))

## [1] 1

# Displaying the content of the "age" vector one last time 
PollData$age

##  [1] 28 18 65 40 44 21 52 33 29 37 72

######### FIRST TECHNIQUE TO IDENTIFY MINIMUM VALUE ######### 

# Selecting the minimum element within the "age" vector and storing it into an idx element
idx <- which(PollData$age==min(PollData$age))

# Printing out the value of this element, which represents the relevant row number within that vector (here, in other words, the second element within the "age" vector)
idx

## [1] 2

# Displaying the second element within the "age" vector
PollData$age[idx]

## [1] 18

######### SECOND TECHNIQUE TO IDENTIFY MINIMUM VALUE #########

# Selecting the minimum element within the "age" vector and storing it into an idx element
idx <- which.min(PollData$age)

# Printing out the value of this element
idx

## [1] 2

# Displaying the second element within the "age" vector
PollData$age[idx]

## [1] 18

3.2 Subsetting our Data

Now, we know that there is only one respondent who is under 21, and that this respondent is 18 years old. Here, we create an indicator variable called “canVote” to identify all respondents who are 21 or older. We then use this new indicator variable to subset the dataset and create a new dataset without any underage respondent.

# Creating a new indicator variable called "canVote"
PollData$canVote <- NA # Creating an empty variable
PollData$canVote[PollData$age < 21] <- 0 # All underage respondents are attributed a value of 0 in this vector
PollData$canVote[PollData$age >= 21] <- 1 # All major respondents are attributed a value of 1 in this vector

# Creating a new dataset called "Voters" using subset()
Voters <- subset(PollData, canVote==1)

# Alternatively: creating a new dataset called "Voters" using %in% (same output as using subset())
Voters <- PollData[PollData$canVote %in% 1,]

# Looking at the dimensions of the dataset using dim()
dim(Voters)

## [1] 10  7

# How many rows were deleted from the original dataset?
dim(PollData)[1] - dim(Voters)[1]

## [1] 1

Please find the mistake in each of the following subsetting tentatives.

# BlueVoters <- subset(PollData, PollData$voteChoice = "blue")

# YoungVoters <- subset(PollData, PollData$age =< 25)

# Female_and_Grad_Voters <- subset(PollData, PollData$female == 1 | PollData$educGrad == 1)

# NoYellowVoters <- subset(PollData, PollData$voteChoice != yellow)

4. Saving our New Dataset as an .Rdata File Using `save()`

In this section, we save our new “Voters” dataset as an .Rdata file, a format designed for use with R.

save(Voters, file="Voters.RData")

Exercises

Conduct a similar analysis on the “Data_Income” dataset, available on Canvas.

Step 1

Import the dataset.

Relevant function: read.csv().

Step 2

How many rows and columns does the data set have?

Relevant function: dim().

Step 3

What type (or class) of object is the data set you’ve inputted?

Relevant function: class().

Step 4

How many respondents have an annual income lower than 40k? What is the income of these respondents?

Relevant functions: length(), which().

Step 5

What is the most common astrological sign for respondents within this dataset?

Relevant function: summary() (basic) OR which.max(), table()(more advanced: printing out the name of the most common astrological sign only)