POLISCI 3325G - Data Science for Political Science

Evelyne Brie

Winter 2023

Inputting and Manipulating Data

In this lab, we (1) discuss the basics of how to use R, (2) input datasets into R, (3) explore types and classes of objects.

Relevant functions: install.packages(), library(), read.csv(), dim(),colnames(),head(), class()

1. Understanding the Basics

 

1.1 Running your code using command + enter

 

If you type code directly into the console, it will be executed automatically when you press enter. However, code is typically written in your script and sent into the console using command + enter at either extremity of the code line. You can also use the run icon in the RStudio environment, but this is less efficient.

 

answer <- 1+1 # Creating an object called answer 

answer # Displaying the value of our "answer" object
## [1] 2

 

Our code (which appears in the first rectangle) is a typical example of what can be written within a script. Similarly, ## [1] 2 is the output that would appear in your console by executing these code lines.

 

 

1.2 Writing a comment using #

 

Writing comments is an essential part of being a good data scientist. Not only does it make it easier for others to understand your code, but it also allows you to remember why you used specific functions. Characters following a # in your script are considered to be comments, and won’t be executed by R when you run your code.

 

# This is a comment

## This is also a comment

print("Hello World") 
## [1] "Hello World"

 

You can use as many # as you’d like to make your script more readable. Please note that if you use 5 # or more, you will create a chunk (or a code subsection). This technique can be useful when working with longer scripts.

 

 

1.3 Installing a package using install.packages()

 

R is an open-source programming language, which means that anyone with sufficient coding abilities can contribute to the R community by creating their own package. These packages work like “extension packs” for the basic functions which are already included when you first download R. install.packages() is one of these basic functions: it allows you to download any package listed in the R repository (https://cran.r-project.org/).

 

# install.packages("tidyverse") 

# Tidyverse is a collection of R packages designed for data science. 
# All packages share an underlying design philosophy, grammar, and data structure. 
# Please install this package, as it will be useful in this class. 
# For more information, see: https://www.tidyverse.org/packages/

 

Please note that R is sensitive to a variety of syntax elements. For instance, if you tried to run install.packages(tidyverse) (i.e. without “…”), an error message would appear. Likewise, R is case sensitive—so if you create an object called “myName”, the programming language won’t recognize “myname”.

 

 

1.4 Loading a package using library()

 

Once you install a package, it remains dormant within RStudio. In order to activate the functions encompassed within that package, you need to load it within your library using the library() function.

 

library(tidyverse) 
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

 

2 Importing Data

 

We begin by loading the dataset, which is available on OWL (Resources -> Exercises -> Week 1 -> Data_Poll.csv). Don’t forget to include the correct path to your file within the parentheses. Also, please don’t use the “import dataset” function in the environment window. The script alone should be sufficient to replicate your work.

We input the CSV file using the read.csv() command. Please note that R can also import a variety of other files (.dta, .sav, etc.) using other functions. We’ll discuss these additional functions later in this class.

# Loading the dataset
PollData <- read.csv("/Users/evelynebrie/Dropbox/Teaching/3325G_DataScience/Lab/Week 1/Data_Poll.csv")

The data frame should now appear within your environment in RStudio (upper right window).

This dataset represents the results of a poll conducted on residents of a fictive country. It contains the following variables: vote choice, age, sex, and three indicator variables (i.e. binary/dummy variables, that take a value of 0 or 1) for the respondent’s educational level.

The first things we typically want to know from a dataset are: (1) how many dimensions this data frame has, (2) what are the names of its columns and (3) what the data itself looks like.

# Looking at the dimensions of the dataset using dim()
dim(PollData)
## [1] 11  6
# Looking at the variable names using colnames()
colnames(PollData)
## [1] "voteChoice"  "age"         "female"      "educHS"      "educCollege"
## [6] "educGrad"
# Displaying the content of the first 5 rows using head()
head(PollData,5)
##   voteChoice age female educHS educCollege educGrad
## 1        red  28      1      0           1        0
## 2       blue  18      0      1           0        0
## 3       blue  65      0      0           1        0
## 4     yellow  40      1      0           0        1
## 5        red  44      1      0           0        1

 

3. Managing Data

 

We’ll see next week how to subset, merge and manipulate data. For now, let’s make sure that we understand the nature of our data—in other words, what types and classes of objects are we dealing with?

 

3.1 Types of Objects

 

Objects are entities with different attributes. A sequence of elements is called a vector, and a sequence of vectors is called a data frame.

 

(1) Element

 

(2) Vector

 

(3) Data frame

 

 

3.2 Classes of Objects

 

Objects can have a variety of classes in R. Today, we will focus on the three main classes: character, numeric and logical.

 

 

2.1.1 Logical object (= TRUE or FALSE)

 

myLogObject <- TRUE # Creating a logical object 
myLogObject # Displaying the object 
## [1] TRUE
class(myLogObject) # Displaying the class of the "myLogObject" object
## [1] "logical"

 

2.1.2 Numeric object (= numbers)

 

myNumObject <- 1 # Creating a numeric object 
myNumObject # Displaying the object 
## [1] 1
class(myNumObject) # Displaying the class of the "myNumObject" object
## [1] "numeric"

 

2.1.3 Character object (= anything within quotation marks)

 

myCharObject <- "1" # Creating a character object
myCharObject # Displaying the object 
## [1] "1"
class(myCharObject) # Displaying the class of the "myCharObject" object
## [1] "character"

 

 

Exercises

Step 1

Load in the Data_Poll.csv dataset available on OWL (Resources -> Exercises -> Week 1 -> Data_Poll.csv).

Relevant function: read.csv().

 

Step 2

How many rows and columns does the data set have?

Relevant function: dim().

 

Step 3

What class is the age variable in the data set you’ve inputted?

Relevant function: class().

 

Step 4 (extra challenge)

Using code only: print out how many respondents are younger than 40 years old.

Relevant functions: length(), which().