Inputting and Manipulating Data
In this lab, we (1) discuss the basics of how to use R, (2) input datasets into R, (3) explore types and classes of objects.
Relevant functions: install.packages(),
library(), read.csv(),
dim(),colnames(),head(),
class()
1. Understanding the Basics
1.1 Running your code using command + enter
If you type code directly into the console, it will be executed automatically when you press enter. However, code is typically written in your script and sent into the console using command + enter at either extremity of the code line. You can also use the run icon in the RStudio environment, but this is less efficient.
answer <- 1+1 # Creating an object called answer
answer # Displaying the value of our "answer" object
## [1] 2
Our code (which appears in the first rectangle) is a typical example
of what can be written within a script. Similarly, ## [1] 2
is the output that would appear in your console by executing these code
lines.
1.2 Writing a comment using #
Writing comments is an essential part of being a good data scientist.
Not only does it make it easier for others to understand your code, but
it also allows you to remember why you used specific functions.
Characters following a # in your script are considered to
be comments, and won’t be executed by R when you run your code.
# This is a comment
## This is also a comment
print("Hello World")
## [1] "Hello World"
You can use as many # as you’d like to make your script
more readable. Please note that if you use 5 # or more, you
will create a chunk (or a code subsection). This
technique can be useful when working with longer scripts.
1.3 Installing a package using install.packages()
R is an open-source programming language, which means that anyone
with sufficient coding abilities can contribute to the R community by
creating their own package. These packages work like “extension packs”
for the basic functions which are already included when you first
download R. install.packages() is one of these basic
functions: it allows you to download any package listed in the R
repository (https://cran.r-project.org/).
# install.packages("tidyverse")
# Tidyverse is a collection of R packages designed for data science.
# All packages share an underlying design philosophy, grammar, and data structure.
# Please install this package, as it will be useful in this class.
# For more information, see: https://www.tidyverse.org/packages/
Please note that R is sensitive to a variety of syntax elements. For
instance, if you tried to run install.packages(tidyverse)
(i.e. without “…”), an error message would appear.
Likewise, R is case sensitive—so if you create an object called
“myName”, the programming language won’t recognize “myname”.
1.4 Loading a package using library()
Once you install a package, it remains dormant within RStudio. In
order to activate the functions encompassed within that package, you
need to load it within your library using the library()
function.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
2 Importing Data
We begin by loading the dataset, which is available on OWL (Resources -> Exercises -> Week 1 -> Data_Poll.csv). Don’t forget to include the correct path to your file within the parentheses. Also, please don’t use the “import dataset” function in the environment window. The script alone should be sufficient to replicate your work.
We input the CSV file using the read.csv() command.
Please note that R can also import a variety of other files (.dta, .sav,
etc.) using other functions. We’ll discuss these additional functions
later in this class.
# Loading the dataset
PollData <- read.csv("/Users/evelynebrie/Dropbox/Teaching/3325G_DataScience/Lab/Week 1/Data_Poll.csv")
The data frame should now appear within your environment in RStudio (upper right window).
This dataset represents the results of a poll conducted on residents of a fictive country. It contains the following variables: vote choice, age, sex, and three indicator variables (i.e. binary/dummy variables, that take a value of 0 or 1) for the respondent’s educational level.
The first things we typically want to know from a dataset are: (1) how many dimensions this data frame has, (2) what are the names of its columns and (3) what the data itself looks like.
# Looking at the dimensions of the dataset using dim()
dim(PollData)
## [1] 11 6
# Looking at the variable names using colnames()
colnames(PollData)
## [1] "voteChoice" "age" "female" "educHS" "educCollege"
## [6] "educGrad"
# Displaying the content of the first 5 rows using head()
head(PollData,5)
## voteChoice age female educHS educCollege educGrad
## 1 red 28 1 0 1 0
## 2 blue 18 0 1 0 0
## 3 blue 65 0 0 1 0
## 4 yellow 40 1 0 0 1
## 5 red 44 1 0 0 1
3. Managing Data
We’ll see next week how to subset, merge and manipulate data. For now, let’s make sure that we understand the nature of our data—in other words, what types and classes of objects are we dealing with?
3.1 Types of Objects
Objects are entities with different attributes. A sequence of elements is called a vector, and a sequence of vectors is called a data frame.
(1) Element
(2) Vector
(3) Data frame
3.2 Classes of Objects
Objects can have a variety of classes in R. Today, we will focus on the three main classes: character, numeric and logical.
2.1.1 Logical object (= TRUE or FALSE)
myLogObject <- TRUE # Creating a logical object
myLogObject # Displaying the object
## [1] TRUE
class(myLogObject) # Displaying the class of the "myLogObject" object
## [1] "logical"
2.1.2 Numeric object (= numbers)
myNumObject <- 1 # Creating a numeric object
myNumObject # Displaying the object
## [1] 1
class(myNumObject) # Displaying the class of the "myNumObject" object
## [1] "numeric"
2.1.3 Character object (= anything within quotation marks)
myCharObject <- "1" # Creating a character object
myCharObject # Displaying the object
## [1] "1"
class(myCharObject) # Displaying the class of the "myCharObject" object
## [1] "character"
Exercises
Step 1
Load in the Data_Poll.csv dataset available on OWL (Resources -> Exercises -> Week 1 -> Data_Poll.csv).
Relevant function: read.csv().
Step 2
How many rows and columns does the data set have?
Relevant function: dim().
Step 3
What class is the age variable in the data set you’ve inputted?
Relevant function: class().
Step 4 (extra challenge)
Using code only: print out how many respondents are younger than 40 years old.
Relevant functions: length(),
which().