## Run this code, to manage packages and install as needed.
# Install pacman
if (!require("pacman")) install.packages("pacman", repos = "http://cran.us.r-project.org")
# p_load function loads packages if installed, or install then loads otherwise
pacman::p_load(tidyverse,haven)
You’ll download all of the lab files for this class as
.zip files from canvas. To use them, you’ll have to unzip
them into their own directory (not a temporary directory that lets you
explore the files inside), and open the project file.
You should click on lab2-read-modify-export-data.Rproj
when you open this lab project. R project files automatically give you
access to the files and directories (folders) in the project directory.
This makes it much easier for use to load multiple data sets.
Besides a package, the easiest way to work with external data is for it to be stored in a delimited text file, e.g. comma-separated values (.csv) or tab-separated values (.tsv).
In the same directory as this .Rmd file and the project
file, there’s a directory called data. Inside, there’s a
csv file called seattle_airbnb.csv. This contains data
about 100 Airbnb listings from Seattle.
These data come from Inside Airbnb, http://insideairbnb.com/. Go to the website and have a look at the About, Behind, and Get the Data pages. Use what you read to answer the questions below. Just type your answers below the questions.
Question 2.1: What was the context of this data’s production, ie:
The purpose of collecting the data is consistent with the goal of increasing public awareness and informed discussion of the consequences of short-term rentals. By analyzing and visualizing data from Airbnb listings, Inside Airbnb seeks to empower individuals, policymakers, and communities to make more informed decisions and engage in discussions about the impact of short-term rentals on housing markets and communities.
The package we’ll use today is called “tidyverse.” It’s a collection of packages for data manipulation, exploration, and visualization.
You can read more about the tidyverse here: https://www.tidyverse.org/
Follow the instructions we learned in lab 1, install the tidyverse package, then load the package.
if (!require("tidyverse")) install.packages("tidyverse", repos = "https://www.tidyverse.org/")
To use data inside R, we first have to import, or read, that data into our environment. The chunk below reads the example data we’ll use for this module.
airbnb_data <- read_csv("data/seattle_airbnb.csv")
## Rows: 100 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, neighbourhood_group, neighbourhood
## dbl (3): id, price, number_of_reviews
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice that here, we create an object of a dataframe called “airbnb_data”.
Question 3.1: What is the role of each component in the above line of code?
<- : the operator used to assign a value to a
variable. It is a way of assigning a value on the right to a variable on
the leftLet’s take a look at the data.
Question 4.1 Go ahead and type that into the console.
You can also look at the entire data set using RStudio’s built-in viewer. To use that, we use the function ‘View().’ We can run that command from the console, or from a code chunk:
#View(airbnb_data)
Question 4.2 Follow the instructions in the code block and run it.
# "un-comment" the line below this one, by removing the '#' and the space
# View(airbnb_data)
The head() function shows you the first six rows of a
data frame.
Question 4.3 Use the head function in the code chunk below to show the first rows of the airbnb_data.
head(airbnb_data)
## # A tibble: 6 × 6
## id name neighbourhood_group neighbourhood price number_of_reviews
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 2318 Casa Madrona … Central Area Madrona 296 16
## 2 4291 Sunrise in Se… Other neighborhoods Roosevelt 82 54
## 3 5682 Cozy Studio, … Delridge South Delrid… 48 428
## 4 6606 Fab, private … Other neighborhoods Wallingford 90 110
## 5 9419 Glorious sun … Other neighborhoods Georgetown 70 120
## 6 9460 Downtown/Conv… Downtown First Hill 80 366
Question 4.4: head shows the first 6 rows by
default. Change the following code to show the first 10
rows.
head(airbnb_data, n = 10)
## # A tibble: 10 × 6
## id name neighbourhood_group neighbourhood price number_of_reviews
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 2318 Casa Madrona… Central Area Madrona 296 16
## 2 4291 Sunrise in S… Other neighborhoods Roosevelt 82 54
## 3 5682 Cozy Studio,… Delridge South Delrid… 48 428
## 4 6606 Fab, private… Other neighborhoods Wallingford 90 110
## 5 9419 Glorious sun… Other neighborhoods Georgetown 70 120
## 6 9460 Downtown/Con… Downtown First Hill 80 366
## 7 9531 The Adorable… West Seattle Fairmount Pa… 165 34
## 8 9534 The Coolest … West Seattle Fairmount Pa… 125 32
## 9 9596 the down hom… Other neighborhoods Wallingford 120 61
## 10 9909 Luna Park Lo… West Seattle Fairmount Pa… 125 48
What if you want to look at the last several rows of a data frame instead of the first several rows?
Let’s read the documentation for head by typing
?head into the console.
Question 4.5 Based on what you found out, show the last 5 rows of airbnb_data
tail(airbnb_data, n = 5)
## # A tibble: 5 × 6
## id name neighbourhood_group neighbourhood price number_of_reviews
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 224763 Location! Sl… Downtown Belltown 149 72
## 2 225820 Family Frien… Other neighborhoods Phinney Ridge 90 69
## 3 226495 Fun apartmen… Ballard Whittier Hei… 170 72
## 4 226536 Serene Room … Magnolia Lawton Park 46 116
## 5 226677 Sunny Parisi… Other neighborhoods Georgetown 55 101
You can extract a single column by name using $. Type
the name of the dataframe (airbnb) first, then $ and
finally the name of the column.
Question 4.6 Use $ to display the ‘price’
column
airbnb_data$price
## [1] 296 82 48 90 70 80 165 125 120 125 48 60 109 299 60
## [16] 40 60 91 40 105 85 145 165 199 89 79 99 189 107 157
## [31] 75 259 185 75 85 225 95 60 110 180 50 70 96 147 76
## [46] 50 50 70 46 110 47 75 157 150 250 120 130 135 110 79
## [61] 110 150 170 65 125 75 89 92 180 9300 55 110 650 80 75
## [76] 88 105 275 125 250 69 80 59 89 125 275 99 99 212 80
## [91] 84 200 90 285 75 149 90 170 46 55
Sometimes we have a large dataset but we only need to work with a subset of it. There are several ways to modify and subset a dataframe. Here, we will learn to subset with indices.
Question 5.1: Change the code below so that we load the haven
package and make a new object called twitter_survey by
using the read_sav() function to read the file
“data/Pew_Twitter_Study_for_release.sav”.
library(haven)
twitter_survey <- read_sav("data/Pew_Twitter_Study_for_release.sav")
To get a quick look at names of the variables in this data frame, run the below code:
names(twitter_survey)
## [1] "CaseID" "tm_start" "tm_finish"
## [4] "duration" "qflag" "base_weight"
## [7] "weight" "TWITTER" "TWITTER_USE"
## [10] "TWITTER_HANDLE_Refused" "COMATTACH" "SOCTRUST2"
## [13] "SOCTRUST2_order_1" "SOCTRUST2_order_2" "VOTED"
## [16] "CONGPOST" "CONGPOST_order_1" "CONGPOST_order_2"
## [19] "CONGPOST_order_3" "TRUSTCONGa" "TRUSTCONGb"
## [22] "TRUSTCONGc" "TRUSTCONGd" "TRUSTCONGe"
## [25] "TRUSTCONGa_order" "TRUSTCONGb_order" "TRUSTCONGc_order"
## [28] "TRUSTCONGd_order" "TRUSTCONGe_order" "GSSTRUST2"
## [31] "GSSTRUST2_order_1" "GSSTRUST2_order_2" "GSSTRUST3"
## [34] "GSSTRUST3_order_1" "GSSTRUST3_order_2" "POL1DT"
## [37] "POL1DTSTR" "FRIENDT" "FRIENDT_order_1"
## [40] "FRIENDT_order_2" "FRIENDT_order_3" "FRIENDT_order_4"
## [43] "FRIENDT_order_5" "NEWSIMPT" "SNSSKEP"
## [46] "SNSSKEP_order_1" "SNSSKEP_order_2" "QBELIEF3"
## [49] "QBELIEF4" "QBELIEF3_order_1" "QBELIEF3_order_2"
## [52] "QBELIEF3_order_3" "QBELIEF3_order_6" "QBELIEF4_order_1"
## [55] "QBELIEF4_order_2" "QBELIEF4_order_3" "QBELIEF4_order_6"
## [58] "TWKNOW" "TWKNOW_order_1" "TWKNOW_order_2"
## [61] "TWKNOW_order_3" "TWAUTO" "TWAUTO_order_1"
## [64] "TWAUTO_order_2" "JOKE1_order" "CHOICE1_order"
## [67] "JOKE1" "CHOICE1" "JOKE1_order_1"
## [70] "JOKE1_order_2" "CHOICE1_order_1" "CHOICE1_order_2"
## [73] "THERMOa" "THERMOb" "THERMOc"
## [76] "THERMOd" "THERMOe" "THERMOf"
## [79] "THERMOg" "THERMOh" "THERMOa_order"
## [82] "THERMOb_order" "THERMOc_order" "THERMOd_order"
## [85] "THERMOe_order" "THERMOf_order" "THERMOg_order"
## [88] "THERMOh_order" "NATPROBSa" "NATPROBSb"
## [91] "NATPROBSc" "NATPROBSd" "NATPROBSe"
## [94] "NATPROBSf" "NATPROBSg" "NATPROBSh"
## [97] "NATPROBSi" "NATPROBSj" "NATPROBSa_order"
## [100] "NATPROBSb_order" "NATPROBSc_order" "NATPROBSd_order"
## [103] "NATPROBSe_order" "NATPROBSf_order" "NATPROBSg_order"
## [106] "NATPROBSh_order" "NATPROBSi_order" "NATPROBSj_order"
## [109] "FAIRTRT" "FAIRTRT_order_1" "FAIRTRT_order_2"
## [112] "FAIRTRT_order_3" "WOMENOPPS" "WOMENOPPS_order_1"
## [115] "WOMENOPPS_order_2" "IMMCULT2" "IMMCULT2_order_1"
## [118] "IMMCULT2_order_2" "ECONFAIR2" "ECONFAIR2_order_1"
## [121] "ECONFAIR2_order_2" "POLCRCT" "POLCRCT_order_1"
## [124] "POLCRCT_order_2" "PARTY" "PARTYLN"
## [127] "REPANTIP" "REPANTIP_order_1" "REPANTIP_order_2"
## [130] "DEMANTIP" "DEMANTIP_order_1" "DEMANTIP_order_2"
## [133] "CIVIC_ENG_ACTYRa" "CIVIC_ENG_ACTYRb" "CIVIC_ENG_ACTYRc"
## [136] "CIVIC_ENG_ACTYRa_order" "CIVIC_ENG_ACTYRb_order" "CIVIC_ENG_ACTYRc_order"
## [139] "DOV_IDEO" "DOV_ASSIGN" "IDEODEM"
## [142] "IDEOREP" "IDEOSELF" "VOL1"
## [145] "VOL2" "RELIG" "RELIG_text"
## [148] "CHR" "BORN" "RELIMP"
## [151] "TALKREL" "TALKREL_order_1" "TALKREL_order_2"
## [154] "TALKREL_order_3" "TALKREL_order_4" "TALKREL_order_5"
## [157] "REG" "POLTWEET" "PPAGE"
## [160] "ppagecat" "ppagect4" "PPEDUC"
## [163] "PPEDUCAT" "PPETHM" "PPGENDER"
## [166] "PPHHHEAD" "PPHHSIZE" "PPHOUSE"
## [169] "PPINCIMP" "PPMARIT" "PPMSACAT"
## [172] "PPREG4" "ppreg9" "PPRENT"
## [175] "PPSTATEN" "PPT01" "PPT25"
## [178] "PPT612" "PPT1317" "PPT18OV"
## [181] "PPWORK"
There are many variables but we only need three of them: CaseID,
TWITTER_USE, VOTED. To manipulate data frames in R, we can use the
[] notation to access the indices for the observations and
the variables. It is easiest to think of the data frame as a rectangle
of data where the rows are the observations and the columns are the
variables. The indices for a rectangle of data follow the RxC principle;
in other words, the first index is for Rows and the second index is for
Columns [R, C]. When we only want to subset variables (or columns) we
use the second index and leave the first index blank. Leaving an index
blank indicates that you want to keep all the elements in that
dimension.
twitter_subset <- twitter_survey[,c(1,9,15)]
If the variables we want are in consecutive columns, we can use the colon notation rather than list them using the c function.
twitter_subset2 <- twitter_survey[,1:4]
Question 5.2: Create a subset of dataframe twitter_subject, taking the first 100 rows and variables PARTY and PPWORK. Assign this object with name “my_twitter_survey”.
my_twitter_survey <- twitter_survey[1:100,c("PARTY","PPWORK")]
Getting data out of R into a delimited file is very similar to getting it into R:
write_csv(twitter_subset, file = "twitter_subset.csv")
This saved the data we just modified into a file called
twitter_subset.csv in your working directory.
Exporting to a .csv drops R metadata, such as whether a
variable is a character or factor(which we will learn in the next labs).
You can save objects (data frames, lists, etc.) in R formats to preserve
this.
.Rds format:
write_rds(old_object_name,"path.Rds").Rdata or .Rda format:
save(object1, object2,..., file = "path.Rdata")load("path.Rdata") without assignment
operatorQuestion 6.1. Save object my_twitter_survey into an
.Rda file.
save(my_twitter_survey,file = "my_twitter_survey.Rda")