## Run this code, to manage packages and install as needed.
# Install pacman
if (!require("pacman")) install.packages("pacman", repos = "http://cran.us.r-project.org")
# p_load function loads packages if installed, or install then loads otherwise
pacman::p_load(tidyverse,haven)

Agenda

Open a project file in R
Describe the sources of data
Read data into R
Look at data
Subset a dataframe
Export data

1: Opening this file

You’ll download all of the lab files for this class as .zip files from canvas. To use them, you’ll have to unzip them into their own directory (not a temporary directory that lets you explore the files inside), and open the project file.

You should click on lab2-read-modify-export-data.Rproj when you open this lab project. R project files automatically give you access to the files and directories (folders) in the project directory. This makes it much easier for use to load multiple data sets.

2: Describe data sources

Besides a package, the easiest way to work with external data is for it to be stored in a delimited text file, e.g. comma-separated values (.csv) or tab-separated values (.tsv).

In the same directory as this .Rmd file and the project file, there’s a directory called data. Inside, there’s a csv file called seattle_airbnb.csv. This contains data about 100 Airbnb listings from Seattle.

These data come from Inside Airbnb, http://insideairbnb.com/. Go to the website and have a look at the About, Behind, and Get the Data pages. Use what you read to answer the questions below. Just type your answers below the questions.

Question 2.1: What was the context of this data’s production, ie:

Who created this data set? Inside Airbnb project,by Murray Cox
How did they do it? Inside Airbnb collects and analyzes data from the Airbnb platform to provide insights into the short-term rental market in various cities.
Where did the data come from? The data used by Inside Airbnb comes from scraping publicly available information on the Airbnb platform. Airbnb listings, including property details, pricing, availability, and host information, are typically accessible to the public on the Airbnb website. Inside Airbnb uses automated scripts to collect this data in a systematic way.
Is this active or passive data collection? The data collection method used by Inside Airbnb can be considered a form of passive data collection. Question 2.2: What were the original purpose of this data? Is that the same as the purpose of the person who collected it? As our purposes?
The original purpose of the data collected by Inside Airbnb was to provide transparency and insight into the impact of short-term rentals on various cities, particularly those offered by Airbnb. The project aims to shed light on issues such as housing affordability, gentrification and the widespread impact of short-term rental activity on local communities.

The purpose of collecting the data is consistent with the goal of increasing public awareness and informed discussion of the consequences of short-term rentals. By analyzing and visualizing data from Airbnb listings, Inside Airbnb seeks to empower individuals, policymakers, and communities to make more informed decisions and engage in discussions about the impact of short-term rentals on housing markets and communities.

3: Set up and read data into R

The package we’ll use today is called “tidyverse.” It’s a collection of packages for data manipulation, exploration, and visualization.

You can read more about the tidyverse here: https://www.tidyverse.org/

Follow the instructions we learned in lab 1, install the tidyverse package, then load the package.

if (!require("tidyverse")) install.packages("tidyverse", repos = "https://www.tidyverse.org/")

To use data inside R, we first have to import, or read, that data into our environment. The chunk below reads the example data we’ll use for this module.

airbnb_data <- read_csv("data/seattle_airbnb.csv")

## Rows: 100 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, neighbourhood_group, neighbourhood
## dbl (3): id, price, number_of_reviews
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Notice that here, we create an object of a dataframe called “airbnb_data”.

Question 3.1: What is the role of each component in the above line of code?

airbnb_data: the variable name we used for the data frame
<- : the operator used to assign a value to a variable. It is a way of assigning a value on the right to a variable on the left
read_csv() :Read data files in CSV (comma-separated values) format. This function provides a convenient and flexible way to import data boxes
“data/seattle_airbnb.csv”： Read the CSV file named seattle_airbnb.csv

4: Looking at the data

Let’s take a look at the data.

Question 4.1 Go ahead and type that into the console.

You can also look at the entire data set using RStudio’s built-in viewer. To use that, we use the function ‘View().’ We can run that command from the console, or from a code chunk:

#View(airbnb_data)

Question 4.2 Follow the instructions in the code block and run it.

# "un-comment" the line below this one, by removing the '#' and the space
# View(airbnb_data)

The head() function shows you the first six rows of a data frame.

Question 4.3 Use the head function in the code chunk below to show the first rows of the airbnb_data.

head(airbnb_data)

## # A tibble: 6 × 6
##      id name           neighbourhood_group neighbourhood price number_of_reviews
##   <dbl> <chr>          <chr>               <chr>         <dbl>             <dbl>
## 1  2318 Casa Madrona … Central Area        Madrona         296                16
## 2  4291 Sunrise in Se… Other neighborhoods Roosevelt        82                54
## 3  5682 Cozy Studio, … Delridge            South Delrid…    48               428
## 4  6606 Fab, private … Other neighborhoods Wallingford      90               110
## 5  9419 Glorious sun … Other neighborhoods Georgetown       70               120
## 6  9460 Downtown/Conv… Downtown            First Hill       80               366

Question 4.4: head shows the first 6 rows by default. Change the following code to show the first 10 rows.

head(airbnb_data, n = 10)

## # A tibble: 10 × 6
##       id name          neighbourhood_group neighbourhood price number_of_reviews
##    <dbl> <chr>         <chr>               <chr>         <dbl>             <dbl>
##  1  2318 Casa Madrona… Central Area        Madrona         296                16
##  2  4291 Sunrise in S… Other neighborhoods Roosevelt        82                54
##  3  5682 Cozy Studio,… Delridge            South Delrid…    48               428
##  4  6606 Fab, private… Other neighborhoods Wallingford      90               110
##  5  9419 Glorious sun… Other neighborhoods Georgetown       70               120
##  6  9460 Downtown/Con… Downtown            First Hill       80               366
##  7  9531 The Adorable… West Seattle        Fairmount Pa…   165                34
##  8  9534 The Coolest … West Seattle        Fairmount Pa…   125                32
##  9  9596 the down hom… Other neighborhoods Wallingford     120                61
## 10  9909 Luna Park Lo… West Seattle        Fairmount Pa…   125                48

What if you want to look at the last several rows of a data frame instead of the first several rows?

Let’s read the documentation for head by typing ?head into the console.

Question 4.5 Based on what you found out, show the last 5 rows of airbnb_data

tail(airbnb_data, n = 5)

## # A tibble: 5 × 6
##       id name          neighbourhood_group neighbourhood price number_of_reviews
##    <dbl> <chr>         <chr>               <chr>         <dbl>             <dbl>
## 1 224763 Location! Sl… Downtown            Belltown        149                72
## 2 225820 Family Frien… Other neighborhoods Phinney Ridge    90                69
## 3 226495 Fun apartmen… Ballard             Whittier Hei…   170                72
## 4 226536 Serene Room … Magnolia            Lawton Park      46               116
## 5 226677 Sunny Parisi… Other neighborhoods Georgetown       55               101

You can extract a single column by name using $. Type the name of the dataframe (airbnb) first, then $ and finally the name of the column.

Question 4.6 Use $ to display the ‘price’ column

airbnb_data$price

##   [1]  296   82   48   90   70   80  165  125  120  125   48   60  109  299   60
##  [16]   40   60   91   40  105   85  145  165  199   89   79   99  189  107  157
##  [31]   75  259  185   75   85  225   95   60  110  180   50   70   96  147   76
##  [46]   50   50   70   46  110   47   75  157  150  250  120  130  135  110   79
##  [61]  110  150  170   65  125   75   89   92  180 9300   55  110  650   80   75
##  [76]   88  105  275  125  250   69   80   59   89  125  275   99   99  212   80
##  [91]   84  200   90  285   75  149   90  170   46   55

5. Subsetting data

Sometimes we have a large dataset but we only need to work with a subset of it. There are several ways to modify and subset a dataframe. Here, we will learn to subset with indices.

Question 5.1: Change the code below so that we load the haven package and make a new object called twitter_survey by using the read_sav() function to read the file “data/Pew_Twitter_Study_for_release.sav”.

library(haven)
twitter_survey <- read_sav("data/Pew_Twitter_Study_for_release.sav")

To get a quick look at names of the variables in this data frame, run the below code:

names(twitter_survey)

##   [1] "CaseID"                 "tm_start"               "tm_finish"             
##   [4] "duration"               "qflag"                  "base_weight"           
##   [7] "weight"                 "TWITTER"                "TWITTER_USE"           
##  [10] "TWITTER_HANDLE_Refused" "COMATTACH"              "SOCTRUST2"             
##  [13] "SOCTRUST2_order_1"      "SOCTRUST2_order_2"      "VOTED"                 
##  [16] "CONGPOST"               "CONGPOST_order_1"       "CONGPOST_order_2"      
##  [19] "CONGPOST_order_3"       "TRUSTCONGa"             "TRUSTCONGb"            
##  [22] "TRUSTCONGc"             "TRUSTCONGd"             "TRUSTCONGe"            
##  [25] "TRUSTCONGa_order"       "TRUSTCONGb_order"       "TRUSTCONGc_order"      
##  [28] "TRUSTCONGd_order"       "TRUSTCONGe_order"       "GSSTRUST2"             
##  [31] "GSSTRUST2_order_1"      "GSSTRUST2_order_2"      "GSSTRUST3"             
##  [34] "GSSTRUST3_order_1"      "GSSTRUST3_order_2"      "POL1DT"                
##  [37] "POL1DTSTR"              "FRIENDT"                "FRIENDT_order_1"       
##  [40] "FRIENDT_order_2"        "FRIENDT_order_3"        "FRIENDT_order_4"       
##  [43] "FRIENDT_order_5"        "NEWSIMPT"               "SNSSKEP"               
##  [46] "SNSSKEP_order_1"        "SNSSKEP_order_2"        "QBELIEF3"              
##  [49] "QBELIEF4"               "QBELIEF3_order_1"       "QBELIEF3_order_2"      
##  [52] "QBELIEF3_order_3"       "QBELIEF3_order_6"       "QBELIEF4_order_1"      
##  [55] "QBELIEF4_order_2"       "QBELIEF4_order_3"       "QBELIEF4_order_6"      
##  [58] "TWKNOW"                 "TWKNOW_order_1"         "TWKNOW_order_2"        
##  [61] "TWKNOW_order_3"         "TWAUTO"                 "TWAUTO_order_1"        
##  [64] "TWAUTO_order_2"         "JOKE1_order"            "CHOICE1_order"         
##  [67] "JOKE1"                  "CHOICE1"                "JOKE1_order_1"         
##  [70] "JOKE1_order_2"          "CHOICE1_order_1"        "CHOICE1_order_2"       
##  [73] "THERMOa"                "THERMOb"                "THERMOc"               
##  [76] "THERMOd"                "THERMOe"                "THERMOf"               
##  [79] "THERMOg"                "THERMOh"                "THERMOa_order"         
##  [82] "THERMOb_order"          "THERMOc_order"          "THERMOd_order"         
##  [85] "THERMOe_order"          "THERMOf_order"          "THERMOg_order"         
##  [88] "THERMOh_order"          "NATPROBSa"              "NATPROBSb"             
##  [91] "NATPROBSc"              "NATPROBSd"              "NATPROBSe"             
##  [94] "NATPROBSf"              "NATPROBSg"              "NATPROBSh"             
##  [97] "NATPROBSi"              "NATPROBSj"              "NATPROBSa_order"       
## [100] "NATPROBSb_order"        "NATPROBSc_order"        "NATPROBSd_order"       
## [103] "NATPROBSe_order"        "NATPROBSf_order"        "NATPROBSg_order"       
## [106] "NATPROBSh_order"        "NATPROBSi_order"        "NATPROBSj_order"       
## [109] "FAIRTRT"                "FAIRTRT_order_1"        "FAIRTRT_order_2"       
## [112] "FAIRTRT_order_3"        "WOMENOPPS"              "WOMENOPPS_order_1"     
## [115] "WOMENOPPS_order_2"      "IMMCULT2"               "IMMCULT2_order_1"      
## [118] "IMMCULT2_order_2"       "ECONFAIR2"              "ECONFAIR2_order_1"     
## [121] "ECONFAIR2_order_2"      "POLCRCT"                "POLCRCT_order_1"       
## [124] "POLCRCT_order_2"        "PARTY"                  "PARTYLN"               
## [127] "REPANTIP"               "REPANTIP_order_1"       "REPANTIP_order_2"      
## [130] "DEMANTIP"               "DEMANTIP_order_1"       "DEMANTIP_order_2"      
## [133] "CIVIC_ENG_ACTYRa"       "CIVIC_ENG_ACTYRb"       "CIVIC_ENG_ACTYRc"      
## [136] "CIVIC_ENG_ACTYRa_order" "CIVIC_ENG_ACTYRb_order" "CIVIC_ENG_ACTYRc_order"
## [139] "DOV_IDEO"               "DOV_ASSIGN"             "IDEODEM"               
## [142] "IDEOREP"                "IDEOSELF"               "VOL1"                  
## [145] "VOL2"                   "RELIG"                  "RELIG_text"            
## [148] "CHR"                    "BORN"                   "RELIMP"                
## [151] "TALKREL"                "TALKREL_order_1"        "TALKREL_order_2"       
## [154] "TALKREL_order_3"        "TALKREL_order_4"        "TALKREL_order_5"       
## [157] "REG"                    "POLTWEET"               "PPAGE"                 
## [160] "ppagecat"               "ppagect4"               "PPEDUC"                
## [163] "PPEDUCAT"               "PPETHM"                 "PPGENDER"              
## [166] "PPHHHEAD"               "PPHHSIZE"               "PPHOUSE"               
## [169] "PPINCIMP"               "PPMARIT"                "PPMSACAT"              
## [172] "PPREG4"                 "ppreg9"                 "PPRENT"                
## [175] "PPSTATEN"               "PPT01"                  "PPT25"                 
## [178] "PPT612"                 "PPT1317"                "PPT18OV"               
## [181] "PPWORK"

There are many variables but we only need three of them: CaseID, TWITTER_USE, VOTED. To manipulate data frames in R, we can use the [] notation to access the indices for the observations and the variables. It is easiest to think of the data frame as a rectangle of data where the rows are the observations and the columns are the variables. The indices for a rectangle of data follow the RxC principle; in other words, the first index is for Rows and the second index is for Columns [R, C]. When we only want to subset variables (or columns) we use the second index and leave the first index blank. Leaving an index blank indicates that you want to keep all the elements in that dimension.

twitter_subset <- twitter_survey[,c(1,9,15)]

If the variables we want are in consecutive columns, we can use the colon notation rather than list them using the c function.

twitter_subset2 <- twitter_survey[,1:4]

Question 5.2: Create a subset of dataframe twitter_subject, taking the first 100 rows and variables PARTY and PPWORK. Assign this object with name “my_twitter_survey”.

my_twitter_survey <- twitter_survey[1:100,c("PARTY","PPWORK")]

6. Export data

Getting data out of R into a delimited file is very similar to getting it into R:

write_csv(twitter_subset, file = "twitter_subset.csv")

This saved the data we just modified into a file called twitter_subset.csv in your working directory.

Exporting to a .csv drops R metadata, such as whether a variable is a character or factor(which we will learn in the next labs). You can save objects (data frames, lists, etc.) in R formats to preserve this.

.Rds format:
- Used for single objects, doesn’t save original the object name
- Save: write_rds(old_object_name,"path.Rds")
- Load: `new_object_name <- read_rds(“path.Rds”)
.Rdata or .Rda format:
- Used for saving multiple files where the original object names are preserved
- Save: save(object1, object2,..., file = "path.Rdata")
- Load: load("path.Rdata") without assignment operator

Question 6.1. Save object my_twitter_survey into an .Rda file.

save(my_twitter_survey,file = "my_twitter_survey.Rda")

References

Charles Lanfear, Introduction to R for Social Scientists

Subsetting data

lab2: Read, modify and export data

Soc 225: Data & Society

[Xiang Li]

2024-01-23