Goals
0: Opening this file
1: Rectangular data
- 3: Importing data
- 4: Looking at the data
5: Hierarchical data
6. Knitting your output and saving your work
Challenge: Excel worksheets
Just for fun: Image data
Hints

Goals

Open a project file in R
Read data of various types into R
Describe the sources of datasets

0: Opening this file

You’ll download all of the lab files for this class as .zip files from canvas. To use them, you’ll have to unzip them into their own directory (not a temporary directory that lets you explore the files inside), and open the project file.

You should click on 01-reading-data.Rproj when you open this lab project. R project files automatically give you access to the files and directories (folders) in the project directory. This makes it much easier for use to load multiple data sets.

If you get stuck on one of the questions below, first check the bottom of this document for hints. Then try googling your issue or talking to somebody sitting near you. Finally, raise your hand and I’ll come help you out.

1: Rectangular data

The most common type of data we’ll see is rectangular data, which is organized into a table of rows and columns. If data come to us in another form we’ll often want to make them rectangular.

In the same directory as this .Rmd file and the project file, there’s a directory called data. Inside, there’s a comma-separated values (CSV) file called seattle_airbnb.csv. This contains data about 100 Airbnb listings from Seattle.

These data come from Inside Airbnb, http://insideairbnb.com/. Go to the website and have a look at the About, Behind, and Get the Data pages. Use what you read to answer the questions below. Just type your answers below the questions.

Question 1.1: Who created these data sets?
Airbnb Question 1.2: Why did they do it?
They want to analyze room type, activities, availiablities, and listings per host. Question 1.3: Where were the data sourced from?
It is based on data of users. ## 2: Setup

R doesn’t come with everything we need loaded by default. Before we do anything else, we need to load a package. Packages contain specialized functions and data that we can use to do nifty things. The package we’ll use is called “tidyverse.” It’s a collection of packages for data manipulation, exploration, and visualization.

You can read more about the tidyverse here: https://www.tidyverse.org/

If a package hasn’t been installed on your machine, you’ll need to install it. For instance, you’d type install.packages("tidyverse") into the console. You only need to do this once for each R installation, but in order to use a package, you have to load it.

Code chunks let us incorporate code into text documents like this one. They’re very useful for creating interactive documents and for testing out new code. The output from a code chunk shows up right below it (and often also in the console). You can run the code in a chunk by pressing the green arrow int he top right of the chunk. You can also run just one line of code by placing your cursor on that line and pressing CTRL plus Return.

Question 2.1 Once you’ve completed installation, run this chunk of code to load the tidyverse:

library(tidyverse)

## ── Attaching packages ───────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

3: Importing data

To use data inside R, we first have to import, or read, that data into our environment.

airbnb_data <- read_csv("data/seattle_airbnb.csv")

## Parsed with column specification:
## cols(
##   id = col_double(),
##   name = col_character(),
##   neighbourhood_group = col_character(),
##   neighbourhood = col_character(),
##   price = col_double(),
##   number_of_reviews = col_double()
## )

When we do this, we assign the values of the data to a variable, airbnb_data, using the arrow (<-). When you’re typing inside of a code chunk (that’s where the code goes), you can use alt plus the minus key to quickly type the arrow. The object we create is called a data frame.

Question 3.1: What is the role of each component in the above line of code? - airbnb_data: the variable name we used for the data frame - <- : assign values of the data to a variable. - read_csv() : import data to environment - “data/seattle_airbnb.csv” : file of data

4: Looking at the data

You can print the data frame to the console by typing the name of the object. Our object was called ‘airbnb_data.’

Question 4.1 Go ahead and type that into the console. I did!

You can also look at the entire data set using RStudio’s built-in viewer. To use that, we use the function ‘view().’ We can run that command from the console, or from a code chunk like the one below:

Question 4.2 Follow the instructions in the code block and run it # I got error on here when I knit it (Error in .External2(C_dataviewer, x, title) : unable to start data viewer)

# "un-comment" the line below this one, by removing the '#' and the space
#View(airbnb_data)

The head() function shows you the first six rows of a data frame.

Question 4.3 Use the head function in the code chunk below to show the first rows of the airbnb_data.

head(airbnb_data)

## # A tibble: 6 x 6
##      id name              neighbourhood_gr… neighbourhood price number_of_revie…
##   <dbl> <chr>             <chr>             <chr>         <dbl>            <dbl>
## 1  2318 Casa Madrona - U… Central Area      Madrona         296               16
## 2  4291 Sunrise in Seatt… Other neighborho… Roosevelt        82               54
## 3  5682 Cozy Studio, min… Delridge          South Delrid…    48              428
## 4  6606 Fab, private sea… Other neighborho… Wallingford      90              110
## 5  9419 Glorious sun roo… Other neighborho… Georgetown       70              120
## 6  9460 Downtown/Convent… Downtown          First Hill       80              366

Question 4.4: head shows the first 6 rows by default. Change the following code to show the first 10 rows:

head(airbnb_data, n = 10)

## # A tibble: 10 x 6
##       id name             neighbourhood_gr… neighbourhood price number_of_revie…
##    <dbl> <chr>            <chr>             <chr>         <dbl>            <dbl>
##  1  2318 Casa Madrona - … Central Area      Madrona         296               16
##  2  4291 Sunrise in Seat… Other neighborho… Roosevelt        82               54
##  3  5682 Cozy Studio, mi… Delridge          South Delrid…    48              428
##  4  6606 Fab, private se… Other neighborho… Wallingford      90              110
##  5  9419 Glorious sun ro… Other neighborho… Georgetown       70              120
##  6  9460 Downtown/Conven… Downtown          First Hill       80              366
##  7  9531 The Adorable Sw… West Seattle      Fairmount Pa…   165               34
##  8  9534 The Coolest Tan… West Seattle      Fairmount Pa…   125               32
##  9  9596 the down home ,… Other neighborho… Wallingford     120               61
## 10  9909 Luna Park Lower… West Seattle      Fairmount Pa…   125               48

What if you want to look at the last several rows of a data frame instead of the first several rows?

Let’s read the documentation for head by typing ?head into the console.

Question 4.5 Based on what you found out, show the last 5 rows of airbnb_data

tail(airbnb_data, n = 5)

## # A tibble: 5 x 6
##       id name             neighbourhood_gr… neighbourhood price number_of_revie…
##    <dbl> <chr>            <chr>             <chr>         <dbl>            <dbl>
## 1 224763 Location! Sleep… Downtown          Belltown        149               72
## 2 225820 Family Friendly… Other neighborho… Phinney Ridge    90               69
## 3 226495 Fun apartment i… Ballard           Whittier Hei…   170               72
## 4 226536 Serene Room in … Magnolia          Lawton Park      46              116
## 5 226677 Sunny Parisian … Other neighborho… Georgetown       55              101

You can extract a single column by name using $. Type the name of the dataframe (airbnb) first, then $ and finally the name of the column.

Question 4.6 Use $ to display the ‘price’ column

airbnb_data$price

##   [1]  296   82   48   90   70   80  165  125  120  125   48   60  109  299   60
##  [16]   40   60   91   40  105   85  145  165  199   89   79   99  189  107  157
##  [31]   75  259  185   75   85  225   95   60  110  180   50   70   96  147   76
##  [46]   50   50   70   46  110   47   75  157  150  250  120  130  135  110   79
##  [61]  110  150  170   65  125   75   89   92  180 9300   55  110  650   80   75
##  [76]   88  105  275  125  250   69   80   59   89  125  275   99   99  212   80
##  [91]   84  200   90  285   75  149   90  170   46   55

5: Hierarchical data

Data isn’t always a single, flat table. Sometimes it’s nested or hierarchical.

colors.json is a file in JSON format. This is a common format for web data. We’ll need to load another package, ‘jsonlite’, in order to read it. Type install.packages("jsonlite") in the console to install it.

Question 5.1 Change the code below so that we load the jsonlite package and make a new object called json_data by using the read_json() function to read the file “data/colors.json”.

library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

json_data <- read_json("data/colors.json")

When you do, it becomes a different type of R object, a list:

str(json_data)

## List of 1
##  $ colors:List of 6
##   ..$ :List of 4
##   .. ..$ color   : chr "black"
##   .. ..$ category: chr "hue"
##   .. ..$ type    : chr "primary"
##   .. ..$ code    :List of 2
##   .. .. ..$ rgba:List of 4
##   .. .. .. ..$ : int 255
##   .. .. .. ..$ : int 255
##   .. .. .. ..$ : int 255
##   .. .. .. ..$ : int 1
##   .. .. ..$ hex : chr "#000"
##   ..$ :List of 3
##   .. ..$ color   : chr "white"
##   .. ..$ category: chr "value"
##   .. ..$ code    :List of 2
##   .. .. ..$ rgba:List of 4
##   .. .. .. ..$ : int 0
##   .. .. .. ..$ : int 0
##   .. .. .. ..$ : int 0
##   .. .. .. ..$ : int 1
##   .. .. ..$ hex : chr "#FFF"
##   ..$ :List of 4
##   .. ..$ color   : chr "red"
##   .. ..$ category: chr "hue"
##   .. ..$ type    : chr "primary"
##   .. ..$ code    :List of 2
##   .. .. ..$ rgba:List of 4
##   .. .. .. ..$ : int 255
##   .. .. .. ..$ : int 0
##   .. .. .. ..$ : int 0
##   .. .. .. ..$ : int 1
##   .. .. ..$ hex : chr "#FF0"
##   ..$ :List of 4
##   .. ..$ color   : chr "blue"
##   .. ..$ category: chr "hue"
##   .. ..$ type    : chr "primary"
##   .. ..$ code    :List of 2
##   .. .. ..$ rgba:List of 4
##   .. .. .. ..$ : int 0
##   .. .. .. ..$ : int 0
##   .. .. .. ..$ : int 255
##   .. .. .. ..$ : int 1
##   .. .. ..$ hex : chr "#00F"
##   ..$ :List of 4
##   .. ..$ color   : chr "yellow"
##   .. ..$ category: chr "hue"
##   .. ..$ type    : chr "primary"
##   .. ..$ code    :List of 2
##   .. .. ..$ rgba:List of 4
##   .. .. .. ..$ : int 255
##   .. .. .. ..$ : int 255
##   .. .. .. ..$ : int 0
##   .. .. .. ..$ : int 1
##   .. .. ..$ hex : chr "#FF0"
##   ..$ :List of 4
##   .. ..$ color   : chr "green"
##   .. ..$ category: chr "hue"
##   .. ..$ type    : chr "secondary"
##   .. ..$ code    :List of 2
##   .. .. ..$ rgba:List of 4
##   .. .. .. ..$ : int 0
##   .. .. .. ..$ : int 255
##   .. .. .. ..$ : int 0
##   .. .. .. ..$ : int 1
##   .. .. ..$ hex : chr "#0F0"

It’s a little trickier, but you can use [[]] and $ to extract pieces of this data. For example, the 5th color:

json_data$colors[[5]]

## $color
## [1] "yellow"
## 
## $category
## [1] "hue"
## 
## $type
## [1] "primary"
## 
## $code
## $code$rgba
## $code$rgba[[1]]
## [1] 255
## 
## $code$rgba[[2]]
## [1] 255
## 
## $code$rgba[[3]]
## [1] 0
## 
## $code$rgba[[4]]
## [1] 1
## 
## 
## $code$hex
## [1] "#FF0"

Question 5.2: Display the information for the color red.

json_data$colors[[3]]

## $color
## [1] "red"
## 
## $category
## [1] "hue"
## 
## $type
## [1] "primary"
## 
## $code
## $code$rgba
## $code$rgba[[1]]
## [1] 255
## 
## $code$rgba[[2]]
## [1] 0
## 
## $code$rgba[[3]]
## [1] 0
## 
## $code$rgba[[4]]
## [1] 1
## 
## 
## $code$hex
## [1] "#FF0"

6. Knitting your output and saving your work

RMarkdown documents can be “knit” to produce different kinds of output. The simplest kind is an HTML file, like a web page. Knitting output is a good way to see the results of your work. It also helps you check for errors. To knit, press the “Knit” button just below the name of this document.

Question 6.1: Knit your output to an HTML file. Try opening the new soc225_reading_data.html file in a web browser.

You should save your work somewhere you can easily access it again, such as your UDrive.

Challenge: Excel worksheets

In the data folder, there is an Excel spreadsheet, airbnb.xlsx. It contains data for three cities (Seattle, Boston, Chicago) in separate sheets. Use the internet to find a package that will allow you to read in all of this data. If you read in each sheet separately, combine them into one data frame.

Just for fun: Image data

Images are data too, and R can import them as well. If you have extra time, check out the vignette for the magick package:

https://cran.r-project.org/web/packages/magick/vignettes/intro.html

# install.packages("magick")
# library(magick)
# cat <- image_read("data/Black_white_cat_on_fence.jpg")
# cat
# image_flip(cat)

Hints

4.1 The console is the same place where we typed ‘install.packages’ before. It should have a > and a cursor waiting for your input.

4.4 You’ll need to change the value of n from 6 to 10.

5.1 Make sure you’ve installed the jsonlite package. Then your code should look like this:

library(jsonlite)
json_data <- read_json("data/colors.json")

5.2 json_data$colors[[5]] gave us the info for ‘yellow’, so let’s try changing the 5 to other numbers to see if we can find ‘red’.

6.1 To avoid common errors while knitting output, make sure that you’ve removed or commented out any line of code with View() or install.packages() in it.

Reading data

Soc 225: Data & Society

[PUT YOUR NAME HERE]

2020-04-18