You’ll download all of the lab files for this class as .zip files from canvas. To use them, you’ll have to unzip them into their own directory (not a temporary directory that lets you explore the files inside), and open the project file.
You should click on 01-reading-data.Rproj when you open this lab project. R project files automatically give you access to the files and directories (folders) in the project directory. This makes it much easier for use to load multiple data sets.
If you get stuck on one of the questions below, first check the bottom of this document for hints. Then try googling your issue or talking to somebody sitting near you. Finally, raise your hand and I’ll come help you out.
The most common type of data we’ll see is rectangular data, which is organized into a table of rows and columns. If data come to us in another form we’ll often want to make them rectangular.
In the same directory as this .Rmd file and the project file, there’s a directory called data. Inside, there’s a comma-separated values (CSV) file called seattle_airbnb.csv. This contains data about 100 Airbnb listings from Seattle.
These data come from Inside Airbnb, http://insideairbnb.com/. Go to the website and have a look at the About, Behind, and Get the Data pages. Use what you read to answer the questions below. Just type your answers below the questions.
Question 1.1: Who created these data sets?
Airbnb Question 1.2: Why did they do it?
They want to analyze room type, activities, availiablities, and listings per host. Question 1.3: Where were the data sourced from?
It is based on data of users. ## 2: Setup
R doesn’t come with everything we need loaded by default. Before we do anything else, we need to load a package. Packages contain specialized functions and data that we can use to do nifty things. The package we’ll use is called “tidyverse.” It’s a collection of packages for data manipulation, exploration, and visualization.
You can read more about the tidyverse here: https://www.tidyverse.org/
If a package hasn’t been installed on your machine, you’ll need to install it. For instance, you’d type install.packages("tidyverse") into the console. You only need to do this once for each R installation, but in order to use a package, you have to load it.
Code chunks let us incorporate code into text documents like this one. They’re very useful for creating interactive documents and for testing out new code. The output from a code chunk shows up right below it (and often also in the console). You can run the code in a chunk by pressing the green arrow int he top right of the chunk. You can also run just one line of code by placing your cursor on that line and pressing CTRL plus Return.
Question 2.1 Once you’ve completed installation, run this chunk of code to load the tidyverse:
library(tidyverse)
## ── Attaching packages ───────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
To use data inside R, we first have to import, or read, that data into our environment.
airbnb_data <- read_csv("data/seattle_airbnb.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## name = col_character(),
## neighbourhood_group = col_character(),
## neighbourhood = col_character(),
## price = col_double(),
## number_of_reviews = col_double()
## )
When we do this, we assign the values of the data to a variable, airbnb_data, using the arrow (<-). When you’re typing inside of a code chunk (that’s where the code goes), you can use alt plus the minus key to quickly type the arrow. The object we create is called a data frame.
Question 3.1: What is the role of each component in the above line of code? - airbnb_data: the variable name we used for the data frame - <- : assign values of the data to a variable. - read_csv() : import data to environment - “data/seattle_airbnb.csv” : file of data
You can print the data frame to the console by typing the name of the object. Our object was called ‘airbnb_data.’
Question 4.1 Go ahead and type that into the console. I did!
You can also look at the entire data set using RStudio’s built-in viewer. To use that, we use the function ‘view().’ We can run that command from the console, or from a code chunk like the one below:
Question 4.2 Follow the instructions in the code block and run it # I got error on here when I knit it (Error in .External2(C_dataviewer, x, title) : unable to start data viewer)
# "un-comment" the line below this one, by removing the '#' and the space
#View(airbnb_data)
The head() function shows you the first six rows of a data frame.
Question 4.3 Use the head function in the code chunk below to show the first rows of the airbnb_data.
head(airbnb_data)
## # A tibble: 6 x 6
## id name neighbourhood_gr… neighbourhood price number_of_revie…
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 2318 Casa Madrona - U… Central Area Madrona 296 16
## 2 4291 Sunrise in Seatt… Other neighborho… Roosevelt 82 54
## 3 5682 Cozy Studio, min… Delridge South Delrid… 48 428
## 4 6606 Fab, private sea… Other neighborho… Wallingford 90 110
## 5 9419 Glorious sun roo… Other neighborho… Georgetown 70 120
## 6 9460 Downtown/Convent… Downtown First Hill 80 366
Question 4.4: head shows the first 6 rows by default. Change the following code to show the first 10 rows:
head(airbnb_data, n = 10)
## # A tibble: 10 x 6
## id name neighbourhood_gr… neighbourhood price number_of_revie…
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 2318 Casa Madrona - … Central Area Madrona 296 16
## 2 4291 Sunrise in Seat… Other neighborho… Roosevelt 82 54
## 3 5682 Cozy Studio, mi… Delridge South Delrid… 48 428
## 4 6606 Fab, private se… Other neighborho… Wallingford 90 110
## 5 9419 Glorious sun ro… Other neighborho… Georgetown 70 120
## 6 9460 Downtown/Conven… Downtown First Hill 80 366
## 7 9531 The Adorable Sw… West Seattle Fairmount Pa… 165 34
## 8 9534 The Coolest Tan… West Seattle Fairmount Pa… 125 32
## 9 9596 the down home ,… Other neighborho… Wallingford 120 61
## 10 9909 Luna Park Lower… West Seattle Fairmount Pa… 125 48
What if you want to look at the last several rows of a data frame instead of the first several rows?
Let’s read the documentation for head by typing ?head into the console.
Question 4.5 Based on what you found out, show the last 5 rows of airbnb_data
tail(airbnb_data, n = 5)
## # A tibble: 5 x 6
## id name neighbourhood_gr… neighbourhood price number_of_revie…
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 224763 Location! Sleep… Downtown Belltown 149 72
## 2 225820 Family Friendly… Other neighborho… Phinney Ridge 90 69
## 3 226495 Fun apartment i… Ballard Whittier Hei… 170 72
## 4 226536 Serene Room in … Magnolia Lawton Park 46 116
## 5 226677 Sunny Parisian … Other neighborho… Georgetown 55 101
You can extract a single column by name using $. Type the name of the dataframe (airbnb) first, then $ and finally the name of the column.
Question 4.6 Use $ to display the ‘price’ column
airbnb_data$price
## [1] 296 82 48 90 70 80 165 125 120 125 48 60 109 299 60
## [16] 40 60 91 40 105 85 145 165 199 89 79 99 189 107 157
## [31] 75 259 185 75 85 225 95 60 110 180 50 70 96 147 76
## [46] 50 50 70 46 110 47 75 157 150 250 120 130 135 110 79
## [61] 110 150 170 65 125 75 89 92 180 9300 55 110 650 80 75
## [76] 88 105 275 125 250 69 80 59 89 125 275 99 99 212 80
## [91] 84 200 90 285 75 149 90 170 46 55
Data isn’t always a single, flat table. Sometimes it’s nested or hierarchical.
colors.json is a file in JSON format. This is a common format for web data. We’ll need to load another package, ‘jsonlite’, in order to read it. Type install.packages("jsonlite") in the console to install it.
Question 5.1 Change the code below so that we load the jsonlite package and make a new object called json_data by using the read_json() function to read the file “data/colors.json”.
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
json_data <- read_json("data/colors.json")
When you do, it becomes a different type of R object, a list:
str(json_data)
## List of 1
## $ colors:List of 6
## ..$ :List of 4
## .. ..$ color : chr "black"
## .. ..$ category: chr "hue"
## .. ..$ type : chr "primary"
## .. ..$ code :List of 2
## .. .. ..$ rgba:List of 4
## .. .. .. ..$ : int 255
## .. .. .. ..$ : int 255
## .. .. .. ..$ : int 255
## .. .. .. ..$ : int 1
## .. .. ..$ hex : chr "#000"
## ..$ :List of 3
## .. ..$ color : chr "white"
## .. ..$ category: chr "value"
## .. ..$ code :List of 2
## .. .. ..$ rgba:List of 4
## .. .. .. ..$ : int 0
## .. .. .. ..$ : int 0
## .. .. .. ..$ : int 0
## .. .. .. ..$ : int 1
## .. .. ..$ hex : chr "#FFF"
## ..$ :List of 4
## .. ..$ color : chr "red"
## .. ..$ category: chr "hue"
## .. ..$ type : chr "primary"
## .. ..$ code :List of 2
## .. .. ..$ rgba:List of 4
## .. .. .. ..$ : int 255
## .. .. .. ..$ : int 0
## .. .. .. ..$ : int 0
## .. .. .. ..$ : int 1
## .. .. ..$ hex : chr "#FF0"
## ..$ :List of 4
## .. ..$ color : chr "blue"
## .. ..$ category: chr "hue"
## .. ..$ type : chr "primary"
## .. ..$ code :List of 2
## .. .. ..$ rgba:List of 4
## .. .. .. ..$ : int 0
## .. .. .. ..$ : int 0
## .. .. .. ..$ : int 255
## .. .. .. ..$ : int 1
## .. .. ..$ hex : chr "#00F"
## ..$ :List of 4
## .. ..$ color : chr "yellow"
## .. ..$ category: chr "hue"
## .. ..$ type : chr "primary"
## .. ..$ code :List of 2
## .. .. ..$ rgba:List of 4
## .. .. .. ..$ : int 255
## .. .. .. ..$ : int 255
## .. .. .. ..$ : int 0
## .. .. .. ..$ : int 1
## .. .. ..$ hex : chr "#FF0"
## ..$ :List of 4
## .. ..$ color : chr "green"
## .. ..$ category: chr "hue"
## .. ..$ type : chr "secondary"
## .. ..$ code :List of 2
## .. .. ..$ rgba:List of 4
## .. .. .. ..$ : int 0
## .. .. .. ..$ : int 255
## .. .. .. ..$ : int 0
## .. .. .. ..$ : int 1
## .. .. ..$ hex : chr "#0F0"
It’s a little trickier, but you can use [[]] and $ to extract pieces of this data. For example, the 5th color:
json_data$colors[[5]]
## $color
## [1] "yellow"
##
## $category
## [1] "hue"
##
## $type
## [1] "primary"
##
## $code
## $code$rgba
## $code$rgba[[1]]
## [1] 255
##
## $code$rgba[[2]]
## [1] 255
##
## $code$rgba[[3]]
## [1] 0
##
## $code$rgba[[4]]
## [1] 1
##
##
## $code$hex
## [1] "#FF0"
Question 5.2: Display the information for the color red.
json_data$colors[[3]]
## $color
## [1] "red"
##
## $category
## [1] "hue"
##
## $type
## [1] "primary"
##
## $code
## $code$rgba
## $code$rgba[[1]]
## [1] 255
##
## $code$rgba[[2]]
## [1] 0
##
## $code$rgba[[3]]
## [1] 0
##
## $code$rgba[[4]]
## [1] 1
##
##
## $code$hex
## [1] "#FF0"
RMarkdown documents can be “knit” to produce different kinds of output. The simplest kind is an HTML file, like a web page. Knitting output is a good way to see the results of your work. It also helps you check for errors. To knit, press the “Knit” button just below the name of this document.
Question 6.1: Knit your output to an HTML file. Try opening the new soc225_reading_data.html file in a web browser.
You should save your work somewhere you can easily access it again, such as your UDrive.
In the data folder, there is an Excel spreadsheet, airbnb.xlsx. It contains data for three cities (Seattle, Boston, Chicago) in separate sheets. Use the internet to find a package that will allow you to read in all of this data. If you read in each sheet separately, combine them into one data frame.
Images are data too, and R can import them as well. If you have extra time, check out the vignette for the magick package:
https://cran.r-project.org/web/packages/magick/vignettes/intro.html
# install.packages("magick")
# library(magick)
# cat <- image_read("data/Black_white_cat_on_fence.jpg")
# cat
# image_flip(cat)
4.1 The console is the same place where we typed ‘install.packages’ before. It should have a > and a cursor waiting for your input.
4.4 You’ll need to change the value of n from 6 to 10.
5.1 Make sure you’ve installed the jsonlite package. Then your code should look like this:
library(jsonlite)
json_data <- read_json("data/colors.json")
5.2 json_data$colors[[5]] gave us the info for ‘yellow’, so let’s try changing the 5 to other numbers to see if we can find ‘red’.
6.1 To avoid common errors while knitting output, make sure that you’ve removed or commented out any line of code with View() or install.packages() in it.