Here are some great resources:
US Government open data: https://www.data.gov. This site has over 302,800 datasets on topics such as student loans, crime and precipitation. It includes data from NOAA, NASA, the Department of Justice…
You can see that scientisits and scholars interested in policy will find great data here. A student showed me this set about terrorism and responses to terrorism: http://www.start.umd.edu/profiles-individual-radicalization-united-states-pirus-keshif.
Here is a list of public data sets by topic, with links: https://github.com/awesomedata/awesome-public-datasets
Kaggle also hosts contest and provides data sets. https://www.kaggle.com/datasets. Our ultimate goal after this three-part workshop series is to work in teams on a Kaggle competition during Spring 2019. Generally, Kaggle data is already pretty “clean,” so our work will be minimal in this area. Often the challenges in Kaggle data will be filling in or dropping missing values or formatting dates for consistency.
But be aware that many data sets are not clean (particularly if you are web scraping).
Task 1: spend 5 minutes glancing through these data sets. Does anything appeal to you? Can you think of something you’d like to answer with the data?
Computers in HSU labs should already have R and R Studio. If you want to install these on your laptop, first install R (https://cran.r-project.org), then R Studio (https://www.rstudio.com). Think of these like the car engine and the dashboard. R is a programming language that runs computations, R Studio provides an interface and many convenient features and tools. When you are set, open R studio (The icon is a blue circle with the a white letter R in it).
We will add packages as we need them in this tutorial, but briefly, a package extends the functionality of R by providing additional functions, data, etc. They can be downloaded for free and are written by the world-wide community of R users. Over our workshops, we will make use of ggplot2 (for data visualization) and dplyr (for data wrangling). We need to install a package once before we can use it with the install.packages() command, and then we have to load packages we want during each R session. They are loaded with the library() command.
An analogy– R is your smartphone and has some functionality just as it is, packages are apps you download to your phone, and are somewhat personal and specific to how you want to use your phone. You have to install the app once, and load (or open) it each time you want to use it.
We will install 4 packges today:
# install packages -- don't forget the quotes!
install.packages("dplyr") # this package streamlines data manipulation
install.packages("nycflights13") # this contains a data set we will use
install.packages("ggplot2") # this package is great for visualization.
install.packages("tidyverse") # this package is actually a bundle of packages for importing, tidying, transforming and visualizing data.
# load packages
library(dplyr)
library(nycflights13)
library(ggplot2)
library(tidyverse)
Note– you have to reload packages in each R session. This is one of the most common error messages I see. (Usually R will say “Error: could not find function.”)
You can do this at the console with
"setwd("~/Desktop")
Alternatively, you can use the a menu (on my Mac it is Session – Set Working Directory, on Windows use File – Change Directory). Setting your directory is important if you are asking R to find a particular file– it has to know where to look. You can give relative or absolute file paths.
For now, just set your working directory to the Desktop (we will save a downloaded file to the Desktop in Part 5).
We discovered that some lab computers won’t allow you to set the directory to the Desktop. You can use the command
getwd()
to see where you currently are working, and make note so when you download file in Part 5 you can put the file in the same directory.
For this part of the tutorial, we will need the dplyr package. This package is great for data transformation and is intuitive to write. We will start by looking at data in the nycflights13 package, which contains five data sets with information about all domestic flights departing NYC in 2013. Let’s start by exploring the flights data frame.
flights
Let’s unpack this– a tibble is a type of data frame. We can see it has 336,776 rows (corresponding to observations). There are 19 columns, which correspond to different variables (year, month, day, dep_time…). We see just the first 10 rows (otherwise our screen would be completely filled).
This is a good start, but we’d like some better ways to explore this data frame. Lukcily, there are specific commands for this.
View(flights) # note that R is case sensitive.
# this is nice because it results in a pop-up view.
Here we can see there are different types of variables– some are quantitative, some are categorical. Each row corresponds to one flight. The glimpse() function in the dplyr package is pretty similar. You’ll see the variable type, too (like int or chr for integer and character, respectively).
glimpse(flights)
We’ve looked at the data, now let’s play with it a little. We’ll start by assigning it to a local variable myflights in this R session.
# save the data into a local data frame
myflights <- tbl_df(flights)
# look at first 20 rows
print(myflights, n=20)
We will filter data by keeping only rows that match a certain criteria (for this example, we want flights that left on January 1). We will compare and contrast the base R and dplyr approach. Base R filtering forces us to repeat the data frame’s name. Note, these commands don’t change the variable myflights, they just pull out the information requested. We could always set this to a new variable if desired.
# base R approach to view all flights on January 1
myflights[myflights$month==1 & myflights$day==1, ] # the comma means all the columns
# dplyr approach
# note: you can use comma or ampersand to represent AND condition
filter(myflights, month==1, day==1)
Or we can filter based on airline. Let’s keep those that are United Air (UA) or American Air (AA) only:
filter(myflights, carrier %in% c("AA", "UA")) # c() is vector notation in R
# dplyr approach
select(myflights, dep_time, arr_time, tailnum)
We can perform multiple operations in one line with %>%. The code below takes the local data frame flights and then selects only the two columnns carrier and dep_delay Then it keeps only the rows where dep_delay is larger than 60. The advantage is there is less typing and it is highly readable! Be aware it will display the results, but hasn’t changed the variable myflights. If you want to save this filtered and reduced data set, set it to a new variable.
myflights %>%
select(carrier, dep_delay) %>%
filter(dep_delay > 60)
As another side example, look at the difference between base R and chaining to create 2 vectors and calculate the Euclidean distance between them. The first method has nesting, so it is a little harder to tell what is happening.
# create two vectors and calculate Euclidean distance between them with base R
x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1-x2)^2)) # this is how we would write something mathematically, but it is hard to read with all the parentheses.
x1 <- 1:5; x2 <- 2:6
# chaining method
(x1-x2)^2 %>% sum() %>% sqrt()
We can reorder rows with either base R or dplyr. We will select the columns carrier and dep_delay and then sort by dep_delay (so results will be displayed with the smallest delay first, then the next smallest, etc.)
# base R approach to select carrier and dep_delay columns and sort by dep_delay
myflights[order(myflights$dep_delay), c("carrier", "dep_delay")]
# dplyr approach
myflights %>%
select(carrier, dep_delay) %>%
arrange(dep_delay)
Sometimes we want to create a new variable that is a function of existing variables. For example, we could calculate flight speed in miles per hour since there are variables for distance and time in the data set.
# base R approach to create a new variable speed (in mph)
myflights$speed <- myflights$distance / myflights$air_time*60 # create new column
myflights[, c("distance", "air_time", "speed")] # return all rows (see comma) and only these 3 columns
# dplyr approach (prints the new variable but does not store it)
myflights %>%
select(distance, air_time) %>%
mutate(speed = distance/air_time*60)
If we want to store the variable, we can.
# store the new variable
myflights2 <- myflights %>% mutate(speed = distance/air_time*60)
# if we look at the structure, there is one more column called speed
str(myflights2)
# dplyr approach: create a table grouped by dest, and then summarize each group by taking the mean of arr_delay
myflights %>%
group_by(dest) %>%
summarise(avg_delay = mean(arr_delay, na.rm=TRUE))
# note--- na.rm = TRUE will remove any rows with an NA value. More below on removing missing values!
Or, we can calculate the minimum and maximum arrival and departure delays:
myflights %>%
group_by(carrier) %>%
summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("arr_delay"))
Many data sets have missing values, and we have to choose how to deal with these. in R this will show up as NA (for “not available”).
The easiest (and most naive) way is to filter them out. You saw a quick example of this under the summarise section (with the na.rm = TRUE option).
For example, we can see that there are NA values in some of the last rows by using tail() to look.
# look at last rows
tail(myflights)
myflights_clean <- na.omit(myflights)
By looking at the structure of each (use str()) we can see that na.omit took our data frame of 336776 observations and reduced it to 327346 observations. It simply removes any row with one or more NA’s occuring. To see a very simple example of this, we can build a very small data frame and watch it in action:
# make a data frame DF with NA in row 3, column 2.
DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))
# omit any rows with NA
DF2 <- na.omit(DF)
# look at the new data frame-- it has one less row.
DF2
There are other functions, such as na.exclude() which will drop out rows with missing values but will keep track of where they were (which can be important for prediction– this will be the topic of Workshop 3). There are also a number of R packages that help to replace or impute values to the NA values. For example, it might make sense to impute the mean value of the non-missing entries. (This can be undesirable because it decreases the variance.)
If the amount of missing data is very small relative to the size of the data set, leaving out a few samples may be the best way to go. But be aware you are losing information doing so. One package to explore during your Kaggle competition is the R mice package, which can take care of the imputing process.
We will leave that topic for another day, but here is a website to check out: https://datascienceplus.com/imputing-missing-data-with-r-mice-package/.
For the tutorial above, we used a data set that was part of an R package (flights in the nycflights13 package). But most data sets come from outside of R. Let’s make sure we are able to load Kaggle data sets into R. You will have to do this for the contest this spring, and this is also usually a prerequisite to any data science project you might like to do.
When we read data into R, we need to tell R two things: the type of data structure the data is in and where to find it. We will use a data frame/tibble structure, which is why we wanted the package above. We will read in a chocolate bar ratings file found here: https://www.kaggle.com/rtatman/chocolate-bar-ratings#flavors_of_cacao.csv. To download it the file, you will need to quickly create a free Kaggle account. I suggest you use your real name or something you don’t mind an employer seeing because if you do well in the contest in the spring, it is something you can point to on your resume.
Download the file and put it on your Desktop. (Recall that we set the working directory to the Desktop earlier in the tutorial. If your working directory is somewhere else, put the file there.)
The file we want is a .csv (“comma separated values”) file, so we will use a function specifically designed for reading this file type.
# save the information to a local variable
chocolateData <- read_csv("flavors_of_cacao.csv")
# look at the first few lines
head(chocolateData)
Data Science is an exciting and evolving field and there are many free resources available to learn. I think the best way to learn is to do a project, and when you run into challenges, ask for help.
If I get a coding error, I often google the error R gives.
If I have a questions about how to do something to a data set, I often google it. Nine times out of ten, someone else has asked the same question on Stack Overflow
If not, you can post questions on online forums. If you post to Stack Overflow, here are some things to remember:
This journal article by Hadley Wickham (chief scientist at R Studio, and author of many great R packages) covers the idea of “tidy data” and a great example: https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf
I found this tutorial very helpful to learn dplyr and reused some of the examples above: http://rpubs.com/justmarkham/dplyr-tutorial.