packages = c(
  "dplyr","ggplot2","stringr", "dslabs", "readr", "tidyr", "purrr",
  "lubridate", "readxl"
  )
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
options(digits=4, scipen=12)
library(readr)
library(readxl)
library(dplyr)
library(ggplot2)
library(stringr)
library(lubridate)
library(tidyr)
library(dslabs)

A. Data Import

A1. Import Spreadsheets

Q1: Which of the following is NOT part of the data wranging process?

  • Checking correlations between your variables

Q2: Which files could be opened in a basic text editor?

  • data.txt
  • data.csv
  • data.tsv

Q3: You open the file in a basic text editor and see lines that look like the following:

read.table("data/initials.txt", header=T, sep=",")
  initials state age time
1      vib    MA  61 6:01
2      adc    TX  45 5:45
3      kme    CT  50 4:19

What type of file is this?

  • A comma-delimited file with a header
A2. Paths and the Working Directory

Q1: Assume the following is the full path to the directory that a student wants to use as their working directory in R: “/Users/student/Documents/projects/”

Which of the following lines of code CANNOT set the working directory to the desired “projects” directory?

setwd(/Users/student/Documents/projects/)

Q2: We want to copy the “murders.csv” file from the dslabs package into an existing folder “data”, which is located in our HarvardX-Wrangling projects folder. We first enter the code below into our RStudio console.

library(dslabs)
wd = getwd()
filename = "murders.csv"
path = system.file("extdata", package = "dslabs")

Which of the following commands would NOT successfully copy “murders.csv” into the folder “data”?

file.copy(file.path(path, "murders.csv"), getwd())
[1] FALSE

It copies the file into the working directory, not into the “data” folder

A3. The readr and readxl Packages

Q1: You are not sure whether the murders.csv file has a header row. How could you check this?

  • Open the file in a basic text editor.
  • In the RStudio “Files” pane, click on your file, then select “View File”.
  • Use the command read_lines (remembering to specify the number of rows with the n_max argument).

Q2: What is one difference between read_excel and read_xlsx?

  • Read_excel reads both .xls and .xlsx files by detecting the file format from its extension, while read_xlsx only reads .xlsx files.

Q3: You have a file called “times.txt” that contains race finish times for a marathon. The first four lines of the file look like this:

Which line of code will NOT produce a tibble with column names “initials”, “state”, “age”, and “time”?

race_times <- read.csv("data/times.txt")
race_times
  initials state age time
1      vib    MA  61 6:01
2      adc    TX  45 5:45
3      kme    CT  50 4:19
class(race_times)
[1] "data.frame"

Instead of “tibble”, it produces a data frame.

Q4: You also have access to marathon finish times in the form of an Excel document named “times.xlsx”. In the Excel document, different sheets contain race information for different years. The first sheet is named “2015”, the second is named “2016”, and the third is named “2017”.

Which line of code will NOT import the data contained in the “2016” tab of this Excel sheet?

library(readxl)
times_2016 <- read_xlsx("times.xlsx", sheet = "2")
A4. Importing Data Using R-base Functions

Q1: You have a comma-separated values file that contains the initials, home states, ages, and race finish times for marathon runners. The runners’ initials contain three characters for the runners’ first, middle, and last names (for example, “KME”).

You read in the file using the following code.

race_times <- read.csv("data/times.txt")

What is the data type of the initials in the object race_times?

str(race_times)
'data.frame':   3 obs. of  4 variables:
 $ initials: Factor w/ 3 levels "adc","kme","vib": 3 1 2
 $ state   : Factor w/ 3 levels "CT","MA","TX": 2 3 1
 $ age     : int  61 45 50
 $ time    : Factor w/ 3 levels "4:19","5:45",..: 3 2 1

Q2: Which of the following is NOT a real difference between the readr import functions and the base R import functions?

  • The base R import functions can read .csv files, but cannot files with other delimiters, such as .tsv files, or fixed-width files.

Q3: You read in a file containing runner information and marathon finish times using the following code.

race_times <- read.csv("data/times.txt", stringsAsFactors = F)
race_times
  initials state age time
1      vib    MA  61 6:01
2      adc    TX  45 5:45
3      kme    CT  50 4:19

What is the class of the object race_times?

class(race_times)
[1] "data.frame"
A5. Downloading Files from the Internet

Q1: Select the answer choice that summarizes all of the actions that the following lines of code can perform.

url <- "https://raw.githubusercontent.com/MyUserName/MyProject/master/MyData.csv "
dat <- read_csv(url)
download.file(url, "MyData.csv")
  • Create a tibble in R called dat that contains the information contained in the csv file stored on Github. Download the csv file to the working directory and name the downloaded file “MyData.csv”.






