Course 1: R 开发环境和基本知识

Swirl 学习
Week 1: Basic R Language
- 读入数据, 读入csv数据
- 读入API 数据
Week 2: Data manipulation
- 2-1 Basic Data Manipulation
- 2-2 Working with Dates, Times, Time Zones
Week 3: Text Processing, Regular Expression, & Physical Memory
Week 4: large datasets
- 4-1 Working with large datasets
Quiz

本文档说明：

本文件位于Coursera 文件夹之中，本课程的资料文件夹可以存放某些需要多次复习的页面。
Week 2 提供了相当重要的内容，我却没有认真去学习它

帮助：

If you’re trying to learn a new R package, Google “[package name] vignette” and “[package name] tutorial”.
If you are struggling with how to write the code for a plot, try using Google Images. Google “r [name or description of plot]” (e.g., “r pareto plot”) and then choose the “Images” tab in the results. Scroll through to find something that looks like the plot you want to create, and then check the image’s website. It will often include the R code used to create the image.

Course Learning Objectives:

Modify object attributes and metadata
Describe differences in different R classes and data types. Manipulate and transform a variety of data types, including dates, times, and text data
Read tabular data into R and read in web data via web scraping tools and APIs
Define tidy data and to transform non-tidy data into tidy data
Describe how memory is used in R sessions to store R objects
Describe how to diagnose programming problems and to look up answers from the web or forums

Swirl 学习

基本上就是课程中例子的简化版本！

library(swirl)
install_course("The R Programming Environment")

R Workspace and Files

R provides a common API (a common set of commands) for interacting with files, that way your code will work across different kinds of computers.

getwd(), ls(), dir(), list.files(), dir.create("testdir"), file.create("mytest.R"), file.exists("mytest.R"), file.exists("mytest.R"), file.rename("mytest.R", "mytest2.R"), file.copy(), unlink("testdir", recursive = TRUE) 文件操作语句

Manipulate data

library(titanic)
library(faraway)

looking at data

Whenever you’re working with a new dataset, the first thing you should do is look at it! What is the format of the data? What are the dimensions? What are the variable names? How are the variables stored? Are there missing data? Are there any flaws in the data?

常用命令

object.size(plants)

Week 1: Basic R Language

a tidy dataset has the following properties:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

读入数据, 读入csv数据

library(readr)
getwd()
teams <- read_csv("data/team_standings.csv")
teams
teams <- read_csv("data/team_standings.csv", col_types = "cc") # Here "cc" indicates that the first column ischaracter and the second column is character (there are only two columns).

读入API 数据

这个部分有点难，我需要找到一个例子，完整的代码。 https://www.coursera.org/learn/r-programming-environment/supplement/VFcjM/requesting-data-through-a-web-api

同样介绍了一个读入html, xml, json文件格式的读入包。

Week 2: Data manipulation

这一章节提供了，相当重要的内容，我却没有认真去学习它

2-1 Basic Data Manipulation

2-2 Working with Dates, Times, Time Zones

Week 3: Text Processing, Regular Expression, & Physical Memory

You can specify sets of characters with regular expressions, some of which come built in, but you can build your own character sets too. First we’ll discuss the built in character sets: words (“\w”), digits (“\d”), and whitespace characters (“\s”). Words specify any letter, digit, or a underscore, digits specify the digits 0 through 9, and whitespace specifies line breaks, tabs, or spaces. Each of these character sets have their own compliments: not words (“\W”), not digits (“\D”), and not whitespace characters (“\S”). Each specifies all of the characters not included in their corresponding character sets.

stringr package 中有很多的函数包括 `str_exact, str_order, str_pad, str_to_title, str_trim, str_wrap, word’

有些内存管理函数还挺有意思的，比如mem_used

library(magrittr)
sapply(ls(), function(x) object_size(get(x))) %>% sort %>% tail(5)
mem_change(rm(check_tracks, denver, b))

Week 4: large datasets

4-1 Working with large datasets

Package data.table 是非常块的，和 package readr的函数谁更快呢？

one strategy for speeding up R code is to write some of the code in C++ and connect it to R using the Rcpp package.
Parallel strategies may be work pursuing if you are working with very large datasets, and if the coding tasks can be split to run in parallel.
Out of memories strategy, There are several R packages that allow you to connect your R session to a database.

library(data.table)
brazil_zika <- fread("data/COES_Microcephaly-2016-06-25.csv")
fread("data/COES_Microcephaly-2016-06-25.csv",
      select = c("location", "value", "unit")) %>%
  dplyr::slice(1:3)

Quiz

答案： 1. 0.003960 2. OC CSN Unadjusted PM2.5 LC TOT 3. State 39 County 081 Site 0017 4. 0.018567 5. 0.4300 6. 3527

先读入数据和查看基本性质。

library(readr)
dat <- read_csv("/media/ghy/36D2072ED206F243/coursera/R_programing_development/course1/data/data/daily_SPEC_2014.csv.bz2");dat1 <- dat
## 用 read.csv读入的数据有280MB, 并且变量名字不一样，而read_csv 有447MB
# dat <- read.csv("/media/ghy/36D2072ED206F243/coursera/R_programing_development/course1/data/data/daily_SPEC_2014.csv.bz2")
str(dat)
head(dat)
dim(dat)
names(dat)
object.size(dat)

## mmap'd region has EOF at the end 出现错误！！
# library(data.table)
# dat <- fread("/media/ghy/36D2072ED206F243/coursera/R_programing_development/course1/data/data/daily_SPEC_2014.csv.bz2")

dat1 <- dat %>%
  (function(dat) {names(dat) = gsub(" ", ".", names(dat)); dat}) %>% 
  select(Parameter.Name, Arithmetic.Mean, State.Name, State.Code, County.Code, Site.Num)

分析各个变量

table(dat$`Parameter Name`) %>% head
table(dat$`State Code`) %>% head
head(data)
head(data); tail(data)

class(dat)
sapply(data, class)
names(data)

What is average Sample.Value for “Bromine PM2.5 LC” in the state of Wisconsin in this dataset?

data <- dat %>%
  (function(dat) {names(dat) = gsub(" ", ".", names(dat)); dat}) %>%
  filter(Parameter.Name == "Bromine PM2.5 LC")  %>%
  filter(State.Name == "Wisconsin") %>%
  (function(dat) dat[, sapply(dat, class) == "numeric"]) %>%
  select(Arithmetic.Mean) %>%
  as.data.frame() 

mean(data$Arithmetic.Mean)
sum(is.na(data))

Calculate the average of each chemical constituent across all states, monitoring sites and all time points. Which constituent Parameter.Name has the highest average level?

time points, monitoring sites, states 分别对应哪几个变量呢？

data <- dat %>%
  (function(dat) {names(dat) = gsub(" ", ".", names(dat)); dat}) %>%
  select(Arithmetic.Mean, State.Name, Site.Num, Parameter.Name) %>%
  group_by(Parameter.Name) %>%
  summarise(Means = mean(Arithmetic.Mean)) %>%
  arrange(Means)

Which monitoring site has the highest average level of “Sulfate PM2.5 LC” across all time? Indicate the state code, county code, and site number.

data <- dat %>%
  (function(dat) {names(dat) = gsub(" ", ".", names(dat)); dat}) %>%
  filter(Parameter.Name == "Sulfate PM2.5 LC")  %>% 
  select(Arithmetic.Mean, State.Code, Site.Num, Parameter.Name, County.Code) %>%
  arrange(Arithmetic.Mean)

What is the absolute difference in the average levels of “EC PM2.5 LC TOR” between the states California and Arizona, across all time and all monitoring sites?

data <- dat %>%
  (function(dat) {names(dat) = gsub(" ", ".", names(dat)); dat}) %>%
  filter(Parameter.Name == "EC PM2.5 LC TOR")  %>% 
  select(Arithmetic.Mean, State.Name) %>%
  group_by(State.Name) %>%
  summarise(Means = mean(Arithmetic.Mean)) %>%
  filter(State.Name == "California"| State.Name =="Arizona") %>%
  as.data.frame()

diff(data$Means)

What is the median level of “OC PM2.5 LC TOR” in the western United States, across all time? Define western as any monitoring location that has a Longitude LESS THAN -100.

data <- dat %>% 
  (function(dat) {names(dat) = gsub(" ", ".", names(dat)); dat}) %>%
  filter(Parameter.Name == "OC PM2.5 LC TOR")  %>% 
  filter(Longitude < -100) %>%
  summarise(Median = median(Arithmetic.Mean)) %>% 
  as.data.frame()

第二个数据集合，我们先读入

library(readxl)
dat <- read_excel("/media/ghy/36D2072ED206F243/coursera/R_programing_development/course1/data/data/aqs_sites.xlsx")

names(dat)
head(dat)
sapply(dat, class)
unique(dat$"Local Site Name") %>% length
sapply(dat, function(x) length(unique(x)))
names(dat)[grepl("Site", names(dat))]

intersect(names(dat), names(dat1))

How many monitoring sites are labelled as both RESIDENTIAL for “Land Use” and SUBURBAN for “Location Setting”?

data <- dat %>%
    (function(dat) {names(dat) = gsub(" ", ".", names(dat)); dat}) %>%
    select(Site.Number, Land.Use,  Location.Setting) %>%
    filter(Land.Use == "RESIDENTIAL", Location.Setting == "SUBURBAN") %>%
    nrow()

What is the median level of “EC PM2.5 LC TOR” amongst monitoring sites that are labelled as both “RESIDENTIAL” and “SUBURBAN” in the eastern U.S., where eastern is defined as Longitude greater than or equal to -100?

只有第一个数据集合才有 “EC PM2.5 LC TOR” 这个数据。这里很可能需要数据融合。

data <- dat %>%
  (function(dat) {names(dat) = gsub(" ", ".", names(dat)); dat}) %>%
  filter(Longitude >= -100) %>%
  filter(Land.Use == "RESIDENTIAL", Location.Setting == "SUBURBAN") %>%
#  filter(Local.Site.Name == "EC PM2.5 LC TOR")

data1 <- dat1 %>%
  (function(dat) {names(dat) = gsub(" ", ".", names(dat)); dat}) %>%
  filter(Parameter.Name== "EC PM2.5 LC TOR") %>%
  select(Latitude, Longitude, Arithmetic.Mean)
  
merge.Data <- merge(data, data1)
mean(merge.Data$Arithmetic.Mean)
  
sapply(dat, class)    
sapply(dat, function(x) sum(grepl("PM2.5", x)))
(dat$`Local Site Name` == "EC PM2.5 LC TOR") %>% na.omit() %>% sum
unique(data1$Site.Num)

不对尝试另外一种思路

先融合数据

names(dat)
dat1.temp <- dat1 %>%
  (function(dat) {names(dat) = gsub(" ", ".", names(dat)); dat})
dat.temp <- dat %>%
  (function(dat) {names(dat) = gsub(" ", ".", names(dat)); dat})
dat.temp <- rename(dat.temp, Site.Num = Site.Number)
dat2 <- merge(dat.temp, dat1.temp)
rm(dat1.temp, dat.temp)

然后在回去做题

data <- dat2 %>%
  (function(dat) {names(dat) = gsub(" ", ".", names(dat)); dat}) %>%
  filter(Longitude >= -100) %>%
  filter(Parameter.Name== "EC PM2.5 LC TOR") %>%
  filter(Land.Use == "RESIDENTIAL", Location.Setting == "SUBURBAN")

好了，这题目还是不对，但是我应该绕过去了。