typeof(2)
typeof("micro")
typeof(1+1i)
typeof(TRUE)
The c() function creates a vector, in which all elements are the same type. In the first case, the elements are numeric, in the second, they are characters, and in the third they are characters: the numeric values are “coerced” to be characters.
c(1,2,3)
## [1] 1 2 3
c("a","b","c")
## [1] "a" "b" "c"
vector_example <- c(10, 11, 15, 19) # concatenate command c()
vector_example
## [1] 10 11 15 19
class(vector_example)
## [1] "numeric"
vector_example[1]
## [1] 10
vector_example[-1]
## [1] 11 15 19
vector_example[2:4]
## [1] 11 15 19
matrix_example <- matrix(1:12, nrow=3, ncol=4) # define a matrix with 3 rows and 4 columns
matrix_example <- matrix(0, ncol=6, nrow=3)
matrix_example
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0
class(matrix_example)
typeof(matrix_example)
str(matrix_example)
dim(matrix_example)
nrow(matrix_example)
ncol(matrix_example)
matrix_example[2, ]
## [1] 0 0 0 0 0 0
matrix_example[,2]
## [1] 0 0 0
matrix_example[2,3]
## [1] 0
t(matrix_example)
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
## [4,] 0 0 0
## [5,] 0 0 0
## [6,] 0 0 0
List is ordered data structure Another data structure you’ll want in your bag of tricks is the list. A list is simpler in some ways than the other types, because you can put anything you want in it:
list_example <- list(1, "a", TRUE, 1+4i)
list_example[1:2]
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
list_example[[2]]
## [1] "a"
list_example[2]
## [[1]]
## [1] "a"
class(list_example[2])
## [1] "list"
another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
another_list
## $title
## [1] "Numbers"
##
## $numbers
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $data
## [1] TRUE
They are similar to the matrices although they can have 2 or more dimensions.
z <- array(1:24, dim=c(2,3,4))
We said that columns in data.frames were vectors. A data frame is used for storing data tables. It is a list of vectors of equal length.
cats <- data.frame(coat = c("calico", "black", "tabby"),
weight = c(2.1, 5.0,3.2),
likes_string = c(1, 0, 1))
write.csv(x = cats, file = "/home/ubuntu/r-tutorial/gapminder/data/feline-data.csv", row.names = FALSE)
cats <- read.csv(file = "/home/ubuntu/r-tutorial/gapminder/data/feline-data.csv")
cats
## coat weight likes_string
## 1 calico 2.1 1
## 2 black 5.0 0
## 3 tabby 3.2 1
str(cats$weight)
str(cats$likes_string)
str(cats$coat)
class(cats)
## [1] "data.frame"
cats$weight
cats$likes_string
cats$coat
cats[[]]
We already learned that the columns of a data frame are vectors, so that our data are consistent in type throughout the columns. As such, if we want to add a new column, we can start by making a new vector:
age <- c(2, 3, 5)
cats
## coat weight likes_string
## 1 calico 2.1 1
## 2 black 5.0 0
## 3 tabby 3.2 1
cbind(cats, age)
## coat weight likes_string age
## 1 calico 2.1 1 2
## 2 black 5.0 0 3
## 3 tabby 3.2 1 5
Note that if we tried to add a vector of ages with a different number of entries than the number of rows in the dataframe, it would fail:
age <- c(2, 3, 5, 12)
cbind(cats, age)
Now how about adding rows? We already know that the rows of a data frame are lists:
newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
## Warning in `[<-.factor`(`*tmp*`, ri, value = "tortoiseshell"): invalid
## factor level, NA generated
cats <- rbind(cats, cats)
cats
## coat weight likes_string
## 1 calico 2.1 1
## 2 black 5.0 0
## 3 tabby 3.2 1
## 4 <NA> 3.3 1
## 5 calico 2.1 1
## 6 black 5.0 0
## 7 tabby 3.2 1
## 8 <NA> 3.3 1
gapminder <- read.csv("/home/ubuntu/r-tutorial/gapminder/data/gapminder_data.csv")
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
## [1] 2193.755
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
## [1] 7136.11
Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to read.
Here we’re going to cover 6 of the most commonly used functions as well as using pipes (%>%) to combine them.
select() filter() group_by() summarize() mutate()
install.packages('dplyr')
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
If, for example, we wanted to move forward with only a few of the variables in our dataframe we could use the select() function. This will keep only the variables you select.
year_country_gdp <- select(gapminder,year,country,gdpPercap)
year_country_gdp <- gapminder %>% select(year,country,gdpPercap)
year_country_gdp_euro <- gapminder %>%
filter(continent=="Europe") %>%
select(year,country,gdpPercap)
The above was a bit on the uneventful side but group_by() is much more exciting in conjunction with summarize(). This will allow us to create new variable(s) by using functions that repeat for each of the continent-specific data frames. That is to say, using the group_by() function, we split our original dataframe into multiple pieces, then we can run functions (e.g. mean() or sd()) within summarize().
gdp_bycontinents <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap=mean(gdpPercap))
We can also create new variables prior to (or even after) summarizing information using mutate().
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion=gdpPercap*pop/10^9) %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop),
mean_gdp_billion=mean(gdp_billion),
sd_gdp_billion=sd(gdp_billion))