Rongen Zhang
variable_name <- some_valuetarget <- dataA vector is a sequence of values, all of the same type
## [1] 1 3 7 15
## [1] TRUE
## [1] 4
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 3 5 7 9
## [1] 7 7 7
## [1] 10 20 7 13
## value1 value2 value3 value4
## 10 20 7 13
## [1] 1 3 5 11 13 15 21 23 25
Try to find the most efficient way (fewer characters) to create the following vectors:
## [1] 10 11 12 13 14 15 16 17 18 19 20
## [1] -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30
## apple banana cherry
## 4 5 3
Vector computations are performed element-wise
## [1] 50 100 150 200
## [1] 5 -5 5 30
## [1] 10 40 90 160
Is there a value of 20 in x?
## [1] TRUE
Is there a value of 40 in x?
## [1] FALSE
Does basket have apple?
## [1] TRUE
Does basket have cheese?
## [1] FALSE
In real world, your data may contain missing values. In R, we use NA (upper case) to represent a missing value.
## [1] 1 4 NA 2
## [1] NA
## [1] NA
NA creates problems for most numerical functions.
For example, we cannot add NA to other numbers.
To apply these numerical functions on data with NAs, we simply just remove NAs from the calculation. That is,
## [1] 7
## [1] 4
Recycling repeats elements in the shorter vector until its length matches the longer vector
u <- c(10, 20)
v <- c(1, 2, 3, 4, 5)
u + v # the shorter vector will be recycled to match the longer vector## [1] 11 22 13 24 15
Under the hood:
u + v
= c(10, 20) + c(1, 2, 3, 4, 5)
= c(10, 20, 10, 20, 10) + c(1, 2, 3, 4, 5) # recycling
= c(10+1, 20+2, 10+3, 20+4, 10+5) # element-wise operation
= c(11, 22, 13, 24, 15)
Without typing the following into R, guess what the result of the last line would be:
Another more challenging one:
You can retrieve elements from a vector by specifying the indexes of the elements. This operation is also known as subsetting.
## value1
## 10
## value3
## 30
You can provide more than just one index.
## value1 value2 value3
## 10 20 30
## value3 value2 value1 value4
## 30 20 10 40
## value4 value4
## 40 40
## value2 value3 value4
## 20 30 40
## value3 value4
## 30 40
List is also a container for values, but can accommodate items of different data types.
## [[1]]
## [1] "Bob"
##
## [[2]]
## [1] 100 80 90
Just like vectors, you can give each element a name:
## $name
## [1] "Bob"
##
## $grades
## [1] 100 80 90
## $grades
## [1] 100 80 90
## $grades
## [1] 100 80 90
## [1] "list"
## [1] 100 80 90
## [1] "numeric"
## [1] 100 80 90
## [1] 100 80 90
Create the following list and use the function mean() to get the GPA (grade point average) from her grade points:
## $name
## [1] "Anna"
##
## $is_female
## [1] TRUE
##
## $age
## [1] 22
##
## $enrollment
## [1] "CIS4710" "CIS4720" "CIS4730"
##
## $grade_point
## [1] 4 3 4
A matrix is a collection of data elements arranged in a two-dimensional rectangular layout.
A <- matrix(
1:6, # the data elements
nrow=2, # number of rows
ncol=3, # number of columns
byrow = TRUE) # fill matrix by rows
A## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
A data frame is a set of vectors of equal length. Consider data frame as an Excel sheet or a database table.
Column names are preserved or guessed if not explicitly set
course <- c("CIS4730", "CIS4710", "CIS4950", "CIS1234")
num_of_students <- c(20, 10, 40, 30)
data_analytics_minor <- c(TRUE, TRUE, TRUE, FALSE)
df <- data.frame(course, n_students=num_of_students, data_analytics_minor,
stringsAsFactors=F)
df # notice the column names and row names## course n_students data_analytics_minor
## 1 CIS4730 20 TRUE
## 2 CIS4710 10 TRUE
## 3 CIS4950 40 TRUE
## 4 CIS1234 30 FALSE
## [1] 3
## [1] 4
## [1] "course" "n_students" "data_analytics_minor"
## [1] "1" "2" "3" "4"
You can change column and row names:
df2 <- df # create a copy of df, and name it as "df2"
colnames(df2) <- c("col1", "col2", "col3") # assign column names
colnames(df2) # they were "course", "n_students", "ba_minor_course"## [1] "col1" "col2" "col3"
rownames(df2) <- c("row1", "row2", "row3", "row4") # assign row names
rownames(df2) # they were "1", "2", "3", "4"## [1] "row1" "row2" "row3" "row4"
There are many ways you can get values out of a column:
dataframe_name$column_name## [1] "CIS4730" "CIS4710" "CIS4950" "CIS1234"
## [1] 20 10 40 30
## course n_students data_analytics_minor
## 1 CIS4730 20 TRUE
## 2 CIS4710 10 TRUE
## 3 CIS4950 40 TRUE
## 4 CIS1234 30 FALSE
All of a row
## course n_students data_analytics_minor
## 2 CIS4710 10 TRUE
## course n_students data_analytics_minor
## 1 CIS4730 20 TRUE
## 2 CIS4710 10 TRUE
## 3 CIS4950 40 TRUE
## 4 CIS1234 30 FALSE
Multiple rows
## course n_students data_analytics_minor
## 1 CIS4730 20 TRUE
## 3 CIS4950 40 TRUE
## [1] "CIS4710"
## course n_students
## 3 CIS4950 40
## 4 CIS1234 30
## [1] 10
Rows matching a condition:
## course n_students data_analytics_minor
## 1 CIS4730 20 TRUE
## 2 CIS4710 10 TRUE
## 3 CIS4950 40 TRUE
## 4 CIS1234 30 FALSE
## course n_students data_analytics_minor
## 1 CIS4730 20 TRUE
## course n_students data_analytics_minor
## 1 CIS4730 20 TRUE
## 2 CIS4710 10 TRUE
## 3 CIS4950 40 TRUE
## 4 CIS1234 30 FALSE
## [1] TRUE FALSE FALSE FALSE
df[df$course == 'CIS4730', ] # is interpreted by R as the following
df[c(TRUE, FALSE, FALSE, FALSE), ] The result is that you are getting the TRUE row(s).
This lab assignment involves 2 tasks (see the next 2 slides). Once you finish the following tasks, please put everything in one single R file with the file name assignment1.R (.R is the file extension) and upload it to iCollege (Lab Assignment 1).
You will lose 50% of the points if you use a different file name or put your code in multiple files.
In addition, lab assignments will be graded based on:
## $name
## [1] "Alex" "Bob" "Claire" "Denise"
##
## $female
## [1] FALSE FALSE TRUE TRUE
##
## $age
## [1] 20 25 30 35
## [1] "Bob"
## name female age
## row_1 Alex FALSE 20
## row_2 Bob FALSE 25
## row_3 Claire TRUE 30
## row_4 Denise TRUE 35
## [1] 27.5
## [1] 30