Data Frame = table stored data of multiple vectors/variables.
data.frame() - combine multiple vectors/variables.
data.frame ( vector1, vector2, vector3, ... )
General Rules
# generate 4 vectors for the 4 columns
id <- 1:5
id
## [1] 1 2 3 4 5
student <- c( "Mavis" , "Lucy" , "Patrick" , "Greg" , "Dean")
student
## [1] "Mavis" "Lucy" "Patrick" "Greg" "Dean"
gender <- c("Female" , "Female" , "Male" , "Male" , "Male")
gender
## [1] "Female" "Female" "Male" "Male" "Male"
gpa <- c(3.82 , 3.90 , 4.0 , 3.7 , 4.0)
gpa
## [1] 3.82 3.90 4.00 3.70 4.00
# check for the length of these vectors
length(id)
## [1] 5
length(student)
## [1] 5
length(gender)
## [1] 5
length(gpa)
## [1] 5
# generate data frame
student_data <- data.frame(id, student, gender, gpa)
student_data
## id student gender gpa
## 1 1 Mavis Female 3.82
## 2 2 Lucy Female 3.90
## 3 3 Patrick Male 4.00
## 4 4 Greg Male 3.70
## 5 5 Dean Male 4.00
Based on the data frame, gender should be convert into a
factor.
Before converting gender into factor, we need to access
the column.
data-frame-name $ column-name
# select/access the column.
student_data$gender
## [1] "Female" "Female" "Male" "Male" "Male"
# convert the column gender into factor and store it back into the same column
student_data$gender <- factor( student_data$gender )
student_data
## id student gender gpa
## 1 1 Mavis Female 3.82
## 2 2 Lucy Female 3.90
## 3 3 Patrick Male 4.00
## 4 4 Greg Male 3.70
## 5 5 Dean Male 4.00
To understand the difference between the vector gender
and column gender, we can look at the data format for
each.
# check for the data type for vector gender
typeof(gender)
## [1] "character"
# check for the data type for column gender in the dataframe student_data
typeof(student_data$gender)
## [1] "integer"
# for factor, their datatype is integer because factor use numbers to represent different categories
# in this case Female = 1, and Male = 2
To add in a new column, follow steps:
generate a vector containing the data of the length with the other columns.
attach this column in the data frame
data-frame-name $ new-column-name <- c( value1, value2, … )
student_data$year <- c("Freshman" , "Sophomore" , "Freshman" , "Senior" , "Junior")
student_data
## id student gender gpa year
## 1 1 Mavis Female 3.82 Freshman
## 2 2 Lucy Female 3.90 Sophomore
## 3 3 Patrick Male 4.00 Freshman
## 4 4 Greg Male 3.70 Senior
## 5 5 Dean Male 4.00 Junior
str() - check for structure of the dataframe/column
# structure for the data frame
str(student_data)
## 'data.frame': 5 obs. of 5 variables:
## $ id : int 1 2 3 4 5
## $ student: chr "Mavis" "Lucy" "Patrick" "Greg" ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2
## $ gpa : num 3.82 3.9 4 3.7 4
## $ year : chr "Freshman" "Sophomore" "Freshman" "Senior" ...
# structure for the column gpa
str(student_data$gpa)
## num [1:5] 3.82 3.9 4 3.7 4
table() - generate a frequency (count) for a factor (categorical vector)
table( student_data$gender )
##
## Female Male
## 2 3
mean() - generate the mean for a numeric variable
mean( student_data$gpa )
## [1] 3.884
round() - round to a specific decimal places
ceiling() - round UP to the closest whole number
floor() - round DOWN to the closest whole number
round( 3.884 , digits = 2)
## [1] 3.88
round(3.884 , 2)
## [1] 3.88
round( mean(student_data$gpa) , 2)
## [1] 3.88
ceiling(mean(student_data$gpa))
## [1] 4
floor(mean(student_data$gpa))
## [1] 3
max() - highest numeric value
max(student_data$gpa)
## [1] 4
min() - lowest numeric value
min(student_data$gpa)
## [1] 3.7
median() - median
median(student_data$gpa)
## [1] 3.9
quantile() - the percentile for numeric data, large set score preferred.
quantile(student_data$data, type = 6)
## 0% 25% 50% 75% 100%
## NA NA NA NA NA
guest <- c("Jennifer" , "David" , "Jack" , "Joanna" , "Victoria")
age <- c(19, 20, 21, 25, 23)
friend_of <- c(1,2,1,2,2)
length(guest)
## [1] 5
length(age)
## [1] 5
length(friend_of)
## [1] 5
wedding <- data.frame(guest , age, friend_of )
wedding
## guest age friend_of
## 1 Jennifer 19 1
## 2 David 20 2
## 3 Jack 21 1
## 4 Joanna 25 2
## 5 Victoria 23 2
# check for datatype of column age
typeof(wedding$age)
## [1] "double"
# check for datatype of column friend_of
typeof(wedding$friend_of)
## [1] "double"
# factor and label friend_of
wedding$friend_of <- factor(wedding$friend_of, labels = c("Groom", "Bride"))
wedding
## guest age friend_of
## 1 Jennifer 19 Groom
## 2 David 20 Bride
## 3 Jack 21 Groom
## 4 Joanna 25 Bride
## 5 Victoria 23 Bride