Data Frame

Data Frame = table stored data of multiple vectors/variables.

data.frame() - combine multiple vectors/variables.

data.frame ( vector1, vector2, vector3, ... )   

General Rules

  1. All cells should have values. All vectors should not have blank space. When having missing value, use symbols/numbers/NA to represent missing values.
  2. All column/variable names should be unique.
  3. Each column/variable should carries the same data type values.
  4. All column/variable should have the SAME length (aka, same number of components).
# generate 4 vectors for the 4 columns
id <- 1:5
id
## [1] 1 2 3 4 5
student <- c( "Mavis" , "Lucy" , "Patrick" , "Greg" , "Dean")
student
## [1] "Mavis"   "Lucy"    "Patrick" "Greg"    "Dean"
gender <- c("Female" , "Female" , "Male" , "Male" , "Male")
gender
## [1] "Female" "Female" "Male"   "Male"   "Male"
gpa <- c(3.82 , 3.90 , 4.0 , 3.7 , 4.0)
gpa
## [1] 3.82 3.90 4.00 3.70 4.00
# check for the length of these vectors
length(id)
## [1] 5
length(student)
## [1] 5
length(gender)
## [1] 5
length(gpa)
## [1] 5
# generate data frame
student_data <- data.frame(id, student, gender, gpa)
student_data
##   id student gender  gpa
## 1  1   Mavis Female 3.82
## 2  2    Lucy Female 3.90
## 3  3 Patrick   Male 4.00
## 4  4    Greg   Male 3.70
## 5  5    Dean   Male 4.00

Based on the data frame, gender should be convert into a factor.

Before converting gender into factor, we need to access the column.

data-frame-name $ column-name
# select/access the column. 
student_data$gender
## [1] "Female" "Female" "Male"   "Male"   "Male"
# convert the column gender into factor and store it back into the same column
student_data$gender <- factor( student_data$gender )

student_data
##   id student gender  gpa
## 1  1   Mavis Female 3.82
## 2  2    Lucy Female 3.90
## 3  3 Patrick   Male 4.00
## 4  4    Greg   Male 3.70
## 5  5    Dean   Male 4.00

To understand the difference between the vector gender and column gender, we can look at the data format for each.

# check for the data type for vector gender
typeof(gender)
## [1] "character"
# check for the data type for column gender in the dataframe student_data
typeof(student_data$gender)
## [1] "integer"
# for factor, their datatype is integer because factor use numbers to represent different categories
# in this case Female = 1, and Male = 2

To add in a new column, follow steps:

student_data$year <-  c("Freshman" , "Sophomore" , "Freshman" , "Senior" , "Junior")

student_data
##   id student gender  gpa      year
## 1  1   Mavis Female 3.82  Freshman
## 2  2    Lucy Female 3.90 Sophomore
## 3  3 Patrick   Male 4.00  Freshman
## 4  4    Greg   Male 3.70    Senior
## 5  5    Dean   Male 4.00    Junior

DESCRIPTIVE STATISTICS FUNCTION

str() - check for structure of the dataframe/column

# structure for the data frame
str(student_data)
## 'data.frame':    5 obs. of  5 variables:
##  $ id     : int  1 2 3 4 5
##  $ student: chr  "Mavis" "Lucy" "Patrick" "Greg" ...
##  $ gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2
##  $ gpa    : num  3.82 3.9 4 3.7 4
##  $ year   : chr  "Freshman" "Sophomore" "Freshman" "Senior" ...
# structure for the column gpa
str(student_data$gpa)
##  num [1:5] 3.82 3.9 4 3.7 4

table() - generate a frequency (count) for a factor (categorical vector)

table( student_data$gender )
## 
## Female   Male 
##      2      3

mean() - generate the mean for a numeric variable

mean( student_data$gpa )
## [1] 3.884

round() - round to a specific decimal places
ceiling() - round UP to the closest whole number
floor() - round DOWN to the closest whole number

round( 3.884 , digits = 2)
## [1] 3.88
round(3.884 , 2)
## [1] 3.88
round( mean(student_data$gpa) , 2)
## [1] 3.88
ceiling(mean(student_data$gpa))  
## [1] 4
floor(mean(student_data$gpa))
## [1] 3

max() - highest numeric value

max(student_data$gpa)
## [1] 4

min() - lowest numeric value

min(student_data$gpa)
## [1] 3.7

median() - median

median(student_data$gpa)
## [1] 3.9

quantile() - the percentile for numeric data, large set score preferred.

quantile(student_data$data, type = 6)
##   0%  25%  50%  75% 100% 
##   NA   NA   NA   NA   NA

Practice

guest <- c("Jennifer" , "David" , "Jack" , "Joanna" , "Victoria")
age <- c(19, 20, 21, 25, 23) 
friend_of <- c(1,2,1,2,2)

length(guest)
## [1] 5
length(age)
## [1] 5
length(friend_of)
## [1] 5
wedding <- data.frame(guest , age, friend_of )
wedding
##      guest age friend_of
## 1 Jennifer  19         1
## 2    David  20         2
## 3     Jack  21         1
## 4   Joanna  25         2
## 5 Victoria  23         2
# check for datatype of column age
typeof(wedding$age)
## [1] "double"
# check for datatype of column friend_of
typeof(wedding$friend_of)
## [1] "double"
# factor and label friend_of
wedding$friend_of <- factor(wedding$friend_of, labels = c("Groom", "Bride"))

wedding
##      guest age friend_of
## 1 Jennifer  19     Groom
## 2    David  20     Bride
## 3     Jack  21     Groom
## 4   Joanna  25     Bride
## 5 Victoria  23     Bride