Data Frame

Data Frame = table stored data of multiple vectors/variables.

data.frame() - combine multiple vectors/variables.

data.frame ( vector1, vector2, vector3, ... )

General Rules

All cells should have values. All vectors should not have blank space. When having missing value, use symbols/numbers/NA to represent missing values.
All column/variable names should be unique.
Each column/variable should carries the same data type values.
All column/variable should have the SAME length (aka, same number of components).

# generate 4 vectors for the 4 columns
id <- 1:5
id

## [1] 1 2 3 4 5

student <- c( "Mavis" , "Lucy" , "Patrick" , "Greg" , "Dean")
student

## [1] "Mavis"   "Lucy"    "Patrick" "Greg"    "Dean"

gender <- c("Female" , "Female" , "Male" , "Male" , "Male")
gender

## [1] "Female" "Female" "Male"   "Male"   "Male"

gpa <- c(3.82 , 3.90 , 4.0 , 3.7 , 4.0)
gpa

## [1] 3.82 3.90 4.00 3.70 4.00

# check for the length of these vectors
length(id)

## [1] 5

length(student)

## [1] 5

length(gender)

## [1] 5

length(gpa)

## [1] 5

# generate data frame
student_data <- data.frame(id, student, gender, gpa)
student_data

##   id student gender  gpa
## 1  1   Mavis Female 3.82
## 2  2    Lucy Female 3.90
## 3  3 Patrick   Male 4.00
## 4  4    Greg   Male 3.70
## 5  5    Dean   Male 4.00

Based on the data frame, gender should be convert into a factor.

Before converting gender into factor, we need to access the column.

data-frame-name $ column-name

# select/access the column. 
student_data$gender

## [1] "Female" "Female" "Male"   "Male"   "Male"

# convert the column gender into factor and store it back into the same column
student_data$gender <- factor( student_data$gender )

student_data

##   id student gender  gpa
## 1  1   Mavis Female 3.82
## 2  2    Lucy Female 3.90
## 3  3 Patrick   Male 4.00
## 4  4    Greg   Male 3.70
## 5  5    Dean   Male 4.00

To understand the difference between the vector gender and column gender, we can look at the data format for each.

# check for the data type for vector gender
typeof(gender)

## [1] "character"

# check for the data type for column gender in the dataframe student_data
typeof(student_data$gender)

## [1] "integer"

# for factor, their datatype is integer because factor use numbers to represent different categories
# in this case Female = 1, and Male = 2

To add in a new column, follow steps:

generate a vector containing the data of the length with the other columns.
attach this column in the data frame

data-frame-name $ new-column-name <- c( value1, value2, … )

student_data$year <-  c("Freshman" , "Sophomore" , "Freshman" , "Senior" , "Junior")

student_data

##   id student gender  gpa      year
## 1  1   Mavis Female 3.82  Freshman
## 2  2    Lucy Female 3.90 Sophomore
## 3  3 Patrick   Male 4.00  Freshman
## 4  4    Greg   Male 3.70    Senior
## 5  5    Dean   Male 4.00    Junior

DESCRIPTIVE STATISTICS FUNCTION

str() - check for structure of the dataframe/column

# structure for the data frame
str(student_data)

## 'data.frame':    5 obs. of  5 variables:
##  $ id     : int  1 2 3 4 5
##  $ student: chr  "Mavis" "Lucy" "Patrick" "Greg" ...
##  $ gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2
##  $ gpa    : num  3.82 3.9 4 3.7 4
##  $ year   : chr  "Freshman" "Sophomore" "Freshman" "Senior" ...

# structure for the column gpa
str(student_data$gpa)

##  num [1:5] 3.82 3.9 4 3.7 4

table() - generate a frequency (count) for a factor (categorical vector)

table( student_data$gender )

## 
## Female   Male 
##      2      3

mean() - generate the mean for a numeric variable

mean( student_data$gpa )

## [1] 3.884

round() - round to a specific decimal places
ceiling() - round UP to the closest whole number
floor() - round DOWN to the closest whole number

round( 3.884 , digits = 2)

## [1] 3.88

round(3.884 , 2)

## [1] 3.88

round( mean(student_data$gpa) , 2)

## [1] 3.88

ceiling(mean(student_data$gpa))

## [1] 4

floor(mean(student_data$gpa))

## [1] 3

max() - highest numeric value

max(student_data$gpa)

## [1] 4

min() - lowest numeric value

min(student_data$gpa)

## [1] 3.7

median() - median

median(student_data$gpa)

## [1] 3.9

quantile() - the percentile for numeric data, large set score preferred.

quantile(student_data$data, type = 6)

##   0%  25%  50%  75% 100% 
##   NA   NA   NA   NA   NA

Practice

guest <- c("Jennifer" , "David" , "Jack" , "Joanna" , "Victoria")
age <- c(19, 20, 21, 25, 23) 
friend_of <- c(1,2,1,2,2)

length(guest)

## [1] 5

length(age)

## [1] 5

length(friend_of)

## [1] 5

wedding <- data.frame(guest , age, friend_of )
wedding

##      guest age friend_of
## 1 Jennifer  19         1
## 2    David  20         2
## 3     Jack  21         1
## 4   Joanna  25         2
## 5 Victoria  23         2

# check for datatype of column age
typeof(wedding$age)

## [1] "double"

# check for datatype of column friend_of
typeof(wedding$friend_of)

## [1] "double"

# factor and label friend_of
wedding$friend_of <- factor(wedding$friend_of, labels = c("Groom", "Bride"))

wedding

##      guest age friend_of
## 1 Jennifer  19     Groom
## 2    David  20     Bride
## 3     Jack  21     Groom
## 4   Joanna  25     Bride
## 5 Victoria  23     Bride

Data Frame and Descriptive Statistics

Hanh

2023-02-15

Data Frame

DESCRIPTIVE STATISTICS FUNCTION

Practice