STEM Research

1.3 R DataFrames

We will learn the following:

1.3.1. What is an R DataFrame?
1.3.2. Creating R DataFrames
1.3.3. Common DataFrame methods

1.3.1 What is an R DataFrame?

An R DataFrame is a two-dimensional, tabular data structure within the R programming language. It resembles a spreadsheet or a SQL table, with rows representing observations and columns representing variables or attributes. In a DataFrame, each column can contain different types of data, such as numeric, character, or factor. This versatile structure is widely used for organizing and manipulating structured data in R, providing a convenient way to perform data analysis and statistical operations.

Features of a DataFrame

Homogeneous Columns
Homogeneous or Heterogeneous Rows
Cells
Labeled Axes
- Row Index
- Column Index

This is illustrated in the figure below.

1.3.2 Creating DataFrames

Creating DataFrames is a foundational skill in R, pivotal for organizing and analyzing data efficiently. In R, you can create a DataFrame directly using functions like data.frame(). The DataFrame structure allows for the seamless organization of diverse data types into a tabular format, facilitating straightforward data manipulation and analysis.

We will learn the following:

1.3.2.1 From vectors
1.3.2.2 Using R editor

1.3.2.1 Create DataFrames from vectors

STEP I: Create vectors for each of the columns in the DataFrame.

id <- 101:110
gender <- c(
    "Female", "Male", "Female", "Female", "Male", 
    "Female", "Male", "Male", "Female", "Male"
)
education <- c(
    "Masters", "Bachelors", "Doctoral", "Masters", "Bachelors",
    "Doctoral", "Bachelors", "Bachelors", "Masters", "Doctoral"
)
age <- c(32, 46, 43, 35, 36, 44, 30, 42, 30, 28)
income <- c(
    528.87, 422.02, 466.81, 598.32, 510.16, 
    517.48, 399.02, 546.79, 467.69, 551.13
)

STEP II: Use the data.frame() function to create a DataFrame from the above vectors.

df <- data.frame(
    id, gender, education, age, income
)
View(df)

1.3.2.2 Create DataFrames using R editor

A DataFrame can be conveniently constructed by inputting values through a spreadsheet-like editor in R. This process involves utilizing the data.frame() function to initialize the DataFrame with specified columns and data types. Subsequently, the fix() function is employed to open a spreadsheet-like editor.

STEP I: Initialize the DataFrame columns

df <- data.frame(
    id = integer(),
    gender = character(),
    education = character(),
    age = integer(),
    income = double()
)

STEP II: Use fix() to pop-up the R editor

fix(df)

STEP III: Enter the values

Enter the values in the spreadsheet that pops up. To save the data, simply close the window.

1.3.3 Common DataFrame methods

In this section we look at a diverse range of essential functionalities, streamlining the extraction of swift and pertinent information from the DataFrame.

Load dataset

df <- readRDS(file = "~/STEMResearch/datasets/hh_income.RDS")
View(df)

We will explore the following aspects of the DataFrame:

Dimensions

dim(df)

## [1] 10  9

# number of rows
nrow(df) # similar to dim(df)[1]

## [1] 10

# number of columns
ncol(df) # similar to dim(df)[2]

## [1] 9

Attributes

attributes(df)

## $names
## [1] "id"          "race"        "gender"      "married"     "education"  
## [6] "age"         "income"      "expenditure" "ses"        
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"

Column names

colnames(df)

## [1] "id"          "race"        "gender"      "married"     "education"  
## [6] "age"         "income"      "expenditure" "ses"

Row names

rownames(df)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

row.names(df)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

Adding comment

comment(df) <- "Household income for civil servants"

First n observations

head(df, n = 2)

##   id  race gender married education age income expenditure  ses
## 1  1 White Female     Yes   Masters  32 528.87      417.77 High
## 2  2 Other   Male     Yes Bachelors  46 422.02      501.42 High

Last n observations

tail(df, n = 3)

##    id  race gender married education age income expenditure  ses
## 8   8 Black   Male      No Bachelors  42 546.79      393.82  Low
## 9   9 Other Female     Yes   Masters  30 467.69      460.62  Low
## 10 10 Black   Male     Yes  Doctoral  28 551.13      353.95 High

Data types

str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  9 variables:
##  $ id         : int  1 2 3 4 5 6 7 8 9 10
##  $ race       : Ord.factor w/ 3 levels "White"<"Black"<..: 1 3 1 2 2 3 1 2 3 2
##  $ gender     : chr  "Female" "Male" "Female" "Female" ...
##  $ married    : chr  "Yes" "Yes" "No" "No" ...
##  $ education  : Ord.factor w/ 3 levels "Bachelors"<"Masters"<..: 2 1 3 2 1 3 1 1 2 3
##  $ age        : int  32 46 43 35 36 44 30 42 30 28
##  $ income     : num  529 422 467 598 510 ...
##  $ expenditure: num  418 501 336 454 340 ...
##  $ ses        : Ord.factor w/ 3 levels "Low"<"Middle"<..: 3 3 2 1 2 2 1 1 1 3
##  - attr(*, "comment")= chr "Household income for civil servants"

library(dplyr)

glimpse(df)

## Rows: 10
## Columns: 9
## $ id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ race        <ord> White, Other, White, Black, Black, Other, White, Black, Ot…
## $ gender      <chr> "Female", "Male", "Female", "Female", "Male", "Female", "M…
## $ married     <chr> "Yes", "Yes", "No", "No", "Yes", "Yes", "Yes", "No", "Yes"…
## $ education   <ord> Masters, Bachelors, Doctoral, Masters, Bachelors, Doctoral…
## $ age         <int> 32, 46, 43, 35, 36, 44, 30, 42, 30, 28
## $ income      <dbl> 528.87, 422.02, 466.81, 598.32, 510.16, 517.48, 399.02, 54…
## $ expenditure <dbl> 417.77, 501.42, 336.07, 454.08, 340.33, 372.84, 348.82, 39…
## $ ses         <ord> High, High, Middle, Low, Middle, Middle, Low, Low, Low, Hi…

Summarize

summary(df[, 6:9])

##       age            income       expenditure        ses   
##  Min.   :28.00   Min.   :399.0   Min.   :336.1   Low   :4  
##  1st Qu.:30.50   1st Qu.:467.0   1st Qu.:350.1   Middle:3  
##  Median :35.50   Median :513.8   Median :383.3   High  :3  
##  Mean   :36.60   Mean   :500.8   Mean   :398.0             
##  3rd Qu.:42.75   3rd Qu.:542.3   3rd Qu.:445.0             
##  Max.   :46.00   Max.   :598.3   Max.   :501.4