Understand

Understanding data structures and variable types in the data set are also crucial for conducting data preprocessing.

We shouldn’t be performing any type of data preprocessing without understanding what we have in hand.

Types of Variables

Nominal Variable

They have a scale in which the numbers or letters assigned to objects serve as labels for identification or classification.

Examples of this variable include binary variables (e.g., yes/no, male/female) and multinomial variables (e.g. religious affiliation, eye colour, ethnicity, suburb).

Ordinal Variable

They have a scale that arranges objects or alternatives according to their ranking. Examples include the exam grades (i.e., HD, DI, Credit, Pass, Fail etc.) and the disease severity (i.e., severe, moderate, mild).

Quantitative variable

These variables are the numerical data that we can either measure or count. The quantitative variables can be either discrete or continuous.

Data Structures in R

Logical

This class consists of TRUE or FALSE (binary) values. A logical value is often created via comparison between variables.

Numeric (integer or double)

Quantitative values are called as numerics in R. It is the default computational data type. Numeric class can be integer or double. Integer types can be seen as discrete values (e.g., 2) whereas, double class will have floating point numbers (e.g., 2.16).

Character

A character class is used to represent string values in R. The most basic way to generate a character object is to use quotation marks " " and assign a string/text to an object.

Factor

Factor class is used to represent qualitative data in R. Factors can be ordered or unordered. Factors store the nominal values as a vector of integers in the range [1…k] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.

Exploring Dataframes

Check the head of the dataframe

head(df)

##                                               
## 1 function (x, df1, df2, ncp, log = FALSE)    
## 2 {                                           
## 3     if (missing(ncp))                       
## 4         .Call(C_df, x, df1, df2, log)       
## 5     else .Call(C_dnf, x, df1, df2, ncp, log)
## 6 }

Select a column

# df$col

Check levels of a factored variable

# levels(df$col)

Check the type of a column

# class(df$col)

Check the structure of a dataframe

# str(df)

Check the attributes of a dataframe

attributes(df)

## NULL

Check dimensions of a dataframe

dim(df)

## NULL

Converting Data Types

is. functions will test for the given data type and return a logical value (TRUE or FALSE).

# is.character(char_vec)

as. functions will convert the object to a given type (whenever possible)

# is.character(char_vec)

converting a column to a factored variable.

# df$col <-as.factor(df$col)

Create factored variable

# vect_fact3<-factor(c("very low","low","medium","high","very high"),
#                  levels=c("very low","low","medium","high","very high"))

Create ordered factored variable

# vect_fact3<-factor(c("very low","low","medium","high","very high"),
#                  levels=c("very low","low","medium","high","very high"),
#                  ordered=TRUE)

Create alternative labels for ordered factored variable

# vect_fact3<-factor(c("very low","low","medium","high","very high"),
#                  levels=c("very low","low","medium","high","very high"),
#                  labels=c("Very Low","Low","Medium","High","Very High"),
#                  ordered=TRUE)

Combining Rows and Columns

You can only combine rows if they have the same number of columns

# combined_rows = rbind(row1, row2)

You can only combine columns if they have the same number of rows.

# combined_cols = cbind(col1, col2)

Naming Rows and Columns

Adding row names

# rownames(df) <- c("subj1", "subj2", "subj3")

Adding column names

# colnames(df) <- c("number", "card_type", "fraud", "transaction", "state")

Subsetting a Dataframe

Data frames possess the characteristics of both lists and matrices.

Therefore, if you subset with a single vector, they behave like lists and will return the selected columns with all rows.

If you subset with two vectors, they behave like matrices and can be subset by row and column.

Subsetting Examples

subset using $, to get the column ‘fraud’.

# df2$fraud

Take the second element in the fraud column.

# df2$fraud[2]

subset by row numbers, take rows 2 and 3 only.

# df[2:3, ]

subset by row names, take rows with names “subj2”, “subj3”

# df2[c("subj2", "subj3"),  ]

Subset with column numbers, take columns 1 and 4 only.

# df2[, c(1,4)]

Subset with column names “number”, “transaction”.

# df2[, c("number", "transaction")]

Subset by rows 2 and 3, and columns 1 and 4.

# df2[2:3, c(1, 4)]

Subset using row and column names.

# df2[c("subj2", "subj3"), c("number", "transaction")]