Understanding data structures and variable types in the data set are also crucial for conducting data preprocessing.
We shouldn’t be performing any type of data preprocessing without understanding what we have in hand.
They have a scale in which the numbers or letters assigned to objects serve as labels for identification or classification.
Examples of this variable include binary variables (e.g., yes/no, male/female) and multinomial variables (e.g. religious affiliation, eye colour, ethnicity, suburb).
They have a scale that arranges objects or alternatives according to their ranking. Examples include the exam grades (i.e., HD, DI, Credit, Pass, Fail etc.) and the disease severity (i.e., severe, moderate, mild).
These variables are the numerical data that we can either measure or count. The quantitative variables can be either discrete or continuous.
This class consists of TRUE or FALSE (binary) values. A logical value is often created via comparison between variables.
Quantitative values are called as numerics in R. It is the default computational data type. Numeric class can be integer or double. Integer types can be seen as discrete values (e.g., 2) whereas, double class will have floating point numbers (e.g., 2.16).
A character class is used to represent string values in R. The most basic way to generate a character object is to use quotation marks " " and assign a string/text to an object.
Factor class is used to represent qualitative data in R. Factors can be ordered or unordered. Factors store the nominal values as a vector of integers in the range [1…k] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.
Check the head of the dataframe
head(df)
##
## 1 function (x, df1, df2, ncp, log = FALSE)
## 2 {
## 3 if (missing(ncp))
## 4 .Call(C_df, x, df1, df2, log)
## 5 else .Call(C_dnf, x, df1, df2, ncp, log)
## 6 }
Select a column
# df$col
Check levels of a factored variable
# levels(df$col)
Check the type of a column
# class(df$col)
Check the structure of a dataframe
# str(df)
Check the attributes of a dataframe
attributes(df)
## NULL
Check dimensions of a dataframe
dim(df)
## NULL
is. functions will test for the given data type and return a logical value (TRUE or FALSE).
# is.character(char_vec)
as. functions will convert the object to a given type (whenever possible)
# is.character(char_vec)
converting a column to a factored variable.
# df$col <-as.factor(df$col)
Create factored variable
# vect_fact3<-factor(c("very low","low","medium","high","very high"),
# levels=c("very low","low","medium","high","very high"))
Create ordered factored variable
# vect_fact3<-factor(c("very low","low","medium","high","very high"),
# levels=c("very low","low","medium","high","very high"),
# ordered=TRUE)
Create alternative labels for ordered factored variable
# vect_fact3<-factor(c("very low","low","medium","high","very high"),
# levels=c("very low","low","medium","high","very high"),
# labels=c("Very Low","Low","Medium","High","Very High"),
# ordered=TRUE)
You can only combine rows if they have the same number of columns
# combined_rows = rbind(row1, row2)
You can only combine columns if they have the same number of rows.
# combined_cols = cbind(col1, col2)
Adding row names
# rownames(df) <- c("subj1", "subj2", "subj3")
Adding column names
# colnames(df) <- c("number", "card_type", "fraud", "transaction", "state")
Data frames possess the characteristics of both lists and matrices.
Therefore, if you subset with a single vector, they behave like lists and will return the selected columns with all rows.
If you subset with two vectors, they behave like matrices and can be subset by row and column.
subset using $, to get the column ‘fraud’.
# df2$fraud
Take the second element in the fraud column.
# df2$fraud[2]
subset by row numbers, take rows 2 and 3 only.
# df[2:3, ]
subset by row names, take rows with names “subj2”, “subj3”
# df2[c("subj2", "subj3"), ]
Subset with column numbers, take columns 1 and 4 only.
# df2[, c(1,4)]
Subset with column names “number”, “transaction”.
# df2[, c("number", "transaction")]
Subset by rows 2 and 3, and columns 1 and 4.
# df2[2:3, c(1, 4)]
Subset using row and column names.
# df2[c("subj2", "subj3"), c("number", "transaction")]