Some useful tricks

Create a fake the data

data <- data.frame(x1 = c(1,2,5,"--", 5),
                   x2 = c(2,4,"--", 7,6), x3 = 1:5)
data

##   x1 x2 x3
## 1  1  2  1
## 2  2  4  2
## 3  5 --  3
## 4 --  7  4
## 5  5  6  5

Verify the class of the data (each column should be numeric, or integer)

sapply(data, class)

##        x1        x2        x3 
##  "factor"  "factor" "integer"

x1 and x2 are regognized as factor because of the double dashes. We can try to coerce them as numeric. Let’s see how that works.

cdata = data # saving a copy of the original 
cdata # taking a look to make sure it's identical to the original

##   x1 x2 x3
## 1  1  2  1
## 2  2  4  2
## 3  5 --  3
## 4 --  7  4
## 5  5  6  5

cdata$x1 <- as.numeric(cdata$x1) 
cdata$x2 <- as.numeric(cdata$x2)
cdata

##   x1 x2 x3
## 1  2  2  1
## 2  3  3  2
## 3  4  1  3
## 4  1  5  4
## 5  4  4  5

See that the values have changed, and the dashes became numbers, technically messing the whole data. We can avoid that by using the function as.characher(). Let’s see how that works, using the original (since the copy is messed up)

data # looking to make sure the original did not change.

##   x1 x2 x3
## 1  1  2  1
## 2  2  4  2
## 3  5 --  3
## 4 --  7  4
## 5  5  6  5

data$x1 <- as.numeric(as.character(data$x1))

## Warning: NAs introduced by coercion

data$x2 <- as.numeric(as.character(data$x2))

## Warning: NAs introduced by coercion

data # look at the result

##   x1 x2 x3
## 1  1  2  1
## 2  2  4  2
## 3  5 NA  3
## 4 NA  7  4
## 5  5  6  5

Further verifications

sapply(data, class)

##        x1        x2        x3 
## "numeric" "numeric" "integer"

Making sure that the NAs are recognized as missing data. We have 2. So let’s verify that

sum(is.na(data))

## [1] 2

Problem solved. Note that if the data contains many columns that have unusual chatacters, it would be tedious to write codes on each column. It would then be useful to write a single code to automate the process. Here is an example

new.data <- data.frame(x1 = c(1,2,5,"--", 5),
                  x2 = c(2,4,"--", 7,6), x3 = 1:5, x4=c(1,10,"##",77, 5),
                  x5 =c(3,7,5,"&$", 9))
new.data

##   x1 x2 x3 x4 x5
## 1  1  2  1  1  3
## 2  2  4  2 10  7
## 3  5 --  3 ##  5
## 4 --  7  4 77 &$
## 5  5  6  5  5  9

Write a function to take a column and perform an operation on that column. Then, apply that function to all the columns. Let’s call that function t.to.numeric (this can be anything)

This function below will take any vector and coerce it to numeric

t.to.numeric <- function(x){
  x <- as.numeric(as.character(x))
}

Now the code below will apply the above created function to all columns of the new data

new.data <- sapply(new.data, t.to.numeric)

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

class(new.data) # it's preferable for the class of the data to be data.frame

## [1] "matrix"

new.data <- as.data.frame(new.data)
class(new.data)

## [1] "data.frame"

sapply(new.data, class) # verification

##        x1        x2        x3        x4        x5 
## "numeric" "numeric" "numeric" "numeric" "numeric"

sum(is.na(new.data)) # verify we there are 4 missing vaalues

## [1] 4

class(new.data)

## [1] "data.frame"

Some useful tricks

Sena

4/7/2019