In our previous file, we talked about the beginning of R, data loading and use of some functions which are used by data analyst for writing reseatch papers, doing statistical analysis. Now we are in part II.

For the squid data, you may want to sort the GSI data from low to high values of the variable month, even if only for a quick observation. The following code can be used

Ord1 <- order(Squid$Month)
head(Squid[Ord1, ]) # First few observations of sorted data
tail(Squid[Ord1, ]) # Last Few observations of sorted data

Combining Data Set Using Merge Function

Squid1 <- read.table(file = "/Users/mominul/OneDrive - South Dakota State University - SDSU/Data Science/Books/R Books/A Beginners Guide to R/RBook/squid1.txt", header = TRUE)
Squid2 <- read.table(file = "/Users/mominul/OneDrive - South Dakota State University - SDSU/Data Science/Books/R Books/A Beginners Guide to R/RBook/squid2.txt", header = TRUE)
SquidMerged <- merge(Squid1, Squid2, by = "Sample")
head(SquidMerged)
dim(SquidMerged)
[1] 2643    6

The merge command takes as argument the two data frames Squid1 and Squid2 and combines the two datasets using as a common identifier the variable Sample. A useful option within the merge function is all. By default it is set to FALSE, which means that rows in which either Squid1 or Squid2 has missing values are omitted. When set to TRUE, NAs are filled in if Squid1 has no data for a sample that is present in Squid2, and vice versa.

SquidMerged <- merge(Squid1, Squid2, by = "Sample",
                     all = TRUE)

Dealing with Missing Values

Suppose you want a function that takes as input a data frame that contains the data and calculates the number of missing values in each variable. The syntax of such a function is

NAPerVariable <- function(X1) {
  D1 <- is.na(X1)  #creates a Boolean matrix of the same dimension as X1
  colSums(D1)
}

Let’s use the function above to detect missing values in R.

#csv data set reading
water <- read.csv('https://umich.instructure.com/files/399172/download?download_frd=1', header=T)
#use the existing colnames function to add column names to water data
colnames(water) <- c("Year", "Region", "Country","Residence","DrinikingWater","Sanitation")
NAPerVariable(water) # Missing value detection
          Year         Region        Country 
             0              0              0 
     Residence DrinikingWater     Sanitation 
             0             32            135 

From the above code, we are able to determine number of missing values in the water data set. There are 32 missing values for Drinking water column and 135 missing values in Sanitation column.

Sometimes in a given data set, you may see zeros in every column. If you want to count zeros in a specific dataset. You can use:

##Counting Number of zeros
ZerosPerVariable <- function(X1) {
  D1 = (X1 == 0)
  colSums(D1)
}

Other Ways To Deal With Missing Values

# is.na() function provides tests for missing values

x <- c(1, 2, 3, NA, 4)
is.na(x) #used to check for missing values in an object
[1] FALSE FALSE FALSE  TRUE FALSE
# arithmetic functions, such as mean(), applied to data containing missing values 

is.na can gives results in a boolean form. If there is any missing values you can see TRUE statement.

mean(x, na.rm=TRUE)
[1] 2.5
mean(x, na.rm=FALSE)
[1] NA

If you choose na.rm=FALSE then you will have NA in your result. But if you choose na.rm =TRUE, R will compute without considering the missing value.

DF <- data.frame(x = c(1, 2, 3), y = c(10, 20, NA))
DF
DF[!complete.cases(DF),] # list rows of data that have missing values
na.omit(DF)  # Delete the NA values from our dataset 
DF1 <- na.exclude(DF) # Delete the NA values from our dataset 
DF1  # na.exclude() function returns the object with incomplete cases removed.

We created a data frame with missing values in it. Using command na.omit or na.exclude, you can remove the rows with missing values in it.

sum(is.na(DF)) # identify count of NAs in data frame
[1] 1

We can have the total number of missing values using sum command.For data frames, a convenient shortcut to compute the total missing values in each column is to use colSums().

colSums(is.na(DF))
x y 
0 1 

There are no missing values in X but there is only one missing values in Y.

To determine the position of missing values we can use

xal <- c(1,2,3,4,5,6,NA,8,NA,9)
which(is.na(xal)) #position of missing values
[1] 7 9

At 7th and 9th position of our data, the values are missing.

LS0tCnRpdGxlOiAiUiBOb3RlYm9vayIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKSW4gb3VyIHByZXZpb3VzIGZpbGUsIHdlIHRhbGtlZCBhYm91dCB0aGUgYmVnaW5uaW5nIG9mIFIsIGRhdGEgbG9hZGluZyBhbmQgdXNlIG9mIHNvbWUgZnVuY3Rpb25zIHdoaWNoIGFyZSB1c2VkIGJ5IGRhdGEgYW5hbHlzdCBmb3Igd3JpdGluZyByZXNlYXRjaCBwYXBlcnMsIGRvaW5nIHN0YXRpc3RpY2FsIGFuYWx5c2lzLiBOb3cgd2UgYXJlIGluIHBhcnQgSUkuCgpgYGB7ciBkYXRhIExvYWRpbmcsIGVjaG8gPSBGQUxTRX0KU3F1aWQgPC0gcmVhZC50YWJsZShmaWxlID0gIi9Vc2Vycy9tb21pbnVsL09uZURyaXZlIC0gU291dGggRGFrb3RhIFN0YXRlIFVuaXZlcnNpdHkgLSBTRFNVL0RhdGEgU2NpZW5jZS9Cb29rcy9SIEJvb2tzL0EgQmVnaW5uZXJzIEd1aWRlIHRvIFIvUkJvb2svc3F1aWQudHh0IixoZWFkZXIgPSBUUlVFKQpgYGAKCkZvciB0aGUgc3F1aWQgZGF0YSwgeW91IG1heSB3YW50IHRvIHNvcnQgdGhlIEdTSSBkYXRhIGZyb20gbG93IHRvIGhpZ2ggdmFsdWVzIG9mIHRoZSB2YXJpYWJsZSAqKm1vbnRoKiosIGV2ZW4gaWYgb25seSBmb3IgYSBxdWljayBvYnNlcnZhdGlvbi4gVGhlIGZvbGxvd2luZyBjb2RlIGNhbiBiZSB1c2VkCgpgYGB7cn0KT3JkMSA8LSBvcmRlcihTcXVpZCRNb250aCkKaGVhZChTcXVpZFtPcmQxLCBdKSAjIEZpcnN0IGZldyBvYnNlcnZhdGlvbnMgb2Ygc29ydGVkIGRhdGEKdGFpbChTcXVpZFtPcmQxLCBdKSAjIExhc3QgRmV3IG9ic2VydmF0aW9ucyBvZiBzb3J0ZWQgZGF0YQpgYGAKCiMjIyBDb21iaW5pbmcgRGF0YSBTZXQgVXNpbmcgTWVyZ2UgRnVuY3Rpb24KCmBgYHtyfQpTcXVpZDEgPC0gcmVhZC50YWJsZShmaWxlID0gIi9Vc2Vycy9tb21pbnVsL09uZURyaXZlIC0gU291dGggRGFrb3RhIFN0YXRlIFVuaXZlcnNpdHkgLSBTRFNVL0RhdGEgU2NpZW5jZS9Cb29rcy9SIEJvb2tzL0EgQmVnaW5uZXJzIEd1aWRlIHRvIFIvUkJvb2svc3F1aWQxLnR4dCIsIGhlYWRlciA9IFRSVUUpClNxdWlkMiA8LSByZWFkLnRhYmxlKGZpbGUgPSAiL1VzZXJzL21vbWludWwvT25lRHJpdmUgLSBTb3V0aCBEYWtvdGEgU3RhdGUgVW5pdmVyc2l0eSAtIFNEU1UvRGF0YSBTY2llbmNlL0Jvb2tzL1IgQm9va3MvQSBCZWdpbm5lcnMgR3VpZGUgdG8gUi9SQm9vay9zcXVpZDIudHh0IiwgaGVhZGVyID0gVFJVRSkKU3F1aWRNZXJnZWQgPC0gbWVyZ2UoU3F1aWQxLCBTcXVpZDIsIGJ5ID0gIlNhbXBsZSIpCmhlYWQoU3F1aWRNZXJnZWQpCmRpbShTcXVpZE1lcmdlZCkKYGBgCgpUaGUgbWVyZ2UgY29tbWFuZCB0YWtlcyBhcyBhcmd1bWVudCB0aGUgdHdvIGRhdGEgZnJhbWVzIFNxdWlkMSBhbmQgU3F1aWQyIGFuZCBjb21iaW5lcyB0aGUgdHdvIGRhdGFzZXRzIHVzaW5nIGFzIGEgY29tbW9uIGlkZW50aWZpZXIgdGhlIHZhcmlhYmxlIFNhbXBsZS4gQSB1c2VmdWwgb3B0aW9uIHdpdGhpbiB0aGUgbWVyZ2UgZnVuY3Rpb24gaXMgKiphbGwqKi4gQnkgZGVmYXVsdCBpdCBpcyBzZXQgdG8gRkFMU0UsIHdoaWNoIG1lYW5zIHRoYXQgcm93cyBpbiB3aGljaCBlaXRoZXIgU3F1aWQxIG9yIFNxdWlkMiBoYXMgbWlzc2luZyB2YWx1ZXMgYXJlIG9taXR0ZWQuIFdoZW4gc2V0IHRvIFRSVUUsIE5BcyBhcmUgZmlsbGVkIGluIGlmIFNxdWlkMSBoYXMgbm8gZGF0YSBmb3IgYQpzYW1wbGUgdGhhdCBpcyBwcmVzZW50IGluIFNxdWlkMiwgYW5kIHZpY2UgdmVyc2EuIAoKYGBge3J9ClNxdWlkTWVyZ2VkIDwtIG1lcmdlKFNxdWlkMSwgU3F1aWQyLCBieSA9ICJTYW1wbGUiLAogICAgICAgICAgICAgICAgICAgICBhbGwgPSBUUlVFKQpgYGAKCiMjIyBEZWFsaW5nIHdpdGggTWlzc2luZyBWYWx1ZXMKClN1cHBvc2UgeW91IHdhbnQgYSBmdW5jdGlvbiB0aGF0IHRha2VzIGFzIGlucHV0IGEgZGF0YSBmcmFtZSB0aGF0IGNvbnRhaW5zIHRoZSBkYXRhIGFuZCBjYWxjdWxhdGVzIHRoZSBudW1iZXIgb2YgbWlzc2luZyB2YWx1ZXMgaW4gZWFjaCB2YXJpYWJsZS4gVGhlIHN5bnRheCBvZiBzdWNoIGEgZnVuY3Rpb24gaXMKCmBgYHtyfQpOQVBlclZhcmlhYmxlIDwtIGZ1bmN0aW9uKFgxKSB7CiAgRDEgPC0gaXMubmEoWDEpICAjY3JlYXRlcyBhIEJvb2xlYW4gbWF0cml4IG9mIHRoZSBzYW1lIGRpbWVuc2lvbiBhcyBYMQogIGNvbFN1bXMoRDEpCn0KCmBgYAoKTGV0J3MgdXNlIHRoZSBmdW5jdGlvbiBhYm92ZSB0byBkZXRlY3QgbWlzc2luZyB2YWx1ZXMgaW4gUi4KCmBgYHtyfQojY3N2IGRhdGEgc2V0IHJlYWRpbmcKd2F0ZXIgPC0gcmVhZC5jc3YoJ2h0dHBzOi8vdW1pY2guaW5zdHJ1Y3R1cmUuY29tL2ZpbGVzLzM5OTE3Mi9kb3dubG9hZD9kb3dubG9hZF9mcmQ9MScsIGhlYWRlcj1UKQojdXNlIHRoZSBleGlzdGluZyBjb2xuYW1lcyBmdW5jdGlvbiB0byBhZGQgY29sdW1uIG5hbWVzIHRvIHdhdGVyIGRhdGEKY29sbmFtZXMod2F0ZXIpIDwtIGMoIlllYXIiLCAiUmVnaW9uIiwgIkNvdW50cnkiLCJSZXNpZGVuY2UiLCJEcmluaWtpbmdXYXRlciIsIlNhbml0YXRpb24iKQpOQVBlclZhcmlhYmxlKHdhdGVyKSAjIE1pc3NpbmcgdmFsdWUgZGV0ZWN0aW9uCmBgYAoKRnJvbSB0aGUgYWJvdmUgY29kZSwgd2UgYXJlIGFibGUgdG8gZGV0ZXJtaW5lIG51bWJlciBvZiBtaXNzaW5nIHZhbHVlcyBpbiB0aGUgd2F0ZXIgZGF0YSBzZXQuIFRoZXJlIGFyZSAzMiBtaXNzaW5nIHZhbHVlcyBmb3IgRHJpbmtpbmcgd2F0ZXIgY29sdW1uIGFuZCAxMzUgbWlzc2luZyB2YWx1ZXMgaW4gU2FuaXRhdGlvbiBjb2x1bW4uCgpTb21ldGltZXMgaW4gYSBnaXZlbiBkYXRhIHNldCwgeW91IG1heSBzZWUgemVyb3MgaW4gZXZlcnkgY29sdW1uLiBJZiB5b3Ugd2FudCB0byBjb3VudCB6ZXJvcyBpbiBhIHNwZWNpZmljIGRhdGFzZXQuIFlvdSBjYW4gdXNlOgoKYGBge3J9CiMjQ291bnRpbmcgTnVtYmVyIG9mIHplcm9zClplcm9zUGVyVmFyaWFibGUgPC0gZnVuY3Rpb24oWDEpIHsKICBEMSA9IChYMSA9PSAwKQogIGNvbFN1bXMoRDEpCn0KYGBgCgoKIyMjIyBPdGhlciBXYXlzIFRvIERlYWwgV2l0aCBNaXNzaW5nIFZhbHVlcwoKYGBge3J9CiMgaXMubmEoKSBmdW5jdGlvbiBwcm92aWRlcyB0ZXN0cyBmb3IgbWlzc2luZyB2YWx1ZXMKCnggPC0gYygxLCAyLCAzLCBOQSwgNCkKaXMubmEoeCkgI3VzZWQgdG8gY2hlY2sgZm9yIG1pc3NpbmcgdmFsdWVzIGluIGFuIG9iamVjdAojIGFyaXRobWV0aWMgZnVuY3Rpb25zLCBzdWNoIGFzIG1lYW4oKSwgYXBwbGllZCB0byBkYXRhIGNvbnRhaW5pbmcgbWlzc2luZyB2YWx1ZXMgCmBgYAoKaXMubmEgY2FuIGdpdmVzIHJlc3VsdHMgaW4gYSBib29sZWFuIGZvcm0uIElmIHRoZXJlIGlzIGFueSBtaXNzaW5nIHZhbHVlcyB5b3UgY2FuIHNlZSBUUlVFIHN0YXRlbWVudC4KCmBgYHtyfQptZWFuKHgsIG5hLnJtPVRSVUUpCm1lYW4oeCwgbmEucm09RkFMU0UpCmBgYAoKSWYgeW91IGNob29zZSBuYS5ybT1GQUxTRSB0aGVuIHlvdSB3aWxsIGhhdmUgTkEgaW4geW91ciByZXN1bHQuIEJ1dCBpZiB5b3UgY2hvb3NlIG5hLnJtID1UUlVFLCBSIHdpbGwgY29tcHV0ZSB3aXRob3V0IGNvbnNpZGVyaW5nIHRoZSBtaXNzaW5nIHZhbHVlLgoKYGBge3J9CkRGIDwtIGRhdGEuZnJhbWUoeCA9IGMoMSwgMiwgMyksIHkgPSBjKDEwLCAyMCwgTkEpKQpERgpERlshY29tcGxldGUuY2FzZXMoREYpLF0gIyBsaXN0IHJvd3Mgb2YgZGF0YSB0aGF0IGhhdmUgbWlzc2luZyB2YWx1ZXMKbmEub21pdChERikgICMgRGVsZXRlIHRoZSBOQSB2YWx1ZXMgZnJvbSBvdXIgZGF0YXNldCAKREYxIDwtIG5hLmV4Y2x1ZGUoREYpICMgRGVsZXRlIHRoZSBOQSB2YWx1ZXMgZnJvbSBvdXIgZGF0YXNldCAKREYxICAjIG5hLmV4Y2x1ZGUoKSBmdW5jdGlvbiByZXR1cm5zIHRoZSBvYmplY3Qgd2l0aCBpbmNvbXBsZXRlIGNhc2VzIHJlbW92ZWQuCmBgYAoKV2UgY3JlYXRlZCBhIGRhdGEgZnJhbWUgd2l0aCBtaXNzaW5nIHZhbHVlcyBpbiBpdC4gVXNpbmcgY29tbWFuZCAqKm5hLm9taXQqKiBvciAqKm5hLmV4Y2x1ZGUqKiwgeW91IGNhbiByZW1vdmUgdGhlIHJvd3Mgd2l0aCBtaXNzaW5nIHZhbHVlcyBpbiBpdC4KCmBgYHtyfQpzdW0oaXMubmEoREYpKSAjIGlkZW50aWZ5IGNvdW50IG9mIE5BcyBpbiBkYXRhIGZyYW1lCmBgYAoKV2UgY2FuIGhhdmUgdGhlIHRvdGFsIG51bWJlciBvZiBtaXNzaW5nIHZhbHVlcyB1c2luZyBzdW0gY29tbWFuZC5Gb3IgZGF0YSBmcmFtZXMsIGEgY29udmVuaWVudCBzaG9ydGN1dCB0byBjb21wdXRlIHRoZSB0b3RhbCBtaXNzaW5nIHZhbHVlcyAgaW4gZWFjaCBjb2x1bW4gaXMgdG8gdXNlIGNvbFN1bXMoKS4KCmBgYHtyfQpjb2xTdW1zKGlzLm5hKERGKSkKYGBgCgpUaGVyZSBhcmUgbm8gbWlzc2luZyB2YWx1ZXMgaW4gWCBidXQgdGhlcmUgaXMgb25seSBvbmUgbWlzc2luZyB2YWx1ZXMgaW4gWS4KClRvIGRldGVybWluZSB0aGUgcG9zaXRpb24gb2YgbWlzc2luZyB2YWx1ZXMgd2UgY2FuIHVzZQoKYGBge3J9CnhhbCA8LSBjKDEsMiwzLDQsNSw2LE5BLDgsTkEsOSkKd2hpY2goaXMubmEoeGFsKSkgI3Bvc2l0aW9uIG9mIG1pc3NpbmcgdmFsdWVzCmBgYAoKQXQgN3RoIGFuZCA5dGggcG9zaXRpb24gb2Ygb3VyIGRhdGEsIHRoZSB2YWx1ZXMgYXJlIG1pc3NpbmcuCgoKCgoKCgoKCgoKCg==