Data Analytics- Application of Sapply and other functions

R Markdown

Take a list L generated as follows. set.seed(4321) L<-list(u=sample(c(rep(NA,4),runif(96))),n=rnorm(200),t=sample(c(rep(NA,10),rt(290,df=3))))

Derive a matrix where each of the columns contains the mean, variance, and number of observations of the list elements, and standard error of the mean(estimated by sd/sqrt(n)). The missing values should not be counted as the sample size: length(x[!is.na(x)]) would count non-missing obs. The columns should carry the name of the list elements. Use of anly loop is stricly forbidden.

# Creation of list
L<-list(u=sample(c(rep(NA,4),runif(96))),n=rnorm(200),t=sample(c(rep(NA,10),rt(290,df=3))))

# Calculation of mean, variance, standard error and number of observations of the list elements using sapply and binding them using rbind  
mytable <- rbind(sapply(L,mean,na.rm=TRUE),sapply(L,var,na.rm=TRUE), sapply(L,sd,na.rm=TRUE)/sqrt(sapply(L,function(x) length(x[!is.na(x)]))),sapply(L,function(x) length(x[!is.na(x)])))

# Checking the output mytable
mytable

##                u            n            t
## [1,]  0.48994450  -0.03591338  -0.07279339
## [2,]  0.07203766   1.00220674   2.50666720
## [3,]  0.02739329   0.07078865   0.09297139
## [4,] 96.00000000 200.00000000 290.00000000

# Adding names 
dimnames(mytable)<-list(c("mean","variance","standard error","observation"),names(L))
mytable

##                          u            n            t
## mean            0.48994450  -0.03591338  -0.07279339
## variance        0.07203766   1.00220674   2.50666720
## standard error  0.02739329   0.07078865   0.09297139
## observation    96.00000000 200.00000000 290.00000000

str(mytable)

##  num [1:4, 1:3] 0.4899 0.072 0.0274 96 -0.0359 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:4] "mean" "variance" "standard error" "observation"
##   ..$ : chr [1:3] "u" "n" "t"

Note- Lapply is short for list apply.It will generate a list containing all elements of the same type. Sapply performs an lapply, and sees whether the result can be simplified to a vector.

Write a function with data input = L and output is the matrix above. # write comments (explainations about what you are doing )

Approach - I am writing a function that will input a list L and will output a matrix where each of the columns contains the mean, variance, standard error and number of observations of the list elements. First, I Defined the function as convert matrix. Next, I used sapply four times to calculate mean, variance, standard error and number of observations of the list elements using sapply. To remove missing values, I used na.rm=TRUE and to calculate length of list without NAs and number of observations without NAs, I used (!is.na). To convert to a matrix, I used rbind command to bind the list row wise. Next, I added names to the variables using dimnames and used return to get mytable as outut from the function.

I called the function convert_matrix with data input = L and got the matrix as output

# Defining the function as convert matrix 
convert_matrix <- function(L) {
# I used sapply four times to calculate mean, variance, standard error and number of observations of the list elements using sapply. To covert to a matrix, I used rbind to bind the list row wise
mytable <- rbind(sapply(L,mean,na.rm=TRUE),
                 sapply(L,var,na.rm=TRUE),
                 sapply(L,sd,na.rm=TRUE)/sqrt(sapply(L,function(x) length(x[!is.na(x)]))),
                 sapply(L,function(x) length(x[!is.na(x)])))

# Adding names to mytable
dimnames(mytable)<-list(c("mean","variance","standard error","observations"),names(L))
# Using round to get only two decimal places
finaltable <- round(mytable,2)

# returning mytable
return(round(finaltable,2))  }
# Calling the function
convert_matrix(L)

##                    u      n      t
## mean            0.49  -0.04  -0.07
## variance        0.07   1.00   2.51
## standard error  0.03   0.07   0.09
## observations   96.00 200.00 290.00

2.a) Write a function that calculates the location (MidIQR) as the mid-point of 25 percentile and 75 percentile from data.

Need to eliminate all the missing values in the function. Apply it to data1 generated as follows. Summarize the data first using histogram and summary. set.seed(123);data1=c(rep(NA,10),rcauchy(20),rnorm(970))

set.seed(123);
data1=c(rep(NA,10),rcauchy(20),rnorm(970))

# Removing missing values
data2=data1[!is.na(data1)]

# Creating histogram
hist(data2,col ="lightblue", xlab = "Numbers ", main= " Histogram of Data without NA")

# Summary of data2
summary(data2)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -11.30000  -0.62920   0.01519   0.01040   0.66500   7.29100

MidIQR <-function(x)
{
# Removing missing values
data2=x[!is.na(x)]
q1 <- quantile(data2,0.25)
q1
q3 <- quantile(data2,0.75)
q3
midiqr <- (q1+q3)/2
names(midiqr) <-NULL

#return(summary(data2))
return(midiqr)
}

MidIQR(data1)

## [1] 0.01790665

Interquartile range (iqr) is defined as 75 percentile - 25 percentile. robust standard deviation (RSD) can be obtained by iqr/(iqr of standard normal distribution). iqr of standard normal distribution = qnorm(.75)-qnorm(.25) = 1.34898

Write a function that calculates RSD and apply it to data1.

rsd <- function(x,na.rm=TRUE) {
  y=x[!is.na(x)]
  IQR= quantile(y,0.75)-quantile(y,0.25)
  IQR_SD= qnorm(0.75)-qnorm(0.25)
  # robust standard deviation (RSD) is Interquartile range /(iqr of standard normal distribution).
  RSD=IQR/IQR_SD
  return(RSD)
  }
rsd(data1)

##       75% 
## 0.9593433

c)The median absolute deviation is defined as the median of the absolute deviations from the median multiplied by 1.4826. median(abs(x - median(x)))*1.4826. Make funtion MAD that calculates the median absolute deviation which takes x and eliminate missing values if parameter na.rm is TRUE. Apply it to the data 1.

MAD <- function(x,na.rm=TRUE) {
  # eliminating missing values
  y=(x[!is.na(x)])
  # 
  z=median(abs(y-median(y)))*1.4826
  return(z)}
MAD(data1)

## [1] 0.9630704

Data Analytics- Application of Sapply and other functions

Amit

January 19, 2017

R Markdown

Write a function that calculates RSD and apply it to data1.