Exploratory Statistics using the R language

With so many new packages and libraries available for OpenSource R language , a section on libraries and functions available will be listed and shown in this paper. We will try to keep up to date with any new libraries that will make it much easier than using core R utilities.

Useful Libraries we will use for this report.

There are so many new and up and coming libraries that we will attempt to use the latest packages.

Once you have installed the packages onto your computer you just run the library(Package) command to load it into memory. To install packages or library’s use the following commands:

install.packages("dplyr")
library("dplyr")

Libraries used are listed below.

dplyr - data manipulation for viewing and manipulating data
reshape2 - data manipulation for switching between “wide” and “long” formats plus much more
rio - input and output of data in various formats
magrittr - pipe’s and data wrangling
xts - eXtensible Time Series data
zoo -

A full list of currently loaded packages can be had with the following commands:

(.packages())

## [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [7] "base"

Calculating BMI or Body Mass Index

We create vectors with random heights and weights. There are many ways to create vectors and a simple way is with the concatenate function written as c() , the other way is specifying the data type with the vector() function. Other simples ways are the use of the colon operator “:” . Another is seq.int() , seq_len() gives you the list, while length() gives you the number of items in the vector.

aa <- 3:12
seq_len(aa)
a <- seq.int(3, 12)     #same as 3:12
seq(from=1,to=100,by=2)  
b <- c(1,2,3,4,5)
c <- c(1:8)
d <- c("M","F")     # character list
v1 <- vector("numeric", 5)      # vector creates empty vectors with n number of items and type
v2 <- vector("complex", 5)
v3 <- vector("logical", 5)
v4 <- vector("character", 5)
v5 <- vector("list", 5)

For our initial R programming we will create vectors with concatenate statement. Our job is to perform Body Mass Index (BMI) calculations using metric system. BMI is calculated with kilos/meters

# weight <- c(60,72,57,90,95,72,88,75,62,98)  # kilograms
weight <- sample(60:98,10,replace = FALSE)

# height <- c(1.75,1.80,1.65,1.90,1.74,1.91,1.73,1.8,1.43,1.50) # meters
height <- round(rnorm(10,mean = 1.72,sd = 0.156))
length(weight)   # number of values

## [1] 10

bmi <- weight/height^2    # create a new variable called bmi
bmi

##  [1] 21.75 22.75 18.50 20.25 78.00 23.00 17.75 20.00 21.50 17.50

Basic Stats on weight and height

summary(weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   70.00   75.00   80.50   81.00   86.75   92.00

sd(weight)

## [1] 7.90218

summary(height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     2.0     2.0     1.9     2.0     2.0

sd(height)

## [1] 0.3162278

You can apply equations to complete vectors as shown below. We convert from Metric to Standard measurements.

wt <- weight * 2.2     # convert kilos to pounds
ht <- height * 39.37   # convert meters to inches
wt   # in pounds

##  [1] 191.4 200.2 162.8 178.2 171.6 202.4 156.2 176.0 189.2 154.0

ht   # in inches

##  [1] 78.74 78.74 78.74 78.74 39.37 78.74 78.74 78.74 78.74 78.74

We can also create character and data vectors all by code and build data.frame’s. Most importantly we can use regular expressions regexpr() and subset() from the base R language and “grep” or grab a subset of the data as shown below.

mydf = data.frame(a = c("Paul", "Kim", "Nora", "Sue", "Paul", "Kim"),
                  b = c("A", "A", "B", "B", "B", "C"),
                  c = rnorm(6))
mydf

##      a b          c
## 1 Paul A -0.1297188
## 2  Kim A  0.4617215
## 3 Nora B -0.4584925
## 4  Sue B -0.9395656
## 5 Paul B -0.4053395
## 6  Kim C  0.3494869

sapply(mydf, class)   # class function will output the type of data you are working with

##         a         b         c 
##  "factor"  "factor" "numeric"

A very important command to keep in memory is the subset command with regexpr or regular expressions, this works very well as “grep” does in Linux. You can extract just the data you need from a data.frame.

subset(mydf,regexpr("Paul",mydf$a) >0)  # used as grep to pull the actual rows matching "Paul"

##      a b          c
## 1 Paul A -0.1297188
## 5 Paul B -0.4053395

Creating a data.frame

We now have 2 vectors which we can column wise or row wise bind. cbind() and rbind() functions will create table structures but not data.frames which are much more useful. Random numbers will be used to create 100 patients.We will use the mean and sd from the original vectors. We also will show two different ways to create a data.frame , one from vectors and the other from a table .

weight <- rnorm(100,mean = 74.3,sd=15.4)
height <- rnorm(100,mean = 1.79,sd=0.20)
c_wh <- cbind(weight,height)      # create a column wise table
df1_wh <- data.frame(weight,height)  # create a data.frame with vectors
df2_wh <- data.frame(c_wh)      # create a data.frame from a table

A few commands to view what type of data you have are to use the str() and head() or tail() functions.

str(df1_wh)    # str() or structure shows you what type of data is inside the data.frame

## 'data.frame':    100 obs. of  2 variables:
##  $ weight: num  75.4 67.5 77.8 72.2 53.9 ...
##  $ height: num  1.82 1.8 1.36 1.79 1.75 ...

head(df1_wh)

##     weight   height
## 1 75.36985 1.816580
## 2 67.48871 1.804614
## 3 77.84542 1.360170
## 4 72.19008 1.793828
## 5 53.85101 1.746134
## 6 64.36698 2.074804

tail(df1_wh)

##       weight   height
## 95  72.35477 1.794170
## 96  98.97260 2.051765
## 97  61.48127 1.855896
## 98  60.86468 2.110587
## 99  99.24557 2.018285
## 100 64.22106 1.689397

class(df1_wh)  # shows you what class of data type you have.

## [1] "data.frame"

Manipulate data with dplyr

dplyr is a package for data manipulation, written and maintained by Hadley Wickham. The package makes it simple to work with your data First lets display the data. You can simply view all the data at once , which in our case would be very long, or choose to display the first few lines from the top and bottom .

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

##     weight   height
## 1 75.36985 1.816580
## 2 67.48871 1.804614
## 3 77.84542 1.360170
## 4 72.19008 1.793828
## 5 53.85101 1.746134
## 6 64.36698 2.074804

Mutate is used to add new variables to the data. For example lets adds a new column that displays the BMI. Notice that to keep the row we need to add it to a new variable.

bmi_tab01 <- mutate(df1_wh, BMI = (weight / height^2))
head(bmi_tab01)

##     weight   height      BMI
## 1 75.36985 1.816580 22.83961
## 2 67.48871 1.804614 20.72348
## 3 77.84542 1.360170 42.07721
## 4 72.19008 1.793828 22.43449
## 5 53.85101 1.746134 17.66195
## 6 64.36698 2.074804 14.95233

We now will change the names of the columns to be more specific and to add column’s with standard measurements. Vectorized Calculations are a great feature.

colnames(df1_wh) <- c("wtKilo", "htMeter")    # change column names
df1_wh <- mutate(df1_wh, wtPounds = wtKilo * 2.2)  # create a new column wtPounds conversion
df1_wh <- mutate(df1_wh, htInches = htMeter * 39.37) # create a new column htInches conversion
df1_wh <- mutate(df1_wh, BMI = (wtKilo / htMeter^2))  # create a new column with BMI 
head(df1_wh)    # view the first few rows

##     wtKilo  htMeter wtPounds htInches      BMI
## 1 75.36985 1.816580 165.8137 71.51875 22.83961
## 2 67.48871 1.804614 148.4752 71.04764 20.72348
## 3 77.84542 1.360170 171.2599 53.54988 42.07721
## 4 72.19008 1.793828 158.8182 70.62299 22.43449
## 5 53.85101 1.746134 118.4722 68.74530 17.66195
## 6 64.36698 2.074804 141.6074 81.68504 14.95233

Now to change the name of just one column from BMI to bmi on the saved data.frame df1_wh .

colnames(df1_wh)[which(names(df1_wh) == "BMI")] <- "bmi"
head(df1_wh)

##     wtKilo  htMeter wtPounds htInches      bmi
## 1 75.36985 1.816580 165.8137 71.51875 22.83961
## 2 67.48871 1.804614 148.4752 71.04764 20.72348
## 3 77.84542 1.360170 171.2599 53.54988 42.07721
## 4 72.19008 1.793828 158.8182 70.62299 22.43449
## 5 53.85101 1.746134 118.4722 68.74530 17.66195
## 6 64.36698 2.074804 141.6074 81.68504 14.95233

To work with factors, we will divide the data into males and females and analyze the data. To do that we will use the gl() or generate levels function. For help on the command line you can use the

help(gl)

Usage gl(n, k, length = n*k, labels = seq_len(n), ordered = FALSE)

sexMF <- gl(2,50,labels = c("M","F"),ordered = FALSE) # 2 * 50 = 100 total values, no order

Adding a column to a data.frame

Now add the column of sex to the data.frame .

df1_wh$sex <- sexMF
head(df1_wh)

##     wtKilo  htMeter wtPounds htInches      bmi sex
## 1 75.36985 1.816580 165.8137 71.51875 22.83961   M
## 2 67.48871 1.804614 148.4752 71.04764 20.72348   M
## 3 77.84542 1.360170 171.2599 53.54988 42.07721   M
## 4 72.19008 1.793828 158.8182 70.62299 22.43449   M
## 5 53.85101 1.746134 118.4722 68.74530 17.66195   M
## 6 64.36698 2.074804 141.6074 81.68504 14.95233   M

Basic statistics

Using library dplyr there are multiple ways to view your data and run basic stats by sorting and data manipulation.

##     wtKilo  htMeter wtPounds htInches      bmi sex
## 1 75.36985 1.816580 165.8137 71.51875 22.83961   M
## 2 67.48871 1.804614 148.4752 71.04764 20.72348   M
## 3 77.84542 1.360170 171.2599 53.54988 42.07721   M
## 4 72.19008 1.793828 158.8182 70.62299 22.43449   M
## 5 53.85101 1.746134 118.4722 68.74530 17.66195   M
## 6 64.36698 2.074804 141.6074 81.68504 14.95233   M

#sort df1_wh by sex and bmi
df1_wh <- arrange(df1_wh,sex,bmi)   # arranged by sex , bmi
df1_M <- subset(df1_wh,regexpr("M",df1_wh$sex) >0)  # grep just the "M" to calculate basic stats
df1_F <- subset(df1_wh,regexpr("F",df1_wh$sex) >0)  # grep just the "F" to calculate basic stats
summary(df1_M)    # complete stats per column for all table

##      wtKilo          htMeter         wtPounds         htInches    
##  Min.   : 39.97   Min.   :1.360   Min.   : 87.94   Min.   :53.55  
##  1st Qu.: 64.25   1st Qu.:1.599   1st Qu.:141.34   1st Qu.:62.96  
##  Median : 72.41   Median :1.746   Median :159.31   Median :68.72  
##  Mean   : 73.25   Mean   :1.722   Mean   :161.16   Mean   :67.80  
##  3rd Qu.: 81.92   3rd Qu.:1.826   3rd Qu.:180.23   3rd Qu.:71.89  
##  Max.   :111.39   Max.   :2.153   Max.   :245.05   Max.   :84.75  
##       bmi        sex   
##  Min.   :14.34   M:50  
##  1st Qu.:18.89   F: 0  
##  Median :24.75         
##  Mean   :25.44         
##  3rd Qu.:31.50         
##  Max.   :42.08

summary(df1_M$bmi) # just bmi stats

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.34   18.89   24.75   25.44   31.50   42.08

index_x <- 1:50                    # or length(df1_M$bmi)
#plot(index_x ~ df1_M$bmi)  # plot male bmi       
p <- plot(index_x, df1_M$bmi,pch=19)  # plot male bmi
fit <- lm(df1_M$bmi~index_x)
print(p)

## NULL

lines(index_x,predict(fit,data.frame(index_x=index_x)),col="red")       # first order equation curve fit

Input and Output to files

We now export out this data into different formats. Most are well known file extensions , such as :

extension	filetype
csv	Comma delimited
dta	Stata (c)
sas7bdat	SAS (c)
html	Web format

plus many more data import and exports types.

library("rio")
export(df1_M,"df1_M.csv",format = "csv")  # just male data
export(df1_wh, "df1_wh.csv",format = "csv")   # comma de-limited format"M" and "F" data
export(df1_wh, "df1_wh.dta")    # stata format

Now we will read data back into a data.frame and compare to original data:

df1M  <- import("df1_M.csv",stringsAsFactors = TRUE)    # read in just "M" male data
df1wh <- import("df1_wh.csv",stringsAsFactors = TRUE)   # read in the file , sex read as factor

A method to compare data.

all.equal(df1wh, df1_wh, check.attributes = FALSE)

## [1] TRUE

all.equal(df1M, df1_M, check.attributes = FALSE)

## [1] TRUE

More dplr functions

The filter function returns all the rows that satisfy a specific condition. For example , We will return all the rows where weight is larger than 90.

filter(bmi_tab01, weight > 90)  # greater than 90 kilograms

##       weight   height      BMI
## 1   99.26731 1.700194 34.34071
## 2   97.52160 1.826164 29.24297
## 3  100.70626 1.749616 32.89813
## 4   91.64154 1.872735 26.13003
## 5  111.38632 1.856897 32.30398
## 6  100.86397 1.836291 29.91252
## 7   98.10093 1.647469 36.14421
## 8   98.11402 1.830275 29.28859
## 9   96.06827 1.725218 32.27692
## 10  93.24085 1.834633 27.70178
## 11  92.94206 1.548377 38.76669
## 12  94.65366 1.573781 38.21630
## 13  96.00999 1.985412 24.35652
## 14 100.14895 2.128135 22.11301
## 15  98.97260 2.051765 23.51038
## 16  99.24557 2.018285 24.36386

Here we filter with two variables, weight >70 and height < 1.75

filter(bmi_tab01, weight > 70 & height < 1.75 )

##       weight   height      BMI
## 1   77.84542 1.360170 42.07721
## 2   75.36789 1.476742 34.56025
## 3   80.38828 1.638319 29.94995
## 4   84.52765 1.501525 37.49158
## 5   79.20496 1.572908 32.01441
## 6   72.47684 1.460973 33.95589
## 7   99.26731 1.700194 34.34071
## 8   71.40892 1.653312 26.12421
## 9   88.19961 1.649247 32.42616
## 10  71.96943 1.438291 34.79002
## 11 100.70626 1.749616 32.89813
## 12  82.43310 1.665639 29.71256
## 13  75.18110 1.617153 28.74791
## 14  82.49791 1.522543 35.58804
## 15  72.90619 1.632828 27.34534
## 16  71.48135 1.361715 38.54962
## 17  98.10093 1.647469 36.14421
## 18  72.35293 1.593263 28.50238
## 19  96.06827 1.725218 32.27692
## 20  71.15133 1.536602 30.13425
## 21  82.02188 1.686371 28.84186
## 22  92.94206 1.548377 38.76669
## 23  94.65366 1.573781 38.21630
## 24  75.69435 1.676842 26.92025
## 25  77.45263 1.667038 27.87054

Pipes

Pipes is a simple way to pass data from one function to another. Using two libraries - dylyr and magrittr - can simplyfy running multiple commands to export a file. The aggregate command is used as follows: Where “.” implies the complete data set. The tilde “~” says to aggregate the next column(s) . FUN = aggregate by mean function. You can place any function you want here. The command export is used as part of an R pipeline using library magrittr.

The following code uses pipes to export data to a file and multiple pipes to a file:

library("magrittr")
filter(df1_wh, weight > 90)

##      wtKilo  htMeter wtPounds htInches      bmi sex
## 1  68.12129 1.757475 149.8668 69.19181 22.05486   M
## 2  54.85431 1.557075 120.6795 61.30204 22.62516   M
## 3  72.35293 1.593263 159.1765 62.72677 28.50238   M
## 4  88.19961 1.649247 194.0392 64.93084 32.42616   M
## 5  99.26731 1.700194 218.3881 66.93664 34.34071   M
## 6  98.10093 1.647469 215.8220 64.86085 36.14421   M
## 7  71.48135 1.361715 157.2590 53.61073 38.54962   M
## 8  61.22848 1.983475 134.7027 78.08940 15.56324   F
## 9  47.53150 1.662924 104.5693 65.46930 17.18846   F
## 10 71.60606 1.989361 157.5333 78.32114 18.09350   F
## 11 60.91937 1.691445 134.0226 66.59218 21.29314   F
## 12 62.82219 1.661613 138.2088 65.41769 22.75378   F
## 13 82.02188 1.686371 180.4481 66.39244 28.84186   F
## 14 98.11402 1.830275 215.8508 72.05791 29.28859   F
## 15 71.15133 1.536602 156.5329 60.49601 30.13425   F
## 16 94.65366 1.573781 208.2381 61.95977 38.21630   F

filter(df1_wh, weight > 90) %>% export("weight90.csv")   # to a csv file
# 
head(df1_wh %>% subset(weight > 90 & height < 1.75) %>%  
       aggregate(. ~ bmi + wtKilo, data = ., FUN = mean))

##        bmi   wtKilo  htMeter wtPounds htInches sex
## 1 17.18846 47.53150 1.662924 104.5693 65.46930   2
## 2 21.29314 60.91937 1.691445 134.0226 66.59218   2
## 3 22.75378 62.82219 1.661613 138.2088 65.41769   2
## 4 22.05486 68.12129 1.757475 149.8668 69.19181   1
## 5 38.54962 71.48135 1.361715 157.2590 53.61073   1
## 6 28.50238 72.35293 1.593263 159.1765 62.72677   1

df1_wh %>% subset(weight > 90 & height < 1.75) %>%  
       aggregate(. ~ bmi + wtKilo, data = ., FUN = mean) %>% export(file = "df1_wh-01.csv") 
       # to a csv file

Basic Statistics

Our first statistical test is to see if our data is right or near the normal range , since we are trying to create a dataset with normal bmi’s.

The US National Average for Median Body Mass Index BMI for Men

Age	20-29	30-39	40-49	50-59	60-69
BMI	24.5	25.9	26.8	27.2	27.0

BMI Chart

BMI for normal weight , randomly acquired samples is between 20 - 29, or mean of 24.5. We will use the t-test to check the data. mu is used for theoretical mean of the sample in question. Notice how we reference the BMI column using the operator “$” .

t.test(df1_M$bmi,mu=24.5)

## 
##  One Sample t-test
## 
## data:  df1_M$bmi
## t = 0.91717, df = 49, p-value = 0.3635
## alternative hypothesis: true mean is not equal to 24.5
## 95 percent confidence interval:
##  23.37823 27.50544
## sample estimates:
## mean of x 
##  25.44183

Null Hypothesis

The one sample t-test is used for single vector randomly selected data that is assumed to come from a normal distribution. We use it to test for the NULL hypothesis. If you select BMI from a random group of individuals you would expect a normal distribution as show by the equation

\[N\left(\mu,\sigma^2\right)\]

where $\sigma$ = mean and $\sigma^2$ = variance.

Now compare Males to Females using the filter function on df1_wh We will do a curve fitting using a first Order Equation to see if there is any relationship

male <- filter(df1_wh,sex=="M")
x <- male$bmi
female <- filter(df1_wh,sex=="F")
y <- female$bmi
p <- plot(x,y,pch=19)
fit <- lm(y~x)
p

## NULL

lines(x,predict(fit,data.frame(x=x)),col="red")

Embedding Plots

You can also embed plots, for example:

plot(df1M$bmi,type="o")
title(main="Male BMI values", col.main="red", font.main=4)

Exploratoring Statistics with the R language

Carlos Martinez

November 10, 2016