Statistics Fundamentals

It would not be wrong if we say that statistics starts with an interest to quantify the centre and spread of the distribution of a variable.

So, let’s load the “Lung Capacity Dataset” to produce the numerical summeries of categorical and numerical variables.

LungCapData = read.table(file = "../Dataset/LungCapData.txt", header = TRUE, sep = "\t")
head(LungCapData)

##   LungCap Age Height Smoke Gender Caesarean
## 1   6.475   6   62.1    no   male        no
## 2  10.125  18   74.7   yes female        no
## 3   9.550  16   69.7    no female       yes
## 4  11.125  14   71.0    no   male        no
## 5   4.800   5   56.9    no   male        no
## 6   6.225  11   58.7    no female        no

attach(LungCapData)

The above table gives us a brief look over the data andd class of each variable.

Now, let’s see various summarizing options for the categorical and numerical variables

Summarizing Categorical Variables

Categorical variables can be summarized using a frequency/proportion table and to produce a frequency table, we use either use the table() function or, the summary() function :

table(Smoke)

## Smoke
##  no yes 
## 648  77

Or,

summary(Smoke)

##  no yes 
## 648  77

So, we can see, we have \(648\) observations for no and \(77\) observations for yes within the categorical variable Smoke.

To get the total number of observations in a variable, we use the length() command :

length(Smoke)

## [1] 725

Now, we can produce a proportion table for Smoke variable, as follows :

table(Smoke)/length(Smoke)

## Smoke
##        no       yes 
## 0.8937931 0.1062069

To produce a two-way table or, contengency table between Gender and Smoke, we can utilise the table() as follows :

table(Smoke, Gender)

##      Gender
## Smoke female male
##   no     314  334
##   yes     44   33

Note : The first variable passed in the table() function is taken into the rows.

Summarizing Numeric Variables

We can get the mean of our numeric variable LungCap with mean() function :

mean(LungCap)

## [1] 7.863148

To get a trimmed mean :

mean(LungCap, trim = 0.10)

## [1] 7.938081

Note : By the argument trim = 0.10, we have removed the top and bottom \(10%\) observations from the mean calculation.

To calculate median, we can use the median() command :

median(LungCap)

## [1] 8

To calculate the variance, we use the var() command :

var(LungCap)

## [1] 7.086288

To calculate the standard deviation, we can the sd() command or, can simply pass the var() argument within th sqrt() function :

sd(LungCap)

## [1] 2.662008

Or,

sqrt(var(LungCap))

## [1] 2.662008

similarly, the alternate way to calculate the variance is to sqaure the standard deviation :

Variance = sd(LungCap) ^ 2
Variance

## [1] 7.086288

To get the minimum and maximum values present in the LungCap variable :

#For Minimum
min(LungCap)

## [1] 0.507

and

#For Maximum
max(LungCap)

## [1] 14.675

We can find the range(minimum & maximum) of the variable, by using the range() command :

range(LungCap)

## [1]  0.507 14.675

Specific quantiles/percentiles can be calculated for the variable, using the quantile() function :

To get the 90th percentile :

quantile(LungCap, probs = 0.90)

##    90% 
## 11.205

To obtain multiple percentiles, we can pass a list in the probs argument :

quantile(LungCap, probs = c(0.2, 0.5, 0.7, 1))

##    20%    50%    70%   100% 
##  5.645  8.000  9.375 14.675

To calculate the sum of all the entries of a variable :

sum(LungCap)

## [1] 5700.782

To calculate the Pearson’s correlation between Age and LungCap, we can use the cor() function:

cor(LungCap, Age)

## [1] 0.8196749

Note : The default correlation in cor() function is Pearson’s correlation but, if we need any other type of correlation, then, we have to pass an additional argument method.

To get Spearman’s correlation between Age and LungCap

cor(LungCap, Age, method = "spearman")

## [1] 0.8172464

The covariance between Age and LungCap can be calculated using cov() command or, by var() command as well.

cov(LungCap, Age)

## [1] 8.738289

Or,

var(LungCap, Age)

## [1] 8.738289

We can use the summary command, to get most of the summaries for a variable :

summary(LungCap)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.507   6.150   8.000   7.863   9.800  14.675

As we can see, the output of summary() function gives us the following summary level information of the variable :

Minimum
1st quartile (25th percentile)
Median (50th Percentile/2nd quartile)
Mean
3rd quartile (75th percentile)
Maximum

To get help in any of the summary attributes, just pass its name inside the help() command or, after ?, as follows :

help("mean") #or,
?mean

We can also pass the entire dataset inside the summary() command to get a cursory look over the summary of each variable :

summary(LungCapData)

##     LungCap            Age            Height      Smoke        Gender   
##  Min.   : 0.507   Min.   : 3.00   Min.   :45.30   no :648   female:358  
##  1st Qu.: 6.150   1st Qu.: 9.00   1st Qu.:59.90   yes: 77   male  :367  
##  Median : 8.000   Median :13.00   Median :65.40                         
##  Mean   : 7.863   Mean   :12.33   Mean   :64.84                         
##  3rd Qu.: 9.800   3rd Qu.:15.00   3rd Qu.:70.30                         
##  Max.   :14.675   Max.   :19.00   Max.   :81.80                         
##  Caesarean
##  no :561  
##  yes:164  
##           
##           
##           
##