It would not be wrong if we say that statistics starts with an interest to quantify the centre and spread of the distribution of a variable.
So, let’s load the “Lung Capacity Dataset” to produce the numerical summeries of categorical and numerical variables.
LungCapData = read.table(file = "../Dataset/LungCapData.txt", header = TRUE, sep = "\t")
head(LungCapData)
## LungCap Age Height Smoke Gender Caesarean
## 1 6.475 6 62.1 no male no
## 2 10.125 18 74.7 yes female no
## 3 9.550 16 69.7 no female yes
## 4 11.125 14 71.0 no male no
## 5 4.800 5 56.9 no male no
## 6 6.225 11 58.7 no female no
attach(LungCapData)
The above table gives us a brief look over the data andd class of each variable.
Now, let’s see various summarizing options for the categorical and numerical variables
Categorical variables can be summarized using a frequency/proportion table and to produce a frequency table, we use either use the table() function or, the summary() function :
table(Smoke)
## Smoke
## no yes
## 648 77
Or,
summary(Smoke)
## no yes
## 648 77
So, we can see, we have \(648\) observations for no and \(77\) observations for yes within the categorical variable Smoke.
To get the total number of observations in a variable, we use the length() command :
length(Smoke)
## [1] 725
Now, we can produce a proportion table for Smoke variable, as follows :
table(Smoke)/length(Smoke)
## Smoke
## no yes
## 0.8937931 0.1062069
To produce a two-way table or, contengency table between Gender and Smoke, we can utilise the table() as follows :
table(Smoke, Gender)
## Gender
## Smoke female male
## no 314 334
## yes 44 33
Note : The first variable passed in the table() function is taken into the rows.
We can get the mean of our numeric variable LungCap with mean() function :
mean(LungCap)
## [1] 7.863148
To get a trimmed mean :
mean(LungCap, trim = 0.10)
## [1] 7.938081
Note : By the argument trim = 0.10, we have removed the top and bottom \(10%\) observations from the mean calculation.
To calculate median, we can use the median() command :
median(LungCap)
## [1] 8
To calculate the variance, we use the var() command :
var(LungCap)
## [1] 7.086288
To calculate the standard deviation, we can the sd() command or, can simply pass the var() argument within th sqrt() function :
sd(LungCap)
## [1] 2.662008
Or,
sqrt(var(LungCap))
## [1] 2.662008
similarly, the alternate way to calculate the variance is to sqaure the standard deviation :
Variance = sd(LungCap) ^ 2
Variance
## [1] 7.086288
To get the minimum and maximum values present in the LungCap variable :
#For Minimum
min(LungCap)
## [1] 0.507
and
#For Maximum
max(LungCap)
## [1] 14.675
We can find the range(minimum & maximum) of the variable, by using the range() command :
range(LungCap)
## [1] 0.507 14.675
Specific quantiles/percentiles can be calculated for the variable, using the quantile() function :
To get the 90th percentile :
quantile(LungCap, probs = 0.90)
## 90%
## 11.205
To obtain multiple percentiles, we can pass a list in the probs argument :
quantile(LungCap, probs = c(0.2, 0.5, 0.7, 1))
## 20% 50% 70% 100%
## 5.645 8.000 9.375 14.675
To calculate the sum of all the entries of a variable :
sum(LungCap)
## [1] 5700.782
To calculate the Pearson’s correlation between Age and LungCap, we can use the cor() function:
cor(LungCap, Age)
## [1] 0.8196749
Note : The default correlation in cor() function is Pearson’s correlation but, if we need any other type of correlation, then, we have to pass an additional argument method.
To get Spearman’s correlation between Age and LungCap
cor(LungCap, Age, method = "spearman")
## [1] 0.8172464
The covariance between Age and LungCap can be calculated using cov() command or, by var() command as well.
cov(LungCap, Age)
## [1] 8.738289
Or,
var(LungCap, Age)
## [1] 8.738289
We can use the summary command, to get most of the summaries for a variable :
summary(LungCap)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.507 6.150 8.000 7.863 9.800 14.675
As we can see, the output of summary() function gives us the following summary level information of the variable :
To get help in any of the summary attributes, just pass its name inside the help() command or, after ?, as follows :
help("mean") #or,
?mean
We can also pass the entire dataset inside the summary() command to get a cursory look over the summary of each variable :
summary(LungCapData)
## LungCap Age Height Smoke Gender
## Min. : 0.507 Min. : 3.00 Min. :45.30 no :648 female:358
## 1st Qu.: 6.150 1st Qu.: 9.00 1st Qu.:59.90 yes: 77 male :367
## Median : 8.000 Median :13.00 Median :65.40
## Mean : 7.863 Mean :12.33 Mean :64.84
## 3rd Qu.: 9.800 3rd Qu.:15.00 3rd Qu.:70.30
## Max. :14.675 Max. :19.00 Max. :81.80
## Caesarean
## no :561
## yes:164
##
##
##
##