Learning Objectives

Load Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)

Data Types

In biostatistics , these are the main data types:

R has five essential data types. These data types are used to create data structures such as vectors, lists, and data frames.

num_var <- 156.8
num_var
## [1] 156.8
typeof(num_var)
## [1] "double"
char_var <- 'SSPPS'
char_var
## [1] "SSPPS"
typeof(char_var)
## [1] "character"
log_var <- TRUE
log_var
## [1] TRUE
typeof(log_var)
## [1] "logical"

Atomic vectors is a fundamental data structure in R. It contains elements of the same type

names_vec_char <- c('Zaid','Mark','SSPPS')
names_vec_char
## [1] "Zaid"  "Mark"  "SSPPS"
typeof(names_vec_char)
## [1] "character"
numbers_vdc <- c(1,2,3,4,5,6,7,8,9,10)
numbers_vdc
numbers_vdc+2 #add 2 to each element
numbers_vdc*2 #multiply each element by 2
numbers_vdc/2 #divide each element by 2
##  [1]  1  2  3  4  5  6  7  8  9 10
##  [1]  3  4  5  6  7  8  9 10 11 12
##  [1]  2  4  6  8 10 12 14 16 18 20
##  [1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataframe is the most common data structure to store data for analysis in R. Essentially, a data frame is a list of equal length vectors.

setwd("E:/Biostat and Study Design/204/Lectures/Data") #set working directory
NHANES_df <- openxlsx::read.xlsx('NHEFS.xlsx') #load data
#get the type of variables
typeof(NHANES_df$age)
## [1] "double"
typeof(NHANES_df$sex)
## [1] "character"
typeof(NHANES_df$cholesterol)
## [1] "double"

Descriptive Analytics

Descriptive statistics are used to describe and summarize data in a meaningful way. Paired with data visualization, they form the basis of quantitative data analysis.

Continuous Variables

Mean is the average value in the data set. It is calculated by summing all observations and dividing by the number of observations. It has the following formula:

\[\bar{x} = \frac{\sum_{i=1}^{n} x_{i}}{n}\]

where \(x_{i}\) is the \(i_{th}\) observation of the group, and \(n\) is the number of observations.

Median is the middle value of the data set that is sorted in ascending order.

Variance is the variability of observations from its arithmetic mean and represents the average of squared deviations. It has the following formula:

\[s^{2} = \frac{\sum (x_{i} - \bar{x})^{2}}{n}\] where \(x_{i}\) is the \(i_{th}\) observation of the group, \(\bar{x}\) is the mean of the group, and \(n\) is the number of observations.

Standard deviation is the average departure from the mean. It is the root square of variance.

\[s = \sqrt{s^{2}} = \sqrt{\frac{\sum (x_{i} - \bar{x})^{2}}{n}}\] where \(x_{i}\) is the \(i_{th}\) observation of the group, \(\bar{x}\) is the mean of the group, and \(n\) is the number of observations.

Min and Max are the minimum and maximum values of the data set.

1st quartile is the number for which 25% of values in the data set are smaller than. 3rd quartile is the number for which 75% of values in the data set are smaller than.

## Calculate mean,median,min,max,and quantiles
mean(NHANES_df$age,na.rm = TRUE) #mean
## [1] 43.91529
sd(NHANES_df$age,na.rm = TRUE) # standard deviation
## [1] 12.17043
median(NHANES_df$age,na.rm = TRUE) #median
## [1] 44
min(NHANES_df$age,na.rm = TRUE) #min
## [1] 25
max(NHANES_df$age,na.rm = TRUE) #max
## [1] 74
quantile(NHANES_df$age,na.rm = TRUE) #quantiles
##   0%  25%  50%  75% 100% 
##   25   33   44   53   74

Frequency distribution histogram allow us to see the distribution of many values by dividing the range of values into a set of smaller ranges (bin) and then graph the number of values in each bin. Histograms can be used to inspect the shape of the distribution of the data, identify the location of the center of the data, evaluate the spread of the data, and identifies outliers.

There are multiple methods to approximate the “best” number of bins. The easiest method to is use the rounded squared of the number of observations. These methods are often sensitive to outliers. Therefore, it is best to only use them to guide the selection of the number of bins.

normal_distribution_df=data.frame(nd_values=rnorm(1000, mean=10, sd=2.5)) # generate a sample random distribution with a mean of 10 and a standard deviation of 2.5.

sqrt(1000)# Calculate number of bins as sqrt of 1000
## [1] 31.62278
# basic histogram
normal_distribution_df %>%  ggplot(aes(x=nd_values)) + 
  geom_histogram(bins=32,color="darkblue", fill="lightblue") + theme_bw()

Data set is said to be normally distributed if mean = median. A distribution of data is skewed if it is not symmetric and extends more to one side than to the other. Data skewed to the right have a longer right tail. Data skewed to the left have a longer left tail.

Alt Alt Alt

When a data set is normally distributed, about 68% of all values fall within 1 standard deviation of the mean, about 95% of all values fall within 2 standard deviations of the mean, and about 99.7% of all values fall within 3 standard deviations of the mean.

Alt
Alt

A boxplot with whiskers gives a good insight regarding data distribution without showing every value.

Alt Alt

sqrt(nrow(NHANES_df)) #calculate the number of bins
## [1] 40.36087
NHANES_df %>% ggplot(aes(x=age)) +
    geom_histogram(bins=40, fill="deepskyblue", color="black") + 
    theme_light() #generate histogram

summary(NHANES_df$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   33.00   44.00   43.92   53.00   74.00
NHANES_df %>%  ggplot(aes(x=age)) +
    stat_boxplot(geom = 'errorbar',  width = 0.2) + 
    geom_boxplot(fill='deepskyblue',outlier.colour="red", outlier.size=4) +
    theme_light()

Categorical Variables

Frequencies is number of observations of each category in the data set.

Proportions is the percent that each category accounts for in the data set.

NHANES_df$sex <- as.factor(NHANES_df$sex) #convert to factor
table(NHANES_df$sex)
## 
## Female   Male 
##    830    799
prop.table(table(NHANES_df$sex))
## 
##   Female     Male 
## 0.509515 0.490485
NHANES_table <- table(sex=NHANES_df$sex,asthma=NHANES_df$asthma)
round(prop.table(NHANES_table,margin=1),2) #by rows
##         asthma
## sex         0    1
##   Female 0.94 0.06
##   Male   0.96 0.04
round(prop.table(NHANES_table,margin=2),2) # by columns
##         asthma
## sex         0    1
##   Female 0.50 0.62
##   Male   0.50 0.38

A bar graph uses bars of equal width to show frequencies of categories of categorical data. Bar graph shows the relative distribution of categorical data so that it is easier to compare the different categories.

NHANES_df %>%  ggplot(aes(x=race)) +
    geom_bar(fill="deepskyblue", color="black") +
    theme_light()

A pie chart is a very common graph that depicts categorical data as slices of a circle, in which the size of each slice is proportional to the frequency count for the category.

NHANES_df %>%  ggplot(aes(x="", fill=race)) + geom_bar(width = 1) + coord_polar("y", start=0)  + labs(fill = "Patient race") + theme_void()