Data Exploration

Learning Objectives

Develop an understanding of basic data types and data structures
Be able to calculate mean, median, variance, standard deviation, proportions, and frequencies
Learn how to plot frequency histograms, box plots, bar graphs, and pie charts

Lecture Link

https://rpubs.com/zaidyousif/1351996

Load Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)

Data Types

In biostatistics , these are the main data types:

Numerical : values on the number line such as age (years), heart rate (bpm), serum creatinine (mg/dL).
Categorical: Observations takes on several values:
- Binary: observations that one of two values such as smoking status (smoker vs. not-smoker).
- Ordinal: categories (levels) can be ranked such as cancer stage (I, II, III, IV).
- Nominal: categories (levels) cannot be ranked such as blood type (A, B, AB or O).

R has five essential data types. These data types are used to create data structures such as vectors, lists, and data frames.

integer: 1, 2, 6
double: 1.0, 2.3, 6.8
character: “Zaid”, “Mark”, “SSPPS”
logical: TRUE, FALSE
complex: 1+4i

num_var <- 156.8
num_var

## [1] 156.8

typeof(num_var)

## [1] "double"

char_var <- 'SSPPS'
char_var

## [1] "SSPPS"

typeof(char_var)

## [1] "character"

log_var <- TRUE
log_var

## [1] TRUE

typeof(log_var)

## [1] "logical"

Atomic vectors is a fundamental data structure in R. It contains elements of the same type

names_vec_char <- c('Zaid','Mark','SSPPS')
names_vec_char

## [1] "Zaid"  "Mark"  "SSPPS"

typeof(names_vec_char)

## [1] "character"

numbers_vdc <- c(1,2,3,4,5,6,7,8,9,10)
numbers_vdc
numbers_vdc+2 #add 2 to each element
numbers_vdc*2 #multiply each element by 2
numbers_vdc/2 #divide each element by 2

##  [1]  1  2  3  4  5  6  7  8  9 10
##  [1]  3  4  5  6  7  8  9 10 11 12
##  [1]  2  4  6  8 10 12 14 16 18 20
##  [1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Dataframe is the most common data structure to store data for analysis in R. Essentially, a data frame is a list of equal length vectors.

setwd("E:/Biostat and Study Design/204/Lectures/Data") #set working directory
NHANES_df <- openxlsx::read.xlsx('NHEFS.xlsx') #load data

#get the type of variables
typeof(NHANES_df$age)

## [1] "double"

typeof(NHANES_df$sex)

## [1] "character"

typeof(NHANES_df$cholesterol)

## [1] "double"

Descriptive Analytics

Descriptive statistics are used to describe and summarize data in a meaningful way. Paired with data visualization, they form the basis of quantitative data analysis.

Continuous Variables

Mean is the average value in the data set. It is calculated by summing all observations and dividing by the number of observations. It has the following formula:

\[\bar{x} = \frac{\sum_{i=1}^{n} x_{i}}{n}\]

where \(x_{i}\) is the \(i_{th}\) observation of the group, and \(n\) is the number of observations.

Median is the middle value of the data set that is sorted in ascending order.

Variance is the variability of observations from its arithmetic mean and represents the average of squared deviations. It has the following formula:

\[s^{2} = \frac{\sum (x_{i} - \bar{x})^{2}}{n}\] where \(x_{i}\) is the \(i_{th}\) observation of the group, \(\bar{x}\) is the mean of the group, and \(n\) is the number of observations.

Standard deviation is the average departure from the mean. It is the root square of variance.

\[s = \sqrt{s^{2}} = \sqrt{\frac{\sum (x_{i} - \bar{x})^{2}}{n}}\] where \(x_{i}\) is the \(i_{th}\) observation of the group, \(\bar{x}\) is the mean of the group, and \(n\) is the number of observations.

Min and Max are the minimum and maximum values of the data set.

1st quartile is the number for which 25% of values in the data set are smaller than. 3rd quartile is the number for which 75% of values in the data set are smaller than.

## Calculate mean,median,min,max,and quantiles
mean(NHANES_df$age,na.rm = TRUE) #mean

## [1] 43.91529

sd(NHANES_df$age,na.rm = TRUE) # standard deviation

## [1] 12.17043

median(NHANES_df$age,na.rm = TRUE) #median

## [1] 44

min(NHANES_df$age,na.rm = TRUE) #min

## [1] 25

max(NHANES_df$age,na.rm = TRUE) #max

## [1] 74

quantile(NHANES_df$age,na.rm = TRUE) #quantiles

##   0%  25%  50%  75% 100% 
##   25   33   44   53   74

Frequency distribution histogram allow us to see the distribution of many values by dividing the range of values into a set of smaller ranges (bin) and then graph the number of values in each bin. Histograms can be used to inspect the shape of the distribution of the data, identify the location of the center of the data, evaluate the spread of the data, and identifies outliers.

There are multiple methods to approximate the “best” number of bins. The easiest method to is use the rounded squared of the number of observations. These methods are often sensitive to outliers. Therefore, it is best to only use them to guide the selection of the number of bins.

normal_distribution_df=data.frame(nd_values=rnorm(1000, mean=10, sd=2.5)) # generate a sample random distribution with a mean of 10 and a standard deviation of 2.5.

sqrt(1000)# Calculate number of bins as sqrt of 1000

## [1] 31.62278

# basic histogram
normal_distribution_df %>%  ggplot(aes(x=nd_values)) + 
  geom_histogram(bins=32,color="darkblue", fill="lightblue") + theme_bw()

Data set is said to be normally distributed if mean = median. A distribution of data is skewed if it is not symmetric and extends more to one side than to the other. Data skewed to the right have a longer right tail. Data skewed to the left have a longer left tail.

Alt Alt Alt

When a data set is normally distributed, about 68% of all values fall within 1 standard deviation of the mean, about 95% of all values fall within 2 standard deviations of the mean, and about 99.7% of all values fall within 3 standard deviations of the mean.

Alt

A boxplot with whiskers gives a good insight regarding data distribution without showing every value.

Alt Alt

sqrt(nrow(NHANES_df)) #calculate the number of bins

## [1] 40.36087

NHANES_df %>% ggplot(aes(x=age)) +
    geom_histogram(bins=40, fill="deepskyblue", color="black") + 
    theme_light() #generate histogram

summary(NHANES_df$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   33.00   44.00   43.92   53.00   74.00

NHANES_df %>%  ggplot(aes(x=age)) +
    stat_boxplot(geom = 'errorbar',  width = 0.2) + 
    geom_boxplot(fill='deepskyblue',outlier.colour="red", outlier.size=4) +
    theme_light()

Categorical Variables

Frequencies is number of observations of each category in the data set.

Proportions is the percent that each category accounts for in the data set.

NHANES_df$sex <- as.factor(NHANES_df$sex) #convert to factor
table(NHANES_df$sex)

## 
## Female   Male 
##    830    799

prop.table(table(NHANES_df$sex))

## 
##   Female     Male 
## 0.509515 0.490485

NHANES_table <- table(sex=NHANES_df$sex,asthma=NHANES_df$asthma)
round(prop.table(NHANES_table,margin=1),2) #by rows

##         asthma
## sex         0    1
##   Female 0.94 0.06
##   Male   0.96 0.04

round(prop.table(NHANES_table,margin=2),2) # by columns

##         asthma
## sex         0    1
##   Female 0.50 0.62
##   Male   0.50 0.38

A bar graph uses bars of equal width to show frequencies of categories of categorical data. Bar graph shows the relative distribution of categorical data so that it is easier to compare the different categories.

NHANES_df %>%  ggplot(aes(x=race)) +
    geom_bar(fill="deepskyblue", color="black") +
    theme_light()

A pie chart is a very common graph that depicts categorical data as slices of a circle, in which the size of each slice is proportional to the frequency count for the category.

NHANES_df %>%  ggplot(aes(x="", fill=race)) + geom_bar(width = 1) + coord_polar("y", start=0)  + labs(fill = "Patient race") + theme_void()