First steps: summarizing continuous data in R

Authors: Domenico Digel, Sebastian Endres, Sören Schwabbauer
Date: 02.07.2024
Class: Professor: Stefan Lang

# load libraries
library(tidyverse)
library(highcharter)
library(purrr)

Load Data

R has well worked out Datasets, which are ready-to-use. Those datasets are often used in forums (such as stackoverflow), so it may be helpful to be familiar with them. A popular examples is the iris data frame:

iris <- datasets::iris
head(iris, n = 15)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa

By default, the head() function gives us the first 10 observations in a dataframe. We can manually change the number of displayed rows by specifying the head(x, n = ...) variable in the function. In this case we get the first 15 observations in the dataframe.

Background Information

The iris flower

We can obtain more information on the dataframe by running ?iris in the console. This gives us:

Description* This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Description of an iris	Setosa	Versicolor	Virginica

source: wikipedia

As we can see, they look fairly similar. We want to approach analyzing the dataframe and maybe even finding characteristics in the data.

Base R & Tidyverse (& Highcharter)

Base R and the Tidyverse represent two different approaches to working with data in R, each with its own strengths and characteristics.

Base R refers to the core set of functions and packages that come pre-installed with R. It provides a wide range of functions for data manipulation, statistical analysis, and visualization. Base R functions are typically efficient and well-established, making them suitable for basic data tasks and scripting.

On the other hand, the Tidyverse is a collection of R packages designed to streamline the process of data analysis and manipulation. It follows the principles of tidy data, emphasizing the use of consistent data structures and function names for seamless workflow. The Tidyverse includes packages such as dplyr, tidyr, ggplot2, and purrr, among others, which offer powerful tools for data manipulation, visualization, and modeling. The Tidyverse is particularly popular for its intuitive syntax, which simplifies complex data tasks and promotes reproducible research practices.

Ultimately, the choice between Base R and the Tidyverse depends on factors such as personal preference, familiarity with each approach, and the specific requirements of the data analysis task at hand. Some users prefer the simplicity and efficiency of Base R functions, while others appreciate the consistency and flexibility provided by the Tidyverse packages. Both approaches have their place in the R ecosystem, and many users find value in leveraging both depending on the context.

We belief, that the tidyverse notation and functions provide an easier acess to coding in R and therefor want to provide both options and let you decide, which one you may want to use.

Highcharter and ggplot2 (from the tidyverse represent) two powerful tools for data visualization in R, each with its own distinct strengths. Highcharter shines in its ability to create interactive and dynamic visualizations, making it ideal for exploring data in depth and engaging users with features like zooming and panning. Its wide range of chart types and extensive customization options offer flexibility in crafting visually appealing plots tailored to specific requirements. On the other hand, ggplot2 is renowned for its elegant syntax and ease of use, making it a preferred choice for many R users, particularly those within the tidyverse ecosystem. While ggplot2 may not offer the same level of interactivity as Highcharter out of the box, it excels in producing static, publication-quality plots with minimal effort. Additionally, ggplot2’s grammar of graphics framework provides a conceptual foundation for understanding and constructing a wide range of visualizations. Ultimately, the choice between Highcharter and ggplot2 depends on the specific needs of the analysis, with Highcharter offering interactivity and flexibility, and ggplot2 providing simplicity and elegance.

We thought it might be nice to visualize the data interactively, which is why occasionally, highchart plots will be provided in addition.

Long vs. Wide dataframes

Long and wide dataframes represent different ways of organizing data, each with its advantages depending on the analysis and visualization tasks at hand. In a wide dataframe, each row typically represents an observation, while each column represents a variable. This format is suitable for storing data where each variable has its own column, making it intuitive for human interpretation. On the other hand, in a long dataframe, multiple rows may represent different measurements or observations for the same entity, with additional columns used to distinguish between different variables or categories. Long format is often preferred for analyses involving multiple related variables or repeated measures, as it allows for easier manipulation and plotting of data. Additionally, long format is conducive to many tidy data principles, facilitating seamless integration with tidyverse functions and packages for data manipulation and analysis. Ultimately, the choice between long and wide dataframes depends on the specific analytical needs and the most effective way to represent and work with the data. Our iris dataframe is currently in a wide dataframe format. When working with packages in the tidyverse, it can be helpful to use the long dataframe structure.

transform dataframes from wide to long format

Base R

iris_long <- data.frame(
  Species = rep(iris$Species, times = 4),
  Measurement = rep(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"), each = nrow(iris)),
  Value = c(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)
)

Tidyverse

iris_long <- iris %>% pivot_longer(cols = starts_with("Sepal.") | starts_with("Petal."), 
                                   names_to = "Measurement",
                                   values_to = "Value")

transform dataframes from long to wide format

Base R

# ???

Tidyverse

iris <- iris_long %>% pivot_wider(names_from = Measurement, 
                                       values_from = Value) %>% 
                      unnest()

Ordinal, Nominal and Metric variables

Ordinal variables represent categories with a natural order or ranking, but the intervals between the categories may not be equal. Examples include education level (e.g., high school, bachelor’s, master’s) and rating scales (e.g., strongly disagree to strongly agree).
Nominal variables represent categories with no inherent order or ranking. They simply classify data into distinct groups. Examples include gender, race, and types of fruit.
Metric variables, also known as interval or ratio variables, represent data with meaningful numerical values where both the order and the exact differences between categories are meaningful. Examples include temperature, weight, and age.

From the head() function we know, that we have a dataframe with 5 columns. Thanks to the documentation we also know, that the first 4 rows (.Width & .Length) are numeric measurements in centimeters. As we know, that 5 cm is more than 4 cm, we speak of metric scaled variables. The 5th row contains information on the type of iris species we have. As we have different groups within the variable, but we cannot really rank them, the species variable is a nominal scaled varibale. Even though not in the data, it could be the perceived beauty of the iris. As we can rank the beauty of a flower, but not state, that one flower is twice as nice as another one, we get would get a ordinal scaled variable.

# Sidenote: R does not care so much about the unit, but about the class of columns in dataframes. It can happen, that numbers are seen as characters. In that case a 1 turns into a "1", which makes it not possible to compute anything (give it a try in the console)

# ! always check the classes of the varibales in you dataframe
lapply(iris, class)

## $Species
## [1] "factor"
## 
## $Sepal.Length
## [1] "numeric"
## 
## $Sepal.Width
## [1] "numeric"
## 
## $Petal.Length
## [1] "numeric"
## 
## $Petal.Width
## [1] "numeric"

Now we know, that our metric scaled variables are numeric and our nominal scaled variables are factor variables. No further changes are required. Let’s move on.

further definitions

Descriptive Analysis of Continuous Data

Descriptive analysis of continuous variables involves summarizing the distribution and central tendencies of numerical data. This typically includes measures such as the mean, median, mode, range, variance, and standard deviation. These statistics offer insights into the central tendency (where the data tends to cluster) and the spread or variability of the data points. Histograms and density plots are commonly used to visually represent the distribution of continuous variables, showing the frequency or density of values across their range. Box plots provide a concise summary of key statistics such as quartiles and outliers, aiding in the identification of patterns and discrepancies within the dataset. Through descriptive analysis, analysts gain a comprehensive understanding of the shape, spread, and central tendency of continuous variables, laying the groundwork for further inferential and exploratory analysis.

Analzing a single, continuous varibale

The iris dataframe provides 4 continuous variables. The width of the Sepal, it’s length, as well as the Petal length and it’s width. In R, a simple version is to apply the summary function on each row that is numeric.

Summary statistics

Summary statistics can be helpful to …

Base R

Click here for code

# Get the indices of columns containing "Sepal" or "Petal"
continousvariables <- grep("Sepal|Petal", names(iris))

# Subset the dataframe using the obtained indices
iris_continousvariables <- iris[, continousvariables]

# apply the summary function on the continuous variables
summary_continuousvariables <- lapply(iris_continousvariables, summary)

## $Sepal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900 
## 
## $Sepal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.800   3.000   3.057   3.300   4.400 
## 
## $Petal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.600   4.350   3.758   5.100   6.900 
## 
## $Petal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.300   1.300   1.199   1.800   2.500

Tidyverse

Click here for code

# summarise the continuous measurement variables
summary_continuousvariables <- iris_long %>%  group_by(Measurement) %>% 
            
                                              summarise(min = min(Value),
                                                        quartile_25 = quantile(Value, 0.25),
                                                        median = median(Value),
                                                        mean = mean(Value),
                                                        quartile_75 = quantile(Value, 0.75),
                                                        max = max(Value))

## # A tibble: 4 × 7
##   Measurement    min quartile_25 median  mean quartile_75   max
##   <chr>        <dbl>       <dbl>  <dbl> <dbl>       <dbl> <dbl>
## 1 Petal.Length   1           1.6   4.35  3.76         5.1   6.9
## 2 Petal.Width    0.1         0.3   1.3   1.20         1.8   2.5
## 3 Sepal.Length   4.3         5.1   5.8   5.84         6.4   7.9
## 4 Sepal.Width    2           2.8   3     3.06         3.3   4.4

Visualization (Boxplot)

For visualizing a single continuous variable, a boxplot is the most pleasing option.

Base R

Click here for code

# Subset the iris dataset to include only numeric variables
iris_continuousvariables <- iris[, sapply(iris, is.numeric)]

boxplot(iris_continuousvariables)

Tidyverse

Click here for code

# Subset the iris dataset to include only numeric variables
iris_boxplot <- iris_long %>% ggplot(aes(y = Value, x = Measurement)) +
  
  geom_boxplot(outlier.shape = NA) +
  geom_jitter() +
  labs(title = "Distibution of data in continuous variables", x = "", y = "measured size (cm)")  +
  theme_bw()

iris_boxplot

Adding the jitter() function allows us to see, how that there may be certain groups of withing the Measurements. Simply looking at the averages of the Measurements may not be helpful or even misleading, as we are not able to differentiate between the different Species.

Highchart

Click here for code

iris_boxplot <- hcboxplot(x = iris_long$Value, var = iris_long$Measurement) %>%
    
    hc_chart(type = "column")

iris_boxplot

Visualization (Histogram)

A histogram can be helpful, if …

Base R

Click here for code

# Subset the iris dataset to include only numeric variables
iris_continuousvariables <- iris[, sapply(iris, is.numeric)]

# Create histograms for each numeric variable, using a for l
par(mfrow = c(1, 4))

for(i in names(iris_continuousvariables)){
  hist(iris_continuousvariables[[i]], n = 15, main = paste0("Histogram for ", i), xlab = "", ylab = "Frequency")
}

In the hist()function, we can specify the number of bins, by defining the parameter n = ... in the function the parameter main = specifies the title, xlab = & ylab = are selfexplaining

Tidyverse

Click here for code

# Subset the iris dataset to include only numeric variables
iris_histogram <- iris_long %>% ggplot(aes(x = Value)) +
  
  geom_histogram(binwidth = 0.1, color = "orange", fill = "grey") +
  labs(x = "", y = "Absolute Frequency", title = "Histogram of the different continuous varibales", caption = "Data: iris") +
  facet_grid(~Measurement) +
  theme_bw()

Notice, how the geom_histogram() function allows us to specify the span of values to group to bins. As the Values are in steps of 0.1, it makes sense to choose that width of for the grouping of the bins.

It also stands out, that it is much more intuitive to use the ggplot package, instead of looping over the different numeric variables.

Highchart

Click here for code

listofplots <- map(unique(iris_long$Measurement), function(x){
  
  iris_subset <- iris_long %>% 
    filter(Measurement == x)
  
  hchart(
    iris_subset$Value, 
    type = "area", name = x
  ) %>%
      hc_legend(enabled = FALSE) %>%
    hc_subtitle(text = x)
  
})

grouped_histogram <- htmltools::tagList(
  htmltools::h4("Histogram of the different continuous variables"),
  hw_grid(listofplots, rowheight = 300, ncol = 4))

Histogram of the different continuous variables

Highchart again shows it’s strength in its interactivity.

Visualization (Density)

A density funciton can be helpful, if … How does it work?

Base R

Click here for code

# kommt noch

# kommt noch

Tidyverse

Click here for code

# Subset the iris dataset to include only numeric variables
iris_denstiy <- iris_long %>% ggplot(aes(x = Value)) +
  
  geom_density(aes(y = ..density..), color = "red") +
  labs(title = "Density function of the different continuous variables") +
  facet_grid(~Measurement) +
  theme_bw()

It also stands out, that it is much more intuitive to use the ggplot package, instead of looping over the different numeric variables.

Highchart

Click here for code

listofplots <- map(unique(iris_long$Measurement), function(x){
  
  iris_subset <- iris_long %>% 
    filter(Measurement == x)
  
   hchart(
  density(iris_subset$Value), 
  type = "area", name = "Weight"
  ) %>%
     
     hc_legend(enabled = FALSE) %>%
     hc_subtitle(text = x)
  
})

grouped_denstiy <- htmltools::tagList(
  htmltools::h4("Desinty function of the different continuous variables"),
  hw_grid(listofplots, rowheight = 300, ncol = 4))

Desinty function of the different continuous variables

Highchart again shows it’s strength in its interactivity.

From looking at the summary statistics of a single continuous varibale, as well as its variable we can see, that…

Comparing a Discrete and a Continuous Varibale

Knowing more about the individual, continuous variables is helpful, but it may be useful to add the disctrete “Species” variable, in order to further analyze the data. From the histograms, but also from the added points in the boxplot we could foresee, that there might be groups in the data, which highlight the importance of “controlling for the discrete Species variable.

This leaves us with different options. We can look at summary tables or use Boxplots (again) and further group our density plots.

Summary statistics

Base R

Click here for code

# overall summary
iris_groupedsummary <- lapply(split(iris[, -5], iris$Species), function(x) summary(x))

## $setosa
##        Species    Sepal.Length    Sepal.Width     Petal.Length  
##  setosa    :50   Min.   :4.300   Min.   :2.300   Min.   :1.000  
##  versicolor: 0   1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400  
##  virginica : 0   Median :5.000   Median :3.400   Median :1.500  
##                  Mean   :5.006   Mean   :3.428   Mean   :1.462  
##                  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575  
##                  Max.   :5.800   Max.   :4.400   Max.   :1.900  
## 
## $versicolor
##        Species    Sepal.Length    Sepal.Width     Petal.Length 
##  setosa    : 0   Min.   :4.900   Min.   :2.000   Min.   :3.00  
##  versicolor:50   1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00  
##  virginica : 0   Median :5.900   Median :2.800   Median :4.35  
##                  Mean   :5.936   Mean   :2.770   Mean   :4.26  
##                  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60  
##                  Max.   :7.000   Max.   :3.400   Max.   :5.10  
## 
## $virginica
##        Species    Sepal.Length    Sepal.Width     Petal.Length  
##  setosa    : 0   Min.   :4.900   Min.   :2.200   Min.   :4.500  
##  versicolor: 0   1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100  
##  virginica :50   Median :6.500   Median :3.000   Median :5.550  
##                  Mean   :6.588   Mean   :2.974   Mean   :5.552  
##                  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875  
##                  Max.   :7.900   Max.   :3.800   Max.   :6.900

Tidyverse

Click here for code

              # define grouping values
iris_groupedsummary <- iris_long %>% group_by(Species, Measurement) %>%
                   
              # make a grouped summary statistics
              summarise(min = min(Value),
                        quartile_25 = quantile(Value, 0.25),
                        median = median(Value),
                        quartile_75 = quantile(Value, 0.75),
                        max = max(Value)) %>%
              arrange(Measurement) %>%
              # display all values
              head(12)

## # A tibble: 12 × 7
## # Groups:   Species [3]
##    Species    Measurement    min quartile_25 median quartile_75   max
##    <fct>      <chr>        <dbl>       <dbl>  <dbl>       <dbl> <dbl>
##  1 setosa     Petal.Length   1          1.4    1.5         1.58   1.9
##  2 versicolor Petal.Length   3          4      4.35        4.6    5.1
##  3 virginica  Petal.Length   4.5        5.1    5.55        5.88   6.9
##  4 setosa     Petal.Width    0.1        0.2    0.2         0.3    0.6
##  5 versicolor Petal.Width    1          1.2    1.3         1.5    1.8
##  6 virginica  Petal.Width    1.4        1.8    2           2.3    2.5
##  7 setosa     Sepal.Length   4.3        4.8    5           5.2    5.8
##  8 versicolor Sepal.Length   4.9        5.6    5.9         6.3    7  
##  9 virginica  Sepal.Length   4.9        6.22   6.5         6.9    7.9
## 10 setosa     Sepal.Width    2.3        3.2    3.4         3.68   4.4
## 11 versicolor Sepal.Width    2          2.52   2.8         3      3.4
## 12 virginica  Sepal.Width    2.2        2.8    3           3.18   3.8

This helps us to compare the different species of iris flowers, based on what we measured. However, a plain table is not very helpful. Here is a visualization of the distribution as boxplots:

Click here for code

grouped_summary_plot <- iris_long %>%
  
  ggplot(aes(x = Species, y = Value, color = Species)) +
  geom_boxplot() +
  geom_point(alpha = 0.1) +
  
  facet_grid(~Measurement) +
  
  labs(y = "measured values (cm)", x = "", title = "Boxplots for different measurements", subtitle = "Distitrubiton by Species", caption = "Data: iris") +
  
  theme_bw()

Visualization (Histogram)

Base R

Click here for code

# kommt noch

# kommt noch

In Base R this is a pain

Tidyverse

Click here for code

# Subset the iris dataset to include only numeric variables
iris_histogram <- iris_long %>% ggplot(aes(x = Value, color = Species, fill = Species)) +
  geom_histogram(binwidth = 0.1, alpha = 0.3) +
  labs(x = "",
       y = "Absolute Frequency",
       title = "Histogram of the different continuous varibales", 
       caption = "Data: iris") +
  facet_grid(~Measurement) +
  theme_bw()

It also stands out, that it is much more intuitive to use the ggplot package, instead of looping over the different numeric variables.

Highchart

Click here for code

# kommt noch

Highchart again shows it’s strength in its interactivity.

Visualization (Density)

Base R

Click here for code

# kommt noch

# kommt noch

In Base R this is a pain

Tidyverse

Click here for code

# Subset the iris dataset to include only numeric variables
iris_denstiy <- iris_long %>% ggplot(aes(x = Value, color = Species, fill = Species)) +
  geom_density(aes(y = ..density..), binwidth = 0.1, alpha = 0.3) +
  labs(x = "",
       y = "Density",
       title = "Histogram of the different continuous varibales", 
       caption = "Data: iris") +
  facet_grid(~Measurement) +
  theme_bw()

It also stands out, that it is much more intuitive to use the ggplot package, instead of looping over the different numeric variables.

Highchart

Click here for code

# kommt noch

Highchart again shows it’s strength in its interactivity.

From looking at the summary statistics of a single continuous varibale, as well as its variable we can see, that… ___

Comparing two Continuous Variables

Correlation

Base R

# hier kommt eine lm()

Tidyverse

# hier kommt eine lm()

Scatter Plot

Base R

# Scatter plot
plot(iris$Sepal.Width, iris$Sepal.Length, 
     col = iris$Species, pch = as.numeric(iris$Species), 
     xlab = "Sepal Width", ylab = "Sepal Length",
     main = "Scatter Plot: Width vs. Length",
     xlim = c(2, 4.5), ylim = c(4, 8.5))

Tidyverse

Click here for code

scatterplot_width_length <- ggplot(data=iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species, shape = Species)) + 
  geom_point() +
  geom_smooth(se=FALSE, method = "lm") +
  labs(title = "Scatter Plot", subtitle = "Width vs. Length", color = "Species", 
       x = "Sepal Width", y = "Sepal Length") +
  theme_bw()

Highcharter

Click here for code

lm.model <- broom::augment(lm(Sepal.Length ~ Sepal.Width + Species, data = iris))

scatterplot_width_length <- hchart(lm.model, 'scatter',
                                   hcaes(x = Sepal.Width, y = Sepal.Length, group = Species)) %>%
  hc_colors(c("#f8766d", "#00ba38", "#619cff")) %>%
  
  hc_add_series(lm.model, "line", hcaes(x = Sepal.Width, y = .fitted, group = Species)) %>%
  
   hc_legend(enabled = TRUE)

You can see that, …

From looking at the summary statistics of a single continuous varibale, as well as its variable we can see, that…

Sources

https://discdown.org/rprogramming/dekriptiv.pdf (in German)
google “summary statistics in R” and “histogram in R”
kernel density estimates: https://mathisonian.github.io/kde/ and/or google “kernel density estimates” and find relevant resources.
Regression. Models, methods and applications chapter 1.2

Possible Examples

Explain nominal, ordinal and metric scale.
Use and explain e.g. the summary() and sd() function of R (other functions or packages are welcome) for relevant summary statistics (mean, std dev, min, max).
Use e.g. the hist() function of R (other functions or packages are ok) for plotting histograms.
Use e.g the density() function of R (other functions or packages are ok) for plotting kernel density estimators. Explain how these estimators are constructed.
Use the data sets of the lecture (available in OLAT) for demonstration, other data sets are welcome.