# load libraries
library(tidyverse)
library(highcharter)
library(purrr)
R has well worked out Datasets, which are ready-to-use. Those datasets are often used in forums (such as stackoverflow), so it may be helpful to be familiar with them. A popular examples is the iris data frame:
iris <- datasets::iris
head(iris, n = 15)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
By default, the head() function gives us the first 10
observations in a dataframe. We can manually change the number of
displayed rows by specifying the head(x, n = ...) variable
in the function. In this case we get the first 15 observations in the
dataframe.
We can obtain more information on the dataframe by running
?iris in the console. This gives us:
Description* This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
| Description of an iris | Setosa | Versicolor | Virginica |
|---|---|---|---|
| source: wikipedia |
As we can see, they look fairly similar. We want to approach analyzing the dataframe and maybe even finding characteristics in the data.
Base R and the Tidyverse represent two different approaches to working with data in R, each with its own strengths and characteristics.
Base R refers to the core set of functions and packages that come pre-installed with R. It provides a wide range of functions for data manipulation, statistical analysis, and visualization. Base R functions are typically efficient and well-established, making them suitable for basic data tasks and scripting.
On the other hand, the Tidyverse is a collection of R packages designed to streamline the process of data analysis and manipulation. It follows the principles of tidy data, emphasizing the use of consistent data structures and function names for seamless workflow. The Tidyverse includes packages such as dplyr, tidyr, ggplot2, and purrr, among others, which offer powerful tools for data manipulation, visualization, and modeling. The Tidyverse is particularly popular for its intuitive syntax, which simplifies complex data tasks and promotes reproducible research practices.
Ultimately, the choice between Base R and the Tidyverse depends on factors such as personal preference, familiarity with each approach, and the specific requirements of the data analysis task at hand. Some users prefer the simplicity and efficiency of Base R functions, while others appreciate the consistency and flexibility provided by the Tidyverse packages. Both approaches have their place in the R ecosystem, and many users find value in leveraging both depending on the context.
We belief, that the tidyverse notation and functions provide an easier acess to coding in R and therefor want to provide both options and let you decide, which one you may want to use.
Highcharter and ggplot2 (from the tidyverse represent) two powerful tools for data visualization in R, each with its own distinct strengths. Highcharter shines in its ability to create interactive and dynamic visualizations, making it ideal for exploring data in depth and engaging users with features like zooming and panning. Its wide range of chart types and extensive customization options offer flexibility in crafting visually appealing plots tailored to specific requirements. On the other hand, ggplot2 is renowned for its elegant syntax and ease of use, making it a preferred choice for many R users, particularly those within the tidyverse ecosystem. While ggplot2 may not offer the same level of interactivity as Highcharter out of the box, it excels in producing static, publication-quality plots with minimal effort. Additionally, ggplot2’s grammar of graphics framework provides a conceptual foundation for understanding and constructing a wide range of visualizations. Ultimately, the choice between Highcharter and ggplot2 depends on the specific needs of the analysis, with Highcharter offering interactivity and flexibility, and ggplot2 providing simplicity and elegance.
We thought it might be nice to visualize the data interactively, which is why occasionally, highchart plots will be provided in addition.
Long and wide dataframes represent different ways of organizing data, each with its advantages depending on the analysis and visualization tasks at hand. In a wide dataframe, each row typically represents an observation, while each column represents a variable. This format is suitable for storing data where each variable has its own column, making it intuitive for human interpretation. On the other hand, in a long dataframe, multiple rows may represent different measurements or observations for the same entity, with additional columns used to distinguish between different variables or categories. Long format is often preferred for analyses involving multiple related variables or repeated measures, as it allows for easier manipulation and plotting of data. Additionally, long format is conducive to many tidy data principles, facilitating seamless integration with tidyverse functions and packages for data manipulation and analysis. Ultimately, the choice between long and wide dataframes depends on the specific analytical needs and the most effective way to represent and work with the data. Our iris dataframe is currently in a wide dataframe format. When working with packages in the tidyverse, it can be helpful to use the long dataframe structure.
transform dataframes from wide to long format
iris_long <- data.frame(
Species = rep(iris$Species, times = 4),
Measurement = rep(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"), each = nrow(iris)),
Value = c(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)
)
iris_long <- iris %>% pivot_longer(cols = starts_with("Sepal.") | starts_with("Petal."),
names_to = "Measurement",
values_to = "Value")
transform dataframes from long to wide format
# ???
iris <- iris_long %>% pivot_wider(names_from = Measurement,
values_from = Value) %>%
unnest()
From the head() function we know, that we have a
dataframe with 5 columns. Thanks to the documentation we also know, that
the first 4 rows (.Width & .Length) are numeric measurements in
centimeters. As we know, that 5 cm is more than 4 cm, we speak of
metric scaled variables. The 5th row contains
information on the type of iris species we have. As we have different
groups within the variable, but we cannot really rank them, the species
variable is a nominal scaled varibale. Even though not
in the data, it could be the perceived beauty of the iris. As we can
rank the beauty of a flower, but not state, that one flower is twice as
nice as another one, we get would get a ordinal scaled
variable.
# Sidenote: R does not care so much about the unit, but about the class of columns in dataframes. It can happen, that numbers are seen as characters. In that case a 1 turns into a "1", which makes it not possible to compute anything (give it a try in the console)
# ! always check the classes of the varibales in you dataframe
lapply(iris, class)
## $Species
## [1] "factor"
##
## $Sepal.Length
## [1] "numeric"
##
## $Sepal.Width
## [1] "numeric"
##
## $Petal.Length
## [1] "numeric"
##
## $Petal.Width
## [1] "numeric"
Now we know, that our metric scaled variables are numeric and our nominal scaled variables are factor variables. No further changes are required. Let’s move on.
Descriptive analysis of continuous variables involves summarizing the distribution and central tendencies of numerical data. This typically includes measures such as the mean, median, mode, range, variance, and standard deviation. These statistics offer insights into the central tendency (where the data tends to cluster) and the spread or variability of the data points. Histograms and density plots are commonly used to visually represent the distribution of continuous variables, showing the frequency or density of values across their range. Box plots provide a concise summary of key statistics such as quartiles and outliers, aiding in the identification of patterns and discrepancies within the dataset. Through descriptive analysis, analysts gain a comprehensive understanding of the shape, spread, and central tendency of continuous variables, laying the groundwork for further inferential and exploratory analysis.
The iris dataframe provides 4 continuous variables. The width of the
Sepal, it’s length, as well as the Petal length and it’s width. In R, a
simple version is to apply the summary function on each row
that is numeric.
Summary statistics can be helpful to …
# Get the indices of columns containing "Sepal" or "Petal"
continousvariables <- grep("Sepal|Petal", names(iris))
# Subset the dataframe using the obtained indices
iris_continousvariables <- iris[, continousvariables]
# apply the summary function on the continuous variables
summary_continuousvariables <- lapply(iris_continousvariables, summary)
## $Sepal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
##
## $Sepal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.000 3.057 3.300 4.400
##
## $Petal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.350 3.758 5.100 6.900
##
## $Petal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.199 1.800 2.500
# summarise the continuous measurement variables
summary_continuousvariables <- iris_long %>% group_by(Measurement) %>%
summarise(min = min(Value),
quartile_25 = quantile(Value, 0.25),
median = median(Value),
mean = mean(Value),
quartile_75 = quantile(Value, 0.75),
max = max(Value))
## # A tibble: 4 × 7
## Measurement min quartile_25 median mean quartile_75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Petal.Length 1 1.6 4.35 3.76 5.1 6.9
## 2 Petal.Width 0.1 0.3 1.3 1.20 1.8 2.5
## 3 Sepal.Length 4.3 5.1 5.8 5.84 6.4 7.9
## 4 Sepal.Width 2 2.8 3 3.06 3.3 4.4
For visualizing a single continuous variable, a boxplot is the most pleasing option.
# Subset the iris dataset to include only numeric variables
iris_continuousvariables <- iris[, sapply(iris, is.numeric)]
boxplot(iris_continuousvariables)
# Subset the iris dataset to include only numeric variables
iris_boxplot <- iris_long %>% ggplot(aes(y = Value, x = Measurement)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter() +
labs(title = "Distibution of data in continuous variables", x = "", y = "measured size (cm)") +
theme_bw()
iris_boxplot
Adding the jitter() function allows us to see, how that
there may be certain groups of withing the Measurements. Simply looking
at the averages of the Measurements may not be helpful or even
misleading, as we are not able to differentiate between the different
Species.
iris_boxplot <- hcboxplot(x = iris_long$Value, var = iris_long$Measurement) %>%
hc_chart(type = "column")
iris_boxplot
A histogram can be helpful, if …
# Subset the iris dataset to include only numeric variables
iris_continuousvariables <- iris[, sapply(iris, is.numeric)]
# Create histograms for each numeric variable, using a for l
par(mfrow = c(1, 4))
for(i in names(iris_continuousvariables)){
hist(iris_continuousvariables[[i]], n = 15, main = paste0("Histogram for ", i), xlab = "", ylab = "Frequency")
}
In the hist()function, we can specify the number of
bins, by defining the parameter n = ... in the function the
parameter main = specifies the title, xlab =
& ylab = are selfexplaining
# Subset the iris dataset to include only numeric variables
iris_histogram <- iris_long %>% ggplot(aes(x = Value)) +
geom_histogram(binwidth = 0.1, color = "orange", fill = "grey") +
labs(x = "", y = "Absolute Frequency", title = "Histogram of the different continuous varibales", caption = "Data: iris") +
facet_grid(~Measurement) +
theme_bw()
Notice, how the geom_histogram() function allows us to
specify the span of values to group to bins. As the Values are in steps
of 0.1, it makes sense to choose that width of for the grouping of the
bins.
It also stands out, that it is much more intuitive to use the ggplot package, instead of looping over the different numeric variables.
listofplots <- map(unique(iris_long$Measurement), function(x){
iris_subset <- iris_long %>%
filter(Measurement == x)
hchart(
iris_subset$Value,
type = "area", name = x
) %>%
hc_legend(enabled = FALSE) %>%
hc_subtitle(text = x)
})
grouped_histogram <- htmltools::tagList(
htmltools::h4("Histogram of the different continuous variables"),
hw_grid(listofplots, rowheight = 300, ncol = 4))
Highchart again shows it’s strength in its interactivity.
A density funciton can be helpful, if … How does it work?
# kommt noch
# kommt noch
# Subset the iris dataset to include only numeric variables
iris_denstiy <- iris_long %>% ggplot(aes(x = Value)) +
geom_density(aes(y = ..density..), color = "red") +
labs(title = "Density function of the different continuous variables") +
facet_grid(~Measurement) +
theme_bw()
Notice, how the geom_histogram() function allows us to
specify the span of values to group to bins. As the Values are in steps
of 0.1, it makes sense to choose that width of for the grouping of the
bins.
It also stands out, that it is much more intuitive to use the ggplot package, instead of looping over the different numeric variables.
listofplots <- map(unique(iris_long$Measurement), function(x){
iris_subset <- iris_long %>%
filter(Measurement == x)
hchart(
density(iris_subset$Value),
type = "area", name = "Weight"
) %>%
hc_legend(enabled = FALSE) %>%
hc_subtitle(text = x)
})
grouped_denstiy <- htmltools::tagList(
htmltools::h4("Desinty function of the different continuous variables"),
hw_grid(listofplots, rowheight = 300, ncol = 4))
Highchart again shows it’s strength in its interactivity.
From looking at the summary statistics of a single continuous varibale, as well as its variable we can see, that…
Knowing more about the individual, continuous variables is helpful, but it may be useful to add the disctrete “Species” variable, in order to further analyze the data. From the histograms, but also from the added points in the boxplot we could foresee, that there might be groups in the data, which highlight the importance of “controlling for the discrete Species variable.
This leaves us with different options. We can look at summary tables or use Boxplots (again) and further group our density plots.
# overall summary
iris_groupedsummary <- lapply(split(iris[, -5], iris$Species), function(x) summary(x))
## $setosa
## Species Sepal.Length Sepal.Width Petal.Length
## setosa :50 Min. :4.300 Min. :2.300 Min. :1.000
## versicolor: 0 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400
## virginica : 0 Median :5.000 Median :3.400 Median :1.500
## Mean :5.006 Mean :3.428 Mean :1.462
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575
## Max. :5.800 Max. :4.400 Max. :1.900
##
## $versicolor
## Species Sepal.Length Sepal.Width Petal.Length
## setosa : 0 Min. :4.900 Min. :2.000 Min. :3.00
## versicolor:50 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00
## virginica : 0 Median :5.900 Median :2.800 Median :4.35
## Mean :5.936 Mean :2.770 Mean :4.26
## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60
## Max. :7.000 Max. :3.400 Max. :5.10
##
## $virginica
## Species Sepal.Length Sepal.Width Petal.Length
## setosa : 0 Min. :4.900 Min. :2.200 Min. :4.500
## versicolor: 0 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100
## virginica :50 Median :6.500 Median :3.000 Median :5.550
## Mean :6.588 Mean :2.974 Mean :5.552
## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875
## Max. :7.900 Max. :3.800 Max. :6.900
# define grouping values
iris_groupedsummary <- iris_long %>% group_by(Species, Measurement) %>%
# make a grouped summary statistics
summarise(min = min(Value),
quartile_25 = quantile(Value, 0.25),
median = median(Value),
quartile_75 = quantile(Value, 0.75),
max = max(Value)) %>%
arrange(Measurement) %>%
# display all values
head(12)
## # A tibble: 12 × 7
## # Groups: Species [3]
## Species Measurement min quartile_25 median quartile_75 max
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 setosa Petal.Length 1 1.4 1.5 1.58 1.9
## 2 versicolor Petal.Length 3 4 4.35 4.6 5.1
## 3 virginica Petal.Length 4.5 5.1 5.55 5.88 6.9
## 4 setosa Petal.Width 0.1 0.2 0.2 0.3 0.6
## 5 versicolor Petal.Width 1 1.2 1.3 1.5 1.8
## 6 virginica Petal.Width 1.4 1.8 2 2.3 2.5
## 7 setosa Sepal.Length 4.3 4.8 5 5.2 5.8
## 8 versicolor Sepal.Length 4.9 5.6 5.9 6.3 7
## 9 virginica Sepal.Length 4.9 6.22 6.5 6.9 7.9
## 10 setosa Sepal.Width 2.3 3.2 3.4 3.68 4.4
## 11 versicolor Sepal.Width 2 2.52 2.8 3 3.4
## 12 virginica Sepal.Width 2.2 2.8 3 3.18 3.8
This helps us to compare the different species of iris flowers, based on what we measured. However, a plain table is not very helpful. Here is a visualization of the distribution as boxplots:
grouped_summary_plot <- iris_long %>%
ggplot(aes(x = Species, y = Value, color = Species)) +
geom_boxplot() +
geom_point(alpha = 0.1) +
facet_grid(~Measurement) +
labs(y = "measured values (cm)", x = "", title = "Boxplots for different measurements", subtitle = "Distitrubiton by Species", caption = "Data: iris") +
theme_bw()
# kommt noch
# kommt noch
In Base R this is a pain
# Subset the iris dataset to include only numeric variables
iris_histogram <- iris_long %>% ggplot(aes(x = Value, color = Species, fill = Species)) +
geom_histogram(binwidth = 0.1, alpha = 0.3) +
labs(x = "",
y = "Absolute Frequency",
title = "Histogram of the different continuous varibales",
caption = "Data: iris") +
facet_grid(~Measurement) +
theme_bw()
Notice, how the geom_histogram() function allows us to
specify the span of values to group to bins. As the Values are in steps
of 0.1, it makes sense to choose that width of for the grouping of the
bins.
It also stands out, that it is much more intuitive to use the ggplot package, instead of looping over the different numeric variables.
# kommt noch
Highchart again shows it’s strength in its interactivity.
# kommt noch
# kommt noch
In Base R this is a pain
# Subset the iris dataset to include only numeric variables
iris_denstiy <- iris_long %>% ggplot(aes(x = Value, color = Species, fill = Species)) +
geom_density(aes(y = ..density..), binwidth = 0.1, alpha = 0.3) +
labs(x = "",
y = "Density",
title = "Histogram of the different continuous varibales",
caption = "Data: iris") +
facet_grid(~Measurement) +
theme_bw()
Notice, how the geom_histogram() function allows us to
specify the span of values to group to bins. As the Values are in steps
of 0.1, it makes sense to choose that width of for the grouping of the
bins.
It also stands out, that it is much more intuitive to use the ggplot package, instead of looping over the different numeric variables.
# kommt noch
Highchart again shows it’s strength in its interactivity.
From looking at the summary statistics of a single continuous varibale, as well as its variable we can see, that… ___
# hier kommt eine lm()
# hier kommt eine lm()
# Scatter plot
plot(iris$Sepal.Width, iris$Sepal.Length,
col = iris$Species, pch = as.numeric(iris$Species),
xlab = "Sepal Width", ylab = "Sepal Length",
main = "Scatter Plot: Width vs. Length",
xlim = c(2, 4.5), ylim = c(4, 8.5))
scatterplot_width_length <- ggplot(data=iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species, shape = Species)) +
geom_point() +
geom_smooth(se=FALSE, method = "lm") +
labs(title = "Scatter Plot", subtitle = "Width vs. Length", color = "Species",
x = "Sepal Width", y = "Sepal Length") +
theme_bw()
lm.model <- broom::augment(lm(Sepal.Length ~ Sepal.Width + Species, data = iris))
scatterplot_width_length <- hchart(lm.model, 'scatter',
hcaes(x = Sepal.Width, y = Sepal.Length, group = Species)) %>%
hc_colors(c("#f8766d", "#00ba38", "#619cff")) %>%
hc_add_series(lm.model, "line", hcaes(x = Sepal.Width, y = .fitted, group = Species)) %>%
hc_legend(enabled = TRUE)
You can see that, …
From looking at the summary statistics of a single continuous varibale, as well as its variable we can see, that…
https://discdown.org/rprogramming/dekriptiv.pdf (in German)
google “summary statistics in R” and “histogram in R”
kernel density estimates: https://mathisonian.github.io/kde/ and/or google “kernel density estimates” and find relevant resources.
Regression. Models, methods and applications chapter 1.2
Explain nominal, ordinal and metric scale.
Use and explain e.g. the summary() and sd() function of R (other functions or packages are welcome) for relevant summary statistics (mean, std dev, min, max).
Use e.g. the hist() function of R (other functions or packages are ok) for plotting histograms.
Use e.g the density() function of R (other functions or packages are ok) for plotting kernel density estimators. Explain how these estimators are constructed.
Use the data sets of the lecture (available in OLAT) for demonstration, other data sets are welcome.