Lecture 2 - Basics of Descriptive Statistics

Data Frames

In this course, we will mostly handle rectangular data, or tabular data, which are simply data in spreadsheets with rows being samples and columns being features. As below is an example:

my_data <- data.frame(name = c('James', 'Alice', 'Lucy'), Math_grade = c(80, 90, 100), English_grade = c(100, 90, 80))
my_data

##    name Math_grade English_grade
## 1 James         80           100
## 2 Alice         90            90
## 3  Lucy        100            80

In this simple example, we have three students and their corresponding scores in two subjects. So each row is a sample (or observation), and each column is an attribute (usually called features or variables) for each sample.

Data Frames Are Special Lists

In R, Data frames and tibbles (we will learn this later) are built on top of lists. You can understand data frames as lists with the following properties:

Each element must be an atomic vector with an assigned label.
Each element must be of the same length.

Again, we have the two basic functions length() and names() to retrieve these information for a data frame.

length(my_data)

## [1] 3

names(my_data)

## [1] "name"          "Math_grade"    "English_grade"

Note that here the length() function returns the number of columns (features). To get the number of rows, we need the function nrow()

nrow(my_data)

## [1] 3

R Basics - Basic Operations of Data Frames

Lab Exercise - Basic Operations of Data Frames

Try the following commands in RStudio console. Before you execute the code, think in your mind what you expect to be the result:

View(my_data)
my_data$name
my_data$Math_grade
my_data$English_grade
my_data[1, ]
my_data[2, ]
my_data[3, 3]

Create a data frame with the first column being years from 1999 to 2002, and the second column being the day of the week (Monday, Tuesday, etc.) for January 1st of that year.

R Basics - install and load packages

In this course, we are going to analyze many data sets from real world. Like import in Python, we also need to load other R packages to access data sets and functions.

One of the popular R package collection for Data Science is tidyverse. It includes packages such as ggplot2, dplyr, forcats, readr, tidyr, stringr to do different jobs in Data science, including data importing, tidying, transformation, visualization and others.

For more details, please refer to https://www.tidyverse.org/. To install the package, simply run the following commands:

install.packages("tidyverse")

You will not be able to use the functions, objects, and help files in a package until you load it with library(). Once you have installed a package, you can load it with the library() function:

library(tidyverse)

This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are considered to be the core of the tidyverse because you’ll use them in almost every analysis.

Packages in the tidyverse change fairly frequently. You can see if updates are available, and optionally install them, by running tidyverse_update().

Access a data set from packages

After we load the package tidyverse, many data sets become accessible. We can simply access the data set by their names. For example, today we will take a look at the data set named mpg.

mpg is a data set of fuel economy from 1999 to 2008 for 38 popular model of cars. More details about the data set can be found at https://ggplot2.tidyverse.org/reference/mpg.html.

Actually, we can simply access the help document in R as well with the following command.

?mpg

The `mpg` Data set

Now let’s have a complete view of the data set by using the command:

View(mpg)

Or you may use the following commands as well

mpg
glimpse(mpg)

The good thing about glimpse(mpg) is that it will list all columns. If you only use mpg, when there are too many columns, some of them will be suppressed.

A glimpse of `mpg`

glimpse(mpg)

## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Lab Questions:

How many columns (features/varaibles) are there? How many rows (samples) are there?
What is the meaning of each variable?
What is the average mpg in city for all car models in the data set?

Example of Data Visualization and Exploration

In the next, we won’t learn any new commands. But I will show you what we will learn to do in this course using mpg as an example.

Please don’t worry about how the following figures are generated, which we will learn starting the next class.

Data exploration refers to the practice of finding useful patterns or trends in a given data set, and plotting graphs to visualize data is one of the most important ways to explore data.

Example of Data Visualization and Exploration

For example, Give cty being the miles per gallon in city, and hwy being miles per gallon in highway. What do the following plot indicate? Why does it look like an increasing line?

Example of Data Visualization and Exploration

Now let’s explore the effect of drive train type (f = front-wheel drive, r = rear-wheel drive, 4 = 4-wheel drive) on fuel efficiency. What conclusion can you draw from the following graph? Which type is most fuel economic?

Example of Data Visualization and Exploration

Now let’s explore the effect of number of cylinders on fuel efficiency. What conclusion can you draw from the following graph?

Basic Concepts in Statistics

When we use graphs to summarize and visualize data, we are doing descriptive statistics. To understand what it is, we need to go through some basic concepts of statistics.

Let’s start our study with looking at a statement that can be commonly seen in news media:

A poll for 1,247 voters shows that 48% of US voters support the republican candidate, with a 3% margin of error.

Clearly, this is a statement based on statistics. As below are the key points to understand this statement:

A “poll” means that the data were collected from a small sample (1,247) of US voters, not all US voters.
“48% of US voters support the republican candidate” is a guess based on the poll within that small group of US voters.
“3% margin of error” measures how likely and how accurate the guess (48%) is.

Population and Sample

In statistics, usually we study a specific collection of objects (people, companies, cars etc.). In this example, this collection of objects is all US voters. It is called a population.
Usually we are interested in a specific attribute or characteristic of the population. In this example, we are interested in whether a voter supports a candidate. This is called a random variable. It is data of this variable that are collected.
Usually it is too costly or infeasible to study the entire population. Therefore we only collect data for a subset of population. In this example, it is the 1,247 voters that answer the question. It is called a sample.

Lab Exercise

Statement: A study of survival of 1,225 newly diagnosed breast cancer cases finds that the average seven-year survival rates for Stage I breast cancer was 92%“.

seven-year survival rates: the percentage of patients that survive seven years after diagnosis.

What is the population of this study?
What is the sample of this study?
What is the random variable of this study?

Descriptive vs Inferential Statistics

Inferential Statistics: Using properties of a sample to make conclusions about the population, requiring probability theory
Descriptive Statistics: Only stating facts about the sample by itself.

Lab Exercises

A study shows that 71.6% of US adults are overweight. Answer the following question:

Under what condition is the study descriptive?
Under what condition is the study inferential?
Which one is more likely to be the case?

Types of random variables

Categorical (or qualitative) variable: takes values that are not numerical (not numbers)
- Ordinal variable: similar to categorical but with ordered categories
Numeric (or quantitative) variable: takes values that are numeric (numbers)
- Discrete variable: A numeric variable whose possible values can be listed.
- Continuous variable: A numeric variable who possible values are from interval of real numbers.

Lab Exercise

Give a real example of different variable types:

** categorical but not ordinal

** ordinal

** discrete numeric

** continuous numeric

</div>

Introduction to data plots

Before we start to plot graphs, we need to review the basic knowledge of data plotting types. There are many types of them, and as below are a few examples, including some most commonly used ones:

Plot types depend on data types

Why do we have this many plot types? One reason is that we need different plots to best illustrate the relationship between (usually one or two) variables of different types.

bar plots: (usually) for one categorical variable
histograms: for one numeric variable
box plots: for one continuous variable
Scatter plots: (usually) for two numeric variables
Multiple box plots: for one continuous variable and one categorical/discrete variable
Stacked bar plots: for two categorical variables.

Example of a bar plot

Next, we will use mpg data set to give examples of each plot type to explain their meaning.

A bar plot is to show the distribution of one categorical variable.

ggplot(mpg) + 
  geom_bar(aes(x = drv))

Example of a histogram

A histogram is to show the distribution of one numeric variable (discrete or continuous).

ggplot(mpg) + 
  geom_histogram(aes(x = hwy), border = 5, binwidth = 5)

Example of a box plot

A boxplot is to show a five-number summary of a numeric variable (discrete or continous).

ggplot(mpg) + 
  geom_boxplot(aes(x = hwy))

Example of a scatter plot

A scatter plot is usually to show the relationship between two numeric variables.

ggplot(mpg) + 
  geom_point(aes(x = hwy, y = cty))

Example of a multiple box plot

A multiple box plot is usually to show the relationship between one categorical variable and one numeric variable.

ggplot(mpg) + 
  geom_boxplot(aes(x = drv, y = cty))

Example of stacked bar plot

A stacked bar plot is usually to show the relationship between two categorical variables.

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = drv, fill = class))

Lecture 2 - Basics of Descriptive Statistics

Miao Yu

2024-01-17

Review of R basics

Lab Exercise

R Basics - Atomic Vectors

Two Key Properties for Vectors

Names for Vector Elements

Functions basics in R

Example

Lab Exercise

R Basics - List

Data Frames

Data Frames Are Special Lists

R Basics - Basic Operations of Data Frames

Lab Exercise - Basic Operations of Data Frames

R Basics - install and load packages

Access a data set from packages

The `mpg` Data set

A glimpse of `mpg`

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Basic Concepts in Statistics

Population and Sample

Lab Exercise

Descriptive vs Inferential Statistics

Lab Exercises

Types of random variables

Lab Exercise

Introduction to data plots

Plot types depend on data types

Example of a bar plot

Example of a histogram

Example of a box plot

Example of a scatter plot

Example of a multiple box plot

Example of stacked bar plot

Lecture 2 - Basics of Descriptive Statistics

Miao Yu

2024-01-17

Review of R basics

Lab Exercise

R Basics - Atomic Vectors

Two Key Properties for Vectors

Names for Vector Elements

Functions basics in R

Example

Lab Exercise

R Basics - List

Data Frames

Data Frames Are Special Lists

R Basics - Basic Operations of Data Frames

Lab Exercise - Basic Operations of Data Frames

R Basics - install and load packages

Access a data set from packages

The mpg Data set

A glimpse of mpg

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Example of Data Visualization and Exploration

Basic Concepts in Statistics

Population and Sample

Lab Exercise

Descriptive vs Inferential Statistics

Lab Exercises

Types of random variables

Lab Exercise

Introduction to data plots

Plot types depend on data types

Example of a bar plot

Example of a histogram

Example of a box plot

Example of a scatter plot

Example of a multiple box plot

Example of stacked bar plot

The `mpg` Data set

A glimpse of `mpg`