2023-10-31

Introduction

Descriptive Statistics is a subsection of statistics that deals with the collection, analysis, interpretation, presentation, and organization of data. Its primary purpose is to summarize and describe the main features of a data set, providing a clear and concise overview of the data’s characteristics.

Types of Data

There are two types of data, categorical and numerical. Within numerical data, a further distinction can be made between discrete and continuous data.

  • Categorical data is data that represents categories and groups.

  • Numerical data consists of measurable quantities.

  • Continuous data involves measurements that can have infinite values within a range, like measuring height or temperature.

  • Discrete data deals with distinct, countable values, often representing things we can count, like the number of students in a class. Essentially, continuous data can be any value, while discrete data is about counting specific, separate things.

As shown below, within the data set of “airquality”, the only piece of categorical data is “Month”, although it is represented numerically by 1-12. The rest of the data is continuous numerical data.

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Measures of Central Tendency

A central tendency is a typical value for a probability distribution. These measures of central tendency are often called averages. The most common examples we see in statistics are the mean, the median, and the mode. The easiest way to view these in R studio is to use the summary function. Using this function,

summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

Measures of Central Tendency (Cont.)

The arithmetic mean can also be calculated by adding all the values and then diving by the total number of values. This can be written as the following formula where X is the arithmetic mean and n represents the total number of values:

\[ X = \frac{1}{n} \sum_{i=1}^n = \frac{x_1 + x_2 + \ldots + x_n}{n} \]

This can simply be completed by using the mean() function. Mode, however, does not have a standard built-in function in R, so one must create a function to find the most commonly occurring value instead.

Measure of Dispersion

Measures of dispersion describe the spread of the data. They include the range, interquartile range, standard deviation and variance. The following functions can be used to calculate measures of dispersion:

range(airquality$Ozone, na.rm = TRUE) #Range
## [1]   1 168
sd(airquality$Ozone, na.rm=TRUE)                #Standard Deviation
## [1] 32.98788
var(airquality$Ozone, na.rm=TRUE)               #Variance
## [1] 1088.201
IQR(airquality$Ozone, na.rm=TRUE)               #Interquartile Range
## [1] 45.25

Histograms

One way to visualize data is a histogram. A histogram is a graphical representation of data points organized into user-specified ranges. The the vertical axis represents how often a variable appears. The horizontal axis represents the value of the variable for example, minutes, years, or ages. For example, on the next slide is a histogram of the temperatures from the “airquality” data set, using the following R code.

ggplot(airquality, aes(x = Temp)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Temperature in Airquality Dataset",
       x = "Temperature (Fahrenheit)",
       y = "Frequency") +
  theme_minimal()

Histograms (Cont.)

Box Plot

Another type of graph is a box plot. In descriptive statistics, a box plot or boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles. The following plot showcases the temperatures by month.

Scatter Plot

A scatter plot is a type of graph in which values of two variables are plotted along the x and y axes. Pattern can show any correlation between the variables.

## Warning: Ignoring 42 observations

Conclusion

In summary, understanding descriptive statistics is very essential to anyone wanting to become a data scientist since it is a building block for valuable analysis and meaningful interpretations of data. Through data visualization, patterns between multiple variables of data become apparent. Additionally, values such as the central tendencies or measures of dispersion help us identify ranges and what outliers look like, giving us more insight within the data.