Packages

library(dslabs)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)
library(psych)

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(stats)

What are descriptive statistics?

• A way to summarize or describe a set of data. • It isn’t easy to make sense of data by looking at sets of numbers, but if we use descriptive statistics & graphs, we can get a "sense" of the data

Types of descriptive statistic

Measures of central tendency

• The mean • The median • Trimmed mean • Mode

Measures of variability

• Range • IQR • mad (mean or median absolute deviation) • Variance • Standard deviation

Load a dataset and take a look at

data(murders)
head(murders)

##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

?murders

## starting httpd help server ... done

Calculate the murder rate

It is usually measured as murders per 100,000 people. Store it in the dataframe with the column name: rate The pipe function is in tidyverse.

murders_df <- murders %>%
  mutate(rate = (total/population)*100000)
head(murders_df)

##        state abb region population total     rate
## 1    Alabama  AL  South    4779736   135 2.824424
## 2     Alaska  AK   West     710231    19 2.675186
## 3    Arizona  AZ   West    6392017   232 3.629527
## 4   Arkansas  AR  South    2915918    93 3.189390
## 5 California  CA   West   37253956  1257 3.374138
## 6   Colorado  CO   West    5029196    65 1.292453

Descriptive Statistics

Let’s calculate the mean or median of rate

murders_rate_mean <- mean(murders_df$rate)
murders_rate_mean

## [1] 2.779125

median(murders_df$rate)

## [1] 2.687123

The $ is a “of that”. Use df $ column.

What type of data

class() and str()tells the type of data. str() is really useful. It refers to structure and will give length and other information.

class(murders_df)

## [1] "data.frame"

str(murders_df)

## 'data.frame':    51 obs. of  6 variables:
##  $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ abb       : chr  "AL" "AK" "AZ" "AR" ...
##  $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
##  $ population: num  4779736 710231 6392017 2915918 37253956 ...
##  $ total     : num  135 19 232 93 1257 ...
##  $ rate      : num  2.82 2.68 3.63 3.19 3.37 ...

If there’s a missing value, mean doesn’t work For example, using the na_example dataframe:

data(na_example)
mean(na_example)

## [1] NA

mean(na_example, na.rm=TRUE)

## [1] 2.301754

na.rm removes NA so the function can calculate.

Summary

The summary function provides useful information for the entire dataframe

summary(murders_df)

##     state               abb                      region     population      
##  Length:51          Length:51          Northeast    : 9   Min.   :  563626  
##  Class :character   Class :character   South        :17   1st Qu.: 1696962  
##  Mode  :character   Mode  :character   North Central:12   Median : 4339367  
##                                        West         :13   Mean   : 6075769  
##                                                           3rd Qu.: 6636084  
##                                                           Max.   :37253956  
##      total             rate        
##  Min.   :   2.0   Min.   : 0.3196  
##  1st Qu.:  24.5   1st Qu.: 1.2526  
##  Median :  97.0   Median : 2.6871  
##  Mean   : 184.4   Mean   : 2.7791  
##  3rd Qu.: 268.0   3rd Qu.: 3.3861  
##  Max.   :1257.0   Max.   :16.4528

When should you use “mean” vs “median”?

The median may be more descriptive for skewed data or outliers.

Trimmed mean

You can ”trim” the data if you are concerned about outliers. Of course, you need to have a really good reason for doing so, but the code is here:

mean(murders_df$rate, trim = 0.1)

## [1] 2.442242

Mode

Rarely do you need to report the mode, but you may be interested in knowing about the frequency of different values in our categorical variables.

table(murders$region)

## 
##     Northeast         South North Central          West 
##             9            17            12            13

Measure of Dispersion:

###Range The Range The smallest score subtracted from the largest Very biased by outliers

range(murders_df$rate)

## [1]  0.3196211 16.4527532

max(murders_df$rate) - min(murders_df$rate)

## [1] 16.13313

This is bias by outliers.

The Interquartile range

• Quartiles • The three values that split the sorted data into four equal parts. • Second quartile = median. • Lower quartile = median of lower half of the data. • Upper quartile = median of upper half of the data.

Quantile / IQR

Quantile function provides the percentile for a dataset (stats package). For instance, if you ask for the 10% percentile you get the value such that 10% of the data is less than x.

quantile(murders_df$rate, probs = 0.10)

##       10% 
## 0.7655102

For the IQR, you can use the 25th and 75th percentiles

quantile(murders_df$rate, probs = c(0.25, 0.75))

##      25%      75% 
## 1.252645 3.386104

If you want to know the difference between them, you can subtract them.

interquantile_range_rate <- quantile(murders_df$rate, probs = 0.75) - quantile(murders_df$rate, probs = 0.25)
interquantile_range_rate

##      75% 
## 2.133458

MAD / AAD and MAD

Mean (or average) absolute deviation and median absolute deviation –These are rarely used.

Deviance

We can calculate the spread of scores by looking at how different each score is from the center of a distribution e.g., the mean

deviance <- function(x, y) {
  deviance <- x - y
  return(deviance)
}
deviance(murders_df$rate, murders_rate_mean)

##  [1]  0.04529833 -0.10393949  0.85040182  0.41026465  0.59501286 -1.48667234
##  [7] -0.06515322  1.45281142 13.67362773  0.61894338  1.01119713 -2.26453346
## [13] -2.01361526  0.05783535 -0.58905240 -2.08977703 -0.57101489 -0.10592450
## [19]  4.96345557 -1.95103730  2.29574007 -0.97694637  1.39949700 -1.77986547
## [25]  1.26495912  2.58076623 -1.56628756 -1.02698825  0.33135089 -2.39932189
## [31]  0.01890646  0.47459848 -0.11116550  0.22019823 -2.18441039 -0.09200290
## [37]  0.17980854 -1.83944117  0.81862581 -1.25903219  1.69619800 -1.79654175
## [43]  0.67181020  0.42223482 -1.98314443 -2.45950439  0.34547460 -1.39613122
## [49] -1.32202413 -1.07347680 -1.89201237

Total Deviance

total_deviance <- function(x, y) {
  deviance <- x - y
  total <- sum(deviance)
  return(total)
}
total_deviance(murders_df$rate, murders_rate_mean)

## [1] -3.996803e-15

How Many Hours of Sleep did you get last night?

class_sleep_hrs <- c(6, 7.5, 7, 3, 8.5, 8)
class_sleep_mean <- mean(class_sleep_hrs)
class_sleep_mean

## [1] 6.666667

total_deviance(class_sleep_hrs, class_sleep_mean)

## [1] -1.776357e-15

Notice that the total devience is really small. This is because some values are positive and some are negative. There is some rounding errors but the total is 0. This is why we square it. The benefit of squaring is that everything becomes positive and the numbers get big so larger devience from the mean is penalized.

Sum of Squared Errors, SS

Indicates the total dispersion, or total deviance of scores from the mean:

sum_of_squared_errors <- function(x, y) {
  deviance <- x - y
  deviance_squared <- (deviance^2)
  total <- sum(deviance_squared)
  return(total)
}
sum_of_squared_errors(murders_df$rate, murders_rate_mean)

## [1] 301.6259

sum_of_squared_errors(class_sleep_hrs, class_sleep_mean)

## [1] 19.83333

It’s size is dependent on the number of scores in the data.

Variance

More useful to work with the average dispersion, known as the variance:

#N-1
(length(class_sleep_hrs)-1)

## [1] 5

variance <- function(vector, mean) {
  deviance <- vector - mean
  deviance_squared <- (deviance^2)
  total <- sum(deviance_squared)
  variance <- total/(length(vector) - 1)
  return(variance)
}
variance(murders_df$rate, murders_rate_mean)

## [1] 6.032518

variance(class_sleep_hrs, class_sleep_mean)

## [1] 3.966667

We need to control for the number of observations, or values we have in the dataset. This is why we divide by N-1 where “N” equals the length of the vector. Why N-1 and not just N? When you try to estimate the variance for a population,there is larger variance then the sample. The sample under estimates the variance of the population. Just to make it a little bit bigger, statisticians determined this method to remove sample bias.

Standard Deviation

• The variance gives us a measure in units squared. • This problem is solved by taking the square root of the variance, which is known as the standard deviation:

sd_function <- function(vector, mean) {
  deviance <- vector - mean
  deviance_squared <- (deviance^2)
  total <- sum(deviance_squared)
  variance <- total/(length(vector) - 1)
  std_dev <- variance^1/2
  return(std_dev)
}
sd_function(murders_df$rate, murders_rate_mean)

## [1] 3.016259

sd_function(class_sleep_hrs, class_sleep_mean)

## [1] 1.983333

Or, you can just use the built-in functions

var(murders_df$rate)

## [1] 6.032518

sd(murders_df$rate)

## [1] 2.456118

var(class_sleep_hrs)

## [1] 3.966667

sd(class_sleep_hrs)

## [1] 1.991649

I don’t know why my `sd()` r code doesn’t match my function

If you have a missing value, you’ll need to use the na.rm = TRUE argument:

var(na_example, na.rm=TRUE)

## [1] 1.49666

sd(na_example, na.rm=TRUE)

## [1] 1.22338

Using a Frequency Distribution to go Beyond the Data

Images

![caption](filename or url) Two distributions with the same mean, but large and small standard deviations ### Analyzing Data: Histograms Frequency Distributions (aka Histograms) A graph plotting values of observations on the horizontal axis, with a bar showing how many times each value occurred in the data set. #### The ‘Normal’ Distribution The Normal Distribution Notice that it is Bell-shaped and Symmetrical around the center. Models are based on a normal distribution.

Properties of Frequency Distributions

• Skew • Positive (or right) • Negative (or left) Skew • Kurtosis • Leptokurtic • Platykurtic

Creating a histogram

hist(murders_df$rate)

#Or using ggplot2
murders_df %>% ggplot(aes(rate)) + 
geom_histogram(binwidth= 1)

Describe function

discribeBy(x, group)

# install.packages(“psych”)
 library(psych)
?describeBy
#description information for total by region
describeBy(murders, group = murders$region)

## 
##  Descriptive statistics by group 
## group: Northeast
##            vars n       mean         sd  median    trimmed        mad    min
## state*        1 9       5.00       2.74       5       5.00       2.97      1
## abb*          2 9       5.00       2.74       5       5.00       2.97      1
## region*       3 9       1.00       0.00       1       1.00       0.00      1
## population    4 9 6146360.00 6469174.00 3574097 6146360.00 4371232.61 625741
## total         5 9     163.22     200.19      97     163.22     136.40      2
##                 max    range skew kurtosis         se
## state*            9        8 0.00    -1.60       0.91
## abb*              9        8 0.00    -1.60       0.91
## region*           1        0  NaN      NaN       0.00
## population 19378102 18752361 0.85    -0.76 2156391.33
## total           517      515 0.76    -1.24      66.73
## ------------------------------------------------------------ 
## group: South
##            vars  n       mean         sd  median   trimmed        mad    min
## state*        1 17       9.00       5.05       9       9.0       5.93      1
## abb*          2 17       9.00       5.05       9       9.0       5.93      1
## region*       3 17       2.00       0.00       2       2.0       0.00      2
## population    4 17 6804378.47 6516176.53 4625364 5995143.3 2551170.61 601723
## total         5 17     246.76     212.86     207     224.2     142.33     27
##                 max    range skew kurtosis         se
## state*           17       16 0.00    -1.41       1.22
## abb*             17       16 0.00    -1.41       1.22
## region*           2        0  NaN      NaN       0.00
## population 25145561 24543838 1.62     1.72 1580404.95
## total           805      778 1.33     0.89      51.63
## ------------------------------------------------------------ 
## group: North Central
##            vars  n       mean         sd    median   trimmed        mad    min
## state*        1 12       6.50       3.61       6.5       6.5       4.45      1
## abb*          2 12       6.50       3.61       6.5       6.5       4.45      1
## region*       3 12       3.00       0.00       3.0       3.0       0.00      3
## population    4 12 5577250.08 4071915.73 5495455.5 5342377.8 4678679.37 672591
## total         5 12     152.33     154.22      80.0     141.1      99.33      4
##                 max    range skew kurtosis         se
## state*           12       11 0.00    -1.50       1.04
## abb*             12       11 0.00    -1.50       1.04
## region*           3        0  NaN      NaN       0.00
## population 12830632 12158041 0.44    -1.25 1175460.82
## total           413      409 0.54    -1.58      44.52
## ------------------------------------------------------------ 
## group: West
##            vars  n    mean         sd  median trimmed        mad    min
## state*        1 13       7       3.89       7       7       4.45      1
## abb*          2 13       7       3.89       7       7       4.45      1
## region*       3 13       4       0.00       4       4       0.00      4
## population    4 13 5534273 9751150.25 2700551 3102543 2536930.23 563626
## total         5 13     147     339.10      36      59      43.00      5
##                 max    range skew kurtosis         se
## state*           13       12 0.00    -1.48       1.08
## abb*             13       12 0.00    -1.48       1.08
## region*           4        0  NaN      NaN       0.00
## population 37253956 36690330 2.60     5.63 2704482.48
## total          1257     1252 2.66     5.84      94.05

#descriptive information for the murders' df by region
describeBy(murders$total, group = murders$region)

## 
##  Descriptive statistics by group 
## group: Northeast
##    vars n   mean     sd median trimmed   mad min max range skew kurtosis    se
## X1    1 9 163.22 200.19     97  163.22 136.4   2 517   515 0.76    -1.24 66.73
## ------------------------------------------------------------ 
## group: South
##    vars  n   mean     sd median trimmed    mad min max range skew kurtosis
## X1    1 17 246.76 212.86    207   224.2 142.33  27 805   778 1.33     0.89
##       se
## X1 51.63
## ------------------------------------------------------------ 
## group: North Central
##    vars  n   mean     sd median trimmed   mad min max range skew kurtosis    se
## X1    1 12 152.33 154.22     80   141.1 99.33   4 413   409 0.54    -1.58 44.52
## ------------------------------------------------------------ 
## group: West
##    vars  n mean    sd median trimmed mad min  max range skew kurtosis    se
## X1    1 13  147 339.1     36      59  43   5 1257  1252 2.66     5.84 94.05

Descriptive Stats by group

The by function is general. First, tell what the data is then the indices, then the function. Describe is the function. It doesn’t work the way tidyverse works. In the past, if you told it what the data was, it would always remember what dataset you are working in. In this case, you have to tell it the name of the data again in the indices. Indices is the grouping variable. FUN stands for function. Describe is the function.

?by
by(murders, murders$region, describe)

## murders$region: Northeast
##            vars n       mean         sd  median    trimmed        mad    min
## state*        1 9       5.00       2.74       5       5.00       2.97      1
## abb*          2 9       5.00       2.74       5       5.00       2.97      1
## region*       3 9       1.00       0.00       1       1.00       0.00      1
## population    4 9 6146360.00 6469174.00 3574097 6146360.00 4371232.61 625741
## total         5 9     163.22     200.19      97     163.22     136.40      2
##                 max    range skew kurtosis         se
## state*            9        8 0.00    -1.60       0.91
## abb*              9        8 0.00    -1.60       0.91
## region*           1        0  NaN      NaN       0.00
## population 19378102 18752361 0.85    -0.76 2156391.33
## total           517      515 0.76    -1.24      66.73
## ------------------------------------------------------------ 
## murders$region: South
##            vars  n       mean         sd  median   trimmed        mad    min
## state*        1 17       9.00       5.05       9       9.0       5.93      1
## abb*          2 17       9.00       5.05       9       9.0       5.93      1
## region*       3 17       2.00       0.00       2       2.0       0.00      2
## population    4 17 6804378.47 6516176.53 4625364 5995143.3 2551170.61 601723
## total         5 17     246.76     212.86     207     224.2     142.33     27
##                 max    range skew kurtosis         se
## state*           17       16 0.00    -1.41       1.22
## abb*             17       16 0.00    -1.41       1.22
## region*           2        0  NaN      NaN       0.00
## population 25145561 24543838 1.62     1.72 1580404.95
## total           805      778 1.33     0.89      51.63
## ------------------------------------------------------------ 
## murders$region: North Central
##            vars  n       mean         sd    median   trimmed        mad    min
## state*        1 12       6.50       3.61       6.5       6.5       4.45      1
## abb*          2 12       6.50       3.61       6.5       6.5       4.45      1
## region*       3 12       3.00       0.00       3.0       3.0       0.00      3
## population    4 12 5577250.08 4071915.73 5495455.5 5342377.8 4678679.37 672591
## total         5 12     152.33     154.22      80.0     141.1      99.33      4
##                 max    range skew kurtosis         se
## state*           12       11 0.00    -1.50       1.04
## abb*             12       11 0.00    -1.50       1.04
## region*           3        0  NaN      NaN       0.00
## population 12830632 12158041 0.44    -1.25 1175460.82
## total           413      409 0.54    -1.58      44.52
## ------------------------------------------------------------ 
## murders$region: West
##            vars  n    mean         sd  median trimmed        mad    min
## state*        1 13       7       3.89       7       7       4.45      1
## abb*          2 13       7       3.89       7       7       4.45      1
## region*       3 13       4       0.00       4       4       0.00      4
## population    4 13 5534273 9751150.25 2700551 3102543 2536930.23 563626
## total         5 13     147     339.10      36      59      43.00      5
##                 max    range skew kurtosis         se
## state*           13       12 0.00    -1.48       1.08
## abb*             13       12 0.00    -1.48       1.08
## region*           4        0  NaN      NaN       0.00
## population 37253956 36690330 2.60     5.63 2704482.48
## total          1257     1252 2.66     5.84      94.05

by(murders$total, murders$region, mean)

## murders$region: Northeast
## [1] 163.2222
## ------------------------------------------------------------ 
## murders$region: South
## [1] 246.7647
## ------------------------------------------------------------ 
## murders$region: North Central
## [1] 152.3333
## ------------------------------------------------------------ 
## murders$region: West
## [1] 147

Use aggregate aggregate() Work simularily as above. This allows us to do more than one category. It takes a formula. The syntax take in variables that are usually factors.Usually independant variable then dependant variable. This function does recognize that the variables are in the murders dataframe. Last, put in the function you want it to run. Sometimes you need to put the function in quotes.

?aggregate
aggregate(total ~ region, murders, mean)

##          region    total
## 1     Northeast 163.2222
## 2         South 246.7647
## 3 North Central 152.3333
## 4          West 147.0000

#total by region, what kind of data, what function
aggregate(total ~ region, murders, sd)

##          region    total
## 1     Northeast 200.1935
## 2         South 212.8622
## 3 North Central 154.2244
## 4          West 339.1015

z-scores

Standardizing a score with respect to the other scores in the group. It tells you how far away a value is from the mean in terms of standard deviations. In a perfect bell curve, the mean, medium, and mode are equal. The enitre area under the curve sums to 1 or 100%. Everything blow the middle point is 0.5 or 50%.
Expresses a score in terms of how many sd away from the mean. It is spoken about in percentiles. Like when you go to a doctor and they tell you that you are in the 35 percentile. This is blow the mean.

If the mean = 39.68 and sd = 7.74 x_i = 30 Where is 30? Draw a picture and ask if the number should be < or > 0.50? After you have a prediction, you can use this knowledge to check the logic to the mathematical result. (x_i - mean)/sd. You know that the z-score should be to the left of the mean.
You can also use the function: pnorm() gives the probability under the curve. pnorm(x~i~, sd, mean)

(30-39.68)/7.74

## [1] -1.250646

pnorm(-1.25)

## [1] 0.1056498

pnorm(30, 39.68, 7.74)

## [1] 0.1055318

The difference is because of rounding. If the middle area of the bell curve is 0, everything below is negative and above is positive sd values. There are values that come up over and over again to remember. 95% is a value we care a lot about in statistics. If I am interested in the area that corresponds to 95%, centered around the zero. There is 5% in both tails or 2.5% in each tail or 0.025. When we determine the corresponding z scores are for 0.025, what we get is 1.96 and -1.96. If you do pnorm() in R, 1.96 - (-1.96) is close to 0.05

pnorm(1.96) - pnorm(-1.96)

## [1] 0.9500042

1 - (pnorm(1.96) - pnorm(-1.96))

## [1] 0.04999579

The first result is the 95% under the bell curve and the second result is both tails.

Properties of z-scores

1.96 cuts off the top 2.5% of the distribution -1.96 cuts off the bottom 2.5% of the distribution.

pnorm(1.96)

## [1] 0.9750021

This is the area under the right tail.

pnorm(-1.96)

## [1] 0.0249979

This is the area under the left tail.

as such, 95% of z-scores lie between 01.96 and 1.96. 99% of the z-scores lie between -.258 and 2.58. 99.9% of them lie between -3.29 and 3.29.

The function called qnorm() does the inverse of pnorm() With pnorm, you give the z score and it tells you the area under the curve.

qnorm(0.005)

## [1] -2.575829

This is symmetric. on each tail. Round that to -2.58. 99% of the data is between -2.58 and 2.58.

Within 1 sd, there is about 68% of the data. Within 2 sd, there is 95% of the data and the z score is +/- 1.96. So, within 2 sd, you should have most of the data. Within 3 sd, there is 99.7% of the data. only 0.3% of the data should be outside the data. Remember, this is assuming that the data is “normal”.

To calculate the area under the curve within 1 sd, 68.2% of the data. How do you calculate that with R?

What is the percentage of data between 1 and 1? pnorm(1) - pnorm(-1) This is the area under the curve between -1 and 1. pnorm() gives all below but when you subtract you get the area in between the tails. Do the same for 2 sd. It should be 95%

pnorm(1) - pnorm(-1)

## [1] 0.6826895

pnorm(2) - pnorm(-2)

## [1] 0.9544997

Example of z-scores(assume this data is normal)

What is the probability that someone posted a video in the first 50 days?
. . . after 70 days?
Between 50 and 70 days?
How many days correspond to the first 25% of the data?

#1 before
(50-39.68)/7.74

## [1] 1.333333

pnorm(1.33333)

## [1] 0.9087882

post_before <- pnorm(50, 39.68, 7.74)

#2 after
(70-39.68)/7.74

## [1] 3.917313

1 - pnorm(3.917313)

## [1] 4.477073e-05

pnorm(-3.917313)

## [1] 4.477073e-05

post_after <- 1 - pnorm(70, 39.68, 7.74)
post_after

## [1] 4.477079e-05

Remember that pnorm() tells to the left of the value. If you want to know what is to the right of the value (the after), you have to do 1 - pnorm(). The likely hood that someone posted after 70 days is very small.

#3
pnorm(50, 39.68, 7.74) - (1- pnorm(70, 39.68, 7.74))

## [1] 0.908744

pnorm(1.3333) - pnorm(-3.917313)

## [1] 0.9087385

post_before - (- post_after)

## [1] 0.9088336

#4
qnorm(.25)

## [1] -0.6744898

qnorm(.25, 39.68, 7.74)

## [1] 34.45945

The answer to #4 is approximately 34 days. The first quarter of people posted in the first 34.5 days.

You could have solved for the equation by hand:

z-score = (x~i~ - mean)/ sd

(-0.67 * 7.74) + 39.68

## [1] 34.4942

x_i = 34.5 days

The Only Equation You Will Ever Need

Outcome_i = (Model) + error_i

What is the most basic statistical model?

A Simple Statistical Model

In statistics we fit models to our data

• we use a statistical model to represent what is happening in the real world. #### The mean is a hypothetical value • it doesn’t have to be a value that actually exists in the data set • the mean is simple statistical model

Measuring the ‘Fit’ of the Model

The mean is a model of what happens in the real world: the typical score It is not a perfect representation of the data How can we assess how well the mean represents reality?

A Perfect Fit

The Mean as a Model

outcome_i = (b) + error_i

Calculating ‘Error’

• A deviation is the difference between the mean and an actual data point. • Deviations can be calculated by taking each score and subtracting the mean from it

Why not use the Total Error?

It equals 0

Sum of Squared Errors

• Therefore, we square each deviation. • If we add these squared deviations we get the Sum of Squared Errors (SS).

What is the problem with SS?

Variance

• We calculate the average variability by dividing by the number of scores (N) - 1. • This value is called the variance (s2) or mean squared error

Standard Deviation

• The variance has one problem: it is measured in units squared. • This isn’t a very meaningful metric so we take the square root value. • This is the standard deviation (s).

Important Things to Remember

• The Sum of Squares, Variance, and Standard Deviation represent the same thing: • The ‘Fit’ of the mean to the data • The variability in the data • How well the mean represents the observed data • Error

Descriptive Statistics Week 2

Lorraine Gaudio

2023-01-17

Packages

What are descriptive statistics?

Types of descriptive statistic

Measures of central tendency

Measures of variability

Load a dataset and take a look at

Calculate the murder rate

Descriptive Statistics

What type of data

Summary

When should you use “mean” vs “median”?

Trimmed mean

Mode

Measure of Dispersion:

The Interquartile range

Quantile / IQR

MAD / AAD and MAD

Deviance

Total Deviance

How Many Hours of Sleep did you get last night?

Sum of Squared Errors, SS

Variance

Standard Deviation

I don’t know why my sd() r code doesn’t match my function

Using a Frequency Distribution to go Beyond the Data

Images

Properties of Frequency Distributions

Creating a histogram

Describe function

Descriptive Stats by group

z-scores

Properties of z-scores

Example of z-scores(assume this data is normal)

The Only Equation You Will Ever Need

What is the most basic statistical model?

A Simple Statistical Model

In statistics we fit models to our data

Measuring the ‘Fit’ of the Model

A Perfect Fit

The Mean as a Model

Calculating ‘Error’

Why not use the Total Error?

Sum of Squared Errors

Variance

Standard Deviation

Important Things to Remember

I don’t know why my `sd()` r code doesn’t match my function