library(dslabs)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(stats)
• A way to summarize or describe a set of data. • It isn’t easy to make sense of data by looking at sets of numbers, but if we use descriptive statistics & graphs, we can get a "sense" of the data
• The mean • The median • Trimmed mean • Mode
• Range • IQR • mad (mean or median absolute deviation) • Variance • Standard deviation
data(murders)
head(murders)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
?murders
## starting httpd help server ... done
It is usually measured as murders per 100,000 people. Store it in the dataframe with the column name: rate The pipe function is in tidyverse.
murders_df <- murders %>%
mutate(rate = (total/population)*100000)
head(murders_df)
## state abb region population total rate
## 1 Alabama AL South 4779736 135 2.824424
## 2 Alaska AK West 710231 19 2.675186
## 3 Arizona AZ West 6392017 232 3.629527
## 4 Arkansas AR South 2915918 93 3.189390
## 5 California CA West 37253956 1257 3.374138
## 6 Colorado CO West 5029196 65 1.292453
Let’s calculate the mean or median of rate
murders_rate_mean <- mean(murders_df$rate)
murders_rate_mean
## [1] 2.779125
median(murders_df$rate)
## [1] 2.687123
The $ is a “of that”. Use df $ column.
class() and str()tells the type of data.
str() is really useful. It refers to structure and will
give length and other information.
class(murders_df)
## [1] "data.frame"
str(murders_df)
## 'data.frame': 51 obs. of 6 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ abb : chr "AL" "AK" "AZ" "AR" ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## $ population: num 4779736 710231 6392017 2915918 37253956 ...
## $ total : num 135 19 232 93 1257 ...
## $ rate : num 2.82 2.68 3.63 3.19 3.37 ...
If there’s a missing value, mean doesn’t work For example, using the na_example dataframe:
data(na_example)
mean(na_example)
## [1] NA
mean(na_example, na.rm=TRUE)
## [1] 2.301754
na.rm removes NA so the function can calculate.
The summary function provides useful information for the entire dataframe
summary(murders_df)
## state abb region population
## Length:51 Length:51 Northeast : 9 Min. : 563626
## Class :character Class :character South :17 1st Qu.: 1696962
## Mode :character Mode :character North Central:12 Median : 4339367
## West :13 Mean : 6075769
## 3rd Qu.: 6636084
## Max. :37253956
## total rate
## Min. : 2.0 Min. : 0.3196
## 1st Qu.: 24.5 1st Qu.: 1.2526
## Median : 97.0 Median : 2.6871
## Mean : 184.4 Mean : 2.7791
## 3rd Qu.: 268.0 3rd Qu.: 3.3861
## Max. :1257.0 Max. :16.4528
The median may be more descriptive for skewed data or outliers.
You can ”trim” the data if you are concerned about outliers. Of course, you need to have a really good reason for doing so, but the code is here:
mean(murders_df$rate, trim = 0.1)
## [1] 2.442242
Rarely do you need to report the mode, but you may be interested in knowing about the frequency of different values in our categorical variables.
table(murders$region)
##
## Northeast South North Central West
## 9 17 12 13
###Range The Range The smallest score subtracted from the largest Very biased by outliers
range(murders_df$rate)
## [1] 0.3196211 16.4527532
max(murders_df$rate) - min(murders_df$rate)
## [1] 16.13313
This is bias by outliers.
• Quartiles • The three values that split the sorted data into four equal parts. • Second quartile = median. • Lower quartile = median of lower half of the data. • Upper quartile = median of upper half of the data.
Quantile function provides the percentile for a dataset (stats package). For instance, if you ask for the 10% percentile you get the value such that 10% of the data is less than x.
quantile(murders_df$rate, probs = 0.10)
## 10%
## 0.7655102
For the IQR, you can use the 25th and 75th percentiles
quantile(murders_df$rate, probs = c(0.25, 0.75))
## 25% 75%
## 1.252645 3.386104
If you want to know the difference between them, you can subtract them.
interquantile_range_rate <- quantile(murders_df$rate, probs = 0.75) - quantile(murders_df$rate, probs = 0.25)
interquantile_range_rate
## 75%
## 2.133458
Mean (or average) absolute deviation and median absolute deviation –These are rarely used.
We can calculate the spread of scores by looking at how different each score is from the center of a distribution e.g., the mean
deviance <- function(x, y) {
deviance <- x - y
return(deviance)
}
deviance(murders_df$rate, murders_rate_mean)
## [1] 0.04529833 -0.10393949 0.85040182 0.41026465 0.59501286 -1.48667234
## [7] -0.06515322 1.45281142 13.67362773 0.61894338 1.01119713 -2.26453346
## [13] -2.01361526 0.05783535 -0.58905240 -2.08977703 -0.57101489 -0.10592450
## [19] 4.96345557 -1.95103730 2.29574007 -0.97694637 1.39949700 -1.77986547
## [25] 1.26495912 2.58076623 -1.56628756 -1.02698825 0.33135089 -2.39932189
## [31] 0.01890646 0.47459848 -0.11116550 0.22019823 -2.18441039 -0.09200290
## [37] 0.17980854 -1.83944117 0.81862581 -1.25903219 1.69619800 -1.79654175
## [43] 0.67181020 0.42223482 -1.98314443 -2.45950439 0.34547460 -1.39613122
## [49] -1.32202413 -1.07347680 -1.89201237
total_deviance <- function(x, y) {
deviance <- x - y
total <- sum(deviance)
return(total)
}
total_deviance(murders_df$rate, murders_rate_mean)
## [1] -3.996803e-15
class_sleep_hrs <- c(6, 7.5, 7, 3, 8.5, 8)
class_sleep_mean <- mean(class_sleep_hrs)
class_sleep_mean
## [1] 6.666667
total_deviance(class_sleep_hrs, class_sleep_mean)
## [1] -1.776357e-15
Notice that the total devience is really small. This is because some values are positive and some are negative. There is some rounding errors but the total is 0. This is why we square it. The benefit of squaring is that everything becomes positive and the numbers get big so larger devience from the mean is penalized.
Indicates the total dispersion, or total deviance of scores from the mean:
sum_of_squared_errors <- function(x, y) {
deviance <- x - y
deviance_squared <- (deviance^2)
total <- sum(deviance_squared)
return(total)
}
sum_of_squared_errors(murders_df$rate, murders_rate_mean)
## [1] 301.6259
sum_of_squared_errors(class_sleep_hrs, class_sleep_mean)
## [1] 19.83333
It’s size is dependent on the number of scores in the data.
More useful to work with the average dispersion, known as the variance:
#N-1
(length(class_sleep_hrs)-1)
## [1] 5
variance <- function(vector, mean) {
deviance <- vector - mean
deviance_squared <- (deviance^2)
total <- sum(deviance_squared)
variance <- total/(length(vector) - 1)
return(variance)
}
variance(murders_df$rate, murders_rate_mean)
## [1] 6.032518
variance(class_sleep_hrs, class_sleep_mean)
## [1] 3.966667
We need to control for the number of observations, or values we have in the dataset. This is why we divide by N-1 where “N” equals the length of the vector. Why N-1 and not just N? When you try to estimate the variance for a population,there is larger variance then the sample. The sample under estimates the variance of the population. Just to make it a little bit bigger, statisticians determined this method to remove sample bias.
• The variance gives us a measure in units squared. • This problem is solved by taking the square root of the variance, which is known as the standard deviation:
sd_function <- function(vector, mean) {
deviance <- vector - mean
deviance_squared <- (deviance^2)
total <- sum(deviance_squared)
variance <- total/(length(vector) - 1)
std_dev <- variance^1/2
return(std_dev)
}
sd_function(murders_df$rate, murders_rate_mean)
## [1] 3.016259
sd_function(class_sleep_hrs, class_sleep_mean)
## [1] 1.983333
Or, you can just use the built-in functions
var(murders_df$rate)
## [1] 6.032518
sd(murders_df$rate)
## [1] 2.456118
var(class_sleep_hrs)
## [1] 3.966667
sd(class_sleep_hrs)
## [1] 1.991649
sd() r code doesn’t match my
functionIf you have a missing value, you’ll need to use the na.rm = TRUE argument:
var(na_example, na.rm=TRUE)
## [1] 1.49666
sd(na_example, na.rm=TRUE)
## [1] 1.22338

### Analyzing Data: Histograms Frequency Distributions (aka Histograms)
A graph plotting values of observations on the horizontal axis, with a
bar showing how many times each value occurred in the data set. #### The
‘Normal’ Distribution
Notice that it is Bell-shaped and
Symmetrical around the center. Models are based on a normal
distribution.
• Skew • Positive (or right) • Negative (or left) • Kurtosis • Leptokurtic •
Platykurtic
hist(murders_df$rate)
#Or using ggplot2
murders_df %>% ggplot(aes(rate)) +
geom_histogram(binwidth= 1)
discribeBy(x, group)
# install.packages(“psych”)
library(psych)
?describeBy
#description information for total by region
describeBy(murders, group = murders$region)
##
## Descriptive statistics by group
## group: Northeast
## vars n mean sd median trimmed mad min
## state* 1 9 5.00 2.74 5 5.00 2.97 1
## abb* 2 9 5.00 2.74 5 5.00 2.97 1
## region* 3 9 1.00 0.00 1 1.00 0.00 1
## population 4 9 6146360.00 6469174.00 3574097 6146360.00 4371232.61 625741
## total 5 9 163.22 200.19 97 163.22 136.40 2
## max range skew kurtosis se
## state* 9 8 0.00 -1.60 0.91
## abb* 9 8 0.00 -1.60 0.91
## region* 1 0 NaN NaN 0.00
## population 19378102 18752361 0.85 -0.76 2156391.33
## total 517 515 0.76 -1.24 66.73
## ------------------------------------------------------------
## group: South
## vars n mean sd median trimmed mad min
## state* 1 17 9.00 5.05 9 9.0 5.93 1
## abb* 2 17 9.00 5.05 9 9.0 5.93 1
## region* 3 17 2.00 0.00 2 2.0 0.00 2
## population 4 17 6804378.47 6516176.53 4625364 5995143.3 2551170.61 601723
## total 5 17 246.76 212.86 207 224.2 142.33 27
## max range skew kurtosis se
## state* 17 16 0.00 -1.41 1.22
## abb* 17 16 0.00 -1.41 1.22
## region* 2 0 NaN NaN 0.00
## population 25145561 24543838 1.62 1.72 1580404.95
## total 805 778 1.33 0.89 51.63
## ------------------------------------------------------------
## group: North Central
## vars n mean sd median trimmed mad min
## state* 1 12 6.50 3.61 6.5 6.5 4.45 1
## abb* 2 12 6.50 3.61 6.5 6.5 4.45 1
## region* 3 12 3.00 0.00 3.0 3.0 0.00 3
## population 4 12 5577250.08 4071915.73 5495455.5 5342377.8 4678679.37 672591
## total 5 12 152.33 154.22 80.0 141.1 99.33 4
## max range skew kurtosis se
## state* 12 11 0.00 -1.50 1.04
## abb* 12 11 0.00 -1.50 1.04
## region* 3 0 NaN NaN 0.00
## population 12830632 12158041 0.44 -1.25 1175460.82
## total 413 409 0.54 -1.58 44.52
## ------------------------------------------------------------
## group: West
## vars n mean sd median trimmed mad min
## state* 1 13 7 3.89 7 7 4.45 1
## abb* 2 13 7 3.89 7 7 4.45 1
## region* 3 13 4 0.00 4 4 0.00 4
## population 4 13 5534273 9751150.25 2700551 3102543 2536930.23 563626
## total 5 13 147 339.10 36 59 43.00 5
## max range skew kurtosis se
## state* 13 12 0.00 -1.48 1.08
## abb* 13 12 0.00 -1.48 1.08
## region* 4 0 NaN NaN 0.00
## population 37253956 36690330 2.60 5.63 2704482.48
## total 1257 1252 2.66 5.84 94.05
#descriptive information for the murders' df by region
describeBy(murders$total, group = murders$region)
##
## Descriptive statistics by group
## group: Northeast
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 9 163.22 200.19 97 163.22 136.4 2 517 515 0.76 -1.24 66.73
## ------------------------------------------------------------
## group: South
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 17 246.76 212.86 207 224.2 142.33 27 805 778 1.33 0.89
## se
## X1 51.63
## ------------------------------------------------------------
## group: North Central
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 152.33 154.22 80 141.1 99.33 4 413 409 0.54 -1.58 44.52
## ------------------------------------------------------------
## group: West
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 13 147 339.1 36 59 43 5 1257 1252 2.66 5.84 94.05
The by function is general. First, tell what the data is then the indices, then the function. Describe is the function. It doesn’t work the way tidyverse works. In the past, if you told it what the data was, it would always remember what dataset you are working in. In this case, you have to tell it the name of the data again in the indices. Indices is the grouping variable. FUN stands for function. Describe is the function.
?by
by(murders, murders$region, describe)
## murders$region: Northeast
## vars n mean sd median trimmed mad min
## state* 1 9 5.00 2.74 5 5.00 2.97 1
## abb* 2 9 5.00 2.74 5 5.00 2.97 1
## region* 3 9 1.00 0.00 1 1.00 0.00 1
## population 4 9 6146360.00 6469174.00 3574097 6146360.00 4371232.61 625741
## total 5 9 163.22 200.19 97 163.22 136.40 2
## max range skew kurtosis se
## state* 9 8 0.00 -1.60 0.91
## abb* 9 8 0.00 -1.60 0.91
## region* 1 0 NaN NaN 0.00
## population 19378102 18752361 0.85 -0.76 2156391.33
## total 517 515 0.76 -1.24 66.73
## ------------------------------------------------------------
## murders$region: South
## vars n mean sd median trimmed mad min
## state* 1 17 9.00 5.05 9 9.0 5.93 1
## abb* 2 17 9.00 5.05 9 9.0 5.93 1
## region* 3 17 2.00 0.00 2 2.0 0.00 2
## population 4 17 6804378.47 6516176.53 4625364 5995143.3 2551170.61 601723
## total 5 17 246.76 212.86 207 224.2 142.33 27
## max range skew kurtosis se
## state* 17 16 0.00 -1.41 1.22
## abb* 17 16 0.00 -1.41 1.22
## region* 2 0 NaN NaN 0.00
## population 25145561 24543838 1.62 1.72 1580404.95
## total 805 778 1.33 0.89 51.63
## ------------------------------------------------------------
## murders$region: North Central
## vars n mean sd median trimmed mad min
## state* 1 12 6.50 3.61 6.5 6.5 4.45 1
## abb* 2 12 6.50 3.61 6.5 6.5 4.45 1
## region* 3 12 3.00 0.00 3.0 3.0 0.00 3
## population 4 12 5577250.08 4071915.73 5495455.5 5342377.8 4678679.37 672591
## total 5 12 152.33 154.22 80.0 141.1 99.33 4
## max range skew kurtosis se
## state* 12 11 0.00 -1.50 1.04
## abb* 12 11 0.00 -1.50 1.04
## region* 3 0 NaN NaN 0.00
## population 12830632 12158041 0.44 -1.25 1175460.82
## total 413 409 0.54 -1.58 44.52
## ------------------------------------------------------------
## murders$region: West
## vars n mean sd median trimmed mad min
## state* 1 13 7 3.89 7 7 4.45 1
## abb* 2 13 7 3.89 7 7 4.45 1
## region* 3 13 4 0.00 4 4 0.00 4
## population 4 13 5534273 9751150.25 2700551 3102543 2536930.23 563626
## total 5 13 147 339.10 36 59 43.00 5
## max range skew kurtosis se
## state* 13 12 0.00 -1.48 1.08
## abb* 13 12 0.00 -1.48 1.08
## region* 4 0 NaN NaN 0.00
## population 37253956 36690330 2.60 5.63 2704482.48
## total 1257 1252 2.66 5.84 94.05
by(murders$total, murders$region, mean)
## murders$region: Northeast
## [1] 163.2222
## ------------------------------------------------------------
## murders$region: South
## [1] 246.7647
## ------------------------------------------------------------
## murders$region: North Central
## [1] 152.3333
## ------------------------------------------------------------
## murders$region: West
## [1] 147
Use aggregate aggregate() Work simularily as above. This
allows us to do more than one category. It takes a formula. The syntax
take in variables that are usually factors.Usually independant variable
then dependant variable. This function does recognize that the variables
are in the murders dataframe. Last, put in the function you want it to
run. Sometimes you need to put the function in quotes.
?aggregate
aggregate(total ~ region, murders, mean)
## region total
## 1 Northeast 163.2222
## 2 South 246.7647
## 3 North Central 152.3333
## 4 West 147.0000
#total by region, what kind of data, what function
aggregate(total ~ region, murders, sd)
## region total
## 1 Northeast 200.1935
## 2 South 212.8622
## 3 North Central 154.2244
## 4 West 339.1015
Standardizing a score with respect to the other scores in the group.
It tells you how far away a value is from the mean in terms of standard
deviations. In a perfect bell curve, the mean, medium, and mode are
equal. The enitre area under the curve sums to 1 or 100%. Everything
blow the middle point is 0.5 or 50%.
Expresses a score in terms of how many sd away from the mean. It is
spoken about in percentiles. Like when you go to a doctor and they tell
you that you are in the 35 percentile. This is blow the mean.
If the mean = 39.68 and sd = 7.74 xi = 30 Where is 30?
Draw a picture and ask if the number should be < or > 0.50? After
you have a prediction, you can use this knowledge to check the logic to
the mathematical result. (xi - mean)/sd. You know that the
z-score should be to the left of the mean.
You can also use the function: pnorm() gives the
probability under the curve. pnorm(x~i~, sd, mean)
(30-39.68)/7.74
## [1] -1.250646
pnorm(-1.25)
## [1] 0.1056498
pnorm(30, 39.68, 7.74)
## [1] 0.1055318
The difference is because of rounding. If the middle area of the bell
curve is 0, everything below is negative and above is positive sd
values. There are values that come up over and over again to remember.
95% is a value we care a lot about in statistics. If I am interested in
the area that corresponds to 95%, centered around the zero. There is 5%
in both tails or 2.5% in each tail or 0.025. When we determine the
corresponding z scores are for 0.025, what we get is 1.96 and -1.96. If
you do pnorm() in R, 1.96 - (-1.96) is close to 0.05
pnorm(1.96) - pnorm(-1.96)
## [1] 0.9500042
1 - (pnorm(1.96) - pnorm(-1.96))
## [1] 0.04999579
The first result is the 95% under the bell curve and the second result is both tails.
1.96 cuts off the top 2.5% of the distribution -1.96 cuts off the bottom 2.5% of the distribution.
pnorm(1.96)
## [1] 0.9750021
This is the area under the right tail.
pnorm(-1.96)
## [1] 0.0249979
This is the area under the left tail.
as such, 95% of z-scores lie between 01.96 and 1.96. 99% of the z-scores lie between -.258 and 2.58. 99.9% of them lie between -3.29 and 3.29.
The function called qnorm() does the inverse of
pnorm() With pnorm, you give the z score and it tells you
the area under the curve.
qnorm(0.005)
## [1] -2.575829
This is symmetric. on each tail. Round that to -2.58. 99% of the data is between -2.58 and 2.58.
Within 1 sd, there is about 68% of the data. Within 2 sd, there is 95% of the data and the z score is +/- 1.96. So, within 2 sd, you should have most of the data. Within 3 sd, there is 99.7% of the data. only 0.3% of the data should be outside the data. Remember, this is assuming that the data is “normal”.
To calculate the area under the curve within 1 sd, 68.2% of the data. How do you calculate that with R?
What is the percentage of data between 1 and 1?
pnorm(1) - pnorm(-1) This is the area under the curve
between -1 and 1. pnorm() gives all below but when you
subtract you get the area in between the tails. Do the same for 2 sd. It
should be 95%
pnorm(1) - pnorm(-1)
## [1] 0.6826895
pnorm(2) - pnorm(-2)
## [1] 0.9544997
#1 before
(50-39.68)/7.74
## [1] 1.333333
pnorm(1.33333)
## [1] 0.9087882
post_before <- pnorm(50, 39.68, 7.74)
#2 after
(70-39.68)/7.74
## [1] 3.917313
1 - pnorm(3.917313)
## [1] 4.477073e-05
pnorm(-3.917313)
## [1] 4.477073e-05
post_after <- 1 - pnorm(70, 39.68, 7.74)
post_after
## [1] 4.477079e-05
pnorm() tells to the left of the value.
If you want to know what is to the right of the value (the after), you
have to do 1 - pnorm(). The likely hood that someone posted
after 70 days is very small.#3
pnorm(50, 39.68, 7.74) - (1- pnorm(70, 39.68, 7.74))
## [1] 0.908744
pnorm(1.3333) - pnorm(-3.917313)
## [1] 0.9087385
post_before - (- post_after)
## [1] 0.9088336
#4
qnorm(.25)
## [1] -0.6744898
qnorm(.25, 39.68, 7.74)
## [1] 34.45945
The answer to #4 is approximately 34 days. The first quarter of people posted in the first 34.5 days.
You could have solved for the equation by hand:
z-score = (x~i~ - mean)/ sd
(-0.67 * 7.74) + 39.68
## [1] 34.4942
xi = 34.5 days
Outcomei = (Model) + errori
• we use a statistical model to represent what is happening in the real world. #### The mean is a hypothetical value • it doesn’t have to be a value that actually exists in the data set • the mean is simple statistical model
The mean is a model of what happens in the real world: the typical score It is not a perfect representation of the data How can we assess how well the mean represents reality?
outcomei = (b) + errori
• A deviation is the difference between the mean and an actual data point. • Deviations can be calculated by taking each score and subtracting the mean from it
It equals 0
• Therefore, we square each deviation. • If we add these squared deviations we get the Sum of Squared Errors (SS).
What is the problem with SS?
• We calculate the average variability by dividing by the number of scores (N) - 1. • This value is called the variance (s2) or mean squared error
• The variance has one problem: it is measured in units squared. • This isn’t a very meaningful metric so we take the square root value. • This is the standard deviation (s).
• The Sum of Squares, Variance, and Standard Deviation represent the same thing: • The ‘Fit’ of the mean to the data • The variability in the data • How well the mean represents the observed data • Error