stat4arch

Petr Pajdla & Peter Tkáč

AES_707: Statistics seminar for archaeologists

Seminar 4

31. 3. 2022

Today:

Relationship of two variables (correlation, scatterplots).
Data manipulation basics (base R and dplyr package).
Normal distribution.
Your data sets for the project?

Relationship of two variables

Correlation

A statistic describing a relationship between two continuous variables.
To what degree is a variable y explained by x?
Correlation coefficient r, from -1 to +1.
Correlation does not imply causation!
r = 1 – strong positive correlation
r = 0.5 – moderately strong positive correlation
r = 0 – variables are not correlated
r = -0.2 – weak negative correlation
r = -1 – strong negative correlation

Function cor(x, y)

x <- 1:100

cor(x, y = 1:100)

[1] 1

cor(x, y = 100:1)

[1] -1

cor(x, y = rnorm(100))

[1] -0.1474186

Scatter plot

Plot displying two continuous variables, x and y.
x axis: explanatory variable, independent, predictor
y axis: dependent variable, response

plot of chunk unnamed-chunk-6

Correlation

plot of chunk unnamed-chunk-7

Spurrious correlations

Correlation does not imply causality!

http://www.tylervigen.com/spurious-correlations

$plot of chunk unnamed-chunk-8$

Guess the correlation game...

http://guessthecorrelation.com/

Task - syntax elements

Try to identify the syntax elements of the code above:

what is ggplot2?
what is the name of the dataset?
what is the name of the variable plotted on axis y?
how many variables are plotted here?
what is the type of the plot? (what is the type of geometry?)
is there any way how to change the size of the dots / points?

Result

library(ggplot2)
p <- ggplot(mtcars, aes(mpg, wt, color=hp))
p+geom_point(size=3)+
  labs(title = "Porovnanie spotreby paliva s hmotnosťou vozidla",
       x="počet prejdených míl na 1 galón bezínu",
       y="hmotnosť",
       color="výkon motora (v konských silách)")

plot of chunk unnamed-chunk-9

Plots for 2 continuous variables

Packages

library(archdata)
library(ggplot2)

Data

data(DartPoints)
sipky <- data.frame(DartPoints)
head(sipky,4)

  Name Catalog     TARL  Quad Length Width Thickness B.Width J.Width H.Length
1 Darl 41-0322 41CV0536 26/59   42.8  15.8       5.8    11.3    10.6     11.6
2 Darl 35-2946 41CV0235 21/63   40.5  17.4       5.8      NA    13.7     12.9
3 Darl 35-2921 41CV0132 20/63   37.5  16.3       6.1    12.1    11.3      8.2
4 Darl 36-3487 41CV0594 10/54   40.3  16.1       6.3    13.5    11.7      8.3
  Weight Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
1    3.6        S       I         S         T       S       E
2    4.5        S       I         S         T       S       E
3    3.6        S       I         S         T       S       E
4    4.0        S       I         S         T       S       E

Distribution

Length

p <- ggplot(sipky, aes(Length))
p + geom_histogram(binwidth=5)

plot of chunk unnamed-chunk-12

Width

p <- ggplot(sipky, aes(Width))
p + geom_histogram(binwidth=5)

plot of chunk unnamed-chunk-13

Plotting 2 variables

Correlation between length and width

p <- ggplot(sipky, aes(Length, Width))
p + geom_point()

plot of chunk unnamed-chunk-14

Plotting 3 variables

Adding 3th variable (weight) and adjusting size of the dots

p <- ggplot(sipky, aes(Length, Width, color=Weight))
p + geom_point(size=3)

plot of chunk unnamed-chunk-15

Observing trend

p <- ggplot(sipky, aes(Length, Width, color=Weight))
p + geom_point(size=3) +
  geom_smooth()

plot of chunk unnamed-chunk-16

Different models

method = "loess"

p <- ggplot(sipky, aes(Length, Width, color=Weight))
p + geom_point(size=3) + 
  geom_smooth(method = "loess")

plot of chunk unnamed-chunk-17

method = "lm"

p <- ggplot(sipky, aes(Length, Width, color=Weight))
p + geom_point(size=3) + 
  geom_smooth(method = "lm")

plot of chunk unnamed-chunk-18

Small multiples - 4th variable

Grouping the darts by categorical variable Name

p <- ggplot(sipky, aes(Length, Width, color=Weight))
p + geom_point(size=3) +
  facet_wrap(~Name, nrow=1)

plot of chunk unnamed-chunk-19

Data transformation

Package

dplyr

Functions

select
arrange
filter
summarise
%>%
mutate

Select

Lets create a smaller dataframes from selected variables

select

select(sipky, Name, Catalog, Length, Width, Weight)

         Name  Catalog Length Width Weight
1        Darl  41-0322   42.8  15.8    3.6
2        Darl  35-2946   40.5  17.4    4.5
3        Darl  35-2921   37.5  16.3    3.6
4        Darl  36-3487   40.3  16.1    4.0
5        Darl  36-3321   30.6  17.1    2.3
6        Darl  35-2959   41.8  16.8    3.0
7        Darl  35-2866   40.3  20.7    3.9
8        Darl  41-0323   48.5  18.7    6.2
9        Darl  35-2325   47.7  17.5    5.1
10       Darl  40-0847   33.6  15.8    2.8
11       Darl  36-3520   32.4  14.5    2.5
12       Darl  38-0736   42.2  14.6    4.8
13       Darl  41-0257   33.5  16.6    3.2
14       Darl  35-2905   41.8  16.7    3.8
15       Darl  36-3898   38.0  16.5    4.5
16       Darl  35-2871   35.5  16.3    4.4
17       Darl  35-2382   31.2  15.6    2.5
18       Darl  35-3026   34.5  15.9    2.3
19       Darl  36-4247   33.1  17.4    4.2
20       Darl  36-3619   32.0  16.0    3.3
21       Darl  36-3036   38.1  19.8    3.6
22       Darl  35-2004   47.6  20.3    7.4
23       Darl  35-0164   42.3  19.2    5.6
24       Darl  36-3059   38.3  20.4    4.8
25       Darl  35-2960   50.6  20.3    7.8
26       Darl  41-0237   54.2  21.0    9.2
27       Darl  35-3043   44.2  23.3    6.2
28       Darl  35-2928   40.0  18.2    4.3
29      Ensor  36-3019   43.5  20.1    4.6
30      Ensor  35-2958   42.1  20.8    5.4
31      Ensor  35-2174   42.1  25.1    5.9
32      Ensor  41-0321   43.1  20.0    5.1
33      Ensor  36-3493   37.5  21.8    4.7
34      Ensor  36-3526   55.2  22.5    7.2
35      Ensor  35-2176   38.0  17.5    2.5
36      Ensor  35-2979   42.5  18.6    3.9
37      Ensor  35-2645   34.9  21.4    4.1
38      Ensor  41-0210   48.5  27.3    7.2
39 Pedernales  36-3026   53.5  27.8   10.7
40 Pedernales  35-2391   66.0  27.2   12.5
41 Pedernales  35-2951   64.5  28.5   13.4
42 Pedernales  35-2875   59.0  22.8   11.1
43 Pedernales  35-2901   60.0  27.9    7.2
44 Pedernales  35-2855  109.5  49.3   28.8
45 Pedernales  35-2384   59.8  31.7   13.9
46 Pedernales  41-0054   52.8  25.3    9.4
47 Pedernales  36-3549   43.7  21.0    5.3
48 Pedernales  35-2873   43.2  23.1    7.9
49 Pedernales  41-0220   49.0  24.4    7.3
50 Pedernales  36-3229   55.3  27.3   12.2
51 Pedernales  36-3879   84.0  21.2    9.3
52 Pedernales  36-3081   48.0  26.9   11.1
53 Pedernales  35-0173   78.3  28.1   14.8
54 Pedernales  36-3880   57.2  26.9   10.7
55 Pedernales  36-3897   52.0  29.6   11.1
56 Pedernales  41-0058   61.1  25.2   12.3
57 Pedernales  38-0098   70.4  30.4   13.1
58 Pedernales  41-0008   48.1  20.9    6.1
59 Pedernales  35-2863   46.4  24.3    9.2
60 Pedernales  35-2650   56.2  22.6    9.4
61 Pedernales  36-4320   47.1  20.9    6.7
62 Pedernales  41-0239   67.2  27.1   15.3
63 Pedernales  36-4266   64.1  27.2   15.1
64 Pedernales  43-0110   65.0  31.6    4.6
65 Pedernales  44-0643   35.4  18.5    4.3
66 Pedernales 44-1253M   61.3  28.1   11.6
67 Pedernales 44-1492M   55.0  22.0   10.5
68 Pedernales 44-1315M   45.2  22.8    6.8
69 Pedernales  47-0041   44.8  26.8    9.1
70 Pedernales  50-0092   49.1  25.8    9.4
71     Travis  35-3164   56.5  21.1    9.5
72     Travis  41-0163   54.6  22.4   10.4
73     Travis  35-2897   46.3  21.3    7.5
74     Travis  35-3075   57.6  18.9    8.7
75     Travis  41-0160   49.1  21.4    6.9
76     Travis  36-0006   64.6  21.5   15.0
77     Travis  43-0112   69.0  20.9   11.4
78     Travis  37-0752   40.1  18.4    6.3
79     Travis  38-0802   41.5  19.2    7.5
80     Travis  40-0661   46.3  17.9    5.9
81     Travis  44-0031   39.6  21.5    5.4
82      Wells  35-0162   53.1  29.6    9.5
83      Wells  36-3586   41.2  19.7    5.4
84      Wells  35-0161   45.0  22.6    7.1
85      Wells  37-0762   54.2  26.0    9.7
86      Wells  36-3088   65.4  25.1   12.6
87      Wells  35-3079   58.9  24.4   10.5
88      Wells  35-2898   55.4  19.3    5.6
89      Wells  35-2458   45.8  18.9    4.9
90      Wells  35-3012   49.1  21.1    5.2
91      Wells  44-0732   63.1  24.7   16.3

Select

And lets create a new dataframe sipky_sel from the selection

sipky_sel <- select(sipky, Name, Catalog, Length, Width, Weight)
head(sipky_sel, 4)

  Name Catalog Length Width Weight
1 Darl 41-0322   42.8  15.8    3.6
2 Darl 35-2946   40.5  17.4    4.5
3 Darl 35-2921   37.5  16.3    3.6
4 Darl 36-3487   40.3  16.1    4.0

How to identify and filter out an outlier?

plot of chunk unnamed-chunk-22

Arrange

Lets arrange data from lowest to highest value

arrange(sipky_sel, Length)

         Name  Catalog Length Width Weight
1        Darl  36-3321   30.6  17.1    2.3
2        Darl  35-2382   31.2  15.6    2.5
3        Darl  36-3619   32.0  16.0    3.3
4        Darl  36-3520   32.4  14.5    2.5
5        Darl  36-4247   33.1  17.4    4.2
6        Darl  41-0257   33.5  16.6    3.2
7        Darl  40-0847   33.6  15.8    2.8
8        Darl  35-3026   34.5  15.9    2.3
9       Ensor  35-2645   34.9  21.4    4.1
10 Pedernales  44-0643   35.4  18.5    4.3
11       Darl  35-2871   35.5  16.3    4.4
12       Darl  35-2921   37.5  16.3    3.6
13      Ensor  36-3493   37.5  21.8    4.7
14       Darl  36-3898   38.0  16.5    4.5
15      Ensor  35-2176   38.0  17.5    2.5
16       Darl  36-3036   38.1  19.8    3.6
17       Darl  36-3059   38.3  20.4    4.8
18     Travis  44-0031   39.6  21.5    5.4
19       Darl  35-2928   40.0  18.2    4.3
20     Travis  37-0752   40.1  18.4    6.3
21       Darl  36-3487   40.3  16.1    4.0
22       Darl  35-2866   40.3  20.7    3.9
23       Darl  35-2946   40.5  17.4    4.5
24      Wells  36-3586   41.2  19.7    5.4
25     Travis  38-0802   41.5  19.2    7.5
26       Darl  35-2959   41.8  16.8    3.0
27       Darl  35-2905   41.8  16.7    3.8
28      Ensor  35-2958   42.1  20.8    5.4
29      Ensor  35-2174   42.1  25.1    5.9
30       Darl  38-0736   42.2  14.6    4.8
31       Darl  35-0164   42.3  19.2    5.6
32      Ensor  35-2979   42.5  18.6    3.9
33       Darl  41-0322   42.8  15.8    3.6
34      Ensor  41-0321   43.1  20.0    5.1
35 Pedernales  35-2873   43.2  23.1    7.9
36      Ensor  36-3019   43.5  20.1    4.6
37 Pedernales  36-3549   43.7  21.0    5.3
38       Darl  35-3043   44.2  23.3    6.2
39 Pedernales  47-0041   44.8  26.8    9.1
40      Wells  35-0161   45.0  22.6    7.1
41 Pedernales 44-1315M   45.2  22.8    6.8
42      Wells  35-2458   45.8  18.9    4.9
43     Travis  35-2897   46.3  21.3    7.5
44     Travis  40-0661   46.3  17.9    5.9
45 Pedernales  35-2863   46.4  24.3    9.2
46 Pedernales  36-4320   47.1  20.9    6.7
47       Darl  35-2004   47.6  20.3    7.4
48       Darl  35-2325   47.7  17.5    5.1
49 Pedernales  36-3081   48.0  26.9   11.1
50 Pedernales  41-0008   48.1  20.9    6.1
51       Darl  41-0323   48.5  18.7    6.2
52      Ensor  41-0210   48.5  27.3    7.2
53 Pedernales  41-0220   49.0  24.4    7.3
54 Pedernales  50-0092   49.1  25.8    9.4
55     Travis  41-0160   49.1  21.4    6.9
56      Wells  35-3012   49.1  21.1    5.2
57       Darl  35-2960   50.6  20.3    7.8
58 Pedernales  36-3897   52.0  29.6   11.1
59 Pedernales  41-0054   52.8  25.3    9.4
60      Wells  35-0162   53.1  29.6    9.5
61 Pedernales  36-3026   53.5  27.8   10.7
62       Darl  41-0237   54.2  21.0    9.2
63      Wells  37-0762   54.2  26.0    9.7
64     Travis  41-0163   54.6  22.4   10.4
65 Pedernales 44-1492M   55.0  22.0   10.5
66      Ensor  36-3526   55.2  22.5    7.2
67 Pedernales  36-3229   55.3  27.3   12.2
68      Wells  35-2898   55.4  19.3    5.6
69 Pedernales  35-2650   56.2  22.6    9.4
70     Travis  35-3164   56.5  21.1    9.5
71 Pedernales  36-3880   57.2  26.9   10.7
72     Travis  35-3075   57.6  18.9    8.7
73      Wells  35-3079   58.9  24.4   10.5
74 Pedernales  35-2875   59.0  22.8   11.1
75 Pedernales  35-2384   59.8  31.7   13.9
76 Pedernales  35-2901   60.0  27.9    7.2
77 Pedernales  41-0058   61.1  25.2   12.3
78 Pedernales 44-1253M   61.3  28.1   11.6
79      Wells  44-0732   63.1  24.7   16.3
80 Pedernales  36-4266   64.1  27.2   15.1
81 Pedernales  35-2951   64.5  28.5   13.4
82     Travis  36-0006   64.6  21.5   15.0
83 Pedernales  43-0110   65.0  31.6    4.6
84      Wells  36-3088   65.4  25.1   12.6
85 Pedernales  35-2391   66.0  27.2   12.5
86 Pedernales  41-0239   67.2  27.1   15.3
87     Travis  43-0112   69.0  20.9   11.4
88 Pedernales  38-0098   70.4  30.4   13.1
89 Pedernales  35-0173   78.3  28.1   14.8
90 Pedernales  36-3879   84.0  21.2    9.3
91 Pedernales  35-2855  109.5  49.3   28.8

Arrange

head(arrange(sipky_sel, Length),4)

  Name Catalog Length Width Weight
1 Darl 36-3321   30.6  17.1    2.3
2 Darl 35-2382   31.2  15.6    2.5
3 Darl 36-3619   32.0  16.0    3.3
4 Darl 36-3520   32.4  14.5    2.5

Arrange - descent

And now from highest to lowest value!

head(arrange(sipky_sel, desc(Length)),4)

        Name Catalog Length Width Weight
1 Pedernales 35-2855  109.5  49.3   28.8
2 Pedernales 36-3879   84.0  21.2    9.3
3 Pedernales 35-0173   78.3  28.1   14.8
4 Pedernales 38-0098   70.4  30.4   13.1

Filter

Lets filter some data

filter(sipky_sel, Length > 90)

        Name Catalog Length Width Weight
1 Pedernales 35-2855  109.5  49.3   28.8

sipky_90 <- filter(sipky_sel, Length < 90)
head(arrange(sipky_90, desc(Length)),4)

        Name Catalog Length Width Weight
1 Pedernales 36-3879   84.0  21.2    9.3
2 Pedernales 35-0173   78.3  28.1   14.8
3 Pedernales 38-0098   70.4  30.4   13.1
4     Travis 43-0112   69.0  20.9   11.4

Filter

You can also filter rows with specific value, e.g. here we filter all darts called “Darl”

sipky_darl<-filter(sipky_sel, Name=="Darl")
head(sipky_darl)

  Name Catalog Length Width Weight
1 Darl 41-0322   42.8  15.8    3.6
2 Darl 35-2946   40.5  17.4    4.5
3 Darl 35-2921   37.5  16.3    3.6
4 Darl 36-3487   40.3  16.1    4.0
5 Darl 36-3321   30.6  17.1    2.3
6 Darl 35-2959   41.8  16.8    3.0

Filtering more than one value

Create a dataframe with darts named “Darl” or “Ensor”

vyber<-c("Darl","Ensor")
darl_ensor <- filter(sipky_sel, Name %in% vyber)

summary(darl_ensor$Name)

      Darl      Ensor Pedernales     Travis      Wells 
        28         10          0          0          0

Summarise

mean(sipky_sel$Length)

[1] 49.33077

summarise(sipky_sel, dlzka_prum = mean(Length))

  dlzka_prum
1   49.33077

summarise(
  sipky_sel, 
  dlzka_prum = mean(Length),
  dlzka_sd = sd(Length),
  dlzka_min = min(Length),
  dlzka_max = max(Length),
  ks = n()
)

  dlzka_prum dlzka_sd dlzka_min dlzka_max ks
1   49.33077 12.73619      30.6     109.5 91

Grouping

group_by(sipky_sel, Name)

# A tibble: 91 × 5
# Groups:   Name [5]
   Name  Catalog Length Width Weight
   <fct> <chr>    <dbl> <dbl>  <dbl>
 1 Darl  41-0322   42.8  15.8    3.6
 2 Darl  35-2946   40.5  17.4    4.5
 3 Darl  35-2921   37.5  16.3    3.6
 4 Darl  36-3487   40.3  16.1    4  
 5 Darl  36-3321   30.6  17.1    2.3
 6 Darl  35-2959   41.8  16.8    3  
 7 Darl  35-2866   40.3  20.7    3.9
 8 Darl  41-0323   48.5  18.7    6.2
 9 Darl  35-2325   47.7  17.5    5.1
10 Darl  40-0847   33.6  15.8    2.8
# … with 81 more rows

Pipe operator

(originally comes from a package magrittr, but nowadays many packages use it as well)

%>%

(type Ctrl + Shift + m)

function(x, y)

can be written as:

x %>% function(y)

sipky %>% 
  select(Name, Length) %>% 
  filter(Length < 90) %>% 
  arrange(desc(Length)) %>% 
  head(6)

        Name Length
1 Pedernales   84.0
2 Pedernales   78.3
3 Pedernales   70.4
4     Travis   69.0
5 Pedernales   67.2
6 Pedernales   66.0

Summarizing data

sipky_group <- sipky_sel %>%
  group_by(Name) %>%
  summarise (dlzka_prum = mean(Length),
             ks = n())

sipky_group

# A tibble: 5 × 3
  Name       dlzka_prum    ks
  <fct>           <dbl> <int>
1 Darl             39.8    28
2 Ensor            42.7    10
3 Pedernales       57.9    32
4 Travis           51.4    11
5 Wells            53.1    10

Mutate

mutate(sipky_group, 
       frekvence = ks/sum(ks), 
       procento = round((frekvence*100), 0))

# A tibble: 5 × 5
  Name       dlzka_prum    ks frekvence procento
  <fct>           <dbl> <int>     <dbl>    <dbl>
1 Darl             39.8    28     0.308       31
2 Ensor            42.7    10     0.110       11
3 Pedernales       57.9    32     0.352       35
4 Travis           51.4    11     0.121       12
5 Wells            53.1    10     0.110       11

Summarizing once again

sipky_group <- sipky_sel %>%
  group_by(Name)%>%
  summarise(
    dlzka_prum = mean(Length),
    dlzka_sd = sd(Length),
    ks = n()
  ) %>%
  mutate(pomer = ks / sum(ks),
         procento = round((pomer*100),0))

head(sipky_group)

# A tibble: 5 × 6
  Name       dlzka_prum dlzka_sd    ks pomer procento
  <fct>           <dbl>    <dbl> <int> <dbl>    <dbl>
1 Darl             39.8     6.18    28 0.308       31
2 Ensor            42.7     5.79    10 0.110       11
3 Pedernales       57.9    14.1     32 0.352       35
4 Travis           51.4     9.90    11 0.121       12
5 Wells            53.1     7.94    10 0.110       11

Let's practice...

Normal distribution

(bell-shaped curve, Gaussian distribution)

One sigma

Two sigma

Tree sigma

Normal distribution in different plots

plot of chunk unnamed-chunk-39

Mean and median in normal distribution

plot of chunk unnamed-chunk-40

plot of chunk unnamed-chunk-41

Is my distribution normal?

Visual aids

Density plot
Q-Q plot (quantile-quantile plot)
qqnorm(x) nebo
ggplot(data, aes(sample = x)) + geom_qq()

Statistical hypothesis test

Shapiro-Wilk test
shapiro.test()
Kolmogorov-Smirnov normality test

Q-Q plot

ggplot(nd, aes(x)) + 
  geom_density()

plot of chunk unnamed-chunk-42

ggplot(nd, aes(sample = x)) + 
  geom_qq(alpha = 0.4)

plot of chunk unnamed-chunk-43

Shapiro-Wilk normality test

$ H_0 $ (null hypothesis): Values fit normal distribution.
$ H_A $ (alternative hypothesis): Values do not fit normal distribution.
p-value: probability of the event that observed values fit normal distribution
p > 0.05: Fail to reject null hypothesis.
Significance level = 0.05 – Event occurs in less than 5% of cases

shapiro.test(nd$x)


    Shapiro-Wilk normality test

data:  nd$x
W = 0.99897, p-value = 0.8592

Data distribution shapes

Transforming data

Many (!) statistical methods suppose the distribution of data is normal.
Parametric vs non-parametric statistical methods.
Most of (our) real world data is not normal.
Data are transformed using various methods…
Natural/Base 10 logarithms: log10(x)
Square root: sqrt(x)
Multimodal distributions?

Relative position of a data point in a data set

Z--Score

Recall what is standard deviation?
How many standard deviations is the data point from the mean?

\[ z-score = \frac{x_i-\overline{x}}{s} \]

Why is this useful?
What if we count z-scores for the whole dataset?