R01 STA1511

Introduction to R

R is a language and environment for statistical computing and graphics.

  1. Download R-base

https://cran.r-project.org/bin/windows/base/

  1. Download R-Studio

https://www.rstudio.com/products/rstudio/download/

Statistics

  • Statistics -> parameter estimator

  • Parameters –> numerical measures that describe the population of interest.

  • Statistics –> numerical measures of a sample.

  • Sample –> is a subset of the population.

  • Population -> the whole object that is the center of our observation

Descriptive Statistics

It is a technique of presenting and summarizing data so that it becomes information that is easy to understand.

Import Data

The data can be downloaded through this link.

Tutorial: Import data to R Studio. (Click here)

library(readxl)
library(tidyr)
data1 <- read.csv("D:/MATERI KULIAH S2 IPB/ASPRAK 2/Example_data.csv")
head(data1)
##   bookpageID    appdate ceremonydate delay     officialTitle person        dob
## 1   B230p539 10/29/1996    11/9/1996    11    CIRCUIT JUDGE   Groom  4/11/1964
## 2   B230p677 11/12/1996   11/12/1996     0 MARRIAGE OFFICIAL  Groom   8/6/1964
## 3   B230p766 11/19/1996   11/27/1996     8 MARRIAGE OFFICIAL  Groom  2/20/1962
## 4   B230p892  12/2/1996    12/7/1996     5          MINISTER  Groom  5/20/1956
## 5   B230p994  12/9/1996   12/14/1996     5          MINISTER  Groom 12/14/1966
## 6  B230p1209 12/26/1996   12/26/1996     0 MARRIAGE OFFICIAL  Groom  2/21/1970
##        age college     zodiacs
## 1 32.60274       7       Aries
## 2 32.29041       0         Leo
## 3 34.79178       3      Pisces
## 4 40.57808       4      Gemini
## 5 30.02192       0 Saggitarius
## 6 26.86301       0      Pisces
#check missing data
colSums(is.na(data1))
##    bookpageID       appdate  ceremonydate         delay officialTitle 
##             0             0             0             1             0 
##        person           dob           age       college       zodiacs 
##             0             0             1            11             0
# drop NA 
dataz <- drop_na(data1)

# cek missing value
colSums(is.na(dataz))
##    bookpageID       appdate  ceremonydate         delay officialTitle 
##             0             0             0             0             0 
##        person           dob           age       college       zodiacs 
##             0             0             0             0             0

Contingency Table

A Contingency table can be used to see the distribution of two or more categorical data and it is a way of summarizing categorical variables.

Data

mtcars data from R

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

A data frame with 32 observations on 11 (numeric) variables.

[, 1] mpg Miles/(US) gallon

[, 2] cyl Number of cylinders

[, 3] disp Displacement (cu.in.)

[, 4] hp Gross horsepower

[, 5] drat Rear axle ratio

[, 6] wt Weight (1000 lbs)

[, 7] qsec 1/4 mile time

[, 8] vs Engine (0 = V-shaped, 1 = straight)

[, 9] am Transmission (0 = automatic, 1 = manual)

[,10] gear Number of forward gears

[,11] carb Number of carburetors

# reading the data
data(mtcars)
colnames(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
attach(mtcars)


# Contingency Table – 2-way relationships
t0 = table(cyl, gear)
t0
##    gear
## cyl  3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2
t1 =xtabs(~ cyl + gear
          , data = mtcars)
t1
##    gear
## cyl  3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2
t2 = ftable(gear ~ cyl
            , data = mtcars)
t2
##     gear  3  4  5
## cyl              
## 4         1  8  2
## 6         2  4  1
## 8        12  0  2

Frequency Table

A frequency table is a table that lists items and shows the number of times the items occur.

library(kableExtra)
library(janitor)
table2 = tabyl(dataz, officialTitle) %>% 
    adorn_totals("row") %>%
    adorn_pct_formatting(digits = 0)
names(table2) = c("Official Title", "Frequency", "Percent")

kbl(table2, 
    caption = "Table 1: Distribution of participants by official title") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 1: Distribution of participants by official title
Official Title Frequency Percent
BISHOP 1 1%
CATHOLIC PRIEST 2 2%
CHIEF CLERK 2 2%
CIRCUIT JUDGE 2 2%
ELDER 2 2%
MARRIAGE OFFICIAL 40 45%
MINISTER 19 22%
PASTOR 20 23%
Total 88 100%
library(tidyverse)
datatable1<-dataz%>%count(officialTitle)
datatable1
##       officialTitle  n
## 1            BISHOP  1
## 2   CATHOLIC PRIEST  2
## 3       CHIEF CLERK  2
## 4    CIRCUIT JUDGE   2
## 5             ELDER  2
## 6 MARRIAGE OFFICIAL 40
## 7          MINISTER 19
## 8            PASTOR 20

Bar Chart

Colours in R:

Bar chart useful for displaying categorical data (nominal and ordinal) and This can also be used to present data from contingency tables / data summary tables

library(ggplot2)
ggplot(dataz, aes(x = zodiacs)) +                                                   # diagram view of `Zodiacs` 
  geom_bar(fill = "pink",color= "black") +                                           # colors
  theme_minimal() +                                                                    # background theme
  labs(x = "Zodiacs",                                                                   # label for every variables
       y = "Frequency",   
       title = "Zodiacs")  

ggplot(dataz, aes(x = zodiacs)) +                                                   # diagram view of `Zodiacs` 
  geom_bar(fill = "coral",color= "black") +                                           # colors
  theme_minimal() +                                                                    # background theme
  labs(x = "Zodiacs",                                                                   # label for every variables
       y = "Frequency",   
       title = "Zodiacs") +
  coord_flip()

Pie Chart

Used to display categorical data, especially nominal data.This chart shows the distribution of data in groups (total 100%).

library(tidyverse)
plotdata <- dataz %>%
  count(zodiacs) %>%
  arrange(desc(zodiacs)) %>%
  mutate(prop = round(n*100/sum(n), 1),
         lab.ypos = cumsum(prop) - 0.5*prop)

# Pie Chart
ggplot(plotdata, aes(x = "", y = prop, fill = zodiacs)) +
  geom_bar(width = 1, stat = "identity", color = "white") +
  coord_polar("y", start = 0)+
  geom_text(aes(y = lab.ypos, label = prop), color = "black")+
  scale_fill_manual(values = rainbow(13)) +
  theme_void()+
  labs(title = "Percentage of Zodiacs")

Histogram

A graph of a frequency distribution. Can be the distribution of its frequency or its relative frequency.

#dataz
ggplot(dataz, aes(x = age)) +
  geom_histogram(fill = "coral1", 
                 color = "black",
                 bins = 15) + 
  theme_minimal() +                                  
  labs(title="Age",
       x = "Age",
       y = "Frequency") #skewed to right

#data iris from R
ggplot(iris, aes(x = Sepal.Width)) +
  geom_histogram(fill = "green", 
                 color = "black",
                 bins = 10) + 
  theme_minimal() +                                  
    labs(title="Sepal Width",
       x = "Sepal Width",
       y = "Frequency") #normal curve

Dot plot

  • A graph used to see the distribution of the original data in the form of points

  • Used to see the frequency of occurrence for each value

ggplot(dataz, aes(x = age)) +
  geom_dotplot(fill = "blue",
               binwidth = 2) +
  theme_minimal() +                                 
  labs(title = "Age",
       y = "Proportions",
       x = "Age",
       subtitle = "binwidth = 2")

Stem & leaf plot

  • A stem and leaf plot is a very effective way of visually representing the data directly.

  • The shape of the plot may indicate whether the data set is skewed-left,skewed-right or centered.

  • The appearance of tails in the plot may also indicate the presence of outliers in the data set, located in the tail region.

  • In R we can generate a stem and leaf plot for a data set using the stem() function.

library(aplpack)
variety_1 <-   c(20,12,39,38,
                 41,43,51,52,
                 59,55,53,59,
                 50,58,35,38,
                 23,32,43,53)
variety_2 <-   c(18,45,62,59,
                 53,25,13,57,
                 42,55,13,57,
                 42,55,56,38,
                 41,36,50,62,
                 45,55)
stem.leaf.backback(variety_1, variety_2, m = 1)
## _____________________________________
##   1 | 2: represents 12, leaf unit: 1 
##        variety_1     variety_2   
## _____________________________________
##    1           2| 1 |338         3   
##    3          30| 2 |5           4   
##    8       98852| 3 |68          6   
##   (3)        331| 4 |12255      (5)  
##    9   998533210| 5 |035556779  (9)  
##                 | 6 |22          2   
##                 | 7 |                
## _____________________________________
## n:            20     22          
## _____________________________________
stem(variety_1)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   1 | 2
##   2 | 03
##   3 | 25889
##   4 | 133
##   5 | 012335899

Box Plot

Presenting Data from the Five Number Summary (Min, Q1, Q2, Q3, Max)

library(ggplot2)
datasets::airquality
##     Ozone Solar.R Wind Temp Month Day
## 1      41     190  7.4   67     5   1
## 2      36     118  8.0   72     5   2
## 3      12     149 12.6   74     5   3
## 4      18     313 11.5   62     5   4
## 5      NA      NA 14.3   56     5   5
## 6      28      NA 14.9   66     5   6
## 7      23     299  8.6   65     5   7
## 8      19      99 13.8   59     5   8
## 9       8      19 20.1   61     5   9
## 10     NA     194  8.6   69     5  10
## 11      7      NA  6.9   74     5  11
## 12     16     256  9.7   69     5  12
## 13     11     290  9.2   66     5  13
## 14     14     274 10.9   68     5  14
## 15     18      65 13.2   58     5  15
## 16     14     334 11.5   64     5  16
## 17     34     307 12.0   66     5  17
## 18      6      78 18.4   57     5  18
## 19     30     322 11.5   68     5  19
## 20     11      44  9.7   62     5  20
## 21      1       8  9.7   59     5  21
## 22     11     320 16.6   73     5  22
## 23      4      25  9.7   61     5  23
## 24     32      92 12.0   61     5  24
## 25     NA      66 16.6   57     5  25
## 26     NA     266 14.9   58     5  26
## 27     NA      NA  8.0   57     5  27
## 28     23      13 12.0   67     5  28
## 29     45     252 14.9   81     5  29
## 30    115     223  5.7   79     5  30
## 31     37     279  7.4   76     5  31
## 32     NA     286  8.6   78     6   1
## 33     NA     287  9.7   74     6   2
## 34     NA     242 16.1   67     6   3
## 35     NA     186  9.2   84     6   4
## 36     NA     220  8.6   85     6   5
## 37     NA     264 14.3   79     6   6
## 38     29     127  9.7   82     6   7
## 39     NA     273  6.9   87     6   8
## 40     71     291 13.8   90     6   9
## 41     39     323 11.5   87     6  10
## 42     NA     259 10.9   93     6  11
## 43     NA     250  9.2   92     6  12
## 44     23     148  8.0   82     6  13
## 45     NA     332 13.8   80     6  14
## 46     NA     322 11.5   79     6  15
## 47     21     191 14.9   77     6  16
## 48     37     284 20.7   72     6  17
## 49     20      37  9.2   65     6  18
## 50     12     120 11.5   73     6  19
## 51     13     137 10.3   76     6  20
## 52     NA     150  6.3   77     6  21
## 53     NA      59  1.7   76     6  22
## 54     NA      91  4.6   76     6  23
## 55     NA     250  6.3   76     6  24
## 56     NA     135  8.0   75     6  25
## 57     NA     127  8.0   78     6  26
## 58     NA      47 10.3   73     6  27
## 59     NA      98 11.5   80     6  28
## 60     NA      31 14.9   77     6  29
## 61     NA     138  8.0   83     6  30
## 62    135     269  4.1   84     7   1
## 63     49     248  9.2   85     7   2
## 64     32     236  9.2   81     7   3
## 65     NA     101 10.9   84     7   4
## 66     64     175  4.6   83     7   5
## 67     40     314 10.9   83     7   6
## 68     77     276  5.1   88     7   7
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 71     85     175  7.4   89     7  10
## 72     NA     139  8.6   82     7  11
## 73     10     264 14.3   73     7  12
## 74     27     175 14.9   81     7  13
## 75     NA     291 14.9   91     7  14
## 76      7      48 14.3   80     7  15
## 77     48     260  6.9   81     7  16
## 78     35     274 10.3   82     7  17
## 79     61     285  6.3   84     7  18
## 80     79     187  5.1   87     7  19
## 81     63     220 11.5   85     7  20
## 82     16       7  6.9   74     7  21
## 83     NA     258  9.7   81     7  22
## 84     NA     295 11.5   82     7  23
## 85     80     294  8.6   86     7  24
## 86    108     223  8.0   85     7  25
## 87     20      81  8.6   82     7  26
## 88     52      82 12.0   86     7  27
## 89     82     213  7.4   88     7  28
## 90     50     275  7.4   86     7  29
## 91     64     253  7.4   83     7  30
## 92     59     254  9.2   81     7  31
## 93     39      83  6.9   81     8   1
## 94      9      24 13.8   81     8   2
## 95     16      77  7.4   82     8   3
## 96     78      NA  6.9   86     8   4
## 97     35      NA  7.4   85     8   5
## 98     66      NA  4.6   87     8   6
## 99    122     255  4.0   89     8   7
## 100    89     229 10.3   90     8   8
## 101   110     207  8.0   90     8   9
## 102    NA     222  8.6   92     8  10
## 103    NA     137 11.5   86     8  11
## 104    44     192 11.5   86     8  12
## 105    28     273 11.5   82     8  13
## 106    65     157  9.7   80     8  14
## 107    NA      64 11.5   79     8  15
## 108    22      71 10.3   77     8  16
## 109    59      51  6.3   79     8  17
## 110    23     115  7.4   76     8  18
## 111    31     244 10.9   78     8  19
## 112    44     190 10.3   78     8  20
## 113    21     259 15.5   77     8  21
## 114     9      36 14.3   72     8  22
## 115    NA     255 12.6   75     8  23
## 116    45     212  9.7   79     8  24
## 117   168     238  3.4   81     8  25
## 118    73     215  8.0   86     8  26
## 119    NA     153  5.7   88     8  27
## 120    76     203  9.7   97     8  28
## 121   118     225  2.3   94     8  29
## 122    84     237  6.3   96     8  30
## 123    85     188  6.3   94     8  31
## 124    96     167  6.9   91     9   1
## 125    78     197  5.1   92     9   2
## 126    73     183  2.8   93     9   3
## 127    91     189  4.6   93     9   4
## 128    47      95  7.4   87     9   5
## 129    32      92 15.5   84     9   6
## 130    20     252 10.9   80     9   7
## 131    23     220 10.3   78     9   8
## 132    21     230 10.9   75     9   9
## 133    24     259  9.7   73     9  10
## 134    44     236 14.9   81     9  11
## 135    21     259 15.5   76     9  12
## 136    28     238  6.3   77     9  13
## 137     9      24 10.9   71     9  14
## 138    13     112 11.5   71     9  15
## 139    46     237  6.9   78     9  16
## 140    18     224 13.8   67     9  17
## 141    13      27 10.3   76     9  18
## 142    24     238 10.3   68     9  19
## 143    16     201  8.0   82     9  20
## 144    13     238 12.6   64     9  21
## 145    23      14  9.2   71     9  22
## 146    36     139 10.3   81     9  23
## 147     7      49 10.3   69     9  24
## 148    14      20 16.6   63     9  25
## 149    30     193  6.9   70     9  26
## 150    NA     145 13.2   77     9  27
## 151    14     191 14.3   75     9  28
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30
ggplot(data = airquality, aes(x=as.character(Month), y=Temp)) +
    geom_boxplot(fill=c('steelblue')) #boxplot of temperature value every month

#using different color
ggplot(data = airquality, aes(x=as.character(Month), y=Temp)) +
    geom_boxplot(fill=c('steelblue', 'red', 'purple', 'green', 'orange'))

Summary Data Technic

1. Central Tendency

Mean

• Center of mass (centroid)

• If representative to population, then denote as \(\mu\).

• As representative of sample, then denote as \(\bar{x}\)

• Use for numerical data

• Resistent towards outlier

dist<-c(12.5,29.9,14.8,18.7,7.6,16.2,16.5,27.4,12.1,17.5)
mean1<-sum(dist)/length(dist)
mean1
## [1] 17.32
# or using function `mean`

mean(dist)
## [1] 17.32

Median

• The symbol is Q2

• Observation in the middle of sorted data

• split data into 50%

median(dist)
## [1] 16.35

Mode

The value of the observation that occurs most often.

library(DescTools)
mode1<-Mode(iris$Sepal.Width)
mode1
## [1] 3
## attr(,"freq")
## [1] 26

Quartil

• Values that divide sorted data into 4 equal parts

• Q0 = min and Q4 = max

• Q1 (read quartile 1) is the value that divides the data 25% of the data on the left and 75% of the data on the right

• Q3 (read quartile 3) is the value that divides the data 75% of the data on the left and 25% of the data on the right

• Robust against outliers

# Q1 and Q3
quantile(dist,probs=c(0.25,0.75))
##    25%    75% 
## 13.075 18.400
# Q0 and Q4
min(dist)
## [1] 7.6
max(dist)
## [1] 29.9

2. Dispersion Measure

  1. To describe a QUANTITATIVE MEASURE of the level of spread or grouping of data

  2. Variation is usually defined in terms of distance:

  • How far are the points from each other

  • How far is the distance between the points from the mean

  • How is the level of representation of these values to the overall data condition

Range

Range = Max(data)-Min(data)

range1<-max(dist)-min(dist)
range1
## [1] 22.3

Interquartil range (IQR)

The interquartile range explains the spread of the middle half of the distribution.

IQR = Q3 - Q1

Quartiles segment any distribution that’s ordered from low to high into four equal parts.

IQR<-quantile(dist,probs=c(0.75)) - quantile(dist,probs=c(0.25))
IQR
##   75% 
## 5.325

Deviation

Difference between the data to its mean

deviation<-dist-mean(dist)
deviation
##  [1] -4.82 12.58 -2.52  1.38 -9.72 -1.12 -0.82 10.08 -5.22  0.18

Variance

The variance is a measure of variability. Variance can explain the degree of spread in our data set. The more spread the data, the larger the variance is in relation to the mean.

Formula: The average of sum square of deviation between its mean.

var(dist)
## [1] 46.11511

Standard Deviation

Standard Deviation is the square root of variance.

sd(dist)
## [1] 6.790811

Summary in R

summary(dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.60   13.07   16.35   17.32   18.40   29.90

Summary Table

library(tidyverse)
library(kableExtra)

table1 <- dataz %>% 
    group_by(zodiacs) %>% 
    summarise(Frequency = n(),
              Minimum = min(age),
              Maximum = max(age),
              Median = median(age),
              Mean=mean(age),
              IQR = diff(quantile(age, c(1, 3)/4)))
names(table1)[1] = c("Zodiacs")

kbl(table1, digits = 2, 
    caption = "Table 1: Descriptive statistics of age by zodiacs.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 1: Descriptive statistics of age by zodiacs.
Zodiacs Frequency Minimum Maximum Median Mean IQR
Aquarius 7 20.27 42.17 23.38 28.27 10.45
Aries 9 20.04 52.44 33.98 34.00 17.83
Cancer 8 16.27 67.58 40.42 38.73 12.07
Capricorn 2 23.99 37.84 30.92 30.92 6.93
Gemini 9 18.46 74.25 34.01 42.09 29.81
Leo 6 18.28 68.04 29.36 34.70 19.62
Libra 6 18.36 45.02 22.30 27.59 16.85
Pisces 13 18.64 55.64 26.86 30.28 14.02
Saggitarius 9 21.34 44.85 37.55 34.11 16.44
Scorpio 6 18.40 72.80 28.93 36.13 13.34
Taurus 5 17.02 52.59 39.58 36.49 25.35
Virgo 8 20.22 50.07 27.74 31.02 18.84