R01 STA1511

Introduction to R

R is a language and environment for statistical computing and graphics.

Download R-base

https://cran.r-project.org/bin/windows/base/

Download R-Studio

https://www.rstudio.com/products/rstudio/download/

Statistics

Statistics -> parameter estimator
Parameters –> numerical measures that describe the population of interest.
Statistics –> numerical measures of a sample.
Sample –> is a subset of the population.
Population -> the whole object that is the center of our observation

Descriptive Statistics

It is a technique of presenting and summarizing data so that it becomes information that is easy to understand.

Import Data

The data can be downloaded through this link.

Tutorial: Import data to R Studio. (Click here)

library(readxl)
library(tidyr)
data1 <- read.csv("D:/MATERI KULIAH S2 IPB/ASPRAK 2/Example_data.csv")
head(data1)

##   bookpageID    appdate ceremonydate delay     officialTitle person        dob
## 1   B230p539 10/29/1996    11/9/1996    11    CIRCUIT JUDGE   Groom  4/11/1964
## 2   B230p677 11/12/1996   11/12/1996     0 MARRIAGE OFFICIAL  Groom   8/6/1964
## 3   B230p766 11/19/1996   11/27/1996     8 MARRIAGE OFFICIAL  Groom  2/20/1962
## 4   B230p892  12/2/1996    12/7/1996     5          MINISTER  Groom  5/20/1956
## 5   B230p994  12/9/1996   12/14/1996     5          MINISTER  Groom 12/14/1966
## 6  B230p1209 12/26/1996   12/26/1996     0 MARRIAGE OFFICIAL  Groom  2/21/1970
##        age college     zodiacs
## 1 32.60274       7       Aries
## 2 32.29041       0         Leo
## 3 34.79178       3      Pisces
## 4 40.57808       4      Gemini
## 5 30.02192       0 Saggitarius
## 6 26.86301       0      Pisces

#check missing data
colSums(is.na(data1))

##    bookpageID       appdate  ceremonydate         delay officialTitle 
##             0             0             0             1             0 
##        person           dob           age       college       zodiacs 
##             0             0             1            11             0

# drop NA 
dataz <- drop_na(data1)

# cek missing value
colSums(is.na(dataz))

##    bookpageID       appdate  ceremonydate         delay officialTitle 
##             0             0             0             0             0 
##        person           dob           age       college       zodiacs 
##             0             0             0             0             0

Contingency Table

A Contingency table can be used to see the distribution of two or more categorical data and it is a way of summarizing categorical variables.

Data

mtcars data from R

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

A data frame with 32 observations on 11 (numeric) variables.

[, 1] mpg Miles/(US) gallon

[, 2] cyl Number of cylinders

[, 3] disp Displacement (cu.in.)

[, 4] hp Gross horsepower

[, 5] drat Rear axle ratio

[, 6] wt Weight (1000 lbs)

[, 7] qsec 1/4 mile time

[, 8] vs Engine (0 = V-shaped, 1 = straight)

[, 9] am Transmission (0 = automatic, 1 = manual)

[,10] gear Number of forward gears

[,11] carb Number of carburetors

# reading the data
data(mtcars)
colnames(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

attach(mtcars)


# Contingency Table – 2-way relationships
t0 = table(cyl, gear)
t0

##    gear
## cyl  3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2

t1 =xtabs(~ cyl + gear
          , data = mtcars)
t1

##    gear
## cyl  3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2

t2 = ftable(gear ~ cyl
            , data = mtcars)
t2

##     gear  3  4  5
## cyl              
## 4         1  8  2
## 6         2  4  1
## 8        12  0  2

Frequency Table

A frequency table is a table that lists items and shows the number of times the items occur.

library(kableExtra)
library(janitor)
table2 = tabyl(dataz, officialTitle) %>% 
    adorn_totals("row") %>%
    adorn_pct_formatting(digits = 0)
names(table2) = c("Official Title", "Frequency", "Percent")

kbl(table2, 
    caption = "Table 1: Distribution of participants by official title") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 1: Distribution of participants by official title
Official Title	Frequency	Percent
BISHOP	1	1%
CATHOLIC PRIEST	2	2%
CHIEF CLERK	2	2%
CIRCUIT JUDGE	2	2%
ELDER	2	2%
MARRIAGE OFFICIAL	40	45%
MINISTER	19	22%
PASTOR	20	23%
Total	88	100%

library(tidyverse)
datatable1<-dataz%>%count(officialTitle)
datatable1

##       officialTitle  n
## 1            BISHOP  1
## 2   CATHOLIC PRIEST  2
## 3       CHIEF CLERK  2
## 4    CIRCUIT JUDGE   2
## 5             ELDER  2
## 6 MARRIAGE OFFICIAL 40
## 7          MINISTER 19
## 8            PASTOR 20

Bar Chart

Colours in R:

Bar chart useful for displaying categorical data (nominal and ordinal) and This can also be used to present data from contingency tables / data summary tables

library(ggplot2)
ggplot(dataz, aes(x = zodiacs)) +                                                   # diagram view of `Zodiacs` 
  geom_bar(fill = "pink",color= "black") +                                           # colors
  theme_minimal() +                                                                    # background theme
  labs(x = "Zodiacs",                                                                   # label for every variables
       y = "Frequency",   
       title = "Zodiacs")

ggplot(dataz, aes(x = zodiacs)) +                                                   # diagram view of `Zodiacs` 
  geom_bar(fill = "coral",color= "black") +                                           # colors
  theme_minimal() +                                                                    # background theme
  labs(x = "Zodiacs",                                                                   # label for every variables
       y = "Frequency",   
       title = "Zodiacs") +
  coord_flip()

Pie Chart

Used to display categorical data, especially nominal data.This chart shows the distribution of data in groups (total 100%).

library(tidyverse)
plotdata <- dataz %>%
  count(zodiacs) %>%
  arrange(desc(zodiacs)) %>%
  mutate(prop = round(n*100/sum(n), 1),
         lab.ypos = cumsum(prop) - 0.5*prop)

# Pie Chart
ggplot(plotdata, aes(x = "", y = prop, fill = zodiacs)) +
  geom_bar(width = 1, stat = "identity", color = "white") +
  coord_polar("y", start = 0)+
  geom_text(aes(y = lab.ypos, label = prop), color = "black")+
  scale_fill_manual(values = rainbow(13)) +
  theme_void()+
  labs(title = "Percentage of Zodiacs")

Histogram

A graph of a frequency distribution. Can be the distribution of its frequency or its relative frequency.

#dataz
ggplot(dataz, aes(x = age)) +
  geom_histogram(fill = "coral1", 
                 color = "black",
                 bins = 15) + 
  theme_minimal() +                                  
  labs(title="Age",
       x = "Age",
       y = "Frequency") #skewed to right

#data iris from R
ggplot(iris, aes(x = Sepal.Width)) +
  geom_histogram(fill = "green", 
                 color = "black",
                 bins = 10) + 
  theme_minimal() +                                  
    labs(title="Sepal Width",
       x = "Sepal Width",
       y = "Frequency") #normal curve

Dot plot

A graph used to see the distribution of the original data in the form of points
Used to see the frequency of occurrence for each value

ggplot(dataz, aes(x = age)) +
  geom_dotplot(fill = "blue",
               binwidth = 2) +
  theme_minimal() +                                 
  labs(title = "Age",
       y = "Proportions",
       x = "Age",
       subtitle = "binwidth = 2")

Stem & leaf plot

A stem and leaf plot is a very effective way of visually representing the data directly.
The shape of the plot may indicate whether the data set is skewed-left,skewed-right or centered.
The appearance of tails in the plot may also indicate the presence of outliers in the data set, located in the tail region.
In R we can generate a stem and leaf plot for a data set using the stem() function.

library(aplpack)
variety_1 <-   c(20,12,39,38,
                 41,43,51,52,
                 59,55,53,59,
                 50,58,35,38,
                 23,32,43,53)
variety_2 <-   c(18,45,62,59,
                 53,25,13,57,
                 42,55,13,57,
                 42,55,56,38,
                 41,36,50,62,
                 45,55)
stem.leaf.backback(variety_1, variety_2, m = 1)

## _____________________________________
##   1 | 2: represents 12, leaf unit: 1 
##        variety_1     variety_2   
## _____________________________________
##    1           2| 1 |338         3   
##    3          30| 2 |5           4   
##    8       98852| 3 |68          6   
##   (3)        331| 4 |12255      (5)  
##    9   998533210| 5 |035556779  (9)  
##                 | 6 |22          2   
##                 | 7 |                
## _____________________________________
## n:            20     22          
## _____________________________________

stem(variety_1)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   1 | 2
##   2 | 03
##   3 | 25889
##   4 | 133
##   5 | 012335899

Box Plot

Presenting Data from the Five Number Summary (Min, Q1, Q2, Q3, Max)

library(ggplot2)
datasets::airquality

##     Ozone Solar.R Wind Temp Month Day
## 1      41     190  7.4   67     5   1
## 2      36     118  8.0   72     5   2
## 3      12     149 12.6   74     5   3
## 4      18     313 11.5   62     5   4
## 5      NA      NA 14.3   56     5   5
## 6      28      NA 14.9   66     5   6
## 7      23     299  8.6   65     5   7
## 8      19      99 13.8   59     5   8
## 9       8      19 20.1   61     5   9
## 10     NA     194  8.6   69     5  10
## 11      7      NA  6.9   74     5  11
## 12     16     256  9.7   69     5  12
## 13     11     290  9.2   66     5  13
## 14     14     274 10.9   68     5  14
## 15     18      65 13.2   58     5  15
## 16     14     334 11.5   64     5  16
## 17     34     307 12.0   66     5  17
## 18      6      78 18.4   57     5  18
## 19     30     322 11.5   68     5  19
## 20     11      44  9.7   62     5  20
## 21      1       8  9.7   59     5  21
## 22     11     320 16.6   73     5  22
## 23      4      25  9.7   61     5  23
## 24     32      92 12.0   61     5  24
## 25     NA      66 16.6   57     5  25
## 26     NA     266 14.9   58     5  26
## 27     NA      NA  8.0   57     5  27
## 28     23      13 12.0   67     5  28
## 29     45     252 14.9   81     5  29
## 30    115     223  5.7   79     5  30
## 31     37     279  7.4   76     5  31
## 32     NA     286  8.6   78     6   1
## 33     NA     287  9.7   74     6   2
## 34     NA     242 16.1   67     6   3
## 35     NA     186  9.2   84     6   4
## 36     NA     220  8.6   85     6   5
## 37     NA     264 14.3   79     6   6
## 38     29     127  9.7   82     6   7
## 39     NA     273  6.9   87     6   8
## 40     71     291 13.8   90     6   9
## 41     39     323 11.5   87     6  10
## 42     NA     259 10.9   93     6  11
## 43     NA     250  9.2   92     6  12
## 44     23     148  8.0   82     6  13
## 45     NA     332 13.8   80     6  14
## 46     NA     322 11.5   79     6  15
## 47     21     191 14.9   77     6  16
## 48     37     284 20.7   72     6  17
## 49     20      37  9.2   65     6  18
## 50     12     120 11.5   73     6  19
## 51     13     137 10.3   76     6  20
## 52     NA     150  6.3   77     6  21
## 53     NA      59  1.7   76     6  22
## 54     NA      91  4.6   76     6  23
## 55     NA     250  6.3   76     6  24
## 56     NA     135  8.0   75     6  25
## 57     NA     127  8.0   78     6  26
## 58     NA      47 10.3   73     6  27
## 59     NA      98 11.5   80     6  28
## 60     NA      31 14.9   77     6  29
## 61     NA     138  8.0   83     6  30
## 62    135     269  4.1   84     7   1
## 63     49     248  9.2   85     7   2
## 64     32     236  9.2   81     7   3
## 65     NA     101 10.9   84     7   4
## 66     64     175  4.6   83     7   5
## 67     40     314 10.9   83     7   6
## 68     77     276  5.1   88     7   7
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 71     85     175  7.4   89     7  10
## 72     NA     139  8.6   82     7  11
## 73     10     264 14.3   73     7  12
## 74     27     175 14.9   81     7  13
## 75     NA     291 14.9   91     7  14
## 76      7      48 14.3   80     7  15
## 77     48     260  6.9   81     7  16
## 78     35     274 10.3   82     7  17
## 79     61     285  6.3   84     7  18
## 80     79     187  5.1   87     7  19
## 81     63     220 11.5   85     7  20
## 82     16       7  6.9   74     7  21
## 83     NA     258  9.7   81     7  22
## 84     NA     295 11.5   82     7  23
## 85     80     294  8.6   86     7  24
## 86    108     223  8.0   85     7  25
## 87     20      81  8.6   82     7  26
## 88     52      82 12.0   86     7  27
## 89     82     213  7.4   88     7  28
## 90     50     275  7.4   86     7  29
## 91     64     253  7.4   83     7  30
## 92     59     254  9.2   81     7  31
## 93     39      83  6.9   81     8   1
## 94      9      24 13.8   81     8   2
## 95     16      77  7.4   82     8   3
## 96     78      NA  6.9   86     8   4
## 97     35      NA  7.4   85     8   5
## 98     66      NA  4.6   87     8   6
## 99    122     255  4.0   89     8   7
## 100    89     229 10.3   90     8   8
## 101   110     207  8.0   90     8   9
## 102    NA     222  8.6   92     8  10
## 103    NA     137 11.5   86     8  11
## 104    44     192 11.5   86     8  12
## 105    28     273 11.5   82     8  13
## 106    65     157  9.7   80     8  14
## 107    NA      64 11.5   79     8  15
## 108    22      71 10.3   77     8  16
## 109    59      51  6.3   79     8  17
## 110    23     115  7.4   76     8  18
## 111    31     244 10.9   78     8  19
## 112    44     190 10.3   78     8  20
## 113    21     259 15.5   77     8  21
## 114     9      36 14.3   72     8  22
## 115    NA     255 12.6   75     8  23
## 116    45     212  9.7   79     8  24
## 117   168     238  3.4   81     8  25
## 118    73     215  8.0   86     8  26
## 119    NA     153  5.7   88     8  27
## 120    76     203  9.7   97     8  28
## 121   118     225  2.3   94     8  29
## 122    84     237  6.3   96     8  30
## 123    85     188  6.3   94     8  31
## 124    96     167  6.9   91     9   1
## 125    78     197  5.1   92     9   2
## 126    73     183  2.8   93     9   3
## 127    91     189  4.6   93     9   4
## 128    47      95  7.4   87     9   5
## 129    32      92 15.5   84     9   6
## 130    20     252 10.9   80     9   7
## 131    23     220 10.3   78     9   8
## 132    21     230 10.9   75     9   9
## 133    24     259  9.7   73     9  10
## 134    44     236 14.9   81     9  11
## 135    21     259 15.5   76     9  12
## 136    28     238  6.3   77     9  13
## 137     9      24 10.9   71     9  14
## 138    13     112 11.5   71     9  15
## 139    46     237  6.9   78     9  16
## 140    18     224 13.8   67     9  17
## 141    13      27 10.3   76     9  18
## 142    24     238 10.3   68     9  19
## 143    16     201  8.0   82     9  20
## 144    13     238 12.6   64     9  21
## 145    23      14  9.2   71     9  22
## 146    36     139 10.3   81     9  23
## 147     7      49 10.3   69     9  24
## 148    14      20 16.6   63     9  25
## 149    30     193  6.9   70     9  26
## 150    NA     145 13.2   77     9  27
## 151    14     191 14.3   75     9  28
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30

ggplot(data = airquality, aes(x=as.character(Month), y=Temp)) +
    geom_boxplot(fill=c('steelblue')) #boxplot of temperature value every month

#using different color
ggplot(data = airquality, aes(x=as.character(Month), y=Temp)) +
    geom_boxplot(fill=c('steelblue', 'red', 'purple', 'green', 'orange'))

Summary Data Technic

1. Central Tendency

Mean

• Center of mass (centroid)

• If representative to population, then denote as \(\mu\).

• As representative of sample, then denote as \(\bar{x}\)

• Use for numerical data

• Resistent towards outlier

dist<-c(12.5,29.9,14.8,18.7,7.6,16.2,16.5,27.4,12.1,17.5)
mean1<-sum(dist)/length(dist)
mean1

## [1] 17.32

# or using function `mean`

mean(dist)

## [1] 17.32

Median

• The symbol is Q2

• Observation in the middle of sorted data

• split data into 50%

median(dist)

## [1] 16.35

Mode

The value of the observation that occurs most often.

library(DescTools)
mode1<-Mode(iris$Sepal.Width)
mode1

## [1] 3
## attr(,"freq")
## [1] 26

Quartil

• Values that divide sorted data into 4 equal parts

• Q0 = min and Q4 = max

• Q1 (read quartile 1) is the value that divides the data 25% of the data on the left and 75% of the data on the right

• Q3 (read quartile 3) is the value that divides the data 75% of the data on the left and 25% of the data on the right

• Robust against outliers

# Q1 and Q3
quantile(dist,probs=c(0.25,0.75))

##    25%    75% 
## 13.075 18.400

# Q0 and Q4
min(dist)

## [1] 7.6

max(dist)

## [1] 29.9

2. Dispersion Measure

To describe a QUANTITATIVE MEASURE of the level of spread or grouping of data
Variation is usually defined in terms of distance:

How far are the points from each other
How far is the distance between the points from the mean
How is the level of representation of these values to the overall data condition

Range

Range = Max(data)-Min(data)

range1<-max(dist)-min(dist)
range1

## [1] 22.3

Interquartil range (IQR)

The interquartile range explains the spread of the middle half of the distribution.

IQR = Q3 - Q1

Quartiles segment any distribution that’s ordered from low to high into four equal parts.

IQR<-quantile(dist,probs=c(0.75)) - quantile(dist,probs=c(0.25))
IQR

##   75% 
## 5.325

Deviation

Difference between the data to its mean

deviation<-dist-mean(dist)
deviation

##  [1] -4.82 12.58 -2.52  1.38 -9.72 -1.12 -0.82 10.08 -5.22  0.18

Variance

The variance is a measure of variability. Variance can explain the degree of spread in our data set. The more spread the data, the larger the variance is in relation to the mean.

Formula: The average of sum square of deviation between its mean.

var(dist)

## [1] 46.11511

Standard Deviation

Standard Deviation is the square root of variance.

sd(dist)

## [1] 6.790811

Summary in R

summary(dist)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.60   13.07   16.35   17.32   18.40   29.90

Summary Table

library(tidyverse)
library(kableExtra)

table1 <- dataz %>% 
    group_by(zodiacs) %>% 
    summarise(Frequency = n(),
              Minimum = min(age),
              Maximum = max(age),
              Median = median(age),
              Mean=mean(age),
              IQR = diff(quantile(age, c(1, 3)/4)))
names(table1)[1] = c("Zodiacs")

kbl(table1, digits = 2, 
    caption = "Table 1: Descriptive statistics of age by zodiacs.") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 1: Descriptive statistics of age by zodiacs.
Zodiacs	Frequency	Minimum	Maximum	Median	Mean	IQR
Aquarius	7	20.27	42.17	23.38	28.27	10.45
Aries	9	20.04	52.44	33.98	34.00	17.83
Cancer	8	16.27	67.58	40.42	38.73	12.07
Capricorn	2	23.99	37.84	30.92	30.92	6.93
Gemini	9	18.46	74.25	34.01	42.09	29.81
Leo	6	18.28	68.04	29.36	34.70	19.62
Libra	6	18.36	45.02	22.30	27.59	16.85
Pisces	13	18.64	55.64	26.86	30.28	14.02
Saggitarius	9	21.34	44.85	37.55	34.11	16.44
Scorpio	6	18.40	72.80	28.93	36.13	13.34
Taurus	5	17.02	52.59	39.58	36.49	25.35
Virgo	8	20.22	50.07	27.74	31.02	18.84