R01 STA1511
Introduction to R
R is a language and environment for statistical computing and graphics.
- Download R-base
https://cran.r-project.org/bin/windows/base/
- Download R-Studio
https://www.rstudio.com/products/rstudio/download/
Statistics
Statistics -> parameter estimator
Parameters –> numerical measures that describe the population of interest.
Statistics –> numerical measures of a sample.
Sample –> is a subset of the population.
Population -> the whole object that is the center of our observation
Descriptive Statistics
It is a technique of presenting and summarizing data so that it becomes information that is easy to understand.
Import Data
The data can be downloaded through this link.
Tutorial: Import data to R Studio. (Click here)
library(readxl)
library(tidyr)
<- read.csv("D:/MATERI KULIAH S2 IPB/ASPRAK 2/Example_data.csv")
data1 head(data1)
## bookpageID appdate ceremonydate delay officialTitle person dob
## 1 B230p539 10/29/1996 11/9/1996 11 CIRCUIT JUDGE Groom 4/11/1964
## 2 B230p677 11/12/1996 11/12/1996 0 MARRIAGE OFFICIAL Groom 8/6/1964
## 3 B230p766 11/19/1996 11/27/1996 8 MARRIAGE OFFICIAL Groom 2/20/1962
## 4 B230p892 12/2/1996 12/7/1996 5 MINISTER Groom 5/20/1956
## 5 B230p994 12/9/1996 12/14/1996 5 MINISTER Groom 12/14/1966
## 6 B230p1209 12/26/1996 12/26/1996 0 MARRIAGE OFFICIAL Groom 2/21/1970
## age college zodiacs
## 1 32.60274 7 Aries
## 2 32.29041 0 Leo
## 3 34.79178 3 Pisces
## 4 40.57808 4 Gemini
## 5 30.02192 0 Saggitarius
## 6 26.86301 0 Pisces
#check missing data
colSums(is.na(data1))
## bookpageID appdate ceremonydate delay officialTitle
## 0 0 0 1 0
## person dob age college zodiacs
## 0 0 1 11 0
# drop NA
<- drop_na(data1)
dataz
# cek missing value
colSums(is.na(dataz))
## bookpageID appdate ceremonydate delay officialTitle
## 0 0 0 0 0
## person dob age college zodiacs
## 0 0 0 0 0
Contingency Table
A Contingency table can be used to see the distribution of two or more categorical data and it is a way of summarizing categorical variables.
Data
mtcars data from R
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
A data frame with 32 observations on 11 (numeric) variables.
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
# reading the data
data(mtcars)
colnames(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
attach(mtcars)
# Contingency Table – 2-way relationships
= table(cyl, gear)
t0 t0
## gear
## cyl 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
=xtabs(~ cyl + gear
t1 data = mtcars)
, t1
## gear
## cyl 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
= ftable(gear ~ cyl
t2 data = mtcars)
, t2
## gear 3 4 5
## cyl
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
Frequency Table
A frequency table is a table that lists items and shows the number of times the items occur.
library(kableExtra)
library(janitor)
= tabyl(dataz, officialTitle) %>%
table2 adorn_totals("row") %>%
adorn_pct_formatting(digits = 0)
names(table2) = c("Official Title", "Frequency", "Percent")
kbl(table2,
caption = "Table 1: Distribution of participants by official title") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Official Title | Frequency | Percent |
---|---|---|
BISHOP | 1 | 1% |
CATHOLIC PRIEST | 2 | 2% |
CHIEF CLERK | 2 | 2% |
CIRCUIT JUDGE | 2 | 2% |
ELDER | 2 | 2% |
MARRIAGE OFFICIAL | 40 | 45% |
MINISTER | 19 | 22% |
PASTOR | 20 | 23% |
Total | 88 | 100% |
library(tidyverse)
<-dataz%>%count(officialTitle)
datatable1 datatable1
## officialTitle n
## 1 BISHOP 1
## 2 CATHOLIC PRIEST 2
## 3 CHIEF CLERK 2
## 4 CIRCUIT JUDGE 2
## 5 ELDER 2
## 6 MARRIAGE OFFICIAL 40
## 7 MINISTER 19
## 8 PASTOR 20
Bar Chart
Colours in R:
Bar chart useful for displaying categorical data (nominal and ordinal) and This can also be used to present data from contingency tables / data summary tables
library(ggplot2)
ggplot(dataz, aes(x = zodiacs)) + # diagram view of `Zodiacs`
geom_bar(fill = "pink",color= "black") + # colors
theme_minimal() + # background theme
labs(x = "Zodiacs", # label for every variables
y = "Frequency",
title = "Zodiacs")
ggplot(dataz, aes(x = zodiacs)) + # diagram view of `Zodiacs`
geom_bar(fill = "coral",color= "black") + # colors
theme_minimal() + # background theme
labs(x = "Zodiacs", # label for every variables
y = "Frequency",
title = "Zodiacs") +
coord_flip()
Pie Chart
Used to display categorical data, especially nominal data.This chart shows the distribution of data in groups (total 100%).
library(tidyverse)
<- dataz %>%
plotdata count(zodiacs) %>%
arrange(desc(zodiacs)) %>%
mutate(prop = round(n*100/sum(n), 1),
lab.ypos = cumsum(prop) - 0.5*prop)
# Pie Chart
ggplot(plotdata, aes(x = "", y = prop, fill = zodiacs)) +
geom_bar(width = 1, stat = "identity", color = "white") +
coord_polar("y", start = 0)+
geom_text(aes(y = lab.ypos, label = prop), color = "black")+
scale_fill_manual(values = rainbow(13)) +
theme_void()+
labs(title = "Percentage of Zodiacs")
Histogram
A graph of a frequency distribution. Can be the distribution of its frequency or its relative frequency.
#dataz
ggplot(dataz, aes(x = age)) +
geom_histogram(fill = "coral1",
color = "black",
bins = 15) +
theme_minimal() +
labs(title="Age",
x = "Age",
y = "Frequency") #skewed to right
#data iris from R
ggplot(iris, aes(x = Sepal.Width)) +
geom_histogram(fill = "green",
color = "black",
bins = 10) +
theme_minimal() +
labs(title="Sepal Width",
x = "Sepal Width",
y = "Frequency") #normal curve
Dot plot
A graph used to see the distribution of the original data in the form of points
Used to see the frequency of occurrence for each value
ggplot(dataz, aes(x = age)) +
geom_dotplot(fill = "blue",
binwidth = 2) +
theme_minimal() +
labs(title = "Age",
y = "Proportions",
x = "Age",
subtitle = "binwidth = 2")
Stem & leaf plot
A stem and leaf plot is a very effective way of visually representing the data directly.
The shape of the plot may indicate whether the data set is skewed-left,skewed-right or centered.
The appearance of tails in the plot may also indicate the presence of outliers in the data set, located in the tail region.
In R we can generate a stem and leaf plot for a data set using the
stem
() function.
library(aplpack)
<- c(20,12,39,38,
variety_1 41,43,51,52,
59,55,53,59,
50,58,35,38,
23,32,43,53)
<- c(18,45,62,59,
variety_2 53,25,13,57,
42,55,13,57,
42,55,56,38,
41,36,50,62,
45,55)
stem.leaf.backback(variety_1, variety_2, m = 1)
## _____________________________________
## 1 | 2: represents 12, leaf unit: 1
## variety_1 variety_2
## _____________________________________
## 1 2| 1 |338 3
## 3 30| 2 |5 4
## 8 98852| 3 |68 6
## (3) 331| 4 |12255 (5)
## 9 998533210| 5 |035556779 (9)
## | 6 |22 2
## | 7 |
## _____________________________________
## n: 20 22
## _____________________________________
stem(variety_1)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 1 | 2
## 2 | 03
## 3 | 25889
## 4 | 133
## 5 | 012335899
Box Plot
Presenting Data from the Five Number Summary (Min, Q1, Q2, Q3, Max)
library(ggplot2)
::airquality datasets
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## 11 7 NA 6.9 74 5 11
## 12 16 256 9.7 69 5 12
## 13 11 290 9.2 66 5 13
## 14 14 274 10.9 68 5 14
## 15 18 65 13.2 58 5 15
## 16 14 334 11.5 64 5 16
## 17 34 307 12.0 66 5 17
## 18 6 78 18.4 57 5 18
## 19 30 322 11.5 68 5 19
## 20 11 44 9.7 62 5 20
## 21 1 8 9.7 59 5 21
## 22 11 320 16.6 73 5 22
## 23 4 25 9.7 61 5 23
## 24 32 92 12.0 61 5 24
## 25 NA 66 16.6 57 5 25
## 26 NA 266 14.9 58 5 26
## 27 NA NA 8.0 57 5 27
## 28 23 13 12.0 67 5 28
## 29 45 252 14.9 81 5 29
## 30 115 223 5.7 79 5 30
## 31 37 279 7.4 76 5 31
## 32 NA 286 8.6 78 6 1
## 33 NA 287 9.7 74 6 2
## 34 NA 242 16.1 67 6 3
## 35 NA 186 9.2 84 6 4
## 36 NA 220 8.6 85 6 5
## 37 NA 264 14.3 79 6 6
## 38 29 127 9.7 82 6 7
## 39 NA 273 6.9 87 6 8
## 40 71 291 13.8 90 6 9
## 41 39 323 11.5 87 6 10
## 42 NA 259 10.9 93 6 11
## 43 NA 250 9.2 92 6 12
## 44 23 148 8.0 82 6 13
## 45 NA 332 13.8 80 6 14
## 46 NA 322 11.5 79 6 15
## 47 21 191 14.9 77 6 16
## 48 37 284 20.7 72 6 17
## 49 20 37 9.2 65 6 18
## 50 12 120 11.5 73 6 19
## 51 13 137 10.3 76 6 20
## 52 NA 150 6.3 77 6 21
## 53 NA 59 1.7 76 6 22
## 54 NA 91 4.6 76 6 23
## 55 NA 250 6.3 76 6 24
## 56 NA 135 8.0 75 6 25
## 57 NA 127 8.0 78 6 26
## 58 NA 47 10.3 73 6 27
## 59 NA 98 11.5 80 6 28
## 60 NA 31 14.9 77 6 29
## 61 NA 138 8.0 83 6 30
## 62 135 269 4.1 84 7 1
## 63 49 248 9.2 85 7 2
## 64 32 236 9.2 81 7 3
## 65 NA 101 10.9 84 7 4
## 66 64 175 4.6 83 7 5
## 67 40 314 10.9 83 7 6
## 68 77 276 5.1 88 7 7
## 69 97 267 6.3 92 7 8
## 70 97 272 5.7 92 7 9
## 71 85 175 7.4 89 7 10
## 72 NA 139 8.6 82 7 11
## 73 10 264 14.3 73 7 12
## 74 27 175 14.9 81 7 13
## 75 NA 291 14.9 91 7 14
## 76 7 48 14.3 80 7 15
## 77 48 260 6.9 81 7 16
## 78 35 274 10.3 82 7 17
## 79 61 285 6.3 84 7 18
## 80 79 187 5.1 87 7 19
## 81 63 220 11.5 85 7 20
## 82 16 7 6.9 74 7 21
## 83 NA 258 9.7 81 7 22
## 84 NA 295 11.5 82 7 23
## 85 80 294 8.6 86 7 24
## 86 108 223 8.0 85 7 25
## 87 20 81 8.6 82 7 26
## 88 52 82 12.0 86 7 27
## 89 82 213 7.4 88 7 28
## 90 50 275 7.4 86 7 29
## 91 64 253 7.4 83 7 30
## 92 59 254 9.2 81 7 31
## 93 39 83 6.9 81 8 1
## 94 9 24 13.8 81 8 2
## 95 16 77 7.4 82 8 3
## 96 78 NA 6.9 86 8 4
## 97 35 NA 7.4 85 8 5
## 98 66 NA 4.6 87 8 6
## 99 122 255 4.0 89 8 7
## 100 89 229 10.3 90 8 8
## 101 110 207 8.0 90 8 9
## 102 NA 222 8.6 92 8 10
## 103 NA 137 11.5 86 8 11
## 104 44 192 11.5 86 8 12
## 105 28 273 11.5 82 8 13
## 106 65 157 9.7 80 8 14
## 107 NA 64 11.5 79 8 15
## 108 22 71 10.3 77 8 16
## 109 59 51 6.3 79 8 17
## 110 23 115 7.4 76 8 18
## 111 31 244 10.9 78 8 19
## 112 44 190 10.3 78 8 20
## 113 21 259 15.5 77 8 21
## 114 9 36 14.3 72 8 22
## 115 NA 255 12.6 75 8 23
## 116 45 212 9.7 79 8 24
## 117 168 238 3.4 81 8 25
## 118 73 215 8.0 86 8 26
## 119 NA 153 5.7 88 8 27
## 120 76 203 9.7 97 8 28
## 121 118 225 2.3 94 8 29
## 122 84 237 6.3 96 8 30
## 123 85 188 6.3 94 8 31
## 124 96 167 6.9 91 9 1
## 125 78 197 5.1 92 9 2
## 126 73 183 2.8 93 9 3
## 127 91 189 4.6 93 9 4
## 128 47 95 7.4 87 9 5
## 129 32 92 15.5 84 9 6
## 130 20 252 10.9 80 9 7
## 131 23 220 10.3 78 9 8
## 132 21 230 10.9 75 9 9
## 133 24 259 9.7 73 9 10
## 134 44 236 14.9 81 9 11
## 135 21 259 15.5 76 9 12
## 136 28 238 6.3 77 9 13
## 137 9 24 10.9 71 9 14
## 138 13 112 11.5 71 9 15
## 139 46 237 6.9 78 9 16
## 140 18 224 13.8 67 9 17
## 141 13 27 10.3 76 9 18
## 142 24 238 10.3 68 9 19
## 143 16 201 8.0 82 9 20
## 144 13 238 12.6 64 9 21
## 145 23 14 9.2 71 9 22
## 146 36 139 10.3 81 9 23
## 147 7 49 10.3 69 9 24
## 148 14 20 16.6 63 9 25
## 149 30 193 6.9 70 9 26
## 150 NA 145 13.2 77 9 27
## 151 14 191 14.3 75 9 28
## 152 18 131 8.0 76 9 29
## 153 20 223 11.5 68 9 30
ggplot(data = airquality, aes(x=as.character(Month), y=Temp)) +
geom_boxplot(fill=c('steelblue')) #boxplot of temperature value every month
#using different color
ggplot(data = airquality, aes(x=as.character(Month), y=Temp)) +
geom_boxplot(fill=c('steelblue', 'red', 'purple', 'green', 'orange'))
Summary Data Technic
1. Central Tendency
Mean
• Center of mass (centroid)
• If representative to population
, then denote as \(\mu\).
• As representative of sample
, then denote as \(\bar{x}\)
• Use for numerical data
• Resistent towards outlier
<-c(12.5,29.9,14.8,18.7,7.6,16.2,16.5,27.4,12.1,17.5)
dist<-sum(dist)/length(dist)
mean1 mean1
## [1] 17.32
# or using function `mean`
mean(dist)
## [1] 17.32
Median
• The symbol is Q2
• Observation in the middle of sorted data
• split data into 50%
median(dist)
## [1] 16.35
Mode
The value of the observation that occurs most often.
library(DescTools)
<-Mode(iris$Sepal.Width)
mode1 mode1
## [1] 3
## attr(,"freq")
## [1] 26
Quartil
• Values that divide sorted data into 4 equal parts
• Q0 = min and Q4 = max
• Q1 (read quartile 1) is the value that divides the data 25% of the data on the left and 75% of the data on the right
• Q3 (read quartile 3) is the value that divides the data 75% of the data on the left and 25% of the data on the right
• Robust against outliers
# Q1 and Q3
quantile(dist,probs=c(0.25,0.75))
## 25% 75%
## 13.075 18.400
# Q0 and Q4
min(dist)
## [1] 7.6
max(dist)
## [1] 29.9
2. Dispersion Measure
To describe a QUANTITATIVE MEASURE of the level of spread or grouping of data
Variation is usually defined in terms of distance:
How far are the points from each other
How far is the distance between the points from the mean
How is the level of representation of these values to the overall data condition
Range
Range = Max(data)-Min(data)
<-max(dist)-min(dist)
range1 range1
## [1] 22.3
Interquartil range (IQR)
The interquartile range explains the spread of the middle half of the distribution.
IQR = Q3 - Q1
Quartiles segment any distribution that’s ordered from low to high into four equal parts.
<-quantile(dist,probs=c(0.75)) - quantile(dist,probs=c(0.25))
IQR IQR
## 75%
## 5.325
Deviation
Difference between the data to its mean
<-dist-mean(dist)
deviation deviation
## [1] -4.82 12.58 -2.52 1.38 -9.72 -1.12 -0.82 10.08 -5.22 0.18
Variance
The variance is a measure of variability. Variance can explain the degree of spread in our data set. The more spread the data, the larger the variance is in relation to the mean.
Formula: The average of sum square of deviation between its mean.
var(dist)
## [1] 46.11511
Standard Deviation
Standard Deviation is the square root of variance.
sd(dist)
## [1] 6.790811
Summary in R
summary(dist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.60 13.07 16.35 17.32 18.40 29.90
Summary Table
library(tidyverse)
library(kableExtra)
<- dataz %>%
table1 group_by(zodiacs) %>%
summarise(Frequency = n(),
Minimum = min(age),
Maximum = max(age),
Median = median(age),
Mean=mean(age),
IQR = diff(quantile(age, c(1, 3)/4)))
names(table1)[1] = c("Zodiacs")
kbl(table1, digits = 2,
caption = "Table 1: Descriptive statistics of age by zodiacs.") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Zodiacs | Frequency | Minimum | Maximum | Median | Mean | IQR |
---|---|---|---|---|---|---|
Aquarius | 7 | 20.27 | 42.17 | 23.38 | 28.27 | 10.45 |
Aries | 9 | 20.04 | 52.44 | 33.98 | 34.00 | 17.83 |
Cancer | 8 | 16.27 | 67.58 | 40.42 | 38.73 | 12.07 |
Capricorn | 2 | 23.99 | 37.84 | 30.92 | 30.92 | 6.93 |
Gemini | 9 | 18.46 | 74.25 | 34.01 | 42.09 | 29.81 |
Leo | 6 | 18.28 | 68.04 | 29.36 | 34.70 | 19.62 |
Libra | 6 | 18.36 | 45.02 | 22.30 | 27.59 | 16.85 |
Pisces | 13 | 18.64 | 55.64 | 26.86 | 30.28 | 14.02 |
Saggitarius | 9 | 21.34 | 44.85 | 37.55 | 34.11 | 16.44 |
Scorpio | 6 | 18.40 | 72.80 | 28.93 | 36.13 | 13.34 |
Taurus | 5 | 17.02 | 52.59 | 39.58 | 36.49 | 25.35 |
Virgo | 8 | 20.22 | 50.07 | 27.74 | 31.02 | 18.84 |