LIBRARY AND DATA

Pertama-tama kita memasukkan library dan dataset-dataset yang diperlukan:

library(readr)

## Warning: package 'readr' was built under R version 4.1.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.1.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.3

library(openintro)

## Warning: package 'openintro' was built under R version 4.1.3

## Loading required package: airports

## Warning: package 'airports' was built under R version 4.1.3

## Loading required package: cherryblossom

## Warning: package 'cherryblossom' was built under R version 4.1.3

## Loading required package: usdata

## Warning: package 'usdata' was built under R version 4.1.3

cars <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv")
comics <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/comics.csv")
life <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/life_exp_raw.csv")

Contingency table review

#Print the first rows of the data
head(comics)

##                                    name      id   align        eye       hair
## 1             Spider-Man (Peter Parker)  Secret    Good Hazel Eyes Brown Hair
## 2       Captain America (Steven Rogers)  Public    Good  Blue Eyes White Hair
## 3 Wolverine (James \\"Logan\\" Howlett)  Public Neutral  Blue Eyes Black Hair
## 4   Iron Man (Anthony \\"Tony\\" Stark)  Public    Good  Blue Eyes Black Hair
## 5                   Thor (Thor Odinson) No Dual    Good  Blue Eyes Blond Hair
## 6            Benjamin Grimm (Earth-616)  Public    Good  Blue Eyes    No Hair
##   gender  gsm             alive appearances first_appear publisher
## 1   Male <NA> Living Characters        4043       Aug-62    marvel
## 2   Male <NA> Living Characters        3360       Mar-41    marvel
## 3   Male <NA> Living Characters        3061       Oct-74    marvel
## 4   Male <NA> Living Characters        2961       Mar-63    marvel
## 5   Male <NA> Living Characters        2258       Nov-50    marvel
## 6   Male <NA> Living Characters        2255       Nov-61    marvel

# Check levels of align
al <- as.factor(comics$align) #ubah ke factor agar data terbaca
levels(al)

## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"

EXPLANATION: Terdapat empat unique value di variable/attribute align pada dataset comics yaitu; bad, neutral, good, dan reformed criminals.

# Check the levels of gender
gen <- as.factor(comics$gender) #ubah ke factor agar data terbaca
levels(gen)

## [1] "Female" "Male"   "Other"

EXPLANATION: Terdapat tiga unique value di variable/attribute gender pada dataset comics yaitu; female, male, dan other

# Create a 2-way contingency table
table(al, gen)

##                     gen
## al                   Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

EXPLANATION: Dari data di atas kita membuat table dari dua buah data yaitu variabel align dan gender. Disini kita juga dapat menyimpulkan bahwa dalam dataset ini kebanyakan male.

Dropping levels

# Load dplyr

# Print tab
tab <- table(al, gen)
tab

##                     gen
## al                   Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

EXPLANATION: Disini kita hanya mengecek apakah tab sudah sesuai atau belum.

# Remove align level
comics <- comics %>% filter(align != 'Reformed Criminals') %>% droplevels()

levels(as.factor(comics$align)) #gunakan as factor agar terbaca

## [1] "Bad"     "Good"    "Neutral"

EXPLANATION: Lalu ketika ‘Reformed Criminals’ sudah terhapus maka hanya akan menyisakan 3 unique value yaitu; bad, good, dan neutral.

Side by side barcharts

# Load ggplot2

# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "dodge")

# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + geom_bar(positio = "dodge") + theme(axis.text.x = element_text(angle = 90))

EXPLANATION: Disini kita mendapat konfirmasi bahwa lebih banyak male character dibandingkan female character di dalam dataset ini. Kemudian pada character dengan ‘Neutral’ alignment, male lebih banyak dijumpai. Dengan begitu, kita juga dapat menyimpulkan bahwa ada asosiasi antara variabel/atribut gender dan alignment.

Counts vs. proportions

# simplify display format
options(scipen = 999, digits = 3) 

## create table of counts
tbl_cnt <- table(comics$id, comics$align)
tbl_cnt

##          
##            Bad Good Neutral
##   No Dual  474  647     390
##   Public  2172 2930     965
##   Secret  4493 2475     959
##   Unknown    7    0       2

EXPLANATION: Dari data di atas kita membuat table dari dua buah data yaitu variabel id dan align dari data set comics.

# Proportional table
# All values add up to 1
prop.table(tbl_cnt)

##          
##                Bad     Good  Neutral
##   No Dual 0.030553 0.041704 0.025139
##   Public  0.140003 0.188862 0.062202
##   Secret  0.289609 0.159533 0.061815
##   Unknown 0.000451 0.000000 0.000129

EXPLANATION:

sum(prop.table(tbl_cnt))

## [1] 1

# All rows add up to 1
prop.table(tbl_cnt, 1)

##          
##             Bad  Good Neutral
##   No Dual 0.314 0.428   0.258
##   Public  0.358 0.483   0.159
##   Secret  0.567 0.312   0.121
##   Unknown 0.778 0.000   0.222

# Coluns add up to 1
prop.table(tbl_cnt, 2)

##          
##                Bad     Good  Neutral
##   No Dual 0.066331 0.106907 0.168394
##   Public  0.303946 0.484137 0.416667
##   Secret  0.628743 0.408956 0.414076
##   Unknown 0.000980 0.000000 0.000864

EXPLANATION: Terlihat bahwa ada beberapa character id = unknown.

ggplot(comics, aes(x = id, fill = align)) + geom_bar(position = "fill") + ylab("proportion")

EXPLANATION: Tukar c dan variabel fill, dan dari disini kita dapat melihat bahwa bad character rata-rata adalah secret. Disini kita juga dapat melihat lebih jelas bahwa beberapa character id = unknown.

ggplot(comics, aes(x = align, fill = id)) + geom_bar(position = "fill") + ylab("proportion")

EXPLANATION: Disini, selain mendapat konfirmasi lagi bahwa bad character rata-rata adalah secret. Kita juga dapat melihat bahwa no dual pada neutral lebih banyak dibanding dua lainnya, dan public pada good lebih banyak dibanding neutral dan bad.

Conditional propotions

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab)     # Joint proportions

##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Good    0.130135 0.251333 0.000888
##   Neutral 0.043692 0.094021 0.000888

prop.table(tab, 2)

##          
##           Female  Male Other
##   Bad      0.321 0.534 0.485
##   Good     0.508 0.339 0.258
##   Neutral  0.171 0.127 0.258

EXPLANATION: Dari kedua hasil di atas kita dapat menyimpulkan bahwa proporsi kira-kira dari character female yang good adalah sebesar 51%.

Counts vs. proportions (2)

# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) + geom_bar()

# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "fill")

Distribution of one variable

# Can use table function on just one variable
# This is called a marginal distribution
table(comics$id)

## 
## No Dual  Public  Secret Unknown 
##    1511    6067    7927       9

EXPLANATION: Terdapat 1511 No Dual, 6067 Public, 7927 Secret, dan 9 Unknown.

# Simple barchart
ggplot(comics, aes(x = id)) + geom_bar()

EXPLANATION: Dari sini kita dapat meilhat variables secara individual dan filtering jauh lebih gampang. Dengan begitu kita dapat membuat fakta dengan alignment daripada mewarnai gambar/stacknya.

ggplot(comics, aes(x = id)) + geom_bar() + facet_wrap(~align)

EXPLANATION: Di atas adalah hasil simple barchart tiap tiap kategori.

Marginal barchart

# Change the order of the levels in align
comics$align <- factor(comics$align, 
                       levels = c("Bad", "Neutral", "Good"))

# Create plot of align
ggplot(comics, aes(x = align)) + 
  geom_bar()

EXPLANATION: Menggunakan marginal barchart jauh lebih masuk akal untuk menaruh neutral diantara bad dan good, dengan begitu kita perlu melakukan reorder tiap level agar chart menghasilkan gambar seperti di atas. Jika tidak, maka hasil akan default yang dimana diatur secara alphabetical.

Conditional barchart

# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) + 
  geom_bar() +
  facet_wrap(~ gender)

EXPLANATION: Berikut adalah conditional barchart dari alignment yang dimana diatur secara gender.

Improve piechart

# Put levels of flavor in decending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies <- data.frame(flavors = as.factor(rep(c("apple", "blueberry", "boston creme", "cherry", "key lime", "pumpkin", "strawberry"), times = c(17, 14, 15, 13, 16, 12, 11))))
pies$flavor <- factor(pies$flavor, levels = lev)

head(pies$flavor)

## [1] apple apple apple apple apple apple
## Levels: apple key lime boston creme blueberry cherry pumpkin strawberry

EXPLANATION: Di atas adalah sebuah data set yang bernama pies yang dimana data set tersebut memiliki 7 levels yaitu; apple, key lime, boston creme, blueberry, cherry, pumpkin, dan strawberry.

# Create barchart of flavor
ggplot(pies, aes(x = flavor)) + geom_bar(fill = "chartreuse") + theme(axis.text.x = element_text(angle = 90))

EXPLANATION: Di atas adalah barchart dari dataset pies yang diurutkan berdasarkan flavour. Dari grafik tersebut, apel memiliki kuantitas paling banyak kemudian diikuti oleh key lime dan boston creme.

Exploring numerical data

ggplot(cars, aes(x = weight)) + geom_dotplot(dotsize = 0.4)

## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing non-finite values (stat_bindot).

EXPLANATION: Dot plot biasanya digunakan untuk menunjukkan semua titik data.

ggplot(cars, aes(x = weight)) + geom_histogram(dotsize = 0.4, binwidth = 500)

## Warning: Ignoring unknown parameters: dotsize

## Warning: Removed 2 rows containing non-finite values (stat_bin).

EXPLANATION: Histogram mengelompokkan points ke bins agar tidak kewalahan/tidak berlebihan.

ggplot(cars, aes(x = weight)) + geom_density()

## Warning: Removed 2 rows containing non-finite values (stat_density).

EXPLANATION: Plot density memberikan gambaran yang lebih besar atas representasi dari distribusinya. Plot density juga sangat membantu jika memiliki data yang sangat banyak.

ggplot(cars, aes(x = 1, y = weight)) + geom_boxplot() +coord_flip()

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

EXPLANATION: Boxplot adalah cara yang baik untuk menampilkan ringkasan dari sebuah distribusi.

Faceted histogram

# Load package
library(ggplot2)

# Learn data structure
str(cars)

## 'data.frame':    428 obs. of  19 variables:
##  $ name       : chr  "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...

# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) + geom_histogram() + facet_wrap(~ suv)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 14 rows containing non-finite values (stat_bin).

EXPLANATION: Di atas adalah faceted histogram dari dataset cars variabel city_mpg yang diurutkan dengan variabel suv.

Boxplots and density plots

unique(cars$ncyl)

## [1]  4  6  3  8  5 12 10 -1

table(cars$ncyl)

## 
##  -1   3   4   5   6   8  10  12 
##   2   1 136   7 190  87   2   3

# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4,6,8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) + geom_boxplot()

## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) + geom_density(alpha = .3)

## Warning: Removed 11 rows containing non-finite values (stat_density).

EXPLANATION: Dari plot di atas kita dapat melihat bahwa mobil dengan jarak tempuh tertinggi memiliki 4 cylinders. Kemudian tipikal mobil 4 cylinders adalah mendapatkan jarak tempuh yang lebih baik daripada mobil 6 cylinders, yang dimana jarak tempuh 6 cylinders lebih baik daripada mobil 8 cylinders. Lalu dari plot ini, kita dapat mengetahui bahwa sebagian besar mobil 4 cylinders mendapat jarak tempuh yangh lebih baik daripada mobil 8 cylinders.

Distribution of one variable

Marginal and conditional histograms

# Create hist of horsepwr
cars %>% ggplot(aes(horsepwr)) + geom_histogram() + ggtitle("Horsepower distribution")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

EXPLANATION: Plot di atas adalah histogram dari variabel horsepower.

# Create hist of horsepwr for affordable cars
cars %>% 
  filter(msrp < 25000) %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ggtitle("Horsepower distribtion for msrp < 25000")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

EXPLANATION: Plot di atas adalah histogram horsepower untuk mobil yang affordable. Dari sini kita juga dapat menyimpulkan bahwa mobil dengan higpower tertinggi dalam kisaran yang lebih murah hanya memiliki kurang dari 250 tenaga kuda.

Three Bindwidth

# Create hist of horsepwr with binwidth of 3
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle("binwidth = 3")

EXPLANATION: Plot di atas adalah histogram horsepower dengan binwidth 3.

# Create hist of horsepwr with binwidth of 30
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 30) +
  ggtitle("binwidth = 30")

EXPLANATION: Plot di atas adalah histogram horsepower dengan binwidth 30.

# Create hist of horsepwr with binwidth of 60
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 60) +
  ggtitle("binwidth = 60")

EXPLANATION: Plot di atas adalah histogram horsepower dengan binwidth 60.

Boxplot

Box plots for outliers

# Construct box plot of msrp
cars %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

EXPLANATION: Plot di atas adalah boxplot msrp.

# Exclude outliers from data
cars_no_out <- cars %>%
  filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

EXPLANATION: Plot di atas adalah boxplot msrp yang outliersnya sudah dihilangkan.

Plot selection

# Create plot of city_mpg
cars %>%
  ggplot(aes(x = 1, y = city_mpg)) +
  geom_boxplot()

## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

cars %>%
  ggplot(aes(city_mpg)) +
  geom_density()

## Warning: Removed 14 rows containing non-finite values (stat_density).

# Create plot of width
cars %>%
  ggplot(aes(x = 1, y = width)) +
  geom_boxplot()

## Warning: Removed 28 rows containing non-finite values (stat_boxplot).

cars %>%
  ggplot(aes(x = width)) +
  geom_density()

## Warning: Removed 28 rows containing non-finite values (stat_density).

Visualization in higher dimensions

3 Variable plot

# Facet hists using hwy mileage and ncyl
common_cyl %>%
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv) +
  ggtitle("hwy_mpg by ncyl and suv")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 11 rows containing non-finite values (stat_bin).

EXPLANATION: Baik di SUV maupun non-SUV, jarak tempuh cenderung menurun seiring bertambahnya jumlah cylinders.

Numerical Summaries

Measures of center

head(life)

##     State         County fips Year Female.life.expectancy..years.
## 1 Alabama Autauga County 1001 1985                           77.0
## 2 Alabama Baldwin County 1003 1985                           78.8
## 3 Alabama Barbour County 1005 1985                           76.0
## 4 Alabama    Bibb County 1007 1985                           76.6
## 5 Alabama  Blount County 1009 1985                           78.9
## 6 Alabama Bullock County 1011 1985                           75.1
##   Female.life.expectancy..national..years.
## 1                                     77.8
## 2                                     77.8
## 3                                     77.8
## 4                                     77.8
## 5                                     77.8
## 6                                     77.8
##   Female.life.expectancy..state..years. Male.life.expectancy..years.
## 1                                  76.9                         68.1
## 2                                  76.9                         71.1
## 3                                  76.9                         66.8
## 4                                  76.9                         67.3
## 5                                  76.9                         70.6
## 6                                  76.9                         66.6
##   Male.life.expectancy..national..years. Male.life.expectancy..state..years.
## 1                                   70.8                                69.1
## 2                                   70.8                                69.1
## 3                                   70.8                                69.1
## 4                                   70.8                                69.1
## 5                                   70.8                                69.1
## 6                                   70.8                                69.1

x <- head(round(life$Female.life.expectancy..years.), 11)
x

##  [1] 77 79 76 77 79 75 77 77 77 78 77

sum(x)/11

## [1] 77.2

mean(x)

## [1] 77.2

EXPLANATION: Mean adalah titik keseimbangan data dan biasanya sensitif terhadap nilai-nilai ekstrim.

sort(x)

##  [1] 75 76 77 77 77 77 77 77 78 79 79

median(x)

## [1] 77

EXPLANATION: Median adalah nilai tengah dari sebuah data, kuat untuk nilai-nilai ekstrim, dan sangat sesuai untuk data yang skewed.

table(x)

## x
## 75 76 77 78 79 
##  1  1  6  1  2

EXPLANATION: Mode adalah nilai yang paling sering muncul.

Calculate center measures

library(gapminder)

## Warning: package 'gapminder' was built under R version 4.1.3

str(gapminder)

## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)

# Compute groupwise mean and median lifeExp
gap2007 %>%
  group_by(continent) %>%
  summarize(mean(lifeExp),
            median(lifeExp))

## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##   <fct>               <dbl>             <dbl>
## 1 Africa               54.8              52.9
## 2 Americas             73.6              72.9
## 3 Asia                 70.7              72.4
## 4 Europe               77.6              78.6
## 5 Oceania              80.7              80.7

# Generate box plots of lifeExp for each continent
gap2007 %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

Measures of variability

##  [1] 77 79 76 77 79 75 77 77 77 78 77

EXPLANATION: Hanya dengan melihat data kita dapat menginisiasi untuk break down data menjadi satu nomor agar kita bisa membandingkan sample distribusi.

# Look at the difference between each point and the mean
sum(x - mean(x))

## [1] -0.0000000000000568

EXPLANATION: Kita dapat menyimpulkan bahwa kita dapat menyamakan perbedaannya tetapi hal ini menjadi pain untuk kita karena angka ini akan menjadi lebih besar jika kita menambahkan observasinya. Kita ingin sesuatu yang stabil.

# Square each difference to get rid of negatives then sum
sum((x - mean(x))^2)

## [1] 13.6

sum((x - mean(x))^2)/(length(x) - 1)

## [1] 1.36

var(x)

## [1] 1.36

EXPLANATION: Lalu kita mencoba untuk membagi dengan n-1 dan ini disebut sampel variance. Variance adalah salah satu ukuran yang paling berguna dari distribusi sampel.

sqrt(sum((x - mean(x))^2)/(length(x) - 1))

## [1] 1.17

sd(x)

## [1] 1.17

EXPLANATION: Salah satu perhitungan yang berguna lainnya adalah standard deviation, yang dimana hanya akar kuadrat dari variance.

summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    75.0    77.0    77.0    77.2    77.5    79.0

IQR(x)

## [1] 0.5

EXPLANATION: Kemudian IQR adalah 50% tengah dari data. IQR tidaklah sensitif terhadap nilai-nilai ekstrim.

max(x)

## [1] 79

min(x)

## [1] 75

diff(range(x))

## [1] 4

EXPLANATION: Max dan min juga menarik sama seperti rentang (perbedaan antara max dan min).

Calculate spread measures

str(gap2007)

## tibble [142 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
##  $ year     : int [1:142] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ lifeExp  : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
##  $ pop      : int [1:142] 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ...
##  $ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...

# Compute groupwise measures of spread
gap2007 %>%
  group_by(continent) %>%
  summarize(sd(lifeExp),
            IQR(lifeExp),
            n())

## # A tibble: 5 x 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##   <fct>             <dbl>          <dbl> <int>
## 1 Africa            9.63          11.6      52
## 2 Americas          4.44           4.63     25
## 3 Asia              7.96          10.2      33
## 4 Europe            2.98           4.78     30
## 5 Oceania           0.729          0.516     2

# Generate overlaid density plots
gap2007 %>%
  ggplot(aes(x = lifeExp, fill = continent)) +
  geom_density(alpha = 0.3)

## Choose measures for center and spread

# Compute stats for lifeExp in Americas
head(gap2007)

## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       2007    43.8 31889923      975.
## 2 Albania     Europe     2007    76.4  3600523     5937.
## 3 Algeria     Africa     2007    72.3 33333216     6223.
## 4 Angola      Africa     2007    42.7 12420476     4797.
## 5 Argentina   Americas   2007    75.3 40301927    12779.
## 6 Australia   Oceania    2007    81.2 20434176    34435.

gap2007 %>%
  filter(continent == "Americas") %>%
  summarize(mean(lifeExp),
            sd(lifeExp))

## # A tibble: 1 x 2
##   `mean(lifeExp)` `sd(lifeExp)`
##             <dbl>         <dbl>
## 1            73.6          4.44

# Compute stats for population
gap2007 %>%
  summarize(median(pop),
            IQR(pop))

## # A tibble: 1 x 2
##   `median(pop)` `IQR(pop)`
##           <dbl>      <dbl>
## 1      10517531  26702008.

Describe the shape

# Create density plot of old variable
gap2007 %>%
  ggplot(aes(x = pop)) +
  geom_density()

# Transform the skewed pop variable
gap2007 <- gap2007 %>%
  mutate(log_pop = log(pop))

# Create density plot of new variable
gap2007 %>%
  ggplot(aes(x = log_pop)) +
  geom_density()

Identify outliers

# Filter for Asia, add column indicating outliers
str(gapminder)

## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

gap_asia <- gap2007 %>%
  filter(continent == "Asia") %>%
  mutate(is_outlier = lifeExp < 50)

# Remove outliers, create box plot of lifeExp
gap_asia %>%
  filter(!is_outlier) %>%
  ggplot(aes(x = 1, y = lifeExp)) +
  geom_boxplot()

Spam and num_char

# ggplot2, dplyr, and openintro are loaded

# Compute summary statistics
email %>%
  group_by(spam) %>%
  summarize( 
    median(num_char),
    IQR(num_char))

## # A tibble: 2 x 3
##   spam  `median(num_char)` `IQR(num_char)`
##   <fct>              <dbl>           <dbl>
## 1 0                   6.83           13.6 
## 2 1                   1.05            2.82

str(email)

## tibble [3,921 x 21] (S3: tbl_df/tbl/data.frame)
##  $ spam        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ to_multiple : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
##  $ from        : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ cc          : int [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
##  $ sent_email  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
##  $ time        : POSIXct[1:3921], format: "2012-01-01 13:16:41" "2012-01-01 14:03:59" ...
##  $ image       : num [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
##  $ attach      : num [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
##  $ dollar      : num [1:3921] 0 0 4 0 0 0 0 0 0 0 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num [1:3921] 0 0 1 0 0 0 0 0 0 0 ...
##  $ viagra      : num [1:3921] 0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num [1:3921] 0 0 0 0 2 2 0 0 0 0 ...
##  $ num_char    : num [1:3921] 11.37 10.5 7.77 13.26 1.23 ...
##  $ line_breaks : int [1:3921] 202 202 192 255 29 25 193 237 69 68 ...
##  $ format      : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 2 1 2 ...
##  $ re_subj     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ exclaim_subj: num [1:3921] 0 0 0 0 0 0 0 0 0 0 ...
##  $ urgent_subj : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ exclaim_mess: num [1:3921] 0 1 6 48 1 1 1 18 1 0 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 3 2 2 2 1 1 3 2 2 2 ...

table(email$spam)

## 
##    0    1 
## 3554  367

email <- email %>%
  mutate(spam = factor(ifelse(spam == 0, "not-spam", "spam")))

# Create plot
email %>%
  mutate(log_num_char = log(num_char)) %>%
  ggplot(aes(x = spam, y = log_num_char)) +
  geom_boxplot()

Spam and !!!

# Compute center and spread for exclaim_mess by spam
email %>%
  group_by(spam) %>%
  summarize(
    median(exclaim_mess),
    IQR(exclaim_mess))

## # A tibble: 2 x 3
##   spam     `median(exclaim_mess)` `IQR(exclaim_mess)`
##   <fct>                     <dbl>               <dbl>
## 1 not-spam                      1                   5
## 2 spam                          0                   1

table(email$exclaim_mess)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1435  733  507  128  190  113  115   51   93   45   85   17   56   20   43   11 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##   29   12   26    5   29    9   15    3   11    6   11    1    6    8   13   12 
##   32   33   34   35   36   38   39   40   41   42   43   44   45   46   47   48 
##   13    3    3    2    3    3    1    2    1    1    3    3    5    3    2    1 
##   49   52   54   55   57   58   62   71   75   78   89   94   96  139  148  157 
##    3    1    1    4    2    2    2    1    1    1    1    1    1    1    1    1 
##  187  454  915  939  947 1197 1203 1209 1236 
##    1    1    1    1    1    1    2    1    1

# Create plot for spam and exclaim_mess
email %>%
  mutate(log_exclaim_mess = log(exclaim_mess)) %>%
  ggplot(aes(x = log_exclaim_mess)) + 
  geom_histogram() + 
  facet_wrap(~ spam)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1435 rows containing non-finite values (stat_bin).

Collapsing levels

table(email$image)

## 
##    0    1    2    3    4    5    9   20 
## 3811   76   17   11    2    2    1    1

# Create plot of proportion of spam by image
email %>%
  mutate(has_image = image > 0) %>%
  ggplot(aes(x = has_image, fill = spam)) +
  geom_bar(position = "fill")

## Data Integrity

# Test if images count as attachments
sum(email$image > email$attach)

## [1] 0

Answering questions with chains

## Within non-spam emails, is the typical length of emails shorter for 
## those that were sent to multiple people?
email %>%
   filter(spam == "not-spam") %>%
   group_by(to_multiple) %>%
   summarize(median(num_char))

## # A tibble: 2 x 2
##   to_multiple `median(num_char)`
##   <fct>                    <dbl>
## 1 0                         7.20
## 2 1                         5.36

# Question 1
## For emails containing the word "dollar", does the typical spam email 
## contain a greater number of occurences of the word than the typical non-spam email?
table(email$dollar)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 3175  120  151   10  146   20   44   12   35   10   22   10   20    7   14    5 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   32 
##   23    2   14    1   10    7   12    7    7    3    7    1    5    1    1    2 
##   34   36   40   44   46   48   54   63   64 
##    1    2    3    3    2    1    1    1    3

email %>%
  filter(dollar > 0) %>%
  group_by(spam) %>%
  summarize(median(dollar))

## # A tibble: 2 x 2
##   spam     `median(dollar)`
##   <fct>               <dbl>
## 1 not-spam                4
## 2 spam                    2

# Question 2
## If you encounter an email with greater than 10 occurrences of the word "dollar", 
## is it more likely to be spam or not -spam?

email %>%
  filter(dollar > 10) %>%
  ggplot(aes(x = spam)) +
  geom_bar()

What’s in a number?

levels(email$number)

## [1] "none"  "small" "big"

table(email$number)

## 
##  none small   big 
##   549  2827   545

# Reorder levels
email$number <- factor(email$number, levels = c("none","small","big"))

# Construct plot of number
ggplot(email, aes(x = number)) +
  geom_bar() + 
  facet_wrap( ~ spam)

Session6_LAB

Tifara Beata Wibowo

2022-06-02

LIBRARY AND DATA

Contingency table review

Dropping levels

Side by side barcharts

Counts vs. proportions

Conditional propotions

Counts vs. proportions (2)

Distribution of one variable

Marginal barchart

Conditional barchart

Improve piechart

Exploring numerical data

Faceted histogram

Boxplots and density plots

Distribution of one variable

Marginal and conditional histograms

Three Bindwidth

Boxplot

Box plots for outliers

Plot selection

Visualization in higher dimensions

3 Variable plot

Numerical Summaries

Measures of center

Calculate center measures

Measures of variability

Calculate spread measures

Describe the shape

Identify outliers

Spam and num_char

Spam and !!!

Collapsing levels

Answering questions with chains

What’s in a number?