#source('create_datasets.R')
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(openintro)
## Warning: package 'openintro' was built under R version 4.1.3
## Loading required package: airports
## Warning: package 'airports' was built under R version 4.1.3
## Loading required package: cherryblossom
## Warning: package 'cherryblossom' was built under R version 4.1.3
## Loading required package: usdata
## Warning: package 'usdata' was built under R version 4.1.3
cars <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv")
comics <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/comics.csv")
life <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/life_exp_raw.csv")
Explanation
Disini kita menggunakan library readr, dplyr, ggplot2, dan openintro. Lalu membuat variable cars, comics dan life yang isinya merupakan bacaan dari file yang akan kita gunakan.
# Print the first rows of the data
head(comics)
## name id align eye hair
## 1 Spider-Man (Peter Parker) Secret Good Hazel Eyes Brown Hair
## 2 Captain America (Steven Rogers) Public Good Blue Eyes White Hair
## 3 Wolverine (James \\"Logan\\" Howlett) Public Neutral Blue Eyes Black Hair
## 4 Iron Man (Anthony \\"Tony\\" Stark) Public Good Blue Eyes Black Hair
## 5 Thor (Thor Odinson) No Dual Good Blue Eyes Blond Hair
## 6 Benjamin Grimm (Earth-616) Public Good Blue Eyes No Hair
## gender gsm alive appearances first_appear publisher
## 1 Male <NA> Living Characters 4043 Aug-62 marvel
## 2 Male <NA> Living Characters 3360 Mar-41 marvel
## 3 Male <NA> Living Characters 3061 Oct-74 marvel
## 4 Male <NA> Living Characters 2961 Mar-63 marvel
## 5 Male <NA> Living Characters 2258 Nov-50 marvel
## 6 Male <NA> Living Characters 2255 Nov-61 marvel
Explanation
Menggunakan function head() untuk menampilkan 6 baris pertama. Disini kita dapat melihat terdapat 6 baris dan 11 kolom dimana setiap kolom mempunyai tipe data masing-masing.
Tipe data char pada variable -> id, align, eye, hair, gender, alive, first_appear, publisher
Tipe data integer pada variable -> appearances
#Ubah jadi factor
comicsalign <- as.factor(comics$align)
# Check levels of align
levels(comicsalign)
## [1] "Bad" "Good" "Neutral"
## [4] "Reformed Criminals"
Explanation
Disini function levels() digunakan untuk memberi tahu nama atribut level apa saja pada suatu variable. Dan atribut dari variable align adalah “Bad”, “Good”, “Neutral”, dan “Reformed Criminals”
#Ubah jadi factor
comicsgender <- as.factor(comics$gender)
# Check the levels of gender
levels(comicsgender)
## [1] "Female" "Male" "Other"
Explanation
Atribut pada variable gender antara lain “Female”, “Male”, “Other”
# Create a 2-way contingency table
table(comics$align, comics$gender)
##
## Female Male Other
## Bad 1573 7561 32
## Good 2490 4809 17
## Neutral 836 1799 17
## Reformed Criminals 1 2 0
Explanation
Dari function table() di atas kita dapat melihat mengenai representasi kategoris data dengan nama variable dan frekuensi dalam bentuk table.
Frekuensi :
Disini jumlah angka tersedikit ada pada Reformed Criminals.
# Load dplyr
# Print tab
tab <- table(comics$align, comics$gender)
tab
##
## Female Male Other
## Bad 1573 7561 32
## Good 2490 4809 17
## Neutral 836 1799 17
## Reformed Criminals 1 2 0
# Remove align level
comics <- comics %>%
filter(align != 'Reformed Criminals') %>%
droplevels()
levels(comicsalign)
## [1] "Bad" "Good" "Neutral"
## [4] "Reformed Criminals"
tab
##
## Female Male Other
## Bad 1573 7561 32
## Good 2490 4809 17
## Neutral 836 1799 17
## Reformed Criminals 1 2 0
Explanation
Dari variable tab di atas kita dapat melihat bahwa data paling sedikit ada pada attribute ‘Reformed Criminals’ sehingga disini kita menggunakan function droplevels() untuk menghapus ‘Reformed Criminals’ sehingga yang tersisa adalah “Bad”, “Good”, “Neutral”
# Load ggplot2
# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender))+ geom_bar(position = "dodge")
Explanation
Disini kita meload library ggplot2 lalu membuat side by side barchart dimana untuk x nya berisi attribut variable align, dan fillnya berisi gender. Disini kita dapat melihat bahwa Male mempunyai frekuensi paling tinggi pada seluruh attribute variable align terutama pada attribut “Bad” (mencapai 6000++).
# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + geom_bar(positio = "dodge") + theme(axis.text.x = element_text(angle = 90))
Explanation
Untuk side by side barchart disini sebenarnya sama saja dengan yang di atas cuma perbedaan nya hanya di peletakannya saja dan terdapat sumbu 90 derajat untuk membantu dalam keterbacaan.
# simplify display format
options(scipen = 999, digits = 3)
## create table of counts
tbl_cnt <- table(comics$id, comics$align)
tbl_cnt
##
## Bad Good Neutral
## No Dual 474 647 390
## Public 2172 2930 965
## Secret 4493 2475 959
## Unknown 7 0 2
Explanation
Disini kita membuat table of count dimana kita bisa melihat bahwa bad secret mempunyai nilai frekuensi yang paling tinggi dan good public kedua terbanyak.
# Proportional table
# All values add up to 1
prop.table(tbl_cnt)
##
## Bad Good Neutral
## No Dual 0.030553 0.041704 0.025139
## Public 0.140003 0.188862 0.062202
## Secret 0.289609 0.159533 0.061815
## Unknown 0.000451 0.000000 0.000129
Explanation Kita menggunakan function prop.table() untuk menghitung value dari setiap sel dalam tabel sebagai proporsi dari semua nilai.
sum(prop.table(tbl_cnt))
## [1] 1
# All rows add up to 1
prop.table(tbl_cnt, 1)
##
## Bad Good Neutral
## No Dual 0.314 0.428 0.258
## Public 0.358 0.483 0.159
## Secret 0.567 0.312 0.121
## Unknown 0.778 0.000 0.222
# Coluns add up to 1
prop.table(tbl_cnt, 2)
##
## Bad Good Neutral
## No Dual 0.066331 0.106907 0.168394
## Public 0.303946 0.484137 0.416667
## Secret 0.628743 0.408956 0.414076
## Unknown 0.000980 0.000000 0.000864
ggplot(comics, aes(x = id, fill = align)) + geom_bar(position = "fill") + ylab("proportion")
Explanation Kita dapat melihat bahwa kebanyakan bad characters adalah secret, dan good characters adalah public.
ggplot(comics, aes(x = align, fill = id)) + geom_bar(position = "fill") + ylab("proportion")
Explanation Yang neutral, dan bad kebanyakan secret
sedangkan yang good public.
tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab) # Joint proportions
##
## Female Male Other
## Bad 0.082210 0.395160 0.001672
## Good 0.130135 0.251333 0.000888
## Neutral 0.043692 0.094021 0.000888
prop.table(tab, 2)
##
## Female Male Other
## Bad 0.321 0.534 0.485
## Good 0.508 0.339 0.258
## Neutral 0.171 0.127 0.258
# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar()
# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) +
geom_bar(position = "fill")
# Can use table function on just one variable
# This is called a marginal distribution
table(comics$id)
##
## No Dual Public Secret Unknown
## 1511 6067 7927 9
Explanation
variable data comics terdapat 4 attributes pada variable id antara lain No Dual, Public, Secret, dan Unknown
# Simple barchart
ggplot(comics, aes(x = id)) +
geom_bar()
Explanation Paling banyak datanya secret sedangkan paling sedikit no dual
We facte by alignment rather then coloring the stack.
This can make it a little easier to answer some questions.
ggplot(comics, aes(x = id)) +
geom_bar() +
facet_wrap(~align)
Explanation Secret bad mempunyai frekuensi tertinggi, lalu untuk good public mempunyai frekuensi tertinggi, ketiga pada neutral public mempunyai frekuensi tertinggi
# Change the order of the levels in align
comics$align <- factor(comics$align,
levels = c("Bad", "Neutral", "Good"))
# Create plot of align
ggplot(comics, aes(x = align)) +
geom_bar()
Explanation: Disini kita akan menaruh neutral antara bad dan good ### – Conditional barchart
# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) +
geom_bar() +
facet_wrap(~ gender)
Explanation:
Frekuensi tertinggi - Female - Good - Male - Bad
pies <- data.frame(flavors = as.factor(rep(c("apple", "blueberry", "boston creme", "cherry", "key lime", "pumpkin", "strawberry"), times = c(17, 14, 15, 13, 16, 12, 11))))
# Put levels of flavor in decending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)
head(pies$flavor)
## [1] apple apple apple apple apple apple
## Levels: apple key lime boston creme blueberry cherry pumpkin strawberry
# Create barchart of flavor
ggplot(pies, aes(x = flavor)) +
geom_bar(fill = "chartreuse") +
theme(axis.text.x = element_text(angle = 90))
Explanation : Disini kita mengurutkan variable dengan nilai tertinggi ke terendah dan menjadikannya dalam bentuk bar.
# A dot plot shows all the datapoints
ggplot(cars, aes(x = weight)) +
geom_dotplot(dotsize = 0.4)
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bindot).
# A histogram groups the points into bins so it does not get overwhelming
ggplot(cars, aes(x = weight)) +
geom_histogram(dotsize = 0.4, binwidth = 500)
## Warning: Ignoring unknown parameters: dotsize
## Warning: Removed 2 rows containing non-finite values (stat_bin).
# A density plot gives a bigger picture representation of the distribution
# It more helpful when there is a lot of data
ggplot(cars, aes(x = weight)) +
geom_density()
## Warning: Removed 2 rows containing non-finite values (stat_density).
# A boxplot is a good way to just show the summary info of the distriubtion
ggplot(cars, aes(x = 1, y = weight)) +
geom_boxplot() +
coord_flip()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Explanation: Disini kita membuat dotplot, histogram dengan dot size yaitu 0.4, dan kita mengambil variable weight pada data cars.
Kita juga membuat dalam bentuk density plot untuk melihat distribusi normal dimana ini bisa dikatakan tidak terdistribusi normal karena terlalu miring ke kiri.
Dari boxplot ini kita juga bisa melihat bahwa data tidak terdistribusi normal.
# Load package
library(ggplot2)
# Learn data structure
str(cars)
## 'data.frame': 428 obs. of 19 variables:
## $ name : chr "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
## $ sports_car : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ suv : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ wagon : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ minivan : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pickup : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ all_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ rear_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ msrp : int 11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
## $ dealer_cost: int 10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
## $ eng_size : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
## $ ncyl : int 4 4 4 4 4 4 4 4 4 4 ...
## $ horsepwr : int 103 103 140 140 140 132 132 130 110 130 ...
## $ city_mpg : int 28 28 26 26 26 29 29 26 27 26 ...
## $ hwy_mpg : int 34 34 37 37 37 36 36 33 36 33 ...
## $ weight : int 2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
## $ wheel_base : int 98 98 104 104 104 105 105 103 103 103 ...
## $ length : int 167 153 183 183 183 174 174 168 168 168 ...
## $ width : int 66 66 69 68 69 67 67 67 67 67 ...
Explanation : Menggunakan function str() untuk menampilkan size of data dimana disini terdapat 428 observasi dari 19 variable.
Tipe data : - name -> factor - sports_car, suv, wagon, minivan, pickup, all_wheel, rear_wheel ->logical (TRUE, FALSE) - msrp, dealer_cost, ncyl, horsepwr, city_mpg, hwy_mpg, weight, wheel_base, length, width -> integer - eng_size -> numeric
# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
geom_histogram() +
facet_wrap(~ suv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).
Explanation : Memplot sebuah histogram dari data cars dengan mengambil variable city mpg dan sub untuk mengetahui apakah mobil tersebut suv atau tidak, dan disini ditemukan bahwa kebanyakan bukan suv
unique(cars$ncyl)
## [1] 4 6 3 8 5 12 10 -1
Explanation: Terdapat beberapa 8 data unique dari variable ncyl yaitu 4, 6, 3, 8, 5, 12, 10, -1
table(cars$ncyl)
##
## -1 3 4 5 6 8 10 12
## 2 1 136 7 190 87 2 3
Explanation: Ini merupakan frekuensi dari setiap attributes pada variable ncyl
# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4,6,8))
# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).
# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
geom_density(alpha = .3)
## Warning: Removed 11 rows containing non-finite values (stat_density).
Explanation:
Membuat density plot pada variable ncyl (4,6,8) dalam bentuk side by side plot.
Disini kita juga membuat overlaid density plots
Mobil dengan jarak tempuh tertinggi adalah 4 silinder dibandingkan 6 dan 8 silinder
# Create hist of horsepwr
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram() +
ggtitle("Horsepower distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Create hist of horsepwr for affordable cars
cars %>%
filter(msrp < 25000) %>%
ggplot(aes(horsepwr)) +
geom_histogram() +
xlim(c(90, 550)) +
ggtitle("Horsepower distribtion for msrp < 25000")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Explanation:
Disini kita membuat histogram horsepower dan histogram horsepower untuk affordable cars. Dari 2 histogram di atas kita dapat mendapat kesimpulan bahwa mobil dengan horsepower tertinggi dengan harga yang affordable hanya memiliki horsepower kurang dari 250.
# Create hist of horsepwr with binwidth of 3
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 3) +
ggtitle("binwidth = 3")
# Create hist of horsepwr with binwidth of 30
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 30) +
ggtitle("binwidth = 30")
# Create hist of horsepwr with binwidth of 60
cars %>%
ggplot(aes(horsepwr)) +
geom_histogram(binwidth = 60) +
ggtitle("binwidth = 60")
Explanation: Ini merupakan histogram horsepower dengan binwidth 3, 30, 60 + terdistrubusi normal
# Construct box plot of msrp
cars %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
# Exclude outliers from data
cars_no_out <- cars %>%
filter(msrp < 100000)
# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
ggplot(aes(x = 1, y = msrp)) +
geom_boxplot()
Explanation: Lalu ini merupakan boxplots untuk outliers dan ditemukan banyak outliers di range 50000 ke atas sehingga disini kita reduced dataset nya dengan filter (mrsp < 100000), dan masih ditemukan outliers.
# Create plot of city_mpg
cars %>%
ggplot(aes(x = 1, y = city_mpg)) +
geom_boxplot()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
cars %>%
ggplot(aes(city_mpg)) +
geom_density()
## Warning: Removed 14 rows containing non-finite values (stat_density).
Explanation : Disini kita create plot untuk city_mpg (banyak outliers), dan create density city_mpg (tidak terdistribusi normal karena miring ke kiri)
# Create plot of width
cars %>%
ggplot(aes(x = 1, y = width)) +
geom_boxplot()
## Warning: Removed 28 rows containing non-finite values (stat_boxplot).
cars %>%
ggplot(aes(x = width)) +
geom_density()
## Warning: Removed 28 rows containing non-finite values (stat_density).
Explanation
Disini kita membuat boxplot dan density untuk width dimana boxplot menunjukkan terdapat beberapa outliers, dan density menunjukkan bahwa data tidak terdistribusi normal.
# Facet hists using hwy mileage and ncyl
common_cyl %>%
ggplot(aes(x = hwy_mpg)) +
geom_histogram() +
facet_grid(ncyl ~ suv) +
ggtitle("hwy_mpg by ncyl and suv")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11 rows containing non-finite values (stat_bin).
head(life)
## State County fips Year Female.life.expectancy..years.
## 1 Alabama Autauga County 1001 1985 77.0
## 2 Alabama Baldwin County 1003 1985 78.8
## 3 Alabama Barbour County 1005 1985 76.0
## 4 Alabama Bibb County 1007 1985 76.6
## 5 Alabama Blount County 1009 1985 78.9
## 6 Alabama Bullock County 1011 1985 75.1
## Female.life.expectancy..national..years.
## 1 77.8
## 2 77.8
## 3 77.8
## 4 77.8
## 5 77.8
## 6 77.8
## Female.life.expectancy..state..years. Male.life.expectancy..years.
## 1 76.9 68.1
## 2 76.9 71.1
## 3 76.9 66.8
## 4 76.9 67.3
## 5 76.9 70.6
## 6 76.9 66.6
## Male.life.expectancy..national..years. Male.life.expectancy..state..years.
## 1 70.8 69.1
## 2 70.8 69.1
## 3 70.8 69.1
## 4 70.8 69.1
## 5 70.8 69.1
## 6 70.8 69.1
x <- head(round(life$Female.life.expectancy..years.), 11)
x
## [1] 77 79 76 77 79 75 77 77 77 78 77
Explanation: Terdapat 10 variable, dan disini terlihat bahwa 11 data pertama pada female life expectancy adalah 77 tahun (ada 6 )
sum(x)/11
## [1] 77.2
mean(x)
## [1] 77.2
Explanation : Terbukti dengan rata-rata ditemukan bahwa female life expectancy adalah 77.2 tahun
sort(x)
## [1] 75 76 77 77 77 77 77 77 78 79 79
Explanation: Mengurutkan dari data terkecil ke terbesar
median(x)
## [1] 77
Explanation: Nilai tengah nya adalah 77
table(x)
## x
## 75 76 77 78 79
## 1 1 6 1 2
Explanation: Terlihat bahwa data terbanyak terdapat pada 77 yaitu ada 6
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.1.3
str(gapminder)
## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
Explanation: Terdapat 1704 observasi, dan 6 variable dimana mempunyai tipe data yaotu factor, integer, dan numeric.
Country mempunyai 142 levels / variable unique, sedangkan continent mempunyai 5 variable unique.
# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)
# Compute groupwise mean and median lifeExp
gap2007 %>%
group_by(continent) %>%
summarize(mean(lifeExp),
median(lifeExp))
## # A tibble: 5 x 3
## continent `mean(lifeExp)` `median(lifeExp)`
## <fct> <dbl> <dbl>
## 1 Africa 54.8 52.9
## 2 Americas 73.6 72.9
## 3 Asia 70.7 72.4
## 4 Europe 77.6 78.6
## 5 Oceania 80.7 80.7
# Generate box plots of lifeExp for each continent
gap2007 %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
Explanation: Mengambil data di tahun 2007 untuk mean dan median lifeExp. Ditemukan dari data tersebut bahwa rata-rata, dan median lifeExp terendah ada di Afrika yaitu 54.8 dan 52.9. Sedangkan tertinggi di Oceania yaitu dengan rata-rata 80.7 dan median 80.7. Disini pula kita membuat boxplot untuk memvisualisasikannya.
x
## [1] 77 79 76 77 79 75 77 77 77 78 77
# Look at the difference between each point and the mean
sum(x - mean(x))
## [1] -0.0000000000000568
We want something that is stable
# Square each difference to get rid of negatives then sum
sum((x - mean(x))^2)
## [1] 13.6
sum((x - mean(x))^2)/(length(x) - 1)
## [1] 1.36
var(x)
## [1] 1.36
sqrt(sum((x - mean(x))^2)/(length(x) - 1))
## [1] 1.17
sd(x)
## [1] 1.17
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 75.0 77.0 77.0 77.2 77.5 79.0
IQR(x)
## [1] 0.5
max(x)
## [1] 79
min(x)
## [1] 75
diff(range(x))
## [1] 4
str(gap2007)
## tibble [142 x 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
## $ year : int [1:142] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## $ lifeExp : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
## $ pop : int [1:142] 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ...
## $ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...
Explanation : Data di atas menampilkan 142 observasi, dan 6 variable yang mempunyai tipe data factor, int, numeric.
# Compute groupwise measures of spread
gap2007 %>%
group_by(continent) %>%
summarize(sd(lifeExp),
IQR(lifeExp),
n())
## # A tibble: 5 x 4
## continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
## <fct> <dbl> <dbl> <int>
## 1 Africa 9.63 11.6 52
## 2 Americas 4.44 4.63 25
## 3 Asia 7.96 10.2 33
## 4 Europe 2.98 4.78 30
## 5 Oceania 0.729 0.516 2
Explanation: sd tertinggi ada pada Afrika, dan IQR tertinggi ada pada Afrika dimana Afrika mempunyai lifeExp terendah, sedangkan yang IQR dan SD nya terendah adalah Oceania yang sebenarnya mempunyai lifeExp tertinggi.
# Generate overlaid density plots
gap2007 %>%
ggplot(aes(x = lifeExp, fill = continent)) +
geom_density(alpha = 0.3)
Explanation: Visualisasi di atas membuktikan bahwa Oceania mempunyai lifeExp tertinggi dari continent lainnya.
# Compute stats for lifeExp in Americas
head(gap2007)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
Explanation: Kita bisa melihat bahwa Oceania mempunyai lifeExp tertinggi, dan Angola terendah yaitu 42.7 pada tahun 22007
gap2007 %>%
filter(continent == "Americas") %>%
summarize(mean(lifeExp),
sd(lifeExp))
## # A tibble: 1 x 2
## `mean(lifeExp)` `sd(lifeExp)`
## <dbl> <dbl>
## 1 73.6 4.44
Explanation: Menunjukkan mean, dan sd pada continent Americas
# Compute stats for population
gap2007 %>%
summarize(median(pop),
IQR(pop))
## # A tibble: 1 x 2
## `median(pop)` `IQR(pop)`
## <dbl> <dbl>
## 1 10517531 26702008.
Explanation: Menunjukkan median, dan IQR pada variable pop
4 chracteristics of a distribution that are of interest:
# Create density plot of old variable
gap2007 %>%
ggplot(aes(x = pop)) +
geom_density()
Explantion : Terlihat bahwa tidak terdistribusi normal tapi right skewed
# Transform the skewed pop variable
gap2007 <- gap2007 %>%
mutate(log_pop = log(pop))
# Create density plot of new variable
gap2007 %>%
ggplot(aes(x = log_pop)) +
geom_density()
Explanation: Terdistribusi normal/skewed
# Filter for Asia, add column indicating outliers
str(gapminder)
## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
Explanation: Terlihat terdapat 6 variable dnegan 1704 obs dan mempunyai tipe data factor, numeric, interger
gap_asia <- gap2007 %>%
filter(continent == "Asia") %>%
mutate(is_outlier = lifeExp < 50)
# Remove outliers, create box plot of lifeExp
gap_asia %>%
filter(!is_outlier) %>%
ggplot(aes(x = 1, y = lifeExp)) +
geom_boxplot()
# ggplot2, dplyr, and openintro are loaded
# Compute summary statistics
email %>%
group_by(spam) %>%
summarize(
median(num_char),
IQR(num_char))
## # A tibble: 2 x 3
## spam `median(num_char)` `IQR(num_char)`
## <fct> <dbl> <dbl>
## 1 0 6.83 13.6
## 2 1 1.05 2.82
Explanation : 2 observasi dan 3 variable.
str(email)
## tibble [3,921 x 21] (S3: tbl_df/tbl/data.frame)
## $ spam : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ to_multiple : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
## $ from : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ cc : int [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
## $ sent_email : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
## $ time : POSIXct[1:3921], format: "2012-01-01 13:16:41" "2012-01-01 14:03:59" ...
## $ image : num [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
## $ attach : num [1:3921] 0 0 0 0 0 0 0 1 0 0 ...
## $ dollar : num [1:3921] 0 0 4 0 0 0 0 0 0 0 ...
## $ winner : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ inherit : num [1:3921] 0 0 1 0 0 0 0 0 0 0 ...
## $ viagra : num [1:3921] 0 0 0 0 0 0 0 0 0 0 ...
## $ password : num [1:3921] 0 0 0 0 2 2 0 0 0 0 ...
## $ num_char : num [1:3921] 11.37 10.5 7.77 13.26 1.23 ...
## $ line_breaks : int [1:3921] 202 202 192 255 29 25 193 237 69 68 ...
## $ format : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 2 1 2 ...
## $ re_subj : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ exclaim_subj: num [1:3921] 0 0 0 0 0 0 0 0 0 0 ...
## $ urgent_subj : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ exclaim_mess: num [1:3921] 0 1 6 48 1 1 1 18 1 0 ...
## $ number : Factor w/ 3 levels "none","small",..: 3 2 2 2 1 1 3 2 2 2 ...
Explanation: Terdapat beberapa tipe data seperti numeric, factor, dan int
table(email$spam)
##
## 0 1
## 3554 367
Explanation: Di variale spam pada attributes 0 ada 3554 frekuensi, dan pada 1 ada 367 frekuensi
email <- email %>%
mutate(spam = factor(ifelse(spam == 0, "not-spam", "spam")))
# Create plot
email %>%
mutate(log_num_char = log(num_char)) %>%
ggplot(aes(x = spam, y = log_num_char)) +
geom_boxplot()
Explanation: Terdapat beberapa pencilan pada not spam, dan spam
-The median length of not-spam emails is greater than that of spam emails ### – Spam and !!!
# Compute center and spread for exclaim_mess by spam
email %>%
group_by(spam) %>%
summarize(
median(exclaim_mess),
IQR(exclaim_mess))
## # A tibble: 2 x 3
## spam `median(exclaim_mess)` `IQR(exclaim_mess)`
## <fct> <dbl> <dbl>
## 1 not-spam 1 5
## 2 spam 0 1
Explanation: Not-spam -> median (1) + IQR (5) Spam -> median(0) + IQR (1)
table(email$exclaim_mess)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1435 733 507 128 190 113 115 51 93 45 85 17 56 20 43 11
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## 29 12 26 5 29 9 15 3 11 6 11 1 6 8 13 12
## 32 33 34 35 36 38 39 40 41 42 43 44 45 46 47 48
## 13 3 3 2 3 3 1 2 1 1 3 3 5 3 2 1
## 49 52 54 55 57 58 62 71 75 78 89 94 96 139 148 157
## 3 1 1 4 2 2 2 1 1 1 1 1 1 1 1 1
## 187 454 915 939 947 1197 1203 1209 1236
## 1 1 1 1 1 1 2 1 1
# Create plot for spam and exclaim_mess
email %>%
mutate(log_exclaim_mess = log(exclaim_mess)) %>%
ggplot(aes(x = log_exclaim_mess)) +
geom_histogram() +
facet_wrap(~ spam)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1435 rows containing non-finite values (stat_bin).
table(email$image)
##
## 0 1 2 3 4 5 9 20
## 3811 76 17 11 2 2 1 1
# Create plot of proportion of spam by image
email %>%
mutate(has_image = image > 0) %>%
ggplot(aes(x = has_image, fill = spam)) +
geom_bar(position = "fill")
Explanation: Hampir keseluruhan not-spam
# Test if images count as attachments
sum(email$image > email$attach)
## [1] 0
## Within non-spam emails, is the typical length of emails shorter for
## those that were sent to multiple people?
email %>%
filter(spam == "not-spam") %>%
group_by(to_multiple) %>%
summarize(median(num_char))
## # A tibble: 2 x 2
## to_multiple `median(num_char)`
## <fct> <dbl>
## 1 0 7.20
## 2 1 5.36
Explanation Yes karena median nya 7.20 dan untuk spam median nya 5.36 lebih rendah
# Question 1
## For emails containing the word "dollar", does the typical spam email
## contain a greater number of occurences of the word than the typical non-spam email?
table(email$dollar)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 3175 120 151 10 146 20 44 12 35 10 22 10 20 7 14 5
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32
## 23 2 14 1 10 7 12 7 7 3 7 1 5 1 1 2
## 34 36 40 44 46 48 54 63 64
## 1 2 3 3 2 1 1 1 3
email %>%
filter(dollar > 0) %>%
group_by(spam) %>%
summarize(median(dollar))
## # A tibble: 2 x 2
## spam `median(dollar)`
## <fct> <dbl>
## 1 not-spam 4
## 2 spam 2
Explanation: karena median not-spam email 4, dan median spam adalah 2
# Question 2
## If you encounter an email with greater than 10 occurrences of the word "dollar",
## is it more likely to be spam or not -spam?
email %>%
filter(dollar > 10) %>%
ggplot(aes(x = spam)) +
geom_bar()
Explanation: Dari visualisasi bar di atas menunjukkan bahwa non-spam lebih besar
levels(email$number)
## [1] "none" "small" "big"
Explanation: Terdapat 3 variable unik
table(email$number)
##
## none small big
## 549 2827 545
Explanation: Menunjukkan frekuensi pada setiap attributes variable number
# Reorder levels
email$number <- factor(email$number, levels = c("none","small","big"))
# Construct plot of number
ggplot(email, aes(x = number)) +
geom_bar() +
facet_wrap( ~ spam)
### – What’s in a number interpretation - Given that an email contains a
small number, it is more likely to be not-spam. - Given that an email
contains a big number, it is more likely to be not-spam. - Within both
spam and not-spam, the most common number is a small one.
Explanation: - Terlihat bahwa email dengan small number kebanyakan non-spam.