Libraries and Data

#source('create_datasets.R')

library(readr)
## Warning: package 'readr' was built under R version 4.1.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(openintro)
## Warning: package 'openintro' was built under R version 4.1.3
## Loading required package: airports
## Warning: package 'airports' was built under R version 4.1.3
## Loading required package: cherryblossom
## Warning: package 'cherryblossom' was built under R version 4.1.3
## Loading required package: usdata
## Warning: package 'usdata' was built under R version 4.1.3
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.1.3
cars <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/cars04.csv")
comics <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/comics.csv")
life <- read.csv("https://assets.datacamp.com/production/course_1796/datasets/life_exp_raw.csv")

Exploring Categorical Data

– Bar chart expectations

  • Bar charts with categorical variables on the x axis and in the fill are a common way to see a contingency table visually.
  • It essentialy what you would get if you used the table function with two variables
  • Which way you show the data can change the perception.
  • Which variable you use for the fill or the position of the bars (fill, dodge, stack) all can give different perceptions

- Contingency table review

#Print the first rows of the data
head(comics)
##                                    name      id   align        eye       hair
## 1             Spider-Man (Peter Parker)  Secret    Good Hazel Eyes Brown Hair
## 2       Captain America (Steven Rogers)  Public    Good  Blue Eyes White Hair
## 3 Wolverine (James \\"Logan\\" Howlett)  Public Neutral  Blue Eyes Black Hair
## 4   Iron Man (Anthony \\"Tony\\" Stark)  Public    Good  Blue Eyes Black Hair
## 5                   Thor (Thor Odinson) No Dual    Good  Blue Eyes Blond Hair
## 6            Benjamin Grimm (Earth-616)  Public    Good  Blue Eyes    No Hair
##   gender  gsm             alive appearances first_appear publisher
## 1   Male <NA> Living Characters        4043       Aug-62    marvel
## 2   Male <NA> Living Characters        3360       Mar-41    marvel
## 3   Male <NA> Living Characters        3061       Oct-74    marvel
## 4   Male <NA> Living Characters        2961       Mar-63    marvel
## 5   Male <NA> Living Characters        2258       Nov-50    marvel
## 6   Male <NA> Living Characters        2255       Nov-61    marvel

There are 11 attributes in this comics dataset: name, id, align, eye, hair, gender, gsm, alive, appearances, first appear, and publisher. Except for the appearances property, all of the attributes in this dataset are of the character data type.

#Check levels of align
comics$align = as.factor(comics$align)
levels(comics$align)
## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"

To examine the align attribute’s level, first convert the align attribute’s data type from character to factor. After inspecting the levels of the align attribute in the comic dataset, it is discovered that there are four levels: “Bad”, “Good”, “Neutral”, and “Reformed Criminals.”

#Check the levels of gender
comics$gender = as.factor(comics$gender)
levels(comics$gender)
## [1] "Female" "Male"   "Other"

To examine the gender attribute’s level, first convert the gender attribute’s data type from character to factor. After inspecting the levels of the gender attribute in the comic dataset, it is discovered that there are three levels: “Female”, “Male”,and “Reformed Criminals.”

#Create a 2-way contingency table
table(comics$align, comics$gender)
##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

Contingency table is use to represent the total counts of observations that fall into each combination of the levels of categorical variables. 1. According to the result above, the most common or most prevalent category is gender “Male” with align “Bad” with the value 7561. 2. The lowest category is gender “Other” with align “Reformed Criminals” with a value of 0. 3. Gender “Female” and align “Bad” have a total of 1573 instances in this comics data, Gender “Female” and align “Good” have a total of 2490 instances in this comics data, and so on. 4. The gender attribute has a low contingency on the align attribute with the “Reformed Criminals” level.

- Dropping levels

#Load dplyr
#Print tab
tab <- table(comics$align, comics$gender)
tab
##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

The output above is the original contingency table between the align and gender attributes in the data.

# Remove align level
comics <- comics %>% filter(align != 'Reformed Criminals') %>% droplevels()

levels(comics$align)
## [1] "Bad"     "Good"    "Neutral"
  1. The output above is the result of the remaining levels in the align attribute after we removed the reformed criminal level in the align attribute by using filter(align != ‘Reformed Criminals’) %>% droplevels().
  2. Dropping levels can used to simplify our analysis.

- Side-by-side barcharts

# Load ggplot2
# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "dodge")

The output above is the outcome of the visualization of the contingency table that was created earlier, with the x representing the align property, which is filled with the gender attribute to make it easier to understand. 1. According to the result, align “Bad” and gender “Male” have the largest number when compared to other align and gender combinations.Then, followed by align “Good” with the gender “Male,” and finally, align “Neutral” with the gender “Male.”

  1. Furthermore, when compared to other align and gender combinations, align “Good” and gender “Female” have the highest number. Then, followed by align “Bad” with the gender “Female,” and finally align “Neutral” with the gender “Female.”

  2. Also, when compared to other align and gender combinations, align “Bad” and gender “NA” have the highest number. Then, followed by align “Good” with the gender “NA,” and finally, align “Neutral” with the gender “NA.”

  3. The number between align and gender “Other” has the fewest others until the bar is no longer visible.

# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + geom_bar(positio = "dodge") + theme(axis.text.x = element_text(angle = 90))

The output above is the outcome of the preceding visualization of the contingency table, but it differs from the prior visualization, where the x that has the gender property now and its contains the align attribute’s values. But, for the explanation is the same as the previous visualization explanation.

- Bar chart interpretation

  • Among characters with “Neutral” alignment, males are the most common.
  • In general, there is an association between gender and alignment.
  • There are more male characters than female characters in this dataset.

Counts vs. proportions

# simplify display format
options(scipen = 999, digits = 3) 

## create table of counts
tbl_cnt <- table(comics$id, comics$align)
tbl_cnt
##          
##            Bad Good Neutral
##   No Dual  474  647     390
##   Public  2172 2930     965
##   Secret  4493 2475     959
##   Unknown    7    0       2

The output above is a table of the number of cases based on id and align. 1. From the output above, which has the highest number of cases is id “Secret” with align “Bad”, followed by id “Public” with align “Good”. Then id “Secret” and align “Good”, and so on.

  1. From the output above, which has the lowest number of cases is id “Unknown” with align “Good” with the value of 0.

  2. The number of cases between id “Unknown” and all align’s attribute level is small, with a value less than 10.

# Proportional table
# All values add up to 1
prop.table(tbl_cnt)
##          
##                Bad     Good  Neutral
##   No Dual 0.030553 0.041704 0.025139
##   Public  0.140003 0.188862 0.062202
##   Secret  0.289609 0.159533 0.061815
##   Unknown 0.000451 0.000000 0.000129

The output above is the proportion of all cases that fall into each category. 1. The largest single category is id “Secret” and align “Bad,” which stand for around 29% of id.

  1. The smallest single category is id “Unknown” and align “Good,” which stand for 0% of id.
sum(prop.table(tbl_cnt))
## [1] 1

The result above shows that the total of all the proportions in the whole dataset is worth 1.

# All rows add up to 1
prop.table(tbl_cnt, 1)
##          
##             Bad  Good Neutral
##   No Dual 0.314 0.428   0.258
##   Public  0.358 0.483   0.159
##   Secret  0.567 0.312   0.121
##   Unknown 0.778 0.000   0.222

Following that, conditional proportions are needed to determine the systematic relationship between variables such as the output above. 1. From the output above, it can be seen that about 57% of all “Secret” id(s) have “Bad” align.

# Columns add up to 1
prop.table(tbl_cnt, 2)
##          
##                Bad     Good  Neutral
##   No Dual 0.066331 0.106907 0.168394
##   Public  0.303946 0.484137 0.416667
##   Secret  0.628743 0.408956 0.414076
##   Unknown 0.000980 0.000000 0.000864

We add 1 to each row to condition the id property. Instead, we add argument 2 to prop.table() to condition the column. 1. According to the result above, the proportion of align “Bad” whose ID is “hidden”Secret” is roughly 63 percent.

  1. There are very few align(s) with id “Unknown”.
ggplot(comics, aes(x = id, fill = align)) + geom_bar(position = "fill") + ylab("proportion")

The proportion is visualized above to help you comprehend it. Above is a visualization with x as the id attribute filled with align attribute.

ggplot(comics, aes(x = align, fill = id)) + geom_bar(position = "fill") +ylab("proportion")

Above is a visualization with x as the align attribute filled with id attribute. 1. According to both visualization findings above, the id “Secret” has the worst alignment (align “Bad”). Id “No Dual” and Id “Public” have good alignment

  1. From the output above, id “Unknown” only has “Bad” or “Neutral” align

- Conditional proportions

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab)     # Joint proportions
##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Good    0.130135 0.251333 0.000888
##   Neutral 0.043692 0.094021 0.000888

The output above is the proportion of all cases that fall into each category between align attribute and gender attribute.

prop.table(tab, 2)
##          
##           Female  Male Other
##   Bad      0.321 0.534 0.485
##   Good     0.508 0.339 0.258
##   Neutral  0.171 0.127 0.258
  1. From the output above, there are 51% proportion of all female characters are good.

  2. The highest proportion are 53% that all male characters are Bad.

  3. The lowest proportion are 13% that all male characters are Neutral.

– Counts vs. proportions (2)

# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) + geom_bar()

The output above is the outcome of the visualization with the x representing the align attribute, which is filled with the gender attribute but with no position “fill”.

# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "fill")

The output above is the outcome of the visualization with the x representing the align attribute, which is filled with the gender attribute but with position “fill”. The result is the same but only the difference is that the bar is full or not with a different y value. If given the attribute position=“fill” the value on the y-axis will be decimal, but if not then vice versa.

Distribution of one variable

# Can use table function on just one variable
# This is called a marginal distribution
table(comics$id)
## 
## No Dual  Public  Secret Unknown 
##    1511    6067    7927       9

The output above is the number of distributions at each level in the attribute id. “No Dual” id has the value of 1511. “Public” id has the value of 6067. “Secret” id has the value of 7927. And the “Unknown” id has the value of 9.

# Simple barchart
ggplot(comics, aes(x = id)) + geom_bar()

The output shown above depicts the distribution of each level on the id attribute.

ggplot(comics, aes(x = id)) + geom_bar() + facet_wrap(~align)

Above is a visualization of all cases with each one align already grouped against id. So we can see the distribution of id(s) in all cases that have bad, good, or neutral align.

- Marginal barchart

  • It makes more sense to put neutral between Bad and Good
  • We need to reorder the levels so it will chart this way
  • Otherwise it will defult to alphabetical
# Change the order of the levels in align
comics$align <- factor(comics$align, levels = c("Bad", "Neutral", "Good"))

# Create plot of align
ggplot(comics, aes(x = align)) + geom_bar()

To see the alignment distribution of all superheroes, we can create a bar chart for just that single variable like above. As we can see, align “Bad” has the highest distribution, followed by align “Good” and finally align “Neutral”

- Conditional barchart

# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) + geom_bar() + facet_wrap(~ gender)

Above is a visualization of all cases with each level on the gender attribute separated and combined with the align attribute. So we can see the distribution of align(s) in each gender. 1. Female gender are most has the good align, and least has neutral align.

  1. Gender “Male” are most has bad align, and least has neutral align.

  2. Gender “NA” are most has bad align, and least has neutral align.

  3. Gender “Other” can be said doesn’t have align.

– Improve piechart

pies <- data.frame(flavors = as.factor(rep(c("apple", "blueberry", "boston creme", "cherry", "key lime", "pumpkin", "strawberry"), times = c(17, 14, 15, 13, 16, 12, 11))))

#Put levels of flavor in decending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)

head(pies$flavor)
## [1] apple apple apple apple apple apple
## Levels: apple key lime boston creme blueberry cherry pumpkin strawberry

From the output above, we can see that in the pies dataframe there are several levels of flavors, namely apple, key lime, boston creme, blueberry, cherry, pumpkin, and strawberry.

#Create barchart of flavor
ggplot(pies, aes(x = flavor)) + geom_bar(fill = "chartreuse") + theme(axis.text.x = element_text(angle = 90))

The visualization above depicts the distribution of each flavor level from the dataframe pies. The plot above has also been ranked from most to least. 1. Apple flavor are the most, and strawberry flavor are the least.

Exploring Numerical Data

# A dot plot shows all the datapoints
ggplot(cars, aes(x = weight)) + geom_dotplot(dotsize = 0.4)
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bindot).

The output shown above is a depiction of the weight attribute distribution on the cars dataset. A dot is used to represent the visualization above.

# A histogram groups the points into bins so it does not get overwhelming
ggplot(cars, aes(x = weight)) + geom_histogram(dotsize = 0.4, binwidth = 500)
## Warning: Ignoring unknown parameters: dotsize
## Warning: Removed 2 rows containing non-finite values (stat_bin).

The output shown above is a depiction of the weight attribute distribution on the cars dataset. But the difference is, the visualization above is depicted with a histogram.

# A density plot gives a bigger picture representation of the distribution
# It more helpful when there is a lot of data
ggplot(cars, aes(x = weight)) + geom_density()
## Warning: Removed 2 rows containing non-finite values (stat_density).

The output shown above is a depiction of the weight attribute distribution on the cars dataset. The visualization above is depicted with a density line which more helpful when there are a lot of data we want to observed. The data above shows that the data is relatively unimodal and right-skewed.

# A boxplot is a good way to just show the summary info of the distriubtion
ggplot(cars, aes(x = 1, y = weight)) + geom_boxplot() + coord_flip()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Above is a boxplot of the weight attribute on the cars dataset, As we can se, there are several outliers.

– Faceted histogram

# Load package
library(ggplot2)

# Learn data structure
str(cars)
## 'data.frame':    428 obs. of  19 variables:
##  $ name       : chr  "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" "Chevrolet Cavalier 2dr" "Chevrolet Cavalier 4dr" ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...

The output shown above shows the attributes of the dataset cars. Cars dataset has 19 atributte with several data type: character, logical, numeric, and integer.

# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) + geom_histogram() + facet_wrap(~ suv)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14 rows containing non-finite values (stat_bin).

Above is a visualization of the faceted histogram of the city mpg attribute on the cars dataset, it is divided into true or false. As we can see, from this data, most cars don’t have city mpg than those that do.

- Boxplots and density plots

unique(cars$ncyl)
## [1]  4  6  3  8  5 12 10 -1

The output above is to see how many numbers there are in the ncyl attribute in the cars dataset. There are 8 unique values in the ncyl attribute : 4,6,3,8,5,12,10, and -1.

table(cars$ncyl)
## 
##  -1   3   4   5   6   8  10  12 
##   2   1 136   7 190  87   2   3

The output above is to observe how each ncyl number is distributed. In this data, ncyl with the number 6 appears the most.And ncyl with the number 3 appears the least.

# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4,6,8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) + geom_boxplot()
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

The result above is the ncyl attribute boxplot with the numbers 4, 6, and 8. 1. Ncyl with 6 number has 1 outlier.

  1. Ncyl with 4 number has 5 outliers.

  2. Ncyl with 8 number appears don’t have outlier.

# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) + geom_density(alpha = .3)
## Warning: Removed 11 rows containing non-finite values (stat_density).

Above is the ncyl attribute overlaid density plot with the numbers 4, 6, and 8. 1. Ncyl with number 8 has the highest spike among the others, followed by ncyl with number 6 and number 4.

– Compare distribution via plots

  • The highest mileage cars have 4 cylinders.
  • The typical 4 cylinder car gets better mileage than the typical 6 cylinder car, which gets better mileage than the typical 8 cylinder car.
  • Most of the 4 cylinder cars get better mileage than even the most efficient 8 cylinder cars.

Distribution of one variable

– Marginal and conditional histograms

# Create hist of horsepwr
cars %>% ggplot(aes(horsepwr)) + geom_histogram() + ggtitle("Horsepower distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Above is the histogram of horsepower attribute in the car dataset.

# Create hist of horsepwr for affordable cars
cars %>% filter(msrp < 25000) %>% ggplot(aes(horsepwr)) + geom_histogram() + xlim(c(90, 550)) + ggtitle("Horsepower distribtion for msrp < 25000")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

Above is the histogram of horsepower attribute in the car dataset that has value below 25000.

– Marginal and conditional histograms interpretation

  • The highest horsepower car in the less expensive range has just under 250 horsepower. ### – Three binwidths
# Create hist of horsepwr with binwidth of 3
cars %>% ggplot(aes(horsepwr)) + geom_histogram(binwidth = 3) + ggtitle("binwidth = 3")

Above is the histogram of horsepower attribute in the car dataset with value 3 as the binwidth. It can be noticed that the space between the bins is increasing and that they are no longer as close together as they were previously.

# Create hist of horsepwr with binwidth of 30
cars %>% ggplot(aes(horsepwr)) + geom_histogram(binwidth = 30) + ggtitle("binwidth = 30")

Above is the histogram of horsepower attribute in the car dataset with value 30 as the binwidth.It can be seen that the width of the bin is getting bigger and stick to each other.

# Create hist of horsepwr with binwidth of 60
cars %>% ggplot(aes(horsepwr)) + geom_histogram(binwidth = 60) + ggtitle("binwidth = 60")

Above is the histogram of horsepower attribute in the car dataset with value 60 as the binwidth.It can be seen that the width of the bin is getting bigger and than the previous one.

As the conclusion, The larger the binwidth number, the larger and closer the bars will be.

Box plots

– Box plots for outliers

# Construct box plot of msrp
cars %>% ggplot(aes(x = 1, y = msrp)) + geom_boxplot()

Above are is the boxplot of msrp attribute in the car dataset, with the x axis is x with the value 1 and the y axis is the msrp value. It can be seen that the msrp attribute has several outliers.

# Exclude outliers from data
cars_no_out <- cars %>% filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset
cars_no_out %>% ggplot(aes(x = 1, y = msrp)) + geom_boxplot()

Above is the output of the msrp attribute boxplot which has removed the outlier data. The intended data outliers are msrp values above 100000. So we only take msrp data which is worth under 100000.

Plot selection

# Create plot of city_mpg
cars %>% ggplot(aes(x = 1, y = city_mpg)) +geom_boxplot()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

Above are is the boxplot of city mpg attribute in the car dataset, with the x axis is x with the value 1 and the y axis is the city mpg value. It can be seen that the city mpg attribute also has several outliers.

cars %>% ggplot(aes(city_mpg)) + geom_density()
## Warning: Removed 14 rows containing non-finite values (stat_density).

The output shown above is a density line of the city mpg attribute distribution on the cars dataset. The data above shows that the city mpg attribute is relatively right-skewed.

# Create plot of width
cars %>% ggplot(aes(x = 1, y = width)) + geom_boxplot()
## Warning: Removed 28 rows containing non-finite values (stat_boxplot).

Above are is the boxplot of width attribute in the car dataset, with the x axis is x with the value 1 and the y axis is the width value. It can be seen that the width attribute has 2 outliers.

cars %>% ggplot(aes(x = width)) + geom_density()
## Warning: Removed 28 rows containing non-finite values (stat_density).

The output shown above is a density line of the width attribute distribution on the cars dataset. The data above shows that the width attribute is roughly like multimodal.

Visualization in higher dimensions

– 3 variable plot

# Facet hists using hwy mileage and ncyl
common_cyl %>% ggplot(aes(x = hwy_mpg)) + geom_histogram() + facet_grid(ncyl ~ suv) + ggtitle("hwy_mpg by ncyl and suv")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11 rows containing non-finite values (stat_bin).

Above is the visualization between 3 attribute: hwy_mpg by ncyl and suv. 1. ncyl number 4 with suv has the lowest hwy_mpg.

  1. ncyl number 6 with no suv has the most hwy_mpg in the range of 25-30.

  2. hwy_mpg values tends to decrease as the number of cylinders increases in both SUVs and non-SUVs.

Numerical Summaries

Measures of center

head(life)
##     State         County fips Year Female.life.expectancy..years.
## 1 Alabama Autauga County 1001 1985                           77.0
## 2 Alabama Baldwin County 1003 1985                           78.8
## 3 Alabama Barbour County 1005 1985                           76.0
## 4 Alabama    Bibb County 1007 1985                           76.6
## 5 Alabama  Blount County 1009 1985                           78.9
## 6 Alabama Bullock County 1011 1985                           75.1
##   Female.life.expectancy..national..years.
## 1                                     77.8
## 2                                     77.8
## 3                                     77.8
## 4                                     77.8
## 5                                     77.8
## 6                                     77.8
##   Female.life.expectancy..state..years. Male.life.expectancy..years.
## 1                                  76.9                         68.1
## 2                                  76.9                         71.1
## 3                                  76.9                         66.8
## 4                                  76.9                         67.3
## 5                                  76.9                         70.6
## 6                                  76.9                         66.6
##   Male.life.expectancy..national..years. Male.life.expectancy..state..years.
## 1                                   70.8                                69.1
## 2                                   70.8                                69.1
## 3                                   70.8                                69.1
## 4                                   70.8                                69.1
## 5                                   70.8                                69.1
## 6                                   70.8                                69.1
  1. Life dataset has 8 attributes.
  2. There are 2 attributes that has character data type, 2 attributes that has integer as its data type, and 4 attributes that has double as its data type.
x <- head(round(life$Female.life.expectancy..years.), 11)
x
##  [1] 77 79 76 77 79 75 77 77 77 78 77

Above are the top 11 values of the rounded female life expectancy year attribute.

mean

  • balance point of the data
  • sensitive to extreme values
sum(x)/11
## [1] 77.2
  1. To find the mean value of an attribute in the data, we can use the manual method and can use the function that are already available.

  2. If we want to calculate the mean value of the top 11 values of the rounded female life expectancy year attribute that has been saved into the x variable manually, we can add them all up using the function sum() and divide by 11.

mean(x)
## [1] 77.2
  1. However, if we want to calculate the mean value using an existing function, we may apply the mean() function on the variable x directly.

As a result, The top 11 values of the rounded female life expectancy year attribute has the mean value 77.2.

median

  • middle value of the data
  • robust to extreme values
  • most approrpriate measure when working with skewed data
sort(x)
##  [1] 75 76 77 77 77 77 77 77 78 79 79

To find median in the dataset, we must first sort its value from the smallest to the biggest.We can just use sort() function to sort our value. The output above is the top 11 values of the rounded female life expectancy year attribute that have been sorted.

median(x)
## [1] 77

Then, we can use median() function on the variable x directly to find the median.

As a result, The top 11 values of the rounded female life expectancy year attribute has the median value 77.

mode

  • most common value
table(x)
## x
## 75 76 77 78 79 
##  1  1  6  1  2

To find the mode value in an attribute in the dataset, we have to look at the distribution of each unique value using table() function and find out which value occurs the most.

As a result, the mode of the top 11 values of the rounded female life expectancy year attribute is 77 with 6 times occurs.

– Calculate center measures

str(gapminder)
## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

Gapminder data has 6 attribute: country, continent, year,lifeExp, pop, and gpdPercap. 1. Country and continent attribute has factor as its data type.

  1. Year and pop attribute has integer as its data type.

  2. lifeExp and gpdPercap has numeric as its data type.

# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)

# Compute groupwise mean and median lifeExp
gap2007 %>% group_by(continent) %>% summarize(mean(lifeExp), median(lifeExp))
## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##   <fct>               <dbl>             <dbl>
## 1 Africa               54.8              52.9
## 2 Americas             73.6              72.9
## 3 Asia                 70.7              72.4
## 4 Europe               77.6              78.6
## 5 Oceania              80.7              80.7

After we grouped the data of 2007 with the continent, above is the result of the mean and median of lifeExp for each continent in 2007 from the life dataset. 1. Oceania’s continent has the highest mean score of 80.7 and Africa’s continent has the lowest mean score in 2007 with value of 54.8.

  1. Oceania’s continent has the highest median score of 80.7 and Africa’s continent has the lowest median score in 2007 with value of 52.9.
# Generate box plots of lifeExp for each continent
gap2007 %>% ggplot(aes(x = continent, y = lifeExp)) + geom_boxplot()

Above are the boxplot of the lifeExp attribute on each continent. It can be seen that Asia and Americas continent seem to have 1 outlier while the other continent don’t have outlier.

The African continent has the most lifeExp, followed by the Asian continent, the Europe continent, the Americas continent and finally the Oceania continent

Measures of variability

x
##  [1] 77 79 76 77 79 75 77 77 77 78 77

Above are the value in the x variables (the top 11 values of the rounded female life expectancy year attribute).

# Look at the difference between each point and the mean
sum(x - mean(x))
## [1] -0.0000000000000568

-0.0000000000000568 is the sum value of the difference between each point and the mean.

# Square each difference to get rid of negatives then sum
sum((x - mean(x))^2)
## [1] 13.6

13.6 is the sum value of square each difference.

Variance

  • so we divide by n - 1
  • This is called the sample variance. One of the most useful measures of a sample distriution
sum((x - mean(x))^2)/(length(x) - 1)
## [1] 1.36

To find the variance in the data, we can use the manual method and use the provided function. If using the manual method, we can use and write the formula as above.

var(x)
## [1] 1.36

If we want to use the provided function, we can use var() function to find the variance.

As a result, the variance of variable x is 1.36

Standard Deviation

  • Another very useful metric is the sample standard deviation
  • This is just the square root of the variance
  • The nice thing about the std dev is that it is in the same units as the original data
  • In this case its 1.17 years
sqrt(sum((x - mean(x))^2)/(length(x) - 1))
## [1] 1.17

To find the standard deviation in the dataset, we can use the manual method and use the provided function, same as variance, mean, and etc.

If using the manual method, we can use and write the formula as above.

sd(x)
## [1] 1.17

If we want to use the provided function, we can use sd() function to find the standard deviation

As a result, the standard deviation of variable x is 1.17

Inter Quartile Range

  • The IQR is the middle 50% of the data
  • The nice thing about this one is that it is not sensitve to extreme values
  • All of the other measures listed here are sensitive to extreme values
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    75.0    77.0    77.0    77.2    77.5    79.0

By using summary() function above, we can see that variable x has min value of 75, Q1 value of 77, median value of 77, mean value of 77.2, Q3 value of 77.5, and max value of 79.

IQR(x)
## [1] 0.5

By using IQR() function, we can find the interquartile range of our data. Variable X has IQR value of 0.5

Range

  • max and min are also interesting
  • as is the range, or the difference between max and min
max(x)
## [1] 79

By using max() function, we can find the max value of our data. Variable X has max value of 79.

min(x)
## [1] 75

By using min() function, we can find the min value of our data. Variable X has min value of 75.

diff(range(x))
## [1] 4

By using diff(range()) function, we can find the range value of our data. Variable X has range value of 4.

– Calculate spread measures

str(gap2007)
## tibble [142 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
##  $ year     : int [1:142] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ lifeExp  : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
##  $ pop      : int [1:142] 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ...
##  $ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...

Above are attributes and each data type from life dataset in 2007

# Compute groupwise measures of spread
gap2007 %>% group_by(continent) %>% summarize(sd(lifeExp), IQR(lifeExp), n())
## # A tibble: 5 x 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##   <fct>             <dbl>          <dbl> <int>
## 1 Africa            9.63          11.6      52
## 2 Americas          4.44           4.63     25
## 3 Asia              7.96          10.2      33
## 4 Europe            2.98           4.78     30
## 5 Oceania           0.729          0.516     2

After we grouped the data of 2007 with the continent, above is the result of the standard deviation, interquartile range and how many of lifeExp for each continent in 2007 from the life dataset. 1. Africa’s continent has the highest standard deviation score of 9.631 and Oceania’s continent has the lowest standard deviation score in 2007 with value of 0.729.

  1. Africa’s continent has the highest IQR score of 11.610 and Oceania’s continent has the lowest IQR score in 2007 with value of 0.516.

  2. The African continent has the highest number of data, which is 52 and the continent of Oceania has the lowest amount of data, which is 2.

# Generate overlaid density plots
gap2007 %>% ggplot(aes(x = lifeExp, fill = continent)) + geom_density(alpha = 0.3)

Above is the density plots with lifeAxp as the x axis filled with continent.

It seems that there are no normally distributed continents

– Choose measures for center and spread

# Compute stats for lifeExp in Americas
head(gap2007)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       2007    43.8 31889923      975.
## 2 Albania     Europe     2007    76.4  3600523     5937.
## 3 Algeria     Africa     2007    72.3 33333216     6223.
## 4 Angola      Africa     2007    42.7 12420476     4797.
## 5 Argentina   Americas   2007    75.3 40301927    12779.
## 6 Australia   Oceania    2007    81.2 20434176    34435.

Above is the 6 top data of life dataset in 2007.

gap2007 %>% filter(continent == "Americas") %>% summarize(mean(lifeExp), sd(lifeExp))
## # A tibble: 1 x 2
##   `mean(lifeExp)` `sd(lifeExp)`
##             <dbl>         <dbl>
## 1            73.6          4.44

Above, we calculate the mean and standard deviation lifeExp of Americas continent only. 1. Americas continent has the mean lifeExp value of 73.6

  1. Americas continent has the standard deviation lifeExp value of 4.44
# Compute stats for population
gap2007 %>% summarize(median(pop), IQR(pop))
## # A tibble: 1 x 2
##   `median(pop)` `IQR(pop)`
##           <dbl>      <dbl>
## 1      10517531  26702008.

Above, we calculate the median and interquartile range pop. 1. The median pop value is 10517531

  1. The IQR pop value is 26702008

Shape and transformations

4 chracteristics of a distribution that are of interest: ### center -already covered ### spread or variablity -already covered ### shape -modality: number of prominent humps (uni, bi, multi, or uniform - no humps) -skew (right, left, or symetric) -Can transform to fix skew ### outliers ## – Describe the shape There are 4 types of shape : - A: unimodal, left-skewed - B: unimodal, symmetric - C: unimodal, right-skewed - D: bimodal, symmetric

– Transformations

# Create density plot of old variable
gap2007 %>% ggplot(aes(x = pop)) + geom_density()

Above is the original density plot of the pop attribute. The plot appears to be poorly constructed.

So, using the log() method, we do the transformation below.

# Transform the skewed pop variable
gap2007 <- gap2007 %>% mutate(log_pop = log(pop))

# Create density plot of new variable
gap2007 %>% ggplot(aes(x = log_pop)) + geom_density()

After performing the transformation using the log() function on the pop attribute, we can see that the density plot result is better and is formed as visualized above.

Outliers

– Identify outliers

# Filter for Asia, add column indicating outliers
str(gapminder)
## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

Above are the attribute and its data type of gapminder dataset.

gap_asia <- gap2007 %>% filter(continent == "Asia") %>% mutate(is_outlier = lifeExp < 50)

# Remove outliers, create box plot of lifeExp
gap_asia %>% filter(!is_outlier) %>% ggplot(aes(x = 1, y = lifeExp)) + geom_boxplot()

After removing outliers, i.e. removing the lifeExp value which is below 50, the boxplot visualization results are obtained as above.

Case Study

Introducing the data

– Spam and num_char

# ggplot2, dplyr, and openintro are loaded
#library(openintro)
email = read.csv("D:/KULIAH SEMESTER 2/DATA MINING AND VISUALIZATION/1. FIN SESSION/email.csv")
email$spam <- factor(email$spam, labels = c("not-spam", "spam"))

# Compute summary statistics
email %>%
  group_by(spam) %>%
  summarize( 
    median(num_char),
    IQR(num_char))
## # A tibble: 2 x 3
##   spam     `median(num_char)` `IQR(num_char)`
##   <fct>                 <dbl>           <dbl>
## 1 not-spam               6.83           13.6 
## 2 spam                   1.05            2.82

From the output above, it can be seen that there are 2 levels of the spam attribute in the email dataset, namely not-spam and spam. 1. Not-spam has the median value of 6.83 and spam has the median value of 1.05.

  1. Not-spam has the IQR value of 13.58 and spam has the IQR value of 2.82.
str(email)
## 'data.frame':    3921 obs. of  21 variables:
##  $ spam        : Factor w/ 2 levels "not-spam","spam": 1 1 1 1 1 1 1 1 1 1 ...
##  $ to_multiple : int  0 0 0 0 0 0 1 1 0 0 ...
##  $ from        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ cc          : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ sent_email  : int  0 0 0 0 0 0 1 1 0 0 ...
##  $ time        : chr  "2012-01-01T06:16:41Z" "2012-01-01T07:03:59Z" "2012-01-01T16:00:32Z" "2012-01-01T09:09:49Z" ...
##  $ image       : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ attach      : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ dollar      : int  0 0 4 0 0 0 0 0 0 0 ...
##  $ winner      : chr  "no" "no" "no" "no" ...
##  $ inherit     : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ viagra      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : int  0 0 0 0 2 2 0 0 0 0 ...
##  $ num_char    : num  11.37 10.5 7.77 13.26 1.23 ...
##  $ line_breaks : int  202 202 192 255 29 25 193 237 69 68 ...
##  $ format      : int  1 1 1 1 0 0 1 1 0 1 ...
##  $ re_subj     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_subj: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ urgent_subj : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_mess: int  0 1 6 48 1 1 1 18 1 0 ...
##  $ number      : chr  "big" "small" "small" "small" ...

Above are the attribute and its data type of email dataset.

table(email$spam)
## 
## not-spam     spam 
##     3554      367

The output above is the distribution of each level of the spam attribute in the email dataset

email = read.csv("D:/KULIAH SEMESTER 2/DATA MINING AND VISUALIZATION/1. FIN SESSION/email.csv")
email <- email %>% mutate(spam = factor(ifelse(spam == 0, "not-spam", "spam")))

# Create plot
email %>% mutate(log_num_char = log(num_char)) %>%
  ggplot(aes(x = spam, y = log_num_char)) +
  geom_boxplot()

The output shown above are the boxplots of spam vs not-spam, with the y axis representing the logged num char.

It can be seen that more outliers owned by the level of non-spam than spam.

-Spam and num_char interpretation

-The median length of not-spam emails is greater than that of spam emails ## Spam and !!!

# Compute center and spread for exclaim_mess by spam
email = read.csv("D:/KULIAH SEMESTER 2/DATA MINING AND VISUALIZATION/1. FIN SESSION/email.csv")
email <- email %>% mutate(spam = factor(ifelse(spam == 0, "not-spam", "spam")))
email %>% group_by(spam) %>%summarize(median(exclaim_mess),IQR(exclaim_mess))  
## # A tibble: 2 x 3
##   spam     `median(exclaim_mess)` `IQR(exclaim_mess)`
##   <fct>                     <dbl>               <dbl>
## 1 not-spam                      1                   5
## 2 spam                          0                   1
  1. Not-spam has the median exclaim_mess value of 1 and spam has the median exclaim_mess value of 0.

  2. Not-spam has the IQR exclaim_mess value of 5 and spam has the exclaim_mess value of 1.

table(email$exclaim_mess)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1435  733  507  128  190  113  115   51   93   45   85   17   56   20   43   11 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##   29   12   26    5   29    9   15    3   11    6   11    1    6    8   13   12 
##   32   33   34   35   36   38   39   40   41   42   43   44   45   46   47   48 
##   13    3    3    2    3    3    1    2    1    1    3    3    5    3    2    1 
##   49   52   54   55   57   58   62   71   75   78   89   94   96  139  148  157 
##    3    1    1    4    2    2    2    1    1    1    1    1    1    1    1    1 
##  187  454  915  939  947 1197 1203 1209 1236 
##    1    1    1    1    1    1    2    1    1

Above are the distribution of each unique value from the exclaim_mess attribute in the email dataset. 1. The value 0 is the value that appears the most, that is, it appears 1435 times.

  1. Many values that have only 1 occurrence
# Create plot for spam and exclaim_mess
email %>%
  mutate(log_exclaim_mess = log(exclaim_mess)) %>%
  ggplot(aes(x = log_exclaim_mess)) + 
  geom_histogram() + 
  facet_wrap(~ spam)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1435 rows containing non-finite values (stat_bin).

From the output above, in the non-spam group, the typical number of exclamations appears to be somewhat greater than in the spam group.

– Spam and !!! interpretation

  • The most common value of exclaim_mess in both classes of email is zero (a log(exclaim_mess) of -4.6 after adding .01).
  • Even after a transformation, the distribution of exclaim_mess in both classes of email is right-skewed.

Check-in 1

table(email$image)
## 
##    0    1    2    3    4    5    9   20 
## 3811   76   17   11    2    2    1    1

Above are the distribution of each unique value from the image attribute in the email dataset. 1. The value 0 is the value that appears the most, that is, it appears 3811 times.

  1. The value of 9 and 20 appears the least, it appears only 1 time.
# Create plot of proportion of spam by image
email %>% mutate(has_image = image > 0) %>% ggplot(aes(x = has_image, fill = spam)) + geom_bar(position = "fill")

Above is the proportion plot with x axis is has_image attribute filled with spam attribute. 1. Both have or do not have an image, have a larger amount of not-spam than spam.

– Image and spam interpretation

# Test if images count as attachments
sum(email$image > email$attach)
## [1] 0

– Answering questions with chains

Within non-spam emails, is the typical length of emails shorter for

those that were sent to multiple people?

email %>%
   filter(spam == "not-spam") %>%
   group_by(to_multiple) %>%
   summarize(median(num_char))
## # A tibble: 2 x 2
##   to_multiple `median(num_char)`
##         <int>              <dbl>
## 1           0               7.20
## 2           1               5.36

Yes

Question 1

For emails containing the word “dollar”, does the typical spam email

contain a greater number of occurences of the word than the typical non-spam email?

table(email$dollar)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 3175  120  151   10  146   20   44   12   35   10   22   10   20    7   14    5 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   32 
##   23    2   14    1   10    7   12    7    7    3    7    1    5    1    1    2 
##   34   36   40   44   46   48   54   63   64 
##    1    2    3    3    2    1    1    1    3

Above are the distribution of each unique value from the dollar attribute in the email dataset. 1. The value 0 is the value that appears the most, that is, it appears 3175 times.

email %>% filter(dollar > 0) %>% group_by(spam) %>% summarize(median(dollar))
## # A tibble: 2 x 2
##   spam     `median(dollar)`
##   <fct>               <dbl>
## 1 not-spam                4
## 2 spam                    2

For emails containing the word “dollar”, does the typical spam email contain a greater number of occurences of the word than the typical non-spam email? No

Question 2

If you encounter an email with greater than 10 occurrences of the word “dollar”,

is it more likely to be spam or not -spam?

email %>% filter(dollar > 10) %>% ggplot(aes(x = spam)) + geom_bar()

Above are the plot of not-spam vs spam in spam attribute with the value above 10.

If you encounter an email with greater than 10 occurrences of the word “dollar”, is it more likely to be spam or not -spam? Not-spam, at least in this dataset

Check-in 2

– What’s in a number?

email$number = as.factor(email$number)
levels(email$number)
## [1] "big"   "none"  "small"

There are 3 levels in the number attribute: big, none, and small.

table(email$number)
## 
##   big  none small 
##   545   549  2827

Above are the distribution of each unique value from the number attribute in the email dataset. 1. The level of small is the value that appears the most, that is, it appears 2827 times. The followed by level of none, and finally level of big.

# Reorder levels
email$number <- factor(email$number, levels = c("none","small","big"))

# Construct plot of number
ggplot(email, aes(x = number)) + geom_bar() + facet_wrap( ~ spam)

Above are the plot for number attribute vs spam attribute.

– What’s in a number interpretation

  • Given that an email contains a small number, it is more likely to be not-spam.
  • Given that an email contains a big number, it is more likely to be not-spam.
  • Within both spam and not-spam, the most common number is a small one.