CS 424 Big Data Analytics

Session 6: Basics of Datasets

Instructor: Dr. Bob Batzinger
Academic year: 2021/2022
Semester: 1

Begins June 2021

R Studio Interface

Converting rates

dat = data.frame(rbind(
c(14, 2963,6.6),
c(14,10110,6.5),
c(11.1,1402112,7.0),
c(13,3214,2.2),
c(17.4,1380004,3.7),
c(6,51780,2.7)))
rownames(dat) = c("AR","AZ","CN","GE","IN","KR")
colnames(dat) = c('birthrate','population', 'diff.100')
unmatched = round((dat[,1] * dat[,2] * dat[,3]) / 10000, 3)
dat = cbind(dat,unmatched)

Resulting Dataset

	birthrate	population	diff.100	unmatched
AR	14.0	2963	6.6	27.378
AZ	14.0	10110	6.5	92.001
CN	11.1	1402112	7.0	10894.410
GE	13.0	3214	2.2	9.192
IN	17.4	1380004	3.7	8884.466
KR	6.0	51780	2.7	83.884

Dataframe dictionary

country : iso abbreviation of the country
birthrate : number of children born per 1000 population
population : the number of people in the country (in units of 1,000)
diff.100 : per cent increment of male births
(unmatched) : annual number of unmatched marriable men in 18 years

Dataframes

## dim(dat) ========

## [1] 6 4

## colnames(dat) ====

## [1] "birthrate"  "population" "diff.100"   "unmatched"

## rownames(dat) ====

## [1] "AR" "AZ" "CN" "GE" "IN" "KR"

## head(dat,2) ======

##    birthrate population diff.100 unmatched
## AR        14       2963      6.6    27.378
## AZ        14      10110      6.5    92.001

## tail(dat,2) ======

##    birthrate population diff.100 unmatched
## IN      17.4    1380004      3.7  8884.466
## KR       6.0      51780      2.7    83.884

Data frame inspection

## str(data) ======

## 'data.frame':    6 obs. of  4 variables:
##  $ birthrate : num  14 14 11.1 13 17.4 6
##  $ population: num  2963 10110 1402112 3214 1380004 ...
##  $ diff.100  : num  6.6 6.5 7 2.2 3.7 2.7
##  $ unmatched : num  27.38 92 10894.41 9.19 8884.47 ...

## 
## summary(dat) =====

##    birthrate       population         diff.100       unmatched        
##  Min.   : 6.00   Min.   :   2963   Min.   :2.200   Min.   :    9.192  
##  1st Qu.:11.57   1st Qu.:   4938   1st Qu.:2.950   1st Qu.:   41.505  
##  Median :13.50   Median :  30945   Median :5.100   Median :   87.942  
##  Mean   :12.58   Mean   : 475030   Mean   :4.783   Mean   : 3331.889  
##  3rd Qu.:14.00   3rd Qu.:1047948   3rd Qu.:6.575   3rd Qu.: 6686.350  
##  Max.   :17.40   Max.   :1402112   Max.   :7.000   Max.   :10894.410

Dataset decomposition

dat$birthrate

## [1] 14.0 14.0 11.1 13.0 17.4  6.0

dat$population

## [1]    2963   10110 1402112    3214 1380004   51780

dat$diff.100

## [1] 6.6 6.5 7.0 2.2 3.7 2.7

dat$unmatched

## [1]    27.378    92.001 10894.410     9.192  8884.466    83.884

Comparing User-defined to Built-in functions

Finding the mean and std dev of an array

doStat <- function(n=10000) {
    total = 0; sq = 0; cnt = 0; x = 1:n
    for (i in x) {total = total + i
      sq = sq + i * i; cnt = cnt + 1
    }
    mn = round(total / n,5); msq = mn * mn
    doStat = paste("mean=",mn,", std.dev=",
        round(sqrt((sq - msq*cnt)/(cnt-1)),5),"\n")
}

doStat2 <- function(n=10000) { x = 1:n
  doStat2 = paste("mean=",mean(x),", std.dev=",sd(x),"\n")
}

Comparing Runtimes

doStat(): user-defined

## time= 0.03311 +/- 0.0019 
##  mean= 5000.5 , std.dev= 2886.89568 
##

Same analysis using mean() and sd()

## time= 0.00278 +/- 0.00109 
##  mean= 5000.5 , std.dev= 2886.89567990717 
##

Dataset Case Study: Big Mac Index

Big Mac Menu Signage in Hong Kong¹

Big Mac are sold at McDonald’s around the world
The Big Mac Index was invented by The Economist in 1986
The cost of Big Mac in various locations can be used to demonstrate purchasing-power parity
Data is available from https://github.com/TheEconomist/big-mac-data and an interactive online tool²

Reading the dataset

## 'data.frame':    341 obs. of  6 variables:
##  $ name         : Factor w/ 42 levels "Argentina","Australia",..: 39 2 4 5 6 7 13 16 19 22 ...
##  $ iso_a3       : Factor w/ 41 levels "ARG","AUS","AUT",..: 38 2 4 5 16 6 15 18 21 24 ...
##  $ currency_code: Factor w/ 46 levels "ARS","ATS","AUD",..: 42 3 4 5 19 9 18 21 24 27 ...
##  $ local_price  : num  1.6 1.75 90 2.5 1.1 1.89 16.4 7.6 1.18 370 ...
##  $ dollar_ex    : num  1 1.64 42 13.8 0.67 1.39 6.65 7.8 0.74 154 ...
##  $ date         : Factor w/ 14 levels "1986-09-01","1987-01-01",..: 1 1 1 1 1 1 1 1 1 1 ...

Scanning the second version

str(macdat2)

## 'data.frame':    1730 obs. of  7 variables:
##  $ name         : Factor w/ 74 levels "Argentina","Australia",..: 1 2 7 8 9 10 11 15 16 19 ...
##  $ iso_a3       : Factor w/ 73 levels "ARE","ARG","AUS",..: 2 3 8 24 9 11 12 15 17 21 ...
##  $ currency_code: Factor w/ 57 levels "AED","ARS","AUD",..: 2 3 6 17 7 9 10 13 14 16 ...
##  $ local_price  : num  2.5 2.59 2.95 1.9 2.85 ...
##  $ dollar_ex    : num  1 1.68 1.79 0.633 1.47 ...
##  $ GDP_dollar   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ date         : Factor w/ 35 levels "2000-04-01","2001-04-01",..: 1 1 1 1 1 1 1 1 1 1 ...

Data Dictionary

name : country
iso_a3 : iso abbreviation for the country
currency_code : iso abbreviation for the local currency
local_price : cost of a Big Mac in local currency
dollar_ex : US dollar exchange rate for local currency
GDP_dollar : Gross Domestic Product in US Dollar (recent years only)
date : reporting data

Comparing the different dataset versions

Compare the lists of attributes

cat("macdat1 :\n",
      colnames(macdat1),
    "\n\nmacdat2 :\n",
   colnames(macdat2))

## macdat1 :
##  name iso_a3 currency_code local_price dollar_ex date 
## 
## macdat2 :
##  name iso_a3 currency_code local_price dollar_ex GDP_dollar date

Combining the versions

macdat = rbind(macdat1,macdat2[,colnames(macdat1)])
summary(macdat)

##             name          iso_a3     currency_code   local_price     
##  Britain      :  48   GBR    :  48   EUR    : 301   Min.   :      0  
##  United States:  48   USA    :  48   GBP    :  48   1st Qu.:      4  
##  Australia    :  47   AUS    :  47   USD    :  48   Median :     15  
##  Canada       :  47   CAN    :  47   AUD    :  47   Mean   :   6963  
##  Denmark      :  47   DNK    :  47   CAD    :  47   3rd Qu.:     89  
##  Hong Kong    :  47   HKG    :  47   DKK    :  47   Max.   :4000000  
##  (Other)      :1787   (Other):1787   (Other):1533                    
##    dollar_ex                 date     
##  Min.   :      0.0   2018-07-01:  72  
##  1st Qu.:      1.3   2019-01-01:  72  
##  Median :      5.8   2019-07-09:  72  
##  Mean   :   2662.7   2020-01-14:  72  
##  3rd Qu.:     33.7   2020-07-01:  72  
##  Max.   :1600500.0   2021-01-01:  71  
##                      (Other)   :1640

Plotting the local cost of a Big Mac

plot(as.Date(macdat$date),macdat$local_price,main="Cost of a Big Mac",
     ylab="Cost (in local currency)",xlab="Year",log="y",
     col=rainbow(50,)[macdat$iso_a3],pch=19,cex=0.5)

## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted from
## logarithmic plot

Exchange rates (First attempt)

## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted from
## logarithmic plot

Finding and removing the bad data point

Finding the bad data point

macdat[macdat$dollar_ex == 0,]

##           name iso_a3 currency_code local_price dollar_ex       date
## 1638 Venezuela    VEN           VEF           0         0 2018-01-01

Removing the bad data point

macdat = macdat[macdat$dollar_ex != 0,]

Exchange rates

Cost of Big Mac in US Dollar (first attempt)

mcost = macdat$local_price / macdat$dollar_ex
plot(as.Date(macdat$date), mcost,
     ylab= "cost in US Dollars", xlab="Year",
     pch=19,cex=0.5,col=rainbow(50)[macdat$iso_a3])

Search for low )outliers

m = mean(mcost)
sd = sd(mcost)

macdat[mcost < m - 2*sd, c("iso_a3","date","local_price","dollar_ex")]

##      iso_a3       date local_price dollar_ex
## 4       BRA 1986-09-01         2.5   13.8000
## 114     RUS 1992-04-01        58.0   98.9500
## 489     SAU 2004-05-01         2.4    3.7502
## 1358    VEN 2015-07-01       132.0  197.0000
## 1414    VEN 2016-01-01       132.0  198.6986

Search for high outliers

m = mean(mcost)
sd = sd(mcost)

macdat[mcost > m + 3.5*sd, c("iso_a3","date","local_price","dollar_ex")]

##      iso_a3       date local_price dollar_ex
## 719     NOR 2008-06-01        40.0   5.07915
## 894     NOR 2011-07-01        45.0   5.41405
## 908     CHE 2011-07-01         6.5   0.80615
## 1024    VEN 2012-07-01        34.0   4.29465
## 1057    NOR 2013-01-01        43.0   5.48310
## 1079    VEN 2013-01-01        39.0   4.29465
## 1168    NOR 2014-01-01        48.0   6.15745
## 1224    NOR 2014-07-01        48.0   6.18730

Plotting the outliers

Big Mac vs GDP

## function (x, y = NULL, legend, fill = NULL, col = par("col"), 
##     border = "black", lty, lwd, pch, angle = 45, density = NULL, 
##     bty = "o", bg = par("bg"), box.lwd = par("lwd"), box.lty = par("lty"), 
##     box.col = par("fg"), pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd, 
##     xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1, adj = c(0, 
##         0.5), text.width = NULL, text.col = par("col"), text.font = NULL, 
##     merge = do.lines && has.pch, trace = FALSE, plot = TRUE, 
##     ncol = 1, horiz = FALSE, title = NULL, inset = 0, xpd, title.col = text.col, 
##     title.adj = 0.5, seg.len = 2) 
## {
##     if (missing(legend) && !missing(y) && (is.character(y) || 
##         is.expression(y))) {
##         legend <- y
##         y <- NULL
##     }
##     mfill <- !missing(fill) || !missing(density)
##     if (!missing(xpd)) {
##         op <- par("xpd")
##         on.exit(par(xpd = op))
##         par(xpd = xpd)
##     }
##     title <- as.graphicsAnnot(title)
##     if (length(title) > 1) 
##         stop("invalid 'title'")
##     legend <- as.graphicsAnnot(legend)
##     n.leg <- if (is.call(legend)) 
##         1
##     else length(legend)
##     if (n.leg == 0) 
##         stop("'legend' is of length 0")
##     auto <- if (is.character(x)) 
##         match.arg(x, c("bottomright", "bottom", "bottomleft", 
##             "left", "topleft", "top", "topright", "right", "center"))
##     else NA
##     if (is.na(auto)) {
##         xy <- xy.coords(x, y, setLab = FALSE)
##         x <- xy$x
##         y <- xy$y
##         nx <- length(x)
##         if (nx < 1 || nx > 2) 
##             stop("invalid coordinate lengths")
##     }
##     else nx <- 0
##     xlog <- par("xlog")
##     ylog <- par("ylog")
##     rect2 <- function(left, top, dx, dy, density = NULL, angle, 
##         ...) {
##         r <- left + dx
##         if (xlog) {
##             left <- 10^left
##             r <- 10^r
##         }
##         b <- top - dy
##         if (ylog) {
##             top <- 10^top
##             b <- 10^b
##         }
##         rect(left, top, r, b, angle = angle, density = density, 
##             ...)
##     }
##     segments2 <- function(x1, y1, dx, dy, ...) {
##         x2 <- x1 + dx
##         if (xlog) {
##             x1 <- 10^x1
##             x2 <- 10^x2
##         }
##         y2 <- y1 + dy
##         if (ylog) {
##             y1 <- 10^y1
##             y2 <- 10^y2
##         }
##         segments(x1, y1, x2, y2, ...)
##     }
##     points2 <- function(x, y, ...) {
##         if (xlog) 
##             x <- 10^x
##         if (ylog) 
##             y <- 10^y
##         points(x, y, ...)
##     }
##     text2 <- function(x, y, ...) {
##         if (xlog) 
##             x <- 10^x
##         if (ylog) 
##             y <- 10^y
##         text(x, y, ...)
##     }
##     if (trace) 
##         catn <- function(...) do.call("cat", c(lapply(list(...), 
##             formatC), list("\n")))
##     cin <- par("cin")
##     Cex <- cex * par("cex")
##     if (is.null(text.width)) 
##         text.width <- max(abs(strwidth(legend, units = "user", 
##             cex = cex, font = text.font)))
##     else if (!is.numeric(text.width) || text.width < 0) 
##         stop("'text.width' must be numeric, >= 0")
##     xc <- Cex * xinch(cin[1L], warn.log = FALSE)
##     yc <- Cex * yinch(cin[2L], warn.log = FALSE)
##     if (xc < 0) 
##         text.width <- -text.width
##     xchar <- xc
##     xextra <- 0
##     yextra <- yc * (y.intersp - 1)
##     ymax <- yc * max(1, strheight(legend, units = "user", cex = cex)/yc)
##     ychar <- yextra + ymax
##     if (trace) 
##         catn("  xchar=", xchar, "; (yextra,ychar)=", c(yextra, 
##             ychar))
##     if (mfill) {
##         xbox <- xc * 0.8
##         ybox <- yc * 0.5
##         dx.fill <- xbox
##     }
##     do.lines <- (!missing(lty) && (is.character(lty) || any(lty > 
##         0))) || !missing(lwd)
##     n.legpercol <- if (horiz) {
##         if (ncol != 1) 
##             warning(gettextf("horizontal specification overrides: Number of columns := %d", 
##                 n.leg), domain = NA)
##         ncol <- n.leg
##         1
##     }
##     else ceiling(n.leg/ncol)
##     has.pch <- !missing(pch) && length(pch) > 0
##     if (do.lines) {
##         x.off <- if (merge) 
##             -0.7
##         else 0
##     }
##     else if (merge) 
##         warning("'merge = TRUE' has no effect when no line segments are drawn")
##     if (has.pch) {
##         if (is.character(pch) && !is.na(pch[1L]) && nchar(pch[1L], 
##             type = "c") > 1) {
##             if (length(pch) > 1) 
##                 warning("not using pch[2..] since pch[1L] has multiple chars")
##             np <- nchar(pch[1L], type = "c")
##             pch <- substr(rep.int(pch[1L], np), 1L:np, 1L:np)
##         }
##         if (!is.character(pch)) 
##             pch <- as.integer(pch)
##     }
##     if (is.na(auto)) {
##         if (xlog) 
##             x <- log10(x)
##         if (ylog) 
##             y <- log10(y)
##     }
##     if (nx == 2) {
##         x <- sort(x)
##         y <- sort(y)
##         left <- x[1L]
##         top <- y[2L]
##         w <- diff(x)
##         h <- diff(y)
##         w0 <- w/ncol
##         x <- mean(x)
##         y <- mean(y)
##         if (missing(xjust)) 
##             xjust <- 0.5
##         if (missing(yjust)) 
##             yjust <- 0.5
##     }
##     else {
##         h <- (n.legpercol + (!is.null(title))) * ychar + yc
##         w0 <- text.width + (x.intersp + 1) * xchar
##         if (mfill) 
##             w0 <- w0 + dx.fill
##         if (do.lines) 
##             w0 <- w0 + (seg.len + x.off) * xchar
##         w <- ncol * w0 + 0.5 * xchar
##         if (!is.null(title) && (abs(tw <- strwidth(title, units = "user", 
##             cex = cex) + 0.5 * xchar)) > abs(w)) {
##             xextra <- (tw - w)/2
##             w <- tw
##         }
##         if (is.na(auto)) {
##             left <- x - xjust * w
##             top <- y + (1 - yjust) * h
##         }
##         else {
##             usr <- par("usr")
##             inset <- rep_len(inset, 2)
##             insetx <- inset[1L] * (usr[2L] - usr[1L])
##             left <- switch(auto, bottomright = , topright = , 
##                 right = usr[2L] - w - insetx, bottomleft = , 
##                 left = , topleft = usr[1L] + insetx, bottom = , 
##                 top = , center = (usr[1L] + usr[2L] - w)/2)
##             insety <- inset[2L] * (usr[4L] - usr[3L])
##             top <- switch(auto, bottomright = , bottom = , bottomleft = usr[3L] + 
##                 h + insety, topleft = , top = , topright = usr[4L] - 
##                 insety, left = , right = , center = (usr[3L] + 
##                 usr[4L] + h)/2)
##         }
##     }
##     if (plot && bty != "n") {
##         if (trace) 
##             catn("  rect2(", left, ",", top, ", w=", w, ", h=", 
##                 h, ", ...)", sep = "")
##         rect2(left, top, dx = w, dy = h, col = bg, density = NULL, 
##             lwd = box.lwd, lty = box.lty, border = box.col)
##     }
##     xt <- left + xchar + xextra + (w0 * rep.int(0:(ncol - 1), 
##         rep.int(n.legpercol, ncol)))[1L:n.leg]
##     yt <- top - 0.5 * yextra - ymax - (rep.int(1L:n.legpercol, 
##         ncol)[1L:n.leg] - 1 + (!is.null(title))) * ychar
##     if (mfill) {
##         if (plot) {
##             if (!is.null(fill)) 
##                 fill <- rep_len(fill, n.leg)
##             rect2(left = xt, top = yt + ybox/2, dx = xbox, dy = ybox, 
##                 col = fill, density = density, angle = angle, 
##                 border = border)
##         }
##         xt <- xt + dx.fill
##     }
##     if (plot && (has.pch || do.lines)) 
##         col <- rep_len(col, n.leg)
##     if (missing(lwd) || is.null(lwd)) 
##         lwd <- par("lwd")
##     if (do.lines) {
##         if (missing(lty) || is.null(lty)) 
##             lty <- 1
##         lty <- rep_len(lty, n.leg)
##         lwd <- rep_len(lwd, n.leg)
##         ok.l <- !is.na(lty) & (is.character(lty) | lty > 0) & 
##             !is.na(lwd)
##         if (trace) 
##             catn("  segments2(", xt[ok.l] + x.off * xchar, ",", 
##                 yt[ok.l], ", dx=", seg.len * xchar, ", dy=0, ...)")
##         if (plot) 
##             segments2(xt[ok.l] + x.off * xchar, yt[ok.l], dx = seg.len * 
##                 xchar, dy = 0, lty = lty[ok.l], lwd = lwd[ok.l], 
##                 col = col[ok.l])
##         xt <- xt + (seg.len + x.off) * xchar
##     }
##     if (has.pch) {
##         pch <- rep_len(pch, n.leg)
##         pt.bg <- rep_len(pt.bg, n.leg)
##         pt.cex <- rep_len(pt.cex, n.leg)
##         pt.lwd <- rep_len(pt.lwd, n.leg)
##         ok <- !is.na(pch)
##         if (!is.character(pch)) {
##             ok <- ok & (pch >= 0 | pch <= -32)
##         }
##         else {
##             ok <- ok & nzchar(pch)
##         }
##         x1 <- (if (merge && do.lines) 
##             xt - (seg.len/2) * xchar
##         else xt)[ok]
##         y1 <- yt[ok]
##         if (trace) 
##             catn("  points2(", x1, ",", y1, ", pch=", pch[ok], 
##                 ", ...)")
##         if (plot) 
##             points2(x1, y1, pch = pch[ok], col = col[ok], cex = pt.cex[ok], 
##                 bg = pt.bg[ok], lwd = pt.lwd[ok])
##     }
##     xt <- xt + x.intersp * xchar
##     if (plot) {
##         if (!is.null(title)) 
##             text2(left + w * title.adj, top - ymax, labels = title, 
##                 adj = c(title.adj, 0), cex = cex, col = title.col)
##         text2(xt, yt, labels = legend, adj = adj, cex = cex, 
##             col = text.col, font = text.font)
##     }
##     invisible(list(rect = list(w = w, h = h, left = left, top = top), 
##         text = list(x = xt, y = yt)))
## }
## <bytecode: 0x555b45df3868>
## <environment: namespace:graphics>

Price vs GDP

Regression analysis

## 
## Call:
## lm(formula = gdp ~ bigmac)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23080  -9408  -1116   8913  41524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -28609       6003  -4.766 1.01e-05 ***
## bigmac         14005       1470   9.527 3.30e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14260 on 69 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5681, Adjusted R-squared:  0.5619 
## F-statistic: 90.77 on 1 and 69 DF,  p-value: 3.302e-14

Sinking of the Titanic

Artist rendering³

RMS Titanic sank in the North Atlantic Ocean on 15 April 1912, after striking an iceberg during her maiden voyage from Southampton to New York City.
Of 2,224 souls onboard, more than 1,500 died.
It has been said that the rigid class structure of the British passenger line has a great impact on survival.

Homework

Goal: Show the impact of gender and class on survival of the sinking.
Method:
- Load the Titanic passenger list into a notebook of R Studio.
- Insert a block of R commands that count the number of number of victims and survival for each class of passengers and the staff/crew.
- Use R to include a barchart in your notebook
- Summarize your findings in the Final section

R-Studio Notebook (Save as Homework01.Rmd)

---
title: "Title????"
author: "??Author??"
date: "?? July 2021"
output: html_notebook
---

# Introduction
## Background
## Research Question

# Methodology
## Data source
## Dictionary of the Dataset

# Results

# Summary and Conclusion

# References

Name list of people onboard

LastName	FirstName	Age	Position	Type
Allen*	Miss Elizabeth Walton	29	Pass	C1
Allen*	Mr. Ernest Frederick	24	Staff	Engineering
Allison	Mr. Hudson Joshua Creighton	30	Pass	C2
Allsop	Mr. Alfred Samuel	34	Staff	Housekeeping
Avery*	Mr. James Frank	22	Staff	Restuarant
Brown*	Miss Amelia Mary	18	Pass	C1
Swane	Mr. George Swane	19	Pass	C2