1.

Suppose we have an unfair coin. Please answer the following questions.

1.1

Let’s say the probability of getting “Head” p(H) is 0.6. We flip this coin for 30 times.

What’s the probability of getting exactly 10 Heads?

dbinom(x = 10, size = 30, prob = 0.6) 
## [1] 0.001997491

Also, what’s the cumulative probability of getting less than or equal to 20 Heads?

pbinom(20, size = 30, prob = 0.6)
## [1] 0.8237135

1.2

Plot the discrete probability distribution of this coin. You may just plot a bar chart.

plot(dbinom(c(0:30),size = 30, prob = 0.6), type = "h")

Use ggplot2 instead

library(ggplot2)
ggplot(aes(numOfHead, prob), data = data.frame(numOfHead = 0:30, prob = dbinom(c(0:30),size = 30,prob = 0.6))  ) + 
  geom_bar(stat = "identity")

1.3

Suppose we do not know p(H). We understand that the best way to guess p(H) is to “just flip the coin !”. We next flip 5 times per trial and then get the following sequence of the numbers of Hs. So, in this case, what is p(H)?

x = c(4, 5, 3, 4, 5, 5, 4, 4, 4, 4, 3, 3, 5, 3, 3, 4, 2, 4, 4, 4)

# Function used to calculate the log-likelihood
LL_binom = function(p){
  ll = dbinom(x, 5, p); return(-sum(log(ll)));
}
# Maximum Likelihood Estimation. One dimension root-finding using Brent's method
optim(par = 0.5, fn = LL_binom, method = "Brent", lower = 0, upper = 1)
## $par
## [1] 0.77
## 
## $value
## [1] 24.01774
## 
## $counts
## function gradient 
##       NA       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

p(Head) should be around 0.77

2.

Please download loan data 2007-2011 (“LoanStats3a.csv”) with the data dictionary from LendingClub.com. Consider the following data management problems. ### 2.1 Import & read the first 39,786 observations of the CSV file for those that “meet the credit policy”. Only keep those records with loan_status in (“Fully Paid”“,”Charged Off“).

require(RCurl)
## Loading required package: RCurl
## Loading required package: bitops
# Download & unzip the file
download.file("https://resources.lendingclub.com/LoanStats3a.csv.zip", 
              "LoanStats3a.csv.zip", method = "auto")
unzip("LoanStats3a.csv.zip")

Read the CSV file and save it as an R Data Frame

library(data.table) # Using fread()
loan = fread("LoanStats3a.csv", skip = 1,  
             stringsAsFactors = F, verbose = T, nrows = 39786, data.table = F )
## Input contains no \n. Taking this to be a filename to open
## File opened, filesize is 0.039773 GB.
## Memory mapping ... ok
## Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
## Positioned on line 2 after skip or autostart
## This line isn't blank and skip>0 so we're done
## Detecting sep ... ','
## Detected 145 columns. Longest stretch was from line 2 to line 31
## Starting data input on line 2 (either column names or first row of data). First 10 characters: "id","memb
## The line before starting line 2 is non-empty and will be ignored (it has too few or too many items to be column names or data): Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)All the fields on line 2 are character fields. Treating as the column names.
## nrow set to nrows passed in (39786)
## Type codes (point  0): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point  1): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point  2): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point  3): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point  4): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point  5): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point  6): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point  7): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point  8): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point  9): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point 10): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes: 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444 (after applying colClasses and integer64)
## Type codes: 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444 (after applying drop or select (if supplied)
## Allocating 145 column slots (145 - 0 dropped)
## Read 39786 rows. Exactly what was estimated and allocated up front
##    0.004s (  0%) Memory map (rerun may be quicker)
##    0.000s (  0%) sep and header detection
##    0.000s (  0%) Count rows (wc -l)
##    0.003s (  0%) Column type detection (100 rows at 10 points)
##    0.271s ( 30%) Allocation of 39786x145 result (xMB) in RAM
##    0.621s ( 69%) Reading data
##    0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
##    0.000s (  0%) Coercing data already read in type bumps (if any)
##    0.004s (  0%) Changing na.strings to NA
##    0.902s        Total

Keep those records with loan_status in (“Charged Off”, “Fully Paid”)

loan = loan[loan$loan_status %in% c("Charged Off", "Fully Paid"),]

2.2

Create an R function procFreq(df, x, y) that returns a crosstab of x by y, and Chi-squared statistics/p-values for test of independence between variable x and y given an R data frame df.

procFreq = function(df, x, y){
  crosstab = table(df[,x], df[, y])
  chisq = chisq.test(crosstab)
  # Return an R list
  return(list("crosstab" = crosstab,
              "chisq_stat" = chisq$statistic,
              "chisq_pvalue" = chisq$p.value));
}

e.g. Crosstab “grade” by “loan_status”

procFreq(loan, "loan_status", "grade")
## $crosstab
##              
##                   A     B     C     D     E     F     G
##   Charged Off   602  1433  1356  1130   725   323   101
##   Fully Paid   9483 10602  6755  4195  2133   731   217
## 
## $chisq_stat
## X-squared 
##  1472.815 
## 
## $chisq_pvalue
## [1] 4.136568e-315

2.3

Perform series of bivariate analyses on loan_status (as the outcome) by grade, purpose and term. What variable(s) might affect the loan_status? Use your own procFreq() if you’d like.

Map(function(y) procFreq(loan, "loan_status", y), list("grade", "purpose", "term"))
## [[1]]
## [[1]]$crosstab
##              
##                   A     B     C     D     E     F     G
##   Charged Off   602  1433  1356  1130   725   323   101
##   Fully Paid   9483 10602  6755  4195  2133   731   217
## 
## [[1]]$chisq_stat
## X-squared 
##  1472.815 
## 
## [[1]]$chisq_pvalue
## [1] 4.136568e-315
## 
## 
## [[2]]
## [[2]]$crosstab
##              
##                 car credit_card debt_consolidation educational
##   Charged Off   160         548               2792          56
##   Fully Paid   1391        4589              15884         269
##              
##               home_improvement house major_purchase medical moving other
##   Charged Off              351    59            222     106     92   637
##   Fully Paid              2634   323           1966     589    491  3364
##              
##               renewable_energy small_business vacation wedding
##   Charged Off               19            479       53      96
##   Fully Paid                84           1352      328     852
## 
## [[2]]$chisq_stat
## X-squared 
##  367.2878 
## 
## [[2]]$chisq_pvalue
## [1] 1.779283e-70
## 
## 
## [[3]]
## [[3]]$crosstab
##              
##                36 months  60 months
##   Charged Off       3227       2443
##   Fully Paid       25869       8247
## 
## [[3]]$chisq_stat
## X-squared 
##  884.1163 
## 
## [[3]]$chisq_pvalue
## [1] 2.78472e-194

All 3 variables may affect the loan_status.

3.

Load the data airquality. Run normality tests on continuous variables and see whether these variables look “normal”.

library(nortest)
data(airquality)
lapply(airquality, function(x) Map(function(f) f(x) , c(shapiro.test, lillie.test,  ad.test) ))
## $Ozone
## $Ozone[[1]]
## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.87867, p-value = 2.79e-08
## 
## 
## $Ozone[[2]]
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  x
## D = 0.14799, p-value = 1.47e-06
## 
## 
## $Ozone[[3]]
## 
##  Anderson-Darling normality test
## 
## data:  x
## A = 4.5211, p-value = 2.787e-11
## 
## 
## 
## $Solar.R
## $Solar.R[[1]]
## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.94183, p-value = 9.492e-06
## 
## 
## $Solar.R[[2]]
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  x
## D = 0.12, p-value = 2.35e-05
## 
## 
## $Solar.R[[3]]
## 
##  Anderson-Darling normality test
## 
## data:  x
## A = 2.7786, p-value = 5.007e-07
## 
## 
## 
## $Wind
## $Wind[[1]]
## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.98575, p-value = 0.1178
## 
## 
## $Wind[[2]]
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  x
## D = 0.082388, p-value = 0.01294
## 
## 
## $Wind[[3]]
## 
##  Anderson-Darling normality test
## 
## data:  x
## A = 0.7367, p-value = 0.05378
## 
## 
## 
## $Temp
## $Temp[[1]]
## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.97617, p-value = 0.009319
## 
## 
## $Temp[[2]]
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  x
## D = 0.081313, p-value = 0.01506
## 
## 
## $Temp[[3]]
## 
##  Anderson-Darling normality test
## 
## data:  x
## A = 0.96438, p-value = 0.01467
## 
## 
## 
## $Month
## $Month[[1]]
## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.88804, p-value = 2.258e-09
## 
## 
## $Month[[2]]
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  x
## D = 0.16002, p-value = 2.737e-10
## 
## 
## $Month[[3]]
## 
##  Anderson-Darling normality test
## 
## data:  x
## A = 5.4111, p-value = 2.077e-13
## 
## 
## 
## $Day
## $Day[[1]]
## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.95313, p-value = 5.048e-05
## 
## 
## $Day[[2]]
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  x
## D = 0.072739, p-value = 0.04633
## 
## 
## $Day[[3]]
## 
##  Anderson-Darling normality test
## 
## data:  x
## A = 1.7456, p-value = 0.0001726

3.

Load built-in data airquality. Create density plots for all numeric variables. Do their distributions seem normal?
Run normality tests on these variables to justify your answers.

library(nortest); library(ggpubr);  data("airquality")
## Loading required package: magrittr

Create density plots using functions in ggplot2 & ggpubr

g = Map(function(var) qplot(var, data = airquality, geom = "density") , airquality)
ggpubr::ggarrange(plotlist = g, labels = names(g))
## Warning: Removed 37 rows containing non-finite values (stat_density).
## Warning: Removed 7 rows containing non-finite values (stat_density).

Normality tests. It seems that all variables are not “normal”.

We may say that Wind is appromxiately normal, but we should also consider some other non-parametric models/tests.

sapply(airquality, function(x) Map(function(f) f(x)$p.value , c(shapiro.test, lillie.test,  ad.test) ))
##      Ozone        Solar.R      Wind      Temp        Month       
## [1,] 2.789602e-08 9.491956e-06 0.1177928 0.009319356 2.258252e-09
## [2,] 1.469615e-06 2.349614e-05 0.0129374 0.01506377  2.736884e-10
## [3,] 2.787162e-11 5.006901e-07 0.0537756 0.01466933  2.07709e-13 
##      Day         
## [1,] 5.047663e-05
## [2,] 0.04633025  
## [3,] 0.0001725843

4.

Please read the article Chocolate Consumption, Cognitive Function, and Nobel Laureates on our reading list.

Try to reproduce the correlation analysis but use alcohol consumption instead of chocolate consumption.

Specifically, we consider the relationship between the alcohol consumption per capita and all Nobel prizes per capita.

Do you see anything interesting?

Justify your findings with your analysis results.

You may use any crawlers or just copy & paste the data from the websites

alcohol = read.csv("/Users/yuzhe/Downloads/alcohol.csv", header = T, stringsAsFactors = F)
nobel = read.csv("/Users/yuzhe/Downloads/nobel.csv", header = T, stringsAsFactors = F)

Lower cases and remove blanks and unnecessary characters in Country for simplicity

nobel$Country = tolower(nobel$Country)
nobel$Country = stringr::str_trim(gsub(" ", "", nobel$Country))

alcohol$Country = tolower(alcohol$Country)
alcohol$Country = stringr::str_trim(gsub(" ", "", alcohol$Country))

Merged by Country (31 matched countries)

nobel_alcohol = merge(nobel, alcohol, by = "Country", all = F )

Seems like there’s no significant correlation (cor = 0.02782623, p > 0.05)

cor.test(nobel_alcohol$Alcohol_consumption, nobel_alcohol$Nobellaureates_2017, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  nobel_alcohol$Alcohol_consumption and nobel_alcohol$Nobellaureates_2017
## t = 0.14991, df = 29, p-value = 0.8819
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3297654  0.3784350
## sample estimates:
##        cor 
## 0.02782623

How about doing some transformation? Still no correlation (cor = 0.5107, p > 0.05)

cor.test(log(nobel_alcohol$Alcohol_consumption), log(nobel_alcohol$Nobellaureates_2017), method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  log(nobel_alcohol$Alcohol_consumption) and log(nobel_alcohol$Nobellaureates_2017)
## t = 0.66598, df = 29, p-value = 0.5107
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2421365  0.4571911
## sample estimates:
##       cor 
## 0.1227341

Kendall’s tau does not sensitive to the scale & outliers, but it also shows no correlation

cor.test(nobel_alcohol$Alcohol_consumption, nobel_alcohol$Nobellaureates_2017, method = "kendall")
## Warning in cor.test.default(nobel_alcohol$Alcohol_consumption,
## nobel_alcohol$Nobellaureates_2017, : Cannot compute exact p-value with ties
## 
##  Kendall's rank correlation tau
## 
## data:  nobel_alcohol$Alcohol_consumption and nobel_alcohol$Nobellaureates_2017
## z = -0.15358, p-value = 0.8779
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##         tau 
## -0.01980401

How about adjusting for the population?

nobel_alcohol$Population = gsub(",", "", nobel_alcohol$Population) 
# Remove commas
nobel_alcohol$Population = as.numeric(nobel_alcohol$Population)
nobel_alcohol$logPopulation = log(nobel_alcohol$Population) 
# On log scale for those tests sensitive to different scales

We may say that there is no correlation between Alcohol Consumption and Nobel Laureates (tau = 0.01156943, p > 0.05)

ppcor::pcor.test(nobel_alcohol$Nobellaureates_2017, nobel_alcohol$Alcohol_consumption, 
                 nobel_alcohol$logPopulation, method = "kendall")
##     estimate   p.value  statistic  n gp  Method
## 1 0.01156943 0.9284552 0.08978856 31  1 kendall

5.

Feature engineering and learning is the key to the success of a data analytics project.

Please refer to slides Unit 3 for Feature Scaling & Transformation and answer the following questions.

5.1

Load dataset ggplot2::diamonds, perform a log transformation on the price, and replace price with your logPrice.

library(ggplot2); diamonds = as.data.frame(ggplot2::diamonds)
diamonds$logPrice = log(diamonds$price)
diamonds$price = NULL

5.2

Split your diamonds dataset into training and testing sets with the following R code.

set.seed(1); train_idx = sample(1:nrow(diamonds), 40000)
train_d = diamonds[train_idx, ]
test_d = diamonds[setdiff( 1:nrow(diamonds) , train_idx), ]; rm(diamonds)

Please fit a general linear model, logPrice = f(X) where X is all the rest of variables, with lm() and the training set train_d. Apply your linear model to the testing set test_d and compute Mean Absolute Error (MAE).

dia_lm = lm(logPrice ~ ., data = train_d)
yhat = predict(dia_lm, newdata = test_d[, -10])
y_yhat = test_d$logPrice - yhat
# MAE = 0.112064
mean(abs(y_yhat))
## [1] 0.112064

5.3

A statistical machine learning model/algorithm y ̂ = f(X) can be considered a special feature (X) transformation function that gives us some information in feature X about an outcome y.

Also, aggregating and stacking models, such as y ̂=g( f_1 (X), f_2 (X) ,…, f_n (X)) and y ̂=g( f_3 (f_2 (f_1 (X))), create more complex functions that may better predict the outcome y.

Here, we consider using a simple linear model transformer to convert/transform/encode the original input feature onto new feature space on the same scale of the outcome y.

Please load function create_lm_transformer() and predict_lm_transformer() from the attached .R files.

Use both function to create data transformation/pre-processing “plan” and new converted training and testing datasets with the following R code.

Function to create linear model transformer (transformation “plan” object) as y_i = f(x_i)

create_lm_transformer = function(df, targetName){
  X_colnames = setdiff(colnames(df), targetName)
  # Create linear models y ~ x_i 
  lms = lapply(X_colnames, function(x) eval(parse(text = paste("lm(", targetName, " ~ ", x , ", data = df)", sep = "") )))
  names(lms) = X_colnames
  return(lms)
}
# Function to apply plan object to dataset
predict_lm_transformer = function(df, targetName, lm_transformer){
  X_colnames = setdiff(colnames(df), targetName)
  convertedXY = sapply(X_colnames, function(x) predict( lm_transformer[[x]], newdata = df[, X_colnames]))
  return(convertedXY)
}
# Create lm transformation plan object
dia_lm_transformer = create_lm_transformer(train_d, "logPrice")
# Convert X of training and testing dataset into new feature spaces
train_d_t = predict_lm_transformer(train_d, "logPrice", dia_lm_transformer)
train_d_t = as.data.frame(train_d_t)
train_d_t$logPrice = train_d$logPrice

test_d_t = predict_lm_transformer(test_d, "logPrice", dia_lm_transformer)
test_d_t = as.data.frame(test_d_t)
test_d_t$logPrice = test_d$logPrice

Please fit a linear model from the new training set train_d_t, and then apply the model to the new testing set test_d_t, does such transformation lower the MAE?

dia_c_lm = lm(logPrice ~ ., data = train_d_t)
yhat = predict(dia_c_lm, newdata = test_d_t[, -10]) 
y_yhat = test_d_t$logPrice - yhat
# MAE = 0.1291423, it DOES NOT lower the MAE
mean(abs(y_yhat))
## [1] 0.1291423

5.4

Bengio et. al. indicates that “the expressive power of linear features is very limited:

they cannot be stacked to form deeper, more abstract representations since the composition of linear operations yields another linear operation” (see the article related to representation learning on the course page of NSYSU cyber university), which also suggests that stacking linear features may not improve interpretability and accuracy of a model.

Here, instead of fitting linear model in Question 5.3, we consider applying k-nearest neighbors (KNN) regression algorithm to learn a regression model from the aforementioned converted training set train_d_t, You may just use R package FNN::knn.reg().

Does this KNN model result in lower MAE compared to the linear model learned from original raw training set?

yhat = FNN::knn.reg(train = train_d_t[, -10], test = test_d_t[,-10], y = train_d_t[, 10], k = 7  )
y_yhat = test_d$logPrice - yhat$pred
# MAE = 0.08650753, it DOES lower the MAE
mean(abs(y_yhat))
## [1] 0.08651561

However, such simple model on transformed dataset does not outperform modern ensemble learning algorithms, such as a random forest. For example:

# Using package "ranger"
library(ranger); set.seed(1)
rg = ranger(logPrice ~ ., data = train_d, num.trees = 200)
yhat = predict(rg, data = test_d[, -10])
y_yhat = test_d$logPrice - yhat$predictions
# MAE = 0.06578337
mean(abs(y_yhat))
## [1] 0.06578337

Surely, we can create more commplex functions that stack/combine/aggregate mulitple different, linear or non-linear, base leaners like tree-based or linear models to form a Super Learner!

Note that the winners on Kaggle.com always have their own network/layered super learners! :)