Suppose we have an unfair coin. Please answer the following questions.
Let’s say the probability of getting “Head” p(H) is 0.6. We flip this coin for 30 times.
What’s the probability of getting exactly 10 Heads?
dbinom(x = 10, size = 30, prob = 0.6)
## [1] 0.001997491
Also, what’s the cumulative probability of getting less than or equal to 20 Heads?
pbinom(20, size = 30, prob = 0.6)
## [1] 0.8237135
Plot the discrete probability distribution of this coin. You may just plot a bar chart.
plot(dbinom(c(0:30),size = 30, prob = 0.6), type = "h")
Use ggplot2 instead
library(ggplot2)
ggplot(aes(numOfHead, prob), data = data.frame(numOfHead = 0:30, prob = dbinom(c(0:30),size = 30,prob = 0.6)) ) +
geom_bar(stat = "identity")
Suppose we do not know p(H). We understand that the best way to guess p(H) is to “just flip the coin !”. We next flip 5 times per trial and then get the following sequence of the numbers of Hs. So, in this case, what is p(H)?
x = c(4, 5, 3, 4, 5, 5, 4, 4, 4, 4, 3, 3, 5, 3, 3, 4, 2, 4, 4, 4)
# Function used to calculate the log-likelihood
LL_binom = function(p){
ll = dbinom(x, 5, p); return(-sum(log(ll)));
}
# Maximum Likelihood Estimation. One dimension root-finding using Brent's method
optim(par = 0.5, fn = LL_binom, method = "Brent", lower = 0, upper = 1)
## $par
## [1] 0.77
##
## $value
## [1] 24.01774
##
## $counts
## function gradient
## NA NA
##
## $convergence
## [1] 0
##
## $message
## NULL
p(Head) should be around 0.77
Please download loan data 2007-2011 (“LoanStats3a.csv”) with the data dictionary from LendingClub.com. Consider the following data management problems. ### 2.1 Import & read the first 39,786 observations of the CSV file for those that “meet the credit policy”. Only keep those records with loan_status in (“Fully Paid”“,”Charged Off“).
require(RCurl)
## Loading required package: RCurl
## Loading required package: bitops
# Download & unzip the file
download.file("https://resources.lendingclub.com/LoanStats3a.csv.zip",
"LoanStats3a.csv.zip", method = "auto")
unzip("LoanStats3a.csv.zip")
Read the CSV file and save it as an R Data Frame
library(data.table) # Using fread()
loan = fread("LoanStats3a.csv", skip = 1,
stringsAsFactors = F, verbose = T, nrows = 39786, data.table = F )
## Input contains no \n. Taking this to be a filename to open
## File opened, filesize is 0.039773 GB.
## Memory mapping ... ok
## Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
## Positioned on line 2 after skip or autostart
## This line isn't blank and skip>0 so we're done
## Detecting sep ... ','
## Detected 145 columns. Longest stretch was from line 2 to line 31
## Starting data input on line 2 (either column names or first row of data). First 10 characters: "id","memb
## The line before starting line 2 is non-empty and will be ignored (it has too few or too many items to be column names or data): Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)All the fields on line 2 are character fields. Treating as the column names.
## nrow set to nrows passed in (39786)
## Type codes (point 0): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point 1): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point 2): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point 3): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point 4): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point 5): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point 6): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point 7): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point 8): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point 9): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes (point 10): 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
## Type codes: 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444 (after applying colClasses and integer64)
## Type codes: 4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444 (after applying drop or select (if supplied)
## Allocating 145 column slots (145 - 0 dropped)
## Read 39786 rows. Exactly what was estimated and allocated up front
## 0.004s ( 0%) Memory map (rerun may be quicker)
## 0.000s ( 0%) sep and header detection
## 0.000s ( 0%) Count rows (wc -l)
## 0.003s ( 0%) Column type detection (100 rows at 10 points)
## 0.271s ( 30%) Allocation of 39786x145 result (xMB) in RAM
## 0.621s ( 69%) Reading data
## 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
## 0.000s ( 0%) Coercing data already read in type bumps (if any)
## 0.004s ( 0%) Changing na.strings to NA
## 0.902s Total
Keep those records with loan_status in (“Charged Off”, “Fully Paid”)
loan = loan[loan$loan_status %in% c("Charged Off", "Fully Paid"),]
Create an R function procFreq(df, x, y) that returns a crosstab of x by y, and Chi-squared statistics/p-values for test of independence between variable x and y given an R data frame df.
procFreq = function(df, x, y){
crosstab = table(df[,x], df[, y])
chisq = chisq.test(crosstab)
# Return an R list
return(list("crosstab" = crosstab,
"chisq_stat" = chisq$statistic,
"chisq_pvalue" = chisq$p.value));
}
e.g. Crosstab “grade” by “loan_status”
procFreq(loan, "loan_status", "grade")
## $crosstab
##
## A B C D E F G
## Charged Off 602 1433 1356 1130 725 323 101
## Fully Paid 9483 10602 6755 4195 2133 731 217
##
## $chisq_stat
## X-squared
## 1472.815
##
## $chisq_pvalue
## [1] 4.136568e-315
Perform series of bivariate analyses on loan_status (as the outcome) by grade, purpose and term. What variable(s) might affect the loan_status? Use your own procFreq() if you’d like.
Map(function(y) procFreq(loan, "loan_status", y), list("grade", "purpose", "term"))
## [[1]]
## [[1]]$crosstab
##
## A B C D E F G
## Charged Off 602 1433 1356 1130 725 323 101
## Fully Paid 9483 10602 6755 4195 2133 731 217
##
## [[1]]$chisq_stat
## X-squared
## 1472.815
##
## [[1]]$chisq_pvalue
## [1] 4.136568e-315
##
##
## [[2]]
## [[2]]$crosstab
##
## car credit_card debt_consolidation educational
## Charged Off 160 548 2792 56
## Fully Paid 1391 4589 15884 269
##
## home_improvement house major_purchase medical moving other
## Charged Off 351 59 222 106 92 637
## Fully Paid 2634 323 1966 589 491 3364
##
## renewable_energy small_business vacation wedding
## Charged Off 19 479 53 96
## Fully Paid 84 1352 328 852
##
## [[2]]$chisq_stat
## X-squared
## 367.2878
##
## [[2]]$chisq_pvalue
## [1] 1.779283e-70
##
##
## [[3]]
## [[3]]$crosstab
##
## 36 months 60 months
## Charged Off 3227 2443
## Fully Paid 25869 8247
##
## [[3]]$chisq_stat
## X-squared
## 884.1163
##
## [[3]]$chisq_pvalue
## [1] 2.78472e-194
All 3 variables may affect the loan_status.
Load the data airquality. Run normality tests on continuous variables and see whether these variables look “normal”.
library(nortest)
data(airquality)
lapply(airquality, function(x) Map(function(f) f(x) , c(shapiro.test, lillie.test, ad.test) ))
## $Ozone
## $Ozone[[1]]
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.87867, p-value = 2.79e-08
##
##
## $Ozone[[2]]
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: x
## D = 0.14799, p-value = 1.47e-06
##
##
## $Ozone[[3]]
##
## Anderson-Darling normality test
##
## data: x
## A = 4.5211, p-value = 2.787e-11
##
##
##
## $Solar.R
## $Solar.R[[1]]
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.94183, p-value = 9.492e-06
##
##
## $Solar.R[[2]]
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: x
## D = 0.12, p-value = 2.35e-05
##
##
## $Solar.R[[3]]
##
## Anderson-Darling normality test
##
## data: x
## A = 2.7786, p-value = 5.007e-07
##
##
##
## $Wind
## $Wind[[1]]
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.98575, p-value = 0.1178
##
##
## $Wind[[2]]
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: x
## D = 0.082388, p-value = 0.01294
##
##
## $Wind[[3]]
##
## Anderson-Darling normality test
##
## data: x
## A = 0.7367, p-value = 0.05378
##
##
##
## $Temp
## $Temp[[1]]
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.97617, p-value = 0.009319
##
##
## $Temp[[2]]
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: x
## D = 0.081313, p-value = 0.01506
##
##
## $Temp[[3]]
##
## Anderson-Darling normality test
##
## data: x
## A = 0.96438, p-value = 0.01467
##
##
##
## $Month
## $Month[[1]]
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.88804, p-value = 2.258e-09
##
##
## $Month[[2]]
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: x
## D = 0.16002, p-value = 2.737e-10
##
##
## $Month[[3]]
##
## Anderson-Darling normality test
##
## data: x
## A = 5.4111, p-value = 2.077e-13
##
##
##
## $Day
## $Day[[1]]
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.95313, p-value = 5.048e-05
##
##
## $Day[[2]]
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: x
## D = 0.072739, p-value = 0.04633
##
##
## $Day[[3]]
##
## Anderson-Darling normality test
##
## data: x
## A = 1.7456, p-value = 0.0001726
Load built-in data airquality. Create density plots for all numeric variables. Do their distributions seem normal?
Run normality tests on these variables to justify your answers.
library(nortest); library(ggpubr); data("airquality")
## Loading required package: magrittr
Create density plots using functions in ggplot2 & ggpubr
g = Map(function(var) qplot(var, data = airquality, geom = "density") , airquality)
ggpubr::ggarrange(plotlist = g, labels = names(g))
## Warning: Removed 37 rows containing non-finite values (stat_density).
## Warning: Removed 7 rows containing non-finite values (stat_density).
Normality tests. It seems that all variables are not “normal”.
We may say that Wind is appromxiately normal, but we should also consider some other non-parametric models/tests.
sapply(airquality, function(x) Map(function(f) f(x)$p.value , c(shapiro.test, lillie.test, ad.test) ))
## Ozone Solar.R Wind Temp Month
## [1,] 2.789602e-08 9.491956e-06 0.1177928 0.009319356 2.258252e-09
## [2,] 1.469615e-06 2.349614e-05 0.0129374 0.01506377 2.736884e-10
## [3,] 2.787162e-11 5.006901e-07 0.0537756 0.01466933 2.07709e-13
## Day
## [1,] 5.047663e-05
## [2,] 0.04633025
## [3,] 0.0001725843
Please read the article Chocolate Consumption, Cognitive Function, and Nobel Laureates on our reading list.
Try to reproduce the correlation analysis but use alcohol consumption instead of chocolate consumption.
Specifically, we consider the relationship between the alcohol consumption per capita and all Nobel prizes per capita.
Do you see anything interesting?
Justify your findings with your analysis results.
You may use any crawlers or just copy & paste the data from the websites
alcohol = read.csv("/Users/yuzhe/Downloads/alcohol.csv", header = T, stringsAsFactors = F)
nobel = read.csv("/Users/yuzhe/Downloads/nobel.csv", header = T, stringsAsFactors = F)
Lower cases and remove blanks and unnecessary characters in Country for simplicity
nobel$Country = tolower(nobel$Country)
nobel$Country = stringr::str_trim(gsub(" ", "", nobel$Country))
alcohol$Country = tolower(alcohol$Country)
alcohol$Country = stringr::str_trim(gsub(" ", "", alcohol$Country))
Merged by Country (31 matched countries)
nobel_alcohol = merge(nobel, alcohol, by = "Country", all = F )
Seems like there’s no significant correlation (cor = 0.02782623, p > 0.05)
cor.test(nobel_alcohol$Alcohol_consumption, nobel_alcohol$Nobellaureates_2017, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: nobel_alcohol$Alcohol_consumption and nobel_alcohol$Nobellaureates_2017
## t = 0.14991, df = 29, p-value = 0.8819
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3297654 0.3784350
## sample estimates:
## cor
## 0.02782623
How about doing some transformation? Still no correlation (cor = 0.5107, p > 0.05)
cor.test(log(nobel_alcohol$Alcohol_consumption), log(nobel_alcohol$Nobellaureates_2017), method = "pearson")
##
## Pearson's product-moment correlation
##
## data: log(nobel_alcohol$Alcohol_consumption) and log(nobel_alcohol$Nobellaureates_2017)
## t = 0.66598, df = 29, p-value = 0.5107
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2421365 0.4571911
## sample estimates:
## cor
## 0.1227341
Kendall’s tau does not sensitive to the scale & outliers, but it also shows no correlation
cor.test(nobel_alcohol$Alcohol_consumption, nobel_alcohol$Nobellaureates_2017, method = "kendall")
## Warning in cor.test.default(nobel_alcohol$Alcohol_consumption,
## nobel_alcohol$Nobellaureates_2017, : Cannot compute exact p-value with ties
##
## Kendall's rank correlation tau
##
## data: nobel_alcohol$Alcohol_consumption and nobel_alcohol$Nobellaureates_2017
## z = -0.15358, p-value = 0.8779
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.01980401
How about adjusting for the population?
nobel_alcohol$Population = gsub(",", "", nobel_alcohol$Population)
# Remove commas
nobel_alcohol$Population = as.numeric(nobel_alcohol$Population)
nobel_alcohol$logPopulation = log(nobel_alcohol$Population)
# On log scale for those tests sensitive to different scales
We may say that there is no correlation between Alcohol Consumption and Nobel Laureates (tau = 0.01156943, p > 0.05)
ppcor::pcor.test(nobel_alcohol$Nobellaureates_2017, nobel_alcohol$Alcohol_consumption,
nobel_alcohol$logPopulation, method = "kendall")
## estimate p.value statistic n gp Method
## 1 0.01156943 0.9284552 0.08978856 31 1 kendall
Feature engineering and learning is the key to the success of a data analytics project.
Please refer to slides Unit 3 for Feature Scaling & Transformation and answer the following questions.
Load dataset ggplot2::diamonds, perform a log transformation on the price, and replace price with your logPrice.
library(ggplot2); diamonds = as.data.frame(ggplot2::diamonds)
diamonds$logPrice = log(diamonds$price)
diamonds$price = NULL
Split your diamonds dataset into training and testing sets with the following R code.
set.seed(1); train_idx = sample(1:nrow(diamonds), 40000)
train_d = diamonds[train_idx, ]
test_d = diamonds[setdiff( 1:nrow(diamonds) , train_idx), ]; rm(diamonds)
Please fit a general linear model, logPrice = f(X) where X is all the rest of variables, with lm() and the training set train_d. Apply your linear model to the testing set test_d and compute Mean Absolute Error (MAE).
dia_lm = lm(logPrice ~ ., data = train_d)
yhat = predict(dia_lm, newdata = test_d[, -10])
y_yhat = test_d$logPrice - yhat
# MAE = 0.112064
mean(abs(y_yhat))
## [1] 0.112064
A statistical machine learning model/algorithm y ̂ = f(X) can be considered a special feature (X) transformation function that gives us some information in feature X about an outcome y.
Also, aggregating and stacking models, such as y ̂=g( f_1 (X), f_2 (X) ,…, f_n (X)) and y ̂=g( f_3 (f_2 (f_1 (X))), create more complex functions that may better predict the outcome y.
Here, we consider using a simple linear model transformer to convert/transform/encode the original input feature onto new feature space on the same scale of the outcome y.
Please load function create_lm_transformer() and predict_lm_transformer() from the attached .R files.
Use both function to create data transformation/pre-processing “plan” and new converted training and testing datasets with the following R code.
Function to create linear model transformer (transformation “plan” object) as y_i = f(x_i)
create_lm_transformer = function(df, targetName){
X_colnames = setdiff(colnames(df), targetName)
# Create linear models y ~ x_i
lms = lapply(X_colnames, function(x) eval(parse(text = paste("lm(", targetName, " ~ ", x , ", data = df)", sep = "") )))
names(lms) = X_colnames
return(lms)
}
# Function to apply plan object to dataset
predict_lm_transformer = function(df, targetName, lm_transformer){
X_colnames = setdiff(colnames(df), targetName)
convertedXY = sapply(X_colnames, function(x) predict( lm_transformer[[x]], newdata = df[, X_colnames]))
return(convertedXY)
}
# Create lm transformation plan object
dia_lm_transformer = create_lm_transformer(train_d, "logPrice")
# Convert X of training and testing dataset into new feature spaces
train_d_t = predict_lm_transformer(train_d, "logPrice", dia_lm_transformer)
train_d_t = as.data.frame(train_d_t)
train_d_t$logPrice = train_d$logPrice
test_d_t = predict_lm_transformer(test_d, "logPrice", dia_lm_transformer)
test_d_t = as.data.frame(test_d_t)
test_d_t$logPrice = test_d$logPrice
Please fit a linear model from the new training set train_d_t, and then apply the model to the new testing set test_d_t, does such transformation lower the MAE?
dia_c_lm = lm(logPrice ~ ., data = train_d_t)
yhat = predict(dia_c_lm, newdata = test_d_t[, -10])
y_yhat = test_d_t$logPrice - yhat
# MAE = 0.1291423, it DOES NOT lower the MAE
mean(abs(y_yhat))
## [1] 0.1291423
Bengio et. al. indicates that “the expressive power of linear features is very limited:
they cannot be stacked to form deeper, more abstract representations since the composition of linear operations yields another linear operation” (see the article related to representation learning on the course page of NSYSU cyber university), which also suggests that stacking linear features may not improve interpretability and accuracy of a model.
Here, instead of fitting linear model in Question 5.3, we consider applying k-nearest neighbors (KNN) regression algorithm to learn a regression model from the aforementioned converted training set train_d_t, You may just use R package FNN::knn.reg().
Does this KNN model result in lower MAE compared to the linear model learned from original raw training set?
yhat = FNN::knn.reg(train = train_d_t[, -10], test = test_d_t[,-10], y = train_d_t[, 10], k = 7 )
y_yhat = test_d$logPrice - yhat$pred
# MAE = 0.08650753, it DOES lower the MAE
mean(abs(y_yhat))
## [1] 0.08651561
However, such simple model on transformed dataset does not outperform modern ensemble learning algorithms, such as a random forest. For example:
# Using package "ranger"
library(ranger); set.seed(1)
rg = ranger(logPrice ~ ., data = train_d, num.trees = 200)
yhat = predict(rg, data = test_d[, -10])
y_yhat = test_d$logPrice - yhat$predictions
# MAE = 0.06578337
mean(abs(y_yhat))
## [1] 0.06578337
Surely, we can create more commplex functions that stack/combine/aggregate mulitple different, linear or non-linear, base leaners like tree-based or linear models to form a Super Learner!
Note that the winners on Kaggle.com always have their own network/layered super learners! :)