1 Significance and False Discovery
In the starter script, I’ve extracted a set of p-values from (independent) marginal regression of review stars on presence in review text for each of 5000 words. That is, we fit
\[stars_i = \alpha + \beta 1 [ x_{ji} > 0 ] + \epsilon_{ji}\]
for each term \(j\) with count \(x_{ji}\) in review \(i\), and return the p-value associated with a test of \(\beta_j \neq 0\). We’ll use these 5000 independent regressions to screen words.
1.1
Plot the p-values and comment on their distribution.
library(tidyverse)
[37m── [1mAttaching packages[22m ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──[39m
[37m[32m✔[37m [34mggplot2[37m 2.2.1 [32m✔[37m [34mpurrr [37m 0.2.4
[32m✔[37m [34mtibble [37m 1.4.2 [32m✔[37m [34mdplyr [37m 0.7.4
[32m✔[37m [34mtidyr [37m 0.8.0 [32m✔[37m [34mstringr[37m 1.2.0
[32m✔[37m [34mreadr [37m 1.1.1 [32m✔[37m [34mforcats[37m 0.2.0[39m
package ‘tibble’ was built under R version 3.4.3package ‘tidyr’ was built under R version 3.4.3[37m── [1mConflicts[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[37m [34mdplyr[37m::[32mcollapse()[37m masks [34mdistrom[37m::collapse()
[31m✖[37m [34mtidyr[37m::[32mexpand()[37m masks [34mMatrix[37m::expand()
[31m✖[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31m✖[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()[39m
pvals =as.data.frame(mrgpvals)
ggplot(pvals, aes(x = pvals$mrgpvals)) + geom_histogram()

1.2
What is the p-value rejection region associated with a 1% False Discovery Rate? How many p-values are in this rejection region?
rej_region = fdr_cut(pvals$mrgpvals, q = .01, plotit = TRUE)

The rejection region associated with a 1% False Discovery Rate is 0.0016078. With this in mind, we would reject anything less than this value, which would mean we would reject 812 values.
1.3
Suppose you just select the words with the 250 smallest p-values as significant. How many of these do you expect are false discoveries?
pvals = pvals[order(pvals$mrgpvals), , drop = FALSE]
pvals_250 = pvals[1:250, , drop = FALSE]
pvals_250_cut = fdr_cut(pvals = pvals_250$mrgpvals, q = .1, plotit = TRUE)

for the smallest 250 P-values at a q (i.e. alpha) of .01, we would need to cut anything below 1.790372310^{-8} and that would mean that 249 would be false discoveries.
2 Least-Squares and Bootstrapping
For this question you will fit a lasso path of Gaussian regressions for stars onto x. We will focus on the AICc selected slice of \(\hat\beta\) along this path.
2.1
What are the in-sample SSE and R2 for this regression?
df = data.frame(as.matrix(yhat))
df$stars = stars
is.r2 = R2(df$stars,df$seg66, family = "gaussian")
sse = deviance(df$stars, df$seg66, family = 'gaussian')
In Sample \(R^2\): 0.5000111
\(SSE\): 6632.7653746
2.2
Describe what ratings we’d expect for Bowling businesses and Airports.
3 Model Choice and Logistic Regression
Define a ‘bad review’ as one with fewer than 4 stars. We’ll regress this binary outcome on x.
rev$bad.review = ifelse(rev$stars < 4, 1, 0)
3.1 Use cv.gamlr to fit a cross-validated lasso regularization path for this regression. Plot the in-sample fit.
cv.badreview = cv.gamlr(x, rev$bad.review, lmr = 1e-4, verb=F, family = "binomial")
numerically perfect fit for some observations.
plot(cv.stars)

plot(linfit)

3.2
Describe the criteria used to choose models under select=“1se” and select=“min” rules. What are estimated out-of-sample R2 for models fit using these \(\lambda\)?
The “min” lambda is minimum average out of sample deviance, represented by the left most vertical line, while the “1se” lambda value/oos deviance is at least 1 standard deviation away from the minimum value and still penalizes the large number of coefficeints.
r2_min <- summary(cv.badreview)[cv.badreview$seg.min,"oos.r2"]
5-fold binomial cv.gamlr object
r2_1se <- summary(cv.badreview)[cv.badreview$seg.1se,"oos.r2"]
5-fold binomial cv.gamlr object
r2_val_min <- cor((predict(cv.badreview, x, select = "min")[,1]),rev$bad.review)
r2_val_1se <- cor((predict(cv.badreview, x, select = "1se")[,1]),rev$bad.review)
\(R^2\) for min lambda: 0.6254965
\(R^2\) for 1 se lambda: 0.6095646
3.3
Compare AICc, AIC, and BIC selection to each other and to the CV rules.
library(knitr)
cv_min = cor((predict(cv.stars, x, select = "min")[,1]), (stars))
cv_1se = cor((predict(cv.stars, x, select = "1se")[,1]), (stars))
aicc = cor((predict(cv.stars, x, select=NULL)[,1]), (stars))
aic = cor((predict(linfit,
x,
select=NULL,
corrected=FALSE)[,1]),(stars))
bic = cor(exp(predict(linfit,
x,
select=NULL,
k=log(linfit$nobs),
corrected=FALSE)[,1]), stars)
table1 = data.frame("Method" = c("Minimum Cross Validation",
"1se Cross Validation",
"AICc","AIC","BIC"),
"Value" = c(cv_min,
cv_1se,
aicc,
aic,
bic))
kable(table1)
Minimum Cross Validation |
-0.6700239 |
1se Cross Validation |
-0.6570002 |
AICc |
-0.6570002 |
AIC |
0.7076896 |
BIC |
0.5185542 |
Print the top ten best and worst review words under your preferred IC selection rule. Do the results make sense?
words = data.frame("coef" = as.matrix(coef(cv.badreview)),
"abs.val" = abs(as.matrix(coef(cv.badreview))),
"word" = rownames(coef(cv.badreview))
)
words = arrange(words, desc(words$seg33.1))
colnames(words) = c("Coefficient Value", "Coefficient Absolute Value", "Word")
kable(head(words, 10))
1.366706 |
1.366706 |
shitty |
1.350759 |
1.350759 |
stale |
1.350303 |
1.350303 |
overpriced |
1.299702 |
1.299702 |
tasteless |
1.189143 |
1.189143 |
disappointing |
1.119422 |
1.119422 |
Mobile Phones |
1.083031 |
1.083031 |
disgusting |
1.081217 |
1.081217 |
bland |
1.069535 |
1.069535 |
poisoning |
1.009014 |
1.009014 |
horrible |
Yes, these words all make sense they are all negative. Additionall all of their coefficients are greater than 1 meaning that any one of these words independently increase the odds of bad.review
being 1. ***
4 Multinomial Regression
Now, we’ll consider stars as an unordered 5-level factor. Use the code in your starter script to run a multinomial logistic regression.
4.1
What is the change in odds of 1 vs 2-3 stars for an extra count of word crap in the review?
multi.coef = as.data.frame(as.matrix(Bmn)) %>%rownames_to_column() %>% filter(rowname == 'crap' |rowname == 'Crap')
kable(multi.coef)
crap |
0.108569 |
0 |
0 |
-0.1969228 |
0 |
one_star = exp(multi.coef$`1`)
four_star = exp(multi.coef$`4`)
if the word ‘crap’ is in the review, the change in odds is 0.7819642, while it is zero for both 2 and 3, and then it changes again for 4 stars where the likelihood ratio is 0.9711562. This means that if the word crap appeasr in a review, it is 11% more likely to be a one-star review, and approximately 18% less likely to be a four star review.
4.2
What are the changes in odds of 1 star vs 2 and 1 star vs 5 star ratings for an extra count of word fantastic in the review?
multi.coef2 = as.data.frame(as.matrix(Bmn)) %>%rownames_to_column() %>% filter(rowname == 'fantastic' |rowname == 'Fantastic' | rowname == 'FANTASTIC')
kable(multi.coef2)
fantastic |
-0.2459463 |
-0.4714128 |
-0.2779925 |
-0.029268 |
0.173123 |
one_star = exp(multi.coef2$`1`)
two_star = exp(multi.coef2$`2`)
three_star = exp(multi.coef2$`3`)
four_star = exp(multi.coef2$`4`)
five_star = exp(multi.coef2$`5`)
stars = data.frame("Rating" = c("One Star", "Two Star","Three Star","Four Star","Five Star"),
"Likelihood Ratio" = c(one_star, two_star, three_star, four_star, five_star)
)
One Star |
0.7819642 |
Two Star |
0.6241199 |
Three Star |
0.7573025 |
Four Star |
0.9711562 |
Five Star |
1.1890124 |
We can see here that using the word fantastic in a review leads to an 18% more likely to be a five star review, while it means it is 3% less likely to be a four star review, 25% less likely to be a three star review, 48% less likely to be a two star review and interestly it is 22% less likely to be a 1 star review.
---
title: "Machine Learning Problem Set 4"
author: "Tom Curran"
date: "May 22, 2018"
output:
  pdf_document: default
  html_notebook: default
---

```{r starter_scripts, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
fdr_cut <- function(pvals, q, plotit=FALSE, ...){
  pvals <- pvals[!is.na(pvals)]
  N <- length(pvals)
  
  k <- rank(pvals, ties.method="min")
  alpha <- max(pvals[ pvals<= (q*k/N) ])
  
  if(plotit){
    sig <- factor(pvals<=alpha)
    o <- order(pvals)
    plot(pvals[o], col=c("grey60","red")[sig[o]], pch=20, ..., 
       ylab="p-values", xlab="tests ordered by p-value", main = paste('FDR =',q))
    lines(1:N, q*(1:N)/N)
  }
  
  return(alpha)
}

## pred must be probabilities (0<pred<1) for binomial
deviance <- function(y, pred, family=c("gaussian","binomial")){
	family <- match.arg(family)
	if(family=="gaussian"){
		return( sum( (y-pred)^2 ) )
	}else{
		if(is.factor(y)) y <- as.numeric(y)>1
		return( -2*sum( y*log(pred) + (1-y)*log(1-pred) ) )
	}
}


## get null devaince too, and return R2
R2 <- function(y, pred, family=c("gaussian","binomial")){
	fam <- match.arg(family)
	if(fam=="binomial"){
		if(is.factor(y)){ y <- as.numeric(y)>1 }
	}
	dev <- deviance(y, pred, family=fam)
	dev0 <- deviance(y, mean(y), family=fam)
	return(1-dev/dev0)
}


### MIDTERM: yelp business reviews

## read the data
biz <- read.csv("BizType.csv")  
txt <- read.csv("WordFreq.csv")
rev <- read.csv("ReviewStats.csv") # nwrd=review length, nrev= # reviews by this user
words <- scan("words.txt", what="character", sep="\n") #\n is `end of line'
categories <- scan("categories.txt", what="character", sep="\n") # these are the yelp categories

## create some sparse matrices
library(gamlr)
Biz <- sparseMatrix(i=biz$i,j=biz$j,x=biz$x,
	dimnames=list(rev=1:nrow(rev),cat=categories))

C <- sparseMatrix(i=txt$i,j=txt$j,
		x=txt$x,
		dimnames=list(rev=1:nrow(rev),cat=words))

## what do these look like?
Biz[1:2,c(12,103,96,124,129,130)] # reviews in rows, business types concerned in columns
C[1:5,1:5] # reviews in rows, word indexes in columns, values are word counts
# just matrices, but we haven't stored the zeros

## [1] Marginal Regression Screening 
##
## Another example of an algorithm that can be Distributed! 
## We'll do 5000 univariate regressions of 
## star rating on word presence, one for each word.
## Each regression will return a p-value, and we can
## use this as an initial screen for useful words.

# create a dense matrix of word presence
P <- as.data.frame(as.matrix(C>0))
library(parallel)
margreg <- function(p){
	fit <- lm(stars~p)
	sf <- summary(fit)
	return(sf$coef[2,4]) 
}

cl <- makeCluster(detectCores())
# pull out stars and export to cores
stars <- rev$stars
clusterExport(cl,"stars") 
# run the regressions in parallel
mrgpvals <- unlist(parLapply(cl,P,margreg))
## If parallel stuff is not working, 
## you can also just do (in serial):
# mrgpvals <- c()
# for(j in 1:5000){
# 	print(j)
# 	mrgpvals <- c(mrgpvals,margreg(P[,j]))
# }
## make sure we have names
names(mrgpvals) <- colnames(P)

#####  Set data for rest of exam #####

# Restrict text data to top 250 by above pvalues
Ccut <- C[,names(sort(mrgpvals)[1:250])]
# create matrix of interactions between length and category
BizNWRD <- rev$nwrd*Biz
colnames(BizNWRD) <- paste("NWRD:",colnames(Biz),sep="") 
# create our x matrix, used throughout.
x <- cBind(Biz,BizNWRD,Ccut) # make sure you understand what's in rows, what's in columns,
                             # and what do values mean

## [2] linear regression
# note: you'll want to use lambda.min.ratio=1e-3 in your lassos.
# linear regression for 1-5 stars on x. 
linfit <- gamlr(x, y=stars, lambda.min.ratio=1e-3)
# the coefficients (drop turns it into simple vector)
Blin <- drop(coef(linfit)) # AICc default selection
# predicted values
yhat <- predict(linfit,x)
# note linfit$deviance contains the in sample deviance for each lambda

## bootstrap code for 2.3
# the sampling distribution of AICc optimal df.
# function takes an index of data and will return fitted AICc optimal df
bootdf <- function(ib){
	require(gamlr)
	fit <- gamlr(x[ib,],y=stars[ib], lambda.min.ratio=1e-3)
	sum(coef(fit)!=0)-1
}
# export the data to the clusters 
clusterExport(cl,"x")
# run 100 bootstrap resample fits
boots <- 100
n <- nrow(x)
resamp <- as.data.frame(matrix(sample(1:n,boots*n,replace=TRUE),ncol=boots))
dfsamp <- unlist(parLapply(cl,resamp,bootdf))

## again, this is all pretty fast so it could have happened in serial:
# dfsamp <- c()
# for(b in 1:100){
# 	dfsamp <- c(dfsamp, bootdf(resamp[,b]))
# 	print(b)
# }

## [3] logistic regression and model selection
y <- rev$stars<4 # our `bad review' binary response
# Say binfit is the estimated cv.gamlr object.
# Then note binfit$cvm contains  average OOS deviance for each lambda, and
# binfit$seg.min and binfit$seg.1se identify selected segments under each rule

## [4] Treatment effects estimation for `reviewer experience'

# consider treatment acting on log scale
d <- log(rev$nrev)
# fit a model for d on x
treatfit <- gamlr(x,d)

## [5] multinomial regression
# very straight forward, you just need to read the coefficients
library(distrom)
ymn <- factor(stars)
mnfit <- dmr(cl,x,ymn)
Bmn <- coef(mnfit)
Bmn[c("crap","fantastic"),]

```

### 1 Significance and False Discovery

In the starter script, I’ve extracted a set of p-values from (independent) marginal regression of review stars on presence in review text for each of 5000 words. That is, we fit

$$stars_i = \alpha + \beta 1 [ x_{ji} > 0 ] + \epsilon_{ji}$$

for each term $j$ with count $x_{ji}$ in review $i$, and return the p-value associated with a test of $\beta_j \neq 0$. We’ll use these 5000 independent regressions to screen words.

**1.1** \ 

**Plot the p-values and comment on their distribution.**

```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
library(tidyverse)
```

```{r message=FALSE, warning=FALSE, paged.print=FALSE}

pvals =as.data.frame(mrgpvals)

ggplot(pvals, aes(x = pvals$mrgpvals)) + geom_histogram()
```

**1.2** \ 

**What is the p-value rejection region associated with a 1% False Discovery Rate? How many p-values are in this rejection region?**

```{r message=FALSE, warning=FALSE}

rej_region = fdr_cut(pvals$mrgpvals, q = .01, plotit = TRUE)

```
The rejection region associated with a 1% False Discovery Rate is `r rej_region`. With this in mind, we would reject anything less than this value, which would mean we would reject `r length(pvals[pvals$mrgpvals < rej_region,])` values.


**1.3** \

**Suppose you just select the words with the 250 smallest p-values as significant. How many of these do you expect are false discoveries?**
```{r}
pvals = pvals[order(pvals$mrgpvals), , drop = FALSE]

pvals_250 = pvals[1:250, , drop = FALSE]

pvals_250_cut = fdr_cut(pvals = pvals_250$mrgpvals, q = .1, plotit = TRUE)
```

for the smallest 250 P-values at a q (i.e. alpha) of .01, we would need to cut anything below `r pvals_250_cut` and that would mean that `r length(pvals_250[pvals_250 < pvals_250_cut,])` would be false discoveries.

***

### 2 Least-Squares and Bootstrapping

For this question you will fit a lasso path of Gaussian regressions for stars onto x. We will focus on the AICc selected slice of $\hat\beta$ along this path.



**2.1** \

**What are the in-sample SSE and R2 for this regression?**

```{r}

df = data.frame(as.matrix(yhat))

df$stars = stars

is.r2 = R2(df$stars,df$seg66, family = "gaussian")

sse = deviance(df$stars, df$seg66, family = 'gaussian')

```

In Sample $R^2$: `r is.r2`

$SSE$: `r sse`

**2.2** \ 

**Describe what ratings we’d expect for Bowling businesses and Airports.**


***

### 3 Model Choice and Logistic Regression

Define a ‘bad review’ as one with fewer than 4 stars. We’ll regress this binary outcome on x. 
```{r}
rev$bad.review = ifelse(rev$stars < 4, 1, 0)
```

**3.1** \ 
**Use cv.gamlr to fit a cross-validated lasso regularization path for this regression. Plot the in-sample fit.** 

```{r}
cv.badreview = cv.gamlr(x, rev$bad.review, lmr = 1e-4, verb=F, family = "binomial")

plot(cv.stars)

```

```{r}
plot(linfit)
```

**3.2**\

**Describe the criteria used to choose models under select="1se" and select="min" rules. What are estimated out-of-sample R2 for models fit using these $\lambda$?** 

The “min” lambda is minimum average out of sample deviance, represented by the left most vertical line, while the “1se” lambda value/oos deviance is at least 1 standard deviation away from the minimum value and still penalizes the large number of coefficeints.

```{r}
r2_min <- summary(cv.badreview)[cv.badreview$seg.min,"oos.r2"]

r2_1se <- summary(cv.badreview)[cv.badreview$seg.1se,"oos.r2"]

r2_val_min <- cor((predict(cv.badreview, x, select = "min")[,1]),rev$bad.review) 

r2_val_1se <- cor((predict(cv.badreview, x, select = "1se")[,1]),rev$bad.review)
```

$R^2$ for min lambda: `r r2_val_min`

$R^2$ for 1 se lambda: `r r2_val_1se`

**3.3** \ 

**Compare AICc, AIC, and BIC selection to each other and to the CV rules.**

```{r}
library(knitr)
cv_min = cor((predict(cv.stars, x, select = "min")[,1]), (stars))

cv_1se = cor((predict(cv.stars, x, select = "1se")[,1]), (stars))

aicc = cor((predict(cv.stars, x, select=NULL)[,1]), (stars))

aic = cor((predict(linfit, 
                      x, 
                      select=NULL, 
                      corrected=FALSE)[,1]),(stars))

bic = cor(exp(predict(linfit, 
                      x, 
                      select=NULL,
                      k=log(linfit$nobs),
                      corrected=FALSE)[,1]), stars)



table1 = data.frame("Method" = c("Minimum Cross Validation",
                                 "1se Cross Validation",
                                 "AICc","AIC","BIC"),
                    "Value" = c(cv_min, 
                                cv_1se, 
                                aicc,
                                aic, 
                                bic))

kable(table1)
```
**Print the top ten best and worst review words under your preferred IC selection rule. Do the results make sense?**
```{r}
words = data.frame("coef" = as.matrix(coef(cv.badreview)),
                   "abs.val" = abs(as.matrix(coef(cv.badreview))),
                   "word" = rownames(coef(cv.badreview))
                   )

words = arrange(words, desc(words$seg33.1))
colnames(words) = c("Coefficient Value", "Coefficient Absolute Value", "Word")

kable(head(words, 10))
```

Yes, these words all make sense they are all negative. Additionall all of their coefficients are greater than 1 meaning that any one of these words independently increase the odds of `bad.review` being 1.
***

### 4 Multinomial Regression
Now, we’ll consider stars as an unordered 5-level factor. Use the code in your starter script to run a multinomial logistic regression.

**4.1**\

**What is the change in odds of 1 vs 2-3 stars for an extra count of word crap in the review?**

```{r message=FALSE, warning=FALSE, paged.print=FALSE}
multi.coef = as.data.frame(as.matrix(Bmn)) %>%rownames_to_column() %>% filter(rowname == 'crap' |rowname == 'Crap')

kable(multi.coef)

one_star = exp(multi.coef$`1`)

four_star = exp(multi.coef$`4`)
```

if the word 'crap' is in the review, the change in odds is `r one_star`, while it is zero for both 2 and 3, and then it changes again for 4 stars where the likelihood ratio is `r four_star`. This means that if the word crap appeasr in a review, it is 11% more likely to be a one-star review, and approximately 18% less likely to be a four star review. 

**4.2**\

**What are the changes in odds of 1 star vs 2 and 1 star vs 5 star ratings for an extra count of word fantastic in the review?**

```{r message=FALSE, warning=FALSE, paged.print=FALSE}
multi.coef2 = as.data.frame(as.matrix(Bmn)) %>%rownames_to_column() %>% filter(rowname == 'fantastic' |rowname == 'Fantastic' | rowname == 'FANTASTIC')

kable(multi.coef2)

one_star = exp(multi.coef2$`1`)

two_star = exp(multi.coef2$`2`)

three_star = exp(multi.coef2$`3`)

four_star = exp(multi.coef2$`4`)

five_star = exp(multi.coef2$`5`)

stars = data.frame("Rating" = c("One Star", "Two Star","Three Star","Four Star","Five Star"),
                   "Likelihood Ratio" = c(one_star, two_star, three_star, four_star, five_star)
                   )
```

```{r message=FALSE, warning=FALSE, paged.print=FALSE}
kable(stars)
```

We can see here that using the word fantastic in a review leads to an 18% more likely to be a five star review, while it means it is 3% less likely to be a four star review, 25% less likely to be a three star review, 48% less likely to be a two star review and interestly it is 22% less likely to be a 1 star review.  