Question 1

In the A/B Testing, we see the below information given to us

P(Liking Page A) = 0.20
P(Liking Page B) = 0.2143

This statistic points that B is the better choice with the given data available, but I am not yet confident that B is a better choice than A with a difference between their % of liking so small, I would rather consider a bigger confidence interval between them, may be 20% higher ie, B being higher than A by 20% or A being higher than B by 20%. Also test plan durations can affect the conversion rate, by conversion rate I mean coming to the page and clicking the button

Let’s see if we can do a Hypothesis testing for this, Hypothesis test is one of the best way to validate the claim made by a population. But for this case, There is the complication of when to stop the test, do we keep running the tests on Page A and Page B till a particular Page is significantly better in terms of probability than the other

Consider A to be the control set and B to be the test set

H0 (Null Hypothesis)

H1 (Alternate Hypothesis)

To Prove for or against the Null hypothesis, we would need to check the p-value A p-value is a number, that you get by running the hypothesis test on your data.

if p-value>0.05, then we do not have enough evidence against the Null Hypothesis and we would go with the Null Hypothesis if p-value<0.05, then we reject the Null hypothesis

We will start the test with a simulated random experiment for a large number of iterations under the Null hypothesis, to check the respective p-values and thus determine which page is better. There are two sets of people here, control set(page A) and test set(page B).

Below the theory for the analysis :

One of the decisions is to determine the number of data points needed to get a statistically significant result. This is called statistical power. Power has an inverse trade-off with size. The smaller the change you want to detect or the increased confidence you want to have in the result, means you have to run a larger experiment.

As you increase the number of samples, the confidence interval moves closer to the mean and we can get a better idea as to if we can reject the Null hypotheis

  α=P(reject null | null true)
  β=P(fail to reject null | null false)

1−β is referred to as the sensitivity of the experiment, or statistical power. People often choose high sensitivity, typically around 80%.

For a small sample, α is low and β is high

For a small sample, α is low and β is high. For a large sample α remains the same but β goes down (i.e. sensitivity increases). A good online calculator for determing the number of samples is here. As you change one of the parameters, your sample size will change as well. For example:

If you increase the baseline click through probability (under 0.5) then this increases the standard error, and therefore, you need a higher number of samples
If you increase the practical significance level, you require a fewer number of samples since larger changes are easier to detect
If you increase the confidence level, you want to be more certain that you are rejecting the null. At the same sensivitiy, this would require increasing the number of samples
If you want to increase the sensitivity, you need to collect more samples

Comparing two samples

For comparing two samples, we calculate the pooled standard error. For e.g., suppose Xcont and Ncont are the control number of users that click, and the total number of users in the control group. Let Xexp and Nexp be the values for the experiment. The pooled probability is given by

\[p_{pool}=(Xcont+Xexp)/(Ncont+Ntest)\] \[SE_{pool}=√(p_{pool}∗(1−p_{pool})∗(1/Ncont+1/Ntest))\] \[diff=p(exp)-p(control)\]

\[H0:diff=0, where \space diff^∼N(0,SE_{pool})\]

\[diff>1.96∗SE_{pool} \space or \space diff< −1.96∗SE_{pool}\] then we can reject the null hypothesis and state that our difference represents a statistically significant difference

Question2

Since there is no restriction as to how to categorize the users, I would like to use the following categories for this case, Sports, Entertainment, Food, Health and Technology.

A list that would represent the above will be a 5 element list. We use a word list of the topics and increment the count for each topic if the word appears for that tweet For example, if the tweet would be something like,
“I love Bryan Adams!!” , This contains the word “Bryan” and “Adams” which will be part of the word list for the topic “Entertainment”, Hence this will output a distribution [0,0,1,0,0]

Now the Question is to categorize the words I would need to find the relation of the words with that of the category, that is where the word2vec model will be helpful My idea would be to do a word2vec on my text and check the cosine distance,of some of the important/main words with the category word occurring per tweet per userid and hence I will be able to categorize the userid’s based on the tweet. I can fill up the distribution matrix based on these similarities and then each userid will be assigned a category to be printed out

Describing the Whole Process

  • As there is a stream of Tweets coming in, I would assume there is a generator of sort, and I would need to store the Tweets in row-wise manner in a database file so that I can have corpus to train my model
  • I have to choose how much data I need to train and validate , so I accordingly I will feed a part of the text corpus as my input to the word2vec model.
  • The word2vec model will output a vector representation of words, called “word embeddings”
  • For Each tweet I will remove the punctuations and the stopwords/common words so that the important or uncommon words can be picked up for training
  • After training we can calculate the cosine distance of the priority words for each tweet with respect to the categories and thus can categorize them
  • Each tweet can have more than one priority word, In that case I will use the category lowest cosine distance
  • I have described an example below
  "user_id":11,
  "tweet":"I love Basketball games in LosAngeles!"

  After Preprocessing the final dataframe and the list of words will look something like
  [Sports, Entertainment, Food, Health and Technology]
  words = ["love", "Basketball", "games", "LosAngeles"]
  userid  tweet     Cosine_distance_from_categories
  11      love        [0.8, 0.7, 0.8, 0.9, 0.7]
  11      Basketball  [0.1, 0.7, 0.9, 0.6, 0.9]
  11      games       [0.1, 0.4, 0.8, 0.6, 0.7]
  11      LosAngeles  [0.4, 0.4, 0.6, 0.7, 0.8]

  Normalized Sum      [0.35, 0.55, 0.775, 0.7, 0.775] ==> This shows sports as the lowest distance and hence the userid "11" is put under the category "sports"

Clustering Technique to be used

Some Preprocessing steps are shown below

  • Code for taking a json file as input and counting words without the presence of any Punctuations
  • I am still in the process to complete this project, but these are the starting steps
{
 "user_id": [1,2,3,4,5,6],
 "timestamp": ["2016-03-22_11-21-20","2016-03-22_11-31-20","2016-05-22_11-31-20","2018-03-22_11-31-20","2016-03-27_11-31-20","2016-08-22_11-31-20"],
 "tweet": ["It's #dinner-time!", "It's going to be Fun!!", "I love New York!", "What is wrong with this Game!", "We are going to Atlanta!", "Jack Nicolson!!"]
}
library(dplyr)
library(janeaustenr)
library(tidytext)
library(rjson)
library(ggplot2)

res<-fromJSON(file = "C:/Public/Code/json/tw1.json")
res1<-as.data.frame(res)

#text<-c("The world is flat!","There is no good in doing this","I love Basketball","There is something in the air", "where are you going sir?","Come on , lets do it!")
text_df<-data_frame(line=1:6, text=as.character(res1$tweet))

tidyform<-text_df %>% unnest_tokens(word, text)
counts_<-tidyform %>% count(word, sort=TRUE)
print(head(counts_))
## # A tibble: 6 x 2
##      word     n
##     <chr> <int>
## 1   going     2
## 2    it's     2
## 3      to     2
## 4     are     1
## 5 atlanta     1
## 6      be     1
plot1<-ggplot(data = counts_, aes(counts_$word, counts_$n))+ geom_bar(stat="identity")
print(plot1)

Question 3

In a classification problem, with labelled cases, that is a supervised problem

Detect Overfitting

  • Overfitting can be detected if, the Test error is high and the training error is low
  • Cross validation can detect overfitting by partitioning the data
  • Also calculating the R-squared values will help detect overfitting, A difference in predicted R-squared and regular R-squared is a good measure to detect OVerfitting
  • Predicted R Squared, can be calculated using LOO (Leave One out) approach, where you leave a particular data point and check how the model predicts the data point
  • Repeat this for all the data points

Prevent Overfitting

  • Rule of thumb is a good model should have a low training error and a low generalization error, there should always be a bias-variance tradeoff, low Bias and high Variance caused over-fitting
  • Some of the initial steps that we can take is not to make the model overly complex, for example for an image classification problem trying a simple 3-4 layers conv2D will be a good place to start with rather considering too many hidden layers and parameters
  • Some of the other process is introducing a validation set, or doing cross validation, precisely a k-fold CV, where you change the training and the validation sets, though a k-fold CV might be high on compute time
  • One of the steps that we can take is L2 regularization where we add more information(regularization term) so that the complexity of the model is reduced
  • Data augmentation techniques and Normalization also helps reduce overfitting in image classification problems

Question 4

Problem I am tasked with making a learning Agent, that learns for user’s behavior while using a 3D modelling software based on click pattern and recommends changes

File -> Project -> Load a Picture -> Rotate the picture (by a degree) ->select a component (in the pic) -> Symmetrize -> Flip (by a degree) -> Save the model

Question 5

Regularization is necessary for cases where we overfit easily to the training set, Anytime we try to fit the Noise along with the pattern we want to predict we will overfit. * The general cases where regularization is necessary is a supervised Learning setting, where we try to fit our models to predict such that we match the labels * We keep adding more features which results in addition of complexity and fits the training set better * Training set error decreases and test set error is higher Regularization will not be ideal in cases of unsupervised learning, where there are no labels as such. Also for under-fitting scenarios, it is also not required

Question 6

Problem description

Approach

Idea if we use a content and habit based recommendation system to design a system that recommends a coupon

Testing and evaluating the Recommender System

  • I would basically use Hypothesis testing or A/B testing to test if the customer liked the coupon. There could be online coupon systems, where the user can click “Like/dislike” buttons to rate a certain coupon received, and with high number of samples we can determine wether the user prefers a certain set of coupons ie, a Null Hypothesis or reject the Null Hypothesis based on
  • We can make sure that the same user is not getting the same coupon from our system all the time
  • To calculate the accuracy of the data, we will not have any test data, so we can use cross validation, just divide the current data into k-folds and see how the RS does, and average over K runs
  • If the online system is up and running the Key Indicator is Click rate, The click rate tells the count the user clicks “Like or Dislike”

Question 7