Answer: expected value \(p= .25+ .25\ast p+ .5\ast p^2 \implies p= .5\)
Bobo’s lineage dies out if all of his children’s lineages die out. The probability that Bobo’s lineage dies out immediately is 1/4. If Bobo has one child, the probability that his lineage dies out is 1/4 * P, where P is Bobo’s probability of dying out (since Bobo’s children have the same chances). Similarly, if Bobo has 2 children, he has a 1/2 * P^2 chance of dying out, since both of the new lineages need to die out. Putting all this together gives above.
Answer: \(1-(1-.2)^4= .5904\), here assume each 15 minute interval is a binomail, two outcomes: at least once or never, the experiment repeat 4 times.
Answer: If you want a binomial distribution, roll the dice 6 times, and write down the number of odd numbers (or the number of even numbers), then add 1.
For a uniform distribution, here’s what I’d do: Roll the die 3 times. If the roll is even, write down 0; if odd, write down 1. If you get all 0s, reroll.Let’s assume you wrote down a, b, c. Compute 4a + 2b + c.
[source]https://www.quora.com/How-can-you-generate-a-random-number-between-1-7-with-only-a-die-1
Answer: Flip twice and if HT then H, TH then T.
Answer: If \(\mu_1\) and \(\mu_2\) are the two means and the standard deviation is \(\sigma\) for both the distributions, then the mixture is bimodal if and only if \(|\mu_1-\mu_2|> 2\sigma\)
Answer: plug your draw into the CDF of the normal distribution with those parameters. This technique is called the Probability Integral Transform (aka Universality of the Uniform), one thing is that the CDF must be continous and stricktly increasing.
Answer: two children, 4 situations: BB, GG, GB, BG; since we know that at least one of is girl, then only three conditions, and each condition is 1/3.
Answer: assume that the probality of having a girl is p, and this follow a geometric distribution with parameter is p, the mean is \(1/p\).
Answer: \(\frac{\binom{12}{4}\ast \binom{8}{4}}{3!}\)
Answer: the probability of a hash collision: 1-(10!/10^10) the expected number of hash collisions: 1-10(9/10)^10 the expected number of hashes that are unused: 10(9/10)^10
Answer: Lyfts arrive first: 2!*3!/5! Ubers arrive first: same
Answer: 100+60-20=140
Answer: 24C5(1+5(24-5))/24C524C5 = 4/1771
Answer: 1
Answer: Sons of very tall fathers will be on average shorter than their fathers but taller than the average height of men.
Fathers of very tall sons will be on average shorter than their sons but taller than the average height of men.
Answer: (1) E(X)= 1/2(1+E(X|H))+1/2(1+E(X|T)), E(X|T)= E(X) E(X|H)= 1/2(1+E(X|HH))+1/2(1+E(X|HT)), E(X|HT)= E(X) and E(X|HH)=0 then, E(X)= 1/2(1+E(X|H))+ 1/2(1+E(X)) E(X|H)= 1/2(1+0)+1/2(1+E(X)) at last, E(X)=6 (2) Let the expected number of coin flips be x. The case analysis goes as follows: a. If the first flip is a tails, then we have wasted one flip. The probability of this event is 1/2 and the total number of flips required is x+1 b. If the first flip is a heads and second flip is a tails, then we have wasted two flips. The probability of this event is 1/4 and the total number of flips required is x+2 c. If the first flip is a heads and second flip is also heads, then we are done. The probability of this event is 1/4 and the total number of flips required is 2.
Adding, the equation that we get is - x = (1/2)(x+1) + (1/4)(x+2) + (1/4)2
Solving, we get x = 6.
Thus, the expected number of coin flips for getting two consecutive heads is 6.
Answer: less than $3
Answer: 4/13= .307
Answer: .4938
Answer: p-value is the value calculated from the sample information, significant value is the value the experiment set up. we use the H0 value to calcuate the extrem point’s probalitiy.
Answer: First, we perform an overall health check using the multinomial goodness of fit test. This test checks if observed bucket allocations collectively matched expected traffic allocations. If overall health is bad, we also perform binomial tests on each bucket to pinpoint which buckets might have problems, and show a time series of newly bucketed users in case experimenters want to do a deep dive. Or random selection and bucket assignment, stratified random sampling. [source]blog.twitter.com/engineering/en_us/a/2015/detecting-and-avoiding-bucket-imbalance-in-ab-tests.html
Answer: A/A testing is done when organizations are taking up a new implementation of an A/B testing tool: Checking the accuracy of an A/B Testing tool; Setting a baseline conversion rate for future A/B tests; Deciding a minimum sample size. [source]vwo.com/blog/aa-test-before-ab-testing/
Answer: A test is run to see what impact a variable (i.e. a change) has on the audience. If your audience sees both versions then you are no longer testing one change at a time, you are testing two. This would invalidate your results because you wouldn’t be able to trace the audience’s behavior to a specific change. [source]www.quora.com/What-would-be-the-hazards-of-letting-users-sneak-a-peek-at-the-other-bucket-in-an-A-B-test
Answer: The principle that observation affects the outcome, and the blogs could skew the results.
Answer: The best and easiest thing to split test would be what you are providing the user in exchange for their opt-in. Once you determine which of them is more effective, you can start to split test elements of the winning offer, such as the headlines, colors, and call-to-action copy. Of course, you can also do a reverse of this and split test elements first, then try a completely different offer later. [source]https://www.quora.com/How-would-you-conduct-an-A-B-test-on-an-opt-in-feature
Answer: multi variant testing
Answer: transformation, eg log; confidence interval.
Answer: p-value is a measure of statistical significance for accepting or rejecting a hypothesis test. A small p-value means rejecting null hypothesis test and big p-value means accepting null hypothesis.
type-1 is man’s test result shows pregnant, false positive and type-2 is pregant woman’ test result shows not pregant, false negative and null is not pregant.
Answer: For randomly selected listings with more than 1 pictures, hide 1 random picture for group A, and show all for group B. Compare the booking rate for the two groups. Or do some survey.
Answer: Isolate just that factor using a slowdown experiment, i.e., add a delay in an A/B test. [source]https://www.quora.com/How-would-you-design-an-experiment-to-determine-the-impact-of-latency-on-user-engagement
Answer: In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. Here use the math methodology to calculate the first derivative and set it zero and calucate the value for the parameter when second derivative smaller than zero.
Uniform distribution doesn’t has MLE, we introduce the MOM method to calcuate the estimate.
Answer: MAP (Maximum A Posteriori) estimates the posterior distribution given the prior distribution and data which maximizes the likelihood function. MLE is a special case of MAP where the prior is uninformative uniform distribution. MOM use the sample moment equal to population moments and calcuate the value of paramters. \(E[W^r]= \frac{1}{n}\sum_{i=1}^rW^r_i\)
Answer: if prior uniform then MLE, if MLE not exist then MOM; if prior is not uniform then MAP.
Answer: A Confidence Interval is a range of values we are fairly sure our true value lies in. In statistics, a confidence interval (CI) is a type of interval estimate, computed from the statistics of the observed data, that might contain the true value of an unknown population parameter. For example, 95% confidence interval is an interval that when constructed for a set of samples each sampled in the same way, the constructed intervals include the true mean 95% of the time.
Answer: Unbiasedness means that the expectation of the estimator is equal to the population value we are estimating. This is desirable in inference because the goal is to explain the dataset as accurately as possible. However, this is not always desirable for data analysis or predictive modeling as there is the bias variance tradeoff. We sometimes want to prioritize the generalizability and avoid overfitting by reducing variance and thus increasing bias.
Answer: R-squared is the “percent of variance explained” by the model. R-squared measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model.
Adjusted R-squared is always smaller than R-squared, but the difference is usually very small unless you are trying to estimate too many coefficients from too small a sample in the presence of too much noise.
Specifically, adjusted R-squared is equal to 1 minus (n - 1)/(n – k - 1) times 1-minus-R-squared, where n is the sample size and k is the number of independent variables. \(R^2_{adj}= 1-(1-R^2)(\frac{n-1}{n-k-1})\)
(It is possible that adjusted R-squared is negative if the model is too complex for the sample size and/or the independent variables have too little predictive value, and some software just reports that adjusted R-squared is zero in that case.) Adjusted R-squared bears the same relation to the standard error of the regression that R-squared bears to the standard deviation of the errors: one necessarily goes up when the other goes down for models fitted to the same sample of the same dependent variable.
Answer: High dimensionality makes clustering hard, because having lots of dimensions means that everything is “far away” from each other. For example, to cover a fraction of the volume of the data we need to capture a very wide range for each variable as the number of variables increases All samples are close to the edge of the sample. And this is a bad news because prediction is much more difficult near the edges of the training sample. The sampling density decreases exponentially as p increases and hence the data becomes much more sparse without significantly more data. We should conduct PCA to reduce dimensionality
Answer: Statistically, It depends on the quality of your data, for example, if your data is biased, just getting more data won’t help. It depends on your model. If your model suffers from high bias, getting more data won’t improve your test results beyond a point. You’d need to add more features, etc. Practically, Also there’s a tradeoff between having more data and the additional storage, computational power, memory it requires. Hence, always think about the cost of having more data.
Answer: data visualization is important process of data analysis, it could find the error data, outliers, missing value, pattern and the need of scaling the orginal data.
Answer: exploratory data is necessary, we could graph data, test data and summary data, even generate hypotheses.
[source]https://www.quora.com/How-can-you-make-sure-that-you-dont-analyze-something-that-ends-up-meaningless
Answer: Data analysis is part of Data Science and it therefore rests on the scientific method. The scientific method is eminently inductive: we elaborate a hypothesis, test it and refute it or not. As a result, we come up with new hypotheses which are in turn tested and so on. This is an iterative process, as science always is. We search for “the truth” by successive “mistakes”.
Using data science vocabulary, a (null) hypothesis would be something like this: “including feature X will not improve my model”, “a random forest model will perform the same as logistic regression”, and so on. You then try to refute these null hypotheses. “Successive mistakes” would be a sequence of models (remember that all models are wrong although some may be useful).
[source]https://www.quora.com/What-is-the-role-of-trial-and-error-in-data-analysis
Answer: in Kaggle competation, you could use a Gradient Boosting Machine or Random Forest to generate plots of relative importance and information gain for each feature in the ensembles. Besides that, the good understanding of the subject matter is important. Also, we could use regularization to select the features.
[source]https://www.quora.com/How-do-you-determine-feature-importance-in-a-predictive-model
Answer: remove missing value rows, replace with NA, mean or median.
Answer: PCA
Answer: regularization: ridge, lasso, AIC or BIC, PCA dropout rate in deep learning machine learning: random forest, ensemble, gradinet boosting
Answer: the model is not fitted, the numer of feature > the observations, multicollinearity
Answer: The dataset might be heterogeneous. In which case, it is recommended to cluster datasets into different subsets wisely, and then draw different models for different subsets. Or, use models like non parametric models (trees) which can deal with heterogeneity quite nicely.
Answer: the idea is to use a group of weak learners to generate a strong learner, such as decision tree and random forest. Most time, ensemble learning have a better performance than single model, decision tree is a weaker learner and has more change to overfit. Most of time random foreset has a high accuracy, final prediction is a majority vote on this set.
bootstrap is a method to get several dataset, and bagging is short for bootstrap aggregating which create models for each sample and make a majority voting to get the final result. Bagging helps to reduce the variance error.
boosting is a kind of recurrence deep learning, they would improve or revise the original coefficency of previous model and it tends to over fit.
[source]https://www.analyticsvidhya.com/blog/2015/09/questions-ensemble-modeling/
Answer: If the data is more used in one room, then that one is over utilized! Maybe account for the room capacity and normalize the data.
Collect data: of wifi-accessed computers / laptops / cell phones, number in different time of a work day in each room Calculate mean of number of accessed devices in each room, mean of online time in each room in a week or month Generally the higher mean of number and time, the room is more utilized. We can roughly define underutilized as room with mean less than average, and vice versa We still need to compare with the usage for each room, meeting room, office room or café room, compare the mean within rooms for the similar usage.
[source]https://www.besanttechnologies.com/data-science-interview-questions-and-answers
Answer: like page rank with each user corresponding to the webpages and linking to the page equivalent to following.
Answer: One way you could do this is by storing a “skill level” for each user and a “difficulty level” for each problem. We assume that the probability that a user solves a problem only depends on the skill of the user and the difficulty of the problem. Then we maximize the likelihood of the data to find the hidden skill and difficulty levels. The Rasch model for dichotomous data takes the form:
\({\displaystyle \Pr\{X_{ni}=1\}={\frac {\exp({\beta _{n}}-{\delta _{i}})}{1+\exp({\beta _{n}}-{\delta _{i}})}},}\) where is the ability of person and is the difficulty of item.
Answer: use median instead of mean since it is not properly to add the rank of the sushis.
Answer: collaborative filtering. you have your votes and we can calculate the similarity for each representatives and select the most similar representative for liberal and republican parties, find the mean vector and find the representative closest to the center point
Answer: reduce the text to a more compact form (e.g. fingerprinting, bag of words) then compare those with other texts by calculating the similarity
Answer: SD is the square root of the variance \(\sigma\), while the SE is the SD of the sampling distribution \(\frac{\sigma}{\sqrt{n}}\)
[source]https://www.upgrad.com/blog/must-know-machine-learning-questions-linear-regression/
Answer: loss fucntion is the error of a single training example or step or layer, while the cost function is the average of the loss function.
Answer: 1. Filter methods Which measure the relevance of features by their correlation with dependent variable. i.e., depending on whether the independents and dependents are continuous or categorical, you can use Pearson’s Correlation, LDA(latent Dirichlet allocation), ANOVA, and Chi-square test. 2. Wrapper methods Which measure the usefulness of a subset of feature by actually training a model on it. i.e., train tree based model, and its built in feature importance will tell you the top important features. or using regularization L1, which can force some coefficients of not very important features to shrink into zero.
Answer: advertising-driven: Pageviews and daily actives, CTR(click through rate), CPC (cost per click), click-ads, display-ads
service-driven: number of purchases, conversion rate
Answer: productivity tool: same as premium subscriptions MOOC: same as premium subscriptions, completion rate
Answer: heavily on engagement and interaction: uses AU ratios, email summary by type, and push notification summary by type, resurrection ratio messaging product:
Answer: Average Revenue Per Paid User, Average Revenue Per User
Answer: breakdown the KPI’s into what consists them and find where the change is then further breakdown that basic KPI by channel, user cluster, etc. and relate them with any campaigns, changes in user behaviors in that segment
Answer: average increase in revenue gain per coupon, average increase in customers per coupon
Answer: define efficiency
Answer: rate for each action, duration users stay, CTR for sponsor feed posts ref. News Feed Optimization Affinity score: how close the content creator and the users are Weight: weight for the edge type (comment, like, tag, etc.). Emphasis on features the company wants to promote Time decay: the older the less important
Answer: AB test on different balance ratio and see
Answer: there is a gradual step-function type scaling mechanism until that imbalance of requests-to-drivers is alleviated and then vice versa as too many drivers come online enticed by the surge pricing structure. I would bet the algorithm is custom tailored and calibrated to each location as price elasticities almost certainly vary across different cities depending on a huge multitude of variables: income, distance/sprawl, traffic patterns, car ownership, etc. With the massive troves of user data that Uber probably has collected, they most likely have tweaked the algos for each city to adjust for these varying sensitivities to surge pricing. Throw in some machine learning and incredibly rich data and you’ve got yourself an incredible, constantly-evolving algorithm.
Answer: Netflix uses data to estimate the potential market size for an original series before giving it the go-ahead.
Answer: subscription based services
Answer: $E(x)= nH_n= Ln(n)+ + $ and \(\gamma =0.577216\), $H_n= 1+1/2+1/3+ 1/4… $called Harmonic Number for large n; approximate solution: \(E(x)\approx n(ln n + \gamma)+ 1/2\)
Answer: If there is 1 tiger, then it will eat the sheep because he does not need to worry about being eaten. Sheep will not survive. If there are 2 tigers, both of them knows that if one eats the Sheep, the other tiger will eat him. So, the sheep will survive.
If there are 3 tigers, then they each of them knows that if one tiger eats up the sheep, then Iceland will be left with 1 sheep and 2 tigers and as shown in the previous case, the sheep will survive. Hence each tiger will try to eat up the sheep. The sheep will not survive.
If there are 4 tigers, then the sheep will survive.
And so on….
So, If there are even number of tigers the sheep will survive, else it will die. Hence, if there are 100 tigers the sheep will survive.
Answer: A: 1; B: 2; C:5; D: 8 AB–A(back)–CD–B(back)–AB, short group*2+ sigle member of short group+ long group
May 15, May 16, May 19, June 17,June 18, July 14,July 16, August 14,August 15,August 17
Cheryl tells Albert and Bernard separately the month and the day of her birthday respectively.Then following conversation takes place Albert: I don’t know when Cheryl’s birthday is, but I know that Bernard doesn’t know too.Bernard: At first I didn’t know when Cheryl’s birthday is, but I know now.Albert: Then I also know when Cheryl’s birthday is.So when is Cheryl’s birthday?
Answer: A said: A and B doesn’t know = not unique date, 18 and 19; B heard A said then Know that the month != Jun and May; B know = not 14, and A know the month and know the date 16, if 15 or 17 A still doesn’t know.
A casino offers a card game using a normal deck of 52 cards. The rule is that you turn over two cards each time. For each pair, if both are black, they go to the dealer’s pile; if both are red, they go to your pile; if one black and one red, they are discarded. The process is repeated until your two go throught all 52 cards. If you have more cards in your pile, you win $100; otherwise (including tiles) you get nothing. the casino allows you to negotiate the price you want to pay for the game. How much would you be willing to pay to play this game.
Answer: no pay since the number double black and double red are always the same.
Answer: 20 + 4 = 24, round down Take the number that you’ve been given the factorial of. Divide by 5; if you get a decimal, truncate to a whole number. Divide by \(5^2\) = 25; if you get a decimal, truncate to a whole number. Divide by \(5^3\) = 125; if you get a decimal, truncate to a whole number. Continue with ever-higher powers of 5, until your division results in a number less than 1. Once the division is less than 1, stop. Sum all the whole numbers you got in your divisions. This is the number of trailing zeroes.
Answer: 5+1+1
Answer: gradient descent is the process of minimizing a function by following the gradients of the cost function. This involves knowing the form of the cost as well as the derivative so that from a given point you know the gradient and can move in that direction, e.g. downhill towards the minimum value.
Imagine a valley and a person with no sense of direction who wants to get to the bottom of the valley. He goes down the slope and takes large steps when the slope is steep and small steps when the slope is less steep. He decides his next position based on his current position and stops when he gets to the bottom of the valley which was his goal.
Answer: collinearity can be viewed as an extrem case of confouding, when essentially the same variable is entered into a regression equation twice, or when two variables contain exactly the same information as two other variables, and so on. [source]http://www.medicine.mcgill.ca/epidemiology/Joseph/courses/EPIB-621/confounding.pdf
confouding variable: an important predictor that, when omitted, leads to spurious relationships in a regression equation.