3. Probality

1. Bobo the amoeba has a 25%, 25%, and 50% chance of producing 0, 1, or 2 offspring, respectively. Each of Bobo’s descendants also have the same probabilities. What is the probability that Bobo’s lineage dies out?

Answer: expected value $p= .25+ .25\ast p+ .5\ast p^2 \implies p= .5$

Bobo’s lineage dies out if all of his children’s lineages die out. The probability that Bobo’s lineage dies out immediately is 1/4. If Bobo has one child, the probability that his lineage dies out is 1/4 * P, where P is Bobo’s probability of dying out (since Bobo’s children have the same chances). Similarly, if Bobo has 2 children, he has a 1/2 * P^2 chance of dying out, since both of the new lineages need to die out. Putting all this together gives above.

2. In any 15-minute interval, there is a 20% probability that you will see at least one shooting star. What is the probability that you see at least one shooting star in the period of an hour?

Answer: $1-(1-.2)^4= .5904$, here assume each 15 minute interval is a binomail, two outcomes: at least once or never, the experiment repeat 4 times.

3. How can you generate a random number between 1 - 7 with only a die?

Answer: If you want a binomial distribution, roll the dice 6 times, and write down the number of odd numbers (or the number of even numbers), then add 1.

For a uniform distribution, here’s what I’d do: Roll the die 3 times. If the roll is even, write down 0; if odd, write down 1. If you get all 0s, reroll.Let’s assume you wrote down a, b, c. Compute 4a + 2b + c.

[source]https://www.quora.com/How-can-you-generate-a-random-number-between-1-7-with-only-a-die-1

4. How can you get a fair coin toss if someone hands you a coin that is weighted to come up heads more often than tails?

Answer: Flip twice and if HT then H, TH then T.

5.You have an 50-50 mixture of two normal distributions with the same standard deviation. How far apart do the means need to be in order for this distribution to be bimodal?

Answer: If $\mu_1$ and $\mu_2$ are the two means and the standard deviation is $\sigma$ for both the distributions, then the mixture is bimodal if and only if $|\mu_1-\mu_2|> 2\sigma$

6. Given draws from a normal distribution with known parameters, how can you simulate draws from a uniform distribution?

Answer: plug your draw into the CDF of the normal distribution with those parameters. This technique is called the Probability Integral Transform (aka Universality of the Uniform), one thing is that the CDF must be continous and stricktly increasing.

7. A certain couple tells you that they have two children, at least one of which is a girl. What is the probability that they have two girls?

Answer: two children, 4 situations: BB, GG, GB, BG; since we know that at least one of is girl, then only three conditions, and each condition is 1/3.

8. You have a group of couples that decide to have children until they have their first girl, after which they stop having children. What is the expected gender ratio of the children that are born? What is the expected number of children each couple will have?

Answer: assume that the probality of having a girl is p, and this follow a geometric distribution with parameter is p, the mean is $1/p$.

9. How many ways can you split 12 people into 3 teams of 4?

Answer: $\frac{\binom{12}{4}\ast \binom{8}{4}}{3!}$

**10. Your hash function assigns each object to a number between 1:10, each with equal probability. With 10 objects, what is the probability of a hash collision? What is the expected number of hash collisions? What is the expected number of hashes that are unused.

Answer: the probability of a hash collision: 1-(10!/10^10) the expected number of hash collisions: 1-10(9/10)^10 the expected number of hashes that are unused: 10(9/10)^10

11. You call 2 UberX’s and 3 Lyfts. If the time that each takes to reach you is IID, what is the probability that all the Lyfts arrive first? What is the probability that all the UberX’s arrive first?

Answer: Lyfts arrive first: 2!*3!/5! Ubers arrive first: same

12. I write a program should print out all the numbers from 1 to 300, but prints out Fizz instead if the number is divisible by 3, Buzz instead if the number is divisible by 5, and FizzBuzz if the number is divisible by 3 and 5. What is the total number of numbers that is either Fizzed, Buzzed, or FizzBuzzed?

Answer: 100+60-20=140

13. On a dating site, users can select 5 out of 24 adjectives to describe themselves. A match is declared between two users if they match on at least 4 adjectives. If Alice and Bob randomly pick adjectives, what is the probability that they form a match?

Answer: 24C5(1+5(24-5))/24C524C5 = 4/1771

14. A lazy high school senior types up application and envelopes to n different colleges, but puts the applications randomly into the envelopes. What is the expected number of applications that went to the right college?

Answer: 1

15. Let’s say you have a very tall father. On average, what would you expect the height of his son to be? Taller, equal, or shorter? What if you had a very short father?

Answer: Sons of very tall fathers will be on average shorter than their fathers but taller than the average height of men.

Fathers of very tall sons will be on average shorter than their sons but taller than the average height of men.

**16. What’s the expected number of coin flips until you get two heads in a row? What’s the expected number of coin flips until you get two tails in a row?

Answer: (1) E(X)= 1/2(1+E(X|H))+1/2(1+E(X|T)), E(X|T)= E(X) E(X|H)= 1/2(1+E(X|HH))+1/2(1+E(X|HT)), E(X|HT)= E(X) and E(X|HH)=0 then, E(X)= 1/2(1+E(X|H))+ 1/2(1+E(X)) E(X|H)= 1/2(1+0)+1/2(1+E(X)) at last, E(X)=6 (2) Let the expected number of coin flips be x. The case analysis goes as follows: a. If the first flip is a tails, then we have wasted one flip. The probability of this event is 1/2 and the total number of flips required is x+1 b. If the first flip is a heads and second flip is a tails, then we have wasted two flips. The probability of this event is 1/4 and the total number of flips required is x+2 c. If the first flip is a heads and second flip is also heads, then we are done. The probability of this event is 1/4 and the total number of flips required is 2.

Adding, the equation that we get is - x = (1/2)(x+1) + (1/4)(x+2) + (1/4)2

Solving, we get x = 6.

Thus, the expected number of coin flips for getting two consecutive heads is 6.

**17. Let’s say we play a game where I keep flipping a coin until I get heads. If the first time I get heads is on the nth coin, then I pay you 2n-1 dollars. How much would you pay me to play this game?

Answer: less than $3

18. You have two coins, one of which is fair and comes up heads with a probability 1/2, and the other which is biased and comes up heads with probability 3/4. You randomly pick coin and flip it twice, and get heads both times. What is the probability that you picked the fair coin?

Answer: 4/13= .307

19. You have a 0.1% chance of picking up a coin with both heads, and a 99.9% chance that you pick up a fair coin. You flip your coin and it comes up heads 10 times. What’s the chance that you picked up the fair coin, given the information that you observed?

Answer: .4938

20. What is a P-Value ?

Answer: p-value is the value calculated from the sample information, significant value is the value the experiment set up. we use the H0 value to calcuate the extrem point’s probalitiy.

4. Statistics inference

1.In an A/B test, how can you check if assignment to the various buckets was truly random?

Answer: First, we perform an overall health check using the multinomial goodness of fit test. This test checks if observed bucket allocations collectively matched expected traffic allocations. If overall health is bad, we also perform binomial tests on each bucket to pinpoint which buckets might have problems, and show a time series of newly bucketed users in case experimenters want to do a deep dive. Or random selection and bucket assignment, stratified random sampling. [source]blog.twitter.com/engineering/en_us/a/2015/detecting-and-avoiding-bucket-imbalance-in-ab-tests.html

2. What might be the benefits of running an A/A test, where you have two buckets who are exposed to the exact same product?

Answer: A/A testing is done when organizations are taking up a new implementation of an A/B testing tool: Checking the accuracy of an A/B Testing tool; Setting a baseline conversion rate for future A/B tests; Deciding a minimum sample size. [source]vwo.com/blog/aa-test-before-ab-testing/

3. What would be the hazards of letting users sneak a peek at the other bucket in an A/B test?

Answer: A test is run to see what impact a variable (i.e. a change) has on the audience. If your audience sees both versions then you are no longer testing one change at a time, you are testing two. This would invalidate your results because you wouldn’t be able to trace the audience’s behavior to a specific change. [source]www.quora.com/What-would-be-the-hazards-of-letting-users-sneak-a-peek-at-the-other-bucket-in-an-A-B-test

4. What would be some issues if blogs decide to cover one of your experimental groups?

Answer: The principle that observation affects the outcome, and the blogs could skew the results.

5.How would you conduct an A/B test on an opt-in feature?

Answer: The best and easiest thing to split test would be what you are providing the user in exchange for their opt-in. Once you determine which of them is more effective, you can start to split test elements of the winning offer, such as the headlines, colors, and call-to-action copy. Of course, you can also do a reverse of this and split test elements first, then try a completely different offer later. [source]https://www.quora.com/How-would-you-conduct-an-A-B-test-on-an-opt-in-feature

6. How would you run an A/B test for many variants, say 20 or more?

Answer: multi variant testing

7. How would you run an A/B test if the observations are extremely right-skewed?

Answer: transformation, eg log; confidence interval.

9. What is a p-value? What is the difference between type-1 and type-2 error?

Answer: p-value is a measure of statistical significance for accepting or rejecting a hypothesis test. A small p-value means rejecting null hypothesis test and big p-value means accepting null hypothesis.
type-1 is man’s test result shows pregnant, false positive and type-2 is pregant woman’ test result shows not pregant, false negative and null is not pregant.

10. You are AirBnB and you want to test the hypothesis that a greater number of photographs increases the chances that a buyer selects the listing. How would you test this hypothesis?

Answer: For randomly selected listings with more than 1 pictures, hide 1 random picture for group A, and show all for group B. Compare the booking rate for the two groups. Or do some survey.

11. How would you design an experiment to determine the impact of latency on user engagement?

Answer: Isolate just that factor using a slowdown experiment, i.e., add a delay in an A/B test. [source]https://www.quora.com/How-would-you-design-an-experiment-to-determine-the-impact-of-latency-on-user-engagement

12. What is maximum likelihood estimation? Could there be any case where it doesn’t exist?

Answer: In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. Here use the math methodology to calculate the first derivative and set it zero and calucate the value for the parameter when second derivative smaller than zero.

Uniform distribution doesn’t has MLE, we introduce the MOM method to calcuate the estimate.

13. What’s the difference between a MAP, MOM, MLE estimator? In which cases would you want to use each?

Answer: MAP (Maximum A Posteriori) estimates the posterior distribution given the prior distribution and data which maximizes the likelihood function. MLE is a special case of MAP where the prior is uninformative uniform distribution. MOM use the sample moment equal to population moments and calcuate the value of paramters. $E[W^r]= \frac{1}{n}\sum_{i=1}^rW^r_i$

Answer: if prior uniform then MLE, if MLE not exist then MOM; if prior is not uniform then MAP.

[source]https://www.quora.com/What-is-the-difference-between-Method-Of-Moment-MOM-Maximum-A-Posteriori-MAP-and-Maximum-Likelihood-Estimation-MLE

14. What is a confidence interval and how do you interpret it?

Answer: A Confidence Interval is a range of values we are fairly sure our true value lies in. In statistics, a confidence interval (CI) is a type of interval estimate, computed from the statistics of the observed data, that might contain the true value of an unknown population parameter. For example, 95% confidence interval is an interval that when constructed for a set of samples each sampled in the same way, the constructed intervals include the true mean 95% of the time.

15. What is unbiasedness as a property of an estimator? Is this always a desirable property when performing inference? What about in data analysis or predictive modeling?

Answer: Unbiasedness means that the expectation of the estimator is equal to the population value we are estimating. This is desirable in inference because the goal is to explain the dataset as accurately as possible. However, this is not always desirable for data analysis or predictive modeling as there is the bias variance tradeoff. We sometimes want to prioritize the generalizability and avoid overfitting by reducing variance and thus increasing bias.

5. Data Analysis

1. (Given a Dataset) Analyze this dataset and tell me what you can learn from it.

2. What is R2? What are some other metrics that could be better than R2 and why?

Answer: R-squared is the “percent of variance explained” by the model. R-squared measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model.

Adjusted R-squared is always smaller than R-squared, but the difference is usually very small unless you are trying to estimate too many coefficients from too small a sample in the presence of too much noise.

Specifically, adjusted R-squared is equal to 1 minus (n - 1)/(n – k - 1) times 1-minus-R-squared, where n is the sample size and k is the number of independent variables. $R^2_{adj}= 1-(1-R^2)(\frac{n-1}{n-k-1})$

(It is possible that adjusted R-squared is negative if the model is too complex for the sample size and/or the independent variables have too little predictive value, and some software just reports that adjusted R-squared is zero in that case.) Adjusted R-squared bears the same relation to the standard error of the regression that R-squared bears to the standard deviation of the errors: one necessarily goes up when the other goes down for models fitted to the same sample of the same dependent variable.

3. What is the curse of dimensionality?

Answer: High dimensionality makes clustering hard, because having lots of dimensions means that everything is “far away” from each other. For example, to cover a fraction of the volume of the data we need to capture a very wide range for each variable as the number of variables increases All samples are close to the edge of the sample. And this is a bad news because prediction is much more difficult near the edges of the training sample. The sampling density decreases exponentially as p increases and hence the data becomes much more sparse without significantly more data. We should conduct PCA to reduce dimensionality

[source]https://people.duke.edu/~rnau/rsquared.htm

4. Is more data always better?

Answer: Statistically, It depends on the quality of your data, for example, if your data is biased, just getting more data won’t help. It depends on your model. If your model suffers from high bias, getting more data won’t improve your test results beyond a point. You’d need to add more features, etc. Practically, Also there’s a tradeoff between having more data and the additional storage, computational power, memory it requires. Hence, always think about the cost of having more data.

5. What are advantages of plotting your data before performing analysis?

Answer: data visualization is important process of data analysis, it could find the error data, outliers, missing value, pattern and the need of scaling the orginal data.

6. How can you make sure that you don’t analyze something that ends up meaningless?

Answer: exploratory data is necessary, we could graph data, test data and summary data, even generate hypotheses.

[source]https://www.quora.com/How-can-you-make-sure-that-you-dont-analyze-something-that-ends-up-meaningless

7. What is the role of trial and error in data analysis? What is the the role of making a hypothesis before diving in?

Answer: Data analysis is part of Data Science and it therefore rests on the scientific method. The scientific method is eminently inductive: we elaborate a hypothesis, test it and refute it or not. As a result, we come up with new hypotheses which are in turn tested and so on. This is an iterative process, as science always is. We search for “the truth” by successive “mistakes”.

Using data science vocabulary, a (null) hypothesis would be something like this: “including feature X will not improve my model”, “a random forest model will perform the same as logistic regression”, and so on. You then try to refute these null hypotheses. “Successive mistakes” would be a sequence of models (remember that all models are wrong although some may be useful).

[source]https://www.quora.com/What-is-the-role-of-trial-and-error-in-data-analysis

8. How can you determine which features are the most important in your model?

Answer: in Kaggle competation, you could use a Gradient Boosting Machine or Random Forest to generate plots of relative importance and information gain for each feature in the ensembles. Besides that, the good understanding of the subject matter is important. Also, we could use regularization to select the features.

[source]https://www.quora.com/How-do-you-determine-feature-importance-in-a-predictive-model

9. How do you deal with some of your predictors being missing?

Answer: remove missing value rows, replace with NA, mean or median.

**10. You have several variables that are positively correlated with your response, and you think combining all of the variables could give you a good prediction of your response. However, you see that in the multiple linear regression, one of the weights on the predictors is negative. What could be the issue?

Answer: Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related. Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn’t affect the efficiency of extrapolating the fitted model to new data provided that the predictor variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based.

method: principal component regression or remove them, leave it since we care the predictor values

Examine: (1) the correlation coefficient for each pair of independent variables. A value of the correlation near $\pm1$ indicates that the two variables are highly correlated.

(2)VIF(variance inflation factors), VIF(j) is the factor by which the variance of $\beta_j$ is increased over what it would be if $x_j$ was uncorrelated with the other independent variables. If all values of VIF are near 1, then collinearity is not a problem. vIF(j) >10 indicates serious collinearity.

11. Let’s say you’re given an unfeasible amount of predictors in a predictive modeling task. What are some ways to make the prediction more feasible?

Answer: PCA

12. Now you have a feasible amount of predictors, but you’re fairly sure that you don’t need all of them. How would you perform feature selection on the dataset?

Answer: regularization: ridge, lasso, AIC or BIC, PCA dropout rate in deep learning machine learning: random forest, ensemble, gradinet boosting

**13. Your linear regression didn’t run and communicates that there are an infinite number of best estimates for the regression coefficients. What could be wrong?

Answer: the model is not fitted, the numer of feature > the observations, multicollinearity

14. You run your regression on different subsets of your data, and find that in each subset, the beta value for a certain variable varies wildly. What could be the issue here?

Answer: The dataset might be heterogeneous. In which case, it is recommended to cluster datasets into different subsets wisely, and then draw different models for different subsets. Or, use models like non parametric models (trees) which can deal with heterogeneity quite nicely.

15. What is the main idea behind ensemble learning? If I had many different models that predicted the same response variable, what might I want to do to incorporate all of the models? Would you expect this to perform better than an individual model or worse?

Answer: the idea is to use a group of weak learners to generate a strong learner, such as decision tree and random forest. Most time, ensemble learning have a better performance than single model, decision tree is a weaker learner and has more change to overfit. Most of time random foreset has a high accuracy, final prediction is a majority vote on this set.

bootstrap is a method to get several dataset, and bagging is short for bootstrap aggregating which create models for each sample and make a majority voting to get the final result. Bagging helps to reduce the variance error.

boosting is a kind of recurrence deep learning, they would improve or revise the original coefficency of previous model and it tends to over fit.

[source]https://www.analyticsvidhya.com/blog/2015/09/questions-ensemble-modeling/

16. Given that you have wi data in your office, how would you determine which rooms and areas are underutilized and overutilized?

Answer: If the data is more used in one room, then that one is over utilized! Maybe account for the room capacity and normalize the data.

Collect data: of wifi-accessed computers / laptops / cell phones, number in different time of a work day in each room Calculate mean of number of accessed devices in each room, mean of online time in each room in a week or month Generally the higher mean of number and time, the room is more utilized. We can roughly define underutilized as room with mean less than average, and vice versa We still need to compare with the usage for each room, meeting room, office room or café room, compare the mean within rooms for the similar usage.

[source]https://www.besanttechnologies.com/data-science-interview-questions-and-answers

**17. How could you use GPS data from a car to determine the quality of a driver?

**18. Given accelerometer, altitude, and fuel usage data from a car, how would you determine the optimum acceleration pattern to drive over hills?

**19. Given position data of NBA players in a season’s games, how would you evaluate a basketball player’s defensive ability?

20. How would you quantify the influence of a Twitter user?

Answer: like page rank with each user corresponding to the webpages and linking to the page equivalent to following.

**21. Given location data of golf balls in games, how would construct a model that can advise golfers where to aim?

22. You have 100 mathletes and 100 math problems. Each mathlete gets to choose 10 problems to solve. Given data on who got what problem correct, how would you rank the problems in terms of difficulty?

Answer: One way you could do this is by storing a “skill level” for each user and a “difficulty level” for each problem. We assume that the probability that a user solves a problem only depends on the skill of the user and the difficulty of the problem. Then we maximize the likelihood of the data to find the hidden skill and difficulty levels. The Rasch model for dichotomous data takes the form:

${\displaystyle \Pr\{X_{ni}=1\}={\frac {\exp({\beta _{n}}-{\delta _{i}})}{1+\exp({\beta _{n}}-{\delta _{i}})}},}$ where is the ability of person and is the difficulty of item.

23. You have 5000 people that rank 10 sushis in terms of saltiness. How would you aggregate this data to estimate the true saltiness rank in each sushi?

Answer: use median instead of mean since it is not properly to add the rank of the sushis.

24. Given data on congressional bills and which congressional representatives co-sponsored the bills, how would you determine which other representatives are most similar to yours in voting behavior? How would you evaluate who is the most liberal? Most republican? Most bipartisan?

Answer: collaborative filtering. you have your votes and we can calculate the similarity for each representatives and select the most similar representative for liberal and republican parties, find the mean vector and find the representative closest to the center point

25. How would you come up with an algorithm to detect plagiarism in online content?

Answer: reduce the text to a more compact form (e.g. fingerprinting, bag of words) then compare those with other texts by calculating the similarity

28. the difference between SD and SE?

Answer: SD is the square root of the variance $\sigma$, while the SE is the SD of the sampling distribution $\frac{\sigma}{\sqrt{n}}$

[source]https://www.upgrad.com/blog/must-know-machine-learning-questions-linear-regression/

29. the difference between cost function and loss function

Answer: loss fucntion is the error of a single training example or step or layer, while the cost function is the average of the loss function.

30. how to choose features or feature engineer?

Answer: 1. Filter methods Which measure the relevance of features by their correlation with dependent variable. i.e., depending on whether the independents and dependents are continuous or categorical, you can use Pearson’s Correlation, LDA(latent Dirichlet allocation), ANOVA, and Chi-square test. 2. Wrapper methods Which measure the usefulness of a subset of feature by actually training a model on it. i.e., train tree based model, and its built in feature importance will tell you the top important features. or using regularization L1, which can force some coefficients of not very important features to shrink into zero.

6. Product metrics

1. What would be good metrics of success for an advertising-driven consumer product? (Buzzfeed, YouTube, Google Search, etc.) A service-driven consumer product? (Uber, Flickr, Venmo, etc.)

Answer: advertising-driven: Pageviews and daily actives, CTR(click through rate), CPC (cost per click), click-ads, display-ads

service-driven: number of purchases, conversion rate

2. What would be good metrics of success for a productivity tool? (Evernote, Asana, Google Docs, etc.) A MOOC? (edX, Coursera, Udacity, etc.)

Answer: productivity tool: same as premium subscriptions MOOC: same as premium subscriptions, completion rate

3. What would be good metrics of success for an e-commerce product? (Etsy, Groupon, Birchbox, etc.) A subscrip- tion product? (Net ix, Birchbox, Hulu, etc.) Premium subscriptions? (OKCupid, LinkedIn, Spotify, etc.)

Answer: e-commerce: number of purchases, conversion rate, Hourly, daily, weekly, monthly, quarterly, and annual sales, Cost of goods sold, Inventory levels, Site traffic, Unique visitors versus returning visitors, Customer service phone call count, Average resolution time subscription: churn, CoCA (cost of customer acquisition), ARPU (Average Revenue per User), MRR (Monthly Recurring Revenue), LTV (customer lifetime value) premium subscriptions:

4. What would be good metrics of success for a consumer product that relies heavily on engagement and interaction? (Snapchat, Pinterest, Facebook, etc.) A messaging product? (GroupMe, Hangouts, Snapchat, etc.)

Answer: heavily on engagement and interaction: uses AU ratios, email summary by type, and push notification summary by type, resurrection ratio messaging product:

5. What would be good metrics of success for a product that o ered in-app purchases? (Zynga, Angry Birds, other gaming apps)

Answer: Average Revenue Per Paid User, Average Revenue Per User

6. A certain metric is violating your expectations by going down or up more than you expect. How would you try to identify the cause of the change?

Answer: breakdown the KPI’s into what consists them and find where the change is then further breakdown that basic KPI by channel, user cluster, etc. and relate them with any campaigns, changes in user behaviors in that segment

**7. Growth for total number of tweets sent has been slow this month. What data would you look at to determine the cause of the problem?

8. You’re a restaurant and are approached by Groupon to run a deal. What data would you ask from them in order to determine whether or not to do the deal?

Answer: average increase in revenue gain per coupon, average increase in customers per coupon

9. You are tasked with improving the e ciency of a subway system. Where would you start?

Answer: define efficiency

10. Say you are working on Facebook News Feed. What would be some metrics that you think are important? How would you make the news each person gets more relevant?

Answer: rate for each action, duration users stay, CTR for sponsor feed posts ref. News Feed Optimization Affinity score: how close the content creator and the users are Weight: weight for the edge type (comment, like, tag, etc.). Emphasis on features the company wants to promote Time decay: the older the less important

11. How would you measure the impact that sponsored stories on Facebook News Feed have on user engagement? How would you determine the optimum balance between sponsored stories and organic content on a user’s News Feed?

Answer: AB test on different balance ratio and see

**12. You are on the data science team at Uber and you are asked to start thinking about surge pricing. What would be the objectives of such a product and how would you start looking into this?

Answer: there is a gradual step-function type scaling mechanism until that imbalance of requests-to-drivers is alleviated and then vice versa as too many drivers come online enticed by the surge pricing structure. I would bet the algorithm is custom tailored and calibrated to each location as price elasticities almost certainly vary across different cities depending on a huge multitude of variables: income, distance/sprawl, traffic patterns, car ownership, etc. With the massive troves of user data that Uber probably has collected, they most likely have tweaked the algos for each city to adjust for these varying sensitivities to surge pricing. Throw in some machine learning and incredibly rich data and you’ve got yourself an incredible, constantly-evolving algorithm.

13. Say that you are Net ix. How would you determine what original series you should invest in and create?

Answer: Netflix uses data to estimate the potential market size for an original series before giving it the go-ahead.

14. What kind of services would find churn (metric that tracks how many customers leave the service) helpful? How would you calculate churn?

Answer: subscription based services

**15. Let’s say that you’re are scheduling content for a content provider on television. How would you determine the best times to schedule content?

classical probability questions

1. Given n coupons, how many coupons do you expect you need to draw with replacement before having drawn each coupon at least once?

Answer: $E(x)= nH_n= Ln(n)+ + $ and $\gamma =0.577216$, $H_n= 1+1/2+1/3+ 1/4… $called Harmonic Number for large n; approximate solution: $E(x)\approx n(ln n + \gamma)+ 1/2$

3. Hundred tigers and one sheep are put on a magic island that only has grass. Tigers can live on grass, but they want to eat sheep. If a Tiger bites the Sheep then it will become a sheep itself. If 2 tigers attack a sheep, only the first tiger to bite converts into a sheep. Tigers don’t mind being a sheep, but they have a risk of getting eaten by another tiger. All tigers are intelligent and want to survive. Will the sheep survive?

Answer: If there is 1 tiger, then it will eat the sheep because he does not need to worry about being eaten. Sheep will not survive. If there are 2 tigers, both of them knows that if one eats the Sheep, the other tiger will eat him. So, the sheep will survive.

If there are 3 tigers, then they each of them knows that if one tiger eats up the sheep, then Iceland will be left with 1 sheep and 2 tigers and as shown in the previous case, the sheep will survive. Hence each tiger will try to eat up the sheep. The sheep will not survive.

If there are 4 tigers, then the sheep will survive.

And so on….

So, If there are even number of tigers the sheep will survive, else it will die. Hence, if there are 100 tigers the sheep will survive.

4. Four people come to a river in the night. There is a narrow bridge, but it can only hold two people at a time. They have one torch and, because it’s night, the torch has to be used when crossing the bridge. Person A can cross the bridge in 1 minute, B in 2 minutes, C in 5 minutes, and D in 8 minutes. When two people cross the bridge together, they must move at the slower person’s pace. The question is, can they all get across the bridge if the torch lasts only 15 minutes?

Answer: A: 1; B: 2; C:5; D: 8 AB–A(back)–CD–B(back)–AB, short group*2+ sigle member of short group+ long group

5. It is known that Cheryl’s birthday is one of the following 10 dates listed in the table.

May 15, May 16, May 19, June 17,June 18, July 14,July 16, August 14,August 15,August 17

Cheryl tells Albert and Bernard separately the month and the day of her birthday respectively.Then following conversation takes place Albert: I don’t know when Cheryl’s birthday is, but I know that Bernard doesn’t know too.Bernard: At first I didn’t know when Cheryl’s birthday is, but I know now.Albert: Then I also know when Cheryl’s birthday is.So when is Cheryl’s birthday?

Answer: A said: A and B doesn’t know = not unique date, 18 and 19; B heard A said then Know that the month != Jun and May; B know = not 14, and A know the month and know the date 16, if 15 or 17 A still doesn’t know.

6. card game

A casino offers a card game using a normal deck of 52 cards. The rule is that you turn over two cards each time. For each pair, if both are black, they go to the dealer’s pile; if both are red, they go to your pile; if one black and one red, they are discarded. The process is repeated until your two go throught all 52 cards. If you have more cards in your pile, you win $100; otherwise (including tiles) you get nothing. the casino allows you to negotiate the price you want to pay for the game. How much would you be willing to pay to play this game.

Answer: no pay since the number double black and double red are always the same.

7. Find the number of trailing zeroes in 101!

Answer: 20 + 4 = 24, round down Take the number that you’ve been given the factorial of. Divide by 5; if you get a decimal, truncate to a whole number. Divide by $5^2$ = 25; if you get a decimal, truncate to a whole number. Divide by $5^3$ = 125; if you get a decimal, truncate to a whole number. Continue with ever-higher powers of 5, until your division results in a number less than 1. Once the division is less than 1, stop. Sum all the whole numbers you got in your divisions. This is the number of trailing zeroes.

8. Horse run: there are 25 horse, each of which runs at a constant speed that is different from other hourse. Since track only has 5 lanes, each race can have at most 5 horses. If you need to find the 3 firstest hourses, what is the minimum number of races needed to identify them?

Answer: 5+1+1

9. You just had two dice custom-made. Instead of numbers 1-6, you place single-digit numbers on the faces of each dice so that every moring you can arrange the dice in a way as to make the two front faces show the current day of the month. You must use both dice (in other words, days 1-9 must be shown as 01-09), but you can switch the order of the dice if you want. What numbers do you have to put on the six faces of each of the two dice to achieve that?

Linear regression

1. Explain gradient descent with respect to linear regression?

Answer: gradient descent is the process of minimizing a function by following the gradients of the cost function. This involves knowing the form of the cost as well as the derivative so that from a given point you know the gradient and can move in that direction, e.g. downhill towards the minimum value.

Imagine a valley and a person with no sense of direction who wants to get to the bottom of the valley. He goes down the slope and takes large steps when the slope is steep and small steps when the slope is less steep. He decides his next position based on his current position and stops when he gets to the bottom of the valley which was his goal.

2.difference between collinearity and confounding?

Answer: collinearity can be viewed as an extrem case of confouding, when essentially the same variable is entered into a regression equation twice, or when two variables contain exactly the same information as two other variables, and so on. [source]http://www.medicine.mcgill.ca/epidemiology/Joseph/courses/EPIB-621/confounding.pdf

confouding variable: an important predictor that, when omitted, leads to spurious relationships in a regression equation.

3. evaluate laverage and cookies

Answer: hat-value $> 2(p+1)/n$ : a high_leverage data value(influential values); if Cook’s distance(leverage + residual size) exceeds $4/(n-p-1)$

DS DA interview

Jane Liu

April 29, 2019