The chapter began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and estimates of out-of-sample accuracy (WAIC and PSIS). Regularizing priors reduce overfitting during estimation, and WAIC and PSIS help estimate the degree of overfitting. Practical functions compare in the rethinking package were introduced to help analyze collections of models fit to the same data. If you are after causal estimates, then these tools will mislead you. So models must be designed through some other method, not selected on the basis of out-of-sample predictive accuracy. But any causal estimate will still overfit the sample. So you always have to worry about overfitting, measuring it with WAIC/PSIS and reducing it with regularization.
Place each answer inside the code chunk (grey box). The code chunks should contain a text response or a code that completes/answers the question or activity requested. Problems are labeled Easy (E), Medium (M), and Hard(H).
Finally, upon completion, name your final output .html file as: YourName_ANLY505-Year-Semester.html and publish the assignment to your R Pubs account and submit the link to Canvas. Each question is worth 5 points.
7E1. State the three motivating criteria that define information entropy. Try to express each in your own words.
#1. First motivating criteria is the information uncertainity must be continuous that way the possible outcomes can be seen in a pattern
#2. Second motivating criteria is that the uncertainity must be increasing with number of possible outcomes so that various uncertainities are covered that way
#3. Third motivating criteria is that uncertainity must be additive that way if there are any independent outcomes those are covered too
7E2. Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin? The entropy of this coin is 0.6108643.
p <- c(0.7, 0.3)
entropy <- -sum(p * log(p))
entropy
## [1] 0.6108643
7E3. Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die? The entropy of this die is 1.376227.
p <- c(0.2, 0.25, 0.25, 0.3)
entropy <- -sum(p * log(p))
entropy
## [1] 1.376227
7E4. Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die? The entropy of this die is 1.098612.
p <- c(1/3, 1/3, 1/3)
entropy <- -sum(p * log(p))
entropy
## [1] 1.098612
7M1. Write down and compare the definitions of AIC and WAIC. Which of these criteria is most general? Which assumptions are required to transform the more general criterion into a less general one?
#Akaike Information Criterion (AIC)- It is defined as sum of deviance of in sampling training data (Dtrain) and twice the number of parameters (p) follows:
# AIC = Dtrain + 2p
#Widely Applicable Information Criterion (WAIC)- It is defined as the difference between sum of average likelihood of observations (log point wise predictive density (lppd) and the penality term, mathematically defined as follows:
# WAIC(y, Θ) = -2(lppd - varΘ logp(yi|Θ))
#As the name suggests, WAIC is more general. WAIC does not assume Gaussian posterior , however, it does assume that the posterior distirbution is close to multivariate Gaussian and. And priors are probably overwhelmed or flat by likelihood
7M2. Explain the difference between model selection and model comparison. What information is lost under model selection?
#The difference between model selection and model comparison is that model selection is a way to choose a model based on the least IC value but in model comparison we choose a model based on multiple models for causal inference and IC. Therefore, in model selection if a model has higher IC value then the relative model accuracy is also lost that way.
7M3. When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.
#Information criterion depends on the total deviance which in turn depends on the number of observations. So when comparing models if there were different numbers of observations it will result in different Information Criterion, that way relative model accuracy is also lost in model comparison so that's why all models must be fit to exactly the same number of observations,
7M4. What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.
#Prior and effective number of parameters are indirectly proportional, that's why as prior becomes more concentrated, the effective number of parameters decreases. This proportionality can be observed from the WAIC mathematical equation:
# WAIC(y, Θ) = -2(lppd - varΘ logp(yi|Θ))
7M5. Provide an informal explanation of why informative priors reduce overfitting.
#Informative priors are regularizing, in other words, informative priors are more conservative, they learn less from the sample, so when such a model is applied, since the number of parameters these priors look for are limited there is no chance of an overfitting scenario.
7M6. Provide an informal explanation of why overly informative priors result in underfitting.
#Overly informative priors resulting in underfitting can also be explained in the same way as above, overly informative priors are "too conservative", therefore, the number of parameters these priors look for are too limited and picky to develop a proper model.