Assignment 3: Statistical Estimation and Linear Regression

● Instructions: Submit the R Notebook with your name as the author. Refer to Sample Notebook.

Part I - Statistical Estimation

Q1: Concepts Definitions (20 points)

What is the difference between statistical estimation and statistical testing? Explain and give an example of the application of both concepts. (5 points)

Answer:

Statistical Testing entails testing a hypothesis on the population by using a sample data.
Statistical Estimation entails estimating a population parameter, given a sample data of the population.
For example, consider a sample population with sample mean 6, statistical testing will involve checking how extreme will a given observation lie in the distribution. Whereas, statistical estimation works the other way around. Given an observation, statistical estimation helps estimate the most likely population parameter, like the mean, for which the observation will lie in the distribution.
Applications of statistical testing include testing claims such as men have a higher chance of getting heart disease than women, children with regular exercise are less prone to falling sick, etc.
Applications of statistical estimation include Machine learning.

In what way does Likelihood differ from Probability? (5 points)

Answer:

Probability is the chance of a value lying in the the distribution with given population/sample parameters. In short, probability is the area under the curve of the distribution.
Probability can be given as P(data|distribution)
Likelihood is the measure of probability that the given value will lie in the distribution for an estimated population mean. In short, likelihood is the y-axis value that corresponds to the points in the distribution.
Likelihood is given as L(distribution|data)

What is the Maximum Likelihood Estimation? (5 points)

Answer:

Maximum Likelihood Estimatis the approach that is used to calculate a population parameter (mean) such that the given observation has the maximum possible likelihood of lying in the distribution.
Elaborating further, when we have a given observation, statistical estimation is used to find the population parameter that most likely generated the distribution that contains the given observation. Finding such likelihood for all the observations will give multiple probabilities. Considering all the observations independent, the likelihood of all the observations falling in the distribution with an estimated population parameter(mean), the total likelihood will be a product of all the likelihoods, thus giving a likelihood function that can be denoted as ‘L’, and will be equal to a product of all the likelihoods.
According to Maximum Likelihood Estimation, we select the population parameter that maximizes the value of this likelihood function. Thus, selecting a population parameter(mean) that gives the maximum likelihood of the given data observations falling in the distribution is called Maximum Likelihood Estimation.

Why is it more common to use the log-likelihood function than the regular likelihood function? (5 points)

Answer:

While calculating the likelihood function, we multiply the probabilities of each observation lying in the distribution with the estimated population parameter to find the joint probability of all the points falling in the distribution.
As the probability values are small (ranging from 0 to 1), the product calculated is an even smaller number, which is difficult to interpret and can create computational challenges. To avoid this, it’s log is calculated and written as the log-likelihood function.
Thus, the log-likelihood function is more commonly used than a regular likelihood function.

Q2: Maximum Likelihood Estimation (30 points)

For this question, you will use the stroke data from the ISwR package. Consider that the hospital ABC provided this report which contains a sample of the patients that suffered from a stroke. The hospital wants to estimate the true mean and standard deviation of the age of the whole population of patients that had a stroke in that hospital. For this question, assume that the variable “age” follows a normal distribution.

Assuming that the population standard deviation is 13, plot the log-likelihood function to show the estimation of the true mean of the population age. (10 points)

library(ISwR)
#View(stroke)


# define the likelihood function
LLmean2a = function(M)
{
  LLsum2a = sum(dnorm(stroke$age,mean = M,sd = 13,log = T))
  return(LLsum2a)
}


# now define M and use sapply
Mvalues = seq(0,90,by = 0.1)

# sapply applies a function to each element of a vector
LLres2a = sapply(Mvalues,LLmean2a)
plot(Mvalues,
     LLres2a,
     type="l",
     col="blue",
     xlab = "M Values",
     ylab="Log Likelihood",
     main="Likelihood")
# get the value where this is true
y <- which.max(LLres2a)
# return the M Value 
theMax <- Mvalues[y]
# add to plot
abline(v=theMax, col="red", lty=2)

Now assume the mean and the standard deviation of the population age are unknown. Create the log-likelihood function to estimate the true values of the mean and standard deviation of the stroke patients’ age. (10 points)

# import package bbmle
library(stats4)
library(bbmle)

# Log-Likelihood function for mean and sd
LLmeansd2b = function(M,sigma)
{
  LLsum2b = sum(dnorm(stroke$age,mean = M,sd = sigma,log = T))
  return(-1*LLsum2b)
}

Using the Maximum Likelihood Estimation concept, what is the true mean and standard deviation of the age of the stroke population in this hospital? (10 points)

# using function LLmeansd2b defined in previous chunk
mle2c = mle2(minuslogl = LLmeansd2b,
           start = list(M=30,sigma=1),
           lower=c(sigma=0),
           method = "L-BFGS-B")
summary(mle2c)

## Maximum likelihood estimation
## 
## Call:
## mle2(minuslogl = LLmeansd2b, start = list(M = 30, sigma = 1), 
##     method = "L-BFGS-B", lower = c(sigma = 0))
## 
## Coefficients:
##       Estimate Std. Error z value     Pr(z)    
## M     69.88541    0.47948 145.752 < 2.2e-16 ***
## sigma 13.80536    0.33904  40.719 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 6704.948

Part II - Linear Regression

Q3: Concepts Definitions (20 points)

What is the regression analysis used for?

Answer:

Regression analysis can be used for various purposes. It can be used in predictive analytics by finding relation between predictor variables and the target variable.
Regression analysis can also be used in managing store and supply chain operations by finding relation between the demand, stock, shelf life and supply of products.
It can also be used in forecasting techniques.

What does the R-squared value represent on our model results?

Answer:

R-squared is the coefficient of determination. It indicates how much variation in the target variable can be explained by variation in the predictor variable.
The R-squared value gives an idea of how well the model fit.
R^2 = 1 - (SSE/SST) where SSE is the error from the model built and SST is the error from the baseline model.
R-squared value can be any number from 0 to 1. Higher the R-squared, better the model fit.

What is the main difference between multiple regression and linear regression?

Answer:

In a simple linear regression, there is only one independent i.e. predictor variable and one dependent i.e. target variable. Equation of linear regression is: Y = B0 + B1X1 + E
In a multiple regression, there are many independent variables on which the target variable is dependent. Equation of multiple regression is: Yi = B0 + B1X1i + B2X2i + … + BkXki + Ei

Why is the error term in the regression equation important? What does it capture and what assumptions can we have about it?

Answer:

Residuals are the distances between the actual data points and the regression line. These residuals are called errors.
The errors are important because they indicate the error that cannot be explained by just the regression line. The sum of residuals is always equal to zero.
Residual = Observed value - Predicted value
The error term in the regression equation is assumed to be having a normal distribution. Also, the mean value of all the residuals (errors) is equal to zero.

Q4: Linear Regression (30 points)

For this question, you will use the mtcars dataset in R.

Build a linear regression model having the “mpg” as the target variable and the “hp” as the predictor. (5 points)

# linear regression for mpg~hp
fit4a <- lm(mpg~hp, data=mtcars)
summary(fit4a)

## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Build a second linear regression to predict the “mpg” but using the “wt” as a predictor variable. (5 points)

# linear regression for mpg~wt
fit4b <- lm(mpg~wt, data=mtcars)
summary(fit4b)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Identify the regression equation and the R-square of both models. (10 points)

Answer:

Regression equation of model 1: mpg = 30.09886 + (-0.06823)*hp
R-squared: 0.6024
Adjusted R-squared: 0.5892
Regression equation of model 2: mpg = 37.2851 + (-5.3445)*wt
R-squared: 0.7528
Adjusted R-squared: 0.7446

Compare both models based on the results you obtained from the models. (5 points)

# model comparision
# model 1
mpg_analysis4a <- cbind(mtcars$mpg, fit4a$fitted.values)
mpg_analysis4a <- as.data.frame(mpg_analysis4a)
names(mpg_analysis4a) <- c("Actual","Predicted")
# View(mpg_analysis4a)

# look at error metrics
# create error column in an intiutive direction
mpg_analysis4a$Error <- mpg_analysis4a$Predicted - mpg_analysis4a$Actual

# squared error
mpg_analysis4a$SqError <- mpg_analysis4a$Error^2
# absolute error
mpg_analysis4a$AbsError <- abs(mpg_analysis4a$Error)

# Mean absolute error
mae4a <- mean(mpg_analysis4a$AbsError)
mae_sd4a <- sd(mpg_analysis4a$AbsError)

# Mean Squared Error
mse4a <- mean(mpg_analysis4a$SqError)
mse_sd4a <- sd(mpg_analysis4a$SqError)

# if no zeros in the target variable, percentage error are useful
mpg_analysis4a$AbsPercError <- mpg_analysis4a$AbsError/mpg_analysis4a$Actual
# Mean Absolute Percentage Error
mape4a <- mean(mpg_analysis4a$AbsPercError)
mape_sd4a <- sd(mpg_analysis4a$AbsPercError)

# model 2
mpg_analysis4b <- cbind(mtcars$mpg, fit4b$fitted.values)
mpg_analysis4b <- as.data.frame(mpg_analysis4b)
names(mpg_analysis4b) <- c("Actual","Predicted")
# View(mpg_analysis4b)

# look at error metrics
# create error column in an intiutive direction
mpg_analysis4b$Error <- mpg_analysis4b$Predicted - mpg_analysis4b$Actual

# squared error
mpg_analysis4b$SqError <- mpg_analysis4b$Error^2
# absolute error
mpg_analysis4b$AbsError <- abs(mpg_analysis4b$Error)

# Mean absolute error
mae4b <- mean(mpg_analysis4b$AbsError)
mae_sd4b <- sd(mpg_analysis4b$AbsError)

# Mean Squared Error
mse4b <- mean(mpg_analysis4b$SqError)
mse_sd4b <- sd(mpg_analysis4b$SqError)

# if no zeros in the target variable, percentage error are useful
mpg_analysis4b$AbsPercError <- mpg_analysis4b$AbsError/mpg_analysis4b$Actual
# Mean Absolute Percentage Error
mape4b <- mean(mpg_analysis4b$AbsPercError)
mape_sd4b <- sd(mpg_analysis4b$AbsPercError)

# AIC = Aikike Information Criteria
aic4a <- AIC(fit4a)
aic4b <- AIC(fit4b)

cat("R-squared for model 1 is 0.6024",
    "\nR-squared for model 2 is 0.7528",
    "\nMean Absolute Error for model 1 = ",mae4a,
    "\nMean Absolute Error for model 2 = ",mae4b,
    "\nMean Squared Error for model 1 = ",mse4a,
    "\nMean Squared Error for model 2 = ",mse4b,
    "\nMean Absolute Percentage Error for model 1 = ",mape4a,
    "\nMean Absolute Percentage Error for model 2 = ",mape4b,
    "\nAIC for model 1 = ",aic4a,
    "\nAIC for model 1 = = ",aic4b)

## R-squared for model 1 is 0.6024 
## R-squared for model 2 is 0.7528 
## Mean Absolute Error for model 1 =  2.907452 
## Mean Absolute Error for model 2 =  2.340642 
## Mean Squared Error for model 1 =  13.98982 
## Mean Squared Error for model 2 =  8.697561 
## Mean Absolute Percentage Error for model 1 =  0.1566944 
## Mean Absolute Percentage Error for model 2 =  0.1260733 
## AIC for model 1 =  181.2386 
## AIC for model 1 = =  166.0294

cat("After comparing the R-squared and other error metrics for model comparison, it can be concluded that:\n1)Model 2 has a bigger R-squared value than model 1\n2)Model 2 has lesser mean absolute error, mean squared error and mean absolute percentage error\n3)Model 2 has a lesser AIC than model 1\n\nAs higher the R-squared better the model and lower the AIC, better the model, model 2 is a better model than model 1.")

## After comparing the R-squared and other error metrics for model comparison, it can be concluded that:
## 1)Model 2 has a bigger R-squared value than model 1
## 2)Model 2 has lesser mean absolute error, mean squared error and mean absolute percentage error
## 3)Model 2 has a lesser AIC than model 1
## 
## As higher the R-squared better the model and lower the AIC, better the model, model 2 is a better model than model 1.

Make two plots that show the spread of the variables in model 1 and 2 and add the regression line to the plots. (5 points)

# plot the variable spread for model 1
plot(y=mtcars$mpg,
     x=mtcars$hp,
     xlab="hp",
     ylab="mpg",
     main="hp vs. mpg") 
abline(fit4a,
       col="red")

# plot the variable spread for model 2
plot(y=mtcars$mpg,
     x=mtcars$wt,
     xlab="wt",
     ylab="mpg",
     main="wt vs. mpg") 
abline(fit4b,
       col="red")

Assignment 3: Statistical Estimation and Linear Regression

Mrunmayi Shiveshwarkar