Problem 1

Probability Density 1: X~Gamma. Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda (lambda , a shape parameter). Choose any n greater 3 and an expected value (lambda) between 2 and 10 (you choose).
set.seed(123)
n = 4
lambda = 5
X = rgamma(10000,shape = n, scale = 10/lambda)
hist(X, breaks = 50, col = "blue", xlab = "Value of X", main = "Gamma-distributed random variable X")

Probability Density 2: Y~Sum of Exponentials. Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter (lambda). The n and lambda must be the same as in the previous case.
set.seed(123)
Y = rexp(10000,rate = lambda/10)+rexp(10000,rate = lambda/10)+rexp(10000,rate = lambda/10)+rexp(10000,rate = lambda/10)
hist(Y, breaks = 50, col = "blue", xlab = "Value of Y", main = "Sum of Exponential-distributed random variable Y")

Probability Density 3: Z~ Exponential. Then generate 10,000 observations from a single exponential pdf with rate/shape parameter (lambda).
set.seed(123)
Z = rexp(10000,rate = lambda/10)
hist(Z, breaks = 50, col = "blue", xlab = "Value of Z", main = "Single Exponential-distributed random variable Z")

##### 1a. Calculate the empirical expected value (means) and variances of all three pdfs.

X_mean = mean(X)
X_var = var(X)
print(paste0('The empirical mean of X is:',X_mean))
## [1] "The empirical mean of X is:7.92062845373995"
print(paste0('The empirical variance of X is:',X_var))
## [1] "The empirical variance of X is:15.4914314793823"
Y_mean = mean(Y)
Y_var = var(Y)
print(paste0('The empirical mean of Y is:',Y_mean))
## [1] "The empirical mean of Y is:8.01905804579485"
print(paste0('The empirical variance of Y is:',Y_var))
## [1] "The empirical variance of Y is:15.7656963882112"
Z_mean = mean(Z)
Z_var = var(Z)
print(paste0('The empirical mean of Z is:',Z_mean))
## [1] "The empirical mean of Z is:2.0075625394718"
print(paste0('The empirical variance of Z is:',Z_var))
## [1] "The empirical variance of Z is:3.99905760034308"
1b. Using calculus, calculate the expected value and variance of the Gamma pdf (X). Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)

Derivation of Expected value of the Gamma pdf (X)

\[X\sim Gamma(\alpha,\lambda)\] The expected value is the probability-weighted average over all possible values: \[E(X)=\int xf_x(x)dx\]

using pdf of the gamma distribution: \[E(x)=\int_0^\infty x\frac{\lambda^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\lambda x}dx\] \[=\int_0^\infty \frac{1}{\lambda}\frac{\lambda^{\alpha+1}}{\Gamma(\alpha)}x^{(\alpha+1)-1}e^{-\lambda x}dx\]

using the relation: \[\Gamma(x+1)=\Gamma(x)x\] \[E(x)=\int_0^\infty \frac{\alpha}{\lambda}\frac{\lambda^{\alpha+1}}{\Gamma(\alpha+1)}x^{(\alpha+1)-1}e^{-\lambda x}dx\]

using density of gamma distribution: \[E(x) = \frac{\alpha}{\lambda}\int_0^\infty Gamma(x;\alpha+1,\lambda)dx\] \[E(x) = \frac{\alpha}{\lambda}\]

For Gamma Distribution X the expected value is: \[E(x) = 4/5\]

Derivation of the variance of the Gamma pdf (X) The variance can be calculated using expectation of square minus the square of expectation: \[Var=\int_0^\infty x^2 f_x(x)dx-(E(X))^2\] \[(E(X))^2=\int_0^\infty x^2 \frac{\lambda^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\lambda x}dx\] \[=\frac{\lambda^\alpha}{\Gamma(\alpha)}\int_0^\infty x^{\alpha+1}e^{-\lambda x}dx-(\frac{\alpha}{\lambda})^2\] \[=\frac{\Gamma(\alpha+2)}{\lambda^2 \Gamma(\alpha)}-(\frac{\alpha}{\lambda})^2\] \[=\frac{\Gamma(\alpha+2)-\alpha^2\Gamma(\alpha)}{\lambda^2\Gamma(\alpha)}\] \[=\frac{\alpha(\alpha+1)\Gamma(\alpha)-\alpha^2\Gamma(\alpha)}{\lambda^2\Gamma(\alpha)}\] \[=\frac{\alpha\Gamma(\alpha)(\alpha+1-\alpha)}{\lambda^2\Gamma(\alpha)}\] \[=\frac{\alpha}{\lambda^2}\]

Expected value of the single exponential (Z)

Moment Generating Function: \[g(t)=\int_0^{+\infty}e^{tx}\lambda e^{-\lambda x}dx\] \[=\int_0^{+\infty}e^{x(t-\lambda)}\lambda dx=\frac{\lambda e^{x(t-\lambda)}}{t-\lambda}|_0^{+\infty}\] \[=-\frac{\lambda}{t-\lambda}=\frac{\lambda}{\lambda-t}\]

The first moment or the mean is calculated below. The quotient rule is used to find the derivative: \[M'(0)=E(X)=\frac{d}{dt}\frac{\lambda}{\lambda-t}|_{t=0}=\frac{\lambda (+1)}{(\lambda -t)^2}|_{t=0}=\frac{\lambda}{(\lambda -t)^2}|_{t=0}\] \[=\frac{1}{\lambda}=\frac{1}{5}=0.2\]

Expected value of the sum of exponential (Y)

Moment Generating Function: \[g(t)=\int_0^{+\infty}e^{tx}4\lambda e^{-\lambda x}dx\] \[=\int_0^{+\infty}e^{x(t-\lambda)}4\lambda dx=\frac{4\lambda e^{x(t-\lambda)}}{t-\lambda}|_0^{+\infty}\] \[=-\frac{4\lambda}{t-\lambda}=\frac{4\lambda}{\lambda-t}\]

The first moment or the mean is calculated below. The quotient rule is used to find the derivative: \[M'(0)=E(X)=\frac{d}{dt}\frac{4\lambda}{\lambda-t}|_{t=0}=\frac{4\lambda (+1)}{(\lambda -t)^2}|_{t=0}=\frac{4\lambda}{(\lambda -t)^2}|_{t=0}\] \[=\frac{4}{\lambda}=\frac{4}{5}=0.8\]

1c-e. Probability. For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds.
  1. P(Z > lambda | Z > lambda/2)
# Assuming Z is already greater than lambda/2, what is probability Z > lambda (assuming memoryless property)

length(Z[Z>lambda/2])/length(Z)
## [1] 0.2884
1-pexp(lambda/2,lambda/10)
## [1] 0.2865048

calculus proof that memoryless property holds:

Using the CDF of exponential distribution: \[P(Z \le \lambda|Z>\frac{\lambda}{2}) \] \[=\frac{P(Z\le\lambda\cap Z>\frac{\lambda}{2})}{P(Z>\frac{\lambda}{2})}=\frac{P(\frac{\lambda}{2}<Z\le\lambda)}{P(Z>\frac{\lambda}{2})}\] \[=\frac{e^{-\frac{\lambda}{2}*\lambda}-e^{-\lambda*\lambda}}{e^{-\frac{\lambda}{2}*\lambda}}=1-e^{-\frac{\lambda}{2}*\lambda}\]

\[P(Z>\lambda | Z>\frac{\lambda}{2}) = 1-P(Z \le \lambda|Z>\frac{\lambda}{2})=e^{-\frac{\lambda}{2}*\lambda}\]

This is the same thing as P(Z > lambda/2) because it doesn’t matter what happened before. The probability is memoryless so you can get the same answer by just evaluating P(Z > lambda/2).

  1. P(Z > 2lambda | Z > lambda)
length(Z[Z>2*lambda])/length(Z)
## [1] 0.006
1-pexp(2*lambda,rate = lambda/10)
## [1] 0.006737947

calculus proof that memoryless property holds:

Using the CDF of exponential distribution: \[P(Z \le 2*\lambda|Z>\lambda) \] \[=\frac{P(Z\le2*\lambda\cap Z>\lambda)}{P(Z>\lambda)}=\frac{P(\lambda<Z\le2*\lambda)}{P(Z>\lambda)}\] \[=\frac{e^{-\lambda*\lambda}-e^{-2*\lambda*\lambda}}{e^{-\lambda*\lambda}}=1-e^{-\lambda*\lambda}\]

\[P(Z>2*\lambda | Z>\lambda) = 1-P(Z \le 2*\lambda|Z>\lambda)=e^{-\lambda*\lambda}\]

This is the same thing as P(Z > lambda) because it doesn’t matter what happened before. The probability is memoryless so you can get the same answer by just evaluating P(Z > lambda).

  1. P(Z > 3lambda | Z > lambda)
length(Z[Z>3*lambda])/length(Z)
## [1] 5e-04
1-pexp(3*lambda,rate = lambda/10)
## [1] 0.0005530844

calculus proof that memoryless property holds:

Using the CDF of exponential distribution: \[P(Z \le 3*\lambda|Z>\lambda) \] \[=\frac{P(Z\le3*\lambda\cap Z>\lambda)}{P(Z>\lambda)}=\frac{P(\lambda<Z\le3*\lambda)}{P(Z>\lambda)}\] \[=\frac{e^{-\lambda*\lambda}-e^{-3*\lambda*\lambda}}{e^{-\lambda*\lambda}}=1-e^{-2\lambda*\lambda}\]

\[P(Z>3*\lambda | Z>\lambda) = 1-P(Z \le 3*\lambda|Z>\lambda)=e^{-2\lambda*\lambda}\]

This is the same thing as P(Z > 2 lambda) because it doesn’t matter what happened before. The probability is memoryless so you can get the same answer by just evaluating P(Z > 2 lambda).

Loosely investigate whether P(YZ) = P(Y) P(Z) by building a table with quartiles and evaluating the marginal and joint probabilities.
quartiles_Y <- quantile(Y, probs = c(0, 0.25, 0.5, 0.75, 1))
quartiles_Z <- quantile(Z, probs = c(0, 0.25, 0.5, 0.75, 1))

#every unique combination of Y and Z are counted as occurrences 
contingency_table <- table(cut(Y, breaks = quartiles_Y), cut(Z, breaks = quartiles_Z))

marginal_prob_Y <- margin.table(contingency_table, 1) / sum(contingency_table)
marginal_prob_Z <- margin.table(contingency_table, 2) / sum(contingency_table)

joint_prob_YZ <- contingency_table / sum(contingency_table)
joint_prob_YZ
##               
##                (0.000676,0.579] (0.579,1.4] (1.4,2.77] (2.77,18.2]
##   (0.372,5.11]       0.10692138  0.08571714 0.04910982  0.00810162
##   (5.11,7.35]        0.06461292  0.07021404 0.07161432  0.04360872
##   (7.35,10.2]        0.04590918  0.05621124 0.07121424  0.07671534
##   (10.2,31.5]        0.03240648  0.03790758 0.05811162  0.12162432
#The probability of P(Y = Q2, Z = Q2)
joint_prob_YZ[2,2]
## [1] 0.07021404
#The probability of P(Y = Q2)*P(Z = Q2)
sum(joint_prob_YZ[1:4,2])*sum(joint_prob_YZ[2,1:4])
## [1] 0.06252501
#The probability of P(Y = Q3, Z = Q3)
joint_prob_YZ[3,3]
## [1] 0.07121424
#The probability of P(Y = Q3)*P(Z = Q3)
sum(joint_prob_YZ[1:4,3])*sum(joint_prob_YZ[3,1:4])
## [1] 0.06252501
#The probability of P(Y = Q1, Z = Q4)
joint_prob_YZ[1,4]
## [1] 0.00810162
#The probability of P(Y = Q3)*P(Z = Q3)
sum(joint_prob_YZ[1:4,4])*sum(joint_prob_YZ[1,1:4])
## [1] 0.06247499
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

The Fisher’s exact test is recommended for small sample sizes (i.e. less than 20 or 30 observations). The Chi-square test is recommended for larger sample sizes. Fisher’s test can handle small sample sizes because it calculates the exact probabilities of all possible outcomes, rather than relying on large-sample approximation as the chi-square test does. Therefore, fisher’s test is computationally expensive and therefore less practical for large contingency tables. For this reason, the contingency table calculated for P(Y) and P(Z) quartile ranges with its large sample size of 10,000 points would be more appropriate to be tested using the Chi-Square test. The Fisher test with a sample size of 10,000 would be inpractical.

Additionally, Chi-square test is more appropriate because it assumes a normal distribution because of the law of large numbers. Both Fisher and Chi-square tests considers categorical data and considers marginal totals, which are typically integers. However, if the sample size is large, the frequencies can be treated as continous variables and the non-integers may not be as big of a concern.

The chi-square test is performed below to test the null hypothesis to reject or accept the null hypothesis that independence holds. An X-squared of 1884 is reported, which measures the discrepancy between the observed counts in the tontingency table and the expected counts if the null hypothesis of independence is true. This large value provides strong evidence against the null hypothsis. The p-value which is less than the significance value of 0.05, shows that observing a test statistic of 1884, assuming the null hypothesis of independence, is very unlikely. This indicates strong evidence against the null hypothesis suggesting signifciant association between P(Y) and P(Z).

n<-10000
table <- joint_prob_YZ*n
chisq.test(table)
## 
##  Pearson's Chi-squared test
## 
## data:  table
## X-squared = 1884, df = 9, p-value < 2.2e-16
n<-30
table <- joint_prob_YZ*n
fisher.test(table)
## Warning in fisher.test(table): 'x' has been rounded to integer: Mean relative
## difference: 0.1292992
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table
## p-value = 0.6649
## alternative hypothesis: two.sided
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
#libraries
library(readr)
library(psych)
library(GGally)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(corrplot)
## corrplot 0.92 loaded

By reviewing the list of possible predictors, two quantitative variables are chosen (GrLivArea & GarageArea) and one qualitative (OverallQual). The descriptive statistics (i.e. mean, median, standard dev, etc.) are shown for the chosen variables.

  • Observations:
    • The sale price has a right skew with a median of $163K with a range from $34.9K to $755K. There are fewer homes on the higher end of prices and more unique data.
    • The Above grade (ground) living area is measured in square feet and the histogram shows that the distribution is fairly normally distributed. There is low kurtosis and skew. The mean square footage is 1515 ft^2 with a range from 334 to 5642 ft^2.
    • The Garage Area shows relatively normally distributed data with low kurtosis and skew. The mean is 473 ft^2 with a range from no garage area to 1418 ft^2.
    • The Overall quality can be defined as shown below. The mean quality ranking is 6 with the distributions being fairly normally distributed.

OverallQual: Rates the overall material and finish of the house

   10   Very Excellent
   9    Excellent
   8    Very Good
   7    Good
   6    Above Average
   5    Average
   4    Below Average
   3    Fair
   2    Poor
   1    Very Poor
   
urlfile<-"https://raw.githubusercontent.com/catcho1632/DATA605/main/house-prices-advanced-regression-techniques/train.csv"
train_data<-read_csv(url(urlfile))
## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
train_var <- train_data[c('SalePrice','GrLivArea','GarageArea','OverallQual')]
describe(train_var)
##             vars    n      mean       sd median   trimmed      mad   min    max
## SalePrice      1 1460 180921.20 79442.50 163000 170783.29 56338.80 34900 755000
## GrLivArea      2 1460   1515.46   525.48   1464   1467.67   483.33   334   5642
## GarageArea     3 1460    472.98   213.80    480    469.81   177.91     0   1418
## OverallQual    4 1460      6.10     1.38      6      6.08     1.48     1     10
##              range skew kurtosis      se
## SalePrice   720100 1.88     6.50 2079.11
## GrLivArea     5308 1.36     4.86   13.75
## GarageArea    1418 0.18     0.90    5.60
## OverallQual      9 0.22     0.09    0.04
par(mfrow = c(2, 2))
hist(train_data$SalePrice, main = "Histogram of Sale Price")
hist(train_data$GrLivArea, main = "Histogram of GrLivArea", breaks = 30)
hist(train_data$GarageArea, main = "Histogram of Garage Area")
hist(train_data$OverallQual, main = "Histogram of Overall Quality", breaks = 10)

The scatter plot matrix: The scatter matrix shows a positive correlation between the independent variables and sale price. The pearson correlation of ~0.6 to 0.7 is shown. The positive trend implies that generally with greater living area, garage area, or overall quality, the higher the sale price would be.

train_var %>%
  ggpairs()

  • Correlation Plot:
    • Three quantitative variables are chosen: GrLivArea, GarageArea, SalePrice
    • The correlation plot shows that all trends are positive.
    • The strongest positive correlation is found between SalePrice and GrLivArea, the second highest correlation is between SalePrice and GarageArea.
    • There is positive correlation between GarageArea and GrLivArea as well. It appears to have approximately 0.4 pearson correlation.
# create df of 3 quantitative variables
train_3 <- train_data[c('GrLivArea','GarageArea','SalePrice')]

# Generate a correlation matrix
cor_mat <- cor(train_3)

# Create a correlation plot
corrplot(cor_mat, method="color")

null hypothesis: the pearson correlation is 0 alternate hypothesis: the pearson correlation is not 0

The hypothesis is tested for different combinations of the three quantitative variables.

cor.1 tests the correlation that GrLivArea and GarageArea is 0. The pearson correlation is given as 0.469 with a p-value of 2.2e-16. Since the p-value is less than 0.05 (the significance factor), the null hypothesis can be rejected since the probability of obtaining a correlation of 0.469 is 2.2e-16 with the assumption that the null hypothesis is true. We can accept the alternate hypothesis that the correlation is not 0. Additionally, we are 80% confident that the true correlation value between GrLivArea and GarageArea is between 0.4424 and 0.4948.

cor.2 tests the correlation that GrLivArea and SalePrice is 0. The pearson correlation is 0.7086 and the probability of observing this if the null hypothesis were true is 2.2e-16. Therefore, the null hypothesis that the correlation is 0 can be rejected. We can be 80% confident that the true correlation value between GrLivArea and SalePrice is between 0.69 and 0.72.

cor.3 test the correlation that GarageArea and SalePrice is 0. The pearson correlation is 0.623 and the probability of observing this if the null hypothesis were true is 2.2e-16. Therefore, the null hypothesis that the correlation is 0 can be rejected. We can be 80% confident that the true correlation value between GarageArea and SalePrice is between 0.602 and 0.644.

#defaults to pearson method
#two.sided: The alternative hypothesis is that the true correlation coefficient is not equal to zero (default).
cor.1 <- cor.test(train_3$GrLivArea, train_3$GarageArea, conf.level = 0.8)
cor.2 <- cor.test(train_3$GrLivArea, train_3$SalePrice, conf.level = 0.8)
cor.3 <- cor.test(train_3$GarageArea, train_3$SalePrice, conf.level = 0.8)
cor.1
## 
##  Pearson's product-moment correlation
## 
## data:  train_3$GrLivArea and train_3$GarageArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4423993 0.4947713
## sample estimates:
##       cor 
## 0.4689975
cor.2
## 
##  Pearson's product-moment correlation
## 
## data:  train_3$GrLivArea and train_3$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245
cor.3
## 
##  Pearson's product-moment correlation
## 
## data:  train_3$GarageArea and train_3$SalePrice
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6024756 0.6435283
## sample estimates:
##       cor 
## 0.6234314

Would you be worried about familywise error? Why or why not?

I would be concerned with familywise error since there is high correlation between these 3 variables. The correlation is problematic because highly correlated variables can lead to redundant testing. Meaning highly correlated variables will explain overlapping variation in the dependent variable. This can increase the number of necessary tests and increase the chance of false positive accross all the test. In hypothesis testing, it is assumed that each test is independent and when high correlation exists then there could be less independent tests with the same number of test.

The significance level is an important factor that contributes to familywise error (FWE) because it represents the probability of making a false positive (Type I error) in a single hypothesis test. When multiple hypothesis tests are conducted on the same dataset, the probability of making at least one false positive across all tests increases. For example with a 0.05 significance level each test will have a 5% chance of making a type 1 error. Since there are multiple tests happening, the error is no longer 5% but greater.

To address this issue there are methods to correct familywise error such as Bonferonni correction. The Bonferroni correction divides the desired alpha level by the number of tests being conducted to ensure that the overall probability of making a false positive across all tests is controlled at the desired level.

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

The precision matrix provides information about the partial correlations between pairs of variables in a multivariate normal distribution. Partial correlations are a type of correlation coefficient that measures the strength and direction of the linear relationship between two variables while controlling for the effects of one or more additional variables. In other words, partial correlations provide a measure of the correlation between two variables that is not explained by the other variables in a multivariate data set.

#correlation matrix
cor_mat
##            GrLivArea GarageArea SalePrice
## GrLivArea  1.0000000  0.4689975 0.7086245
## GarageArea 0.4689975  1.0000000 0.6234314
## SalePrice  0.7086245  0.6234314 1.0000000
#inverse of correlation matrix
precision<-solve(cor_mat)
precision
##              GrLivArea  GarageArea  SalePrice
## GrLivArea   2.01353305 -0.08964955 -1.3709485
## GarageArea -0.08964955  1.63976057 -0.9587504
## SalePrice  -1.37094845 -0.95875043  2.5692028
#multiplying correlation matrix by precision matrix
round(cor_mat %*% precision)
##            GrLivArea GarageArea SalePrice
## GrLivArea          1          0         0
## GarageArea         0          1         0
## SalePrice          0          0         1
library(matrixcalc)
LU <- lu.decomposition(cor_mat)

LU
## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.4689975 1.0000000    0
## [3,] 0.7086245 0.3731704    1
## 
## $U
##      [,1]      [,2]      [,3]
## [1,]    1 0.4689975 0.7086245
## [2,]    0 0.7800414 0.2910883
## [3,]    0 0.0000000 0.3892258
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of  for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
library(MASS)

hist(train_data$`1stFlrSF`, breaks = 50, col = "blue", xlab = "`1stFlrSF`", main = "Observed Data")

The exponential probability distribution is fitted to SalePrice. The $estimate value will return the estimated rate parameter of the exponential distribution, denoted by lambda (λ).

exp <- fitdistr(train_data$`1stFlrSF`,'exponential')
lambda <- exp$estimate
lambda
##         rate 
## 0.0008601213

The optimal value of lambda for this distribution is used to randomly select 1000 samples using rexp(). The 100 samples are plotted in a histogram below.

sample_dist <- rexp(1000,lambda)

par(mfrow=c(1,2))
hist(sample_dist, breaks = 50, col = "blue", xlab = "`1stFlrSF`", main = "exp distribution of 1K samples")

hist(train_data$`1stFlrSF`, breaks = 50, col = "blue", xlab = "`1stFlrSF`", main = "Observed Data")

Conclusion on fit: The observed data has a less severe right skew than the exponential data. The exponential has a greater spread of unique data points ranging up to 8000 sq ft. Additionally, the peak for the observed is seen at about 1000 sq ft and the exponential sees its peak at 0.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

Using the rate parameter lambda, 5th and 95th percentile is calculated below.

qexp(c(0.05,0.95), lambda)
## [1]   59.63495 3482.91836

Also generate a 95% confidence interval from the empirical data, assuming normality.

t.test(train_data$`1stFlrSF`, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  train_data$`1stFlrSF`
## t = 114.91, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1142.780 1182.473
## sample estimates:
## mean of x 
##  1162.627
#assuming normality by transforming the data using log(). 
par(mfrow=c(1,2))
hist(train_data$`1stFlrSF`, breaks = 50, col = "blue", xlab = "`1stFlrSF`", main = "Observed Data")
hist(log(train_data$`1stFlrSF`), breaks = 50, col = "blue", xlab = "`1stFlrSF`", main = "log(Observed Data)")

log_95<-t.test(log(train_data$`1stFlrSF`), conf.level = 0.95)
log_95
## 
##  One Sample t-test
## 
## data:  log(train_data$`1stFlrSF`)
## t = 842.72, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  6.991190 7.023813
## sample estimates:
## mean of x 
##  7.007501
#back transforming from the log() to get the actual confidence interval
exp(6.99)
## [1] 1085.721
exp(7.02)
## [1] 1118.787

Finally, provide the empirical 5th percentile and 95th percentile of the data.

quantile(train_data$`1stFlrSF`, probs = c(0.05, 0.95))
##      5%     95% 
##  672.95 1831.25

The 5th and 95th percentile of the empirical data is 672.95 and 1831.25. The distribution was only slightly skewed and looked arguably normal, so it makes sense that the confidence interval between the normality assumption and observed is similar (~100 to ~1180). The confidence interval exists between the 5th and 95th percentile.

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.
num_index<-sapply(train_data,is.numeric)
train_data<-train_data[,num_index]
#note lm() will remove NA data prior to analysis
lm_saleprice<-lm(SalePrice ~ ., data = train_data)

summ<-summary(lm_saleprice)
summ
## 
## Call:
## lm(formula = SalePrice ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -442182  -16955   -2824   15125  318183 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.351e+05  1.701e+06  -0.197 0.843909    
## Id            -1.205e+00  2.658e+00  -0.453 0.650332    
## MSSubClass    -2.001e+02  3.451e+01  -5.797 8.84e-09 ***
## LotFrontage   -1.160e+02  6.126e+01  -1.894 0.058503 .  
## LotArea        5.422e-01  1.575e-01   3.442 0.000599 ***
## OverallQual    1.866e+04  1.482e+03  12.592  < 2e-16 ***
## OverallCond    5.239e+03  1.368e+03   3.830 0.000135 ***
## YearBuilt      3.164e+02  8.766e+01   3.610 0.000321 ***
## YearRemodAdd   1.194e+02  8.668e+01   1.378 0.168607    
## MasVnrArea     3.141e+01  7.022e+00   4.473 8.54e-06 ***
## BsmtFinSF1     1.736e+01  5.838e+00   2.973 0.003014 ** 
## BsmtFinSF2     8.342e+00  8.766e+00   0.952 0.341532    
## BsmtUnfSF      5.005e+00  5.277e+00   0.948 0.343173    
## TotalBsmtSF           NA         NA      NA       NA    
## `1stFlrSF`     4.597e+01  7.360e+00   6.246 6.02e-10 ***
## `2ndFlrSF`     4.663e+01  6.102e+00   7.641 4.72e-14 ***
## LowQualFinSF   3.341e+01  2.794e+01   1.196 0.232009    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   9.043e+03  3.198e+03   2.828 0.004776 ** 
## BsmtHalfBath   2.465e+03  5.073e+03   0.486 0.627135    
## FullBath       5.433e+03  3.531e+03   1.539 0.124182    
## HalfBath      -1.098e+03  3.321e+03  -0.331 0.740945    
## BedroomAbvGr  -1.022e+04  2.155e+03  -4.742 2.40e-06 ***
## KitchenAbvGr  -2.202e+04  6.710e+03  -3.282 0.001063 ** 
## TotRmsAbvGrd   5.464e+03  1.487e+03   3.674 0.000251 ***
## Fireplaces     4.372e+03  2.189e+03   1.998 0.046020 *  
## GarageYrBlt   -4.728e+01  9.106e+01  -0.519 0.603742    
## GarageCars     1.685e+04  3.491e+03   4.827 1.58e-06 ***
## GarageArea     6.274e+00  1.213e+01   0.517 0.605002    
## WoodDeckSF     2.144e+01  1.002e+01   2.139 0.032662 *  
## OpenPorchSF   -2.252e+00  1.949e+01  -0.116 0.907998    
## EnclosedPorch  7.295e+00  2.062e+01   0.354 0.723590    
## `3SsnPorch`    3.349e+01  3.758e+01   0.891 0.373163    
## ScreenPorch    5.805e+01  2.041e+01   2.844 0.004532 ** 
## PoolArea      -6.052e+01  2.990e+01  -2.024 0.043204 *  
## MiscVal       -3.761e+00  6.960e+00  -0.540 0.589016    
## MoSold        -2.217e+02  4.229e+02  -0.524 0.600188    
## YrSold        -2.474e+02  8.458e+02  -0.293 0.769917    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36800 on 1085 degrees of freedom
##   (339 observations deleted due to missingness)
## Multiple R-squared:  0.8096, Adjusted R-squared:  0.8034 
## F-statistic: 131.8 on 35 and 1085 DF,  p-value: < 2.2e-16

First, in order to reduce the list of predictors, the p-value is assessed to determine if the null hypothesis that there is no correlation between the said predictor can be rejected. The significance factor for this analysis is 0.05. So any p-value greater than 0.05 is removed. Additionally, we can look at the estimated coefficient values and the standard error. For a good model we typically like to see this ratio (test statistic) number between 5 to 10. The larger the ratio, the smaller the variability would be in the slope estimate.

pvalue<-summ$coefficients[3:nrow(summ$coefficients),4]
pvalue_index<-abs(pvalue)<0.05
list_summ<-data.frame(summ$coefficients)
names(list_summ) <- c('Estimate','Error','tvalue','pvalue')

#variables listed with pvalue less than 0.05
list_summ<-list_summ[list_summ$pvalue<0.05,]

#the test_statistic is added as a column
list_summ$test_statistic<-list_summ$Estimate/list_summ$Error

#filter through list_summ for test_statistic greater than 4. Rule of thumb is >5, but that would leave us with very few variables
list_summ<-list_summ[abs(list_summ$test_statistic)>4,]
list_summ
##                  Estimate       Error    tvalue       pvalue test_statistic
## MSSubClass     -200.06229   34.511335 -5.797002 8.843883e-09      -5.797002
## OverallQual   18656.24965 1481.619499 12.591795 4.941611e-34      12.591795
## MasVnrArea       31.40756    7.022059  4.472700 8.536692e-06       4.472700
## `1stFlrSF`       45.97355    7.360064  6.246353 6.022410e-10       6.246353
## `2ndFlrSF`       46.62535    6.102050  7.640933 4.724807e-14       7.640933
## BedroomAbvGr -10218.91046 2155.037820 -4.741871 2.399480e-06      -4.741871
## GarageCars    16849.71719 3490.579204  4.827198 1.582910e-06       4.827198

The categorical variables are: - OverallQual (Rates the overall material and finish of the house 1-10) - BedroomAbvGr (Bedrooms above grade (does NOT include basement bedrooms)) - MSSubClass (Identifies the type of dwelling involved in the sale.) - GarageCars (Size of garage in car capacity)

The continuous variables are: - MasVnrArea (Masonry veneer area in square feet) - 1stFlrSF (First Floor square feet) - 2ndFlrSF (Second floor square feet)

The new linear regression model is built:

# 1stFlrSF and 2ndFlrSF is changed to match test dataset
train_data2<-train_data
colnames(train_data2)[14]<-'X1stFlrSF'
colnames(train_data2)[15]<-'X2ndFlrSF'
lm_saleprice2<-lm(SalePrice ~ OverallQual+BedroomAbvGr+MSSubClass+GarageCars+MasVnrArea+`X1stFlrSF`+`X2ndFlrSF`, data = train_data2)
summ2<-summary(lm_saleprice2)
summ2
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + BedroomAbvGr + MSSubClass + 
##     GarageCars + MasVnrArea + X1stFlrSF + X2ndFlrSF, data = train_data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -440242  -19212   -1179   17211  277235 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -64151.794   6515.242  -9.846  < 2e-16 ***
## OverallQual   23750.794   1043.683  22.757  < 2e-16 ***
## BedroomAbvGr  -9564.999   1538.501  -6.217 6.62e-10 ***
## MSSubClass     -186.695     25.527  -7.314 4.29e-13 ***
## GarageCars    15601.329   1720.397   9.068  < 2e-16 ***
## MasVnrArea       37.766      6.226   6.066 1.67e-09 ***
## X1stFlrSF        75.040      3.659  20.507  < 2e-16 ***
## X2ndFlrSF        56.245      3.388  16.601  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37730 on 1444 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7747, Adjusted R-squared:  0.7736 
## F-statistic: 709.2 on 7 and 1444 DF,  p-value: < 2.2e-16

From here the conditions are checked for linear regression.

predictors<-train_data2[c('SalePrice','OverallQual','BedroomAbvGr','MSSubClass','GarageCars','MasVnrArea','X1stFlrSF','X2ndFlrSF')]
predictors %>%
  ggpairs()
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values
## Warning: Removed 8 rows containing missing values (`geom_point()`).
## Removed 8 rows containing missing values (`geom_point()`).
## Removed 8 rows containing missing values (`geom_point()`).
## Removed 8 rows containing missing values (`geom_point()`).
## Removed 8 rows containing missing values (`geom_point()`).
## Warning: Removed 8 rows containing non-finite values (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values
## Warning: Removed 8 rows containing missing values (`geom_point()`).
## Removed 8 rows containing missing values (`geom_point()`).

The predictors have fairly linear relationship to SalePrice except for MSSubClass although it does add some

An F-statistic of 540.6 with 7 numerator degrees of freedom and 1113 denominator degrees of freedom, and a p-value of < 2.2e-16 (which is basically 0), shows that the model is statistically significant and that the predictor variables are collectively contributing to the prediction of the outcome variable. The low p-value is less than 0.05, so the null hypothesis that there is no relationship between the variables and the outcome can be rejected.

The baseline model where all the variables in the training data set was used showed the following: F-statistic: 131.8 on 35 and 1085 DF, p-value: < 2.2e-16

Reducing the number of predictors increased the F-statistic with a fewer degrees of freedom. This was an improvement in the model.

test <- read.csv("https://raw.githubusercontent.com/catcho1632/DATA605/main/house-prices-advanced-regression-techniques/test.csv")
test[is.na(test)] = 0

prediction <- predict(lm_saleprice2, test)
prediction <- data.frame(cbind(test$Id, prediction))
colnames(prediction) = c('Id', 'SalePrice')
#write_csv(prediction, "/Users/catherinecho/Documents/DATA 605/SalePrice.csv")