Data Preparation

Step 1: Load and Examine Data

To begin, we load the necessary libraries and the dataset. We then examine the first few rows of the data to get an overview of its structure.

# Load the data
data <- read.csv("https://raw.githubusercontent.com/hawa1983/DATA605/main/train.csv")

# Display the first few rows of the data
head(data)

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000

Step 2: Pick and Define Variables

We select as our independent variable \(X\) and as our dependent variable \(Y\).

# Pick and define variables
X <- data$LotArea
Y <- data$SalePrice

Probabilities

Step 1: Calculate Quartiles

Next, we calculate the 3rd quartile (Q3) for LotArea and the 2nd quartile (Q2) for SalePrice.

# Calculate quartiles
lotarea_q3 <- quantile(X, 0.75)
saleprice_q2 <- quantile(Y, 0.50)

# Print the results with labels
cat("LotArea 3rd Quartile (Q3):", lotarea_q3, "\n")

## LotArea 3rd Quartile (Q3): 11601.5

cat("SalePrice Median (Q2):", saleprice_q2, "\n")

## SalePrice Median (Q2): 163000

Step 2: Calculate Probabilities

We compute the following probabilities:

\(P(X > x \mid Y > y)\)
\(P(X > x \text{ and } Y > y)\)
\(P(X < x \mid Y > y)\)

# Calculate probabilities
P_X_greater_x <- sum(X > lotarea_q3 & Y > saleprice_q2) / sum(Y > saleprice_q2)
P_X_greater_x_Y_greater_y <- sum(X > lotarea_q3 & Y > saleprice_q2) / length(Y)
P_X_less_x_Y_greater_y <- sum(X < lotarea_q3 & Y > saleprice_q2) / sum(Y > saleprice_q2)

cat("P(X>x and Y>y): 0.379120879120879\n")

## P(X>x and Y>y): 0.379120879120879

cat("P(X>x and Y>y): 0.189041095890411\n")

## P(X>x and Y>y): 0.189041095890411

cat("P(X<x and Y>y): 0.620879120879121\n")

## P(X<x and Y>y): 0.620879120879121

Step 3: Interpret the Probabilities

\(P(X > x \text{ and } Y > y): 0.3791\)

The probability that \(X\) is greater than \(x\) and \(Y\) is greater than \(y\) is 0.3791. This indicates that there is approximately a 37.91% chance that both \(X\) and \(Y\) exceed their respective thresholds \(x\) and \(y\). In practical terms, this indicates a considerable likelihood that larger properties tend to be more expensive.

\(P(X > x \text{ and } Y > y): 0.1890\)

The probability that \(X\) is greater than \(x\) and \(Y\) is greater than \(y\) is 0.1890. This suggests that there is approximately an 18.90% chance that both \(X\) and \(Y\) exceed their respective thresholds \(x\) and \(y\). This lower probability, compared to the first one, suggests a less frequent occurrence of properties having both large lot areas and high sale prices. It might imply that the thresholds set for x and y are such that fewer properties meet both criteria.

\(P(X < x \text{ and } Y > y): 0.6209\)

The probability that \(X\) is less than \(x\) and \(Y\) is greater than \(y\) is 0.6209. This means that there is approximately a 62.09% chance that \(X\) is below its threshold \(x\) while \(Y\) exceeds its threshold \(y\). This higher probability indicates that many properties have a sale price exceeding the threshold y even if their lot area is less than the threshold x. This means that a significant number of properties with smaller lot areas still manage to have higher sale prices, possibly due to other factors such as location, amenities, or house quality.

Step 4: Make a Contingency Table of counts and Chi-Square Test

We create a contingency table for and and perform a Chi-Square test to determine if the variables are independent.

# Calculate quartiles
lotarea_q3 <- quantile(X, 0.75)
saleprice_q2 <- quantile(Y, 0.50)

# Create the contingency table
contingency_table <- table(cut(X, breaks=c(-Inf, lotarea_q3, Inf)),
                           cut(Y, breaks=c(-Inf, saleprice_q2, Inf)))

# Convert to matrix and add row and column totals
contingency_matrix <- addmargins(as.matrix(contingency_table))

# Rename rows and columns for clarity
rownames(contingency_matrix) <- c("<=3rd quartile", ">3rd quartile", "Total")
colnames(contingency_matrix) <- c("<=2nd quartile", ">2nd quartile", "Total")

# Create a new matrix with additional row and column for labels
new_matrix <- matrix("", nrow=nrow(contingency_matrix) + 1, ncol=ncol(contingency_matrix) + 1)
new_matrix[1, 2:ncol(new_matrix)] <- c("<=2nd quartile", ">2nd quartile", "Total")
new_matrix[2:nrow(new_matrix), 1] <- c("<=3rd quartile", ">3rd quartile", "Total")
new_matrix[2:nrow(new_matrix), 2:ncol(new_matrix)] <- contingency_matrix
new_matrix[1, 1] <- "B \\ A"

Contingency Table Interpretation

The contingency table is as follows:

kable(new_matrix, format = "html", caption = "Contingency Table of LotArea and SalePrice with Custom Labels") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Contingency Table of LotArea and SalePrice with Custom Labels
B A	<=2nd quartile	>2nd quartile	Total
<=3rd quartile	643	452	1095
>3rd quartile	89	276	365
Total	732	728	1460

Majority of Properties with Smaller Lot Areas

The majority of properties (1,095 out of 1,460) have a LotArea that is less than or equal to the 3rd quartile. Among these, 643 properties also have a SalePrice less than or equal to the 2nd quartile, and 452 properties have a SalePrice greater than the 2nd quartile. This indicates that while a large number of smaller lot properties are less expensive, a significant number are also relatively more expensive.

Properties with Larger Lot Areas

There are fewer properties with a LotArea greater than the 3rd quartile (365 out of 1,460). Among these, 276 properties have a SalePrice greater than the 2nd quartile, and 89 properties have a SalePrice less than or equal to the 2nd quartile. This suggests that properties with larger lot areas are more likely to have higher sale prices, although some still have lower sale prices.

Comparison Across Quartiles

Properties with LotArea <= 3rd quartile and SalePrice > 2nd quartile (452 properties) outnumber properties with LotArea > 3rd quartile and SalePrice <= 2nd quartile (89 properties). This indicates that even among properties with smaller lot areas, a considerable number achieve higher sale prices, possibly due to factors other than lot size, such as location, house condition, or amenities.

Step 5: Checking Independence of Variables𝐴and B

We need to

Define Variables \(A\) and \(B\)
Calculate \(P(A)\)\(P(B)\), and \(P(A|B)\)
Check if \(P(A|B) = P(A)P(B)\)

Let \(A\) be the variable indicating observations above the 3rd quartile for \(X\) (LotArea), and \(B\) be the variable indicating observations above the 2nd quartile for \(Y\) (SalePrice).

# Load necessary libraries
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:kableExtra':
## 
##     group_rows

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# Define A and B
A <- ifelse(X > lotarea_q3, 1, 0)
B <- ifelse(Y > saleprice_q2, 1, 0)

# Create the contingency table for A and B
contingency_table_AB <- table(A, B)

# Calculate probabilities
P_A <- mean(A)
P_B <- mean(B)
P_A_given_B <- mean(A[B == 1])

# Check if P(A|B) = P(A)P(B)
P_A_equal_PB <- P_A_given_B == P_A * P_B

Probabilities

# Create a data frame to display probabilities
prob_df <- data.frame(
  Metric = c("P(A)", "P(B)", "P(A|B)", "P(A|B) == P(A)P(B)"),
  Value = c(P_A, P_B, P_A_given_B, P_A_equal_PB)
)

kable(prob_df, format = "html", caption = "Probabilities for A and B") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Probabilities for A and B
Metric	Value
P(A)	0.2500000
P(B)	0.4986301
P(A\|B)	0.3791209
P(A\|B) == P(A)P(B)	0.0000000

Chi-Square Test

# Create a data frame for chi-square test results
# Perform Chi-Square test
chi_square_test_AB <- chisq.test(contingency_table_AB)

chi_square_df <- data.frame(
  Statistic = round(chi_square_test_AB$statistic, 2),
  `Degrees of Freedom` = chi_square_test_AB$parameter,
  `p-value` = format.pval(chi_square_test_AB$p.value, digits = , scientific = TRUE)
)

kable(chi_square_df, format = "html", caption = "Chi-Square Test Results for A and B") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Chi-Square Test Results for A and B
	Statistic	Degrees.of.Freedom	p.value
X-squared	127.74	1	< 2.22e-16

###Conclusion: Independence of Variables \(A\) and \(B\)

Based on the analysis, we conclude that splitting the training data into variables \(A\) and \(B\) does not make them independent.

The mathematical check shows that \(P(A|B) \neq P(A) \cdot P(B)\).
The Chi-Square test for association further supports this conclusion with a p-value less than 0.05. Given the very low p-value, we reject the null hypothesis that LotArea and SalePrice are independent. There is strong evidence to suggest that there is a significant association between LotArea and SalePrice.

Therefore, we can state that \(A\) and \(B\) are dependent.

Descriptive Statistics and Plots

We provide descriptive statistics for and . Additionally, we generate histograms, boxplots, and a scatterplot to visualize the data.

Step 1: Calculate Summary Statistics

# Descriptive statistics
summary(data$LotArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

summary(data$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Step 2a: Plot Histograms

library(ggplot2)
library(patchwork)

## 
## Attaching package: 'patchwork'

## The following object is masked from 'package:MASS':
## 
##     area

# Histograms
p1 <- ggplot(data, aes(x=LotArea)) + 
  geom_histogram(bins=30, fill="blue", alpha=0.7) + 
  ggtitle("Histogram of LotArea")

p2 <- ggplot(data, aes(x=SalePrice)) + 
  geom_histogram(bins=30, fill="blue", alpha=0.7) + 
  ggtitle("Histogram of SalePrice")

# Combine all plots into a single layout
(p1 | p2)

Step 2b: Plot Boxplots

library(ggplot2)
library(patchwork)


# Boxplots
p3 <- ggplot(data, aes(y=LotArea)) + 
  geom_boxplot() + 
  ggtitle("Boxplot of LotArea")

p4 <- ggplot(data, aes(y=SalePrice)) + 
  geom_boxplot() + 
  ggtitle("Boxplot of SalePrice")

# Combine all plots into a single layout
(p3 | p4)

Step 3: Scatterplot of LotArea and SalePrice

library(ggplot2)
library(patchwork)

# Scatterplot
p5 <- ggplot(data, aes(x=LotArea, y=SalePrice)) + 
  geom_point() + 
  ggtitle("Scatterplot of LotArea vs SalePrice")

p5

Step 4: Calculate 95% Confidence Interval for the Difference in Means

We calculate the difference in means between LotArea and SalePrice and compute the 95% confidence interval for this difference.

# Calculate the difference in means and the standard error of the difference
mean_diff <- mean(X) - mean(Y)
se_diff <- sqrt(var(X)/length(X) + var(Y)/length(Y))

# Calculate the 95% confidence interval
t_value <- qt(0.975, df=length(X)-1)
ci_lower_diff <- mean_diff - t_value * se_diff
ci_upper_diff <- mean_diff + t_value * se_diff

# Print the results with labels
cat("Mean difference:", mean_diff, "\n")

## Mean difference: -170404.4

cat("95% CI lower bound:", ci_lower_diff, "\n")

## 95% CI lower bound: -174514.8

cat("95% CI upper bound:", ci_upper_diff, "\n")

## 95% CI upper bound: -166293.9

Step 5: Correlation Matrix and Hypothesis Testing

We compute the correlation matrix for and and perform a hypothesis test to determine if the correlation is significantly different from zero.

# Compute the correlation matrix
correlation_matrix <- cor(data[, c("LotArea", "SalePrice")])

# Test the hypothesis that the correlation is 0
correlation_test <- cor.test(X, Y)

correlation_matrix

##             LotArea SalePrice
## LotArea   1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000

correlation_test

## 
##  Pearson's product-moment correlation
## 
## data:  X and Y
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2154574 0.3109369
## sample estimates:
##       cor 
## 0.2638434

Step 6: Discussion and analysis of the Descriptive Statistics**

The descriptive statistics and visualizations provide insights into the distribution and central tendency of LotArea and SalePrice.
The histograms and boxplots reveal that both variables are right-skewed with the presence of outliers.
The scatterplot suggests a positive relationship between LotArea and SalePrice, although the relationship is not very strong.
The confidence interval for the difference in means and the correlation matrix further support these findings, indicating that properties with larger lot areas tend to have higher sale prices.
The hypothesis test confirms that the correlation between LotArea and SalePrice is statistically significant.

Linear Algebra and PCA

We invert the correlation matrix to obtain the precision matrix and perform matrix multiplications. We also conduct Principal Components Analysis (PCA) to understand the variance explained by each principal component.

# Invert the correlation matrix
precision_matrix <- solve(correlation_matrix)

# Matrix multiplications
correlation_times_precision <- correlation_matrix %*% precision_matrix
precision_times_correlation <- precision_matrix %*% correlation_matrix

# Perform PCA
pca <- prcomp(data[, c("LotArea", "SalePrice")], scale. = TRUE)

# Principal components and explained variance
principal_components <- pca$rotation
explained_variance <- pca$sdev^2 / sum(pca$sdev^2)

Step 2: Interpretation and Discussion

Precision Matrix:

Diagonal Elements (1.0748219): The diagonal elements represent the precision or variance of each variable when the influence of other variables is removed. Here, the variance of LotArea and SalePrice are the same (approximately 1.0748219).
Off-diagonal Elements (-0.2835846): The off-diagonal elements represent the partial correlation between the variables, after accounting for the influence of other variables. Here, LotArea and SalePrice have a partial correlation of approximately -0.284, indicating a slight negative conditional dependence.

# Print the results with headings
cat("Precision Matrix:\n")

## Precision Matrix:

print(precision_matrix)

##              LotArea  SalePrice
## LotArea    1.0748219 -0.2835846
## SalePrice -0.2835846  1.0748219

Correlation Matrix times Precision Matrix

This result shows an identity matrix, indicating that when the correlation matrix is multiplied by its inverse (the precision matrix), the result is an identity matrix. This confirms that the precision matrix was correctly calculated as the inverse of the correlation matrix.

cat("\nCorrelation Matrix times Precision Matrix:\n")

## 
## Correlation Matrix times Precision Matrix:

print(correlation_times_precision)

##           LotArea SalePrice
## LotArea         1         0
## SalePrice       0         1

Precision Matrix times Correlation Matrix

Similar to the previous result, this multiplication also results in an identity matrix. This further confirms the correctness of the precision and correlation matrices.

cat("\nPrecision Matrix times Correlation Matrix:\n")

## 
## Precision Matrix times Correlation Matrix:

print(precision_times_correlation)

##           LotArea SalePrice
## LotArea         1         0
## SalePrice       0         1

Principal Components

PC1 (First Principal Component): Both LotArea and SalePrice contribute equally to the first principal component, with positive coefficients (approximately 0.7071). This suggests that the first principal component represents a combined measure of LotArea and SalePrice.
PC2 (Second Principal Component): LotArea and SalePrice contribute equally but with opposite signs to the second principal component. This suggests that the second principal component represents the contrast between LotArea and SalePrice.

cat("\nPrincipal Components:\n")

## 
## Principal Components:

print(principal_components)

##                 PC1        PC2
## LotArea   0.7071068  0.7071068
## SalePrice 0.7071068 -0.7071068

Explained Variance

First Principal Component (PC1): Explains approximately 63.19% of the variance in the data. This indicates that most of the variability in LotArea and SalePrice can be captured by this combined measure.
Second Principal Component (PC2): Explains approximately 36.81% of the variance in the data. This indicates that the remaining variability is captured by the contrast between LotArea and SalePrice.

cat("\nExplained Variance:\n")

## 
## Explained Variance:

print(explained_variance)

## [1] 0.6319217 0.3680783

Overall Conclusion:

The analysis provides insights into the relationship between LotArea and SalePrice. The precision matrix indicates a slight negative conditional dependence, while the PCA reveals that a combined measure of these variables captures most of the variance. The significant portion of variance explained by the first principal component suggests that these variables are closely related, but there is also distinct variability captured by the second principal component.

Calculus-Based Probability & Statistics

Step 1: Fit Exponential Distribution and perform calaculations

We fit an exponential distribution to the LotArea data and generate samples from the fitted distribution. We also compare the empirical and theoretical percentiles.

# Check if MASS package is installed; if not, install it
if (!requireNamespace("MASS", quietly = TRUE)) {
  install.packages("MASS")
}

# Load the MASS package
library(MASS)
library(ggplot2)
library(dplyr)
library(tidyr)

# Set a seed for reproducibility
set.seed(42)

# Filter out zero or negative values and shift `LotArea` to ensure positive values
lotarea_shifted <- data$LotArea - min(data$LotArea) + 1

# Fit the exponential distribution
fit <- fitdistr(lotarea_shifted, "exponential")
lambda <- fit$estimate

# Generate 1000 samples from the fitted exponential distribution
samples <- rexp(1000, rate=lambda)

# Calculate percentiles
theoretical_percentiles <- qexp(c(0.05, 0.95), rate=lambda)
empirical_percentiles <- quantile(lotarea_shifted, probs=c(0.05, 0.95))

# Print results
cat("Estimated Lambda:", lambda, "\n")

## Estimated Lambda: 0.0001084854

cat("Theoretical Percentiles (5% and 95%):", theoretical_percentiles, "\n")

## Theoretical Percentiles (5% and 95%): 472.8128 27614.15

cat("Empirical Percentiles (5% and 95%):", empirical_percentiles, "\n")

## Empirical Percentiles (5% and 95%): 2012.7 16102.15

Plot the histograms

# Create data frames for plotting
sample_df <- data.frame(Value = samples, Type = "Exponential Samples")
lotarea_df <- data.frame(Value = lotarea_shifted, Type = "Shifted LotArea")

# Combine data frames
plot_data <- bind_rows(sample_df, lotarea_df)

# Plot histograms with facet wrap
ggplot(plot_data, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
  facet_wrap(~Type, scales = "free") +
  ggtitle("Histograms of Exponential Samples and Shifted LotArea") +
  xlab("Value") +
  ylab("Frequency")

Step 3: Discuss the output and histograms

Estimated Lambda

The estimated lambda for the exponential distribution is 1.0848543^{-4}. This value of lambda is the rate parameter of the exponential distribution. It indicates how quickly the values decay. A smaller lambda would mean a slower decay, whereas a larger lambda indicates a quicker decay.

Theoretical and Empirical Percentiles

Theoretical Percentiles (5% and 95%): 472.8127694, 2.7614145^{4} are the 5th and 95th percentiles of the theoretical exponential distribution based on the estimated lambda. They provide a range within which 90% of the data from the exponential distribution is expected to fall.

Empirical Percentiles

Empirical Percentiles (5% and 95%): 2012.7, 1.610215^{4} are the 5th and 95th percentiles of the empirical shifted LotArea data. They provide a range within which 90% of the shifted LotArea data falls.

Histograms

Exponential Samples Histogram: The left histogram shows the distribution of 1000 samples generated from the fitted exponential distribution. As expected for an exponential distribution, the histogram shows a rapid decay, with most values concentrated towards the lower end and fewer values as you move to the right. The shape confirms that the data follows an exponential decay, which is consistent with the nature of an exponential distribution.
Shifted LotArea Histogram: The right histogram shows the distribution of the shifted LotArea values. This histogram also shows a right-skewed distribution, with most values concentrated towards the lower end and fewer values extending to the right. The shape of this histogram is somewhat similar to the exponential samples histogram, indicating that the shifted LotArea data might reasonably follow an exponential distribution.

Comparison and Analysis

Shape Comparison: Both histograms show a right-skewed distribution, with a high frequency of small values and a long tail of larger values. This similarity suggests that the exponential distribution might be a good fit for the shifted LotArea data.
Percentile Comparison: The theoretical percentiles from the exponential distribution (472.8127694, 2.7614145^{4}) and the empirical percentiles from the shifted LotArea data (2012.7, 1.610215^{4}) are not identical but share a similar range. This similarity further supports that the exponential distribution can reasonably model the shifted LotArea data. The empirical 5th percentile is higher than the theoretical 5th percentile, indicating that the lower end of the LotArea data might be more spread out compared to the fitted exponential distribution.
Goodness of Fit: The comparison of histograms and percentiles suggests that the exponential distribution is a plausible model for the shifted LotArea data.

Modeling

Step 1: Build and Evaluate Regression Model

We build a simple linear regression model using LotArea to predict SalePrice. We then evaluate the model using the root mean squared error (RMSE).

# Split the data
set.seed(42)
train_indices <- sample(1:nrow(data), 0.8 * nrow(data))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

# Build the model
model <- lm(SalePrice ~ LotArea, data=train_data)

# Predictions
predictions <- predict(model, newdata=test_data)

# Evaluate the model
mse <- mean((test_data$SalePrice - predictions)^2)
rmse <- sqrt(mse)

# Print model summary and RMSE
cat("Model Summary:\n")

## Model Summary:

print(summary(model))##

## 
## Call:
## lm(formula = SalePrice ~ LotArea, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -251762  -48469  -16582   32100  550953 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.571e+05  3.246e+03   48.38   <2e-16 ***
## LotArea     2.182e+00  2.257e-01    9.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 75670 on 1166 degrees of freedom
## Multiple R-squared:  0.07425,    Adjusted R-squared:  0.07345 
## F-statistic: 93.51 on 1 and 1166 DF,  p-value: < 2.2e-16

cat("RMSE:", rmse, "\n")

## RMSE: 80495.68

Diagnostic plots

# Check if MASS package is installed; if not, install it
if (!requireNamespace("MASS", quietly = TRUE)) {
  install.packages("MASS")
}

# Load necessary packages
library(MASS)
library(ggplot2)
library(dplyr)
library(tidyr)
library(patchwork)



# Diagnostic plots
# Residuals vs Fitted
residuals_vs_fitted <- ggplot(model, aes(.fitted, .resid)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE, color = "red") +
  ggtitle("Residuals vs Fitted") +
  xlab("Fitted values") +
  ylab("Residuals") +
  theme_minimal()

# Q-Q plot
qq_plot <- ggplot(model, aes(sample = .stdresid)) +
  stat_qq() +
  stat_qq_line() +
  ggtitle("Normal Q-Q") +
  theme_minimal()

# Scale-Location plot
scale_location_plot <- ggplot(model, aes(.fitted, sqrt(abs(.stdresid)))) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE, color = "red") +
  ggtitle("Scale-Location") +
  xlab("Fitted values") +
  ylab("Square Root of |Standardized Residuals|") +
  theme_minimal()

# Residuals vs Leverage plot
residuals_vs_leverage <- ggplot(model, aes(.hat, .stdresid)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE, color = "red") +
  ggtitle("Residuals vs Leverage") +
  xlab("Leverage") +
  ylab("Standardized Residuals") +
  theme_minimal()

# Combine diagnostic plots
diagnostic_plots <- (residuals_vs_fitted | scale_location_plot) / (qq_plot | residuals_vs_leverage)

diagnostic_plots

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The diagnostic plots assess the fit of the linear regression model. Here’s an interpretation of each plot:

Residuals vs Fitted Plot:

Purpose: To check the linearity assumption and identify any patterns. Interpretation: he residuals should be randomly scattered around the horizontal line (zero). In the plot, the residuals exhibit a clear pattern (a curve), suggesting non-linearity. This indicates that the relationship between LotArea and SalePrice may not be purely linear.

Scale-Location Plot (Spread-Location):

Purpose: To check the homoscedasticity assumption (constant variance of residuals). Interpretation: The residuals should be evenly spread along the range of fitted values. In the plot, the red line curves upwards, indicating that the variance of the residuals increases with the fitted values. This suggests heteroscedasticity, meaning that the variance of SalePrice is not constant across different levels of LotArea.

Normal Q-Q Plot:

Purpose: To check the normality of the residuals. Interpretation: The points should lie on the 45-degree reference line if the residuals are normally distributed. In the plot, the points deviate significantly from the line, especially at the tails, indicating that the residuals are not normally distributed. This violation can affect the reliability of hypothesis tests and confidence intervals.

Residuals vs Leverage Plot:

Purpose: To identify influential observations that have a large impact on the model. Interpretation: Points that fall outside the dashed lines (Cook’s distance) are influential. In the plot, there are points with high leverage and large residuals, suggesting the presence of influential observations. These observations can disproportionately affect the model fit and should be investigated further.

Overall Interpretation:

The diagnostic plots suggest that the linear regression model may not be appropriate for the data due to violations of the key assumptions:

Non-linearity.
Heteroscedasticity.
Non-normality of residuals.
Presence of influential points.

Next Steps:

Consider transforming the predictor (LotArea) or the response variable (SalePrice) to achieve linearity and homoscedasticity. Explore non-linear models, such as polynomial regression or generalized additive models (GAMs). Investigate and possibly remove or adjust for influential observations. Consider using robust regression techniques that are less sensitive to violations of assumptions.

DATA 605 Final Project

Fomba Kassoh

2024-05-16

Data Preparation

Step 1: Load and Examine Data

Step 2: Pick and Define Variables

Probabilities

Step 1: Calculate Quartiles

Step 2: Calculate Probabilities

Step 3: Interpret the Probabilities

Step 4: Make a Contingency Table of counts and Chi-Square Test

Contingency Table Interpretation

Step 5: Checking Independence of Variables𝐴and B

Probabilities

Chi-Square Test

Descriptive Statistics and Plots

Step 1: Calculate Summary Statistics

Step 2a: Plot Histograms

Step 2b: Plot Boxplots

Step 3: Scatterplot of LotArea and SalePrice

Step 4: Calculate 95% Confidence Interval for the Difference in Means

Step 5: Correlation Matrix and Hypothesis Testing

Step 6: Discussion and analysis of the Descriptive Statistics**

Linear Algebra and PCA

Step 2: Interpretation and Discussion

Calculus-Based Probability & Statistics

Step 1: Fit Exponential Distribution and perform calaculations

Plot the histograms

Step 3: Discuss the output and histograms

Modeling

Step 1: Build and Evaluate Regression Model

Diagnostic plots

Residuals vs Fitted Plot:

Scale-Location Plot (Spread-Location):

Normal Q-Q Plot:

Residuals vs Leverage Plot:

Overall Interpretation: