DCorrea_FinalExam

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of $\mu = \sigma = (N+1)/2$

set.seed(123)
N <- 10

sigma <- (N + 1)/2
mu <- sigma
   
X <- runif(n = 10000, min = 1, max = N)
Y <- rnorm(n = 10000, mean = mu, sd = sigma)

df <- data.frame(cbind(X, Y))

summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.001   3.276   5.451   5.478   7.691   9.999

summary(Y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -15.649   1.841   5.475   5.511   9.330  26.663

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

$P(X>x|X>y)$ b. $P(X>x, Y>y)$ c. $P(X<x|X>y)$

x <- median(X)
y <- quantile(Y, 0.25)

a

$P(X>x|X>y)$. The likelihood of event A, X > x, given that B, X > y), is true. $ = $

P_AB <- df %>%
  filter(X > x & X > y) %>%
  nrow()

P_B <- df %>%
  filter(X > y) %>%
  nrow()

P1 <- P_AB/P_B

round(P1, 2)

## [1] 0.55

b

$P(X>x, Y>y)$. This is the probability of the intersection of P(A) and P(B)

P2 <- df %>%
  filter (X > x & Y>y) %>%
  nrow()/10000

round(P2, 2)

## [1] 0.38

C

$P(X<x|X>y)$ The likelihood of event A, X < x, given that B, X > y, is true. $ = $

P_AB <- df %>%
  filter(X < x & X > y) %>%
  nrow()

P_B <- df %>%
  filter(X > y) %>%
  nrow()

P3 <- P_AB/P_B

round(P3, 2)

## [1] 0.45

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

p11 <- df %>%
  filter (X > x & Y>y) %>%
  nrow()/10000

p21 <- df %>%
  filter (X > x & Y<y) %>%
  nrow()/10000

p31 <- df %>%
  filter (X > x) %>%
  nrow()/10000

p12 <- df %>%
  filter (X < x & Y>y) %>%
  nrow()/10000

p22 <- df %>%
  filter (X < x & Y<y) %>%
  nrow()/10000

p23 <- df %>%
  filter (X < x) %>%
  nrow()/10000

m1 <- df %>%
  filter (Y > y) %>%
  nrow()/10000

m2 <- df %>%
  filter (Y < y) %>%
  nrow()/10000


prop_matrix <- matrix(c(p11, p21,p31,p12,p22,p23,m1,m2,1.00), nrow = 3, byrow=FALSE, 
       dimnames = list(c('Y>y','Y<y', 'Total'),
                       c('X>x','X<x', 'Total')))
prop_matrix

##          X>x    X<x Total
## Y>y   0.3756 0.3744  0.75
## Y<y   0.1244 0.1256  0.25
## Total 0.5000 0.5000  1.00

#P(A) * P(B)
m1 * p31

## [1] 0.375

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Both the Fisher’s Exact test and the Chi Square test are both used to determine if there is a significant relationship between variables. Chi-square test is used when the sample is large and the Fisher’s exact test is used when the sample is small. Here, the Chi-Square test is appropriate.

p11 <- p11 * 10000
p21 <- p21 * 10000
p12 <- p12 * 10000
p22 <- p22 * 10000

prop_matrix2 <- matrix(c(p11, p21,p12,p22), nrow = 2, byrow=FALSE, 
       dimnames = list(c('Y>y','Y<y'),
                       c('X>x','X<x')))
prop_matrix2

##      X>x  X<x
## Y>y 3756 3744
## Y<y 1244 1256

chisq.test(prop_matrix2)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  prop_matrix2
## X-squared = 0.064533, df = 1, p-value = 0.7995

fisher.test(prop_matrix2)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  prop_matrix2
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9242273 1.1100187
## sample estimates:
## odds ratio 
##   1.012883

We cannot reject the null hypothesis. The independence remains true.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

path <- 'house prices - train.csv'
house_prices_train <- read.csv(path)

# Remove outliers
out <- boxplot.stats(house_prices_train$SalePrice)$out
out_rows <- which(house_prices_train$SalePrice %in% c(out))
house_prices_train <- house_prices_train[-out_rows,]

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

# Descriptive statistics
summary(house_prices_train)

##        Id           MSSubClass       MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.00   Length:1399        Min.   : 21.00  
##  1st Qu.: 367.5   1st Qu.: 20.00   Class :character   1st Qu.: 59.00  
##  Median : 739.0   Median : 50.00   Mode  :character   Median : 68.00  
##  Mean   : 733.5   Mean   : 57.49                      Mean   : 69.06  
##  3rd Qu.:1098.5   3rd Qu.: 70.00                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.00                      Max.   :313.00  
##                                                       NA's   :256     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1399        Length:1399        Length:1399       
##  1st Qu.:  7442   Class :character   Class :character   Class :character  
##  Median :  9317   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10155                                                           
##  3rd Qu.: 11316                                                           
##  Max.   :164660                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1399        Length:1399        Length:1399        Length:1399       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1399        Length:1399        Length:1399        Length:1399       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1399        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1953  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1971  
##                     Mean   : 5.984   Mean   :5.591   Mean   :1970  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:1999  
##                     Max.   :10.000   Max.   :9.000   Max.   :2009  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1399        Length:1399        Length:1399       
##  1st Qu.:1966   Class :character   Class :character   Class :character  
##  Median :1992   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1984                                                           
##  3rd Qu.:2003                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea       ExterQual        
##  Length:1399        Length:1399        Min.   :   0.00   Length:1399       
##  Class :character   Class :character   1st Qu.:   0.00   Class :character  
##  Mode  :character   Mode  :character   Median :   0.00   Mode  :character  
##                                        Mean   :  90.18                     
##                                        3rd Qu.: 144.00                     
##                                        Max.   :1600.00                     
##                                        NA's   :7                           
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1399        Length:1399        Length:1399        Length:1399       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1399        Length:1399        Min.   :   0.0   Length:1399       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 370.0   Mode  :character  
##                                        Mean   : 417.3                     
##                                        3rd Qu.: 686.0                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF     Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0   Length:1399       
##  1st Qu.:   0.00   1st Qu.: 218.0   1st Qu.: 788   Class :character  
##  Median :   0.00   Median : 476.0   Median : 973   Mode  :character  
##  Mean   :  47.48   Mean   : 559.5   Mean   :1024                     
##  3rd Qu.:   0.00   3rd Qu.: 807.0   3rd Qu.:1252                     
##  Max.   :1474.00   Max.   :2042.0   Max.   :6110                     
##                                                                      
##   HeatingQC          CentralAir         Electrical          X1stFlrSF     
##  Length:1399        Length:1399        Length:1399        Min.   : 334.0  
##  Class :character   Class :character   Class :character   1st Qu.: 870.5  
##  Mode  :character   Mode  :character   Mode  :character   Median :1069.0  
##                                                           Mean   :1132.2  
##                                                           3rd Qu.:1346.5  
##                                                           Max.   :4692.0  
##                                                                           
##    X2ndFlrSF       LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0.0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0.0   1st Qu.:  0.000   1st Qu.:1119   1st Qu.:0.0000  
##  Median :   0.0   Median :  0.000   Median :1437   Median :0.0000  
##  Mean   : 336.5   Mean   :  5.691   Mean   :1474   Mean   :0.4103  
##  3rd Qu.: 720.0   3rd Qu.:  0.000   3rd Qu.:1728   3rd Qu.:1.0000  
##  Max.   :1818.0   Max.   :528.000   Max.   :5642   Max.   :3.0000  
##                                                                    
##   BsmtHalfBath        FullBath       HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.00   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.00   Median :0.0000   Median :3.000  
##  Mean   :0.05861   Mean   :1.54   Mean   :0.3703   Mean   :2.862  
##  3rd Qu.:0.00000   3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.00   Max.   :2.0000   Max.   :8.000  
##                                                                   
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1399        Min.   : 2.000   Length:1399       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.049                      Mean   : 6.417                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces     FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.0000   Length:1399        Length:1399        Min.   :1900  
##  1st Qu.:0.0000   Class :character   Class :character   1st Qu.:1960  
##  Median :1.0000   Mode  :character   Mode  :character   Median :1978  
##  Mean   :0.5833                                         Mean   :1977  
##  3rd Qu.:1.0000                                         3rd Qu.:2001  
##  Max.   :3.0000                                         Max.   :2010  
##                                                         NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1399        Min.   :0.000   Min.   :   0.0   Length:1399       
##  Class :character   1st Qu.:1.000   1st Qu.: 312.0   Class :character  
##  Mode  :character   Median :2.000   Median : 471.0   Mode  :character  
##                     Mean   :1.718   Mean   : 458.8                     
##                     3rd Qu.:2.000   3rd Qu.: 573.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1399        Length:1399        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 22.00  
##                                        Mean   : 89.62   Mean   : 44.89  
##                                        3rd Qu.:165.00   3rd Qu.: 64.00  
##                                        Max.   :736.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch       ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.000   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.000   Median :  0.00   Median :  0.000  
##  Mean   : 22.38   Mean   :  3.232   Mean   : 14.63   Mean   :  2.482  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.000   Max.   :480.00   Max.   :738.000  
##                                                                       
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1399        Length:1399        Length:1399        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   45.38  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1399        Length:1399       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.312   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129000  
##  Median :159500  
##  Mean   :170237  
##  3rd Qu.:203500  
##  Max.   :340000  
##

# Distribution of the Dependent Variable, Sale Price
ggplot(data = house_prices_train, aes(x = SalePrice)) + 
    geom_histogram(aes(y=..density..), color = 'black', fill = 'white') + 
    geom_density(alpha = .2, fill = '#FF6666') + 
  labs(title = 'Distribution of Sale Price')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Scatterplots

ggplot(data = house_prices_train, aes(x = OverallQual, y = SalePrice
                                      , color = OverallQual)) + 
  geom_point() + 
  labs(title = 'Sale Price by Overall Quality')

ggplot(data = house_prices_train, aes(x = GrLivArea, y = SalePrice)) + 
  geom_point() + 
  labs(title = 'Sale Price by Above Ground Living Area')

# Correlation
(corr_matrix <- house_prices_train %>%
  dplyr::select(SalePrice, OverallQual, GrLivArea) %>%
  cor() %>%
  as.matrix())

##             SalePrice OverallQual GrLivArea
## SalePrice   1.0000000   0.7842943 0.6613246
## OverallQual 0.7842943   1.0000000 0.5379843
## GrLivArea   0.6613246   0.5379843 1.0000000

# Testing the Hypothesis
cor.test(house_prices_train$OverallQual, house_prices_train$SalePrice, 
        conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  house_prices_train$OverallQual and house_prices_train$SalePrice
## t = 47.251, df = 1397, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7707334 0.7971450
## sample estimates:
##       cor 
## 0.7842943

cor.test(house_prices_train$GrLivArea, house_prices_train$SalePrice, 
        conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  house_prices_train$GrLivArea and house_prices_train$SalePrice
## t = 32.953, df = 1397, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6415857 0.6801881
## sample estimates:
##       cor 
## 0.6613246

#Famimlywise Error Rate
(fwe <- 1-(1-.05)^2)

## [1] 0.0975

The independent variables have strong correlations with out dependent variable. Here we are not worried about a familywise error as the number of tests are low and the p-value from our tests are very low. Thus, we can reject the null hypothesis.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

# Precision matrix
(prec_matrix <- solve(corr_matrix))

##             SalePrice OverallQual   GrLivArea
## SalePrice    3.286933 -1.98219009 -1.10734249
## OverallQual -1.982190  2.60267798 -0.08932896
## GrLivArea   -1.107342 -0.08932896  1.78037038

# Correlation matrix * precision matrix
round(corr_matrix %*% prec_matrix,2)

##             SalePrice OverallQual GrLivArea
## SalePrice           1           0         0
## OverallQual         0           1         0
## GrLivArea           0           0         1

# Precision matrix * correlation matrix
round(prec_matrix %*% corr_matrix,2)

##             SalePrice OverallQual GrLivArea
## SalePrice           1           0         0
## OverallQual         0           1         0
## GrLivArea           0           0         1

#LU Decomposition
LU(corr_matrix)

## $P
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
## 
## $L
##           [,1]       [,2] [,3]
## [1,] 1.0000000 0.00000000    0
## [2,] 0.7842943 1.00000000    0
## [3,] 0.6613246 0.05017437    1
## 
## $U
##      SalePrice OverallQual  GrLivArea
## [1,]         1   0.7842943 0.66132457
## [2,]         0   0.3848824 0.01931123
## [3,]         0   0.0000000 0.56168088

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of $\lambda$ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, $\lambda$)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss. 10 points.

# Variable skewed to the right
ggplot(data = house_prices_train, aes(x = LotArea)) + 
    geom_histogram(aes(y=..density..), color = 'black', fill = 'white') + 
    geom_density(alpha = .2, fill = '#FF6666') + 
  labs(title = 'Distribution of Lot Area')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# fitting Lot Area to an exponential probability density function
(epd <- fitdistr(house_prices_train$LotArea, 'exponential'))

##        rate    
##   9.847093e-05 
##  (2.632687e-06)

# optimal lambda value
(lambda <- epd$estimate)

##         rate 
## 9.847093e-05

# 1000 samples using optimal lambda
opt_lam <- data.frame(samples = rexp(1000, lambda))

# Plot Histogram
ggplot(data = opt_lam, aes(x = samples)) + 
    geom_histogram(aes(y=..density..), color = 'black', fill = 'white') + 
    geom_density(alpha = .2, fill = '#FF6666') + 
  labs(title = 'Exponential Probability Distribution of Lot Area')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# 5th and 95th percentiles using CDF
qexp(p = 0.05, rate = lambda)

## [1] 520.8978

qexp(p = 0.95, rate = lambda)

## [1] 30422.5

# 95% confidence interval
sigma <- sd(house_prices_train$LotArea)
n <- length(house_prices_train$LotArea)
ci <- qnorm(0.95)

me <- ci * (sigma/sqrt(n)) 

# Upper bound
(mean(house_prices_train$LotArea) + me)

## [1] 10521.38

# Lower bound
(mean(house_prices_train$LotArea) - me)

## [1] 9789.186

# Percentiles
quantile(house_prices_train$LotArea, 0.05)

##   5% 
## 3196

quantile(house_prices_train$LotArea, 0.95)

##     95% 
## 16783.5

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

lr <- lm(SalePrice ~ GarageCars + OverallQual + GrLivArea + BldgType + CentralAir +
          Utilities + MSZoning + OverallCond, data = house_prices_train) 

# Linear regression
summary(lr)

## 
## Call:
## lm(formula = SalePrice ~ GarageCars + OverallQual + GrLivArea + 
##     BldgType + CentralAir + Utilities + MSZoning + OverallCond, 
##     data = house_prices_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -260412  -16754   -1085   13861  109731 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -90450.203   9982.587  -9.061  < 2e-16 ***
## GarageCars       17507.748   1319.348  13.270  < 2e-16 ***
## OverallQual      20510.961    852.005  24.074  < 2e-16 ***
## GrLivArea           38.017      2.003  18.978  < 2e-16 ***
## BldgType2fmCon   -4649.219   5398.021  -0.861 0.389232    
## BldgTypeDuplex  -16526.140   4190.804  -3.943 8.43e-05 ***
## BldgTypeTwnhs   -16086.251   4649.587  -3.460 0.000557 ***
## BldgTypeTwnhsE    -140.767   3071.066  -0.046 0.963447    
## CentralAirY      10631.039   3427.583   3.102 0.001964 ** 
## UtilitiesNoSeWa -40329.815  28421.620  -1.419 0.156129    
## MSZoningFV       35529.362  10028.651   3.543 0.000409 ***
## MSZoningRH       18377.653  11593.665   1.585 0.113162    
## MSZoningRL       31327.878   9283.545   3.375 0.000760 ***
## MSZoningRM        5755.675   9391.415   0.613 0.540066    
## OverallCond       2857.578    729.230   3.919 9.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28400 on 1384 degrees of freedom
## Multiple R-squared:  0.7724, Adjusted R-squared:  0.7701 
## F-statistic: 335.4 on 14 and 1384 DF,  p-value: < 2.2e-16

# Residuals
ggplot(data = lr, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = lr, aes(x = .resid)) +
  geom_histogram() +
  xlab("Residuals")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = lr, aes(sample = .resid)) +
  stat_qq()

Using the above linear regression, my kaggle score is a 0.75094, username dcorrea614.