Computational Mathematics

Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Problem 1.

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of ????????????(N+1)/2.
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

# creating variables X and Y with random values, let's choose base value 1 to 9 only otherwise it will be a big messy data

X <- runif(10000, min = 1, max = 9)
Y <- rnorm(10000, 5, 5)

x <- median(X)
y <- summary(Y)[2]
# now let's calculate the most probabilities you can get from these X, x
Xgtx <- sum(X > x)/10000
Xgty <- sum(X > y)/10000
Xltx <- sum(X < x)/10000
Xlty <- sum(X < y)/10000

# for Y, y
Ygtx <- sum(Y > x)/10000
Ygty <- sum(Y > y)/10000
Yltx <- sum(Y < x)/10000
Ylty <- sum(Y < y)/10000

Probability.

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. 5 points

  1. P(X>x | X>y)
    Answer:
# Here, the question is asking about the Probabilty of X that is greater than x and y
prob_a <- (sum(X > max(x, y))/10000) / Xgty
prob_a
## [1] 0.5414771
  1. P(X>x, Y>y)
# Here, the question is asking about the Probabilty of X greater than x multiplied Y is greater than y
prob_b <- Xgtx * Ygty
prob_b
## [1] 0.375
  1. P(Xy)
# Here, the question is asking about the Probabilty of X less than x and X is greater than y
prob_c <- (Xgty - Xgtx)/Xgty
prob_c
## [1] 0.4585229

5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

m <- c(0.38, 0.38, 0.75)
n <- c(0.12, 0.12, 0.25)
o <-c(0.5, 0.5, 1)
table <- (rbind(m, n, o))
names(table) <- c("P(X>x)", "P(X<x)", "Marginal Probabilities")
row.names(table) <- c("P(Y>y)", "P(Y<y)", "Marginal Probabilities")

table
##                        [,1] [,2] [,3]
## P(Y>y)                 0.38 0.38 0.75
## P(Y<y)                 0.12 0.12 0.25
## Marginal Probabilities 0.50 0.50 1.00
## attr(,"names")
## [1] "P(X>x)"                 "P(X<x)"                
## [3] "Marginal Probabilities" NA                      
## [5] NA                       NA                      
## [7] NA                       NA                      
## [9] NA

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Answer:

p <- rbind(m, n)
fisher.test(p)
## Warning in fisher.test(p): 'x' has been rounded to integer: Mean relative
## difference: 0.75
## 
##  Fisher's Exact Test for Count Data
## 
## data:  p
## p-value = 1
## alternative hypothesis: two.sided
chisq.test(p)
## Warning in chisq.test(p): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  p
## X-squared = 0.00027031, df = 2, p-value = 0.9999

In both test p-value is approximately 1. We can say that the variables are independent.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of ??? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

# import data from github which downloaded from kaggle 
train <- read.csv("https://raw.githubusercontent.com/maharjansudhan/DATA605/master/train.csv", sep =",")

head(train)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000
# for any housing pricing the location matters a lot so lets see if we can find somthing here.

summary(train$LotArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245
summary(train$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
#let's see if there is the relationship between the quality of the house and the price of the house
plot(train$SalePrice~train$OverallQual)

#let's see if there is the relationship between the LotFrontage and the price of the house
plot(train$SalePrice~train$LotFrontage)

# to see if there is the correlation between the locaiton and yearbuilt with 80% confidence interval
corr <- cor.test(train$YearBuilt, train$LotArea, method = "pearson", conf.level = 0.8)
corr
## 
##  Pearson's product-moment correlation
## 
## data:  train$YearBuilt and train$LotArea
## t = 0.54332, df = 1458, p-value = 0.587
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.01934322  0.04776648
## sample estimates:
##        cor 
## 0.01422765
# to see if there is the correlation between the saleprice and yearbuilt with 80% confidence interval
corr <- cor.test(train$SalePrice, train$YearBuilt, method = "pearson", conf.level = 0.8)
corr
## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$YearBuilt
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4980766 0.5468619
## sample estimates:
##       cor 
## 0.5228973
# to see if there is the correlation between the saleprice and lotarea with 80% confidence interval
corr <- cor.test(train$SalePrice, train$LotArea, method = "pearson", conf.level = 0.8)
corr
## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434
#to see the saleprice histogram
hist(train$SalePrice)

library(MASS)
fit <- fitdistr(train$SalePrice, densfun = "exponential")
lambda <- fit$estimate
redist <- rexp(500, lambda)
summary(redist)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     166.6   52233.5  123229.4  180946.0  256535.1 1245643.3
#to see the redist histogram
hist(redist)

# to see the 5th precentile value
p05 <- round(log(1-0.05)/-lambda,2)
p05
##    rate 
## 9280.04
# to see the 95th percentile value
p95 <- round(log(1-0.95)/-lambda,2)
p95
##     rate 
## 541991.5
# to calculate the multiple regression, and create a model lets choose and compare correct variables

x <- lm(train$SalePrice ~ train$LotArea + train$YearBuilt)
summary(x)
## 
## Call:
## lm(formula = train$SalePrice ~ train$LotArea + train$YearBuilt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -258044  -37455  -14472   21135  520542 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.533e+06  1.104e+05  -22.93   <2e-16 ***
## train$LotArea    2.041e+00  1.695e-01   12.04   <2e-16 ***
## train$YearBuilt  1.366e+03  5.602e+01   24.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64620 on 1457 degrees of freedom
## Multiple R-squared:  0.3392, Adjusted R-squared:  0.3383 
## F-statistic: 373.9 on 2 and 1457 DF,  p-value: < 2.2e-16

It seems like 33.9% of the variation in the SalePrice can be predicted in the model.

# to see if location, bedroom, features and qualities of the house makes any difference on the price of the house we will create another model
y <- lm(train$SalePrice ~ train$TotRmsAbvGrd + train$BedroomAbvGr + train$FullBath + train$HalfBath + train$GarageCars + train$LotArea + train$OpenPorchSF + train$WoodDeckSF)
summary(y)
## 
## Call:
## lm(formula = train$SalePrice ~ train$TotRmsAbvGrd + train$BedroomAbvGr + 
##     train$FullBath + train$HalfBath + train$GarageCars + train$LotArea + 
##     train$OpenPorchSF + train$WoodDeckSF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -242103  -28149   -2007   23342  418013 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -9.066e+03  5.892e+03  -1.539    0.124    
## train$TotRmsAbvGrd  1.703e+04  1.304e+03  13.063  < 2e-16 ***
## train$BedroomAbvGr -2.131e+04  2.235e+03  -9.531  < 2e-16 ***
## train$FullBath      3.015e+04  3.083e+03   9.778  < 2e-16 ***
## train$HalfBath      1.176e+04  2.820e+03   4.171 3.21e-05 ***
## train$GarageCars    3.751e+04  2.093e+03  17.924  < 2e-16 ***
## train$LotArea       8.947e-01  1.346e-01   6.645 4.29e-11 ***
## train$OpenPorchSF   1.119e+02  2.075e+01   5.392 8.11e-08 ***
## train$WoodDeckSF    7.916e+01  1.082e+01   7.317 4.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49550 on 1451 degrees of freedom
## Multiple R-squared:  0.6132, Adjusted R-squared:  0.611 
## F-statistic: 287.5 on 8 and 1451 DF,  p-value: < 2.2e-16

It seems 61.32% of the variation in the SalePrice can be predicted in the model. This model is more stronger than the previous one. Because of these variables the price of the house is predicted.

qqnorm(resid(y))
qqline(resid(y))

Even though there are outliers on this Normal Q-Q plot, this is a fit model.

My Kaggle account is : maharjansudhan