Your final is due by the end of day on 19 May This project will show off your ability to understand the elements of the class. You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.
library("tidyverse")
library("dplyr")
df = read.csv("https://raw.githubusercontent.com/tonyCUNY/tonyCUNY/main/train.csv")
Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right!
# TotalBsmtSF is the Independent variable with right-skewness
ggplot(df, aes(x = TotalBsmtSF)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
X = df$TotalBsmtSF
Pick the dependent variable and define it as Y.
# SalePrice is the dependent variable with right-skewness
ggplot(df, aes(x = SalePrice)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Y = df$SalePrice
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 795.8 991.5 1057.4 1298.2 6110.0
#3rd quartile of X:
x_3q = summary(X)["3rd Qu."]
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
#2nd quartile of Y is equal to Median which is 163000
y_2q = summary(Y)["Median"]
X_1 = X > x_3q
Y_1 = Y > y_2q
n = sum(X_1 & Y_1)
d = sum(Y_1)
n/d
## [1] 0.4519231
n = sum(X_1 & Y_1)
d = nrow(df)
n/d
## [1] 0.2253425
X_2 = X < x_3q
Y_1 = Y > y_2q
n = sum(X_2 & Y_1)
d = sum(Y_1)
n/d
## [1] 0.5480769
table_counts <- matrix(0, nrow = 3, ncol = 3)
colnames(table_counts) <- c("<=2d quartile", ">2d quartile", "Total")
rownames(table_counts) <- c("<=3d quartile", ">3d quartile", "Total")
table_counts[1, 1] <- sum(X < x_3q & Y < y_2q)
table_counts[1, 2] <- sum(X < x_3q & Y > y_2q)
table_counts[1, 3] <- table_counts[1, 1] + table_counts[1, 2]
table_counts[2, 1] <- sum(X > x_3q & Y < y_2q)
table_counts[2, 2] <- sum(X > x_3q & Y > y_2q)
table_counts[2, 3] <- table_counts[2, 1] + table_counts[2, 2]
table_counts[3, 1] <- table_counts[1, 1] + table_counts[2, 1]
table_counts[3, 2] <- table_counts[1, 2] + table_counts[2, 2]
table_counts[3, 3] <- table_counts[1, 3] + table_counts[2, 3]
print(table_counts)
## <=2d quartile >2d quartile Total
## <=3d quartile 693 399 1092
## >3d quartile 35 329 364
## Total 728 728 1456
Does splitting the training data in this fashion make them independent?
No, Splitting the training data does not make them independent.
Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
Below result shows that P(AB)≠P(A)P(B) and thus A and B are not independent
A = X > x_3q
B = Y > y_2q
# Find P(A|B)
P_A_given_B <- sum(A * B) / sum(B)
P_A_given_B
## [1] 0.4519231
# Find P(A) * P(B)
P_A <- sum(A) / length(A)
P_B <- sum(B) / length(B)
P_A * P_B
## [1] 0.1246575
Evaluate by running a Chi Square test for association.
c_test <- chisq.test(table(A, B))
print(c_test)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(A, B)
## X-squared = 313.61, df = 1, p-value < 2.2e-16
Use the chi-square test to test the null hypothesis
H0: there is no relationship between two categorical variables
P <= 0.05 (Hypothesis interpretations are rejected)
Since P-value is very small, we can say the two variables A, B are statistically dependent.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
Provide univariate descriptive statistics and appropriate plots for the training data set.
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 795.8 991.5 1057.4 1298.2 6110.0
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Provide a scatterplot of X and Y.
# X represent TotalBsmtSF
# Y represent SalePrice
ggplot(df, aes(x = X, y = Y)) +
geom_point() +
labs(x = "TotalBsmtSF", y = "SalePrice", title = "Scatter Plot of TotalBsmtSF and SalePrice")
Provide a 95% CI for the difference in the mean of the variables.
95 percent confidence interval: -183942.2 -175785.3
t.test(X, Y, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: X and Y
## t = -86.509, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -183942.2 -175785.3
## sample estimates:
## mean of x mean of y
## 1057.429 180921.196
Derive a correlation matrix for two of the quantitative variables you selected.
df2 = df |>
select(TotalBsmtSF, SalePrice)
co_matrix = cor(df2)
co_matrix
## TotalBsmtSF SalePrice
## TotalBsmtSF 1.0000000 0.6135806
## SalePrice 0.6135806 1.0000000
Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
Since the p-value (2.2e-16) is less than the chosen significance level (e.g., 0.01), we reject the null hypothesis that the correlation is 0. This indicates that there is sufficient evidence to suggest a significant correlation between X and Y.
co_test <- cor.test(df2$TotalBsmtSF, df2$SalePrice, method = "pearson", conf.level = 0.99)
print(co_test)
##
## Pearson's product-moment correlation
##
## data: df2$TotalBsmtSF and df2$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.5697562 0.6539251
## sample estimates:
## cor
## 0.6135806
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
precision_matrix = solve(co_matrix)
precision_matrix
## TotalBsmtSF SalePrice
## TotalBsmtSF 1.6038006 -0.9840609
## SalePrice -0.9840609 1.6038006
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
# Multiply the correlation matrix by the precision matrix
round(co_matrix %*% precision_matrix)
## TotalBsmtSF SalePrice
## TotalBsmtSF 1 0
## SalePrice 0 1
# Multiply the precision matrix by the correlation matrix
round(precision_matrix %*% co_matrix)
## TotalBsmtSF SalePrice
## TotalBsmtSF 1 0
## SalePrice 0 1
Conduct principle components analysis (research this!) and interpret. Discuss.
# The mean correlation is high and thus it is eligible for conducting PCA
mean(cor(df2))
## [1] 0.8067903
There are two ways to evaluate PCA:
The component loadings tell us how much of the variation in a variable is explained by the component. Here Component 2 explained all the variation of TotalBsmtSF while Component 1 explained all the variation of Sale Price and we can conclude the PCs does capture the essence of the original variable
The correlation among the PC is almost zero and we can conclude these PCs are independent
pca_result <- princomp(df2)
pca_result$loadings
##
## Loadings:
## Comp.1 Comp.2
## TotalBsmtSF 1.000
## SalePrice 1.000
##
## Comp.1 Comp.2
## SS loadings 1.0 1.0
## Proportion Var 0.5 0.5
## Cumulative Var 0.5 1.0
PC = pca_result$scores
cor(PC)
## Comp.1 Comp.2
## Comp.1 1.000000e+00 3.352131e-15
## Comp.2 3.352131e-15 1.000000e+00
summary(pca_result)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 79415.747781 3.462952e+02
## Proportion of Variance 0.999981 1.901391e-05
## Cumulative Proportion 0.999981 1.000000e+00
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
For your variable that is skewed to the right, shift it so that the minimum value is above zero.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
X = df$TotalBsmtSF
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 795.8 991.5 1057.4 1298.2 6110.0
# Adding a constant value to all observations so that the minimum value is above zero.
X2 = X + 1
summary(X2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 796.8 992.5 1058.4 1299.2 6111.0
Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
# Fit exponential distribution
exp_pdf = fitdistr(X, "exponential")
exp_pdf
## rate
## 9.456896e-04
## (2.474983e-05)
Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).
# Optimal lambda value
lambda = exp_pdf$estimate
lambda
## rate
## 0.0009456896
# Samples from exponential distribution
samples <- rexp(1000, rate = lambda)
Plot a histogram and compare it with a histogram of your original variable.
Sample histogram has smoother skewed to the right distribution
# Plot histograms
par(mfrow = c(1, 2))
hist(X, main = "Original Variable", xlab = "Value")
hist(samples, main = "Exponential Distribution", xlab = "Value")
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality.
quantile(samples, probs=c(0.05, 0.95))
## 5% 95%
## 58.95148 3203.88076
# Generate a 95% confidence interval from the empirical data
mean_x <- mean(X)
std_x <- sd(X)
n <- length(X)
z <- qnorm(1 - 0.05/2)
lower_ci <- mean_x - z * (std_x / sqrt(n))
upper_ci <- mean_x + z * (std_x / sqrt(n))
lower_ci
## [1] 1034.926
upper_ci
## [1] 1079.933
Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
# Generate a 95% confidence interval from the empirical data
quantile(X, probs=c(0.05, 0.95))
## 5% 95%
## 519.3 1753.0
The difference in the 95% CI suggests the exponential function is not describing the variability and shape of the original data.This imply the exponential function is not a good fitting model for the data.
The significant skewness of the selected variable would cause the percentiles calculated from the exponential probability density function (PDF) to inaccurately represent the tails of the distribution.
Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
df_mm <- dplyr::select(df, SalePrice, LotArea, YearBuilt, TotalBsmtSF, CentralAir, GrLivArea, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, BsmtFullBath, TotRmsAbvGrd, Fireplaces, GarageArea, GarageCars)
head(df_mm)
## SalePrice LotArea YearBuilt TotalBsmtSF CentralAir GrLivArea FullBath
## 1 208500 8450 2003 856 Y 1710 2
## 2 181500 9600 1976 1262 Y 1262 2
## 3 223500 11250 2001 920 Y 1786 2
## 4 140000 9550 1915 756 Y 1717 1
## 5 250000 14260 2000 1145 Y 2198 2
## 6 143000 14115 1993 796 Y 1362 1
## HalfBath BedroomAbvGr KitchenAbvGr BsmtFullBath TotRmsAbvGrd Fireplaces
## 1 1 3 1 1 8 0
## 2 0 3 1 0 6 1
## 3 1 3 1 1 6 1
## 4 0 3 1 1 7 1
## 5 1 4 1 1 9 1
## 6 1 1 1 1 5 0
## GarageArea GarageCars
## 1 548 2
## 2 460 2
## 3 608 2
## 4 642 3
## 5 836 3
## 6 480 2
#
m_full <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + CentralAir + GrLivArea + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
summary(m_full)
##
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF +
## CentralAir + GrLivArea + FullBath + HalfBath + BedroomAbvGr +
## KitchenAbvGr + BsmtFullBath + TotRmsAbvGrd + Fireplaces +
## GarageArea + GarageCars, data = df_mm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -548095 -18314 -2128 16073 294903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.570e+05 1.041e+05 -9.194 < 2e-16 ***
## LotArea 2.841e-01 1.115e-01 2.548 0.0109 *
## YearBuilt 5.061e+02 5.405e+01 9.363 < 2e-16 ***
## TotalBsmtSF 2.887e+01 3.346e+00 8.628 < 2e-16 ***
## CentralAirY 1.355e+03 4.696e+03 0.289 0.7730
## GrLivArea 6.224e+01 4.812e+00 12.935 < 2e-16 ***
## FullBath 7.667e+03 3.087e+03 2.483 0.0131 *
## HalfBath -1.302e+03 2.701e+03 -0.482 0.6298
## BedroomAbvGr -1.472e+04 1.818e+03 -8.097 1.18e-15 ***
## KitchenAbvGr -4.622e+04 5.257e+03 -8.792 < 2e-16 ***
## BsmtFullBath 1.115e+04 2.193e+03 5.083 4.21e-07 ***
## TotRmsAbvGrd 7.518e+03 1.367e+03 5.499 4.51e-08 ***
## Fireplaces 7.649e+03 1.927e+03 3.969 7.58e-05 ***
## GarageArea 7.548e+00 1.084e+01 0.696 0.4862
## GarageCars 1.596e+04 3.170e+03 5.037 5.34e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39380 on 1445 degrees of freedom
## Multiple R-squared: 0.7566, Adjusted R-squared: 0.7542
## F-statistic: 320.8 on 14 and 1445 DF, p-value: < 2.2e-16
Drop the variable with the highest p-value and re-fit the model.
# drop CentralAirY since it has p-value of 0.7730
m_full_drop1 <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + GrLivArea + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
summary(m_full_drop1)
##
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF +
## GrLivArea + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## BsmtFullBath + TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars,
## data = df_mm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -548428 -18336 -2177 15914 294879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.633e+05 1.017e+05 -9.470 < 2e-16 ***
## LotArea 2.842e-01 1.115e-01 2.549 0.0109 *
## YearBuilt 5.100e+02 5.229e+01 9.753 < 2e-16 ***
## TotalBsmtSF 2.891e+01 3.343e+00 8.646 < 2e-16 ***
## GrLivArea 6.220e+01 4.808e+00 12.936 < 2e-16 ***
## FullBath 7.608e+03 3.080e+03 2.470 0.0136 *
## HalfBath -1.285e+03 2.700e+03 -0.476 0.6341
## BedroomAbvGr -1.467e+04 1.810e+03 -8.107 1.09e-15 ***
## KitchenAbvGr -4.646e+04 5.184e+03 -8.962 < 2e-16 ***
## BsmtFullBath 1.116e+04 2.192e+03 5.089 4.08e-07 ***
## TotRmsAbvGrd 7.518e+03 1.367e+03 5.501 4.47e-08 ***
## Fireplaces 7.713e+03 1.914e+03 4.030 5.87e-05 ***
## GarageArea 7.681e+00 1.082e+01 0.710 0.4780
## GarageCars 1.595e+04 3.168e+03 5.034 5.41e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39370 on 1446 degrees of freedom
## Multiple R-squared: 0.7566, Adjusted R-squared: 0.7544
## F-statistic: 345.7 on 13 and 1446 DF, p-value: < 2.2e-16
# drop HalfBath since it has p-value of 0.6341
m_full_drop2 <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + GrLivArea + FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
summary(m_full_drop2)
##
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF +
## GrLivArea + FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath +
## TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -548702 -18199 -2289 15874 295458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.442e+05 9.348e+04 -10.101 < 2e-16 ***
## LotArea 2.864e-01 1.114e-01 2.572 0.01021 *
## YearBuilt 5.001e+02 4.790e+01 10.441 < 2e-16 ***
## TotalBsmtSF 2.953e+01 3.076e+00 9.599 < 2e-16 ***
## GrLivArea 6.122e+01 4.348e+00 14.080 < 2e-16 ***
## FullBath 8.144e+03 2.866e+03 2.841 0.00455 **
## BedroomAbvGr -1.470e+04 1.809e+03 -8.126 9.45e-16 ***
## KitchenAbvGr -4.633e+04 5.175e+03 -8.952 < 2e-16 ***
## BsmtFullBath 1.119e+04 2.190e+03 5.110 3.65e-07 ***
## TotRmsAbvGrd 7.512e+03 1.366e+03 5.499 4.52e-08 ***
## Fireplaces 7.685e+03 1.913e+03 4.018 6.17e-05 ***
## GarageArea 8.158e+00 1.077e+01 0.757 0.44904
## GarageCars 1.584e+04 3.159e+03 5.014 5.99e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39360 on 1447 degrees of freedom
## Multiple R-squared: 0.7565, Adjusted R-squared: 0.7545
## F-statistic: 374.7 on 12 and 1447 DF, p-value: < 2.2e-16
# drop GarageArea since it has p-value of 0.44904
m_full_drop3 <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + GrLivArea + FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + TotRmsAbvGrd + Fireplaces + GarageCars, data = df_mm)
summary(m_full_drop3)
##
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF +
## GrLivArea + FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath +
## TotRmsAbvGrd + Fireplaces + GarageCars, data = df_mm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -544995 -18339 -2089 15905 294855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.433e+05 9.346e+04 -10.093 < 2e-16 ***
## LotArea 2.895e-01 1.113e-01 2.602 0.00937 **
## YearBuilt 4.997e+02 4.789e+01 10.435 < 2e-16 ***
## TotalBsmtSF 2.993e+01 3.030e+00 9.876 < 2e-16 ***
## GrLivArea 6.168e+01 4.306e+00 14.323 < 2e-16 ***
## FullBath 7.971e+03 2.857e+03 2.790 0.00533 **
## BedroomAbvGr -1.476e+04 1.807e+03 -8.172 6.55e-16 ***
## KitchenAbvGr -4.645e+04 5.172e+03 -8.981 < 2e-16 ***
## BsmtFullBath 1.130e+04 2.186e+03 5.168 2.70e-07 ***
## TotRmsAbvGrd 7.480e+03 1.365e+03 5.478 5.06e-08 ***
## Fireplaces 7.521e+03 1.900e+03 3.958 7.91e-05 ***
## GarageCars 1.777e+04 1.860e+03 9.557 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39350 on 1448 degrees of freedom
## Multiple R-squared: 0.7564, Adjusted R-squared: 0.7546
## F-statistic: 408.8 on 11 and 1448 DF, p-value: < 2.2e-16
Verify that the conditions for this model are reasonable using diagnostic plots.
Check for outliers. No extreme observations that might cause a concern. the distribution of the residuals are nearly normal.(From both histogram and Normal Q-Q plot)
The variability of the residuals is nearly constant The Residuals vs Fitted plot shows a fairly uniform spread of residual across fitted values.
plot(m_full_drop3)
hist(m_full_drop3$residuals)
Perform predictions using the test.csv.
My Kaggle user name is TLeung1234 My Public Score is 0.1977
test = read.csv("https://raw.githubusercontent.com/tonyCUNY/tonyCUNY/main/test.csv")
test <- select_if(test, is.numeric)
test[is.na(test)] <- 0
m_predict = predict(m_full_drop3, test)
kaggle <- as.data.frame(cbind(test$Id, m_predict ))
colnames(kaggle) <- c("Id", "SalePrice")
write.csv(kaggle, file = "data605_CL_submission.csv", quote=FALSE, row.names=FALSE)
knitr::include_graphics('score.png')