This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Sys.setlocale("LC_ALL", my_locale)
OS reports request to set locale to "LC_COLLATE=English_India.1252;LC_CTYPE=English_India.1252;LC_MONETARY=English_India.1252;LC_NUMERIC=C;LC_TIME=English_India.1252" cannot be honored
[1] ""
This dataset consists of six variables namely Bedroom,Bathroom,Sqft_living,sqft_lot,floors,Price. Here price is the dependent variable and other variables are independent variables.
View(House_Price_Kaggle)
View fuction helpsto view our dataset in R.
summary(House_Price_Kaggle)
bedrooms bathrooms sqft_living sqft_lot
Min. : 0.000 Min. :0.000 Min. : 290 Min. : 520
1st Qu.: 3.000 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040
Median : 3.000 Median :2.250 Median : 1910 Median : 7618
Mean : 3.371 Mean :2.115 Mean : 2080 Mean : 15107
3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688
Max. :33.000 Max. :8.000 Max. :13540 Max. :1651359
floors price
Min. :1.000 Min. : 75000
1st Qu.:1.000 1st Qu.: 321950
Median :1.500 Median : 450000
Mean :1.494 Mean : 540088
3rd Qu.:2.000 3rd Qu.: 645000
Max. :3.500 Max. :7700000
Summary of the dataset gives us the minimum value,maximum value, quartile values,mean,median. This gives us the basic understanding of our dataset.
str(House_Price_Kaggle)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 21613 obs. of 6 variables:
$ bedrooms : num 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living: num 1180 2570 770 1960 1680 ...
$ sqft_lot : num 5650 7242 10000 5000 8080 ...
$ floors : num 1 2 1 1 1 1 2 1 1 2 ...
$ price : num 221900 538000 180000 604000 510000 ...
str stands for structure of the dataset to find out which are characters and which are numerical
Plot simply gives a scatter plot of our dataset including all variables.
scatter.smooth(House_Price_Kaggle)
pseudoinverse used at 3neighborhood radius 1reciprocal condition number 0There are other near singularities as well. 1pseudoinverse used at 3neighborhood radius 1reciprocal condition number 0There are other near singularities as well. 1pseudoinverse used at 3neighborhood radius 1reciprocal condition number 0There are other near singularities as well. 1pseudoinverse used at 3neighborhood radius 1reciprocal condition number 0There are other near singularities as well. 1pseudoinverse used at 3neighborhood radius 1reciprocal condition number 0There are other near singularities as well. 1
Scatter.smooth gives us same scatter plot as plot but here we have done for only one variable
This is a boxplot forBedrooms variable of the dataset.
This boxplot is for Bathrooms variable of the dataset.
This boxplot is for only Sqft_living variable of the dataset.
This boxplot is for only sqft_lot variable ofthe dataset.
This boxplot is for only Floors variable ofthe dataset.
This boxplot is for only Price variable ofthe dataset.
This scatter plot compares Bathrooms and Price of the house.
This scatter plot compares sqft_living and Price of the house.
This scatter plot compares sqft_lot and Price of the house.
This scatter plot compares Floors and Price of the house.
scatter.smooth(House_Price_Kaggle$price,House_Price_Kaggle$bathrooms,col=c('brown4','green'),main='Price vs Bathrooms')
scatter.smooth(House_Price_Kaggle$price,House_Price_Kaggle$sqft_living,col=c('green','brown1'),main='Price vs Sqft_living')
scatter.smooth(House_Price_Kaggle$price,House_Price_Kaggle$sqft_lot,col=c('green','black'),main='Price vs Sqft_lot')
scatter.smooth(House_Price_Kaggle$price,House_Price_Kaggle$floors,col=c('green','mediumvioletred'),main='Price vs Floors')
Now if we want to see all the four graphs in a single screen we use par function. We mention the row number and coloumn number in mfrow and we get all the graphs in a single screen.
cor(House_Price_Kaggle)
bedrooms bathrooms sqft_living sqft_lot floors
bedrooms 1.00000000 0.51588364 0.5766707 0.031703243 0.175428935
bathrooms 0.51588364 1.00000000 0.7546653 0.087739662 0.500653173
sqft_living 0.57667069 0.75466528 1.0000000 0.172825661 0.353949290
sqft_lot 0.03170324 0.08773966 0.1728257 1.000000000 -0.005200991
floors 0.17542894 0.50065317 0.3539493 -0.005200991 1.000000000
price 0.30834960 0.52513751 0.7020351 0.089660861 0.256793888
price
bedrooms 0.30834960
bathrooms 0.52513751
sqft_living 0.70203505
sqft_lot 0.08966086
floors 0.25679389
price 1.00000000
Now we check the corrrelation for all the variables to determine the strength.
In order to plot the correlation co-efficients we call the corrplot from library. We assign a variable to corrplot and run that variable.
Regprice
Call:
lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
floors, data = House_Price_Kaggle)
Coefficients:
(Intercept) bedrooms bathrooms sqft_living sqft_lot
8.066e+04 -5.953e+04 6.958e+03 3.143e+02 -3.788e-01
floors
-1.758e+03
After checking correlation we move on to regression. To perform regression we declare a variable and use lm function and form the regression model.
summary(Regprice)
Call:
lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
floors, data = House_Price_Kaggle)
Residuals:
Min 1Q Median 3Q Max
-1573404 -143855 -22380 102493 4148365
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.066e+04 7.696e+03 10.481 <2e-16 ***
bedrooms -5.953e+04 2.351e+03 -25.319 <2e-16 ***
bathrooms 6.958e+03 3.809e+03 1.827 0.0678 .
sqft_living 3.143e+02 3.132e+00 100.355 <2e-16 ***
sqft_lot -3.788e-01 4.320e-02 -8.769 <2e-16 ***
floors -1.758e+03 3.776e+03 -0.466 0.6415
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 257400 on 21607 degrees of freedom
Multiple R-squared: 0.5087, Adjusted R-squared: 0.5086
F-statistic: 4474 on 5 and 21607 DF, p-value: < 2.2e-16
Summary of the declared variable will give us the p value for considering our variables for the regression equation.
regfinal
Call:
lm(formula = price ~ bedrooms + sqft_living + sqft_lot, data = House_Price_Kaggle)
Coefficients:
(Intercept) bedrooms sqft_living sqft_lot
8.278e+04 -5.880e+04 3.179e+02 -3.818e-01
After checking p value we have only two variables so we form a new eqaution with those two variables.
my_prediction_price
1
1061021
Finaly with our formed regression equation we can predict for any given value.