For this project, I decided to invistigate on houses prices in rular NY area. I wanted to know what factors should affect the price, is it no.of bedrooms, or size, etc… The conclusion is at the end.
library(RCurl)
## Loading required package: bitops
# load the required package
library(ggplot2)
df <- read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Stat2Data/HousesNY.csv") , header = TRUE)
summary(df)
## X Price Beds Baths
## Min. : 1 Min. : 38.5 Min. :2.000 Min. :1.000
## 1st Qu.:14 1st Qu.: 82.7 1st Qu.:3.000 1st Qu.:1.500
## Median :27 Median :107.0 Median :3.000 Median :2.000
## Mean :27 Mean :113.6 Mean :3.396 Mean :1.858
## 3rd Qu.:40 3rd Qu.:141.0 3rd Qu.:4.000 3rd Qu.:2.000
## Max. :53 Max. :197.5 Max. :6.000 Max. :3.500
## Size Lot
## Min. :0.712 Min. :0.0000
## 1st Qu.:1.296 1st Qu.:0.2700
## Median :1.528 Median :0.4200
## Mean :1.678 Mean :0.7985
## 3rd Qu.:2.060 3rd Qu.:1.1000
## Max. :3.100 Max. :3.5000
str(df)
## 'data.frame': 53 obs. of 6 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Price: num 57.6 120 150 143 92.5 ...
## $ Beds : int 3 6 4 3 3 2 2 4 4 3 ...
## $ Baths: num 2 2 2 2 1 1 2 3 2.5 2 ...
## $ Size : num 0.96 2.79 1.7 1.2 1.33 ...
## $ Lot : num 1.3 0.23 0.27 0.8 0.42 0.34 0.29 0.21 1 0.3 ...
#head(df)
sub_df <- subset(df, Price >= 38 & Size >= 1.5)
head(sub_df, 30)
## X Price Beds Baths Size Lot
## 2 2 120.0 6 2.0 2.786 0.23
## 3 3 150.0 4 2.0 1.704 0.27
## 8 8 140.0 4 3.0 2.818 0.21
## 9 9 197.5 4 2.5 2.268 1.00
## 10 10 125.1 3 2.0 1.936 0.30
## 11 11 175.0 3 2.0 1.528 1.30
## 14 14 160.0 4 2.0 2.060 0.60
## 15 15 63.5 3 2.0 1.781 0.20
## 17 17 185.0 4 3.5 2.220 0.12
## 20 20 118.0 3 1.0 1.500 0.50
## 21 21 87.5 4 2.0 2.464 0.33
## 22 22 67.5 4 1.0 1.515 0.22
## 23 23 105.0 5 2.0 2.732 0.37
## 24 24 114.0 4 2.0 2.720 2.50
## 25 25 100.5 3 2.0 1.608 0.58
## 28 28 144.0 3 1.5 1.968 0.34
## 29 29 179.0 4 2.0 3.100 0.28
## 31 31 175.0 3 2.0 2.127 1.80
## 34 34 92.0 3 2.0 1.815 0.00
## 40 40 139.0 4 2.0 2.160 0.12
## 43 43 119.0 4 2.0 2.375 0.34
## 44 44 89.0 4 1.0 2.274 1.00
## 45 45 75.1 4 2.0 1.540 0.35
## 46 46 92.5 3 2.0 1.790 0.73
## 47 47 141.0 4 1.0 1.620 0.28
## 49 49 162.0 4 1.5 2.044 0.85
## 50 50 195.0 4 3.0 1.848 1.84
## 51 51 190.0 5 3.5 2.794 0.31
## 53 53 87.0 3 1.5 1.740 0.25
summary(sub_df)
## X Price Beds Baths
## Min. : 2.00 Min. : 63.5 Min. :3.000 Min. :1.0
## 1st Qu.:15.00 1st Qu.: 92.5 1st Qu.:3.000 1st Qu.:2.0
## Median :25.00 Median :125.1 Median :4.000 Median :2.0
## Mean :28.07 Mean :130.6 Mean :3.793 Mean :2.0
## 3rd Qu.:44.00 3rd Qu.:162.0 3rd Qu.:4.000 3rd Qu.:2.0
## Max. :53.00 Max. :197.5 Max. :6.000 Max. :3.5
## Size Lot
## Min. :1.500 Min. :0.0000
## 1st Qu.:1.740 1st Qu.:0.2500
## Median :2.044 Median :0.3400
## Mean :2.098 Mean :0.5938
## 3rd Qu.:2.375 3rd Qu.:0.7300
## Max. :3.100 Max. :2.5000
# rename columns
names(sub_df) <- c("Record", "price", "no.beds", "no.baths", "size", "lot")
head(sub_df, 30)
## Record price no.beds no.baths size lot
## 2 2 120.0 6 2.0 2.786 0.23
## 3 3 150.0 4 2.0 1.704 0.27
## 8 8 140.0 4 3.0 2.818 0.21
## 9 9 197.5 4 2.5 2.268 1.00
## 10 10 125.1 3 2.0 1.936 0.30
## 11 11 175.0 3 2.0 1.528 1.30
## 14 14 160.0 4 2.0 2.060 0.60
## 15 15 63.5 3 2.0 1.781 0.20
## 17 17 185.0 4 3.5 2.220 0.12
## 20 20 118.0 3 1.0 1.500 0.50
## 21 21 87.5 4 2.0 2.464 0.33
## 22 22 67.5 4 1.0 1.515 0.22
## 23 23 105.0 5 2.0 2.732 0.37
## 24 24 114.0 4 2.0 2.720 2.50
## 25 25 100.5 3 2.0 1.608 0.58
## 28 28 144.0 3 1.5 1.968 0.34
## 29 29 179.0 4 2.0 3.100 0.28
## 31 31 175.0 3 2.0 2.127 1.80
## 34 34 92.0 3 2.0 1.815 0.00
## 40 40 139.0 4 2.0 2.160 0.12
## 43 43 119.0 4 2.0 2.375 0.34
## 44 44 89.0 4 1.0 2.274 1.00
## 45 45 75.1 4 2.0 1.540 0.35
## 46 46 92.5 3 2.0 1.790 0.73
## 47 47 141.0 4 1.0 1.620 0.28
## 49 49 162.0 4 1.5 2.044 0.85
## 50 50 195.0 4 3.0 1.848 1.84
## 51 51 190.0 5 3.5 2.794 0.31
## 53 53 87.0 3 1.5 1.740 0.25
#get mean/median values
mean(sub_df$price)
## [1] 130.6276
mean(sub_df$size)
## [1] 2.097759
median(sub_df$price)
## [1] 125.1
median(sub_df$size)
## [1] 2.044
Note that for this sample dataset the mean price/size is higher than the mean/median for the price and size for the whole dataset.
# Group by price range
sub_df$price <- ifelse(is.na(sub_df$price), 'NOTFOUND', ifelse(sub_df <= 38, 'Low_price', ifelse(sub_df$price > 38 & sub_df$price <= 100, 'Median_price', 'High_price')))
head(sub_df, 30)
## Record price no.beds no.baths size lot
## 2 2 Low_price 6 2.0 2.786 0.23
## 3 3 Low_price 4 2.0 1.704 0.27
## 8 8 Low_price 4 3.0 2.818 0.21
## 9 9 Low_price 4 2.5 2.268 1.00
## 10 10 Low_price 3 2.0 1.936 0.30
## 11 11 Low_price 3 2.0 1.528 1.30
## 14 14 Low_price 4 2.0 2.060 0.60
## 15 15 Low_price 3 2.0 1.781 0.20
## 17 17 Low_price 4 3.5 2.220 0.12
## 20 20 Low_price 3 1.0 1.500 0.50
## 21 21 Low_price 4 2.0 2.464 0.33
## 22 22 Low_price 4 1.0 1.515 0.22
## 23 23 Low_price 5 2.0 2.732 0.37
## 24 24 Low_price 4 2.0 2.720 2.50
## 25 25 Low_price 3 2.0 1.608 0.58
## 28 28 Low_price 3 1.5 1.968 0.34
## 29 29 Low_price 4 2.0 3.100 0.28
## 31 31 Low_price 3 2.0 2.127 1.80
## 34 34 Low_price 3 2.0 1.815 0.00
## 40 40 High_price 4 2.0 2.160 0.12
## 43 43 High_price 4 2.0 2.375 0.34
## 44 44 Median_price 4 1.0 2.274 1.00
## 45 45 Median_price 4 2.0 1.540 0.35
## 46 46 Median_price 3 2.0 1.790 0.73
## 47 47 High_price 4 1.0 1.620 0.28
## 49 49 High_price 4 1.5 2.044 0.85
## 50 50 High_price 4 3.0 1.848 1.84
## 51 51 High_price 5 3.5 2.794 0.31
## 53 53 Median_price 3 1.5 1.740 0.25
# show the frequency of prices samples that repeated in the subset data
c2 <- ggplot(sub_df, aes(price))
c2 + geom_bar(fill = "purple")
As demonestrated from the above chart, houses less or equal than 38 - low_price has the highest frequency in the dataset.
library(ggplot2)
p <- ggplot(sub_df, aes(price, no.beds))
#p + geom_label(aes(label = price), nudge_x = 1, nudge_y = 1, check_overlap = TRUE)
p + geom_violin(scale = "area", fill = "red", linetype = "dotted", color = "blue")
Note that houses with high prices most of them have 4+ bed rooms but has no houses less than 4 bedrooms. However, houses with low prices has all bedrooms types. For the median prices, the majority of houses is ranging from 3 and 4 bedrooms.
# inspect the dataset house sizes regarding no. of beds based on the price.
ggplot(sub_df, aes(size, no.beds, color = price)) + geom_point() + stat_smooth(method=lm, se=FALSE, fullrange=TRUE)
# inspect the dataset house prices regarding no. of beds and baths.
ggplot(sub_df, aes(no.beds, no.baths, color = price)) + geom_point() + stat_smooth(method=lm, se=FALSE, fullrange=TRUE)
As shown in the plot, neither the size of the house nor no of beds affects on thhe price. For instance, houses with 4 bedrooms has all price categories. However, the majority lies within size range of 2-2.5 and 4 bedroom.
Study objective:
The main reason for this study is to inspect the affecting factors on houses prices. Is the size of the house, no of bedrooms, or other factors.
Main outcome:
The results should show the affecting factors on houses price.
Results:
Conclusion:
for the given dataset, more variable needed to get a definite opinion. For instance, the geo location may affect the price dramatically. However, for the obtained results we can agree that houses with 4+ bedrooms have higher prices generally.
The steps to place the dataset on your own github are the following:
Go to that link to download the required dataset on your local machine dataset source.
After downloading the file on your computer, add and commit it into git from your terminal then push it to your github repo.
create a new folder on your desktop
“git init” to make it as a git repository.
“git add .” to add the files to git system
“git commit -m”any message you want"
“git push origin master” pushing your changes to the GitHub"
Note that you should have git software on your local machine - you can check using git -v.
library(RCurl)
theurl <- "https://raw.githubusercontent.com/salma71/dataset2/master/HousesNY.csv"
df_hou <- read.table(file=theurl, header=TRUE, sep = ",")
summary(df_hou)
## X Price Beds Baths
## Min. : 1 Min. : 38.5 Min. :2.000 Min. :1.000
## 1st Qu.:14 1st Qu.: 82.7 1st Qu.:3.000 1st Qu.:1.500
## Median :27 Median :107.0 Median :3.000 Median :2.000
## Mean :27 Mean :113.6 Mean :3.396 Mean :1.858
## 3rd Qu.:40 3rd Qu.:141.0 3rd Qu.:4.000 3rd Qu.:2.000
## Max. :53 Max. :197.5 Max. :6.000 Max. :3.500
## Size Lot
## Min. :0.712 Min. :0.0000
## 1st Qu.:1.296 1st Qu.:0.2700
## Median :1.528 Median :0.4200
## Mean :1.678 Mean :0.7985
## 3rd Qu.:2.060 3rd Qu.:1.1000
## Max. :3.100 Max. :3.5000