For this project, I decided to invistigate on houses prices in rular NY area. I wanted to know what factors should affect the price, is it no.of bedrooms, or size, etc… The conclusion is at the end.

Q1: Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

A1:

library(RCurl)
## Loading required package: bitops
# load the required package

library(ggplot2)
df <- read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Stat2Data/HousesNY.csv") , header = TRUE)
summary(df)
##        X          Price            Beds           Baths      
##  Min.   : 1   Min.   : 38.5   Min.   :2.000   Min.   :1.000  
##  1st Qu.:14   1st Qu.: 82.7   1st Qu.:3.000   1st Qu.:1.500  
##  Median :27   Median :107.0   Median :3.000   Median :2.000  
##  Mean   :27   Mean   :113.6   Mean   :3.396   Mean   :1.858  
##  3rd Qu.:40   3rd Qu.:141.0   3rd Qu.:4.000   3rd Qu.:2.000  
##  Max.   :53   Max.   :197.5   Max.   :6.000   Max.   :3.500  
##       Size            Lot        
##  Min.   :0.712   Min.   :0.0000  
##  1st Qu.:1.296   1st Qu.:0.2700  
##  Median :1.528   Median :0.4200  
##  Mean   :1.678   Mean   :0.7985  
##  3rd Qu.:2.060   3rd Qu.:1.1000  
##  Max.   :3.100   Max.   :3.5000
str(df)
## 'data.frame':    53 obs. of  6 variables:
##  $ X    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Price: num  57.6 120 150 143 92.5 ...
##  $ Beds : int  3 6 4 3 3 2 2 4 4 3 ...
##  $ Baths: num  2 2 2 2 1 1 2 3 2.5 2 ...
##  $ Size : num  0.96 2.79 1.7 1.2 1.33 ...
##  $ Lot  : num  1.3 0.23 0.27 0.8 0.42 0.34 0.29 0.21 1 0.3 ...
#head(df)

Q2: Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example - if it makes sense you could sum two columns together)

A2:

sub_df <- subset(df, Price >= 38 & Size >= 1.5)
head(sub_df, 30)
##     X Price Beds Baths  Size  Lot
## 2   2 120.0    6   2.0 2.786 0.23
## 3   3 150.0    4   2.0 1.704 0.27
## 8   8 140.0    4   3.0 2.818 0.21
## 9   9 197.5    4   2.5 2.268 1.00
## 10 10 125.1    3   2.0 1.936 0.30
## 11 11 175.0    3   2.0 1.528 1.30
## 14 14 160.0    4   2.0 2.060 0.60
## 15 15  63.5    3   2.0 1.781 0.20
## 17 17 185.0    4   3.5 2.220 0.12
## 20 20 118.0    3   1.0 1.500 0.50
## 21 21  87.5    4   2.0 2.464 0.33
## 22 22  67.5    4   1.0 1.515 0.22
## 23 23 105.0    5   2.0 2.732 0.37
## 24 24 114.0    4   2.0 2.720 2.50
## 25 25 100.5    3   2.0 1.608 0.58
## 28 28 144.0    3   1.5 1.968 0.34
## 29 29 179.0    4   2.0 3.100 0.28
## 31 31 175.0    3   2.0 2.127 1.80
## 34 34  92.0    3   2.0 1.815 0.00
## 40 40 139.0    4   2.0 2.160 0.12
## 43 43 119.0    4   2.0 2.375 0.34
## 44 44  89.0    4   1.0 2.274 1.00
## 45 45  75.1    4   2.0 1.540 0.35
## 46 46  92.5    3   2.0 1.790 0.73
## 47 47 141.0    4   1.0 1.620 0.28
## 49 49 162.0    4   1.5 2.044 0.85
## 50 50 195.0    4   3.0 1.848 1.84
## 51 51 190.0    5   3.5 2.794 0.31
## 53 53  87.0    3   1.5 1.740 0.25
summary(sub_df)
##        X             Price            Beds           Baths    
##  Min.   : 2.00   Min.   : 63.5   Min.   :3.000   Min.   :1.0  
##  1st Qu.:15.00   1st Qu.: 92.5   1st Qu.:3.000   1st Qu.:2.0  
##  Median :25.00   Median :125.1   Median :4.000   Median :2.0  
##  Mean   :28.07   Mean   :130.6   Mean   :3.793   Mean   :2.0  
##  3rd Qu.:44.00   3rd Qu.:162.0   3rd Qu.:4.000   3rd Qu.:2.0  
##  Max.   :53.00   Max.   :197.5   Max.   :6.000   Max.   :3.5  
##       Size            Lot        
##  Min.   :1.500   Min.   :0.0000  
##  1st Qu.:1.740   1st Qu.:0.2500  
##  Median :2.044   Median :0.3400  
##  Mean   :2.098   Mean   :0.5938  
##  3rd Qu.:2.375   3rd Qu.:0.7300  
##  Max.   :3.100   Max.   :2.5000
# rename columns
names(sub_df) <- c("Record", "price", "no.beds", "no.baths", "size", "lot")
head(sub_df, 30)
##    Record price no.beds no.baths  size  lot
## 2       2 120.0       6      2.0 2.786 0.23
## 3       3 150.0       4      2.0 1.704 0.27
## 8       8 140.0       4      3.0 2.818 0.21
## 9       9 197.5       4      2.5 2.268 1.00
## 10     10 125.1       3      2.0 1.936 0.30
## 11     11 175.0       3      2.0 1.528 1.30
## 14     14 160.0       4      2.0 2.060 0.60
## 15     15  63.5       3      2.0 1.781 0.20
## 17     17 185.0       4      3.5 2.220 0.12
## 20     20 118.0       3      1.0 1.500 0.50
## 21     21  87.5       4      2.0 2.464 0.33
## 22     22  67.5       4      1.0 1.515 0.22
## 23     23 105.0       5      2.0 2.732 0.37
## 24     24 114.0       4      2.0 2.720 2.50
## 25     25 100.5       3      2.0 1.608 0.58
## 28     28 144.0       3      1.5 1.968 0.34
## 29     29 179.0       4      2.0 3.100 0.28
## 31     31 175.0       3      2.0 2.127 1.80
## 34     34  92.0       3      2.0 1.815 0.00
## 40     40 139.0       4      2.0 2.160 0.12
## 43     43 119.0       4      2.0 2.375 0.34
## 44     44  89.0       4      1.0 2.274 1.00
## 45     45  75.1       4      2.0 1.540 0.35
## 46     46  92.5       3      2.0 1.790 0.73
## 47     47 141.0       4      1.0 1.620 0.28
## 49     49 162.0       4      1.5 2.044 0.85
## 50     50 195.0       4      3.0 1.848 1.84
## 51     51 190.0       5      3.5 2.794 0.31
## 53     53  87.0       3      1.5 1.740 0.25
#get mean/median values
mean(sub_df$price)
## [1] 130.6276
mean(sub_df$size)
## [1] 2.097759
median(sub_df$price)
## [1] 125.1
median(sub_df$size)
## [1] 2.044

Note that for this sample dataset the mean price/size is higher than the mean/median for the price and size for the whole dataset.

# Group by price range
sub_df$price <- ifelse(is.na(sub_df$price), 'NOTFOUND', ifelse(sub_df <= 38, 'Low_price', ifelse(sub_df$price > 38 & sub_df$price <= 100, 'Median_price', 'High_price')))
head(sub_df, 30)
##    Record        price no.beds no.baths  size  lot
## 2       2    Low_price       6      2.0 2.786 0.23
## 3       3    Low_price       4      2.0 1.704 0.27
## 8       8    Low_price       4      3.0 2.818 0.21
## 9       9    Low_price       4      2.5 2.268 1.00
## 10     10    Low_price       3      2.0 1.936 0.30
## 11     11    Low_price       3      2.0 1.528 1.30
## 14     14    Low_price       4      2.0 2.060 0.60
## 15     15    Low_price       3      2.0 1.781 0.20
## 17     17    Low_price       4      3.5 2.220 0.12
## 20     20    Low_price       3      1.0 1.500 0.50
## 21     21    Low_price       4      2.0 2.464 0.33
## 22     22    Low_price       4      1.0 1.515 0.22
## 23     23    Low_price       5      2.0 2.732 0.37
## 24     24    Low_price       4      2.0 2.720 2.50
## 25     25    Low_price       3      2.0 1.608 0.58
## 28     28    Low_price       3      1.5 1.968 0.34
## 29     29    Low_price       4      2.0 3.100 0.28
## 31     31    Low_price       3      2.0 2.127 1.80
## 34     34    Low_price       3      2.0 1.815 0.00
## 40     40   High_price       4      2.0 2.160 0.12
## 43     43   High_price       4      2.0 2.375 0.34
## 44     44 Median_price       4      1.0 2.274 1.00
## 45     45 Median_price       4      2.0 1.540 0.35
## 46     46 Median_price       3      2.0 1.790 0.73
## 47     47   High_price       4      1.0 1.620 0.28
## 49     49   High_price       4      1.5 2.044 0.85
## 50     50   High_price       4      3.0 1.848 1.84
## 51     51   High_price       5      3.5 2.794 0.31
## 53     53 Median_price       3      1.5 1.740 0.25

Q3: Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

A3:

# show the frequency of prices samples that repeated in the subset data
c2 <- ggplot(sub_df, aes(price))
c2 + geom_bar(fill = "purple") 

As demonestrated from the above chart, houses less or equal than 38 - low_price has the highest frequency in the dataset.

library(ggplot2)
p <- ggplot(sub_df, aes(price, no.beds)) 
#p +  geom_label(aes(label = price), nudge_x = 1, nudge_y = 1, check_overlap = TRUE)
p + geom_violin(scale = "area", fill = "red", linetype = "dotted", color = "blue")

Note that houses with high prices most of them have 4+ bed rooms but has no houses less than 4 bedrooms. However, houses with low prices has all bedrooms types. For the median prices, the majority of houses is ranging from 3 and 4 bedrooms.

# inspect the dataset house sizes regarding no. of beds based on the price. 
ggplot(sub_df, aes(size, no.beds, color = price)) + geom_point() + stat_smooth(method=lm, se=FALSE, fullrange=TRUE)

# inspect the dataset house prices regarding no. of beds and baths. 
ggplot(sub_df, aes(no.beds, no.baths, color = price)) + geom_point() + stat_smooth(method=lm, se=FALSE, fullrange=TRUE)

As shown in the plot, neither the size of the house nor no of beds affects on thhe price. For instance, houses with 4 bedrooms has all price categories. However, the majority lies within size range of 2-2.5 and 4 bedroom.

Q4: Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

A4:

Study objective:

The main reason for this study is to inspect the affecting factors on houses prices. Is the size of the house, no of bedrooms, or other factors.

Main outcome:

The results should show the affecting factors on houses price.

Results:

Conclusion:

for the given dataset, more variable needed to get a definite opinion. For instance, the geo location may affect the price dramatically. However, for the obtained results we can agree that houses with 4+ bedrooms have higher prices generally.

Q5: BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

A4:

The steps to place the dataset on your own github are the following:

  1. Go to that link to download the required dataset on your local machine dataset source.

  2. After downloading the file on your computer, add and commit it into git from your terminal then push it to your github repo.

Note that you should have git software on your local machine - you can check using git -v.

  1. Check the file and click on raw to get the link.
library(RCurl)

theurl <- "https://raw.githubusercontent.com/salma71/dataset2/master/HousesNY.csv"
df_hou <- read.table(file=theurl, header=TRUE, sep = ",")
summary(df_hou)
##        X          Price            Beds           Baths      
##  Min.   : 1   Min.   : 38.5   Min.   :2.000   Min.   :1.000  
##  1st Qu.:14   1st Qu.: 82.7   1st Qu.:3.000   1st Qu.:1.500  
##  Median :27   Median :107.0   Median :3.000   Median :2.000  
##  Mean   :27   Mean   :113.6   Mean   :3.396   Mean   :1.858  
##  3rd Qu.:40   3rd Qu.:141.0   3rd Qu.:4.000   3rd Qu.:2.000  
##  Max.   :53   Max.   :197.5   Max.   :6.000   Max.   :3.500  
##       Size            Lot        
##  Min.   :0.712   Min.   :0.0000  
##  1st Qu.:1.296   1st Qu.:0.2700  
##  Median :1.528   Median :0.4200  
##  Mean   :1.678   Mean   :0.7985  
##  3rd Qu.:2.060   3rd Qu.:1.1000  
##  Max.   :3.100   Max.   :3.5000