Nam Anh Le
Housing Prices Data set

mydata <- read.table("./Housing.csv", header=TRUE,sep = ",", dec=".")
head(mydata)
##      price area bedrooms bathrooms stories mainroad basement airconditioning
## 1 13300000 7420        4         2       3      yes       no             yes
## 2 12250000 8960        4         4       4      yes       no             yes
## 3 12250000 9960        3         2       2      yes      yes              no
## 4 12215000 7500        4         2       2      yes      yes             yes
## 5 11410000 7420        4         1       2      yes      yes             yes
## 6 10850000 7500        3         3       1      yes      yes             yes
##   parking
## 1       2
## 2       3
## 3       2
## 4       3
## 5       2
## 6       2

Unit of Observation: one house
The sample size is 545

Definition of Variable:
price: House Price in $
area: House Area in square feet
bedrooms: Number of Bedrooms
bathrooms: Number of Bathrooms
stories: Number of floors
mainroad: Whether the house is connected to a main road
basement: Whether the house has a basement
airconditioning: Whether the house has air conditioning
parking: Number of house parking spaces

Data Source: https://www.kaggle.com/datasets/yasserh/housing-prices-dataset/data

mydatanew <- mydata

mydatanew$mainroad <- factor(mydata$mainroad,
                          levels = c("yes","no"),
                          labels = c("yes","no"))

mydatanew$basement <- factor(mydata$basement,
                          levels = c("yes","no"),
                          labels = c("yes","no"))

mydatanew$airconditioning <- factor(mydata$airconditioning,
                                 levels = c("yes","no"),
                                 labels = c("yes","no"))

head(mydatanew)
##      price area bedrooms bathrooms stories mainroad basement airconditioning
## 1 13300000 7420        4         2       3      yes       no             yes
## 2 12250000 8960        4         4       4      yes       no             yes
## 3 12250000 9960        3         2       2      yes      yes              no
## 4 12215000 7500        4         2       2      yes      yes             yes
## 5 11410000 7420        4         1       2      yes      yes             yes
## 6 10850000 7500        3         3       1      yes      yes             yes
##   parking
## 1       2
## 2       3
## 3       2
## 4       3
## 5       2
## 6       2
library(pastecs)
round(stat.desc(mydatanew[,-c(6,7,8)]),2)
##                     price       area bedrooms bathrooms stories parking
## nbr.val      5.450000e+02     545.00   545.00    545.00  545.00  545.00
## nbr.null     0.000000e+00       0.00     0.00      0.00    0.00  299.00
## nbr.na       0.000000e+00       0.00     0.00      0.00    0.00    0.00
## min          1.750000e+06    1650.00     1.00      1.00    1.00    0.00
## max          1.330000e+07   16200.00     6.00      4.00    4.00    3.00
## range        1.155000e+07   14550.00     5.00      3.00    3.00    3.00
## sum          2.597867e+09 2807045.00  1616.00    701.00  984.00  378.00
## median       4.340000e+06    4600.00     3.00      1.00    2.00    0.00
## mean         4.766729e+06    5150.54     2.97      1.29    1.81    0.69
## SE.mean      8.012083e+04      92.96     0.03      0.02    0.04    0.04
## CI.mean.0.95 1.573841e+05     182.60     0.06      0.04    0.07    0.07
## var          3.498544e+12 4709512.06     0.54      0.25    0.75    0.74
## std.dev      1.870440e+06    2170.14     0.74      0.50    0.87    0.86
## coef.var     3.900000e-01       0.42     0.25      0.39    0.48    1.24

The sample mean house price is $4.766.729.
The median house price is $4.340.000, meaning 50% of houses are worth more than $4.34 million and 50% is worth less than $4.34 million.

stdev = sd(mydatanew$price)

The standard deviation for house price is $1870440

library(ggplot2)
ggplot(mydatanew,aes(x=area,y=price))+
  geom_point(colour="royalblue1")


There appears to be a positive correlation between house area and price, meaning larger houses tend to have higher prices.

However, the correlation is not perfectly linear since there is a wide spread of points.

There is significant scatter, especially for mid-sized houses (e.g., 4000–8000 sq ft).

This suggests that factors beyond area, such as location, amenities, and condition, impact house prices.

Some houses have very high prices despite having a small area. These could be luxury properties.

Some large houses are priced lower than expected, possibly due to location or condition.