Nam Anh Le
Housing Prices Data set
mydata <- read.table("./Housing.csv", header=TRUE,sep = ",", dec=".")
head(mydata)
## price area bedrooms bathrooms stories mainroad basement airconditioning
## 1 13300000 7420 4 2 3 yes no yes
## 2 12250000 8960 4 4 4 yes no yes
## 3 12250000 9960 3 2 2 yes yes no
## 4 12215000 7500 4 2 2 yes yes yes
## 5 11410000 7420 4 1 2 yes yes yes
## 6 10850000 7500 3 3 1 yes yes yes
## parking
## 1 2
## 2 3
## 3 2
## 4 3
## 5 2
## 6 2
Unit of Observation: one house
The sample size is 545
Definition of Variable:
price: House Price in $
area: House
Area in square feet
bedrooms: Number of Bedrooms
bathrooms:
Number of Bathrooms
stories: Number of floors
mainroad:
Whether the house is connected to a main road
basement: Whether the
house has a basement
airconditioning: Whether the house has air
conditioning
parking: Number of house parking spaces
Data Source: https://www.kaggle.com/datasets/yasserh/housing-prices-dataset/data
mydatanew <- mydata
mydatanew$mainroad <- factor(mydata$mainroad,
levels = c("yes","no"),
labels = c("yes","no"))
mydatanew$basement <- factor(mydata$basement,
levels = c("yes","no"),
labels = c("yes","no"))
mydatanew$airconditioning <- factor(mydata$airconditioning,
levels = c("yes","no"),
labels = c("yes","no"))
head(mydatanew)
## price area bedrooms bathrooms stories mainroad basement airconditioning
## 1 13300000 7420 4 2 3 yes no yes
## 2 12250000 8960 4 4 4 yes no yes
## 3 12250000 9960 3 2 2 yes yes no
## 4 12215000 7500 4 2 2 yes yes yes
## 5 11410000 7420 4 1 2 yes yes yes
## 6 10850000 7500 3 3 1 yes yes yes
## parking
## 1 2
## 2 3
## 3 2
## 4 3
## 5 2
## 6 2
library(pastecs)
round(stat.desc(mydatanew[,-c(6,7,8)]),2)
## price area bedrooms bathrooms stories parking
## nbr.val 5.450000e+02 545.00 545.00 545.00 545.00 545.00
## nbr.null 0.000000e+00 0.00 0.00 0.00 0.00 299.00
## nbr.na 0.000000e+00 0.00 0.00 0.00 0.00 0.00
## min 1.750000e+06 1650.00 1.00 1.00 1.00 0.00
## max 1.330000e+07 16200.00 6.00 4.00 4.00 3.00
## range 1.155000e+07 14550.00 5.00 3.00 3.00 3.00
## sum 2.597867e+09 2807045.00 1616.00 701.00 984.00 378.00
## median 4.340000e+06 4600.00 3.00 1.00 2.00 0.00
## mean 4.766729e+06 5150.54 2.97 1.29 1.81 0.69
## SE.mean 8.012083e+04 92.96 0.03 0.02 0.04 0.04
## CI.mean.0.95 1.573841e+05 182.60 0.06 0.04 0.07 0.07
## var 3.498544e+12 4709512.06 0.54 0.25 0.75 0.74
## std.dev 1.870440e+06 2170.14 0.74 0.50 0.87 0.86
## coef.var 3.900000e-01 0.42 0.25 0.39 0.48 1.24
The sample mean house price is $4.766.729.
The median house
price is $4.340.000, meaning 50% of houses are worth more than $4.34
million and 50% is worth less than $4.34 million.
stdev = sd(mydatanew$price)
The standard deviation for house price is $1870440
library(ggplot2)
ggplot(mydatanew,aes(x=area,y=price))+
geom_point(colour="royalblue1")
There appears to be a positive correlation between house area and
price, meaning larger houses tend to have higher prices.
However, the correlation is not perfectly linear since there is a
wide spread of points.
There is significant scatter, especially for mid-sized houses (e.g.,
4000–8000 sq ft).
This suggests that factors beyond area, such as location, amenities,
and condition, impact house prices.
Some houses have very high prices despite having a small area. These
could be luxury properties.
Some large houses are priced lower than expected, possibly due to
location or condition.