Haiyuan Gui‘s Document

Data

# import library
library(ggplot2)


# read data
westroxbury.df <- mlba::WestRoxbury
# data structure
str(westroxbury.df)

## 'data.frame':    5802 obs. of  14 variables:
##  $ TOTAL.VALUE: num  344 413 330 499 332 ...
##  $ TAX        : int  4330 5190 4152 6272 4170 4244 4521 4030 4195 5150 ...
##  $ LOT.SQFT   : int  9965 6590 7500 13773 5000 5142 5000 10000 6835 5093 ...
##  $ YR.BUILT   : int  1880 1945 1890 1957 1910 1950 1954 1950 1958 1900 ...
##  $ GROSS.AREA : int  2436 3108 2294 5032 2370 2124 3220 2208 2582 4818 ...
##  $ LIVING.AREA: int  1352 1976 1371 2608 1438 1060 1916 1200 1092 2992 ...
##  $ FLOORS     : num  2 2 2 1 2 1 2 1 1 2 ...
##  $ ROOMS      : int  6 10 8 9 7 6 7 6 5 8 ...
##  $ BEDROOMS   : int  3 4 4 5 3 3 3 3 3 4 ...
##  $ FULL.BATH  : int  1 2 1 1 2 1 1 1 1 2 ...
##  $ HALF.BATH  : int  1 1 1 1 0 0 1 0 0 0 ...
##  $ KITCHEN    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FIREPLACE  : int  0 0 0 1 0 1 0 0 1 0 ...
##  $ REMODEL    : chr  "None" "Recent" "None" "None" ...

# column names
t(t(names(westroxbury.df)))

##       [,1]         
##  [1,] "TOTAL.VALUE"
##  [2,] "TAX"        
##  [3,] "LOT.SQFT"   
##  [4,] "YR.BUILT"   
##  [5,] "GROSS.AREA" 
##  [6,] "LIVING.AREA"
##  [7,] "FLOORS"     
##  [8,] "ROOMS"      
##  [9,] "BEDROOMS"   
## [10,] "FULL.BATH"  
## [11,] "HALF.BATH"  
## [12,] "KITCHEN"    
## [13,] "FIREPLACE"  
## [14,] "REMODEL"

# show the number of rows and columns
dim(westroxbury.df)  # 5802 rows, 14 columns

## [1] 5802   14

# subset of the WestRoxbury
west.df <- westroxbury.df[1:3000, ]
# view the dataset WestRoxbury
View(west.df)

Box plots

# 1. Boxplot for subset of the WestRoxbury dataset
# 1.1 Distribution of TAX
# Not the desired effect 
# boxplot(west.df$TAX, ylab = "TAX", col = ifelse(west.df$TAX < quantile(west.df$TAX, 0.25) | west.df$TAX > quantile(west.df$TAX, 0.75), "red", "green"))
# Box, box border, and scatter color settings in the boxplot
boxplot(west.df$TAX, ylab="TAX", col="green", border="blue", outcol="red")

Figure 1: We can see that most houses have a TAX between 4000-6000, and there are some houses with unusually high or unusually low TAX.

# 1.2 TOTAL.VALUE vs. number of ROOMS
boxplot(west.df$TOTAL.VALUE~west.df$ROOMS, xlab="Number of ROOMS", ylab="TOTAL VALUE", col="green", border="blue", outcol="red")

Figure 2: From the graph we can see that the number of rooms for all houses ranges from 3 to 14. The larger the number of rooms, the larger the TOTAL VALUE. There are some outliers when the number of rooms is 4, 5, 6, 7, 8, 9, 10, and these outliers are overall more often with unusually high TOTAL VALUE.

# 1.3 GROSS.AREA vs. REMODEL types
boxplot(west.df$GROSS.AREA~west.df$REMODEL, xlab="REMODEL Types", ylab="GROSS AREA", col="green", border="blue", outcol="red")

Figure 3: Distribution of GROSS AREA in terms of remodel also indicates several outliers for each category. The distrbution is right skewed.

Histograms

# 2. Histograms for subset of the WestRoxbury dataset
# 2.1 histogram of TAX
ggplot(west.df) + geom_histogram(aes(x=TAX), fill="blue", color="red", bins=15) + ylab("Cout")

Figure 4: The graph shows that the TAX of most houses is around 4000, and the distribution of the data is right-skewed.

# 2.2 histogram of ROOMS
ggplot(west.df) + geom_histogram(aes(x=ROOMS), fill="yellow", color="black", bins=10) + ylab("Cout")

Figure 5: As you can see from the graph, most of the houses have between 5 and 9 rooms, and there exist some houses with more than 10 rooms

# 2.3 histogram of GROSS.AREA
ggplot(west.df) + geom_histogram(aes(x=GROSS.AREA), fill="pink", color="blue", bins=10) + ylab("Cout")

Figure 6: From the graph it can be seen that most houses have a gross area between 2000-4000, there exist some houses with a gross area over 6000, the distrbution is right skewed.

Scatter plots

# 3. Scatter plots for the subset of the WestRoxbury dataset
# 3.1 TAX vs. GROSS.AREA
plot(west.df$GROSS.AREA, west.df$TAX, xlab="GROSS.AREA", ylab="TAX", col="red")

Figure 7: From the graph we can see that as the gross area increases, the TAX required for the house increases and there seems to be some linear relationship between the two.

# 3.2 GROSS.AREA vs. LIVING.AREA
plot(west.df$LIVING.AREA, west.df$GROSS.AREA, xlab="LIVING.AREA", ylab="GROSS.AREA", col="blue")

Figure 8: We can see that the living area of the house and the cross area of the house are more linearly correlated, and the cross area of the house increases as the living area of the house increases.

# 3.3 TOTAL.VALUE vs. FIREPLACE
plot(west.df$FIREPLACE, west.df$TOTAL.VALUE, xlab="FIREPLACE", ylab="TOTAL.VALUE", col="green")

Figure 9: We can see that the number of fireplaces in all houses is mainly 0-2, very few houses have 3 or 4 fireplaces, and there are no houses with more than 4 fireplaces.

Bar charts

# 4. Bar chats for subset of the WestRoxbury dataset
# 4.1 Average Taxes with Different Number of Fireplaces
# compute mean TAX per FIREPLCAE
TAX.per.FIREPLACE <- aggregate(west.df$TAX, by=list(west.df$FIREPLACE), FUN=mean)
names(TAX.per.FIREPLACE) <- c("FIREPLACE", "AverageTAX")
TAX.per.FIREPLACE$FIREPLACE <- factor(TAX.per.FIREPLACE$FIREPLACE)
str(TAX.per.FIREPLACE)

## 'data.frame':    5 obs. of  2 variables:
##  $ FIREPLACE : Factor w/ 5 levels "0","1","2","3",..: 1 2 3 4 5
##  $ AverageTAX: num  4214 4853 5676 8107 5282

ggplot(TAX.per.FIREPLACE, aes(x=FIREPLACE, y=AverageTAX, fill=FIREPLACE)) + geom_bar(stat="identity") + geom_text(aes(label = sprintf("%.2f", AverageTAX)), vjust = -0.5) +  ylab("Average TAX")

Figure 10: From the graph we can see that when the number of fireplaces is 3, the house has the highest average tax, and when the house has no fireplaces, the house has the lowest average tax.

# 4.2 Average LOT.SQFT with Different Number of FULL.BATH
# compute mean LOT.SQFT per FULL.BATH
LOTSQFT.per.FULLBATH <- aggregate(west.df$LOT.SQFT, by=list(west.df$FULL.BATH), FUN=mean)
names(LOTSQFT.per.FULLBATH) <- c("FULL.BATH", "Average.LOT.SQFT")
LOTSQFT.per.FULLBATH$FULL.BATH <- factor(LOTSQFT.per.FULLBATH$FULL.BATH)
str(LOTSQFT.per.FULLBATH)

## 'data.frame':    5 obs. of  2 variables:
##  $ FULL.BATH       : Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
##  $ Average.LOT.SQFT: num  6014 6733 8015 9856 7449

ggplot(LOTSQFT.per.FULLBATH, aes(x=FULL.BATH, y=Average.LOT.SQFT, fill=FULL.BATH)) + geom_bar(stat="identity") + geom_text(aes(label = sprintf("%.2f", Average.LOT.SQFT)), vjust = -0.5) +  ylab("Average LOT SQFT")

Figure 11: From the graph we can see that Average LOT SQFT increases and then decreases as the number of FULL BATH increases. When the number of FULL BATH is 4, the house has the highest Average LOT SQFT, while when the house has only 1 FULL BATH, the house has the lowest Average LOT SQFT.

# 4.1 Average Taxes with Different Number of Rooms
# compute mean TAX per ROOMS
TAX.per.ROOMS <- aggregate(west.df$TAX, by=list(west.df$ROOMS), FUN=mean)
names(TAX.per.ROOMS) <- c("ROOMS", "AverageTAX")
TAX.per.ROOMS$ROOMS <- factor(TAX.per.ROOMS$ROOMS)
str(TAX.per.ROOMS)

## 'data.frame':    12 obs. of  2 variables:
##  $ ROOMS     : Factor w/ 12 levels "3","4","5","6",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ AverageTAX: num  2824 3421 3653 4176 4709 ...

ggplot(TAX.per.ROOMS, aes(x=ROOMS, y=AverageTAX, fill=ROOMS)) + geom_bar(stat="identity") + geom_text(aes(label = sprintf("%.2f", AverageTAX)), vjust = -0.5) +  ylab("Average TAX")

Figure 12: We can see that as the number of rooms in the house increases, the AVERAGE TAX of the rooms keeps increasing, and only decreases when the number of rooms reaches 14. The average tax of the house is highest when the number of rooms is 13, and lowest when the number of rooms is 3.