Question 1

First below, the tableby.control() functon is used by installing and calling for the package “arsenal”. Then the descriptives mean, variance, maximum, minimum and number of missing observations are speciied in numeric.stats(). Thereafter the descriptive stats are given their respective names in stats.labels=list(). This is saved in my_controls. Then every single variable in the dataset HomeSales is given a name in the list my_labels. Thereafter the the table is sorted by no variable around tilde, the dataset HomeSales is called for and the previously mentioned descriptive stats are called, which is all saved in descriptive_stats. Then the descriptive table is summarized by using summary(), with the labels from HomeSalesLabels and text set to TRUE.

descriptive_stats<-
  tableby.control(
  numeric.stats=c("mean","var","max","min","Nmiss2"),
  cat.stats=c("Nmiss2"),
  stats.labels=list(
    mean="Mean",
    var="Variance",
    max="Max",
    min="Min",
    Nmiss2="Missing observations"
  ))

HomeSalesLabels<-list(
  R="Number of rooms",
  BR="Number of bedrooms",
  B="Number of bathrooms",
  A="Age (years)",
  Style="1 if bungalow, 2 if two-storey",
  s="Living area (square meters)",
  g="Number of garage spaces",
  att="Attached",
  bas="Basement",
  f="Number of fireplaces",
  c="Number of chattels",
  lots="Lot size (square meters)",
  listprice="House price (CAD)",
  time="Month of sale",
  cul="Cul-de-sac",
  co="Corner lot",
  la="Lane behind",
  e="Exposure of yard",
  z0="Zone 0",
  z1="Zone 1",
  z2="Zone 2",
  z3="Zone 3",
  z4="Zone 4",
  z5="Zone 5",
  z6="Zone 6",
  z7="Zone 7",
  z8="Zone 8",
  z9="Zone 9"
)

descriptive_table<-tableby(~.,
                     data=HomeSales,
                     control=descriptive_stats
)
summary(descriptive_table,
        labelTranslations=HomeSalesLabels,
        title="Descriptive statistics",
        text=TRUE
)

## 
## 
## Table: Descriptive statistics
## 
## |                            | Overall (N=240) |
## |:---------------------------|:---------------:|
## |Number of rooms             |                 |
## |-  Mean                     |      6.287      |
## |-  Variance                 |      0.574      |
## |-  Max                      |     10.000      |
## |-  Min                      |      4.000      |
## |-  Missing observations     |        0        |
## |Number of bedrooms          |                 |
## |-  Mean                     |      3.079      |
## |-  Variance                 |      0.149      |
## |-  Max                      |      5.000      |
## |-  Min                      |      2.000      |
## |-  Missing observations     |        0        |
## |Number of bathrooms         |                 |
## |-  Mean                     |      2.513      |
## |-  Variance                 |      0.452      |
## |-  Max                      |      5.000      |
## |-  Min                      |      1.000      |
## |-  Missing observations     |        0        |
## |Age (years)                 |                 |
## |-  Mean                     |     21.996      |
## |-  Variance                 |     31.385      |
## |-  Max                      |     32.000      |
## |-  Min                      |      5.000      |
## |-  Missing observations     |        0        |
## |style                       |                 |
## |-  Mean                     |      1.100      |
## |-  Variance                 |      0.090      |
## |-  Max                      |      2.000      |
## |-  Min                      |      1.000      |
## |-  Missing observations     |        0        |
## |Living area (square meters) |                 |
## |-  Mean                     |     125.768     |
## |-  Variance                 |     742.654     |
## |-  Max                      |     267.300     |
## |-  Min                      |     90.000      |
## |-  Missing observations     |        0        |
## |Number of garage spaces     |                 |
## |-  Mean                     |      1.758      |
## |-  Variance                 |      0.343      |
## |-  Max                      |      4.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Attached                    |                 |
## |-  Mean                     |      0.304      |
## |-  Variance                 |      0.213      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Basement                    |                 |
## |-  Mean                     |      2.587      |
## |-  Variance                 |      0.427      |
## |-  Max                      |      3.000      |
## |-  Min                      |      1.000      |
## |-  Missing observations     |        0        |
## |Number of fireplaces        |                 |
## |-  Mean                     |      0.654      |
## |-  Variance                 |      0.445      |
## |-  Max                      |      2.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Number of chattels          |                 |
## |-  Mean                     |      2.504      |
## |-  Variance                 |      1.113      |
## |-  Max                      |      5.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Lot size (square meters)    |                 |
## |-  Mean                     |     620.688     |
## |-  Variance                 |    19483.714    |
## |-  Max                      |    1396.000     |
## |-  Min                      |     272.000     |
## |-  Missing observations     |        0        |
## |House price (CAD)           |                 |
## |-  Mean                     |   140324.633    |
## |-  Variance                 |  417696978.919  |
## |-  Max                      |   279900.000    |
## |-  Min                      |    85000.000    |
## |-  Missing observations     |        0        |
## |Month of sale               |                 |
## |-  Mean                     |     45.737      |
## |-  Variance                 |     31.316      |
## |-  Max                      |     54.000      |
## |-  Min                      |     36.000      |
## |-  Missing observations     |        0        |
## |Cul-de-sac                  |                 |
## |-  Mean                     |      0.121      |
## |-  Variance                 |      0.107      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Corner lot                  |                 |
## |-  Mean                     |      0.050      |
## |-  Variance                 |      0.048      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Lane behind                 |                 |
## |-  Mean                     |      0.575      |
## |-  Variance                 |      0.245      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Exposure of yard            |                 |
## |-  Mean                     |      0.371      |
## |-  Variance                 |      0.234      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Zone 0                      |                 |
## |-  Mean                     |      0.054      |
## |-  Variance                 |      0.051      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Zone 1                      |                 |
## |-  Mean                     |      0.242      |
## |-  Variance                 |      0.184      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Zone 2                      |                 |
## |-  Mean                     |      0.117      |
## |-  Variance                 |      0.103      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Zone 3                      |                 |
## |-  Mean                     |      0.071      |
## |-  Variance                 |      0.066      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Zone 4                      |                 |
## |-  Mean                     |      0.104      |
## |-  Variance                 |      0.094      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Zone 5                      |                 |
## |-  Mean                     |      0.096      |
## |-  Variance                 |      0.087      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Zone 6                      |                 |
## |-  Mean                     |      0.129      |
## |-  Variance                 |      0.113      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Zone 7                      |                 |
## |-  Mean                     |      0.092      |
## |-  Variance                 |      0.084      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Zone 8                      |                 |
## |-  Mean                     |      0.075      |
## |-  Variance                 |      0.070      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |
## |Zone 9                      |                 |
## |-  Mean                     |      0.021      |
## |-  Variance                 |      0.020      |
## |-  Max                      |      1.000      |
## |-  Min                      |      0.000      |
## |-  Missing observations     |        0        |

Question 2

GGplot2 is used to visualize all four scatter plots with similar syntax, where the independent variables are set to x and the dependent variable measuring the price of a home is set to y. Geom_point() is used to create a scatter plot where the color is set to each independent variable in the first three subplots, thus a color that becomes lighter the larger the discrete value, vice versa. By using geom_smooth, regression lines are added while specifying the method to “lm” for Ordinary Least Squares (OLS). In order to add transparency for the density of observations, alpha is set to 0.5. By using labs(),the x- and y-axes are labeled and a title is also added. Then the positions of the titles are adjusted to the center by adding hjust=0.5 (the center of 0 and 1). Last by using p1+p2+p3+p4, the patchwork package is used to create a plot of multiple plots.

My observed conclusion is that home price on average increases by the increase of every unit in every independent variable. This is also visible by perceiving every positive fitted regression line in each of the plots.

p1<-
HomeSales%>%
  ggplot(aes(x=R,y=listprice))+
  geom_point(aes(color=R,size=1,alpha=0.5))+
  geom_smooth(method="lm")+
  labs(x="Number of rooms",
       y="Price in CAD",
       title="Number of rooms and price")+
  theme_gray(base_size=20)+
  theme(plot.title=element_text(hjust=0.5))
p2<-
HomeSales%>%
  ggplot(aes(x=B,y=listprice))+
  geom_point(aes(color=B,size=1,alpha=0.5))+
  geom_smooth(method="lm")+
  labs(x="Number of bathrooms",
       y="Price in CAD",
       title="Number of bathrooms and price")+
  theme_gray(base_size=20)+
  theme(plot.title=element_text(hjust=0.5))
p3<-
HomeSales%>%
  ggplot(aes(x=BR,y=listprice))+
  geom_point(aes(color=BR,size=1,alpha=0.5))+
  geom_smooth(method="lm")+
  labs(x="Number of bedrooms",
       y="Price in CAD",
       title="Number of bedrooms and price")+
  theme_gray(base_size=20)+
  theme(plot.title=element_text(hjust=0.5))
p4<-
HomeSales%>%
  ggplot(aes(x=s,y=listprice))+
  geom_point()+
  geom_smooth(method="lm")+
  labs(x="Living area (square meters)",
       y="Price in CAD",
       title="Living area and price")+
  theme_gray(base_size=20)+
  theme(plot.title=element_text(hjust=0.5))
p1+p2+p3+p4

Question 3

First the dataset HomeSales is attached to avoid a lengthy syntax where the dataset has to be specified for every variable. Thereafter, the dataset r1 is saved by using linear regression with lm() where the dependent variable listprice is specified on the left side of the tilde and the independent variables are specified on the rigth side of the tilde. Thereafter, summary() is used for the saved regression r1 to most importantly obtain all the b-values, p-values and the coefficient of determination, i.e. R-squared. Living area “s”, lot size “lots” and basement (bas) are statistically significant at the the 0.1% significance level and therefore these three variables are the most important variables to affect the house price. As the estimate of lane behind “la” is not statistically significant even at 10%, it does not affect the dependent variable listing price of houses. The dummy variables representing zones are mostly statistically insignificant, but these estimates are not of importance as they are included to isolate the effects of the other independent variables.

attach(HomeSales)
r1<-
  lm(listprice~s+lots+la+bas+
  z1+z2+z3+z4+z5+z6+z7+z8+z9)
summary(r1)

## 
## Call:
## lm(formula = listprice ~ s + lots + la + bas + z1 + z2 + z3 + 
##     z4 + z5 + z6 + z7 + z8 + z9)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33971  -5793   -278   6153  56660 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36885.76    7245.46    5.09  7.5e-07 ***
## s             577.70      32.19   17.95  < 2e-16 ***
## lots           29.84       5.39    5.53  8.7e-08 ***
## la          -2306.32    1717.68   -1.34  0.18072    
## bas          4169.08    1136.77    3.67  0.00031 ***
## z1           2622.23    3492.60    0.75  0.45356    
## z2           1546.82    3845.28    0.40  0.68787    
## z3           4997.75    4255.79    1.17  0.24149    
## z4           4324.45    3953.59    1.09  0.27521    
## z5          10890.48    3887.85    2.80  0.00553 ** 
## z6            352.81    3712.14    0.10  0.92436    
## z7          -4846.45    3877.89   -1.25  0.21268    
## z8            409.19    3913.21    0.10  0.91681    
## z9          24309.70    5731.25    4.24  3.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10600 on 226 degrees of freedom
## Multiple R-squared:  0.745,  Adjusted R-squared:  0.731 
## F-statistic: 50.9 on 13 and 226 DF,  p-value: <2e-16

Question 4

Whether the linear model is question 3 is a good model can not be thoroughly answered without also analyzing a potential presence of heteroskedasticity and multicollinearity in question 5.

The Adjusted Coefficient of determination Multiple R-squared is 0.731, thus indicating that the dependent variable can be explained by the variation in the independent variables by 73.1%, which is evidence of a “good” model as it shows that the independent variables have an effect on house price. Though, a model estimated to have a very low R-squared value can also be truthful and correct, and hence a “good” model. The F-statistic of 50.9 indicates that it is highly probable that at least one independent variable has a statistically significant effect on the listing of house price and the probability is less than 2*10^-16 that it has occurred due to chance, which is also evidence of a “good” model.

Question 5

At first below, the Breusch-Pagan test is performed by using bptest() to test for heteroskedasticity in the regressed model of r1. The null hypothesis of homoskedasticity is rejected by a p-value far lower than 0.01, thus indicating clear issues with bias due to heteroskedasticity.

Thereafter, the fitted values and residuals of r1 are saved by using fitted() and resid() respectively. Thereafter these values are column binded into a data frame. Then the scatter plot is specified in geom_point() by using the variable FIT for the fitted values along the x-axis and the variable RES for the residuals along the y-axis. Then geom_hline() is used for adding a horizontal line at y-intercept of 0. Relevant labels are added in labs(), the text size is increased in theme_gray() and as in question 2, the title is positioned in the center by using theme() with hjust=0.5. As the spread of the residuals in the plot varies along the x-axis, homoskedasticity can not be assumed and heteroskedasticity is present, which is not evidence for a good model.

lmtest::bptest(r1)

## 
##  studentized Breusch-Pagan test
## 
## data:  r1
## BP = 66, df = 13, p-value = 4e-09

FIT<-fitted(r1)
RES<-resid(r1)
F2<-as.data.frame(cbind(FIT,RES))

F2%>%
  ggplot(aes(x=FIT,y=RES))+
  geom_point()+
  geom_hline(yintercept=0)+
  labs(x="Fitted values",
       y="Residuals",
       title="Residuals vs fitted values")+
  theme_gray(base_size=24)+
  theme(plot.title=element_text(hjust=0.5))

In order to detect whether there is a relevant level of multicollinearity, one can create a correlation matrix. First, only the independent variables except for the zone dummy variables are included while generating the data set HomeSales3. Then, every correlation value in HomeSales3 is rounded to two digits while generating the data set HS. Then a function is created to only include the diagonal and the lower values of the correlation matrix. In order to visualize the plot with two axes, every correlation between the variables need to be specified in by generating variable 1 “Var1” and variable 2 “Var2” in melt() where the correlation values are specified in the variable “value”. While melting the correlation matrix into the data frame, missing values are excluded by using na.rm=TRUE.

Then the plot is visualized by using ggplot2 while calling for the melted data set MELTLHS denoting MELTed Lower HomeSales. The variables Var1 and Var2 are used and with fill the color is set between white (for cor=0) and fully turquoise (for cor=1) and any shade in between the two within the value of 0 to 1. The axes are given no label at all to rid the plot of the defaults of Var1 and Var2 and the title is labeled, all by using labs(). Their respective text sizes are set to 24 and the position of the title is centered with hjust=0.5. Last, the boxes indicating correlation by Var1 and Var2 are also given a numeric value in geom_text(), with a text size of 6. Some relevant multicollinearity between the indepenedent variables is detected here between lane behind “la” versus lot size and living area respectively with correlation values exceeding 0.2.

HomeSales3<-
  HomeSales%>%
  select(c("listprice","s","lots","la","bas"))
HS<-
  round(cor(HomeSales3),2)  

get_lower_hs<-function(HS){
  HS[upper.tri(HS)] <- NA
    return(HS)
}
LHS<-get_lower_hs(HS)
MELTLHS<-melt(LHS,na.rm=TRUE)
MELTLHS%>%
  ggplot(aes(x=Var1,y=Var2,fill=value))+
  geom_tile(color="black")+
  coord_fixed()+
  scale_fill_gradient2(low="white",high="turquoise",name="Correlation\nValue")+ 
  labs(x="",
       y="",
       title="Correlation matrix")+
  theme_gray(base_size=24)+
  theme(plot.title=element_text(hjust=0.5))+
  geom_text(aes(Var1,Var2,label=value),size=6)

While generating the variance inflation factor (VIF) values below, none of the main independent variables exceed a value of 2, thus indicating no relevant bias due to multicollinearity within the estimated linear model. Thereafter the average VIF value (AVGVIF) is generated by taking the sum of the variance inflation factors and dividing them by the number of coefficients with length(). The average VIF value of approximately 2.26 does not even exceed a cut-off value of 5, thus indicating no relevant bias due to multicolliearinity in the estimated linear model.

vif(r1)

##    s lots   la  bas   z1   z2   z3   z4   z5   z6   z7   z8 
## 1.64 1.20 1.54 1.17 4.77 3.25 2.54 3.11 2.80 3.31 2.67 2.27 
##   z9 
## 1.43

AVGVIF<-sum(vif(r1))/length(coefficients(r1))
AVGVIF

## [1] 2.26

Final Project

1ST817: Statistical Data Analysis

Olle Scherling

2021-11-16

Question 1

Question 2

Question 3

Question 4

Question 5