Peer Assessment I

Emiliano La Rocca

First, let us load the data and necessary packages:

load("ames_train.Rdata")
library(devtools)
library(MASS)
library(dplyr)
library(ggplot2)
library(statsr)
library(modes)
library(Hmisc)
library(stats)

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.

# type your code for Question 1 here, and Knit
# Defining of a memory variable "age".(2017-Year.Built) and drew a histgram.
# For Explnation of Distribution computed range,modes,skewness and kurtosis.
# Drew a denisty plot to justify the multimode nature of age.

age=(2017-ames_train$Year.Built)
range(age)

## [1]   7 145

hist(age, breaks = 30, col='Blue', main="Distribution of Price in 30 bins")

modes(age, type = 1, digits = "NULL", nmore = "NULL")

##        [,1]
## Value    12
## Length   49

skewness(age,TRUE)

## [1] 0.6591292

kurtosis(age, finite=TRUE)

## [1] -0.3243304

plot(density(age))

da=density(age)
da

## 
## Call:
##  density.default(x = age)
## 
## Data: age (1000 obs.);   Bandwidth 'bw' = 6.7
## 
##        x                y            
##  Min.   :-13.10   Min.   :6.990e-07  
##  1st Qu.: 31.45   1st Qu.:5.257e-04  
##  Median : 76.00   Median :4.466e-03  
##  Mean   : 76.00   Mean   :5.606e-03  
##  3rd Qu.:120.55   3rd Qu.:9.901e-03  
##  Max.   :165.10   Max.   :1.825e-02

summary(age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.0    16.0    42.0    44.8    62.0   145.0

describe(age)

## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      102        1     44.8    33.14       10       11 
##      .25      .50      .75      .90      .95 
##       16       42       62       92       98 
## 
## lowest :   7   8   9  10  11, highest: 122 127 132 137 145

Answer to Question 1 First I define age,and then use breaks = 30 for the 30 bins. After make histrogram for the distribution of age. The age of the houses range between 7 years to 145 years when worked out in 2017. Mean is 44.8 years,Maximum 145,minimum 7. Median 42 years. The distribution is slightly right skewed and is multimodal with more 2 peaks.

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

ames_train %>% select(Neighborhood, price) %>% ggplot(aes(x=reorder(Neighborhood,-price, FUN = median), y = price)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust =1)) + xlab('Neighborhoods by mean prices')+ggtitle("Price vs Ames Neighbourhood ")

ames_train %>%
  group_by(Neighborhood) %>%
  summarise(median_price = median(price), sd_price = sd(price,)) %>%
  arrange(desc(median_price)) %>%
  print(n = 20)

## # A tibble: 27 × 3
##    Neighborhood median_price  sd_price
##          <fctr>        <dbl>     <dbl>
## 1       StoneBr     340691.5 123459.10
## 2       NridgHt     336860.0 105088.90
## 3       NoRidge     290000.0  35888.97
## 4       GrnHill     280000.0  70710.68
## 5        Timber     232500.0  84029.57
## 6       Somerst     221650.0  65199.49
## 7        Greens     212625.0  29063.42
## 8       Veenker     205750.0  72545.41
## 9       Crawfor     205000.0  71267.56
## 10      CollgCr     195800.0  52786.08
## 11      Blmngtn     191000.0  26454.86
## 12      ClearCr     185000.0  48068.69
## 13       NWAmes     185000.0  41340.50
## 14      Gilbert     183500.0  41190.38
## 15      SawyerW     182500.0  48354.36
## 16      Mitchel     156500.0  39682.94
## 17      NPkVill     142100.0  11958.37
## 18        NAmes     139900.0  27267.97
## 19       Sawyer     136000.0  21216.22
## 20        SWISU     134000.0  27375.76
## # ... with 7 more rows

Answer to Question 2 The median statistics is more robust towards outliers than the mean, then, I has used the median price of house to compute prices of homes in different Neighborhoods. An selection has been made by Neighborhood and price.The plot shows: Most Expensive:Stone Brook neighborhood (median price is 340691.5 USD). Least expensive: Meadow Village (median price is 85750.0 USD) . Most heterogeneous:Stone Brook: (standard deviation is 123459.10 USD)

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

sum(is.na(ames_train)) / (nrow(ames_train) *ncol(ames_train))

## [1] 0.05816049

colSums(sapply(ames_train, is.na))

##             PID            area           price     MS.SubClass 
##               0               0               0               0 
##       MS.Zoning    Lot.Frontage        Lot.Area          Street 
##               0             167               0               0 
##           Alley       Lot.Shape    Land.Contour       Utilities 
##             933               0               0               0 
##      Lot.Config      Land.Slope    Neighborhood     Condition.1 
##               0               0               0               0 
##     Condition.2       Bldg.Type     House.Style    Overall.Qual 
##               0               0               0               0 
##    Overall.Cond      Year.Built  Year.Remod.Add      Roof.Style 
##               0               0               0               0 
##       Roof.Matl    Exterior.1st    Exterior.2nd    Mas.Vnr.Type 
##               0               0               0               0 
##    Mas.Vnr.Area      Exter.Qual      Exter.Cond      Foundation 
##               7               0               0               0 
##       Bsmt.Qual       Bsmt.Cond   Bsmt.Exposure  BsmtFin.Type.1 
##              21              21              21              21 
##    BsmtFin.SF.1  BsmtFin.Type.2    BsmtFin.SF.2     Bsmt.Unf.SF 
##               1              21               1               1 
##   Total.Bsmt.SF         Heating      Heating.QC     Central.Air 
##               1               0               0               0 
##      Electrical     X1st.Flr.SF     X2nd.Flr.SF Low.Qual.Fin.SF 
##               0               0               0               0 
##  Bsmt.Full.Bath  Bsmt.Half.Bath       Full.Bath       Half.Bath 
##               1               1               0               0 
##   Bedroom.AbvGr   Kitchen.AbvGr    Kitchen.Qual   TotRms.AbvGrd 
##               0               0               0               0 
##      Functional      Fireplaces    Fireplace.Qu     Garage.Type 
##               0               0             491              46 
##   Garage.Yr.Blt   Garage.Finish     Garage.Cars     Garage.Area 
##              48              46               1               1 
##     Garage.Qual     Garage.Cond     Paved.Drive    Wood.Deck.SF 
##              47              47               0               0 
##   Open.Porch.SF  Enclosed.Porch     X3Ssn.Porch    Screen.Porch 
##               0               0               0               0 
##       Pool.Area         Pool.QC           Fence    Misc.Feature 
##               0             997             798             971 
##        Misc.Val         Mo.Sold         Yr.Sold       Sale.Type 
##               0               0               0               0 
##  Sale.Condition 
##               0

plot_Missing <- function(data_in, title = NULL){
  temp_df <- as.data.frame(ifelse(is.na(data_in), 0, 1))
  temp_df <- temp_df[,order(colSums(temp_df))]
  data_temp <- expand.grid(list(x = 1:nrow(temp_df), y = colnames(temp_df)))
  data_temp$m <- as.vector(as.matrix(temp_df))
  data_temp <- data.frame(x = unlist(data_temp$x), y = unlist(data_temp$y), m = unlist(data_temp$m))
  ggplot(data_temp) + geom_tile(aes(x=x, y=y, fill=factor(m))) + scale_fill_manual(values=c("white", "black"), name="Missing\n(0=Yes, 1=No)") + theme_light() + ylab("Nieghborhood") + xlab("observ.No") + ggtitle("Missing Values")
}

plot_Missing(ames_train[,colSums(is.na(ames_train)) >0])

max(colSums(is.na(ames_train)))

## [1] 997

summary(ames_train$Pool.QC)

##   Ex   Fa   Gd   TA NA's 
##    1    1    1    0  997

# type your code for Question 3 here, and Knit

Answer to Q3 The variable Pool.QC with 997 (out of 1000)missing values has largest number of missing values(99.7%).This measures quality of Pools. There are only three houses in the dataset have a pool.This is the reason of missing values.

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

# type your code for Question 4 here, and Knit
vars <- names(ames_train) %in%    c("price","Lot.Area","Land.Slope","Year.Built","Year.Remod.Add","Bedroom.AbvGr")
 ames2 <- ames_train[vars] 
 names(ames2)

## [1] "price"          "Lot.Area"       "Land.Slope"     "Year.Built"    
## [5] "Year.Remod.Add" "Bedroom.AbvGr"

  ames2$Land.Slope<-as.integer(ames2$Land.Slope)
 str(ames2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  6 variables:
##  $ price         : int  126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ Lot.Area      : int  7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Land.Slope    : int  1 1 1 1 1 1 2 1 1 1 ...
##  $ Year.Built    : int  1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add: int  1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Bedroom.AbvGr : int  2 2 2 2 3 4 2 2 3 2 ...

 #Finding missing Values 
MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(ames2, 2, MissingData)

##          price       Lot.Area     Land.Slope     Year.Built Year.Remod.Add 
##              0              0              0              0              0 
##  Bedroom.AbvGr 
##              0

summary(ames2)

##      price           Lot.Area        Land.Slope      Year.Built  
##  Min.   : 12789   Min.   :  1470   Min.   :1.000   Min.   :1872  
##  1st Qu.:129763   1st Qu.:  7314   1st Qu.:1.000   1st Qu.:1955  
##  Median :159467   Median :  9317   Median :1.000   Median :1975  
##  Mean   :181190   Mean   : 10352   Mean   :1.043   Mean   :1972  
##  3rd Qu.:213000   3rd Qu.: 11650   3rd Qu.:1.000   3rd Qu.:2001  
##  Max.   :615000   Max.   :215245   Max.   :3.000   Max.   :2010  
##  Year.Remod.Add Bedroom.AbvGr  
##  Min.   :1950   Min.   :0.000  
##  1st Qu.:1966   1st Qu.:2.000  
##  Median :1992   Median :3.000  
##  Mean   :1984   Mean   :2.806  
##  3rd Qu.:2004   3rd Qu.:3.000  
##  Max.   :2010   Max.   :6.000

#Building a Model using step method using both forward selection and backward
#Elimination methods by AIC.Measure BIC.
#Modeling best fit after log transfoming price variable.
fit <- lm(log(ames2$price) ~ ., data = ames2) # fit a simple linear 
magnus <- step(fit,direction = "both")

## Start:  AIC=-2530.07
## log(ames2$price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + 
##     Bedroom.AbvGr
## 
##                  Df Sum of Sq    RSS     AIC
## - Land.Slope      1    0.0465 78.750 -2531.5
## <none>                        78.704 -2530.1
## - Bedroom.AbvGr   1    5.1713 83.875 -2468.4
## - Lot.Area        1    5.3617 84.065 -2466.2
## - Year.Remod.Add  1   12.4099 91.114 -2385.7
## - Year.Built      1   19.6773 98.381 -2308.9
## 
## Step:  AIC=-2531.47
## log(ames2$price) ~ Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        78.750 -2531.5
## + Land.Slope      1    0.0465 78.704 -2530.1
## - Bedroom.AbvGr   1    5.1250 83.875 -2470.4
## - Lot.Area        1    7.1223 85.872 -2446.9
## - Year.Remod.Add  1   12.3912 91.141 -2387.3
## - Year.Built      1   19.6618 98.412 -2310.6

summary(magnus)

## 
## Call:
## lm(formula = log(ames2$price) ~ Lot.Area + Year.Built + Year.Remod.Add + 
##     Bedroom.AbvGr, data = ames2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.09102 -0.16488 -0.02135  0.16605  1.13359 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.385e+01  8.633e-01 -16.049  < 2e-16 ***
## Lot.Area        8.697e-06  9.168e-07   9.486  < 2e-16 ***
## Year.Built      6.019e-03  3.819e-04  15.761  < 2e-16 ***
## Year.Remod.Add  6.889e-03  5.505e-04  12.512  < 2e-16 ***
## Bedroom.AbvGr   8.704e-02  1.082e-02   8.047  2.4e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2813 on 995 degrees of freedom
## Multiple R-squared:  0.5544, Adjusted R-squared:  0.5526 
## F-statistic: 309.5 on 4 and 995 DF,  p-value: < 2.2e-16

anova(magnus)

## Analysis of Variance Table
## 
## Response: log(ames2$price)
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## Lot.Area         1 10.392  10.392 131.306 < 2.2e-16 ***
## Year.Built       1 69.442  69.442 877.389 < 2.2e-16 ***
## Year.Remod.Add   1 13.018  13.018 164.483 < 2.2e-16 ***
## Bedroom.AbvGr    1  5.125   5.125  64.754 2.404e-15 ***
## Residuals      995 78.750   0.079                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

BIC(fit)

## [1] 344.166

BIC(magnus)

## [1] 337.8492

Answer to Question 4 Step 1 Constructed a new dataframe called ames2,which is consisting of 6 variables in ames_train. Converted Land.Slope to integer. Step 2 Make Multiple regression on ames2 and computed results. Method Used are been: a step method both forward and backward to select best model. Step 3 A manual selection method is used. Build a model for price as dependant variable and 5 remaining as independent variables. Use all independent to start and eliminate variables using following criteria. Step 4 Elimination method are: 1. lowest Adjusted R squared. 2. p-value equal to less than .05. 3. With Lower F value is excluded. 4. Residual std.error.Residual should be normally distribution, homogenous,independent condition,constant variance,extreme outliers and leverage points. The Log transformation improves the results. Step 4 5.BIC is Bayes method,function Step AIC works backwards starting with all variables such that its value can not be further lowered.

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

# type your code for Question 5 here, and Knit
#1.Construced and extract from training dataset with 7 variables.
#  PID is an addition unique for each observation which will be
#  essential for selecting home with largest squared residual.
 
ames3 <- ames_train %>%
  select(PID, price, Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr)
plm = lm(log(price) ~ Lot.Area 
               + Land.Slope 
               + Year.Built 
               + Year.Remod.Add 
               + Bedroom.AbvGr, data = na.omit(ames3))
stplm <- stepAIC(plm, trace = T, k = log(nrow(na.omit(ames3))))

## Start:  AIC=-2511.42
## log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + 
##     Bedroom.AbvGr
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        77.322 -2511.4
## - Land.Slope      2    1.4281 78.750 -2506.9
## - Bedroom.AbvGr   1    5.0628 82.385 -2454.9
## - Lot.Area        1    6.7292 84.051 -2434.9
## - Year.Remod.Add  1   11.9642 89.286 -2374.5
## - Year.Built      1   19.8546 97.177 -2289.8

BIC(stplm)

## [1] 333.3642

residuals <- ames3 %>%
  select(PID, price) %>%
  mutate(log_price = log(price))

residuals$predicted <- predict(stplm)
residuals$residuals <- residuals(stplm)
residuals$squared_residuals <- residuals(stplm)^2

residuals <- residuals %>%
  arrange(desc(squared_residuals))
residuals[c(1:10),]

## # A tibble: 10 × 6
##          PID  price log_price predicted  residuals squared_residuals
##        <int>  <int>     <dbl>     <dbl>      <dbl>             <dbl>
## 1  902207130  12789  9.456341  11.54419 -2.0878529         4.3591298
## 2  534427010  84900 11.349229  12.37473 -1.0254998         1.0516498
## 3  911175430  35311 10.471950  11.47230 -1.0003537         1.0007075
## 4  528164060 615000 13.329378  12.33487  0.9945048         0.9890399
## 5  534450090  39300 10.578980  11.55148 -0.9724980         0.9457524
## 6  528150070 611657 13.323927  12.36909  0.9548393         0.9117180
## 7  902477120  34900 10.460242  11.37222 -0.9119760         0.8317003
## 8  911102170  40000 10.596635  11.46546 -0.8688291         0.7548640
## 9  905427030 415000 12.936034  12.11720  0.8188343         0.6704895
## 10 902326030 265979 12.491173  11.70249  0.7886795         0.6220154

ames_train[which(ames_train$PID == 902207130),]

## # A tibble: 1 × 81
##         PID  area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
##       <int> <int> <int>       <int>    <fctr>        <int>    <int> <fctr>
## 1 902207130   832 12789          30        RM           68     9656   Pave
## # ... with 73 more variables: Alley <fctr>, Lot.Shape <fctr>,
## #   Land.Contour <fctr>, Utilities <fctr>, Lot.Config <fctr>,
## #   Land.Slope <fctr>, Neighborhood <fctr>, Condition.1 <fctr>,
## #   Condition.2 <fctr>, Bldg.Type <fctr>, House.Style <fctr>,
## #   Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## #   Year.Remod.Add <int>, Roof.Style <fctr>, Roof.Matl <fctr>,
## #   Exterior.1st <fctr>, Exterior.2nd <fctr>, Mas.Vnr.Type <fctr>,
## #   Mas.Vnr.Area <int>, Exter.Qual <fctr>, Exter.Cond <fctr>,
## #   Foundation <fctr>, Bsmt.Qual <fctr>, Bsmt.Cond <fctr>,
## #   Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>, BsmtFin.SF.1 <int>,
## #   BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## #   Total.Bsmt.SF <int>, Heating <fctr>, Heating.QC <fctr>,
## #   Central.Air <fctr>, Electrical <fctr>, X1st.Flr.SF <int>,
## #   X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <int>,
## #   Bsmt.Half.Bath <int>, Full.Bath <int>, Half.Bath <int>,
## #   Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fctr>,
## #   TotRms.AbvGrd <int>, Functional <fctr>, Fireplaces <int>,
## #   Fireplace.Qu <fctr>, Garage.Type <fctr>, Garage.Yr.Blt <int>,
## #   Garage.Finish <fctr>, Garage.Cars <int>, Garage.Area <int>,
## #   Garage.Qual <fctr>, Garage.Cond <fctr>, Paved.Drive <fctr>,
## #   Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## #   X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>,
## #   Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>, Misc.Val <int>,
## #   Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>, Sale.Condition <fctr>

model.resid<-stplm$residuals
plot(model.resid, main="Residuals",pch=20)
abline(0,0, lwd=2, col="blueviolet")

qqnorm(residuals(stplm))
qqline(residuals(stplm))

Answer to Question 5 Every fitted model has: “coefficients”,“residuals”,“effects”,“rank”, “fitted.values”,“assign”,“qr”,“df.residual”, “xlevels”,“call”,“terms”,“model”.

Extraced residuals from constructed fit(stplm). Sorted residuals in desc order. Extracted a few observations in residuals from ames_train model. (stplm)Picked PID having highest squared residuals. Tested PID “9022 07130” with ames,ames2,ames3 and ames_train.

Plotted qqnorm plot for Squared Residuals. PID “902207130” is more than 4 times bigger than the second largest squared residual and hence seems to be an outlier.

The price is lowest price and house is 91 years old located in old Town. Overall quality if poor,square footage is below averag,house has 2 bedrooms and was sold in 2010.These all factors make the house as lowest priced and hence the Squared residuals is largest.

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

# type your code for Question 6 here, and Knit
fitlog <- lm(log(ames3$price)~log(ames3$Lot.Area)+ames3$Land.Slope+ames3$Year.Built+ames3$Year.Remod.Add+ames3$Bedroom.AbvGr,data=ames3)
magnuslog1 <- stepAIC(fitlog, trace = T, k = log(nrow(na.omit(ames3))))

## Start:  AIC=-2615.14
## log(ames3$price) ~ log(ames3$Lot.Area) + ames3$Land.Slope + ames3$Year.Built + 
##     ames3$Year.Remod.Add + ames3$Bedroom.AbvGr
## 
##                        Df Sum of Sq    RSS     AIC
## - ames3$Land.Slope      2    0.4429 70.147 -2622.6
## <none>                              69.704 -2615.1
## - ames3$Bedroom.AbvGr   1    2.2002 71.904 -2591.0
## - ames3$Year.Remod.Add  1   11.8182 81.522 -2465.4
## - log(ames3$Lot.Area)   1   14.3474 84.051 -2434.9
## - ames3$Year.Built      1   19.4083 89.112 -2376.4
## 
## Step:  AIC=-2622.62
## log(ames3$price) ~ log(ames3$Lot.Area) + ames3$Year.Built + ames3$Year.Remod.Add + 
##     ames3$Bedroom.AbvGr
## 
##                        Df Sum of Sq    RSS     AIC
## <none>                              70.147 -2622.6
## - ames3$Bedroom.AbvGr   1    2.0816 72.228 -2600.3
## - ames3$Year.Remod.Add  1   11.9455 82.092 -2472.3
## - log(ames3$Lot.Area)   1   15.7256 85.872 -2427.3
## - ames3$Year.Built      1   19.3032 89.450 -2386.4

anova(stplm)

## Analysis of Variance Table
## 
## Response: log(price)
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## Lot.Area         1 10.392  10.392 133.462 < 2.2e-16 ***
## Land.Slope       2  2.424   1.212  15.564 2.211e-07 ***
## Year.Built       1 68.991  68.991 886.012 < 2.2e-16 ***
## Year.Remod.Add   1 12.535  12.535 160.982 < 2.2e-16 ***
## Bedroom.AbvGr    1  5.063   5.063  65.018 2.124e-15 ***
## Residuals      993 77.322   0.078                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(magnuslog1)

## Analysis of Variance Table
## 
## Response: log(ames3$price)
##                       Df Sum Sq Mean Sq F value    Pr(>F)    
## log(ames3$Lot.Area)    1 24.338  24.338 345.224 < 2.2e-16 ***
## ames3$Year.Built       1 67.894  67.894 963.045 < 2.2e-16 ***
## ames3$Year.Remod.Add   1 12.267  12.267 174.000 < 2.2e-16 ***
## ames3$Bedroom.AbvGr    1  2.082   2.082  29.526 6.939e-08 ***
## Residuals            995 70.147   0.070                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

BIC(magnuslog1)

## [1] 222.1599

BIC(stplm)

## [1] 333.3642

improvement=(BIC(stplm)-BIC(magnuslog1))
improvement

## [1] 111.2043

ggplot(data = ames3, aes(x = Lot.Area)) +
  geom_histogram() +
  ggtitle("Lot.Area w/o a log-transformation")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = ames3, aes(x = log(Lot.Area))) +
  geom_histogram() +
  ggtitle("Lot.Area with after transformation")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Answer to Question 6 The log transformation has an effect to improve skewness of data by making original distributions of predictors to more normally distributed. Consequently, it helps regression model to work better. By replacing Lot.Area with log(Lot.Area). it is possible to achieve a BIC value of 222.1599 which is 111.20 below the BIC of fit before log transform was 333. Model indicating a higher predictive powert of the second model. The explanation for that is most likely that the variable Lot.Area is heavily right-skewed. By applying a log-transformation it looks much more normal than before.

The residual vs fitted,Normal Q-Q,Scale Location,Residulas vs Leverage, become more linear become more linear when multiple regression is carried out after using log of Lot.Area and and Sale price of homes. We get better prediction.This is shown in the plots and in summary for model fit and model fit1,and fit2.

Constant error Varience of residuals , Linearity,Normality of residuals,, independence of residuals are indicated in plots. #

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

# type your code for Question 7 here, and Knit
#Computed comparison of log transformed and actaual variables.
#The previous fitted models before and after log transform to 
#be used for plotting residual plots. 

op <- par(mfrow = c(2, 3))
plot(stplm)
plot(magnuslog1)

summary(stplm)

## 
## Call:
## lm(formula = log(price) ~ Lot.Area + Land.Slope + Year.Built + 
##     Year.Remod.Add + Bedroom.AbvGr, data = na.omit(ames3))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0878 -0.1651 -0.0211  0.1657  0.9945 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.371e+01  8.574e-01 -15.996  < 2e-16 ***
## Lot.Area        1.028e-05  1.106e-06   9.296  < 2e-16 ***
## Land.SlopeMod   1.384e-01  4.991e-02   2.773  0.00565 ** 
## Land.SlopeSev  -4.567e-01  1.514e-01  -3.016  0.00263 ** 
## Year.Built      6.049e-03  3.788e-04  15.968  < 2e-16 ***
## Year.Remod.Add  6.778e-03  5.468e-04  12.395  < 2e-16 ***
## Bedroom.AbvGr   8.686e-02  1.077e-02   8.063 2.12e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared:  0.5625, Adjusted R-squared:  0.5598 
## F-statistic: 212.8 on 6 and 993 DF,  p-value: < 2.2e-16

summary(magnuslog1)

## 
## Call:
## lm(formula = log(ames3$price) ~ log(ames3$Lot.Area) + ames3$Year.Built + 
##     ames3$Year.Remod.Add + ames3$Bedroom.AbvGr, data = ames3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.14609 -0.15825 -0.01477  0.15354  1.01578 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.557e+01  8.213e-01 -18.964  < 2e-16 ***
## log(ames3$Lot.Area)   2.471e-01  1.654e-02  14.935  < 2e-16 ***
## ames3$Year.Built      5.964e-03  3.604e-04  16.547  < 2e-16 ***
## ames3$Year.Remod.Add  6.765e-03  5.197e-04  13.017  < 2e-16 ***
## ames3$Bedroom.AbvGr   5.726e-02  1.054e-02   5.434 6.94e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2655 on 995 degrees of freedom
## Multiple R-squared:  0.6031, Adjusted R-squared:  0.6015 
## F-statistic: 377.9 on 4 and 995 DF,  p-value: < 2.2e-16

anova(stplm)

## Analysis of Variance Table
## 
## Response: log(price)
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## Lot.Area         1 10.392  10.392 133.462 < 2.2e-16 ***
## Land.Slope       2  2.424   1.212  15.564 2.211e-07 ***
## Year.Built       1 68.991  68.991 886.012 < 2.2e-16 ***
## Year.Remod.Add   1 12.535  12.535 160.982 < 2.2e-16 ***
## Bedroom.AbvGr    1  5.063   5.063  65.018 2.124e-15 ***
## Residuals      993 77.322   0.078                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(magnuslog1)

## Analysis of Variance Table
## 
## Response: log(ames3$price)
##                       Df Sum Sq Mean Sq F value    Pr(>F)    
## log(ames3$Lot.Area)    1 24.338  24.338 345.224 < 2.2e-16 ***
## ames3$Year.Built       1 67.894  67.894 963.045 < 2.2e-16 ***
## ames3$Year.Remod.Add   1 12.267  12.267 174.000 < 2.2e-16 ***
## ames3$Bedroom.AbvGr    1  2.082   2.082  29.526 6.939e-08 ***
## Residuals            995 70.147   0.070                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer to Question 7 In Multilinear regressions models after log transform, the residual plots show improvement. 1.Errors are normally distributed with a constant variance. This assumption is met by Q-Q plots.Seems to be fairly met by both models according to the residudal Q-Q-Plot. 2.Homogeneity of the residuals met by both models and seems to be not violated. 3.The independence of the residuals are uncorrelated.The set of plots for second models match this criterian. 4.The model average points in the model.The outliers distort the prediction. The outliers have overweight influence. Here influence of those leverage points seems to be much lower in the second model according to the Residuals vs. Leverage plot. It is very useful decision to transform variables.The log transformation of Lot.Area improves the model assumptions for multiple linear regression models. The log trasformation is fundamental for to have best results by the analysis of data.