First, let us load the data and necessary packages:
load("ames_train.Rdata")
library(devtools)
library(MASS)
library(dplyr)
library(ggplot2)
library(statsr)
library(modes)
library(Hmisc)
library(stats)
Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.
# type your code for Question 1 here, and Knit
# Defining of a memory variable "age".(2017-Year.Built) and drew a histgram.
# For Explnation of Distribution computed range,modes,skewness and kurtosis.
# Drew a denisty plot to justify the multimode nature of age.
age=(2017-ames_train$Year.Built)
range(age)
## [1] 7 145
hist(age, breaks = 30, col='Blue', main="Distribution of Price in 30 bins")
modes(age, type = 1, digits = "NULL", nmore = "NULL")
## [,1]
## Value 12
## Length 49
skewness(age,TRUE)
## [1] 0.6591292
kurtosis(age, finite=TRUE)
## [1] -0.3243304
plot(density(age))
da=density(age)
da
##
## Call:
## density.default(x = age)
##
## Data: age (1000 obs.); Bandwidth 'bw' = 6.7
##
## x y
## Min. :-13.10 Min. :6.990e-07
## 1st Qu.: 31.45 1st Qu.:5.257e-04
## Median : 76.00 Median :4.466e-03
## Mean : 76.00 Mean :5.606e-03
## 3rd Qu.:120.55 3rd Qu.:9.901e-03
## Max. :165.10 Max. :1.825e-02
summary(age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.0 16.0 42.0 44.8 62.0 145.0
describe(age)
## age
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 102 1 44.8 33.14 10 11
## .25 .50 .75 .90 .95
## 16 42 62 92 98
##
## lowest : 7 8 9 10 11, highest: 122 127 132 137 145
Answer to Question 1 First I define age,and then use breaks = 30 for the 30 bins. After make histrogram for the distribution of age. The age of the houses range between 7 years to 145 years when worked out in 2017. Mean is 44.8 years,Maximum 145,minimum 7. Median 42 years. The distribution is slightly right skewed and is multimodal with more 2 peaks.
The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.
ames_train %>% select(Neighborhood, price) %>% ggplot(aes(x=reorder(Neighborhood,-price, FUN = median), y = price)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust =1)) + xlab('Neighborhoods by mean prices')+ggtitle("Price vs Ames Neighbourhood ")
ames_train %>%
group_by(Neighborhood) %>%
summarise(median_price = median(price), sd_price = sd(price,)) %>%
arrange(desc(median_price)) %>%
print(n = 20)
## # A tibble: 27 × 3
## Neighborhood median_price sd_price
## <fctr> <dbl> <dbl>
## 1 StoneBr 340691.5 123459.10
## 2 NridgHt 336860.0 105088.90
## 3 NoRidge 290000.0 35888.97
## 4 GrnHill 280000.0 70710.68
## 5 Timber 232500.0 84029.57
## 6 Somerst 221650.0 65199.49
## 7 Greens 212625.0 29063.42
## 8 Veenker 205750.0 72545.41
## 9 Crawfor 205000.0 71267.56
## 10 CollgCr 195800.0 52786.08
## 11 Blmngtn 191000.0 26454.86
## 12 ClearCr 185000.0 48068.69
## 13 NWAmes 185000.0 41340.50
## 14 Gilbert 183500.0 41190.38
## 15 SawyerW 182500.0 48354.36
## 16 Mitchel 156500.0 39682.94
## 17 NPkVill 142100.0 11958.37
## 18 NAmes 139900.0 27267.97
## 19 Sawyer 136000.0 21216.22
## 20 SWISU 134000.0 27375.76
## # ... with 7 more rows
Answer to Question 2 The median statistics is more robust towards outliers than the mean, then, I has used the median price of house to compute prices of homes in different Neighborhoods. An selection has been made by Neighborhood and price.The plot shows: Most Expensive:Stone Brook neighborhood (median price is 340691.5 USD). Least expensive: Meadow Village (median price is 85750.0 USD) . Most heterogeneous:Stone Brook: (standard deviation is 123459.10 USD)
Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.
sum(is.na(ames_train)) / (nrow(ames_train) *ncol(ames_train))
## [1] 0.05816049
colSums(sapply(ames_train, is.na))
## PID area price MS.SubClass
## 0 0 0 0
## MS.Zoning Lot.Frontage Lot.Area Street
## 0 167 0 0
## Alley Lot.Shape Land.Contour Utilities
## 933 0 0 0
## Lot.Config Land.Slope Neighborhood Condition.1
## 0 0 0 0
## Condition.2 Bldg.Type House.Style Overall.Qual
## 0 0 0 0
## Overall.Cond Year.Built Year.Remod.Add Roof.Style
## 0 0 0 0
## Roof.Matl Exterior.1st Exterior.2nd Mas.Vnr.Type
## 0 0 0 0
## Mas.Vnr.Area Exter.Qual Exter.Cond Foundation
## 7 0 0 0
## Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1
## 21 21 21 21
## BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2 Bsmt.Unf.SF
## 1 21 1 1
## Total.Bsmt.SF Heating Heating.QC Central.Air
## 1 0 0 0
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF
## 0 0 0 0
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath
## 1 1 0 0
## Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd
## 0 0 0 0
## Functional Fireplaces Fireplace.Qu Garage.Type
## 0 0 491 46
## Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area
## 48 46 1 1
## Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## 47 47 0 0
## Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch
## 0 0 0 0
## Pool.Area Pool.QC Fence Misc.Feature
## 0 997 798 971
## Misc.Val Mo.Sold Yr.Sold Sale.Type
## 0 0 0 0
## Sale.Condition
## 0
plot_Missing <- function(data_in, title = NULL){
temp_df <- as.data.frame(ifelse(is.na(data_in), 0, 1))
temp_df <- temp_df[,order(colSums(temp_df))]
data_temp <- expand.grid(list(x = 1:nrow(temp_df), y = colnames(temp_df)))
data_temp$m <- as.vector(as.matrix(temp_df))
data_temp <- data.frame(x = unlist(data_temp$x), y = unlist(data_temp$y), m = unlist(data_temp$m))
ggplot(data_temp) + geom_tile(aes(x=x, y=y, fill=factor(m))) + scale_fill_manual(values=c("white", "black"), name="Missing\n(0=Yes, 1=No)") + theme_light() + ylab("Nieghborhood") + xlab("observ.No") + ggtitle("Missing Values")
}
plot_Missing(ames_train[,colSums(is.na(ames_train)) >0])
max(colSums(is.na(ames_train)))
## [1] 997
summary(ames_train$Pool.QC)
## Ex Fa Gd TA NA's
## 1 1 1 0 997
# type your code for Question 3 here, and Knit
Answer to Q3 The variable Pool.QC with 997 (out of 1000)missing values has largest number of missing values(99.7%).This measures quality of Pools. There are only three houses in the dataset have a pool.This is the reason of missing values.
We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.
# type your code for Question 4 here, and Knit
vars <- names(ames_train) %in% c("price","Lot.Area","Land.Slope","Year.Built","Year.Remod.Add","Bedroom.AbvGr")
ames2 <- ames_train[vars]
names(ames2)
## [1] "price" "Lot.Area" "Land.Slope" "Year.Built"
## [5] "Year.Remod.Add" "Bedroom.AbvGr"
ames2$Land.Slope<-as.integer(ames2$Land.Slope)
str(ames2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 6 variables:
## $ price : int 126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
## $ Lot.Area : int 7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
## $ Land.Slope : int 1 1 1 1 1 1 2 1 1 1 ...
## $ Year.Built : int 1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
## $ Year.Remod.Add: int 1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
## $ Bedroom.AbvGr : int 2 2 2 2 3 4 2 2 3 2 ...
#Finding missing Values
MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(ames2, 2, MissingData)
## price Lot.Area Land.Slope Year.Built Year.Remod.Add
## 0 0 0 0 0
## Bedroom.AbvGr
## 0
summary(ames2)
## price Lot.Area Land.Slope Year.Built
## Min. : 12789 Min. : 1470 Min. :1.000 Min. :1872
## 1st Qu.:129763 1st Qu.: 7314 1st Qu.:1.000 1st Qu.:1955
## Median :159467 Median : 9317 Median :1.000 Median :1975
## Mean :181190 Mean : 10352 Mean :1.043 Mean :1972
## 3rd Qu.:213000 3rd Qu.: 11650 3rd Qu.:1.000 3rd Qu.:2001
## Max. :615000 Max. :215245 Max. :3.000 Max. :2010
## Year.Remod.Add Bedroom.AbvGr
## Min. :1950 Min. :0.000
## 1st Qu.:1966 1st Qu.:2.000
## Median :1992 Median :3.000
## Mean :1984 Mean :2.806
## 3rd Qu.:2004 3rd Qu.:3.000
## Max. :2010 Max. :6.000
#Building a Model using step method using both forward selection and backward
#Elimination methods by AIC.Measure BIC.
#Modeling best fit after log transfoming price variable.
fit <- lm(log(ames2$price) ~ ., data = ames2) # fit a simple linear
magnus <- step(fit,direction = "both")
## Start: AIC=-2530.07
## log(ames2$price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add +
## Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## - Land.Slope 1 0.0465 78.750 -2531.5
## <none> 78.704 -2530.1
## - Bedroom.AbvGr 1 5.1713 83.875 -2468.4
## - Lot.Area 1 5.3617 84.065 -2466.2
## - Year.Remod.Add 1 12.4099 91.114 -2385.7
## - Year.Built 1 19.6773 98.381 -2308.9
##
## Step: AIC=-2531.47
## log(ames2$price) ~ Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## <none> 78.750 -2531.5
## + Land.Slope 1 0.0465 78.704 -2530.1
## - Bedroom.AbvGr 1 5.1250 83.875 -2470.4
## - Lot.Area 1 7.1223 85.872 -2446.9
## - Year.Remod.Add 1 12.3912 91.141 -2387.3
## - Year.Built 1 19.6618 98.412 -2310.6
summary(magnus)
##
## Call:
## lm(formula = log(ames2$price) ~ Lot.Area + Year.Built + Year.Remod.Add +
## Bedroom.AbvGr, data = ames2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.09102 -0.16488 -0.02135 0.16605 1.13359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.385e+01 8.633e-01 -16.049 < 2e-16 ***
## Lot.Area 8.697e-06 9.168e-07 9.486 < 2e-16 ***
## Year.Built 6.019e-03 3.819e-04 15.761 < 2e-16 ***
## Year.Remod.Add 6.889e-03 5.505e-04 12.512 < 2e-16 ***
## Bedroom.AbvGr 8.704e-02 1.082e-02 8.047 2.4e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2813 on 995 degrees of freedom
## Multiple R-squared: 0.5544, Adjusted R-squared: 0.5526
## F-statistic: 309.5 on 4 and 995 DF, p-value: < 2.2e-16
anova(magnus)
## Analysis of Variance Table
##
## Response: log(ames2$price)
## Df Sum Sq Mean Sq F value Pr(>F)
## Lot.Area 1 10.392 10.392 131.306 < 2.2e-16 ***
## Year.Built 1 69.442 69.442 877.389 < 2.2e-16 ***
## Year.Remod.Add 1 13.018 13.018 164.483 < 2.2e-16 ***
## Bedroom.AbvGr 1 5.125 5.125 64.754 2.404e-15 ***
## Residuals 995 78.750 0.079
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(fit)
## [1] 344.166
BIC(magnus)
## [1] 337.8492
Answer to Question 4 Step 1 Constructed a new dataframe called ames2,which is consisting of 6 variables in ames_train. Converted Land.Slope to integer. Step 2 Make Multiple regression on ames2 and computed results. Method Used are been: a step method both forward and backward to select best model. Step 3 A manual selection method is used. Build a model for price as dependant variable and 5 remaining as independent variables. Use all independent to start and eliminate variables using following criteria. Step 4 Elimination method are: 1. lowest Adjusted R squared. 2. p-value equal to less than .05. 3. With Lower F value is excluded. 4. Residual std.error.Residual should be normally distribution, homogenous,independent condition,constant variance,extreme outliers and leverage points. The Log transformation improves the results. Step 4 5.BIC is Bayes method,function Step AIC works backwards starting with all variables such that its value can not be further lowered.
Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?
# type your code for Question 5 here, and Knit
#1.Construced and extract from training dataset with 7 variables.
# PID is an addition unique for each observation which will be
# essential for selecting home with largest squared residual.
ames3 <- ames_train %>%
select(PID, price, Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr)
plm = lm(log(price) ~ Lot.Area
+ Land.Slope
+ Year.Built
+ Year.Remod.Add
+ Bedroom.AbvGr, data = na.omit(ames3))
stplm <- stepAIC(plm, trace = T, k = log(nrow(na.omit(ames3))))
## Start: AIC=-2511.42
## log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add +
## Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## <none> 77.322 -2511.4
## - Land.Slope 2 1.4281 78.750 -2506.9
## - Bedroom.AbvGr 1 5.0628 82.385 -2454.9
## - Lot.Area 1 6.7292 84.051 -2434.9
## - Year.Remod.Add 1 11.9642 89.286 -2374.5
## - Year.Built 1 19.8546 97.177 -2289.8
BIC(stplm)
## [1] 333.3642
residuals <- ames3 %>%
select(PID, price) %>%
mutate(log_price = log(price))
residuals$predicted <- predict(stplm)
residuals$residuals <- residuals(stplm)
residuals$squared_residuals <- residuals(stplm)^2
residuals <- residuals %>%
arrange(desc(squared_residuals))
residuals[c(1:10),]
## # A tibble: 10 × 6
## PID price log_price predicted residuals squared_residuals
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 902207130 12789 9.456341 11.54419 -2.0878529 4.3591298
## 2 534427010 84900 11.349229 12.37473 -1.0254998 1.0516498
## 3 911175430 35311 10.471950 11.47230 -1.0003537 1.0007075
## 4 528164060 615000 13.329378 12.33487 0.9945048 0.9890399
## 5 534450090 39300 10.578980 11.55148 -0.9724980 0.9457524
## 6 528150070 611657 13.323927 12.36909 0.9548393 0.9117180
## 7 902477120 34900 10.460242 11.37222 -0.9119760 0.8317003
## 8 911102170 40000 10.596635 11.46546 -0.8688291 0.7548640
## 9 905427030 415000 12.936034 12.11720 0.8188343 0.6704895
## 10 902326030 265979 12.491173 11.70249 0.7886795 0.6220154
ames_train[which(ames_train$PID == 902207130),]
## # A tibble: 1 × 81
## PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## <int> <int> <int> <int> <fctr> <int> <int> <fctr>
## 1 902207130 832 12789 30 RM 68 9656 Pave
## # ... with 73 more variables: Alley <fctr>, Lot.Shape <fctr>,
## # Land.Contour <fctr>, Utilities <fctr>, Lot.Config <fctr>,
## # Land.Slope <fctr>, Neighborhood <fctr>, Condition.1 <fctr>,
## # Condition.2 <fctr>, Bldg.Type <fctr>, House.Style <fctr>,
## # Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## # Year.Remod.Add <int>, Roof.Style <fctr>, Roof.Matl <fctr>,
## # Exterior.1st <fctr>, Exterior.2nd <fctr>, Mas.Vnr.Type <fctr>,
## # Mas.Vnr.Area <int>, Exter.Qual <fctr>, Exter.Cond <fctr>,
## # Foundation <fctr>, Bsmt.Qual <fctr>, Bsmt.Cond <fctr>,
## # Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>, BsmtFin.SF.1 <int>,
## # BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## # Total.Bsmt.SF <int>, Heating <fctr>, Heating.QC <fctr>,
## # Central.Air <fctr>, Electrical <fctr>, X1st.Flr.SF <int>,
## # X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <int>,
## # Bsmt.Half.Bath <int>, Full.Bath <int>, Half.Bath <int>,
## # Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fctr>,
## # TotRms.AbvGrd <int>, Functional <fctr>, Fireplaces <int>,
## # Fireplace.Qu <fctr>, Garage.Type <fctr>, Garage.Yr.Blt <int>,
## # Garage.Finish <fctr>, Garage.Cars <int>, Garage.Area <int>,
## # Garage.Qual <fctr>, Garage.Cond <fctr>, Paved.Drive <fctr>,
## # Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## # X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>,
## # Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>, Misc.Val <int>,
## # Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>, Sale.Condition <fctr>
model.resid<-stplm$residuals
plot(model.resid, main="Residuals",pch=20)
abline(0,0, lwd=2, col="blueviolet")
qqnorm(residuals(stplm))
qqline(residuals(stplm))
Answer to Question 5 Every fitted model has: “coefficients”,“residuals”,“effects”,“rank”, “fitted.values”,“assign”,“qr”,“df.residual”, “xlevels”,“call”,“terms”,“model”.
Extraced residuals from constructed fit(stplm). Sorted residuals in desc order. Extracted a few observations in residuals from ames_train model. (stplm)Picked PID having highest squared residuals. Tested PID “9022 07130” with ames,ames2,ames3 and ames_train.
Plotted qqnorm plot for Squared Residuals. PID “902207130” is more than 4 times bigger than the second largest squared residual and hence seems to be an outlier.
The price is lowest price and house is 91 years old located in old Town. Overall quality if poor,square footage is below averag,house has 2 bedrooms and was sold in 2010.These all factors make the house as lowest priced and hence the Squared residuals is largest.
Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?
# type your code for Question 6 here, and Knit
fitlog <- lm(log(ames3$price)~log(ames3$Lot.Area)+ames3$Land.Slope+ames3$Year.Built+ames3$Year.Remod.Add+ames3$Bedroom.AbvGr,data=ames3)
magnuslog1 <- stepAIC(fitlog, trace = T, k = log(nrow(na.omit(ames3))))
## Start: AIC=-2615.14
## log(ames3$price) ~ log(ames3$Lot.Area) + ames3$Land.Slope + ames3$Year.Built +
## ames3$Year.Remod.Add + ames3$Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## - ames3$Land.Slope 2 0.4429 70.147 -2622.6
## <none> 69.704 -2615.1
## - ames3$Bedroom.AbvGr 1 2.2002 71.904 -2591.0
## - ames3$Year.Remod.Add 1 11.8182 81.522 -2465.4
## - log(ames3$Lot.Area) 1 14.3474 84.051 -2434.9
## - ames3$Year.Built 1 19.4083 89.112 -2376.4
##
## Step: AIC=-2622.62
## log(ames3$price) ~ log(ames3$Lot.Area) + ames3$Year.Built + ames3$Year.Remod.Add +
## ames3$Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## <none> 70.147 -2622.6
## - ames3$Bedroom.AbvGr 1 2.0816 72.228 -2600.3
## - ames3$Year.Remod.Add 1 11.9455 82.092 -2472.3
## - log(ames3$Lot.Area) 1 15.7256 85.872 -2427.3
## - ames3$Year.Built 1 19.3032 89.450 -2386.4
anova(stplm)
## Analysis of Variance Table
##
## Response: log(price)
## Df Sum Sq Mean Sq F value Pr(>F)
## Lot.Area 1 10.392 10.392 133.462 < 2.2e-16 ***
## Land.Slope 2 2.424 1.212 15.564 2.211e-07 ***
## Year.Built 1 68.991 68.991 886.012 < 2.2e-16 ***
## Year.Remod.Add 1 12.535 12.535 160.982 < 2.2e-16 ***
## Bedroom.AbvGr 1 5.063 5.063 65.018 2.124e-15 ***
## Residuals 993 77.322 0.078
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(magnuslog1)
## Analysis of Variance Table
##
## Response: log(ames3$price)
## Df Sum Sq Mean Sq F value Pr(>F)
## log(ames3$Lot.Area) 1 24.338 24.338 345.224 < 2.2e-16 ***
## ames3$Year.Built 1 67.894 67.894 963.045 < 2.2e-16 ***
## ames3$Year.Remod.Add 1 12.267 12.267 174.000 < 2.2e-16 ***
## ames3$Bedroom.AbvGr 1 2.082 2.082 29.526 6.939e-08 ***
## Residuals 995 70.147 0.070
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(magnuslog1)
## [1] 222.1599
BIC(stplm)
## [1] 333.3642
improvement=(BIC(stplm)-BIC(magnuslog1))
improvement
## [1] 111.2043
ggplot(data = ames3, aes(x = Lot.Area)) +
geom_histogram() +
ggtitle("Lot.Area w/o a log-transformation")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = ames3, aes(x = log(Lot.Area))) +
geom_histogram() +
ggtitle("Lot.Area with after transformation")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Answer to Question 6 The log transformation has an effect to improve skewness of data by making original distributions of predictors to more normally distributed. Consequently, it helps regression model to work better. By replacing Lot.Area with log(Lot.Area). it is possible to achieve a BIC value of 222.1599 which is 111.20 below the BIC of fit before log transform was 333. Model indicating a higher predictive powert of the second model. The explanation for that is most likely that the variable Lot.Area is heavily right-skewed. By applying a log-transformation it looks much more normal than before.
The residual vs fitted,Normal Q-Q,Scale Location,Residulas vs Leverage, become more linear become more linear when multiple regression is carried out after using log of Lot.Area and and Sale price of homes. We get better prediction.This is shown in the plots and in summary for model fit and model fit1,and fit2.
Constant error Varience of residuals , Linearity,Normality of residuals,, independence of residuals are indicated in plots. #
Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.
# type your code for Question 7 here, and Knit
#Computed comparison of log transformed and actaual variables.
#The previous fitted models before and after log transform to
#be used for plotting residual plots.
op <- par(mfrow = c(2, 3))
plot(stplm)
plot(magnuslog1)
summary(stplm)
##
## Call:
## lm(formula = log(price) ~ Lot.Area + Land.Slope + Year.Built +
## Year.Remod.Add + Bedroom.AbvGr, data = na.omit(ames3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0878 -0.1651 -0.0211 0.1657 0.9945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.371e+01 8.574e-01 -15.996 < 2e-16 ***
## Lot.Area 1.028e-05 1.106e-06 9.296 < 2e-16 ***
## Land.SlopeMod 1.384e-01 4.991e-02 2.773 0.00565 **
## Land.SlopeSev -4.567e-01 1.514e-01 -3.016 0.00263 **
## Year.Built 6.049e-03 3.788e-04 15.968 < 2e-16 ***
## Year.Remod.Add 6.778e-03 5.468e-04 12.395 < 2e-16 ***
## Bedroom.AbvGr 8.686e-02 1.077e-02 8.063 2.12e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared: 0.5625, Adjusted R-squared: 0.5598
## F-statistic: 212.8 on 6 and 993 DF, p-value: < 2.2e-16
summary(magnuslog1)
##
## Call:
## lm(formula = log(ames3$price) ~ log(ames3$Lot.Area) + ames3$Year.Built +
## ames3$Year.Remod.Add + ames3$Bedroom.AbvGr, data = ames3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.14609 -0.15825 -0.01477 0.15354 1.01578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.557e+01 8.213e-01 -18.964 < 2e-16 ***
## log(ames3$Lot.Area) 2.471e-01 1.654e-02 14.935 < 2e-16 ***
## ames3$Year.Built 5.964e-03 3.604e-04 16.547 < 2e-16 ***
## ames3$Year.Remod.Add 6.765e-03 5.197e-04 13.017 < 2e-16 ***
## ames3$Bedroom.AbvGr 5.726e-02 1.054e-02 5.434 6.94e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2655 on 995 degrees of freedom
## Multiple R-squared: 0.6031, Adjusted R-squared: 0.6015
## F-statistic: 377.9 on 4 and 995 DF, p-value: < 2.2e-16
anova(stplm)
## Analysis of Variance Table
##
## Response: log(price)
## Df Sum Sq Mean Sq F value Pr(>F)
## Lot.Area 1 10.392 10.392 133.462 < 2.2e-16 ***
## Land.Slope 2 2.424 1.212 15.564 2.211e-07 ***
## Year.Built 1 68.991 68.991 886.012 < 2.2e-16 ***
## Year.Remod.Add 1 12.535 12.535 160.982 < 2.2e-16 ***
## Bedroom.AbvGr 1 5.063 5.063 65.018 2.124e-15 ***
## Residuals 993 77.322 0.078
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(magnuslog1)
## Analysis of Variance Table
##
## Response: log(ames3$price)
## Df Sum Sq Mean Sq F value Pr(>F)
## log(ames3$Lot.Area) 1 24.338 24.338 345.224 < 2.2e-16 ***
## ames3$Year.Built 1 67.894 67.894 963.045 < 2.2e-16 ***
## ames3$Year.Remod.Add 1 12.267 12.267 174.000 < 2.2e-16 ***
## ames3$Bedroom.AbvGr 1 2.082 2.082 29.526 6.939e-08 ***
## Residuals 995 70.147 0.070
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer to Question 7 In Multilinear regressions models after log transform, the residual plots show improvement. 1.Errors are normally distributed with a constant variance. This assumption is met by Q-Q plots.Seems to be fairly met by both models according to the residudal Q-Q-Plot. 2.Homogeneity of the residuals met by both models and seems to be not violated. 3.The independence of the residuals are uncorrelated.The set of plots for second models match this criterian. 4.The model average points in the model.The outliers distort the prediction. The outliers have overweight influence. Here influence of those leverage points seems to be much lower in the second model according to the Residuals vs. Leverage plot. It is very useful decision to transform variables.The log transformation of Lot.Area improves the model assumptions for multiple linear regression models. The log trasformation is fundamental for to have best results by the analysis of data.