For this question, you will the ames data set that we have already worked in class. This data set contains information on 2,930 properties in Ames, Iowa, including columns related to: - house characteristics (bedrooms, garage, fireplace, pool, porch, etc.) - location (neighborhood) - lot information (zoning, shape, size, etc.) - ratings of condition and quality - sale price
Before you dig in predicting house prices, we would like to know a couple a things about our data. To answer each subquestion, please create a nice plot !
How expensive are houses (min, max, average, median) ?
When were the houses built, including the oldest and newest?
When were houses sold?
Where are houses?
How big are houses?
Code
library(ggplot2)library(dplyr)library(corrplot)properties <-read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/modeldata/ames.csv")summary(properties %>%select(Sale_Price, Year_Built, Year_Sold, Lot_Area))# Subquestion 1: how expensive are houses?ggplot(properties, aes(x = Sale_Price)) +geom_histogram(bins =50, fill ="#046307", color ="#5C655A", alpha =0.7) +scale_x_log10() +labs(title ="Housing Price",subtitle ="1. How expensive are houses?",x ="Logged Housing Price ($)",y ="Frequency" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5) )# Subquestion 2: When were houses built? ggplot(properties, aes(x = Year_Built)) +geom_histogram(bins =50, fill ="#002366", color ="black", alpha =0.7) +labs(title ="Year Built",subtitle ="2. When were houses built?",x ="Year Built",y ="Frequency" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5) )# Subquestion 3: When are houses sold? ggplot(properties, aes(x = Year_Sold)) +geom_bar(bins =50, fill ="#005672", color ="black", alpha =0.7) +labs(title ="Year Sold",subtitle ="3. When are houses sold? ",x ="Year Sold",y ="Frequency" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5) )# Subquestion 4: Where are Houses?properties %>%count(Neighborhood) %>%mutate(Neighborhood =gsub("_", " ", Neighborhood)) %>%ggplot(aes(x =reorder(Neighborhood, n), y = n)) +geom_col(fill ="#8E9AFE", color ="black", alpha =0.7) +coord_flip() +labs(title ="Neighborhoods",subtitle ="Neighborhoods of houses sold in Ames, Iowa from 2006 to 2010",x ="Neighborhood",y ="Frequency" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =10, hjust =0.5) )# Subquestion 5: How big are the houses? ggplot(properties, aes(x = Lot_Area)) +geom_histogram(bins =50, fill ="#6A5ACD", color ="black", alpha =0.7) +scale_x_log10() +labs(title ="Lot Area",subtitle ="5: How big are houses?",x =" Logged Lot size (sq. ft)",y ="Frequency" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5) )
Output
Sale_Price Year_Built Year_Sold Lot_Area
Min. : 12789 Min. :1872 Min. :2006 Min. : 1300
1st Qu.:129500 1st Qu.:1954 1st Qu.:2007 1st Qu.: 7440
Median :160000 Median :1973 Median :2008 Median : 9436
Mean :180796 Mean :1971 Mean :2008 Mean : 10148
3rd Qu.:213500 3rd Qu.:2001 3rd Qu.:2009 3rd Qu.: 11555
Max. :755000 Max. :2010 Max. :2010 Max. :215245
Plot
Plot
Plot
Plot
Plot
Based on the plots and summary statistics, the sale price of houses is quite skewed with a median of 160,000 USD, a minimum of 12,789 USD and a maximum of 755,000 USD. For visual purposes, we logged the sale price to visualize the prices on the extreme ends that were otherwise minimized by the skewness. Although the age range of the houses varies significantly from the 1870s to 2010, the mean year for the year a house was built was 1971 revealing a leftward skew in building years. While houses were sold at a fairly uniform frequency there is a sharp decline in 2010, and a sizeable spike in 2007.The majority of houses were sold in the neighborhoods North Ames, College Creek and Old Town, and the least were old in Greens, Green Hills, and Landmark. Finally, the overall lot area of the houses had a similar leftward skew to the houses prices, thus we logged the Lot Area variable for visual purposes. Based on the summary statistics, the minimum lot area was 1300 sq. ft., the maximum was 215,245 sq. ft., and the mean was 10,148 sq. ft.
Since our target variable is house prices, please provide a detail overview of the distribution of our target variables including stats and plots.
Code
# Summary statistics of key numeric variables by Neighborhoodproperties %>%mutate(Neighborhood =gsub("_", " ", Neighborhood)) %>%group_by(Neighborhood) %>%summarise(count =n(),mean_Lot_Area =mean(Lot_Area, na.rm =TRUE),median_Lot_Area =median(Lot_Area, na.rm =TRUE),mean_Sale_Price =mean(Sale_Price, na.rm =TRUE),median_Sale_Price =median(Sale_Price, na.rm =TRUE),mean_Year_Built =mean(Year_Built, na.rm =TRUE),median_Year_Built =median(Year_Built, na.rm =TRUE),mean_Year_Sold =mean(Year_Sold, na.rm =TRUE),median_Year_Sold =median(Year_Sold, na.rm =TRUE) ) %>%arrange(desc(mean_Sale_Price))# Lot Area vs Sale Priceggplot(properties, aes(x = Lot_Area, y = Sale_Price)) +geom_point(color ="#6A5ACD", alpha =0.6) +geom_smooth(method ="lm", color ="black", se =TRUE, linewidth =1) +scale_y_log10() +scale_x_log10() +labs(title ="Lot Area vs Sale Price",x ="Logged Lot Area (sq. ft.)",y ="Logged Sale Price ($)" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5) )# Neighborhood vs Sale Price (Top 10 by Median Price) since there are multipleproperties %>%mutate(Neighborhood =gsub("_", " ", Neighborhood)) %>%group_by(Neighborhood) %>%mutate(median_price =median(Sale_Price, na.rm =TRUE)) %>%ungroup() %>%filter(Neighborhood %in%names(sort(tapply(Sale_Price, Neighborhood, median, na.rm =TRUE), decreasing =TRUE))[1:10]) %>%ggplot(aes(x =reorder(Neighborhood, Sale_Price, median), y = Sale_Price)) +geom_boxplot(fill ="#8E9AFE", color ="black", alpha =0.7) +scale_y_log10() +coord_flip() +labs(title ="Neighborhood vs Sale Price",subtitle ="Top 10 neighborhoods by median price",x ="Neighborhood",y ="Logged Sale Price ($)" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5) )# Year Built vs Sale Price (Smoothed Trend)properties %>%group_by(Year_Built) %>%summarise(mean_price =mean(Sale_Price, na.rm =TRUE)) %>%ggplot(aes(x = Year_Built, y = mean_price)) +geom_point(color ="#002366", alpha =0.6) +geom_smooth(color ="#002366", se =FALSE, linewidth =1.2) +scale_y_log10() +labs(title ="Year Built vs Average Sale Price",subtitle ="How do average prices change over time?",x ="Year Built",y ="Mean Logged Sale Price ($)" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5) )# Sale_Price by Year_Soldproperties %>%ggplot(aes(x =factor(Year_Sold), y = Sale_Price)) +geom_boxplot(fill ="#005672", color ="black", alpha =0.7) +scale_y_log10() +labs(title ="Sale Price by Year Sold",x ="Year Sold",y ="Logged Sale Price ($)" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5) )
Sale Price and Lot Area: Based on the scatter plot of the Logged Lot Area and Logged Sale Price,there appears to be a general upwards trend meaning that a percentage increase in lot area would result in some percentage increase in sale price.
Sale Price and Neighborhood: Given the large number of neighborhoods, we selected the top 10 neighborhooods which had mean sale prices ranging from 201,803.43 USD to 330,319 USD while the bottom 10 (not plotted) had mean sale prices ranging from 95,756.49 USD to 136,751.15 USD. In some neighborhoods such as Stone Brook, Northridge, Somerset, and College Creek, there appears to be significant variability with a few outliers in sale price. Whereas other neighborhoods such as Green Hills and Greens have more rigidity in the variation of sale price.
Sale Price and Year Built: Based on the plot of the build year and average sale price, there is a general upwards trend, suggesting that newer houses are sold on average for higher values. However, it is worth noting that some outliers do exist during the late 1890s-late 1930s period, and there appears to be a slightly non-linear relationship.
Sale Price and Year Sold: Given that houses are sold over the short time frame of 2006-2010, we created a series of box plots to display the relationship between sale price and selling year. Despite the presence of outliers across all years, the mean sale price across all 5 years remains fairly constant.
Provide a overview of the correlation between numerical features of the house (Think about to create a plot that shows all correlations at once). Include any information that you deem important
For our selection of numerical variables to include, we chose to eliminate some variables from the num_var subset that would cause potential endogeneity. For instance, the total basement variable was selected rather than first or second basement as the inclusion of all of these variables would lead to multicollinearity, and total basement area is more comprehensive. For readability, we created a subset of numerical variables, then plotted those variables against Sale_Price in subsets using the ggpairs function which gives a series of plots within and across variables, as well as their numerical correlations. From this tactic, we noticed that Lot_Area, Garage_Area, Total_Bsmt_SF, Gr_Liv_Area and Pool_Area all had statistically significant correlations to sale price at all levels while Misc_Val did not.
Choose 5 categorical features, and create plots to understand the relationship between houses prices and those variables.
Code
# Utilitiesmed_utils <- properties %>%group_by(Utilities) %>%summarise(Med_Sale_Price =median(Sale_Price, na.rm =TRUE)) %>%arrange(desc(Med_Sale_Price))ggplot(med_utils, aes(x =reorder(Utilities, -Med_Sale_Price), y = Med_Sale_Price)) +geom_col(fill ="#FF6F61", color ="black", alpha =0.8) +scale_x_discrete(labels =c("AllPub"="All Public Utilities","NoSewr"="Electricity, Gas and Water","NoSeWa"="Electricity and Gas Only","ELO"="Electricity Only")) +labs(title ="Median Sale Price by Utilities",subtitle ="Comparing median home prices across utility types",x ="Utilities",y ="Sale Price" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5),axis.text.x =element_text(angle =45, hjust =1))# Overall Conditionmed_cond <- properties %>%group_by(Overall_Cond) %>%summarise(Med_Sale_Price =median(Sale_Price, na.rm =TRUE)) %>%arrange(desc(Med_Sale_Price))ggplot(med_cond, aes(x =reorder(as.factor(Overall_Cond), -Med_Sale_Price), y = Med_Sale_Price)) +geom_col(fill ="#FFD700", color ="black", alpha =0.8) +labs(title ="Median Sale Price by Overall Condition",subtitle ="Comparing median home prices across overall condition ratings",x ="Overall Condition",y ="Sale Price" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5),axis.text.x =element_text(angle =45, hjust =1) )# Central Airmed_air <- properties %>%group_by(Central_Air) %>%summarise(Med_Sale_Price =median(Sale_Price, na.rm =TRUE)) %>%arrange(desc(Med_Sale_Price))ggplot(med_air, aes(x =reorder(Central_Air, -Med_Sale_Price), y = Med_Sale_Price)) +geom_col(fill ="#1E90FF", color ="black", alpha =0.8) +scale_y_continuous(labels =NULL, breaks =NULL) +scale_x_discrete(labels =c("Y"="Yes","N"="No")) +labs(title ="Median Sale Price by Central Air",subtitle ="Comparing median home prices across central air types",x ="Central Air",y =NULL ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5),axis.text.x =element_text(angle =45, hjust =1) )# Sale Conditionmed_salecond <- properties %>%group_by(Sale_Condition) %>%summarise(Med_Sale_Price =median(Sale_Price, na.rm =TRUE)) %>%arrange(desc(Med_Sale_Price))ggplot(med_salecond, aes(x =reorder(Sale_Condition, -Med_Sale_Price), y = Med_Sale_Price)) +geom_col(fill ="#32CD32", color ="black", alpha =0.8) +scale_x_discrete(labels =c("Abnorml"="Abnormal Sale","AdjLand"="Adjoining Land Purchase","Alloca"="Allocation"))+labs(title ="Median Sale Price by Sale Condition",subtitle ="Comparing median home prices across sale conditions",x ="Sale Condition",y ="Sale Price" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5),axis.text.x =element_text(angle =45, hjust =1) )# MS Zoningmed_zoning <- properties %>%group_by(MS_Zoning) %>%summarise(Med_Sale_Price =median(Sale_Price, na.rm =TRUE)) %>%arrange(desc(Med_Sale_Price))ggplot(med_zoning, aes(x =reorder(MS_Zoning, -Med_Sale_Price), y = Med_Sale_Price)) +geom_col(fill ="#FF69B4", color ="black", alpha =0.8) +scale_x_discrete(labels =c("A_agr"="Agriculture","C_all"="Commercial","Floating_Village_Residential"="Floating Village Residential","I_all"="Industrial","Residential_High_Density"="Residential High Density","Residential_Low_Density"="Residential Low Density","Residential_Medium_Density"="Residential Medium Density" )) +labs(title ="Median Sale Price by MS Zoning",subtitle ="Comparing median home prices across zoning categories",x ="MS Zoning",y ="Sale_Price" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5),axis.text.x =element_text(angle =45, hjust =1) )
Plot
Plot
Plot
Plot
Plot
For readability and interpretability, we created plots using the median price for each of the categorical variables selected: 1) Median Price by Utilities: Based on the bar graph, it appears that houses with all public utilities had higher median selling prices than houses with electricity and gas only, and houses with electricity, gas, and water. Notably, the majority of houses had all public utilities, so the visual may be slightly biased in that regard. 2) Median Price by Condition: Based on the graph, there appears to be a downward tend in median price vs condition suggesting that houses in worse condition sell for lower median prices, which makes sense in context. Interestingly, there is not much of a visible price difference in houses rated as “good”, “very good” and “above average” 3) Median Sale Price and Central Air: Houses with central air appear to have higher median selling prices compared to houses that do not have central air. Like the Condition variable, this trend makes sense in context as additional features like central air are often incorporated into the selling price of homes 4) Median Sale Price and Sale Condition: Partial sales have a significantly higher median sale price compared to the other sale conditions, which would hold true as houses sold in the “partial” category are often new houses that are not yet finished. Given that they are new, their value would be higher than older depreciated houses 5) Median Sale Price and MS Zoning: Houses in the agriculture zones had lower median selling prices whereas houses in the floating village residential category had the highest median selling prices. Given that argicultural houses are in rural landscapes, property values on average would be lower than the floating village residential houses in urban areas.
Alright, after checking some of the variables, let’s predict house prices. Split the dataset into training and testing (60-40), and compare the following models using cross-validation (k=10).
Best Sub Selection Linear Regression
Lasso Linear Regression
Regression Tree with Pruning
Bagging
Random Forest.
You should decide the criteria to choose the most accurate model, and show the main variables of the chosen models. Beware, that you might have to do some data wrangling before running model (Have fun!)
Code
library(leaps)data1 <-sapply(train_data, is.numeric)data2 <- train_data[, data1]ctrl <-trainControl(method ="cv", number =10)# Best subsets best_sub <-train( Sale_Price ~ .,data = data2,method ="leapSeq",tuneGrid =data.frame(nvmax =1:20),trControl = ctrl)best_subrmse_best <- best_sub$results$RMSE[best_sub$results$nvmax == best_sub$bestTune$nvmax]summary(best_sub)# LASSOset.seed(123)# Create numeric matrixx <-model.matrix(Sale_Price ~ . -1, data = train_data)y <- train_data$Sale_Price# 10-fold CVkfolds <-trainControl(method ="cv", number =10)# Manually specify alpha and lambda gridlasso_grid <-expand.grid(alpha =1, # LASSOlambda =c(0.001, 0.01, 0.1, 1, 10, 100) # short grid, works in all versions)# Train LASSOlasso <-train(x = x,y = y,method ="glmnet",tuneGrid = lasso_grid,trControl = kfolds)lassormse_lasso <- lasso$results$RMSE[lasso$results$lambda == lasso$bestTune$lambda & lasso$results$alpha == lasso$bestTune$alpha]
# TREE WITH PRUNINGtree_fit <-train( Sale_Price ~ .,data = train_data,method ="rpart",tuneLength =10,trControl = kfolds)tree_fitrmse_tree <- tree_fit$results$RMSE[tree_fit$results$cp == tree_fit$bestTune$cp]
Output
CART
1760 samples
53 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1584, 1584, 1584, 1584, 1584, 1584, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.01545238 44221.35 0.6831832 31068.34
0.01692598 45864.79 0.6620878 32212.33
0.01731402 46285.56 0.6573025 32287.91
0.02150539 46731.88 0.6486357 32870.75
0.02373662 46999.47 0.6442860 32963.89
0.03074978 48111.18 0.6264282 33892.18
0.04501678 51539.62 0.5700389 36538.57
0.06902094 54475.98 0.5272571 38168.43
0.11884697 59307.37 0.4285324 42921.00
0.39785290 72891.77 0.3086824 53077.25
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.01545238.
# RANDOM FORESTlibrary(randomForest)random_forest <-train( Sale_Price ~ ., data = train_data,method ="rf",trControl =trainControl(method ="cv", number =5), # 5-fold to speed upntree =200)random_forestrmse_rf <- random_forest$results$RMSE[random_forest$results$mtry == random_forest$bestTune$mtry]
Output
Random Forest
1760 samples
53 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1408, 1408, 1408, 1409, 1407
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 45962.30 0.8029054 30054.16
106 28255.91 0.8776451 17307.71
211 28860.47 0.8711075 17566.15
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 106.
Model RMSE
1 Best Subset 35774.62
2 LASSO 29795.76
3 Tree (Pruned) 44221.35
4 Bagging 36066.84
5 Random Forest 28255.91
Based on the models, the lasso regression has the lowest RMSE and highest R-squared value, making it the most viable choice for housing price prediction. Additionally, since lasso regression preserves the important variables and assigns coefficients to them, we had the added advantage of seeing the marginal effects of each explanatory variable on housing sale price.
Source Code
---title: "Lab #4 - Tree-Method and Unsupervised Learning "author: "Tessa Reilly and Chanel Anderson"institute: "Denison University"format: html: theme: lux # Check here for more themes: https://quarto.org/docs/output-formats/html-themes.html code-tools: true code-fold: true code-summary: "Code" code-copy: hover link-external-newwindow: true tbl-cap-location: top fig-cap-location: bottomself-contained: trueeditor: source---```{r setup, include=FALSE}# DO NOT EDIT THISknitr::opts_chunk$set(fig.align = 'center')knitr::opts_chunk$set(out.width = '90%')knitr::opts_chunk$set(results = 'hold')knitr::opts_chunk$set(fig.show = 'hold')knitr::opts_chunk$set(error = TRUE)knitr::opts_chunk$set(warning = FALSE)knitr::opts_chunk$set(message = FALSE)par(mar = c(4.1, 4.1, 1.1, 4.1))hooks = knitr::knit_hooks$get()hook_foldable = function(type) { force(type) function(x, options) { res = hooks[[type]](x, options) if (isFALSE(options[[paste0("fold.", type)]])) return(res) paste0( "<details open><summary>", gsub("^p", "P", gsub("^o", "O", type)), "</summary>\n\n", res, "\n\n</details>" ) }}knitr::knit_hooks$set( output = hook_foldable("output"), plot = hook_foldable("plot"))Q <- 0```## Question `r Q <- Q+1; Q````{r include=FALSE}q <- 0```For this question, you will the `ames` data set that we have already worked in class. This data set contains information on 2,930 properties in Ames, Iowa, including columns related to: - house characteristics (bedrooms, garage, fireplace, pool, porch, etc.) - location (neighborhood) - lot information (zoning, shape, size, etc.) - ratings of condition and quality - sale priceThe ([data](https://vincentarelbundock.github.io/Rdatasets/csv/modeldata/ames.csv), [documentation](https://jse.amstat.org/v19n3/decock/DataDocumentation.txt))) are available in the links above`r q <- q+1; letters[q]`. Before you dig in predicting house prices, we would like to know a couple a things about our data. To answer each subquestion, please create a nice plot ! - How expensive are houses (min, max, average, median) ? - When were the houses built, including the oldest and newest? - When were houses sold? - Where are houses? - How big are houses?```{r, message = FALSE}library(ggplot2)library(dplyr)library(corrplot)properties <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/modeldata/ames.csv")summary(properties %>% select(Sale_Price, Year_Built, Year_Sold, Lot_Area))# Subquestion 1: how expensive are houses?ggplot(properties, aes(x = Sale_Price)) + geom_histogram(bins = 50, fill = "#046307", color = "#5C655A", alpha = 0.7) + scale_x_log10() + labs( title = "Housing Price", subtitle = "1. How expensive are houses?", x = "Logged Housing Price ($)", y = "Frequency" ) + theme_minimal() + theme( plot.title = element_text(size = 18, face = "bold", hjust = 0.5), plot.subtitle = element_text(size = 12, hjust = 0.5) )# Subquestion 2: When were houses built? ggplot(properties, aes(x = Year_Built)) + geom_histogram(bins = 50, fill = "#002366", color = "black", alpha = 0.7) + labs( title = "Year Built", subtitle = "2. When were houses built?", x = "Year Built", y = "Frequency" ) + theme_minimal() + theme( plot.title = element_text(size = 18, face = "bold", hjust = 0.5), plot.subtitle = element_text(size = 12, hjust = 0.5) )# Subquestion 3: When are houses sold? ggplot(properties, aes(x = Year_Sold)) + geom_bar(bins = 50, fill = "#005672", color = "black", alpha = 0.7) + labs( title = "Year Sold", subtitle = "3. When are houses sold? ", x = "Year Sold", y = "Frequency" ) + theme_minimal() + theme( plot.title = element_text(size = 18, face = "bold", hjust = 0.5), plot.subtitle = element_text(size = 12, hjust = 0.5) )# Subquestion 4: Where are Houses?properties %>% count(Neighborhood) %>% mutate(Neighborhood = gsub("_", " ", Neighborhood)) %>% ggplot(aes(x = reorder(Neighborhood, n), y = n)) + geom_col(fill = "#8E9AFE", color = "black", alpha = 0.7) + coord_flip() + labs( title = "Neighborhoods", subtitle = "Neighborhoods of houses sold in Ames, Iowa from 2006 to 2010", x = "Neighborhood", y = "Frequency" ) + theme_minimal() + theme( plot.title = element_text(size = 18, face = "bold", hjust = 0.5), plot.subtitle = element_text(size = 10, hjust = 0.5) )# Subquestion 5: How big are the houses? ggplot(properties, aes(x = Lot_Area)) + geom_histogram(bins = 50, fill = "#6A5ACD", color = "black", alpha = 0.7) + scale_x_log10() + labs( title = "Lot Area", subtitle = "5: How big are houses?", x = " Logged Lot size (sq. ft)", y = "Frequency" ) + theme_minimal() + theme( plot.title = element_text(size = 18, face = "bold", hjust = 0.5), plot.subtitle = element_text(size = 12, hjust = 0.5) )```Based on the plots and summary statistics, the sale price of houses is quite skewed with a median of 160,000 USD, a minimum of 12,789 USD and a maximum of 755,000 USD. For visual purposes, we logged the sale price to visualize the prices on the extreme ends that were otherwise minimized by the skewness. Although the age range of the houses varies significantly from the 1870s to 2010, the mean year for the year a house was built was 1971 revealing a leftward skew in building years. While houses were sold at a fairly uniform frequency there is a sharp decline in 2010, and a sizeable spike in 2007.The majority of houses were sold in the neighborhoods North Ames, College Creek and Old Town, and the least were old in Greens, Green Hills, and Landmark. Finally, the overall lot area of the houses had a similar leftward skew to the houses prices, thus we logged the Lot Area variable for visual purposes. Based on the summary statistics, the minimum lot area was 1300 sq. ft., the maximum was 215,245 sq. ft., and the mean was 10,148 sq. ft. `r q <- q+1; letters[q]`. Since our target variable is house prices, please provide a detail overview of the distribution of our target variables including stats and plots.```{r}# Summary statistics of key numeric variables by Neighborhoodproperties %>%mutate(Neighborhood =gsub("_", " ", Neighborhood)) %>%group_by(Neighborhood) %>%summarise(count =n(),mean_Lot_Area =mean(Lot_Area, na.rm =TRUE),median_Lot_Area =median(Lot_Area, na.rm =TRUE),mean_Sale_Price =mean(Sale_Price, na.rm =TRUE),median_Sale_Price =median(Sale_Price, na.rm =TRUE),mean_Year_Built =mean(Year_Built, na.rm =TRUE),median_Year_Built =median(Year_Built, na.rm =TRUE),mean_Year_Sold =mean(Year_Sold, na.rm =TRUE),median_Year_Sold =median(Year_Sold, na.rm =TRUE) ) %>%arrange(desc(mean_Sale_Price))# Lot Area vs Sale Priceggplot(properties, aes(x = Lot_Area, y = Sale_Price)) +geom_point(color ="#6A5ACD", alpha =0.6) +geom_smooth(method ="lm", color ="black", se =TRUE, linewidth =1) +scale_y_log10() +scale_x_log10() +labs(title ="Lot Area vs Sale Price",x ="Logged Lot Area (sq. ft.)",y ="Logged Sale Price ($)" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5) )# Neighborhood vs Sale Price (Top 10 by Median Price) since there are multipleproperties %>%mutate(Neighborhood =gsub("_", " ", Neighborhood)) %>%group_by(Neighborhood) %>%mutate(median_price =median(Sale_Price, na.rm =TRUE)) %>%ungroup() %>%filter(Neighborhood %in%names(sort(tapply(Sale_Price, Neighborhood, median, na.rm =TRUE), decreasing =TRUE))[1:10]) %>%ggplot(aes(x =reorder(Neighborhood, Sale_Price, median), y = Sale_Price)) +geom_boxplot(fill ="#8E9AFE", color ="black", alpha =0.7) +scale_y_log10() +coord_flip() +labs(title ="Neighborhood vs Sale Price",subtitle ="Top 10 neighborhoods by median price",x ="Neighborhood",y ="Logged Sale Price ($)" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5) )# Year Built vs Sale Price (Smoothed Trend)properties %>%group_by(Year_Built) %>%summarise(mean_price =mean(Sale_Price, na.rm =TRUE)) %>%ggplot(aes(x = Year_Built, y = mean_price)) +geom_point(color ="#002366", alpha =0.6) +geom_smooth(color ="#002366", se =FALSE, linewidth =1.2) +scale_y_log10() +labs(title ="Year Built vs Average Sale Price",subtitle ="How do average prices change over time?",x ="Year Built",y ="Mean Logged Sale Price ($)" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5) )# Sale_Price by Year_Soldproperties %>%ggplot(aes(x =factor(Year_Sold), y = Sale_Price)) +geom_boxplot(fill ="#005672", color ="black", alpha =0.7) +scale_y_log10() +labs(title ="Sale Price by Year Sold",x ="Year Sold",y ="Logged Sale Price ($)" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5) )```Sale Price and Lot Area: Based on the scatter plot of the Logged Lot Area and Logged Sale Price,there appears to be a general upwards trend meaning that a percentage increase in lot area would result in some percentage increase in sale price. Sale Price and Neighborhood: Given the large number of neighborhoods, we selected the top 10 neighborhooods which had mean sale prices ranging from 201,803.43 USD to 330,319 USD while the bottom 10 (not plotted) had mean sale prices ranging from 95,756.49 USD to 136,751.15 USD. In some neighborhoods such as Stone Brook, Northridge, Somerset, and College Creek, there appears to be significant variability with a few outliers in sale price. Whereas other neighborhoods such as Green Hills and Greens have more rigidity in the variation of sale price. Sale Price and Year Built: Based on the plot of the build year and average sale price, there is a general upwards trend, suggesting that newer houses are sold on average for higher values. However, it is worth noting that some outliers do exist during the late 1890s-late 1930s period, and there appears to be a slightly non-linear relationship. Sale Price and Year Sold: Given that houses are sold over the short time frame of 2006-2010, we created a series of box plots to display the relationship between sale price and selling year. Despite the presence of outliers across all years, the mean sale price across all 5 years remains fairly constant. `r q <- q+1; letters[q]`. Provide a overview of the correlation between numerical features of the house (Think about to create a plot that shows all correlations at once). Include any information that you deem important```{r}library(GGally)# Select numeric variables for correlation analysisnum_vars <- properties %>%select(Lot_Area, Total_Bsmt_SF, Gr_Liv_Area, Garage_Area, Wood_Deck_SF, Enclosed_Porch, Pool_Area, Misc_Val, Sale_Price)# Split into 3 subsets (each including Sale_Price)num_vars1 <- num_vars %>%select(Lot_Area,Garage_Area, Sale_Price)num_vars2 <- num_vars %>%select(Total_Bsmt_SF, Gr_Liv_Area, Sale_Price)num_vars3 <- num_vars %>%select(Pool_Area, Misc_Val, Sale_Price)# Make correlation plotsggpairs(num_vars1)ggpairs(num_vars2)ggpairs(num_vars3)```For our selection of numerical variables to include, we chose to eliminate some variables from the num_var subset that would cause potential endogeneity. For instance, the total basement variable was selected rather than first or second basement as the inclusion of all of these variables would lead to multicollinearity, and total basement area is more comprehensive. For readability, we created a subset of numerical variables, then plotted those variables against Sale_Price in subsets using the ggpairs function which gives a series of plots within and across variables, as well as their numerical correlations. From this tactic, we noticed that Lot_Area, Garage_Area, Total_Bsmt_SF, Gr_Liv_Area and Pool_Area all had statistically significant correlations to sale price at all levels while Misc_Val did not. `r q <- q+1; letters[q]`. Choose 5 categorical features, and create plots to understand the relationship between houses prices and those variables.```{r}# Utilitiesmed_utils <- properties %>%group_by(Utilities) %>%summarise(Med_Sale_Price =median(Sale_Price, na.rm =TRUE)) %>%arrange(desc(Med_Sale_Price))ggplot(med_utils, aes(x =reorder(Utilities, -Med_Sale_Price), y = Med_Sale_Price)) +geom_col(fill ="#FF6F61", color ="black", alpha =0.8) +scale_x_discrete(labels =c("AllPub"="All Public Utilities","NoSewr"="Electricity, Gas and Water","NoSeWa"="Electricity and Gas Only","ELO"="Electricity Only")) +labs(title ="Median Sale Price by Utilities",subtitle ="Comparing median home prices across utility types",x ="Utilities",y ="Sale Price" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5),axis.text.x =element_text(angle =45, hjust =1))# Overall Conditionmed_cond <- properties %>%group_by(Overall_Cond) %>%summarise(Med_Sale_Price =median(Sale_Price, na.rm =TRUE)) %>%arrange(desc(Med_Sale_Price))ggplot(med_cond, aes(x =reorder(as.factor(Overall_Cond), -Med_Sale_Price), y = Med_Sale_Price)) +geom_col(fill ="#FFD700", color ="black", alpha =0.8) +labs(title ="Median Sale Price by Overall Condition",subtitle ="Comparing median home prices across overall condition ratings",x ="Overall Condition",y ="Sale Price" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5),axis.text.x =element_text(angle =45, hjust =1) )# Central Airmed_air <- properties %>%group_by(Central_Air) %>%summarise(Med_Sale_Price =median(Sale_Price, na.rm =TRUE)) %>%arrange(desc(Med_Sale_Price))ggplot(med_air, aes(x =reorder(Central_Air, -Med_Sale_Price), y = Med_Sale_Price)) +geom_col(fill ="#1E90FF", color ="black", alpha =0.8) +scale_y_continuous(labels =NULL, breaks =NULL) +scale_x_discrete(labels =c("Y"="Yes","N"="No")) +labs(title ="Median Sale Price by Central Air",subtitle ="Comparing median home prices across central air types",x ="Central Air",y =NULL ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5),axis.text.x =element_text(angle =45, hjust =1) )# Sale Conditionmed_salecond <- properties %>%group_by(Sale_Condition) %>%summarise(Med_Sale_Price =median(Sale_Price, na.rm =TRUE)) %>%arrange(desc(Med_Sale_Price))ggplot(med_salecond, aes(x =reorder(Sale_Condition, -Med_Sale_Price), y = Med_Sale_Price)) +geom_col(fill ="#32CD32", color ="black", alpha =0.8) +scale_x_discrete(labels =c("Abnorml"="Abnormal Sale","AdjLand"="Adjoining Land Purchase","Alloca"="Allocation"))+labs(title ="Median Sale Price by Sale Condition",subtitle ="Comparing median home prices across sale conditions",x ="Sale Condition",y ="Sale Price" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5),axis.text.x =element_text(angle =45, hjust =1) )# MS Zoningmed_zoning <- properties %>%group_by(MS_Zoning) %>%summarise(Med_Sale_Price =median(Sale_Price, na.rm =TRUE)) %>%arrange(desc(Med_Sale_Price))ggplot(med_zoning, aes(x =reorder(MS_Zoning, -Med_Sale_Price), y = Med_Sale_Price)) +geom_col(fill ="#FF69B4", color ="black", alpha =0.8) +scale_x_discrete(labels =c("A_agr"="Agriculture","C_all"="Commercial","Floating_Village_Residential"="Floating Village Residential","I_all"="Industrial","Residential_High_Density"="Residential High Density","Residential_Low_Density"="Residential Low Density","Residential_Medium_Density"="Residential Medium Density" )) +labs(title ="Median Sale Price by MS Zoning",subtitle ="Comparing median home prices across zoning categories",x ="MS Zoning",y ="Sale_Price" ) +theme_minimal() +theme(plot.title =element_text(size =18, face ="bold", hjust =0.5),plot.subtitle =element_text(size =12, hjust =0.5),axis.text.x =element_text(angle =45, hjust =1) )```For readability and interpretability, we created plots using the median price for each of the categorical variables selected:1) Median Price by Utilities: Based on the bar graph, it appears that houses with all public utilities had higher median selling prices than houses with electricity and gas only, and houses with electricity, gas, and water. Notably, the majority of houses had all public utilities, so the visual may be slightly biased in that regard. 2) Median Price by Condition: Based on the graph, there appears to be a downward tend in median price vs condition suggesting that houses in worse condition sell for lower median prices, which makes sense in context. Interestingly, there is not much of a visible price difference in houses rated as "good", "very good" and "above average" 3) Median Sale Price and Central Air: Houses with central air appear to have higher median selling prices compared to houses that do not have central air. Like the Condition variable, this trend makes sense in context as additional features like central air are often incorporated into the selling price of homes4) Median Sale Price and Sale Condition: Partial sales have a significantly higher median sale price compared to the other sale conditions, which would hold true as houses sold in the "partial" category are often new houses that are not yet finished. Given that they are new, their value would be higher than older depreciated houses5) Median Sale Price and MS Zoning: Houses in the agriculture zones had lower median selling prices whereas houses in the floating village residential category had the highest median selling prices. Given that argicultural houses are in rural landscapes, property values on average would be lower than the floating village residential houses in urban areas. `r q <- q+1; letters[q]`. Alright, after checking some of the variables, let's predict house prices. Split the dataset into training and testing (60-40), and compare the following models using cross-validation (k=10). 1) Best Sub Selection Linear Regression 2) Lasso Linear Regression3) Regression Tree with Pruning4) Bagging 5) Random Forest.You should decide the criteria to choose the most accurate model, and show the main variables of the chosen models. Beware, that you might have to do some data wrangling before running model (Have fun!)```{r, include=FALSE}library(caret)library(glmnet)library(rpart)library(ipred)library(dplyr)library(janitor)# Remove ID columnproperties <- properties %>% select(-matches("^ID$|^id$"))# Remove columns with near-zero variancenzv_cols <- nearZeroVar(properties)properties <- properties[ , -nzv_cols, drop = FALSE]# 4. Drop columns with > 50% missingmissing_pct <- colMeans(is.na(properties))properties <- properties[ , missing_pct < 0.5, drop = FALSE]# 5. Impute missing valuesfor(col in names(properties)) { if (is.numeric(properties[[col]])) { properties[[col]][is.na(properties[[col]])] <- median(properties[[col]], na.rm = TRUE) } else if (is.factor(properties[[col]]) || is.character(properties[[col]])) { mode_value <- names(which.max(table(properties[[col]]))) properties[[col]][is.na(properties[[col]])] <- mode_value properties[[col]] <- factor(properties[[col]]) }}# Remove rows with missing Sale_Price (exact original spelling)properties <- properties %>% filter(!is.na(Sale_Price))# split train_idx <- createDataPartition(properties$Sale_Price, p = 0.6, list = FALSE)train_data <- properties[train_idx, ]test_data <- properties[-train_idx, ]str(train_data)``````{r}library(leaps)data1 <-sapply(train_data, is.numeric)data2 <- train_data[, data1]ctrl <-trainControl(method ="cv", number =10)# Best subsets best_sub <-train( Sale_Price ~ .,data = data2,method ="leapSeq",tuneGrid =data.frame(nvmax =1:20),trControl = ctrl)best_subrmse_best <- best_sub$results$RMSE[best_sub$results$nvmax == best_sub$bestTune$nvmax]summary(best_sub)# LASSOset.seed(123)# Create numeric matrixx <-model.matrix(Sale_Price ~ . -1, data = train_data)y <- train_data$Sale_Price# 10-fold CVkfolds <-trainControl(method ="cv", number =10)# Manually specify alpha and lambda gridlasso_grid <-expand.grid(alpha =1, # LASSOlambda =c(0.001, 0.01, 0.1, 1, 10, 100) # short grid, works in all versions)# Train LASSOlasso <-train(x = x,y = y,method ="glmnet",tuneGrid = lasso_grid,trControl = kfolds)lassormse_lasso <- lasso$results$RMSE[lasso$results$lambda == lasso$bestTune$lambda & lasso$results$alpha == lasso$bestTune$alpha]``````{r}# TREE WITH PRUNINGtree_fit <-train( Sale_Price ~ .,data = train_data,method ="rpart",tuneLength =10,trControl = kfolds)tree_fitrmse_tree <- tree_fit$results$RMSE[tree_fit$results$cp == tree_fit$bestTune$cp]``````{r}#BAGGINGbag <-train( Sale_Price ~ .,data = train_data,method ="treebag",trControl = kfolds)bagrmse_bag <- bag$results$RMSE``````{r}# RANDOM FORESTlibrary(randomForest)random_forest <-train( Sale_Price ~ ., data = train_data,method ="rf",trControl =trainControl(method ="cv", number =5), # 5-fold to speed upntree =200)random_forestrmse_rf <- random_forest$results$RMSE[random_forest$results$mtry == random_forest$bestTune$mtry]``````{r}# Combine RMSEs into a summary tableresults_table <-data.frame(Model =c("Best Subset", "LASSO", "Tree (Pruned)", "Bagging", "Random Forest"),RMSE =c(rmse_best, rmse_lasso, rmse_tree, rmse_bag, rmse_rf))# View resultsresults_table```Based on the models, the lasso regression has the lowest RMSE and highest R-squared value, making it the most viable choice for housing price prediction. Additionally, since lasso regression preserves the important variables and assigns coefficients to them, we had the added advantage of seeing the marginal effects of each explanatory variable on housing sale price.