M1 Project
Report
ALY6015:Intermediate Analytics
Northeastern
University
Professor: Vladimir Shapiro
By:
Zeeshan Ahmad Ansari
Date
of Submission: 12 November, 2023
#The report utilizes a set of libraries for various data processing and visualization tasks.
library(tidyverse)
library(readxl)
library(dplyr)
library(readr)
library(kableExtra)
library(corrplot)
library(car)
Our research examines home sales data from the city of Ames, located in the state of Iowa, in the Midwestern United States. The timeframe of focus is the years spanning 2006 to 2010, a period of fluctuation in the national housing market. The dataset we will utilize contains 2,930 total observations, with each observation representing a home sale and including 82 distinct attributes about the home.
Through comprehensive analysis of this robust dataset, we aim to achieve two primary objectives. The first goal is identifying key factors that exhibit strong correlations with and influence on sold home prices in the Ames market during the study period. Uncovering these important determinants will shed light on the underlying drivers of real estate values and sale prices. The second objective is developing a predictive pricing model for Ames homes using linear regression techniques. Constructing an accurate forecasting model will enable projection of expected sales prices for future home listings in this city.
Our methodological approach will begin with thorough exploratory data analysis to understand data distributions and relationships. We will then conduct correlation studies to quantify connections between attributes and sale prices. After proper data preprocessing, including handling of any missing values and assessment of multicollinearity, we will fit a linear regression model to the data.
Several R programming packages will assist with analysis tasks. Corrplot and ggcorrplot will help visualize correlations. The leaps package will facilitate regression modeling and variable selection. Use of these tools will streamline our analytical workflow.
Completing these objectives will provide meaningful insights into factors impacting the Ames real estate market. The final regression model can support accurate home valuation and price prediction. More broadly, our findings will further comprehension of recent housing market trends, lending knowledge potentially transferable to other US cities.
1. Load the Ames housing
dataset.
In this task we will load a dataset called Ames Housing
data set provided to us. We will use this data set for our further
analysis
#Dataset_Employed_in_this_Report
M1W1Data = read.csv("D:/Quater_2/Second Part/ALY6015/Week_1/Assignment/Assignement_ALY6015/Dataset/AmesHousing.csv")
2. Perform Exploratory
Data Analysis and use descriptive statistics to describe the
data.
In this task we are generating descriptive statistics for both numerical
and categorical columns. We have used summary command to
get information about the dataset and improved the presentation using
kable and kable_styling commands (Chiluiza,
2022).
# Rename Columns
colnames(M1W1Data) <- tolower(gsub("[ ,.]", "_", colnames(M1W1Data)))
#colnames(M1W1Data)
# Structure of the dataset
Structure_1 <- str(M1W1Data)
## 'data.frame': 2930 obs. of 82 variables:
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pid : int 526301100 526350040 526351010 526353030 527105010 527105030 527127150 527145080 527146030 527162130 ...
## $ ms_subclass : int 20 20 20 20 60 60 120 120 120 60 ...
## $ ms_zoning : chr "RL" "RH" "RL" "RL" ...
## $ lot_frontage : int 141 80 81 93 74 78 41 43 39 60 ...
## $ lot_area : int 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
## $ street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ alley : chr NA NA NA NA ...
## $ lot_shape : chr "IR1" "Reg" "IR1" "Reg" ...
## $ land_contour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ lot_config : chr "Corner" "Inside" "Corner" "Corner" ...
## $ land_slope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ neighborhood : chr "NAmes" "NAmes" "NAmes" "NAmes" ...
## $ condition_1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ condition_2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ bldg_type : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ house_style : chr "1Story" "1Story" "1Story" "1Story" ...
## $ overall_qual : int 6 5 6 7 5 6 8 8 8 7 ...
## $ overall_cond : int 5 6 6 5 5 6 5 5 5 5 ...
## $ year_built : int 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
## $ year_remod_add : int 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
## $ roof_style : chr "Hip" "Gable" "Hip" "Hip" ...
## $ roof_matl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ exterior_1st : chr "BrkFace" "VinylSd" "Wd Sdng" "BrkFace" ...
## $ exterior_2nd : chr "Plywood" "VinylSd" "Wd Sdng" "BrkFace" ...
## $ mas_vnr_type : chr "Stone" "None" "BrkFace" "None" ...
## $ mas_vnr_area : int 112 0 108 0 0 20 0 0 0 0 ...
## $ exter_qual : chr "TA" "TA" "TA" "Gd" ...
## $ exter_cond : chr "TA" "TA" "TA" "TA" ...
## $ foundation : chr "CBlock" "CBlock" "CBlock" "CBlock" ...
## $ bsmt_qual : chr "TA" "TA" "TA" "TA" ...
## $ bsmt_cond : chr "Gd" "TA" "TA" "TA" ...
## $ bsmt_exposure : chr "Gd" "No" "No" "No" ...
## $ bsmtfin_type_1 : chr "BLQ" "Rec" "ALQ" "ALQ" ...
## $ bsmtfin_sf_1 : int 639 468 923 1065 791 602 616 263 1180 0 ...
## $ bsmtfin_type_2 : chr "Unf" "LwQ" "Unf" "Unf" ...
## $ bsmtfin_sf_2 : int 0 144 0 0 0 0 0 0 0 0 ...
## $ bsmt_unf_sf : int 441 270 406 1045 137 324 722 1017 415 994 ...
## $ total_bsmt_sf : int 1080 882 1329 2110 928 926 1338 1280 1595 994 ...
## $ heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ heating_qc : chr "Fa" "TA" "TA" "Ex" ...
## $ central_air : chr "Y" "Y" "Y" "Y" ...
## $ electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ x1st_flr_sf : int 1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
## $ x2nd_flr_sf : int 0 0 0 0 701 678 0 0 0 776 ...
## $ low_qual_fin_sf: int 0 0 0 0 0 0 0 0 0 0 ...
## $ gr_liv_area : int 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
## $ bsmt_full_bath : int 1 0 0 1 0 0 1 0 1 0 ...
## $ bsmt_half_bath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ full_bath : int 1 1 1 2 2 2 2 2 2 2 ...
## $ half_bath : int 0 0 1 1 1 1 0 0 0 1 ...
## $ bedroom_abvgr : int 3 2 3 3 3 3 2 2 2 3 ...
## $ kitchen_abvgr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ kitchen_qual : chr "TA" "TA" "Gd" "Ex" ...
## $ totrms_abvgrd : int 7 5 6 8 6 7 6 5 5 7 ...
## $ functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ fireplaces : int 2 0 0 2 1 1 0 0 1 1 ...
## $ fireplace_qu : chr "Gd" NA NA "TA" ...
## $ garage_type : chr "Attchd" "Attchd" "Attchd" "Attchd" ...
## $ garage_yr_blt : int 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
## $ garage_finish : chr "Fin" "Unf" "Unf" "Fin" ...
## $ garage_cars : int 2 1 1 2 2 2 2 2 2 2 ...
## $ garage_area : int 528 730 312 522 482 470 582 506 608 442 ...
## $ garage_qual : chr "TA" "TA" "TA" "TA" ...
## $ garage_cond : chr "TA" "TA" "TA" "TA" ...
## $ paved_drive : chr "P" "Y" "Y" "Y" ...
## $ wood_deck_sf : int 210 140 393 0 212 360 0 0 237 140 ...
## $ open_porch_sf : int 62 0 36 0 34 36 0 82 152 60 ...
## $ enclosed_porch : int 0 0 0 0 0 0 170 0 0 0 ...
## $ x3ssn_porch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ screen_porch : int 0 120 0 0 0 0 0 144 0 0 ...
## $ pool_area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pool_qc : chr NA NA NA NA ...
## $ fence : chr NA "MnPrv" NA NA ...
## $ misc_feature : chr NA NA "Gar2" NA ...
## $ misc_val : int 0 0 12500 0 0 0 0 0 0 0 ...
## $ mo_sold : int 5 6 6 4 3 6 4 1 3 6 ...
## $ yr_sold : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ sale_type : chr "WD " "WD " "WD " "WD " ...
## $ sale_condition : chr "Normal" "Normal" "Normal" "Normal" ...
## $ saleprice : int 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
# Summary statistics of the dataset
data_summary <- summary(M1W1Data)
kable(data_summary, format = "html", align = "l") %>%
column_spec(1, bold = TRUE)%>%
kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
scroll_box(width = "100%", height = "400px")
| order | pid | ms_subclass | ms_zoning | lot_frontage | lot_area | street | alley | lot_shape | land_contour | utilities | lot_config | land_slope | neighborhood | condition_1 | condition_2 | bldg_type | house_style | overall_qual | overall_cond | year_built | year_remod_add | roof_style | roof_matl | exterior_1st | exterior_2nd | mas_vnr_type | mas_vnr_area | exter_qual | exter_cond | foundation | bsmt_qual | bsmt_cond | bsmt_exposure | bsmtfin_type_1 | bsmtfin_sf_1 | bsmtfin_type_2 | bsmtfin_sf_2 | bsmt_unf_sf | total_bsmt_sf | heating | heating_qc | central_air | electrical | x1st_flr_sf | x2nd_flr_sf | low_qual_fin_sf | gr_liv_area | bsmt_full_bath | bsmt_half_bath | full_bath | half_bath | bedroom_abvgr | kitchen_abvgr | kitchen_qual | totrms_abvgrd | functional | fireplaces | fireplace_qu | garage_type | garage_yr_blt | garage_finish | garage_cars | garage_area | garage_qual | garage_cond | paved_drive | wood_deck_sf | open_porch_sf | enclosed_porch | x3ssn_porch | screen_porch | pool_area | pool_qc | fence | misc_feature | misc_val | mo_sold | yr_sold | sale_type | sale_condition | saleprice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. :5.263e+08 | Min. : 20.00 | Length:2930 | Min. : 21.00 | Min. : 1300 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Min. : 1.000 | Min. :1.000 | Min. :1872 | Min. :1950 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Min. : 0.0 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Min. : 0.0 | Length:2930 | Min. : 0.00 | Min. : 0.0 | Min. : 0 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Min. : 334.0 | Min. : 0.0 | Min. : 0.000 | Min. : 334 | Min. :0.0000 | Min. :0.00000 | Min. :0.000 | Min. :0.0000 | Min. :0.000 | Min. :0.000 | Length:2930 | Min. : 2.000 | Length:2930 | Min. :0.0000 | Length:2930 | Length:2930 | Min. :1895 | Length:2930 | Min. :0.000 | Min. : 0.0 | Length:2930 | Length:2930 | Length:2930 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.000 | Min. : 0 | Min. : 0.000 | Length:2930 | Length:2930 | Length:2930 | Min. : 0.00 | Min. : 1.000 | Min. :2006 | Length:2930 | Length:2930 | Min. : 12789 | |
| 1st Qu.: 733.2 | 1st Qu.:5.285e+08 | 1st Qu.: 20.00 | Class :character | 1st Qu.: 58.00 | 1st Qu.: 7440 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 5.000 | 1st Qu.:5.000 | 1st Qu.:1954 | 1st Qu.:1965 | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 0.0 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 0.0 | Class :character | 1st Qu.: 0.00 | 1st Qu.: 219.0 | 1st Qu.: 793 | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 876.2 | 1st Qu.: 0.0 | 1st Qu.: 0.000 | 1st Qu.:1126 | 1st Qu.:0.0000 | 1st Qu.:0.00000 | 1st Qu.:1.000 | 1st Qu.:0.0000 | 1st Qu.:2.000 | 1st Qu.:1.000 | Class :character | 1st Qu.: 5.000 | Class :character | 1st Qu.:0.0000 | Class :character | Class :character | 1st Qu.:1960 | Class :character | 1st Qu.:1.000 | 1st Qu.: 320.0 | Class :character | Class :character | Class :character | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | 1st Qu.: 0 | 1st Qu.: 0.000 | Class :character | Class :character | Class :character | 1st Qu.: 0.00 | 1st Qu.: 4.000 | 1st Qu.:2007 | Class :character | Class :character | 1st Qu.:129500 | |
| Median :1465.5 | Median :5.355e+08 | Median : 50.00 | Mode :character | Median : 68.00 | Median : 9436 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 6.000 | Median :5.000 | Median :1973 | Median :1993 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 0.0 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 370.0 | Mode :character | Median : 0.00 | Median : 466.0 | Median : 990 | Mode :character | Mode :character | Mode :character | Mode :character | Median :1084.0 | Median : 0.0 | Median : 0.000 | Median :1442 | Median :0.0000 | Median :0.00000 | Median :2.000 | Median :0.0000 | Median :3.000 | Median :1.000 | Mode :character | Median : 6.000 | Mode :character | Median :1.0000 | Mode :character | Mode :character | Median :1979 | Mode :character | Median :2.000 | Median : 480.0 | Mode :character | Mode :character | Mode :character | Median : 0.00 | Median : 27.00 | Median : 0.00 | Median : 0.000 | Median : 0 | Median : 0.000 | Mode :character | Mode :character | Mode :character | Median : 0.00 | Median : 6.000 | Median :2008 | Mode :character | Mode :character | Median :160000 | |
| Mean :1465.5 | Mean :7.145e+08 | Mean : 57.39 | NA | Mean : 69.22 | Mean : 10148 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Mean : 6.095 | Mean :5.563 | Mean :1971 | Mean :1984 | NA | NA | NA | NA | NA | Mean : 101.9 | NA | NA | NA | NA | NA | NA | NA | Mean : 442.6 | NA | Mean : 49.72 | Mean : 559.3 | Mean :1052 | NA | NA | NA | NA | Mean :1159.6 | Mean : 335.5 | Mean : 4.677 | Mean :1500 | Mean :0.4314 | Mean :0.06113 | Mean :1.567 | Mean :0.3795 | Mean :2.854 | Mean :1.044 | NA | Mean : 6.443 | NA | Mean :0.5993 | NA | NA | Mean :1978 | NA | Mean :1.767 | Mean : 472.8 | NA | NA | NA | Mean : 93.75 | Mean : 47.53 | Mean : 23.01 | Mean : 2.592 | Mean : 16 | Mean : 2.243 | NA | NA | NA | Mean : 50.63 | Mean : 6.216 | Mean :2008 | NA | NA | Mean :180796 | |
| 3rd Qu.:2197.8 | 3rd Qu.:9.072e+08 | 3rd Qu.: 70.00 | NA | 3rd Qu.: 80.00 | 3rd Qu.: 11555 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.: 7.000 | 3rd Qu.:6.000 | 3rd Qu.:2001 | 3rd Qu.:2004 | NA | NA | NA | NA | NA | 3rd Qu.: 164.0 | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.: 734.0 | NA | 3rd Qu.: 0.00 | 3rd Qu.: 802.0 | 3rd Qu.:1302 | NA | NA | NA | NA | 3rd Qu.:1384.0 | 3rd Qu.: 703.8 | 3rd Qu.: 0.000 | 3rd Qu.:1743 | 3rd Qu.:1.0000 | 3rd Qu.:0.00000 | 3rd Qu.:2.000 | 3rd Qu.:1.0000 | 3rd Qu.:3.000 | 3rd Qu.:1.000 | NA | 3rd Qu.: 7.000 | NA | 3rd Qu.:1.0000 | NA | NA | 3rd Qu.:2002 | NA | 3rd Qu.:2.000 | 3rd Qu.: 576.0 | NA | NA | NA | 3rd Qu.: 168.00 | 3rd Qu.: 70.00 | 3rd Qu.: 0.00 | 3rd Qu.: 0.000 | 3rd Qu.: 0 | 3rd Qu.: 0.000 | NA | NA | NA | 3rd Qu.: 0.00 | 3rd Qu.: 8.000 | 3rd Qu.:2009 | NA | NA | 3rd Qu.:213500 | |
| Max. :2930.0 | Max. :1.007e+09 | Max. :190.00 | NA | Max. :313.00 | Max. :215245 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Max. :10.000 | Max. :9.000 | Max. :2010 | Max. :2010 | NA | NA | NA | NA | NA | Max. :1600.0 | NA | NA | NA | NA | NA | NA | NA | Max. :5644.0 | NA | Max. :1526.00 | Max. :2336.0 | Max. :6110 | NA | NA | NA | NA | Max. :5095.0 | Max. :2065.0 | Max. :1064.000 | Max. :5642 | Max. :3.0000 | Max. :2.00000 | Max. :4.000 | Max. :2.0000 | Max. :8.000 | Max. :3.000 | NA | Max. :15.000 | NA | Max. :4.0000 | NA | NA | Max. :2207 | NA | Max. :5.000 | Max. :1488.0 | NA | NA | NA | Max. :1424.00 | Max. :742.00 | Max. :1012.00 | Max. :508.000 | Max. :576 | Max. :800.000 | NA | NA | NA | Max. :17000.00 | Max. :12.000 | Max. :2010 | NA | NA | Max. :755000 | |
| NA | NA | NA | NA | NA’s :490 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :23 | NA | NA | NA | NA | NA | NA | NA | NA’s :1 | NA | NA’s :1 | NA’s :1 | NA’s :1 | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :2 | NA’s :2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :159 | NA | NA’s :1 | NA’s :1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
The str()function is used to display the structure of the
dataset M1W1Data. It reveals the types and formats of the
variables in the dataset, including the number of observations and
variables.
# Descriptive statistics of numerical columns
num_summary <- summary(select_if(M1W1Data, is.numeric))
kable(num_summary, format = "html", align = "l") %>%
column_spec(1, bold = TRUE)%>%
kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
scroll_box(width = "100%", height = "400px")
| order | pid | ms_subclass | lot_frontage | lot_area | overall_qual | overall_cond | year_built | year_remod_add | mas_vnr_area | bsmtfin_sf_1 | bsmtfin_sf_2 | bsmt_unf_sf | total_bsmt_sf | x1st_flr_sf | x2nd_flr_sf | low_qual_fin_sf | gr_liv_area | bsmt_full_bath | bsmt_half_bath | full_bath | half_bath | bedroom_abvgr | kitchen_abvgr | totrms_abvgrd | fireplaces | garage_yr_blt | garage_cars | garage_area | wood_deck_sf | open_porch_sf | enclosed_porch | x3ssn_porch | screen_porch | pool_area | misc_val | mo_sold | yr_sold | saleprice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. :5.263e+08 | Min. : 20.00 | Min. : 21.00 | Min. : 1300 | Min. : 1.000 | Min. :1.000 | Min. :1872 | Min. :1950 | Min. : 0.0 | Min. : 0.0 | Min. : 0.00 | Min. : 0.0 | Min. : 0 | Min. : 334.0 | Min. : 0.0 | Min. : 0.000 | Min. : 334 | Min. :0.0000 | Min. :0.00000 | Min. :0.000 | Min. :0.0000 | Min. :0.000 | Min. :0.000 | Min. : 2.000 | Min. :0.0000 | Min. :1895 | Min. :0.000 | Min. : 0.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.000 | Min. : 0 | Min. : 0.000 | Min. : 0.00 | Min. : 1.000 | Min. :2006 | Min. : 12789 | |
| 1st Qu.: 733.2 | 1st Qu.:5.285e+08 | 1st Qu.: 20.00 | 1st Qu.: 58.00 | 1st Qu.: 7440 | 1st Qu.: 5.000 | 1st Qu.:5.000 | 1st Qu.:1954 | 1st Qu.:1965 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 0.00 | 1st Qu.: 219.0 | 1st Qu.: 793 | 1st Qu.: 876.2 | 1st Qu.: 0.0 | 1st Qu.: 0.000 | 1st Qu.:1126 | 1st Qu.:0.0000 | 1st Qu.:0.00000 | 1st Qu.:1.000 | 1st Qu.:0.0000 | 1st Qu.:2.000 | 1st Qu.:1.000 | 1st Qu.: 5.000 | 1st Qu.:0.0000 | 1st Qu.:1960 | 1st Qu.:1.000 | 1st Qu.: 320.0 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | 1st Qu.: 0 | 1st Qu.: 0.000 | 1st Qu.: 0.00 | 1st Qu.: 4.000 | 1st Qu.:2007 | 1st Qu.:129500 | |
| Median :1465.5 | Median :5.355e+08 | Median : 50.00 | Median : 68.00 | Median : 9436 | Median : 6.000 | Median :5.000 | Median :1973 | Median :1993 | Median : 0.0 | Median : 370.0 | Median : 0.00 | Median : 466.0 | Median : 990 | Median :1084.0 | Median : 0.0 | Median : 0.000 | Median :1442 | Median :0.0000 | Median :0.00000 | Median :2.000 | Median :0.0000 | Median :3.000 | Median :1.000 | Median : 6.000 | Median :1.0000 | Median :1979 | Median :2.000 | Median : 480.0 | Median : 0.00 | Median : 27.00 | Median : 0.00 | Median : 0.000 | Median : 0 | Median : 0.000 | Median : 0.00 | Median : 6.000 | Median :2008 | Median :160000 | |
| Mean :1465.5 | Mean :7.145e+08 | Mean : 57.39 | Mean : 69.22 | Mean : 10148 | Mean : 6.095 | Mean :5.563 | Mean :1971 | Mean :1984 | Mean : 101.9 | Mean : 442.6 | Mean : 49.72 | Mean : 559.3 | Mean :1052 | Mean :1159.6 | Mean : 335.5 | Mean : 4.677 | Mean :1500 | Mean :0.4314 | Mean :0.06113 | Mean :1.567 | Mean :0.3795 | Mean :2.854 | Mean :1.044 | Mean : 6.443 | Mean :0.5993 | Mean :1978 | Mean :1.767 | Mean : 472.8 | Mean : 93.75 | Mean : 47.53 | Mean : 23.01 | Mean : 2.592 | Mean : 16 | Mean : 2.243 | Mean : 50.63 | Mean : 6.216 | Mean :2008 | Mean :180796 | |
| 3rd Qu.:2197.8 | 3rd Qu.:9.072e+08 | 3rd Qu.: 70.00 | 3rd Qu.: 80.00 | 3rd Qu.: 11555 | 3rd Qu.: 7.000 | 3rd Qu.:6.000 | 3rd Qu.:2001 | 3rd Qu.:2004 | 3rd Qu.: 164.0 | 3rd Qu.: 734.0 | 3rd Qu.: 0.00 | 3rd Qu.: 802.0 | 3rd Qu.:1302 | 3rd Qu.:1384.0 | 3rd Qu.: 703.8 | 3rd Qu.: 0.000 | 3rd Qu.:1743 | 3rd Qu.:1.0000 | 3rd Qu.:0.00000 | 3rd Qu.:2.000 | 3rd Qu.:1.0000 | 3rd Qu.:3.000 | 3rd Qu.:1.000 | 3rd Qu.: 7.000 | 3rd Qu.:1.0000 | 3rd Qu.:2002 | 3rd Qu.:2.000 | 3rd Qu.: 576.0 | 3rd Qu.: 168.00 | 3rd Qu.: 70.00 | 3rd Qu.: 0.00 | 3rd Qu.: 0.000 | 3rd Qu.: 0 | 3rd Qu.: 0.000 | 3rd Qu.: 0.00 | 3rd Qu.: 8.000 | 3rd Qu.:2009 | 3rd Qu.:213500 | |
| Max. :2930.0 | Max. :1.007e+09 | Max. :190.00 | Max. :313.00 | Max. :215245 | Max. :10.000 | Max. :9.000 | Max. :2010 | Max. :2010 | Max. :1600.0 | Max. :5644.0 | Max. :1526.00 | Max. :2336.0 | Max. :6110 | Max. :5095.0 | Max. :2065.0 | Max. :1064.000 | Max. :5642 | Max. :3.0000 | Max. :2.00000 | Max. :4.000 | Max. :2.0000 | Max. :8.000 | Max. :3.000 | Max. :15.000 | Max. :4.0000 | Max. :2207 | Max. :5.000 | Max. :1488.0 | Max. :1424.00 | Max. :742.00 | Max. :1012.00 | Max. :508.000 | Max. :576 | Max. :800.000 | Max. :17000.00 | Max. :12.000 | Max. :2010 | Max. :755000 | |
| NA | NA | NA | NA’s :490 | NA | NA | NA | NA | NA | NA’s :23 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA | NA | NA | NA | NA’s :2 | NA’s :2 | NA | NA | NA | NA | NA | NA | NA’s :159 | NA’s :1 | NA’s :1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
The above provided code computes summary statistics for the numerical columns within the dataset.
# Descriptive statistics of categorical columns
cat_summary <- summary(select_if(M1W1Data, is.character))
kable(cat_summary, format = "html", align = "l") %>%
column_spec(1, bold = TRUE)%>%
kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
scroll_box(width = "100%", height = "200px")
| ms_zoning | street | alley | lot_shape | land_contour | utilities | lot_config | land_slope | neighborhood | condition_1 | condition_2 | bldg_type | house_style | roof_style | roof_matl | exterior_1st | exterior_2nd | mas_vnr_type | exter_qual | exter_cond | foundation | bsmt_qual | bsmt_cond | bsmt_exposure | bsmtfin_type_1 | bsmtfin_type_2 | heating | heating_qc | central_air | electrical | kitchen_qual | functional | fireplace_qu | garage_type | garage_finish | garage_qual | garage_cond | paved_drive | pool_qc | fence | misc_feature | sale_type | sale_condition | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | Length:2930 | |
| Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | |
| Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character |
The above provided code computes summary statistics for the categorical columns within the dataset.
Observations to Task_2:
The Ames Housing dataset under evaluation has
2930 observations over 82 variables. The
variables contain a range of data kinds, including integers (denoted as
‘int’) and character strings (denoted as ‘chr’). It’s important to
notice that some variables have NA values, which indicate missing or
unrecorded data.
summary command is used in the above code to provide a
short summary of the dataset variables. The resulting table includes key
statistics for each variable, such as minimum,
maximum, median, mean, and
quartile values. This allows for a fast grasp of the range
and distribution of numerical properties.
3. Prepare the dataset
for modeling by imputing missing values with the variable’s mean value
or any other value that you prefer.
data <- M1W1Data
# Check for missing values
summarize_na <- colSums(is.na(data))
#summarize_na
# Separating numeric and non-numeric columns
numeric_cols <- sapply(data, is.numeric)
non_numeric_cols <- sapply(data, function(x) !is.numeric(x))
# Imputing missing values for numeric cols with mean
data[numeric_cols] <- lapply(data[numeric_cols], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
# Imputing missig vlaues for non-numeric cols with "NA"
data[non_numeric_cols] <- lapply(data[non_numeric_cols], function(x) ifelse(is.na(x), "NA", x))
variables_missing <- c("bsmt_qual", "bsmt_cond", "bsmt_exposure", "bsmtfin_type_1", "bsmtfin_type_2", "garage_type", "garage_finish", "garage_qual", "garage_cond", "mas_vnr_type")
# Replace NA with "NA" in the specified columns
data[variables_missing][is.na(data[variables_missing])] <- "NA"
# Checking again for missing values
missing_value_check <- colSums(is.na(data))
#missing_value_check
#view(data)
Observations to Task_3:
The code first separates the columns into numeric and non numeric using
sapply(). This allows for imputation to be tailored based
on data type. Imputing data correctly based on its type is an important
consideration.
For numeric columns, lapply() is used to loop through each
column and replace missing values with the mean. Imputing with the mean
is a suitable strategy for numeric data.
Non-numeric columns have NA values imputed with a new NA
category. This maintains the categorical nature of these variables,
rather than converting them to numeric. Imputing a new category is
appropriate for these data types.
Blank values across all columns are replaced with NA using
lapply(). This ensures blank strings are treated as missing
values, for consistency. Handling blank values is an important data
cleaning step.
4. Use the cor() function
to produce a correlation matrix of the numeric
values.
numeric_data <- data[sapply(data, is.numeric)]
# Calculate the correlation matrix
correlation_matrix <- cor(numeric_data)
#correlation_matrix
# View the correlation matrix
view(correlation_matrix)
Observations to Task_4:
Firstly, we have selected only the columns that contain numerical data
from the entire dataset using the code:
numeric_data <- data[sapply(data, is.numeric)]. This
step is important as it isolates columns with numeric values for
subsequent analysis.
Next, we generated a correlation matrix, generated by the
cor() function, essentially indicates how the different
numeric columns in the dataset are related to each other.
Finally, we have displayed the correlation matrix. The printed correlation matrix shows a table of values, where each number represents the relationship between two different numeric columns. A higher value (either positive or negative) signifies a stronger relationship between the two columns, while a value closer to zero suggests a weaker or negligible relationship.
The purpose of generating a correlation matrix is to understand how the different numeric columns in the dataset are interrelated. By analyzing this matrix, one can derive insights into which columns might influence each other, which is valuable for various data analysis and predictive modeling tasks.
5. Produce a plot of the
correlation matrix, and explain how to interpret it. (hint - check the
corrplot or ggcorrplot plot libraries)
corrplot(correlation_matrix, tl.cex = 0.5, type = "upper")
Observations to Task_5:
In the above correlation plot the colors represent the degree and direction of the correlation. Darker colors indicate a stronger correlation, with blue shades for positive correlations and red shades for negative correlations.
The hc.order = TRUE parameter has arranged the variables
based on their correlations. This arrangement helps in identifying
clusters of variables that are more strongly correlated with each other.
The parameter lab = TRUE displays the variable names next
to the cells, making it easier to identify which variables are being
compared in the plot.
This plot allows us to do more detailed exploration of correlations between variables, especially in larger datasets. By observing the color intensity and patterns, we can quickly identify variables that exhibit significant relationships, understand the direction of their correlations, and recognize clusters of variables that are closely related based on the hierarchical arrangement.
6. Make a scatter plot
for the X continuous variable with the highest correlation with
SalePrice. Do the same for the X variable that has the lowest
correlation with SalePrice. Finally, make a scatter plot between X and
SalePrice with the correlation closest to 0.5. Interpret the scatter
plots and describe how the patterns differ.
# Extract the correlation values of SalePrice with other variables
correlation_with_saleprice <- correlation_matrix["saleprice", ]
correlation_with_saleprice_sorted <- sort(correlation_with_saleprice)
correlation_with_saleprice_sorted
## pid enclosed_porch kitchen_abvgr overall_cond ms_subclass
## -0.246521213 -0.128787442 -0.119813720 -0.101696932 -0.085091576
## low_qual_fin_sf bsmt_half_bath order yr_sold misc_val
## -0.037659765 -0.035815123 -0.031407925 -0.030569087 -0.015691463
## bsmtfin_sf_2 x3ssn_porch mo_sold pool_area screen_porch
## 0.005889764 0.032224649 0.035258842 0.068403247 0.112151214
## bedroom_abvgr bsmt_unf_sf lot_area x2nd_flr_sf bsmt_full_bath
## 0.143913428 0.182804552 0.266549220 0.269373357 0.275893674
## half_bath open_porch_sf wood_deck_sf lot_frontage bsmtfin_sf_1
## 0.285056032 0.312950506 0.327143174 0.340751054 0.432794357
## fireplaces totrms_abvgrd mas_vnr_area garage_yr_blt year_remod_add
## 0.474558093 0.495474417 0.505784081 0.510684432 0.532973754
## full_bath year_built x1st_flr_sf total_bsmt_sf garage_area
## 0.545603901 0.558426106 0.621676063 0.632105117 0.640385461
## garage_cars gr_liv_area overall_qual saleprice
## 0.647861110 0.706779921 0.799261795 1.000000000
plot(data$saleprice, data$saleprice,
xlab = "SalePrice", ylab = "SalePrice",
main = "Scatter plot between sale price with itself")
plot(data$bsmtfin_sf_2, data$saleprice,
xlab = "SalePrice", ylab = "bsmtfin_sf_2",
main = "Scatter plot Sale Price and bsmtfin_sf_2")
plot(data$garage_area, data$saleprice,
xlab = "Sale Price", ylab = "Garage Area",
main = "Scatter plot between Sale Price and Garage Area")
Observations to Task_6:
By examining the sorted correlation values, it provides an overview
of how strongly each variable is related to the SalePrice.
Variables with higher correlations are typically more influential in
predicting SalePrice where SalePrice itself
correlates with the highest value.
Scatter Plots:
SalePrice vs. SalePrice: This plot simply shows a perfect linear relationship, which is expected as it’s the variable plotted against itself.
SalePrice vs. bsmtfin_sf_2: The scatter plot visualizes the relationship between ‘SalePrice’ and ‘bsmtfin_sf_2’. This plot is not that much linear and there are outliers also which makes the interepretation tough.
SalePrice vs. Garage Area: This plot shows the relationship
between SalePrice and Garage Area. It has some
outliers but we can see that with increse in sales price the Garage area
also increases fot most of the places
7. Using at least 3
continuous variables, fit a regression model in R.
I have chosen the following three continuous varibales for the regression model: (Lot Frontage, Lot Area and Garage Area).
# Fit a linear regression model
model <- lm(saleprice ~ lot_frontage + lot_area + garage_area, data = numeric_data)
# Summary of the regression model
summary(model)
##
## Call:
## lm(formula = saleprice ~ lot_frontage + lot_area + garage_area,
## data = numeric_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -378818 -32640 -4477 25820 470664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.027e+04 3.983e+03 10.111 < 2e-16 ***
## lot_frontage 3.976e+02 5.827e+01 6.823 1.08e-11 ***
## lot_area 1.052e+00 1.519e-01 6.928 5.22e-12 ***
## garage_area 2.164e+02 5.505e+00 39.311 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59980 on 2926 degrees of freedom
## Multiple R-squared: 0.4368, Adjusted R-squared: 0.4362
## F-statistic: 756.5 on 3 and 2926 DF, p-value: < 2.2e-16
# Extract model coefficients
coefficients <- summary(model)$coefficients
coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40270.388263 3982.6789662 10.111382 1.197653e-23
## lot_frontage 397.582519 58.2747105 6.822557 1.082565e-11
## lot_area 1.052361 0.1518948 6.928226 5.219572e-12
## garage_area 216.412087 5.5051092 39.311134 9.216922e-272
# Report the model in equation form
cat("SalePrice =",
round(coefficients[1, 1], digits = 2),
"+", round(coefficients[2, 1], digits = 2), "* lot_frontage",
"+", round(coefficients[3, 1], digits = 2), "* lot_area",
"+", round(coefficients[4, 1], digits = 2), "* garage_area", "\n")
## SalePrice = 40270.39 + 397.58 * lot_frontage + 1.05 * lot_area + 216.41 * garage_area
Observations to Task_7:
The model predicts home sale prices in Ames, using three predictor
variables lot_frontage, lot_area, and
garage_area. These variables likely represent important
property characteristics influencing market value.
The output from the R summary() function indicates all three predictors are highly statistically significant (p < 0.001) with t-values exceeding 6. This signals they provide useful information to the model.
The R-squared of 0.4368 and adjusted R-squared of 0.4362 suggest that together the predictors explain approximately 43.7% of the variation in sale prices. The minimal difference between the two values implies the model should generalize reasonably well to new data.
Multicollinearity does not seem problematic as evidenced by the large t-values and relatively small standard errors for the coefficients.
The extracted model coefficients show lot_frontage and garage_area have the largest positive relationships with sale price. The equation form demonstrates a 1 unit increase in lot_frontage and garage_area is associated with a 397 and 216 increase in price, holding other variables constant.
Overall, the model provides a reasonable fit but could likely be improved by incorporating more predictive variables. Further refinement may increase the R-squared and decrease the residual error.
8. Report the model in
equation form and interpret each coefficient of the model in the context
of this problem.
Formula Used:
Sales Price = b0 + b1* lot frontage + b2* lot area + b3* garage area
Where:
b0 is the intercept.
b1 is the coefficient for lot_frontage.
b2 is the coefficient for lot_area.
b3 is the coefficient for garage_area.
Intercept (b0): The intercept represents the estimated SalePrice when all the independent variables (lot_frontage, lot_area, and garage_area) are zero. In this context, it may not have a meaningful interpretation because it’s unlikely that all these variables would be zero.
Coefficient for lot_frontage (b1): This coefficient represents the change in SalePrice for a one-unit increase in lot_frontage, holding other variables constant. For example, if b1 is 100, it means that for each additional unit of lot_frontage, the SalePrice is estimated to increase by 100 units, assuming all other variables remain constant.
Coefficient for lot_area (b2): This coefficient represents the change in SalePrice for a one-unit increase in lot_area, holding other variables constant. Similarly, if b2 is, for instance, 50, it means that for each additional unit of lot_area, the SalePrice is estimated to increase by 50 units, assuming all other variables remain constant.
Coefficient for garage_area (b3): This coefficient represents the change in SalePrice for a one-unit increase in garage_area, holding other variables constant. If b3 is, for example, 75, it means that for each additional unit of garage_area, the SalePrice is estimated to increase by 75 units, assuming all other variables remain constant.
9. Use the plot()
function to plot your regression model. Interpret the four graphs that
are produced.
# Plot the regression model diagnostics
par(mfrow=c(2,2))
plot(model)
Observations to Task_9:
These four plots are common diagnostic plots used to evaluate the assumptions of a linear regression model.
The residuals versus fitted values plot shows the errors scattered against the predicted values. We want to see a random pattern, but here the residuals seem to grow as the fitted values increase, implying the variance may not be constant across the range.
The normal Q-Q plot checks if the residuals follow a normal distribution, a requirement for regression. Most points align with the reference line, but some deviation at the extremes indicates potential outliers or heavy-tailed residuals.
The scale-location plot is similar to the first, assessing constancy of variance. Again, the increasing spread of residuals across fitted values points to possible heteroscedasticity.
The residuals versus leverage plot identifies highly influential observations. Points outside the Cook’s distance lines are leveraged and could skew the model. A few cases here appear potentially problematic for the overall fit.
In summary, the diagnostic plots reveal areas of concern like non-constant variance, influential points, and non-normality that may be distorting the model. Addressing these issues could improve the regression performance and reliability.
Overall, while the model might capture the central tendency of the data, the diagnostic plots suggest potential issues with non-constant variance (heteroscedasticity), the normality of residuals, and the presence of influential points. Addressing these issues might involve transformations of the dependent variable, robust regression methods, or removing influential observations, depending on the context and severity of the issues.
10. Check your model for
multicollinearity and report your findings. What steps would you take to
correct multicollinearity if it exists?
# Calculate Variance Inflation Factor (VIF)
vif_result <- vif(model)
print(vif_result)
## lot_frontage lot_area garage_area
## 1.256838 1.166333 1.140594
Observations to Task_10:
The Variance Inflation Factor (VIF) assesses multicollinearity in a
regression model, with values below 5 considered acceptable. For
lot_frontage, lot_area, and
garage_area, VIFs of 1.26, 1.17,
and 1.14 respectively suggest very low multicollinearity.
Overall, these variables exhibit very low inter-variable influence,
assuring minimal multicollinearity concerns in the model.
If there is a multicollinearity we can use these steps to mitigate it:
Remove variables with high VIF values. Sometimes dropping one of the highly correlated variables can help alleviate multicollinearity.
Scale continuous variables (e.g., using standardization or normalization) to reduce the scale differences between predictors.
If multicollinearity persists, PCA (Principal Component Analysis ) can be applied to transform the correlated variables into a set of uncorrelated variables.
11. Check your model for
outliers and report your findings. Should these observations be removed
from the model?
# Extracting residuals and leverage values
residuals <- residuals(model)
leverage <- hatvalues(model)
stud_resid <- rstudent(model)
cooks_d <- cooks.distance(model)
# Combine the data with the residuals, leverage, and statistics
model_data <- model.frame(model)
outlier_data <- cbind(model_data, residuals, leverage, stud_resid, cooks_d)
# Create plots to visualize outliers
# Plot residuals vs. fitted values
plot(model, which = 1)
# Plot residuals vs. leverage
plot(model, which = 3)
# Plot Cook's distance
plot(model, which = 4)
Observations to Task_11:
Yes we do have outliers in the above model.
Outliers can have a significant impact on a regression model, influencing coefficient estimation and interpretation. They may require cautious thinking before removing them. Their removal, however, should be supported by domain knowledge or statistical reasons. It is critical to ensure that outliers are not influential because of a specific trend in the data or any systematic problem.
The residuals versus fitted values plot shows the errors spreading outwards as the predicted values get larger. This fanning pattern hints at heteroscedasticity - the error variance changes across the range of predictors, violating a regression assumption.
A few distinct points detach from the main cluster, like observations 1768 and 1499. These potential outliers could substantially impact model fit if they represent anomalies and not just natural variability.
The red line depicting the residual average trend isn’t perfectly horizontal, implying the relationship between predictors and outcome may not be strictly linear. The model may be missing some nonlinear associations between the independent and dependent variables.
In essence, the diagnostic plots signal concerns like heteroscedasticity, outliers, and nonlinearity that could undermine model assumptions. Addressing these issues through things like data cleaning, transformations, or model revisions could greatly improve the reliability and interpretability of the regression results.
Conclusion
In this analysis, we conducted an in-depth examination of the Ames, Iowa housing dataset, with the goals of understanding key drivers of home sale prices and developing a predictive pricing model. Through exploratory analysis and correlation studies, we gained insight into relationships between sale price and property attributes like lot size, living area, and garage space.
Our multiple linear regression model revealed lot frontage, lot area, and garage area as useful predictors, explaining approximately 44% of variation in sale prices.
In summary, our methodology enabled tangible progress towards the goals outlined. The workflow, from initial data wrangling to regression diagnostics, serves as a practical example of an end-to-end analytics process. The skills applied here are exploratory analysis, correlation mapping, model development will useful for learning.
References
Appendix
This report contains an R Markdown file named as follows
Week_1_ALY6015_Ansari.Rmd