Go to Section:
1. Data Source
2. Data Pre-processing
3. Pattern Discovery
4. Shiny App


Data Source

https://data.ct.gov/Housing-and-Development/Real-Estate-Sales-By-Town-for-2011-2012-2013/8udc-aepg

Summary: The Office of Policy and Management maintains a listing of all real estate sales with a sales price of $2,000 or greater that occur between October 1 and September 30 of each year. For each sale record, the file includes: town, property address, date of sale, property type (residential, apartment, commercial, industrial or vacant land), sales price, and property assessment.

library(RSocrata)
url = "https://data.ct.gov/Housing-and-Development/Real-Estate-Sales-By-Town-for-2011-2012-2013/8udc-aepg"
real_estate <- read.socrata(url)
save(real_estate, file = "real_estate.RData")
str(real_estate)
## 'data.frame':    106981 obs. of  13 variables:
##  $ Name             : chr  "Wilton" "Guilford" "Bridgeport" "Tolland" ...
##  $ SerialNbr        : int  11245 110276 110038 120001 110001 11002 110005 110010 110 111 ...
##  $ ListYear         : int  2011 2011 2011 2012 2011 2011 2011 2011 2011 2011 ...
##  $ DateRecorded     : POSIXlt, format: "2011-05-23" "2011-08-23" ...
##  $ AssessedValue    : chr  "$552090.00" "$208800.00" "$2182650.00" "$205495.00" ...
##  $ SalePrice        : chr  "$591500.00" "$265000.00" "$72500.00" "$336000.00" ...
##  $ AdditionalRemarks: int  0 0 NA 0 NA NA NA NA NA NA ...
##  $ SalesRatio       : num  0.933 0.788 30.106 0.612 0.727 ...
##  $ NonUseCode       : int  25 14 25 28 NA 25 NA 14 24 24 ...
##  $ ResidentialType  : chr  "1" "1" "" "1" ...
##  $ ResidentialUnits : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Address          : chr  "25 BUCKINGHAM RIDGE RD" "5 ROSEMARY LN" "20 HADDON ST UNIT 2" "1235 TOLLAND STAGE ROAD" ...
##  $ Location         : chr  "25 BUCKINGHAM RIDGE RD\nWilton, CT\n(41.19885403200004, -73.41298888899996)" "5 ROSEMARY LN\nGuilford, CT\n(41.27730713900007, -72.67815227299997)" "20 HADDON ST UNIT 2\nBridgeport, CT\n(41.15726416200005, -73.22493095999994)" "1235 TOLLAND STAGE ROAD\nTolland, CT\n(41.88591704400005, -72.33325819399994)" ...

Data Pre-processing

# remove $ punctuation
real_estate$AssessedValue <- gsub("^[[:punct:]]", "", real_estate$AssessedValue)
real_estate$SalePrice <- gsub("^[[:punct:]]", "", real_estate$SalePrice)
# change columns AssessedValue and SalePrice from chr to numeric
real_estate$AssessedValue2 <- as.numeric(real_estate$AssessedValue)
real_estate$SalePrice2 <- as.numeric(real_estate$SalePrice)

Pattern Discovery

Summary: Data Mining is the extraction of interesting (non-trivial, implicit, hidden, previously unknown, and potentially useful) patterns from huge amounts of data. Pattern Discovery is a functional step in the KDD process which includes Supervised Learning and Unsupervised Learning. Supervised Learning is when the labels about the independent response variable are known in a training data-set. Prediction is when the labels are continuous, such as the SalePrice of real estate. Unsupervised Learning means that the labels are unknown and it is the goal to cluster values by commonalities in means.

Prediction: It is favorable to see a moderately high correlation between the independent and dependent variable of interest. My assumption is that the assessed value will be a good predictor of sale price within this data-set. There is also an assumption that towns, Name variable will have a great indication of assessed & sale price which is due to the varying amounts of affluence in Connecticut. A seen in the fit model some of the significant Name dummy variables are some of the most wealthy towns in Connecticut and the US - So its not specifically associated with town name but with the makeup of the town residents (and assets) that really affect real estate prices in Connecticut. This also goes to underscore the housing market and its revival from the recent recession.

# correlations
cor(real_estate$SalePrice2, real_estate$AssessedValue2) # moderate-high value
## [1] 0.4199754
# scatterplot 
library(ggplot2)
s <- ggplot(real_estate, aes(x = AssessedValue2, y = SalePrice2)) + 
            geom_point(aes(color = factor(ListYear), size = SalePrice2))
s # observable positive correlation

# Linear Model Building
fit = lm(SalePrice2 ~ AssessedValue2 + Name, data = real_estate)
summary(fit) # low p-values for the overall model and coeffcients
## 
## Call:
## lm(formula = SalePrice2 ~ AssessedValue2 + Name, data = real_estate)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -20868191   -120959    -42785     38750 139728634 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.544e+05  2.512e+05   0.615 0.538838    
## AssessedValue2        3.582e-01  2.581e-03 138.785  < 2e-16 ***
## NameAnsonia          -3.401e+04  2.719e+05  -0.125 0.900468    
## NameAshford          -4.716e+04  3.249e+05  -0.145 0.884585    
## NameAvon              1.902e+05  4.057e+05   0.469 0.639139    
## NameBarkhamsted      -1.653e+04  3.004e+05  -0.055 0.956123    
## NameBeacon Falls      2.250e+03  2.737e+05   0.008 0.993442    
## NameBerlin            3.164e+04  2.587e+05   0.122 0.902648    
## NameBethany           7.202e+04  2.826e+05   0.255 0.798861    
## NameBethel            1.277e+05  2.599e+05   0.491 0.623282    
## NameBethlehem         5.636e+04  3.105e+05   0.181 0.855991    
## NameBloomfield        4.118e+04  2.570e+05   0.160 0.872701    
## NameBolton            1.809e+04  2.717e+05   0.067 0.946920    
## NameBozrah           -3.329e+04  3.164e+05  -0.105 0.916218    
## NameBranford          5.935e+04  2.545e+05   0.233 0.815633    
## NameBridgeport       -1.346e+05  2.522e+05  -0.534 0.593537    
## NameBridgewater       2.114e+05  3.174e+05   0.666 0.505338    
## NameBristol           5.926e+03  2.545e+05   0.023 0.981425    
## NameBrookfield        1.305e+05  2.583e+05   0.505 0.613293    
## NameBrooklyn         -7.268e+03  2.636e+05  -0.028 0.978007    
## NameBurlington        8.165e+04  2.671e+05   0.306 0.759839    
## NameCanaan            1.241e+04  3.485e+05   0.036 0.971597    
## NameCanterbury       -4.786e+04  2.758e+05  -0.174 0.862216    
## NameCanton            8.046e+04  2.633e+05   0.306 0.759897    
## NameChaplin          -5.360e+04  3.485e+05  -0.154 0.877769    
## NameCheshire          1.204e+05  2.573e+05   0.468 0.639790    
## NameChester           3.555e+04  2.910e+05   0.122 0.902792    
## NameClinton           5.807e+04  2.585e+05   0.225 0.822264    
## NameColchester        1.586e+04  2.623e+05   0.060 0.951779    
## NameColebrook        -2.439e+04  3.129e+05  -0.078 0.937864    
## NameColumbia          1.794e+04  2.782e+05   0.064 0.948595    
## NameCornwall          1.101e+05  3.286e+05   0.335 0.737513    
## NameCoventry         -1.834e+04  2.585e+05  -0.071 0.943454    
## NameCromwell          7.973e+04  2.615e+05   0.305 0.760393    
## NameDanbury           1.042e+05  2.538e+05   0.411 0.681385    
## NameDarien            1.069e+06  2.574e+05   4.155 3.26e-05 ***
## NameDeep River        2.864e+04  2.733e+05   0.105 0.916532    
## NameDerby            -4.631e+04  2.670e+05  -0.173 0.862328    
## NameDurham            4.173e+04  2.674e+05   0.156 0.875996    
## NameEast Granby       2.368e+04  2.776e+05   0.085 0.932036    
## NameEast Haddam       4.385e+03  2.677e+05   0.016 0.986930    
## NameEast Hampton      8.616e+03  2.584e+05   0.033 0.973403    
## NameEast Hartford    -5.150e+04  2.550e+05  -0.202 0.839945    
## NameEast Haven       -3.458e+04  2.577e+05  -0.134 0.893254    
## NameEast Lyme         6.365e+04  2.575e+05   0.247 0.804724    
## NameEast Windsor      2.879e+04  2.642e+05   0.109 0.913221    
## NameEastford         -3.481e+04  3.955e+05  -0.088 0.929869    
## NameEaston            2.944e+05  2.650e+05   1.111 0.266539    
## NameEllington         5.775e+04  2.581e+05   0.224 0.822912    
## NameEnfield           4.369e+03  2.556e+05   0.017 0.986361    
## NameEssex             1.840e+05  2.639e+05   0.697 0.485640    
## NameFairfield         3.863e+05  2.528e+05   1.528 0.126517    
## NameFarmington        1.843e+05  2.557e+05   0.721 0.471108    
## NameFranklin         -1.778e+04  3.374e+05  -0.053 0.957975    
## NameGlastonbury       1.373e+05  2.538e+05   0.541 0.588655    
## NameGoshen            6.060e+04  2.835e+05   0.214 0.830745    
## NameGranby            4.595e+04  2.656e+05   0.173 0.862643    
## NameGreenwich         8.420e+05  2.528e+05   3.331 0.000866 ***
## NameGriswold         -4.745e+04  2.648e+05  -0.179 0.857816    
## NameGroton            7.730e+04  2.559e+05   0.302 0.762592    
## NameGuilford          1.271e+05  2.581e+05   0.492 0.622399    
## NameHaddam            2.588e+04  2.637e+05   0.098 0.921833    
## NameHamden            1.655e+04  2.532e+05   0.065 0.947883    
## NameHampton          -2.019e+04  3.299e+05  -0.061 0.951197    
## NameHartford          7.177e+04  2.543e+05   0.282 0.777820    
## NameHartland         -2.131e+04  3.164e+05  -0.067 0.946317    
## NameHarwinton         5.335e+03  2.854e+05   0.019 0.985088    
## NameHebron            3.180e+04  2.690e+05   0.118 0.905905    
## NameKent              1.344e+05  2.879e+05   0.467 0.640603    
## NameKillingly        -5.522e+04  2.590e+05  -0.213 0.831177    
## NameKillingworth      7.593e+04  2.745e+05   0.277 0.782103    
## NameLebanon          -2.750e+04  2.727e+05  -0.101 0.919670    
## NameLedyard           3.762e+03  2.586e+05   0.015 0.988395    
## NameLisbon           -5.612e+04  2.879e+05  -0.195 0.845422    
## NameLitchfield        1.069e+05  2.709e+05   0.395 0.693180    
## NameLyme              1.398e+05  3.015e+05   0.464 0.642796    
## NameMadison           1.942e+05  2.583e+05   0.752 0.452239    
## NameManchester        1.046e+04  2.546e+05   0.041 0.967238    
## NameMansfield         1.304e+05  2.587e+05   0.504 0.614199    
## NameMarlborough       4.101e+04  2.757e+05   0.149 0.881737    
## NameMeriden          -2.243e+04  2.537e+05  -0.088 0.929553    
## NameMiddlebury        7.232e+04  2.681e+05   0.270 0.787368    
## NameMiddlefield       2.805e+04  2.888e+05   0.097 0.922630    
## NameMiddletown        2.936e+04  2.555e+05   0.115 0.908512    
## NameMilford           9.881e+04  2.540e+05   0.389 0.697265    
## NameMonroe            9.591e+04  2.573e+05   0.373 0.709321    
## NameMontville         1.577e+03  2.606e+05   0.006 0.995170    
## NameMorris            3.874e+04  2.907e+05   0.133 0.893991    
## NameNaugatuck         2.372e+04  2.573e+05   0.092 0.926534    
## NameNew Britain      -3.604e+04  2.541e+05  -0.142 0.887216    
## NameNew Canaan        8.072e+05  2.568e+05   3.143 0.001674 ** 
## NameNew Fairfield     9.014e+04  2.587e+05   0.348 0.727499    
## NameNew Hartford      3.883e+04  2.754e+05   0.141 0.887887    
## NameNew Haven         4.495e+04  2.531e+05   0.178 0.859027    
## NameNew London       -6.378e+04  2.595e+05  -0.246 0.805820    
## NameNew Milford       6.190e+04  2.552e+05   0.243 0.808393    
## NameNewington         3.569e+04  2.568e+05   0.139 0.889465    
## NameNewtown           1.388e+05  2.574e+05   0.539 0.589894    
## NameNorfolk           7.569e+04  3.215e+05   0.235 0.813849    
## NameNorth Branford    3.445e+04  2.609e+05   0.132 0.894947    
## NameNorth Canaan     -5.981e+04  2.989e+05  -0.200 0.841388    
## NameNorth Haven       8.393e+04  2.561e+05   0.328 0.743110    
## NameNorth Stonington  7.981e+03  2.837e+05   0.028 0.977560    
## NameNorwalk           1.902e+05  2.534e+05   0.751 0.452720    
## NameNorwich          -6.708e+04  2.561e+05  -0.262 0.793333    
## NameOld Lyme          9.376e+04  2.600e+05   0.361 0.718346    
## NameOld Saybrook      9.603e+04  2.610e+05   0.368 0.712897    
## NameOrange            1.618e+05  2.640e+05   0.613 0.540070    
## NameOxford            8.356e+04  2.597e+05   0.322 0.747663    
## NamePlainfield       -2.327e+04  2.634e+05  -0.088 0.929625    
## NamePlainville       -9.316e+03  2.595e+05  -0.036 0.971363    
## NamePlymouth         -4.100e+04  2.645e+05  -0.155 0.876797    
## NamePomfret          -1.540e+04  2.784e+05  -0.055 0.955886    
## NamePortland          1.051e+04  2.614e+05   0.040 0.967939    
## NamePreston          -1.015e+05  2.840e+05  -0.357 0.720767    
## NameProspect          4.383e+04  2.702e+05   0.162 0.871155    
## NamePutnam           -4.652e+04  2.619e+05  -0.178 0.859010    
## NameRedding           2.326e+05  2.710e+05   0.858 0.390811    
## NameRidgefield        3.437e+05  2.572e+05   1.336 0.181493    
## NameRocky Hill        1.792e+05  2.601e+05   0.689 0.490763    
## NameRoxbury           3.535e+05  2.999e+05   1.179 0.238498    
## NameSalem            -1.228e+04  2.826e+05  -0.043 0.965349    
## NameSalisbury         2.386e+05  2.690e+05   0.887 0.374994    
## NameScotland         -2.925e+04  3.552e+05  -0.082 0.934368    
## NameSeymour           2.835e+03  2.611e+05   0.011 0.991334    
## NameSharon            1.308e+05  2.826e+05   0.463 0.643601    
## NameShelton           1.082e+05  2.552e+05   0.424 0.671488    
## NameSherman           1.405e+05  2.824e+05   0.497 0.618956    
## NameSimsbury          8.446e+04  2.576e+05   0.328 0.743010    
## NameSomers            4.481e+04  2.633e+05   0.170 0.864847    
## NameSouth Windsor     5.196e+04  2.572e+05   0.202 0.839892    
## NameSouthbury         5.391e+03  2.569e+05   0.021 0.983255    
## NameSouthington       4.820e+04  2.540e+05   0.190 0.849494    
## NameSprague          -6.508e+04  3.299e+05  -0.197 0.843626    
## NameStafford         -1.563e+05  2.625e+05  -0.596 0.551441    
## NameStamford          6.149e+05  2.528e+05   2.432 0.014998 *  
## NameSterling         -7.505e+04  2.865e+05  -0.262 0.793337    
## NameStonington        1.265e+05  2.584e+05   0.489 0.624569    
## NameStratford        -1.837e+05  2.532e+05  -0.726 0.468119    
## NameSuffield          7.898e+04  2.621e+05   0.301 0.763128    
## NameThomaston        -1.790e+04  2.713e+05  -0.066 0.947390    
## NameThompson         -5.144e+04  2.618e+05  -0.196 0.844235    
## NameTolland           3.882e+04  2.589e+05   0.150 0.880830    
## NameTorrington       -6.794e+04  2.541e+05  -0.267 0.789220    
## NameTrumbull          1.332e+05  2.556e+05   0.521 0.602143    
## NameUnion             2.805e+04  4.113e+05   0.068 0.945637    
## NameVernon           -4.014e+03  2.567e+05  -0.016 0.987524    
## NameVoluntown        -2.624e+04  2.948e+05  -0.089 0.929078    
## NameWallingford       3.294e+04  2.538e+05   0.130 0.896746    
## NameWarren            4.466e+05  3.391e+05   1.317 0.187733    
## NameWashington        4.276e+05  2.779e+05   1.539 0.123924    
## NameWaterbury        -4.968e+04  2.529e+05  -0.196 0.844262    
## NameWaterford         2.262e+04  2.580e+05   0.088 0.930113    
## NameWatertown        -9.515e+02  2.597e+05  -0.004 0.997077    
## NameWest Hartford     1.636e+05  2.535e+05   0.645 0.518807    
## NameWest Haven       -4.802e+04  2.544e+05  -0.189 0.850255    
## NameWestbrook         6.500e+04  2.669e+05   0.244 0.807612    
## NameWeston            4.253e+05  2.645e+05   1.608 0.107874    
## NameWestport          1.037e+06  2.539e+05   4.083 4.44e-05 ***
## NameWethersfield      1.756e+04  2.576e+05   0.068 0.945646    
## NameWillington        1.307e+05  2.894e+05   0.452 0.651565    
## NameWilton            5.452e+05  2.591e+05   2.104 0.035365 *  
## NameWinchester       -4.460e+04  2.690e+05  -0.166 0.868295    
## NameWindham          -8.117e+04  2.616e+05  -0.310 0.756303    
## NameWindsor           9.588e+04  2.575e+05   0.372 0.709611    
## NameWindsor Locks     9.214e+03  2.606e+05   0.035 0.971794    
## NameWolcott          -7.718e+03  2.636e+05  -0.029 0.976644    
## NameWoodbridge        1.429e+05  2.628e+05   0.544 0.586776    
## NameWoodbury          4.032e+04  2.678e+05   0.151 0.880313    
## NameWoodstock         1.317e+03  2.659e+05   0.005 0.996049    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1528000 on 106811 degrees of freedom
## Multiple R-squared:  0.1969, Adjusted R-squared:  0.1956 
## F-statistic: 154.9 on 169 and 106811 DF,  p-value: < 2.2e-16
# the model could definelty be improved with other factors not in this data-set