Executive Summary

The overall goal of this project is to:

  1. Devise a segmentation of malls based on available data. That is, find natural groups in the mall data to say what malls are “like” each other. What makes a mall like other malls and what makes malls different.
  2. Build a model to predict what malls will have a J.Crew store.
  3. Build a model to predict what malls will have a Lane Bryant.

This report documents the exploratory data analysis and cleaning of the provided Center Data file. The various attributes are examined individually and in relation to one another. The malls are plotted on a geographical map for an illustration of location. The Gross Leasable Area and the mall type (open or enclosed) are analyzed and compared to other factors to examine relationships. The opening data is used to look for trends in the attributes over time. The anchor stores are analyzed by the number of vacancies and the occurrences of stores. The report ends with a projected course of action for further analysis of the remaining files and intended methods to acheive the project goals specified above.

Exploratory Analysis of the Center Data set

## Loading required package: RColorBrewer

Files for analysis were provided in a zip file via email. Files were extracted from zipped file and placed into the working directory for this session. A folder named “MACOSX” was available for MAC operating systems. Because the current session is operating on a PC, the MACOSX folder was removed from the directory for analysis.

##  [1] "Center Data- Product A - (US_CAN).csv"            
##  [2] "Demographic Data.csv"                             
##  [3] "Demographic Metadata.csv"                         
##  [4] "DMM Store Data - US and CAN.csv"                  
##  [5] "Field descriptions - CENTER DATA.csv"             
##  [6] "Field descriptions - STORE DATA.csv"              
##  [7] "Look-up table - MSA-CMSA (metro market codes).csv"
##  [8] "Look-up table -center_Classification.csv"         
##  [9] "Look-up table -designType.csv"                    
## [10] "Look-up table -shape.csv"                         
## [11] "Look-up table -storeTypes.csv"

11 files are available in CSV format. Each of these files were loaded into the R environment.

Preliminary examination of the 11 files shows three main data collections - Center Data (or Mall Data), Store Data, and Demographic Data. Three files provide Metadata about the Center, Store, and Demographic Data files, respectively.

The remaining five files provide the details from various lookup codes for MSA, the Center Class, the Center Design Type, the Center Shape, and the Store Types.

There are 8,002 malls represented in the Center Data. Malls are identified by a unique code and the name of the mall.

## [1] "There are 8002 unique mall codes in the dataset. There are 7515 unique mall names in the dataset. This means that there are 487 duplicate mall names."

It will be important to keep in mind that not all malls can be uniquely identified by their name.

The data also provide addresses, including geographic coordinates, of each mall. The map below illustrates the locations of the malls included in the data:

The map shows a majority of the malls within the United States.

## [1] "7291 (approximately 91 percent) of the malls are in the United States, representing 52 United States territories (all 50 states, the District of Columbia, and Puerto Rico (Note that Hawaii and much of Alaska, which are not on the above map, provide another 41 malls)."

The ten US territories with the most malls are:

## 
##  CA  TX  FL  PA  NY  OH  IL  NJ  VA  GA 
## 781 629 561 383 325 323 304 284 269 265

As illustrated in the above map, the thickest distribution of malls is on the east coast. Malls become sparse toward the western Midwest and through the West until California, which has the largest number of malls of any territory in the United States or Canada.

The Canadian malls are a significantly smaller subset of the data.

## [1] "711 (approximately 9 percent) of the malls are in Canada, representing 10 Canadian territories."

The distribution of malls across the ten Candian territories is:

## 
##  ON  QC  AB  BC  MB  NS  SK  NB  NF  PE 
## 298 137  86  77  34  26  19  18  12   4

Along with geographic data, the Center Data offers several descriptions of the malls, including Gross Leasable Area (GLA), the type of mall (open or enclosed), the year the mall was opened, and the anchor stores in the mall.

The Gross Leasable Area shows the square footage of the malls available to lease.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28000  238300  330000  439500  514200 4200000
## [1] 320416.6

The data show a non-uniform positive skewness. There are many smaller malls though there are sporadic high counts. These sporadic high counts suggest there may be something profitable about specific size ranges, which may prove useful in determining mall types. This suggests non-uniform marginal returns from Gross Leasable Area. While it may not be beneficial to add enough space for a few small retailers, it may be profitable to add extensive amounts of space for several large anchor stores.

There are two types of malls, open (“O”) and enclosed (“E”).

##         E    O 
##   18 1766 6218
## [1] "Approximately 78 percent of the malls in the dataset are open and approximately 22 percent are enclosed. 18 are unspecified in the data."

One may expect that open malls are more common in lower latitudes as the climate warms.

But contrary to expectations, there is no apparent large difference in latitude between open and enclosed malls.

It may also be helpful to determine whether or not malls with a larger Gross Leasable Area tend to be open or enclosed:

And, there does not appear to be a large difference. However, it may be useful to know that the variance is larger in enclosed malls. This may reflect the kinds of stores that open vs. enclosed malls attract.

The Center Data also provide the date the malls opened:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    1973    1990    1872    2003    2028      91
## [1] "558 are listed as opening at year 0, and 91 are NULL. It does not seem probable that malls were opened in the United States and Canada during the Roman Era, so they will be removed during analysis of the dates. When these values are removed from the data, the descriptive statistics are more helpful."
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1904    1976    1991    1990    2004    2028
## [1] "Also to note, 32 stores in the data have not yet opened. These do not necessarily need to be removed from the dataset, as they may already have contracts with specific stores or even be built and have a structure. This will be kept in mind. The summary of the data without the future malls:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1904    1976    1991    1989    2003    2014

This data offers the ability to see the changes in malls over time.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

The data do not show a trend increase or decrease in Gross Leasable area over the last century.

But the data do show a potential increase in open malls over the last century. This may warrant further exploration during model construction.

Finally, the Center Data provide a list of anchor stores. An anchor store is a tenant that attracts customers, typically a larger store such as a department store or retail chain. Some malls have multiple anchor stores. This information is likely to play a very valuable role in classing malls. According to the International Council of Shopping Centers, anchor stores are a major component of classifying malls.

## [1] 2847

The data lists a total of 157,827 stores, and 2,856 unique stores. However, some of the recorded anchor stores are listed as “Vacant”. These vacancies are important for analysis as they do represent space for anchor stores, even though the space is empty.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   4.000   3.856   5.000  20.000
## [1] 1416
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.563   2.000   8.000

There are 1,416 malls with anchor store vacancies (assuming vacancies are actually listed as “vacant” and not merely left blank or NA). Of those malls, most have less than 2; however, some malls have up to 8 vacancies. Vacancies should be considered and perhaps more heavily weighted in the predictive model since the business would be able to purchase the vacancy and begin preparations relatively quickly, whereas in malls without vacancies the business would have to wait until a vacancy opened.

Some of the anchor stores are listed as “(closing)”, “(closing soon)”, “(coming soon)”, “(opening soon)” or some other variation to indicate a change.

## [1] "74 anchor stores are listed as closing soon."

These stores that are closing soon should also be taken into consideration and perhaps weighted, as they represent near-future opportunities.

There will still be values within parentheses. All of these values either indicate closing, open soon or an opening date, a component of the store (i.e. “w pharmacy”) or a direction in the mall (i.e. North, South, etc), with the exception of the anchor store “DMV”, which is specified as “(WV)”. The DMV is a unique anchor store without the parentheses.

In order to establish basic groups of stores existing or soon opening in the data, all of the parentheticals will be removed from the data.

## [1] "The cleaned and grouped data show there are 2724 unique anchor stores."

The most common anchor stores are:

## anchorStores
##              Target            JCPenney               Sears 
##                 996                 923                 886 
## Ross Dress For Less            PetSmart          Home Depot 
##                 704                 642                 638 
##              Macy's             Walmart              Kohl's 
##                 638                 625                 611 
##   Bed Bath & Beyond 
##                 607

The following diagram illustrates the prevalance of particular stores. More frequent stores appear in larger font. Only the top 100 stores are shown for readability.

The kinds of anchor stores could prove very valuable in the predictive algorithm. Characteristics of these stores may provide further insight to the various qualities of different types of malls.

Projected course of action

The following steps will acheive the goals of this project:

  1. Further analysis of the remaining files

  2. Correlation and cluster analysis to discover relationships in the data and to establish potential groupings for different types of malls. The International Council of Shopping Centers provides classification criteria for mall types, mostly relying on the number of anchor stores. The listed mall types are similar to the Center Class data file provided. Both will inform the model and the labeling of the mall types.

  3. Based upon relationships discovered in step 2 above, various algorithms can be tested to create potential predictive models in order to recommend placement of J. Crew and Lane Bryant stores. The data will need to be explored to determine what factors will determine what is important in the mall selection for these stores. It may be necessary to examine public documents (if available) such as investor relations materials, quarterly earnings statements, and public relations materials to determine the weights of criteria to select particular malls and mall attributes.

  1. Once a predictive model is established, the data will need to be reported via a markdown document including the code.

  2. A shiny application should be created to provide interactive reporting. A potential use would allow users to enter in the various attributes of a mall and would display the mall type based upon the entered criteria. Another use would allow the user to enter or select certain criteria and the application would provide a suggestion for a J. Crew or Lane Bryant opening.

Session Info:

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] mgcv_1.8-3          nlme_3.1-118        wordcloud_2.5      
## [4] RColorBrewer_1.0-5  ggplot2_1.0.0       reshape2_1.4       
## [7] RgoogleMaps_1.2.0.6
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_1.0     
##  [5] grid_3.1.2       gtable_0.1.2     htmltools_0.2.6  knitr_1.7       
##  [9] labeling_0.3     lattice_0.20-29  MASS_7.3-35      Matrix_1.1-4    
## [13] munsell_0.4.2    plyr_1.8.1       png_0.1-7        proto_0.3-10    
## [17] Rcpp_0.11.3      RJSONIO_1.3-0    rmarkdown_0.3.11 scales_0.2.4    
## [21] slam_0.1-32      stringr_0.6.2    tools_3.1.2      yaml_2.1.13