$Notebook 01$

Notebook 01

Executive Summary

In this notebook we will introduce the basic functions of R used to develop data science knowledge and experience. To do so, we will employ a dataset from CompStak over the 2008 to 2015 period using a fantastic R package called tidyverse. Eighty percent of data science is spent on the cleaning and preparing data (Kelleher and Tierney, 2018). Moreover, this is not just a first step, but data cleaning and preparring must be repeated many times over the course of analysis as new problems come to light or new data is collected. To get a handle on the problem, this notebook focuses on a small, but important, aspect of data cleaning that we call data tidying: structuring datasets to facilitate analysis.

The tidyvese

The principles of tidy data provide a standard way to organize data values within a dataset. A standard makes initial data cleaning easier because you don’t need to start from scratch and reinvent the wheel every time. The tidy data standard has been designed to facilitate initial exploration and analysis of the data, and to simplify the development of data analysis tools that work well together. Current tools often require translation. You have to spend time managing the output from one tool so you can input it into another. Tidy datasets and tidy tools work hand in hand to make data analysis easier, allowing you to focus on the interesting domain problem, not on the uninteresting logistics of data.

So whats tidy data? A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

A Tidy Example - the diamonds dataset:

In data analysis more than anything, a picture really is worth a thousand words. When you start analyzing data in R, your first step shouldn’t be to run a complex statistical test: first, you should visualize your data in a graph. This lets you understand the basic nature of the data, so that you know what tests you can perform, and where you should focus your analysis. I’m David Robinson, and in this lesson we’ll introduce you to ggplot2, a powerful R package that produces data visualizations easily and intuitively. We will assume you are moderately familiar with basic concepts in R, including variables and functions, and with RStudio, the integrated development environment for programming in R.

# example dataset 
head(diamonds)

# example using ggplot2 3.0.0
dsamp <- diamonds[sample(nrow(diamonds), 1000), ]
    d <- ggplot(dsamp, aes(carat, price)) +
         geom_point(aes(colour = clarity))
    d + scale_colour_viridis_d()

Note that with just four few lines of simple code we were able to visualize this dataset.

Our goal in this tutorial is to apply this type of analytical techniques to our Real Estate Dataset from CompStak in order to gain a basic insights on this dataset data. In the next tutorial (notebook #2) we will look at more advanced techniques aimed at creating statistical models based on the analysis we performed here.

The Tidyverse Pipe System

Shortcut for this command is CTRL + SHIFT + M. (Mac) Pipes are a powerful tool for clearly expressing a sequence of multiple operations which we will make extensive use of in this notebook. If you are familiar with the bash shell, you might be familiar with the pipe character, |, which is used to re-direct output. A similar operator in R is %>% and is used to send whatever is on the left-side of the operator to the first argument of the function on the right-side of the operator. So, these two statements are effectively the same:

    # This:
    diamonds %>% group_by(cut)
    # is the same as:
    group_by(diamonds, cut)

But here comes the really cool part! We can chain these pipes together in a string of commands, sending the output of one command directly to the next. So instead of the two-step process we used to first group the data by species, then calculate the means, we can do it all at once with pipes:

  diamonds.means <- diamonds %>%
  group_by(cut) %>%
  summarize(SL.mean = mean(carat))

Let’s break apart what we just did, line by line:

diamonds.means <- diamonds %>% We did two things here. First, we instructed R to assign the final output to the variable diamonds.means and we sent the diamonds data to whatever command is coming on the next line.
group_by(cut) This line is effectively the same as group_by(.data = diamonds, cut), because we sent iris data to group_by through the pipe, %>%. We then sent this grouped data to the next line.
summarize(SL.mean = mean(carat)) This used the grouped data from the preceding group_by command to calculate the mean values of carats for each cut type.
The final output of summarize was then assigned to the variable diamonds.means

We will make extensive use of this kind if pipelines in this notebook.

Exploratory Data Analysis on compStak data

About the compStak data:

compStak is a Real Estate data provider for analyst-reviewed comps. CompStak employs a crowdsourced model to gather commercial real estate information for investors, brokers, asset managers and appraisers.

Tidying compStak

Using tidyverse we will perform the following tasks to understand and deconstruct the CompStak data:

1. Explore the base properties of out dataset. 2. Rearrange, clean and explore descriptive statistics on the dataset. 3. Visualize the data and build an data driven hypothesis.

Step one - Load libraries:

Load the main library for this exercise. Make sure you install these before loading the library

library(tidyverse) 
library(psych)
library(kableExtra)
library(knitr)
library(scales)

About these packages

tidyverse - is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step. More information on the tidyverse be found HERE
psych - A general purpose toolbox for personality, psychometric theory and experimental psychology. Functions are primarily for multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis, although others provide basic descriptive statistics.
kableExtra The goal of kableExtra is to help build common complex tables and manipulate table styles. It imports the pipe %>% symbol from tidyverse and verbalizes all the functions, so basically you can add “layers” to a kable output in a way that is similar with ggplot2.
knitr The R package knitr is a general-purpose literate programming engine, with lightweight API’s designed to give users full control of the output without heavy coding work.
scales The scales packages provides the internal scaling infrastructure to ggplot2 and its functions allow users to customize the transformations, breaks, guides and palettes used in visualizations in ggplot2 and beyond.

Step two - Load Dataset:

Label the data we want to bring in as “mydata_00” (could be any identification you want, serialized)
Link the CompStack data with read.csv function by calling the name of the original file

Step two - Load Data:

mydata_00 <- read.csv("compdata.csv", na.strings=c(""," ","NA"))

A quick way to view the data as a data.frames is to use the head function which outputs an Excel-like data frame.

mydata_00 %>% 
  head()

mydata_00 %>% 
  View()

Look at the size and class of this dataset we can use the dim command to quickly see how many rows are in this dataset.

class(mydata_00)

## [1] "data.frame"

dim(mydata_00)

## [1] 5228   81

The data is in “data.frame” composition and is comprised of 5528 rows and 81 columns

Step Three - Generate Descriptive Statistics

describe(mydata_00)

Some of these variables have complicated names which we can change using the PIPES syntax: Shortcut for this command is CTRL + SHIFT + M. (Mac) -> Use the ** rename ** function for this procedure Remember to create a new subset mydata_01 to hold the changed set.

mydata_01 <- mydata_00 %>% 
  
  rename (
    
    Starting_Rrent_Gross = Starting.Rent..Gross.Annual...USD.,
    Starting_Rent_An = Effective.Rent..USD...per.year.,
    Effective_Rent = Effective.Rent..USD...per.year.,
    Current_Rent = Starting.Rent..USD...per.year.
    )

  colnames(mydata_01)

##  [1] "Zip.Code"                         "State"                           
##  [3] "Retail.Notes"                     "Tenant.Brokers"                  
##  [5] "Tenant.Brokerage.Firms"           "Tenant.Industry"                 
##  [7] "Landlord.Brokers"                 "Landlord.Brokerage.Firms"        
##  [9] "Parking.Notes"                    "Retail.Anchor"                   
## [11] "Parking.Lot.Type"                 "Parking.Ratio"                   
## [13] "Year.Renovated"                   "Year.Built"                      
## [15] "Building.Floors"                  "Building.Size"                   
## [17] "Submarket"                        "Property.Subtype"                
## [19] "Property.Type"                    "Building.Name"                   
## [21] "Building.Class"                   "Cross.Streets"                   
## [23] "Street.Frontage"                  "Corner.Unit"                     
## [25] "Vented.Space"                     "Selling.Basement"                
## [27] "Office.Portion"                   "Rail"                            
## [29] "Loading.Docks"                    "Doors"                           
## [31] "Clear.Height"                     "Sprinkler"                       
## [33] "Load.Factor"                      "Suite"                           
## [35] "Space.Type"                       "Date.Created"                    
## [37] "Renewal.Options"                  "Break.Option.Dates"              
## [39] "Rent.Review.Dates"                "Execution.Date"                  
## [41] "Commencement.Date"                "Total.Transaction.Size"          
## [43] "Blended.Rent..USD...per.year."    "Sublessor"                       
## [45] "Break.Option.Type"                "Pro.Rata.Percent"                
## [47] "Annual.Taxes"                     "Operating.Expenses.Notes"        
## [49] "Operating.Expenses..USD."         "Operating.Expenses.Type"         
## [51] "Concessions.Notes"                "Current.Rent..USD."              
## [53] "Work.Value..USD."                 "Work.Type"                       
## [55] "Additional.Rent.Free"             "Rent.Bump.Year"                  
## [57] "Rent.Bump.Dollar..USD."           "Rent.Bump.Percent"               
## [59] "Percentage.Rent"                  "Asking.Rent..Gross.Annual...USD."
## [61] "Starting_Rrent_Gross"             "Asking.Rent..USD...per.year."    
## [63] "Free.Rent.Type"                   "Lease.Type"                      
## [65] "Street.Address"                   "City"                            
## [67] "Tenant.Name"                      "Transaction.Quarter"             
## [69] "Floors.Occupied"                  "Transaction.Type"                
## [71] "Transaction.SQFT"                 "Sublease"                        
## [73] "Expiration.Date"                  "Lease.Term"                      
## [75] "Current_Rent"                     "Effective_Rent"                  
## [77] "Rent.Schedule..USD."              "Free.Rent"                       
## [79] "Current.Landlord"                 "Comments"                        
## [81] "Geo.Point"

write.csv(mydata_01, file = "mydata_01.csv")

Looks much better!

Examining explanatory variables:

ex_vr <- mydata_01 %>% 
  
select(
  
    Tenant.Name,
    Effective_Rent,
    Current_Rent,
    Starting_Rrent_Gross,
    Total.Transaction.Size,
    Transaction.SQFT
    
      ) 

ex_vr %>% 
  arrange(desc (Effective_Rent)) %>% 
  describe()

Question - Who are the top paying tenants in this dataset?

We can use the arrange function to rank this subset by the effective rent

ex_vr %>% 
  select(-Starting_Rrent_Gross) %>% 
  arrange(desc (Effective_Rent)) %>% 
  head()

Data completeness analysis

How many missing observations are there in this subset of explanatory variables

ex_vr %>% 
  
  select_if(function(x) any(is.na(x))) %>% 
  summarise_each(funs(sum(is.na(.)))) -> NA_count 

NA_count

So, what have we done here? The select_if part choses any column where is.na is true (TRUE). Then we take those columns and for each of them, we sum up (summarise_each) the number of NAs. Note that each column is summarized to a single value, that’s why we use summarise. And finally, the resulting data frame ( dplyr which is a tool in the tidyverse always aims at giving back a data frame) is stored in a new variable for further processing.

Wait, is there a package for that?

# VIM library for using 'aggr'
library(VIM)

# 'aggr' plots the amount of missing/imputed values in each column
a <- aggr(ex_vr,  prop = T, numbers = F, combined = F,
 labels = names(df), cex.axis = .8, oma = c(8,4,4,3))

Exploratory graphs

One of the biggest attractions to the R programming language is that built into the language is the functionality required to have complete programmatic control over the plotting of complex figures and graphs. Whilst the use of these functions to generate simple plots is very easy, the generation of more complex multi-layered plots with precise positioning, layout, fonts and colors can be challenging. This course aims to go through the core R plotting functionality in a structured way which should illustrate the basic features you would need to create plots of any kind. Although the course is mostly restricted to the plotting within the tidyvese we will also introduce a few very common add-in packages which may prove to be useful

ggplot

The ggplot2 package, created by Hadley Wickham, offers a powerful graphics language for creating elegant and complex plots. Its popularity in the R community has exploded in recent years. ggplot2 is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well-done R packages. Hadley also supports R software like few other people on the planet. Unlike base R graphs, the ggplot2 graphs are not affected by many of the options set in the par( ) function. They can be modified using the theme() function, and by adding graphic parameters within the qplot() function. For greater control, use ggplot() and other functions provided by the package. Note that ggplot2 functions can be chained with “+” signs to generate the final plot. Having exploratory plots be pretty, even if it’s not necessary, is clearly a bonus (particularly if Andrea is in the audience. Exploratory analyses aren’t publication ready, but they’re definitely ready to send to a colleague or to share on the company’s internal chat.

Histogram - What is the overall Effective Rent Distribution?

In the following exercise we’ll use CompStak data to make a histogram. Here we will plot the mydata_02$Effective_Rent data along the x-axis. After that, you’ll use the geom_histogram() function to tell ggplot2 that you’re actually interested in plotting the distribution of mydata_02$Effective_Rent with the help of a histogram. Lastly, you customize your ggplot by adding labs(), to which you’ll pass the title, x and y arguments to add labels, and xlim() and ylim() to set the limits of the x- and y-axes.

First, the quick way

ggplot(mydata_01) +
  geom_density(aes(Effective_Rent), fill="blue")

With just a bit more effort we get a clear read on the density of the rents

His_01   <-     ggplot(mydata_01, 
                # Set variable to Effective Rent
                aes(x=Effective_Rent)) +
                # Limit X axis to 1000
                coord_cartesian(xlim = c( 0, 1000)) + 
                # Set histogram params
                geom_histogram(binwidth=15, aes(y=..density..), colour="lightblue", fill="blue") +
                # Histogram with density plot
                geom_density(aes(color = "Density") , alpha=.2, fill="#FF6666") +
                labs(title=" Effective Rent Distribution in CompStak Data") +
                # Add mean line
                geom_vline(aes(xintercept=mean(mydata_01$Effective_Rent, na.rm=T),
                color="Average Rent"), linetype="twodash",   size=1) +
                # Add labels 
                labs(x = "Effective Rent", color = "Legend") +
                geom_text(aes(x=166.86, label="\nAvg. Effective Rent = 166.86", y=0.006),
                colour="red", angle=90, text=element_text(size=4)) +
                theme_bw() 
His_01

Complex Histograms - What is the associated locations of these properties?

         ggplot(data=mydata_01, 
         aes(mydata_01$Effective_Rent)) + 
         geom_histogram(binwidth=15, breaks=seq(10, 6000, by = 20), aes(fill=Submarket)) +
         theme_bw() + coord_cartesian(xlim = c( 0, 1000)) +
         labs(title=" Number of properties relative to price and location") +
         labs(x="Effective Rent", y="number of properties") +
         geom_density(alpha=.2, fill="#FF6666")

What is the distribution of the expiration dates on these properties?

      mydata_01$Expiration.Date <- as.Date(mydata_01$Expiration.Date)
      ggplot(mydata_01, 
      aes(x = Expiration.Date)) +
      geom_histogram(aes(fill  = Submarket), binwidth = 120) +
      scale_x_date(labels = date_format("%Y-%b"),
      limits = c(as.Date("2014-12-11"), as.Date("2040-05-01"))) +
      labs(title="Lease Expiration Dates by submarket ") +
      ylab("Frequency") + xlab("Year and Month") +
      theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

### What is the distribution of the commencement on these properties?

      mydata_01$Commencement.Date <- as.Date(mydata_01$Commencement.Date)
      ggplot(mydata_01, 
      aes(x = Commencement.Date)) +
      geom_histogram(aes(fill  = Submarket), binwidth=40) +
      scale_x_date(labels = date_format("%Y-%b"),
      limits = c(as.Date("2013-12-01"), as.Date("2019-05-01"))) +
      labs(title="Lease Commencement Date Distribution") +
      ylab("Number of properties") + xlab("Year and Month") +
      theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

mydata_01$Execution.Date <- as.Date(mydata_01$Execution.Date)


mydata_01 %>% 
  group_by(Property.Type, Tenant.Industry, Execution.Date) %>% 
  summarise_at(vars(Current_Rent, Transaction.SQFT), sum, na.rm = TRUE) %>% 
  arrange(Execution.Date) %>% 
  filter(Execution.Date < "2035-1-1") %>% 
  ggplot(aes(Execution.Date, Transaction.SQFT, color = Property.Type))+
          scale_y_continuous(name = "Transaction.SQFT", labels = comma) + theme_bw() +
#        scale_colour_viridis_d(na.value="blue") + theme_bw() +
        labs(title="Lease Execution Date Date Distribution") +
          # facet_wrap(~ Property.Type) +

  geom_line() +
  geom_point()

mydata_01$Execution.Date <- as.Date(mydata_01$Execution.Date)


mydata_01 %>% 
  group_by(Property.Type, Tenant.Industry, Execution.Date) %>% 
  summarise_at(vars(Current_Rent, Transaction.SQFT), sum, na.rm = TRUE) %>% 
  arrange(Execution.Date) %>% 
  filter(Execution.Date < "2035-1-1") %>% 
  ggplot(aes(Execution.Date, Transaction.SQFT, color = Property.Type))+
          scale_y_continuous(name = "Transaction.SQFT", labels = comma) + theme_bw() +
#        scale_colour_viridis_d(na.value="blue") + theme_bw() +
        labs(title="Lease Execution Date Date Distribution") +
         facet_wrap(~ Property.Type) +

  geom_line() +
  geom_point()

mydata_01$Execution.Date <- as.Date(mydata_01$Execution.Date)

mydata_01 %>% 
  ggplot(aes(Execution.Date, fill = is.na(Property.Type))) +
          labs(title="Lease Execution Date") +
          geom_histogram() + theme_bw()  + facet_wrap(~ Property.Type)

mydata_01 %>% 
  group_by(Property.Type, Tenant.Industry, Expiration.Date) %>% 
  summarise_at(vars(Current_Rent, Transaction.SQFT), sum, na.rm = TRUE) %>% 
  arrange(Expiration.Date) %>% 
  filter(Expiration.Date < "2035-1-1") %>% 
  ggplot(aes(Expiration.Date, Transaction.SQFT, color = Property.Type))+
          scale_y_continuous(name = "Transaction.SQFT", labels = comma) + theme_bw() +
#        scale_colour_viridis_d(na.value="blue") + theme_bw() +
        labs(title="Lease Expiration Date Over Transaction SQFT") +
        facet_wrap(~ Property.Type) +

  geom_line() +
  geom_point()

What is the use associated with the properties in this dataset?

      ggplot(mydata_01, 
      aes(x = Property.Type)) +
      geom_bar(aes(fill  = Submarket)) +
      xlab("Property type")+
      ylab("Number of properties")+theme_bw()+
      labs(title="Property type VS submarket") +
      theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

How does the size of these transaction (in SQF) relate to the property type?

Total Transaction Size (SF): Total amount of space leased by tenant in the transaction in square feet. Used when transaction size does not reflect entirety of space leased or used when multiple spaces are rented at different rates.

        ggplot(data=mydata_01,
        aes(x=Property.Type, y=Transaction.SQFT)) +
        geom_bar(stat="identity", (aes(fill  = Submarket)))+
        xlab("Property type")+
        ylab("Transaction size")+theme_bw()+
        scale_y_continuous(name="Transaction size/SQF", labels = comma)+
        labs(title="Type & Transaction SQF") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

How are these office buildings classified?

      ggplot(mydata_01, 
      aes(x = Building.Class)) +
      geom_bar(aes(fill  = Submarket)) +
      xlab("Property type")+
      ylab("Number of properties")+theme_bw()+
      labs(title="property type VS submarket") +
      theme_bw() + theme(axis.text.x = element_text( hjust = 1))

Which properties types and locations have the highest Effective rents?

d <-  ggplot(data=mydata_01, 
      aes(x=Submarket, y=Effective_Rent, fill = Property.Type)) 
d +   ggtitle("Effective Rent over submarket") +
      theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9))  + 
      scale_fill_brewer(palette="Spectral", na.value="black", direction = -1) +
      xlab("Submarket") +
      ylab("Effective_Rent") +
      geom_bar(stat="identity")

These properties are in fact better described as discrete variables rather than continues so a point graph might be a better graphical representation of the data.

d <-  ggplot(mydata_01,
      aes(y = Effective_Rent, x = Submarket, color = Property.Type  )) 
d +   ggtitle("Effective Rent over submarket") +
      scale_colour_viridis_d(option = "spectral", na.value="red") +
      theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9)) +

      geom_point()

What are the industries associated with each of the property types and Current rent?

d <-    ggplot(mydata_01,
        aes(x = Tenant.Industry, y = Current_Rent, color = Property.Type)) 
d +     ggtitle("Tenant industry over current rent price") +
        scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7)) +

  geom_point()

Looks like Apparel, Consumer durables and Retail have on average the highest current rents

Which Industries have the most listings in the database?

By_Industry <- mydata_01 %>% 
     group_by(Tenant.Industry) %>% 
     tally()

## Warning: Factor `Tenant.Industry` contains implicit NA, consider using
## `forcats::fct_explicit_na`

By_Industry %>% 
    arrange(desc(n))

What are the industries associated with each of the property types and Transaction size?

d <-    ggplot(mydata_01,
        aes(x = Tenant.Industry , y = Transaction.SQFT  , color = Property.Type)) 
d +     ggtitle("Tenant Industry over Transaction SQFT") +
        scale_y_continuous(name = "Transaction.SQFT", labels = comma) +
        scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7)) +

  geom_point()

Looks like retail spaces have the most properties with high transaction sizes

Which floors are most occupies?

d <-    ggplot(mydata_01,
        aes(x = Floors.Occupied, y = Effective_Rent, color = Property.Type)) 
d +     ggtitle("Floor occupied vs effective rent") +
        scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 5)) +

  geom_point()

Which quarters had the highest associated Effective rents?

d <-    ggplot(mydata_01,
        aes(Transaction.Quarter , y = Effective_Rent, color = Property.Type  )) 
d  +    ggtitle("Transaction period over Effective_Rent") + 
        scale_colour_viridis_d(option = "spectral", na.value="red") +
        theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) +

        geom_point()

top_tenants <- mydata_01 %>% 
  filter(Effective_Rent > 6000 ) %>% 
  select(Tenant.Name, Tenant.Industry, Effective_Rent, Current_Rent )
  head(top_tenants)

Which time periods had the highest associated Transaction size?

d <-    ggplot(mydata_01,
        aes(x = Transaction.Quarter , y = Total.Transaction.Size , color = Property.Type)) 
d +     ggtitle("Transaction Period over Total Transaction Size") +
        scale_y_continuous(name = "Total Transaction Size", labels = comma) +
        scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) +

  geom_point()

Looks like most of the observations are either Retail or Office. using the %in% we can isolate these for a closer look

s1 <- mydata_00 %>% 
  filter(Property.Type %in% c("Office", "Retail"))
  s1$Execution.Date <- as.Date(s1$Execution.Date)
  ggplot(s1, aes(x = Execution.Date, y = Current.Rent..USD.)) + 
  ggtitle("Execution Date over Effective rent") +
  geom_point(aes(color = Property.Type), size = 1) +
  scale_color_manual(values = c("#00AFBB", "#E7B800")) +

  #scale_color_manual(values = c("#73f759", "#fff575")) +
  theme_minimal()

Conversely we can look at the Tenant Industry variable and break down the lease execution even further

s1 <- mydata_01 %>% 
  filter(Tenant.Industry %in% c("Apparel", "Retail" , "Consumer Durables" , " Leisure & Restaurants"))
  s1$Execution.Date <- as.Date(s1$Execution.Date)
  ggplot(s1, aes(x = Execution.Date, y = Current_Rent)) + 
  ggtitle("Breakdown by Tenant Industry") +
  geom_point(aes(color = Tenant.Industry ), size = 0.8) +
  scale_color_manual(values = c("#00AFBB", "#ef4ef4" , "#E7B800")) +

# scale_color_manual(values = c("#ef4ef4", "#fff575")) +
  theme_minimal()

Here we see that more Apparel leases were signed between 2014 and 2016.

How do the different lease types associated with property type and Effective rent?

d <-    ggplot(mydata_01,
        aes(x = Lease.Type , y = Effective_Rent , color = Property.Type)) 
d +     ggtitle("Lease type over effective rent") +
        scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) +

  geom_point()

What is the relationship between the size of the building and the effective rent?

d <-    ggplot(mydata_01, 
        aes(Building.Size, Effective_Rent, color = Property.Type)) 
d +     ggtitle("Building size over Effective rent") +
        scale_x_continuous(name = "Building.Size", labels = comma) + 
        scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
        theme(axis.text.x = element_text(hjust = 1, size = 10)) +
        geom_point()

Log Transformations of Effective Rents & Building size

The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality. If the original data follows a log-normal distribution or approximately so, then the log-transformed data follows a normal or near normal distribution.

d <-    ggplot(mydata_01, 
        aes(log(Building.Size), log(Effective_Rent), color = Property.Type)) 
d +     ggtitle("Building size over Effective rent") +
        scale_x_continuous(name = "Building.Size") + 
        scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
        theme(axis.text.x = element_text(hjust = 1, size = 10)) +
        geom_point()

Log Transformations of Effective Rents & Transaction Size

d <-    ggplot(mydata_01, 
        aes(log(Total.Transaction.Size), log(Effective_Rent), color = Property.Type)) 
d +     ggtitle("Transaction  size over Effective rent") +
        scale_x_continuous(name = "Total.Transaction.Size", labels = comma) + 
        scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
        theme(axis.text.x = element_text(hjust = 1, size = 10)) +
        geom_point()

Log Transformations of Effective Rents & Transaction Size with best fit

d <-    ggplot(mydata_01, 
        aes(log(Transaction.SQFT), log(Effective_Rent))) +
        ggtitle("Transaction  size over Effective rent") +
      # xlab("Log of Rent Concession") + ylab(" Log of Effective Rent") +
        geom_point(aes(colour = factor(Property.Type))) + theme_bw() 
d +     scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
        theme(axis.text.x = element_text(hjust = 1, size = 10)) +
      # fit a line 
        geom_smooth(mapping = aes(linetype = "r1"), method = "lm", color = "red", lwd=0.5)

TIME PARAMETER - Log of building size and transaction quarter

d <-    ggplot(mydata_01,
        aes(log(Total.Transaction.Size), log(Effective_Rent) , color = Transaction.Quarter)) 
d +     ggtitle("Building size over effective rent") +
        scale_x_continuous(name = "Building.Size", labels = comma) +
        scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
        theme(axis.text.x = element_text(hjust = 1, size = 10)) +

  geom_point()

BONUS: 3D Plots & geo mapping

How about plotting in 3D ? there is a library for that!

Plotly’s R graphing library makes interactive, publication-quality graphs online. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, and 3D (WebGL based) charts.

library(plotly)

mydata_01$Property.Type <- as.factor(mydata_02$Property.Type)

p <- plot_ly(mydata_02, x = ~Commencement.Date, y = ~City, z = ~Effective_Rent, color = ~Property.Type, colors = c('#BF382A', '#0C4B8E', '#e147ef')) %>% 
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Commencement.Date'),
                     yaxis = list(title = 'City'),
                     zaxis = list(title = 'Effective_Rent')))

# Create a shareable link to your chart
chart_link = api_create(p, filename="scatter3d-basic")
chart_link

NOTE Animations can be created by either using the frame argument in plot_ly() or the (unofficial) frame ggplot2 aesthetic in ggplotly(). By default, animations populate a play button and slider component for controlling the state of the animation (to pause an animation, click on a relevant location on the slider bar). Both the play button and slider component transition between frames according rules specified by animation_opts().

[Go to example]https://plot.ly/~titel.y/27/#/)

Interactive Maps with leaflet in R

Leaflet is one of our all time favorite open-source JavaScript library for interactive maps. It’s used by websites ranging from The New York Times and The Washington Post to GitHub and Flickr, as well as GIS specialists like OpenStreetMap, Mapbox, and CartoDB. Leaflet is designed with simplicity, performance and usability in mind. It works efficiently across all major desktop and mobile platforms, can be extended with lots of plugins, has a beautiful, easy to use and well-documented API and a simple, readable source code that is a joy to contribute to.

Step one - load libraries

library(rgdal)
library(leaflet)
library(ggmap)
library(maptools)

Step two: obtain a Private API key and geocode addresses (Information HERE)

You will need to sign up for your own API key to run this next bit of code which allows you to geo locate the addresses in the data set

for(i in 1:nrow(origAddress))
{
  # Print("Working...")
  result <- geocode(origAddress$Street.Address[i], output = "latlona", source = "google")
  origAddress$lon[i] <- as.numeric(result[1])
  origAddress$lat[i] <- as.numeric(result[2])
  origAddress$geoAddress[i] <- as.character(result[3])
}

Step three - run leaflet

The basic usage of leaflet is that you create a map widget using the leaflet() function, and add layers to the map using the layer functions such as addTiles(), addMarkers(), and so on. Adding layers can be done through the pipe operator %>% from magrittr (you are not required to use %>%, though):

# first 100 properties

df.100 <- origAddress[1:80,]

getColor <- function(origAddress) {
  sapply(origAddress$Effective_Rent, function(Effective_Rent) {
  if(Effective_Rent < 100) {
    "green"
  } else if(Effective_Rent > 300) {
    "orange"
  } else {
    "red"
  } }) 
  
}

icons <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = getColor(df.100)
  
)

    leaflet(df.100) %>% addTiles() %>%
    addProviderTiles("CartoDB.Positron") %>% 

  addAwesomeMarkers(~lon, ~lat, icon=icons, label=~as.character(Effective_Rent))

The leaflet() function and all layer functions have a data argument that can take several types of spatial data objects, including matrices and data frames with latitude and longitude columns, spatial objects from the sp package (e.g. SpatialPoints and SpatialPointsDataFrame, etc), and the data frame returned from maps::map(). When you have got a data object in leaflet() or layer functions, you may use the formula interface to pass values of variables to function arguments.

Next Steps -

There is a lot more to learn about leaflet. Here, you’ve just scratched the surface. We highly recommend that you proceed to The Map Widget page before exploring the rest of this site, as it describes common idioms we’ll use throughout the examples on the other pages. Although we have tried to provide an R-like interface to Leaflet, you may want to check out the API documentationof Leaflet occasionally when the meanings of certain parameters are not clear to you.

End of notebook 01

Tutorial No.1

YYT

3/18/2019