Notebook 01
In this notebook we will introduce the basic functions of R used to develop data science knowledge and experience. To do so, we will employ a dataset from CompStak over the 2008 to 2015 period using a fantastic R package called tidyverse. Eighty percent of data science is spent on the cleaning and preparing data (Kelleher and Tierney, 2018). Moreover, this is not just a first step, but data cleaning and preparring must be repeated many times over the course of analysis as new problems come to light or new data is collected. To get a handle on the problem, this notebook focuses on a small, but important, aspect of data cleaning that we call data tidying: structuring datasets to facilitate analysis.
The principles of tidy data provide a standard way to organize data values within a dataset. A standard makes initial data cleaning easier because you don’t need to start from scratch and reinvent the wheel every time. The tidy data standard has been designed to facilitate initial exploration and analysis of the data, and to simplify the development of data analysis tools that work well together. Current tools often require translation. You have to spend time managing the output from one tool so you can input it into another. Tidy datasets and tidy tools work hand in hand to make data analysis easier, allowing you to focus on the interesting domain problem, not on the uninteresting logistics of data.
So whats tidy data? A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
In data analysis more than anything, a picture really is worth a thousand words. When you start analyzing data in R, your first step shouldn’t be to run a complex statistical test: first, you should visualize your data in a graph. This lets you understand the basic nature of the data, so that you know what tests you can perform, and where you should focus your analysis. I’m David Robinson, and in this lesson we’ll introduce you to ggplot2, a powerful R package that produces data visualizations easily and intuitively. We will assume you are moderately familiar with basic concepts in R, including variables and functions, and with RStudio, the integrated development environment for programming in R.
# example dataset
head(diamonds)
# example using ggplot2 3.0.0
dsamp <- diamonds[sample(nrow(diamonds), 1000), ]
d <- ggplot(dsamp, aes(carat, price)) +
geom_point(aes(colour = clarity))
d + scale_colour_viridis_d()
Note that with just four few lines of simple code we were able to visualize this dataset.
Our goal in this tutorial is to apply this type of analytical techniques to our Real Estate Dataset from CompStak in order to gain a basic insights on this dataset data. In the next tutorial (notebook #2) we will look at more advanced techniques aimed at creating statistical models based on the analysis we performed here.
Shortcut for this command is CTRL + SHIFT + M. (Mac) Pipes are a powerful tool for clearly expressing a sequence of multiple operations which we will make extensive use of in this notebook. If you are familiar with the bash shell, you might be familiar with the pipe character, |, which is used to re-direct output. A similar operator in R is %>% and is used to send whatever is on the left-side of the operator to the first argument of the function on the right-side of the operator. So, these two statements are effectively the same:
# This:
diamonds %>% group_by(cut)
# is the same as:
group_by(diamonds, cut)
But here comes the really cool part! We can chain these pipes together in a string of commands, sending the output of one command directly to the next. So instead of the two-step process we used to first group the data by species, then calculate the means, we can do it all at once with pipes:
diamonds.means <- diamonds %>%
group_by(cut) %>%
summarize(SL.mean = mean(carat))
Let’s break apart what we just did, line by line:
We will make extensive use of this kind if pipelines in this notebook.
compStak is a Real Estate data provider for analyst-reviewed comps. CompStak employs a crowdsourced model to gather commercial real estate information for investors, brokers, asset managers and appraisers.
Using tidyverse we will perform the following tasks to understand and deconstruct the CompStak data:
1. Explore the base properties of out dataset. 2. Rearrange, clean and explore descriptive statistics on the dataset. 3. Visualize the data and build an data driven hypothesis.
Load the main library for this exercise. Make sure you install these before loading the library
library(tidyverse)
library(psych)
library(kableExtra)
library(knitr)
library(scales)
tidyverse - is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step. More information on the tidyverse be found HERE
psych - A general purpose toolbox for personality, psychometric theory and experimental psychology. Functions are primarily for multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis, although others provide basic descriptive statistics.
kableExtra The goal of kableExtra is to help build common complex tables and manipulate table styles. It imports the pipe %>% symbol from tidyverse and verbalizes all the functions, so basically you can add “layers” to a kable output in a way that is similar with ggplot2.
knitr The R package knitr is a general-purpose literate programming engine, with lightweight API’s designed to give users full control of the output without heavy coding work.
scales The scales packages provides the internal scaling infrastructure to ggplot2 and its functions allow users to customize the transformations, breaks, guides and palettes used in visualizations in ggplot2 and beyond.
mydata_00 <- read.csv("compdata.csv", na.strings=c(""," ","NA"))
A quick way to view the data as a data.frames is to use the head function which outputs an Excel-like data frame.
mydata_00 %>%
head()
mydata_00 %>%
View()
Look at the size and class of this dataset we can use the dim command to quickly see how many rows are in this dataset.
class(mydata_00)
## [1] "data.frame"
dim(mydata_00)
## [1] 5228 81
The data is in “data.frame” composition and is comprised of 5528 rows and 81 columns
describe(mydata_00)
Some of these variables have complicated names which we can change using the PIPES syntax: Shortcut for this command is CTRL + SHIFT + M. (Mac) -> Use the ** rename ** function for this procedure Remember to create a new subset mydata_01 to hold the changed set.
mydata_01 <- mydata_00 %>%
rename (
Starting_Rrent_Gross = Starting.Rent..Gross.Annual...USD.,
Starting_Rent_An = Effective.Rent..USD...per.year.,
Effective_Rent = Effective.Rent..USD...per.year.,
Current_Rent = Starting.Rent..USD...per.year.
)
colnames(mydata_01)
## [1] "Zip.Code" "State"
## [3] "Retail.Notes" "Tenant.Brokers"
## [5] "Tenant.Brokerage.Firms" "Tenant.Industry"
## [7] "Landlord.Brokers" "Landlord.Brokerage.Firms"
## [9] "Parking.Notes" "Retail.Anchor"
## [11] "Parking.Lot.Type" "Parking.Ratio"
## [13] "Year.Renovated" "Year.Built"
## [15] "Building.Floors" "Building.Size"
## [17] "Submarket" "Property.Subtype"
## [19] "Property.Type" "Building.Name"
## [21] "Building.Class" "Cross.Streets"
## [23] "Street.Frontage" "Corner.Unit"
## [25] "Vented.Space" "Selling.Basement"
## [27] "Office.Portion" "Rail"
## [29] "Loading.Docks" "Doors"
## [31] "Clear.Height" "Sprinkler"
## [33] "Load.Factor" "Suite"
## [35] "Space.Type" "Date.Created"
## [37] "Renewal.Options" "Break.Option.Dates"
## [39] "Rent.Review.Dates" "Execution.Date"
## [41] "Commencement.Date" "Total.Transaction.Size"
## [43] "Blended.Rent..USD...per.year." "Sublessor"
## [45] "Break.Option.Type" "Pro.Rata.Percent"
## [47] "Annual.Taxes" "Operating.Expenses.Notes"
## [49] "Operating.Expenses..USD." "Operating.Expenses.Type"
## [51] "Concessions.Notes" "Current.Rent..USD."
## [53] "Work.Value..USD." "Work.Type"
## [55] "Additional.Rent.Free" "Rent.Bump.Year"
## [57] "Rent.Bump.Dollar..USD." "Rent.Bump.Percent"
## [59] "Percentage.Rent" "Asking.Rent..Gross.Annual...USD."
## [61] "Starting_Rrent_Gross" "Asking.Rent..USD...per.year."
## [63] "Free.Rent.Type" "Lease.Type"
## [65] "Street.Address" "City"
## [67] "Tenant.Name" "Transaction.Quarter"
## [69] "Floors.Occupied" "Transaction.Type"
## [71] "Transaction.SQFT" "Sublease"
## [73] "Expiration.Date" "Lease.Term"
## [75] "Current_Rent" "Effective_Rent"
## [77] "Rent.Schedule..USD." "Free.Rent"
## [79] "Current.Landlord" "Comments"
## [81] "Geo.Point"
write.csv(mydata_01, file = "mydata_01.csv")
Looks much better!
ex_vr <- mydata_01 %>%
select(
Tenant.Name,
Effective_Rent,
Current_Rent,
Starting_Rrent_Gross,
Total.Transaction.Size,
Transaction.SQFT
)
ex_vr %>%
arrange(desc (Effective_Rent)) %>%
describe()
We can use the arrange function to rank this subset by the effective rent
ex_vr %>%
select(-Starting_Rrent_Gross) %>%
arrange(desc (Effective_Rent)) %>%
head()
ex_vr %>%
select_if(function(x) any(is.na(x))) %>%
summarise_each(funs(sum(is.na(.)))) -> NA_count
NA_count
So, what have we done here? The select_if part choses any column where is.na is true (TRUE). Then we take those columns and for each of them, we sum up (summarise_each) the number of NAs. Note that each column is summarized to a single value, that’s why we use summarise. And finally, the resulting data frame ( dplyr which is a tool in the tidyverse always aims at giving back a data frame) is stored in a new variable for further processing.
# VIM library for using 'aggr'
library(VIM)
# 'aggr' plots the amount of missing/imputed values in each column
a <- aggr(ex_vr, prop = T, numbers = F, combined = F,
labels = names(df), cex.axis = .8, oma = c(8,4,4,3))
One of the biggest attractions to the R programming language is that built into the language is the functionality required to have complete programmatic control over the plotting of complex figures and graphs. Whilst the use of these functions to generate simple plots is very easy, the generation of more complex multi-layered plots with precise positioning, layout, fonts and colors can be challenging. This course aims to go through the core R plotting functionality in a structured way which should illustrate the basic features you would need to create plots of any kind. Although the course is mostly restricted to the plotting within the tidyvese we will also introduce a few very common add-in packages which may prove to be useful
The ggplot2 package, created by Hadley Wickham, offers a powerful graphics language for creating elegant and complex plots. Its popularity in the R community has exploded in recent years. ggplot2 is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well-done R packages. Hadley also supports R software like few other people on the planet. Unlike base R graphs, the ggplot2 graphs are not affected by many of the options set in the par( ) function. They can be modified using the theme() function, and by adding graphic parameters within the qplot() function. For greater control, use ggplot() and other functions provided by the package. Note that ggplot2 functions can be chained with “+” signs to generate the final plot. Having exploratory plots be pretty, even if it’s not necessary, is clearly a bonus (particularly if Andrea is in the audience. Exploratory analyses aren’t publication ready, but they’re definitely ready to send to a colleague or to share on the company’s internal chat.
In the following exercise we’ll use CompStak data to make a histogram. Here we will plot the mydata_02\(Effective_Rent data along the x-axis. After that, you’ll use the geom_histogram() function to tell ggplot2 that you’re actually interested in plotting the distribution of mydata_02\)Effective_Rent with the help of a histogram. Lastly, you customize your ggplot by adding labs(), to which you’ll pass the title, x and y arguments to add labels, and xlim() and ylim() to set the limits of the x- and y-axes.
ggplot(mydata_01) +
geom_density(aes(Effective_Rent), fill="blue")
His_01 <- ggplot(mydata_01,
# Set variable to Effective Rent
aes(x=Effective_Rent)) +
# Limit X axis to 1000
coord_cartesian(xlim = c( 0, 1000)) +
# Set histogram params
geom_histogram(binwidth=15, aes(y=..density..), colour="lightblue", fill="blue") +
# Histogram with density plot
geom_density(aes(color = "Density") , alpha=.2, fill="#FF6666") +
labs(title=" Effective Rent Distribution in CompStak Data") +
# Add mean line
geom_vline(aes(xintercept=mean(mydata_01$Effective_Rent, na.rm=T),
color="Average Rent"), linetype="twodash", size=1) +
# Add labels
labs(x = "Effective Rent", color = "Legend") +
geom_text(aes(x=166.86, label="\nAvg. Effective Rent = 166.86", y=0.006),
colour="red", angle=90, text=element_text(size=4)) +
theme_bw()
His_01
ggplot(data=mydata_01,
aes(mydata_01$Effective_Rent)) +
geom_histogram(binwidth=15, breaks=seq(10, 6000, by = 20), aes(fill=Submarket)) +
theme_bw() + coord_cartesian(xlim = c( 0, 1000)) +
labs(title=" Number of properties relative to price and location") +
labs(x="Effective Rent", y="number of properties") +
geom_density(alpha=.2, fill="#FF6666")
mydata_01$Expiration.Date <- as.Date(mydata_01$Expiration.Date)
ggplot(mydata_01,
aes(x = Expiration.Date)) +
geom_histogram(aes(fill = Submarket), binwidth = 120) +
scale_x_date(labels = date_format("%Y-%b"),
limits = c(as.Date("2014-12-11"), as.Date("2040-05-01"))) +
labs(title="Lease Expiration Dates by submarket ") +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
mydata_01$Commencement.Date <- as.Date(mydata_01$Commencement.Date)
ggplot(mydata_01,
aes(x = Commencement.Date)) +
geom_histogram(aes(fill = Submarket), binwidth=40) +
scale_x_date(labels = date_format("%Y-%b"),
limits = c(as.Date("2013-12-01"), as.Date("2019-05-01"))) +
labs(title="Lease Commencement Date Distribution") +
ylab("Number of properties") + xlab("Year and Month") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
mydata_01$Execution.Date <- as.Date(mydata_01$Execution.Date)
mydata_01 %>%
group_by(Property.Type, Tenant.Industry, Execution.Date) %>%
summarise_at(vars(Current_Rent, Transaction.SQFT), sum, na.rm = TRUE) %>%
arrange(Execution.Date) %>%
filter(Execution.Date < "2035-1-1") %>%
ggplot(aes(Execution.Date, Transaction.SQFT, color = Property.Type))+
scale_y_continuous(name = "Transaction.SQFT", labels = comma) + theme_bw() +
# scale_colour_viridis_d(na.value="blue") + theme_bw() +
labs(title="Lease Execution Date Date Distribution") +
# facet_wrap(~ Property.Type) +
geom_line() +
geom_point()
mydata_01$Execution.Date <- as.Date(mydata_01$Execution.Date)
mydata_01 %>%
group_by(Property.Type, Tenant.Industry, Execution.Date) %>%
summarise_at(vars(Current_Rent, Transaction.SQFT), sum, na.rm = TRUE) %>%
arrange(Execution.Date) %>%
filter(Execution.Date < "2035-1-1") %>%
ggplot(aes(Execution.Date, Transaction.SQFT, color = Property.Type))+
scale_y_continuous(name = "Transaction.SQFT", labels = comma) + theme_bw() +
# scale_colour_viridis_d(na.value="blue") + theme_bw() +
labs(title="Lease Execution Date Date Distribution") +
facet_wrap(~ Property.Type) +
geom_line() +
geom_point()
mydata_01$Execution.Date <- as.Date(mydata_01$Execution.Date)
mydata_01 %>%
ggplot(aes(Execution.Date, fill = is.na(Property.Type))) +
labs(title="Lease Execution Date") +
geom_histogram() + theme_bw() + facet_wrap(~ Property.Type)
mydata_01 %>%
group_by(Property.Type, Tenant.Industry, Expiration.Date) %>%
summarise_at(vars(Current_Rent, Transaction.SQFT), sum, na.rm = TRUE) %>%
arrange(Expiration.Date) %>%
filter(Expiration.Date < "2035-1-1") %>%
ggplot(aes(Expiration.Date, Transaction.SQFT, color = Property.Type))+
scale_y_continuous(name = "Transaction.SQFT", labels = comma) + theme_bw() +
# scale_colour_viridis_d(na.value="blue") + theme_bw() +
labs(title="Lease Expiration Date Over Transaction SQFT") +
facet_wrap(~ Property.Type) +
geom_line() +
geom_point()
ggplot(mydata_01,
aes(x = Property.Type)) +
geom_bar(aes(fill = Submarket)) +
xlab("Property type")+
ylab("Number of properties")+theme_bw()+
labs(title="Property type VS submarket") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Total Transaction Size (SF): Total amount of space leased by tenant in the transaction in square feet. Used when transaction size does not reflect entirety of space leased or used when multiple spaces are rented at different rates.
ggplot(data=mydata_01,
aes(x=Property.Type, y=Transaction.SQFT)) +
geom_bar(stat="identity", (aes(fill = Submarket)))+
xlab("Property type")+
ylab("Transaction size")+theme_bw()+
scale_y_continuous(name="Transaction size/SQF", labels = comma)+
labs(title="Type & Transaction SQF") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(mydata_01,
aes(x = Building.Class)) +
geom_bar(aes(fill = Submarket)) +
xlab("Property type")+
ylab("Number of properties")+theme_bw()+
labs(title="property type VS submarket") +
theme_bw() + theme(axis.text.x = element_text( hjust = 1))
d <- ggplot(data=mydata_01,
aes(x=Submarket, y=Effective_Rent, fill = Property.Type))
d + ggtitle("Effective Rent over submarket") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9)) +
scale_fill_brewer(palette="Spectral", na.value="black", direction = -1) +
xlab("Submarket") +
ylab("Effective_Rent") +
geom_bar(stat="identity")
These properties are in fact better described as discrete variables rather than continues so a point graph might be a better graphical representation of the data.
d <- ggplot(mydata_01,
aes(y = Effective_Rent, x = Submarket, color = Property.Type ))
d + ggtitle("Effective Rent over submarket") +
scale_colour_viridis_d(option = "spectral", na.value="red") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9)) +
geom_point()
d <- ggplot(mydata_01,
aes(x = Tenant.Industry, y = Current_Rent, color = Property.Type))
d + ggtitle("Tenant industry over current rent price") +
scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7)) +
geom_point()
Looks like Apparel, Consumer durables and Retail have on average the highest current rents
By_Industry <- mydata_01 %>%
group_by(Tenant.Industry) %>%
tally()
## Warning: Factor `Tenant.Industry` contains implicit NA, consider using
## `forcats::fct_explicit_na`
By_Industry %>%
arrange(desc(n))
d <- ggplot(mydata_01,
aes(x = Tenant.Industry , y = Transaction.SQFT , color = Property.Type))
d + ggtitle("Tenant Industry over Transaction SQFT") +
scale_y_continuous(name = "Transaction.SQFT", labels = comma) +
scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7)) +
geom_point()
Looks like retail spaces have the most properties with high transaction sizes
d <- ggplot(mydata_01,
aes(x = Floors.Occupied, y = Effective_Rent, color = Property.Type))
d + ggtitle("Floor occupied vs effective rent") +
scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 5)) +
geom_point()
d <- ggplot(mydata_01,
aes(Transaction.Quarter , y = Effective_Rent, color = Property.Type ))
d + ggtitle("Transaction period over Effective_Rent") +
scale_colour_viridis_d(option = "spectral", na.value="red") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) +
geom_point()
top_tenants <- mydata_01 %>%
filter(Effective_Rent > 6000 ) %>%
select(Tenant.Name, Tenant.Industry, Effective_Rent, Current_Rent )
head(top_tenants)
d <- ggplot(mydata_01,
aes(x = Transaction.Quarter , y = Total.Transaction.Size , color = Property.Type))
d + ggtitle("Transaction Period over Total Transaction Size") +
scale_y_continuous(name = "Total Transaction Size", labels = comma) +
scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) +
geom_point()
Looks like most of the observations are either Retail or Office. using the %in% we can isolate these for a closer look
s1 <- mydata_00 %>%
filter(Property.Type %in% c("Office", "Retail"))
s1$Execution.Date <- as.Date(s1$Execution.Date)
ggplot(s1, aes(x = Execution.Date, y = Current.Rent..USD.)) +
ggtitle("Execution Date over Effective rent") +
geom_point(aes(color = Property.Type), size = 1) +
scale_color_manual(values = c("#00AFBB", "#E7B800")) +
#scale_color_manual(values = c("#73f759", "#fff575")) +
theme_minimal()
Conversely we can look at the Tenant Industry variable and break down the lease execution even further
s1 <- mydata_01 %>%
filter(Tenant.Industry %in% c("Apparel", "Retail" , "Consumer Durables" , " Leisure & Restaurants"))
s1$Execution.Date <- as.Date(s1$Execution.Date)
ggplot(s1, aes(x = Execution.Date, y = Current_Rent)) +
ggtitle("Breakdown by Tenant Industry") +
geom_point(aes(color = Tenant.Industry ), size = 0.8) +
scale_color_manual(values = c("#00AFBB", "#ef4ef4" , "#E7B800")) +
# scale_color_manual(values = c("#ef4ef4", "#fff575")) +
theme_minimal()
Here we see that more Apparel leases were signed between 2014 and 2016.
d <- ggplot(mydata_01,
aes(x = Lease.Type , y = Effective_Rent , color = Property.Type))
d + ggtitle("Lease type over effective rent") +
scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) +
geom_point()
d <- ggplot(mydata_01,
aes(Building.Size, Effective_Rent, color = Property.Type))
d + ggtitle("Building size over Effective rent") +
scale_x_continuous(name = "Building.Size", labels = comma) +
scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
theme(axis.text.x = element_text(hjust = 1, size = 10)) +
geom_point()
The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality. If the original data follows a log-normal distribution or approximately so, then the log-transformed data follows a normal or near normal distribution.
d <- ggplot(mydata_01,
aes(log(Building.Size), log(Effective_Rent), color = Property.Type))
d + ggtitle("Building size over Effective rent") +
scale_x_continuous(name = "Building.Size") +
scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
theme(axis.text.x = element_text(hjust = 1, size = 10)) +
geom_point()
d <- ggplot(mydata_01,
aes(log(Total.Transaction.Size), log(Effective_Rent), color = Property.Type))
d + ggtitle("Transaction size over Effective rent") +
scale_x_continuous(name = "Total.Transaction.Size", labels = comma) +
scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
theme(axis.text.x = element_text(hjust = 1, size = 10)) +
geom_point()
d <- ggplot(mydata_01,
aes(log(Transaction.SQFT), log(Effective_Rent))) +
ggtitle("Transaction size over Effective rent") +
# xlab("Log of Rent Concession") + ylab(" Log of Effective Rent") +
geom_point(aes(colour = factor(Property.Type))) + theme_bw()
d + scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
theme(axis.text.x = element_text(hjust = 1, size = 10)) +
# fit a line
geom_smooth(mapping = aes(linetype = "r1"), method = "lm", color = "red", lwd=0.5)
d <- ggplot(mydata_01,
aes(log(Total.Transaction.Size), log(Effective_Rent) , color = Transaction.Quarter))
d + ggtitle("Building size over effective rent") +
scale_x_continuous(name = "Building.Size", labels = comma) +
scale_colour_viridis_d(option = "spectral", na.value="red") + theme_bw() +
theme(axis.text.x = element_text(hjust = 1, size = 10)) +
geom_point()
Plotly’s R graphing library makes interactive, publication-quality graphs online. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, and 3D (WebGL based) charts.
library(plotly)
mydata_01$Property.Type <- as.factor(mydata_02$Property.Type)
p <- plot_ly(mydata_02, x = ~Commencement.Date, y = ~City, z = ~Effective_Rent, color = ~Property.Type, colors = c('#BF382A', '#0C4B8E', '#e147ef')) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Commencement.Date'),
yaxis = list(title = 'City'),
zaxis = list(title = 'Effective_Rent')))
# Create a shareable link to your chart
chart_link = api_create(p, filename="scatter3d-basic")
chart_link
NOTE Animations can be created by either using the frame argument in plot_ly() or the (unofficial) frame ggplot2 aesthetic in ggplotly(). By default, animations populate a play button and slider component for controlling the state of the animation (to pause an animation, click on a relevant location on the slider bar). Both the play button and slider component transition between frames according rules specified by animation_opts().
[Go to example]https://plot.ly/~titel.y/27/#/)
Leaflet is one of our all time favorite open-source JavaScript library for interactive maps. It’s used by websites ranging from The New York Times and The Washington Post to GitHub and Flickr, as well as GIS specialists like OpenStreetMap, Mapbox, and CartoDB. Leaflet is designed with simplicity, performance and usability in mind. It works efficiently across all major desktop and mobile platforms, can be extended with lots of plugins, has a beautiful, easy to use and well-documented API and a simple, readable source code that is a joy to contribute to.
library(rgdal)
library(leaflet)
library(ggmap)
library(maptools)
You will need to sign up for your own API key to run this next bit of code which allows you to geo locate the addresses in the data set
for(i in 1:nrow(origAddress))
{
# Print("Working...")
result <- geocode(origAddress$Street.Address[i], output = "latlona", source = "google")
origAddress$lon[i] <- as.numeric(result[1])
origAddress$lat[i] <- as.numeric(result[2])
origAddress$geoAddress[i] <- as.character(result[3])
}
The basic usage of leaflet is that you create a map widget using the leaflet() function, and add layers to the map using the layer functions such as addTiles(), addMarkers(), and so on. Adding layers can be done through the pipe operator %>% from magrittr (you are not required to use %>%, though):
# first 100 properties
df.100 <- origAddress[1:80,]
getColor <- function(origAddress) {
sapply(origAddress$Effective_Rent, function(Effective_Rent) {
if(Effective_Rent < 100) {
"green"
} else if(Effective_Rent > 300) {
"orange"
} else {
"red"
} })
}
icons <- awesomeIcons(
icon = 'ios-close',
iconColor = 'black',
library = 'ion',
markerColor = getColor(df.100)
)
leaflet(df.100) %>% addTiles() %>%
addProviderTiles("CartoDB.Positron") %>%
addAwesomeMarkers(~lon, ~lat, icon=icons, label=~as.character(Effective_Rent))
The leaflet() function and all layer functions have a data argument that can take several types of spatial data objects, including matrices and data frames with latitude and longitude columns, spatial objects from the sp package (e.g. SpatialPoints and SpatialPointsDataFrame, etc), and the data frame returned from maps::map(). When you have got a data object in leaflet() or layer functions, you may use the formula interface to pass values of variables to function arguments.
There is a lot more to learn about leaflet. Here, you’ve just scratched the surface. We highly recommend that you proceed to The Map Widget page before exploring the rest of this site, as it describes common idioms we’ll use throughout the examples on the other pages. Although we have tried to provide an R-like interface to Leaflet, you may want to check out the API documentationof Leaflet occasionally when the meanings of certain parameters are not clear to you.
End of notebook 01