This assignment asks us to use a dataset on the 100 tallest buildings in the world to create summary graphs. Here is what the uploaded data look like:
## 'data.frame': 100 obs. of 9 variables:
## $ Rank : int 1 2 3 4 5 6 7 7 9 10 ...
## $ Building.Name: Factor w/ 100 levels "23 Marina","4 World Trade Center",..: 15 53 65 79 73 40 68 69 100 96 ...
## $ City : Factor w/ 40 levels "Abu Dhabi\xa0(AE)",..: 9 23 28 35 32 13 19 19 27 6 ...
## $ HeightMeter : num 828 601 541 508 492 ...
## $ HeightFeet : Factor w/ 77 levels "1,001","1,002",..: 65 64 63 62 61 60 59 59 58 57 ...
## $ Floors : int 163 120 94 101 101 108 88 88 66 108 ...
## $ Completed : int 2010 2012 2014 2004 2008 2010 1998 1998 2010 1974 ...
## $ Material : Factor w/ 4 levels "composite","concrete",..: 4 4 1 1 1 1 1 1 1 3 ...
## $ Use : Factor w/ 13 levels "hotel","hotel / office",..: 7 8 6 6 2 2 6 6 2 6 ...
## Source: local data frame [100 x 9]
##
## Rank Building.Name City HeightMeter
## (int) (fctr) (fctr) (dbl)
## 1 1 Burj Khalifa Dubai\xa0(AE) 828.0
## 2 2 Makkah Royal Clock Tower Hotel Mecca\xa0(SA) 601.0
## 3 3 One World Trade Center New York City\xa0(US) 541.3
## 4 4 TAIPEI 101 Taipei\xa0(TW) 508.0
## 5 5 Shanghai World Financial Center Shanghai\xa0(CN) 492.0
## 6 6 International Commerce Centre Hong Kong\xa0(CN) 484.0
## 7 7 Petronas Twin Tower 1 Kuala Lumpur\xa0(MY) 451.9
## 8 7 Petronas Twin Tower 2 Kuala Lumpur\xa0(MY) 451.9
## 9 9 Zifeng Tower Nanjing\xa0(CN) 450.0
## 10 10 Willis Tower Chicago\xa0(US) 442.1
## .. ... ... ... ...
## Variables not shown: HeightFeet (fctr), Floors (int), Completed (int),
## Material (fctr), Use (fctr)
When the datafile is read into RStudio from the csv format, certain non-alphanumeric characters are insterted into the City/Country column. Because we will be doing our analyses by city and then by country, we need to separate the country code from the city variable. To prepare for that, we need to remove the inserted characters and replace them with spaces. This requires converting the City variable from a factor to a character variable, then doing a global substitution using gsub() to replace the unwanted characters:
tallbuildings$City = as.character(tallbuildings$City) # converts factor variable to character variable.
tallbuildings$City = gsub("\xa0", " ", tallbuildings$City) # Replaces the unwanted string with a space.
View the structure of the data to confirm the replacement:
str(tallbuildings)
## 'data.frame': 100 obs. of 9 variables:
## $ Rank : int 1 2 3 4 5 6 7 7 9 10 ...
## $ Building.Name: Factor w/ 100 levels "23 Marina","4 World Trade Center",..: 15 53 65 79 73 40 68 69 100 96 ...
## $ City : chr "Dubai (AE)" "Mecca (SA)" "New York City (US)" "Taipei (TW)" ...
## $ HeightMeter : num 828 601 541 508 492 ...
## $ HeightFeet : Factor w/ 77 levels "1,001","1,002",..: 65 64 63 62 61 60 59 59 58 57 ...
## $ Floors : int 163 120 94 101 101 108 88 88 66 108 ...
## $ Completed : int 2010 2012 2014 2004 2008 2010 1998 1998 2010 1974 ...
## $ Material : Factor w/ 4 levels "composite","concrete",..: 4 4 1 1 1 1 1 1 1 3 ...
## $ Use : Factor w/ 13 levels "hotel","hotel / office",..: 7 8 6 6 2 2 6 6 2 6 ...
Because the assignment includes analyses by city and by copuntry, we need to separate the city name and the country code into two columns. We will also clean up the country codes by using substr() to extract just the alphabetical characters and leave behind the parentheses.
tallbuildings <- separate(tallbuildings, City, c("City", "CountryCode"), sep = -5)
# sep = -5 forces separation at the space before the country code. The default would separate at a space, but in several city names there is more than one space. I tried using "(" as the separation character, but that was interpreted as the beginning of a regular expression that then lacked a corresponding closing parenthesis.
tallbuildings$CountryCode <- substr(tallbuildings$CountryCode, start=2, stop=3)
# I've since learned that another way to do this would be to use separate(tallbuildings, City, c("City","CountryCode"),sep="[:punct:]"). The expression "[:punct:]" is a named class of characters that includes all punctuation characters.
Now we are ready to proceed with the assignment!
For part 1, please display how many buildings are there in each city represented in that dataset. An arrangement in either an ascending or a descending order of number of buildings is always helpful to an eye. For part 2, plot the cities according to the mean height (in meters) of buildings in a city.
CityData <- tallbuildings %>% group_by(City) %>%
summarise(NumBldgs = length(City), MeanMeters = mean(HeightMeter))
ggplot(data=CityData, aes(x=reorder(City, NumBldgs), y=NumBldgs)) + # Uses reorder(x,X) to order the City variable by the value of NumBldgs.
geom_bar(stat="identity") + # Length of bars = a value, not a count.
coord_flip() + # Switches location of x- and y-axes.
labs(list(title = "Cities ranked by the number of tall buildings", x = "", y = "Number of Tall Buildings")) +
theme_classic()
ggplot(data=CityData, aes(x=reorder(City, MeanMeters), y=MeanMeters)) + # Uses reorder(x,X) to order the City variable by the value of MeanMeters.
geom_bar(stat="identity") + # Length of bars = a value, not a count.
coord_flip() + # Switches location of x- and y-axes.
labs(list(title = "Cities Ranked by the mean height of tall buildings", x = "", y = "Height (m)")) +
theme_classic()
Just for fun, here is a a scatter plot to see if there is a correlation between number of buildings and mean height. Do cities with lots of tall buildings also have taller ones?
ggplot(data=CityData, aes(x=NumBldgs, y=MeanMeters)) +
geom_point() +
labs(list(title = "Comparison between mean height and number of tall buildings in a city", x = "Number of Tall Buildings", y = "Mean Height (m)")) +
theme_classic()
Please redo parts 1 and 2, summarizing and plotting the data by country.
This is why we split out the country codes earlier. Because the two-character country codes in the dataset may not be obvious to readers of our graphs, we would like to use the full English names as labels. The package “countrycode” allows a user to convert country information to and from a variety of different formats. The two-character format in the dataset is known as “iso2C”, and the format of the full English name is “country.name”.
Load the countrycode package and convert the country codes to names, keeping the codes in case we still want them in the dataset:
install.packages("countrycode", repos="http://cran.rstudio.com/")
##
## The downloaded binary packages are in
## /var/folders/ss/rgxgrncn6xl0rg2z5ksg3sdn771znz/T//RtmpLhrcEp/downloaded_packages
library(countrycode)
tallbuildings$Country <- countrycode(tallbuildings$CountryCode,"iso2c", "country.name", warn=TRUE)
CountryData <- tallbuildings %>% group_by(Country) %>%
summarise(NumBldgs = length(Country), MeanMeters = mean(HeightMeter))
ggplot(data=CountryData, aes(x=reorder(Country, NumBldgs), y=NumBldgs)) + # Uses reorder(x,X) to order the City variable by the value of NumBldgs.
geom_bar(stat="identity") + # Length of bars = a value, not a count.
coord_flip() + # Switches location of x- and y-axes.
labs(list(title = "Countries ranked by the number of tall buildings", x = "", y = "Number of Tall Buildings")) +
theme_classic()
ggplot(data=CountryData, aes(x=reorder(Country, MeanMeters), y=MeanMeters)) + # Uses reorder(x,X) to order the Country variable by the value of MeanMeters.
geom_bar(stat="identity") + # Length of bars = a value, not a count.
coord_flip() + # Switches location of x- and y-axes.
labs(list(title = "Countries ranked by the mean height of tall buildings", x = "", y = "Mean Height (m)")) +
theme_classic()
And once again, just for fun, a scatter plot of number of tall buildings and mean height, by country:
ggplot(data=CountryData, aes(x=NumBldgs, y=MeanMeters)) +
geom_point() +
labs(list(title = "Comparison between the mean height and the number of tall buildings in a country", x = "Number of Buildings", y = "Mean Height (ft.)")) +
theme_classic()
Recreate the bar chart of countries ranked by the mean height of their tall buildings. For this part, please color each bar for a country based on the number of buildings from this dataset that are present in that country.
For this, we will use the horizontal bar chart from above, with the option aes(fill=NumBldgs) to color the bars appropriately.
ggplot(data=CountryData, aes(x=reorder(Country, MeanMeters), y=MeanMeters)) + # Uses reorder(x,X) to order the Country variable by the value of MeanMeters.
geom_bar(stat="identity", aes(fill=as.factor(NumBldgs))) + # Length of bars = a value, not a count. Bars color-coded by the number of buildings in the country. Designating the number of buildings as a factor means each value has a distinctly different color. Using the number of buildings as a numeric variable gives a continuous color scale.
coord_flip() + # Switches location of x- and y-axes.
labs(list(title = "Countries ranked by the mean height of tall buildings", x = "", y = "Mean Height (m)")) +
theme_classic()
This again requires conversion of the data format. Because the height values in feet include a comma as the thousands separator, the HeightFeet variable is read in as a factor variable, rather than numeric. In order to calculate means, we need to convert HeithgFeet to a character variable, use gsub() to remove the commas, and convert the variable to a numeric type. Looking at the structure of the dataset confirs the changes have been made.
tallbuildings$HeightFeet = as.character(tallbuildings$HeightFeet) # converts factor variable to character variable.
tallbuildings$HeightFeet = as.numeric(gsub(',','',tallbuildings$HeightFeet)) # removes comma and converts to numeric.
str(tallbuildings)
## 'data.frame': 100 obs. of 11 variables:
## $ Rank : int 1 2 3 4 5 6 7 7 9 10 ...
## $ Building.Name: Factor w/ 100 levels "23 Marina","4 World Trade Center",..: 15 53 65 79 73 40 68 69 100 96 ...
## $ City : chr "Dubai " "Mecca " "New York City " "Taipei " ...
## $ CountryCode : chr "AE" "SA" "US" "TW" ...
## $ HeightMeter : num 828 601 541 508 492 ...
## $ HeightFeet : num 2717 1972 1776 1667 1614 ...
## $ Floors : int 163 120 94 101 101 108 88 88 66 108 ...
## $ Completed : int 2010 2012 2014 2004 2008 2010 1998 1998 2010 1974 ...
## $ Material : Factor w/ 4 levels "composite","concrete",..: 4 4 1 1 1 1 1 1 1 3 ...
## $ Use : Factor w/ 13 levels "hotel","hotel / office",..: 7 8 6 6 2 2 6 6 2 6 ...
## $ Country : chr "United Arab Emirates" "Saudi Arabia" "United States" "Taiwan, Province of China" ...
CityDataFeet <- tallbuildings %>% group_by(City) %>%
summarise(NumBldgs = length(City), MeanFeet = mean(HeightFeet))
ggplot(data=CityDataFeet, aes(x=reorder(City, MeanFeet), y=MeanFeet)) + # Uses reorder(x,X) to order the City variable by the value of MeanFeet.
geom_bar(stat="identity") + # Length of bars = a value, not a count.
coord_flip() + # Switches location of x- and y-axes.
labs(list(title = "Cities ranked by the mean height of tall buildings", x = "", y = "Mean Height (ft.)")) +
theme_classic()
CountryDataFeet <- tallbuildings %>% group_by(Country) %>%
summarise(NumBldgs = length(Country), MeanFeet = mean(HeightFeet))
ggplot(data=CountryDataFeet, aes(x=reorder(Country, MeanFeet), y=MeanFeet)) + # Uses reorder(x,X) to order the Country variable by the value of MeanMeters.
geom_bar(stat="identity") + # Length of bars = a value, not a count.
coord_flip() + # Switches location of x- and y-axes.
labs(list(title = "Countries ranked by the mean height of tall buildings", x = "", y = "Mean Height (ft.)")) +
theme_classic()
This is problematic for a couple of reasons. First, nearly all of the buildings have multiple uses, so the same building will show up several times in any comparison. Second, the “Use” variable has all of the different uses for a building listed together. What I need is a summary data set that has one column with the different individual uses, and another column with the mean heights for buildings with that use. This took me a long time to figure out. The solution I finally came up with was this:
I created a data frame called “UseData” with variables “Use”, “MeanFeet”, “MeanMeters”, and “Count”. The last three are the numeric variables that will contain the mean heights in feet and meters, and the number of buildings that have a given use. Then, using a for() loop, I repeatedly applied the grep() command to the tallbuildings data frame, searching for rows in which the Variable “Use” contained the string defining a particular use (e.g., “hotel”). The grep() call returns a list of row numbers, which I used as an argument in the slice() command to create a temporary subset of the data. I then used the commands mean() and length() on this subset to assign the means and counts to elements in the appropriate vectors in UseData.
Use = c("hotel", "office", "residential", "serviced apartments", "retail", "other")
MeanFeet = c(1:6)
MeanMeters = c(1:6)
Count = c(1:6)
UseData <- data.frame(Use, MeanFeet, MeanMeters, Count)
for(i in 1:6) {
tempdata <- slice(tallbuildings, grep(UseData$Use[i], tallbuildings$Use))
MeanFeet[i] = mean(tempdata$HeightFeet)
MeanMeters[i] = mean(tempdata$HeightMeter)
Count[i] = length(tempdata$HeightFeet)
}
UseData <- data.frame(Use, MeanFeet, MeanMeters, Count)
Because there are a small number of values in the dataset, and because knowing the precise values might be important, I opted to use a table display instead of a graph. I created a vector of column names to use as headings for the table, and limited the printing of values to a single decimal place.
colnames(UseData) <- c("Use", "Mean Height (ft.)", "Mean Height (m)", "Number of Bldgs.")
kable(UseData, digits = 1, col.names = colnames(UseData), align = c('l', 'c', 'c', 'c'), caption = "Mean Heights of Buildings by Type of Use")
| Use | Mean Height (ft.) | Mean Height (m) | Number of Bldgs. |
|---|---|---|---|
| hotel | 1190.1 | 362.7 | 40 |
| office | 1166.2 | 355.5 | 72 |
| residential | 1133.4 | 345.4 | 31 |
| serviced apartments | 1118.0 | 340.8 | 2 |
| retail | 1072.7 | 326.9 | 3 |
| other | 1972.0 | 601.0 | 1 |
The formatting could be better in the html version (I’m not impressed with size and visibility of the caption, for example), but that will have to wait until I have more time. The pdf table is very pretty.
Okay, here’s a better way of acheiving the same thing: Separate the Use variable into its components (there are at most three of them for any one building–call them “use1”, “use2”, and “use3”), then gather them back into a key-value pair where the key is “UseNum”, which will have values “use1”, “use2”, etc, and the value will be “UseType”, which will have values of “hotel”, or “office”, etc. Each building will be repeated six times in this data set, but will have values for at most three of these key-value pairs, so you can remove the rows that have “NA” in the Value column. Then you can use group_by() and summarize() to calculate the mean heights.
WideData <- separate(tallbuildings, Use, c("use1", "use2", "use3"), sep = "/", remove = TRUE)
## Warning: Too few values at 90 locations: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
## 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, ...
TallData <- gather(WideData, key = UseNum, value = UseType, use1:use3, na.rm = TRUE)
# The separate() function will leave white spaces at the beginning and end of some of the values of "UseType", and those will be interpreted as different levels of the factor. StackOverflow.com has a function for trimming that white space (courtesy of Vivek Patil): #http://stackoverflow.com/questions/2261079/how-to-trim-leading-and-trailing-whitespace-in-r
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
TallData$UseType <- trim(TallData$UseType)
UseData2 <- group_by(TallData, UseType) %>% summarize(MeanFeet = mean(HeightFeet), MeanMeters = mean(HeightMeter), Count = length(HeightFeet))
colnames(UseData2) <- c("Use", "Mean Height (ft.)", "Mean Height (m)", "Number of Bldgs.")
kable(UseData2, digits = 1, col.names = colnames(UseData2), align = c('l', 'c', 'c', 'c'), caption = "Mean Heights of Buildings by Type of Use")
| Use | Mean Height (ft.) | Mean Height (m) | Number of Bldgs. |
|---|---|---|---|
| hotel | 1190.1 | 362.7 | 40 |
| office | 1166.2 | 355.5 | 72 |
| other | 1972.0 | 601.0 | 1 |
| residential | 1133.4 | 345.4 | 31 |
| retail | 1072.7 | 326.9 | 3 |
| serviced apartments | 1118.0 | 340.8 | 2 |