The Goal

At the start of DataFest in April, you will be given a data set and provided with the research questions your client would like you to explore - and that's it! How you approach the problem, what model you choose...everything is completely up to you.

Throughout this course, we are going to practice some of the skills that will help you tackle these sorts of open ended problems. For today, we are going to start with reviewing how to visualize data in R. We will also see some best practices for visualizations (like adding appropriate labels and titles to graphs) that your team should use for DataFest and indeed any formal report or presentation! The activity will also give you chance to get to know your team mates.

As you work through this activity, feel free to use RMarkdown, R scripts, or whatever you prefer. This assignment is just to help you all practice. I strongly recommend storing your work in a folder you can find later, so you can refer to it during DataFest!

The Data

Our data today is provided by a client who is interested in exploring how existing buildings can be updated to improve their energy efficiency and reduce their carbon foot print. They are interested in identifying traits associated with high energy usage, and recommendations on how buildings can improve their energy efficiency.They have provided us with information on over 70,000 buildings.

The link to the data can be found in the form of a ZIP file at: https://drive.google.com/file/d/1hM08da1DH21P5I126FyEoKJ0OwIElFUo/view?usp=sharing

Our response variable (outcome variable) is:

There are other columns in the data set. These represent identifying information about the building and possible features that we can use to help us build a model for Site EUI.

First Steps: The Plan

Now, we have a lot of rows, and a LOT of possible features in this data set. The data at DataFest may be even bigger. Sometimes the hardest part is figuring out how to get started, so let's talk about that.

The first question to ask yourselves: What is the goal? Are we building a model to help predict something? If so, then our goal is prediction. Is our goal to understand relationships among some of the features and our response? If so, then our goal is association.

There are other types of goals, like establishing causal relationships, but generally with DataFest we have either prediction or association.

  1. Based on what we have seen so far, is this a prediction or an association task?

Why does this matter? Well, if our goal is only prediction, we can use really complicated models, and as long as they predict well, we are probably fine with that. However, if our goal is association, we have to choose a model that allows us to actually interpret the relationships going on in the data. Really complex models may not allow this. This means we need to be aware of the goal of the analysis from the very start.

Now that we know the goal, the next step is to start digging in to the data. Characteristics of the data themselves (yes, data is a plural term!) can also help us decide on modeling techniques. In the real competition, this is a good time to split up the features you might want to visualize among your team - it's more efficient to split up the work.

To make visualizations, we will be working with a group of functions in the ggplot2 package in R. If you have already installed ggplot2, skip the next section.

Installing ggplot2

The package ggplot2 is a powerful collection of R functions for creating flexible, professional graphics. A package is collection of R codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package. We will be using several packages as we move through this course.

The first step to using ggplot2 is to install the ggplot2 package. Go to the top of your RStudio window and find "Tools". From there, click on "Install Packages." In the blank box, type in ggplot2, and hit install. The computer should automatically begin to load in the packages that you need, but this may take a minute.

Now, some of you may see an error about language parsing, or an error involving rlang. If you do, go ahead and install the rlang package. Then, copy and paste the following into a chunk and hit play. Nothing will seem to happen, and that's okay.

library(rlang)

Note that this process of installing a package is one you need to do only once. Think of this as teaching R a new set of skills. Once it knows the skills, you don't have to teach it again.

Once you have installed the ggplot2 package, you need to tell R that you would like to begin using the function by loading the library. Remember that we said installing a package is like teaching R a skill? Loading a library is how we tell we R we want it to use those skills. To do this, create a chunk in your RMarkdown and copy and paste the following, and hit play.

suppressMessages(library(ggplot2))

Note that this process of loading a library is one you have to do ONCE each time we start a lab or project.This tells R "Hey, remember those skills we taught you? Use them."

You are now ready to begin EDA.

EDA: One Numeric Variable

To start off with, let's create a visual to explore the distribution the response variable, site EUI. What are we looking for? Things like modality, outliers, spread, etc.

  1. What types of variable is site EUI, and based on that, what kinds of plots could we create to explore it? List at least two.

We are going to start off with a histogram.To create the histogram of site EUI, paste the following code in a chunk and hit play.

ggplot(train, aes(x=site_eui)) + geom_histogram(bins = 20, fill='blue', col = 'black')

The creation of plots in ggplot2 requires building the plot in layers. First, we build the background, the grid on which we will be building our graph. This is the job of the ggplot() part of the code. Once we have built the background, we are ready to plot our data. The command we will use for this depends upon the data type we are working with. In this case, we want a histogram. The command we use to build a histogram is geom_histogram(). In this same layer, we specify that we want the bars to be filled ('fill') in blue and we want them outlined ('col') in black. The only other part of the command is bins=20. We need this because for histograms, we have to specify how many bins we want.

Notice that to add each layer to the graph, in the code we use a plus sign. We add the background AND THEN the bars to make the final graph. Let's break that down in more detail:

Let's try it.

  1. Create a histogram of average temperatures in March, using 15 bins. Make the bars of the histogram cyan and outline them in white. Show your result.

First Rule of Graphs: Labels and Titles

We mentioned at the start of this lab that we would cover some best practices for visualizations. Here is the first - always label your graphs appropriately.

What does that mean? It means every graph you create needs a title (like Figure 1, Figure 2, etc.) and your axes need to be labelled in such a way that a person looking at your graph can tell what information is being displayed on each axis.

The command labs, which stands for labels, is used for this.

ggplot(train, aes(x=site_eui)) + geom_histogram(fill='blue', col = 'black', bins = 20) + labs(title="Figure 1:", x = "Site EUI", y = "Frequency")

This ggplot syntax actually mimics the ways humans would draw a graph by hand. First, you draw the axes. Then, you add on your data. Finally, you add a label. Thinking through the steps in this manner will help you understand the syntax of this package.

You can also add captions to the bottom of graphs:

ggplot(train, aes(x=site_eui)) + geom_histogram(fill='blue', col = 'black', bins = 20) + labs(title="Figure 1:", x = "Site EUI", y = "Frequency", caption = "A histogram of site EUI.")
  1. Copy the code you used to make the graph from Question 3. Now, add the title "Figure 2:" and add appropriate labels to the x and y axis, and add a caption.
  2. One VERY important thing to remember when we make plots is to make sure the axes are easily interpretable by your reader. You do not want to use default variable names, like "march_avg_temp". Instead, we want clear labels like "March Average Temperature". This is important for any plots you make in statistics - label your axes appropriately and title your graphs.

    Now let's try a new plot: a box plot.

    1. Create a box plot of site EUI. Fill the plot in gold and outline it in black. Title your plot Figure 3, and label the x axis "Site EUI". The y axis does not matter in this box plot (it literally gives us no information), so we want a blank axis label " " . Hint: Instead of a geom_bar, we want geom_boxplot, and box plots do not have bins.
    2. Now that we have our graphs, let's think about what information they yield.

      1. Based on Figure 1 and Figure 3, describe the distribution of site EUI. Do you see any outliers? Does the distribution seem unimodal or multimodal? Symmetric or skewed? Etc.
      2. All of these questions can help us determine the type of model we might want to consider.

        Plotting Two Numeric Variables

        Once we have explored our response variable, we will likely want to get some idea of how some of the features relate to the response variable. This means that we want to create a graph to examine the relationship between two variables.

        Let's start with the amount of square feet in the building. This means we have two numeric variables, so we need a scatter plot. we could create the necessary plot using the following code:

        ggplot(train, aes(x=floor_area, y = site_eui)) + geom_point()

        Just as before, we have two layers. The first draws the background and the second adds on the graph. Here, in the first layer, we specify both the x and y axis of the graph, as we have two variables that we are working with.

        1. Start with the code above to create a scatter plot for the year the building was constructed (X) vs site EUI (Y). Color the dots purple.(Hint: This time we are not specifying a fill, but a color). Title your plot Figure 4, and label the x axis and the y axis.
        2. In the figure you have just constructed, you should notice something unusual. There are 5 buildings which were built in year 0!

          which(train$year_built ==0)

          This is part of why we do EDA when we work with real data. We can see data quality issues, like missing or incorrect data.

          1. What do you think a 0 for year built is supposed to indicate? And how would you suggest dealing with these 5 buildings? Explain your choice.
          2. Stacking Graphs

            Okay, so we have seen how to make a few different kinds of graphs. Great. However, we have a lot of features to explore, so we have the potential for a lot of plots. That starts to take up a lot of space in a report. One nice way to present multiple graphs is to stack them.

            Suppose I want to make a histogram of site EUI (as we have already done). The code I need for that is:

            ggplot(train, aes(x=site_eui)) + geom_histogram(fill='blue', col = 'black', bins = 20) + labs(title="Figure 1:", x = "Site EUI", y = "Frequency")

            If I put that in a chunk and press play, one histogram will appear on my screen. Let's suppose I want to show a box plot along with this histogram. I can create both graphs and have them both print out separately in my knit document. However, I can also tell the computer to print more than one graph at once to save space. Try the following:

            g1 <- ggplot(train, aes(x=site_eui)) + geom_histogram(fill='blue', col = 'black', bins = 20) + labs(title="Figure 1:", x = "Site EUI", y = "Frequency")
            g2 <- ggplot(train, aes(x=site_eui)) + geom_histogram(fill='blue', col = 'black', bins = 20) + labs(title="Figure 1:", x = "Site EUI", y = "Frequency")
            gridExtra::grid.arrange(g1,g2, ncol = 2)
            

            If you get a warning message about not having gridExtra, this means you will need to install the package using the same steps we did above to install ggplot2. We have to install packages all the time with R, depending on the version of R you have and your computer system.

            What you will notice is that we have stored each of the two graphs under a name. Our histogram is stored under g1 and our box plot is stored under g2. Then, we use a special code called gridExtra::grid.arrange to help us arrange the graphs in a grid. In our case, we have to graphs, and we want them side by side. This means we want the graphs in a 1 (row) by 2 (column) grid.

            To create the 1 by 2 grid, we feed the computer our two graphs, and then tell it we want the figures in 2 columns (next to each other) by specifying ncol = 2. In other words, the number of columns we want is 2!

            1. Create a (1) box plot for site SUI, (2) histogram for site EUI, (3) box plot for the average temperature in March and (4) histogram for the average temperature in March. Stack 4 graphs in a 2 x 2 grid (2 rows and 2 columns). You have already made some of the graphs you need!

            Changing Themes

            When using ggplot2, you also have an option to change the theme of your graphs. Consider these:

            The Light Theme

            ggplot(train, aes(x=site_eui)) + geom_histogram(fill='blue', col = 'black', bins = 20) + labs(title="Figure 1:", x = "Site EUI", y = "Frequency", caption = "A histogram of site EUI.") + theme_light()

            The BW Theme

            ggplot(train, aes(x=site_eui)) + geom_histogram(fill='blue', col = 'black', bins = 20) + labs(title="Figure 1:", x = "Site EUI", y = "Frequency", caption = "A histogram of site EUI.") + theme_bw()

            The Minimal Theme

            ggplot(train, aes(x=site_eui)) + geom_histogram(fill='blue', col = 'black', bins = 20) + labs(title="Figure 1:", x = "Site EUI", y = "Frequency", caption = "A histogram of site EUI.") + theme_minimal()

            Take a look and, as a team, decide what you like!

            Making Tables

            While we are thinking about EDA, we don't only focus on numeric variables. What if we have categories? With categorical variables, we can make tables or bar plots. We make bar plots just like we made histograms, except their are no bins and we use geom_bar instead of geom_histogram.

            1. Make a bar plot to display the building class. Make sure you title your graphs appropriately!
            2. There are several ways to make tables in R, but we will discuss two. The first is very direct. We tell R we want to use the train data set and the variable building_class, by using the code train$building_class (dataset$variable). Then, we use the table(whatWeWantToMakeATableWith) command to actually make the table.

              table(train$building_class)

              However, this makes a table that is not particularly pretty or professional when you knit. A second option that does create professional tables is:

              knitr::kable(table(train$building_class), col.names=c("Building Type", "Count") )

              The code is more complex, but the heart of it is the same table. This table will not look very pretty when you press play, but go ahead and knit. See how nicely the table gets formatted?

              1. Create a table, using the second way to make a table, for the facility type. Make sure your columns are labelled Facility Type and Count. (Hint: The col.names part of the code above will help with that.)
              2. Next Steps

                Now that we know some basics of visualization, think about if this was the real DataFest. What would you do now?

                1. Discuss as a team how you would proceed with exploring and visualizing this data set. This is called making a plan, and is critical to success in applied statistics/data science work.
                2. We will keep working with this data set for future activities.

                  Creative Commons License
                  This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2021 January 6.
                  The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.
                  The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .