Introduction to Data-Viz

# Introduction to Data-Viz
----
### 
### June 23, 2022

---

<img src="img/team_2022.png" height="550px"/>
---
# Agenda

.pull-left[**Part-1  Conceptual Overview**
- Purpose of Visualizations
- Principles
  + Perception
  + Color
  + Cognitive Processing
  ]

]
]

---

Part 1:

----

Intro to Data Visualization: Conceptual Overview

---

# Purpose of Data Visualizations

???

Looking at data graphed visualization may be easy to understand but it can be messy.
This visualization looks at Twitter sentiment of four popular Learning Management Systems.

Looking at the the frequency table on the left it is hard to get a good sense of the twitter sentiment for each LMS.

In the the visualization on the right we can quickly see that it looks as though Blackboard had the most positive scores. However, when really diving in and reading the analysis you know that the Twitter counts for Blackboard were the lowest. So, this graph is very informative but requires you to look at all of the analysis for understanding.

In "Data Science for Education," Estrellado et al. (2020) states," that while this relationship might seem intuitive, you can’t draw a definitive conclusion just from the visualization, because it doesn’t tell you whether the relationship between those variables is meaningful." In the proceeding session we will talk about modeling.

The authors explain that, while descriptive statistics and data visualization during the**Explore** step can help us to identify patterns and relationships
in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.

???

Healy, 2018 explains that knowing who you are creating the visualization for is as important as how it looks in an abstract. Taking large sets of data and having an idea of what sort of visual work is most effective to deliberately simplify things in a way that lets us see past a cloud of data points shown in a figure.

But, know who the visualization is for (who the stakeholders are) is just as important as tidying the data. Having an image for an audience of experts reading a professional journal is going to differ from having one to be interpreted by the public.

???

We will be using the `ggplot2` package which requires you to have a `tidy` format. Have your data in long format rather than short format. We put our data into a `tidy` format during the `wrangle` phase of our learning analytics cycle.
- join data using a key if there are multiple data sets.
- each variable is in its own column using `seperate` function
- each observation is saved in its own row 
- we also create a variable with observations when we used the `mutate` function.

Once that happens you can use the various geometric object names from ggplot2 to display your plots.

]

---
# Principles

???
Healy suggests we can often clean up the typeface, remove extraneous colors and backgrounds, and simplify, mute, or delete gridlines, superfluous axis marks, or needless keys and legends.

Following (Tufte, 1983, p. 177)principles of having “more routine, workaday designs,” create charts 
-by having a properly chosen format and design,” 
-“use words, numbers, and drawing together,” 
-“display an accessible complexity of detail” and “avoid content-free decoration, including chartjunk.

Not so much on how it looks rather, how human visual perception works.

<img src="img/perception.png" height="400px"/>
]
???
This is a great graphic from Healy's book.

What do you see with the graph on the **left**?

- the lines appear at first glance to be converging as the value of x increases. 
- they might even intersect if we extended the graph out further

What about the graph on the **right**?
- the curves are clearly equidistant from the beginning.

These are the same graph - But misleading. The apparent convergence in the left panel is just a result of the aspect ratio of the figure.

<img src="img/color.png" height="400px"/>
]

.pull-left[
- **Proximity**
- **Similarity**
- **Connection**
- **Continuity**
- **Closure**
- **Figure and Ground**
- **Common Fate**

]
.pull-right[

]
]
???
- **Proximity**: Things that are spatially near to one another seem to be related.
- **Similarity**: Things that look alike seem to be related.
- **Connection**: Things that are visually tied to one another seem to be related.
- **Continuity**: Partially hidden objects are completed into familiar shapes.
- **Closure**: Incomplete shapes are perceived as complete.
- **Figure and Ground**: Visual elements are taken to be either in the foreground or the background.
- **Common Fate**: Elements sharing a direction of movement are perceived as a unit.

]

---

----

Code-Along

---
# Explore setup

]

```r
#load library
library(skimr)

#skim data
skim(data_to_explore)
```

]

???
As highlighted in both
[DSEIUR](https://datascienceineducation.com/c03.html#doing-analysis-exploring-visualizing-and-modeling-data)
and Learning Analytics Goes to School, calculating summary statistics
and data visualization are a key part of exploratory data analysis. One
goal in this phase is explore questions that drove the original analysis
and develop new questions and hypotheses to test in later stages. 
In this part we will:
-   **Summarize Key Stats**. We'll learn about the {skmir} package for
    pulling together some quick creating descriptive statistics when
    your goal is to understand your data internally.

-   **Visualize Data**. We'll introduce the histogram geom for taking a
    quick peak the distributions of some key variables and put together
    some scatter plots for examining potential relationships between
    time spent and student performance.
    
    On the right,
    we see that the skimr package is used for creating descriptive statistics *when your goal
is to understand your data internally* (rather than to create a tablefor an external-to-the-research-team audience, like for a journal article).

-   Pass to the `skim()` function a single argument
-   That single *argument* is the data frame (aka in tidyverse parlance,
    a tibble) for which you are aiming to calculate descriptive
    statistics
    
**Note:** If you are having difficult viewing your data in the code
chunk, try clicking the icon in the output that looks like a spreadsheet
with an arrow on it to expand your output in a separate window.

```r
data_to_explore %>% 
 select(c('subject', 'gender', 'proportion_earned', 'time_spent')) %>% 
 filter(subject == "OcnA" | subject == "PhysA") %>%
 skim() <##
```

]
]
---

# GGplot2

]

???
In the last session we looked at the first two parts of the Learning Analytics Workflow, *Prepare* and *Wrangle.*
In this session we will continue with the *Explore* phase in the LA workflow cycle. 
ggplot. An important package that is used for data visualization.

ggplot consists of a grammar of graphics to organize and make sense of  different elements (Wilkinson, 2005). Healy explains that ggplot breaks up the task of making a graph into a series of distinct tasks.

ggplot had logical connections between your data and the plot elements are called aesthetic mappings

ggplot2 has **layers** what you see on the plot (lines, points, etc)

- The overall type of plot is called a **geom.** Each geom has a function that creates it.
- Of course you must name the data that you are using.

- mappings are specified using the **aes()** function.  we need to tell it which variables in the data should be represented by which visual elements in the plot. It also doesn’t know what sort of plot we want.
- **scales** function maps the data to graphical output
- **coordinates** the visualization perspective (grid)
- **labels and legends** create titles and change up your legends.
- **faceting** provides digital "drill-down" into the data. Multiple plots by value
- **theme** fine grain control over the visualization of the display (fonts)

Do you need all of these things to create a graph? No!

```r
#layer 1: add data and aesthetics mapping 
*ggplot(data_to_explore,
 aes(x = time_spent_hours, 
 y = proportion_earned)) +
#layer 2: + geom function type
* geom_point()
```
]
.pull-right[
<img src="img/scatter_1.png" height="300px"/>

]
]
???
What you need to create a graph 80% of the time is 
1. **The data**: the raw material that is required for your visualization what you are trying to explore.
2. **Aesthetics**: Its the mappings of your data to the visualization.
3. **Layer**: must provide graphics layers as geom functions. You will need to tell them what geom function you want to render based on the data you provide and the aesthetic mapping. ie: geom_line(), geom_bar() etc

Recall our engineered data to explore dataset. Here we call the ggplot () functions, we then add our dataset = "data_to_explore" followed by a comma and then the aes() function stating the variables we want to run for x and y.

While in the ggplot function we will always use a plus operator and not the pipe. The pipe is for dplyr and we are working in ggplot2.

Immediately the next layer is the geom() function. we are going to create a scatter plot so we will use the geom_point() function.
This is the least amount you need to create a graph 80% of the time. 
Do you think this would be enough. No we could add a scale to it.

```r
#layer 1: add data and aesthetics mapping 
#layer 3: add color scale by type
ggplot(data_to_explore, 
 aes(x = time_spent_hours, 
 y = proportion_earned,
* color = enrollment_status)) +
#layer 2: + geom function type
 geom_point()
```
]
.pull-right[
<img src="img/scatter_2.png" height="300px"/>

]
]

```r
#layer 1: add data and aesthetics mapping 
#layer 3: add color scale by type
ggplot(data_to_explore, 
 aes(x = time_spent_hours, 
 y = proportion_earned,
 color = enrollment_status)) +
#layer 2: + geom function type
 geom_point() +
#layer 4: add lables
* xlab labs(title="How Time Spent on Course LMS is Related to Points Earned in the Course",
* x="Time Spent (Hours)",
* y = "MProportion of Points Earned")
```
]
.pull-right[
<img src="img/scatter_3.png" height="300px"/>

]
]

```r
#layer 1: add data and aesthetics mapping 
#layer 3: add color scale by type
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
#layer 2: + geom function type
 geom_point() +
#layer 4: add lables
 xlab labs(title="How Time Spent on Course LMS is Related to Points Earned in the Course", 
 x="Time Spent (Hours)",
 y = "Proportion of Points Earned")
#layer 5: add facet wrap
* facet_wrap(~ subject)
```
]
.pull-right[
<img src="img/scatter_4.png" height="300px"/>

]
]
]
---
# ggplot continue
.panelset[

```r
data_to_explore %>% 
 ggplot(aes(x = time_spent_hours)) +
 geom_histogram(bins = 30)
```
]
.pull-right[
<img src="img/hist.png" height="250px"/>

]

---

## .font130[.center[**Thank you!**]]

.left-col[
.center[<img style="border-radius: 50%;" src="img/author1.jpeg" height="120px"/> **First Author** <mailto:author1@email.com>]
]
.center-col[
.center[<img style="border-radius: 50%;" src="img/author2.jpeg" height="120px"/> **Second Author** <mailto:author2@email.com>]
]
.right-col[
.center[<img style="border-radius: 50%;" src="img/author3.jpeg" height="120px"/> **Third Author** <mailto:author3@email.com>]
]

.pull-left-narrow[ .center[<img style="border-radius: 50%;" src="https://www.nsf.gov/images/logos/NSF_4-Color_bitmap_Logo.png" height="100px"/> ]]

.pull-right-wide[
.left[.font70[
This work was supported by countless cups of hot tea, cookies, and the RStats Xaringan community willingness to post of plethora of tutorials. But you may wish to thank your funding agency (e.g., the National Science Foundation) and put their logo on the left as shown here.
]]
]