class: center, top, title-slide .title[ # Webinar 1 ] .subtitle[ ## Data Visualization ] .author[ ### Rogers Ochenge ] .date[ ### November 30, 2022
Updated: Jan 23, 2023 ] --- # Course Goals * Develop intermediate data management and visualization skills in R -- * Learn basic programming -- # Univariate Graphs ## Categorical variables -- + The distribution of a single categorical variable is typically plotted with a bar chart, a pie chart, or (less commonly) a tree map. --- # Bar Charts: -- + The Marriage dataset contains the marriage records of 98 individuals in Mobile County, Alabama. Below, a bar chart is used to display the distribution of wedding participants by race. --- <!-- --> -- + The majority of participants are white, followed by black, with very few Hispanics or American Indians. --- -- + Percents -- -- + Bars can represent percents rather than counts. -- -- <!-- --> --- # Sorting categories + It is often helpful to sort the bars by frequency. --- -- --- -- <!-- --> --- # Pie Charts -- * Pie charts are controversial in statistics. -- -- * If your goal is to compare the frequency of categories, you are better off with bar charts (humans are better at judging the length of bars than the volume of pie slices). -- -- * If your goal is compare each category with the the whole (e.g., what portion of participants are Hispanic compared to all participants), and the number of categories is small, then pie charts may work for you. -- --- <!-- --> -- * The pie chart makes it easy to compare each slice with the whole. For example, Black is seen to roughly a quarter of the total participants. --- # Tree Map -- * In a treemap, each tile represents a single observation, with the area of the tile proportional to a variable. -- * Example: treemap with each tile representing a G-20 country. -- -- * The area of the tile will be mapped to the country’s GDP, and the tile’s fill colour mapped to its HDI (Human Development Index). -- -- <!-- --> --- # Subgrouping tiles -- * Let’s subgroup the countries by region -- -- <!-- --> --- # Quantitative -- * The distribution of a single quantitative variable is typically plotted with: -- -- 1. a histogram, -- -- 2. kernel density plot, or -- -- 3. dot plot. -- # Histogram -- * Using the Marriage dataset, let’s plot the ages of the wedding participants. --- -- <!-- --> -- * Most participants appear to be in their early 20’s with another group in their 40’s, and a much smaller group in their later sixties and early seventies. This would be a multimodal distribution. --- # Kernel Density plot -- * An alternative to a histogram is the kernel density plot. * Technically, kernel density estimation is a nonparametric method for estimating the probability density function of a continuous random variable. (What??) * Basically, we are trying to draw a smoothed histogram, where the area under the curve equals one. -- -- <!-- --> --- -- * Fill the density with color -- -- <!-- --> -- * The graph shows the distribution of scores. For example, the proportion of cases between 20 and 40 years old would be represented by the area under the curve between 20 and 40 on the x-axis. --- -- * Better still -- -- ``` ## [1] 5.181946 ``` <!-- --> -- * Kernel density plots allow you to easily see which scores are most frequent and which are relatively rare. -- -- * However it can be difficult to explain the meaning of the y-axis to a non-statistician. (But it will make you look really smart at parties!) --- # Dot Chart -- * Another alternative to the histogram is the **dot chart**. * Again, the quantitative variable is divided into bins, but rather than summary bars, each observation is represented by a dot. * By default, the width of a dot corresponds to the bin width, and dots are stacked, with each dot representing one observation. * This works best when the number of observations is small (say, less than 150). -- -- <!-- --> --- # Bivariate Graphs -- * Bivariate graphs display the relationship between two variables. The type of graph will depend on the measurement level of the variables (categorical or quantitative). -- * **Categorical vs. Categorical** -- * When plotting the relationship between two categorical variables, stacked, grouped, or segmented bar charts are typically used -- * **Stacked bar chart** -- + Let’s plot the relationship between automobile class and drive type (front-wheel, rear-wheel, or 4-wheel drive) for the automobiles in the Fuel economy dataset. --- -- <!-- --> -- + From the chart, we can see for example, that the most common vehicle is the SUV. All 2seater cars are rear wheel drive, while most, but not all SUVs are 4-wheel drive. --- * **Grouped bar chart** -- + Grouped bar charts place bars for the second categorical variable side-by-side -- -- <!-- --> -- + Notice that all Minivans are front-wheel drive. By default, zero count bars are dropped and the remaining bars are made wider. This may not be the behavior you want. --- -- + You can modify; -- -- <!-- --> --- * **Segmented bar chart** -- + A segmented bar plot is a stacked bar plot where each bar represents 100 percent. -- -- <!-- --> ---