Announcements

  • Homework 4 and Project Proposals are graded
  • I changed all missing grades to zero to better reflect your grade in this class.
  • There is no formal class Monday BUT I will be on campus from Noon until 3:30pm on Monday.
    • I was originally planning on traveling this day and had canceled class. I’m not going to require class last minute, but if you come to class that day, you can think of it as an office hour or study session.
    • Go over past readings (Calling Bullshit might be helpful as you work in your final project, plus it’s almost a fun book to read)

Do Previous Readings
- Wickham Chapter 7- Exploratory Analysis
- Wickham Chapter 3 - Data Visualization
- Wickham Chapter 27 - R Markdown - Wickham Chapter 28 - Graphics for Communication
- The Wickham online book has a LOT of very helpful chapters. I suggest skimming the topics to see if they are relevant for your final project.
- Calling Bullshit Book Chapters

Other resources:

R Markdown reminders:

[Dates and times infographic]https://github.com/rstudio/concept-maps/raw/master/inspirations/datetime-silvia-canelon.png

Cheat Sheet for Data Vis

Moving Averages

knitr::include_graphics("moving-average-excel-formula.png")

See example final project for code with moving averages.

Recoding

ANES (American National Election Survey) is a large survey done every 4 years. This data is from 2016 but there is new data that came out in 2020.

library(haven) # to read .dta files
ANES <- read_dta("ANES2016.dta") 

V161115: Health

Survey Question V161115: Would you say that in general your health is excellent, very good, good,fair, or poor? where 1=Excellent and 5=Poor

table(ANES$V161115)
## 
##   -9    1    2    3    4    5 
##    8  742 1429 1341  604  146

Remove values = -9 (turn them into missing values that R recognizes):

ANES$healthy <- recode(ANES$V161115, "-9=NA") # turn -9 to NA

table(ANES$healthy)
## 
##    1    2    3    4    5 
##  742 1429 1341  604  146
# install.packages(descr)
# library(descr)
freq(ANES$healthy) # freq is from descr package

## PRE: Self-evaluation of R health 
##       Frequency  Percent Valid Percent
## 1           742  17.3770        17.410
## 2          1429  33.4660        33.529
## 3          1341  31.4052        31.464
## 4           604  14.1452        14.172
## 5           146   3.4192         3.426
## NA's          8   0.1874              
## Total      4270 100.0000       100.000
# does frequency table with counts and percent + graph

V161232: Opinions on Abortion, 4 options

Categoical variable.

attributes(ANES$V161232) # only works for data that has attributes stored. Will not work for most datasets used for the final project unless it is survey data.
## $label
## [1] "PRE: STD Abortion: self-placement"
## 
## $format.stata
## [1] "%93.0g"
## 
## $class
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $labels
##                                                                                   -9. Refused 
##                                                                                            -9 
##                                                                     -8. Don't know (FTF only) 
##                                                                                            -8 
##                                                1. By law, abortion should never be permitted. 
##                                                                                             1 
##                           2. By law, only in case of rape, incest, or woman's life in danger. 
##                                                                                             2 
## 3. By law, for reasons other than rape, incest, or woman's life in danger if need established 
##                                                                                             3 
##                                           4. By law, abortion as a matter of personal choice. 
##                                                                                             4 
##                                                                              5. Other SPECIFY 
##                                                                                             5

Here is the count of each potential survey response (including missing data and “Other”). Remember, we only know what these numbers mean from having a codebook for the data (which I already had and is available from the ANES website). When downloading data from the internet, you will need a codebook to know how the data was coded and what the numbers mean.

table(ANES$V161232)
## 
##   -9   -8    1    2    3    4    5 
##   48    9  544 1115  616 1932    6

Remove numbers less than 1 and make them NA:

ANES$choice <- ifelse(ANES$V161232 <1, NA, ANES$V161232)
table(ANES$choice)
## 
##    1    2    3    4    5 
##  544 1115  616 1932    6

Now we have 5 remaining options for our variable. Let’s add labels to our variable and keep the 4 main options. Only 6 people said “Other” so lets drop it for this example.

We can recode the variable as a string by making the new variable values in ‘ apostrophe marks’.

ANES$choice_labeled <- car::recode(ANES$V161232, "1='By law Never'; 2='Law Permits Extreme Cases'; 3='Law Permits If Need Established'; 4='By law Always Personal Choice'; else=NA")
## Error: Can't convert <character> to <double>.
# It won't work! Read the error message!
# lets check the class of the variable

class(ANES$V161232) # stored as double
## [1] "haven_labelled" "vctrs_vctr"     "double"

The solution to our problem: as.numeric() is wrapped around the variable name in the code chunk below. This lets the code work because R is… complicated. I solved this by googling the error message and the words “car package” together.

ANES$choice_labeled <- recode(as.numeric(ANES$V161232), "1='By law Never'; 2='Law Permits Extreme Cases'; 3='Law Permits If Need Established'; 4='By law Always Personal Choice'; else=NA")

table(ANES$choice_labeled) # check labels
## 
##   By law Always Personal Choice                    By law Never 
##                            1932                             544 
##       Law Permits Extreme Cases Law Permits If Need Established 
##                            1115                             616

V161232: Creating a Binary Variable

Simplified 4 options to Never or Conditional Yes/Yes

  • Make a dummy variable where 0 = “Never allow abortion”, and 1 = options 2,3,4 which approve of an abortion to some extent.
ANES$choice01 <- recode(ANES$V161232, "1=0; 2=1;3=1;4=1; else=NA")
table(ANES$choice01) # check work
## 
##    0    1 
##  544 3663

If I wanted to give choice01 labels:

ANES$choice01_labels <- recode(ANES$V161232, "1='Never'; 2:4= 'Conditional Yes'; else=NA")
## Error: Can't convert <character> to <double>.
# This won't work
# same error message as last time
ANES$choice01_labels <- recode(as.numeric(ANES$V161232), "1='Never'; 2:4= 'Conditional Yes'; else=NA")

table(ANES$choice01_labels)
## 
## Conditional Yes           Never 
##            3663             544

Make sure that 0 and Never have the same number of people (544). Always check your work!

V161342: Gender

Categorical Variable. Codebook said that -1 = missing, 1=Male, 2=Female, 3=Other. We should add labels.

table(ANES$V161342)
## 
##   -9    1    2    3 
##   41 1987 2231   11
ANES$gender <- recode(as.numeric(ANES$V161342), "1='Male';2='Female'; 3='Other'; else=NA")  #else=NA gets rid of the 3 and -9
table(ANES$gender)
## 
## Female   Male  Other 
##   2231   1987     11

Recoding with ifelse()

V161019: Party of registration

Categorical variable: 1 = Democratic party, 2 = Republican Party, 4 = Independent, 5 = Other

table(ANES$V161019)
## 
##   -9   -8   -1    1    2    4    5 
##    9   11 2151  924  682  471   22

Remove missing data:

ANES$regparty <- ANES$V161019 # duplicate the variable so we still have the original
# this adds regparty to the end of the dataset

ANES$regparty <- ifelse(ANES$V161019 <0, NA, ANES$V161019)
table(ANES$regparty)
## 
##   1   2   4   5 
## 924 682 471  22
ANES <- ANES %>%
  mutate(regparty = ifelse(V161019 <1, NA, V161019))

Add Labels:

ANES$regpartylabels <- recode(ANES$regparty, "1='Democrat'; 2='Republican'; 4='Independent'; 5='Other' ")
table(ANES$regpartylabels)
## 
##    Democrat Independent       Other  Republican 
##         924         471          22         682

Cross Tables: Gender and Opinions on Abortion

# Gender and Choice with 2 options (No = 0, Yes/Conditional Yes = 1)
table(ANES$choice01_labels, ANES$gender)
##                  
##                   Female Male Other
##   Conditional Yes   1901 1719    10
##   Never              296  242     1
CrossTable(ANES$choice01_labels, ANES$gender, prop.r = FALSE, prop.c = FALSE, prop.chisq = F)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## |         N / Table Total | 
## |-------------------------|
## 
## ======================================================
##                         ANES$gender
## ANES$choice01_labels    Female    Male   Other   Total
## ------------------------------------------------------
## Conditional Yes           1901    1719      10    3630
##                          0.456   0.412   0.002        
## ------------------------------------------------------
## Never                      296     242       1     539
##                          0.071   0.058   0.000        
## ------------------------------------------------------
## Total                     2197    1961      11    4169
## ======================================================
# Gender and Choice with 4 options 
CrossTable(ANES$choice_labeled, ANES$gender, 
           prop.r = FALSE, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## |-------------------------|
## 
## ================================================================
##                                    ANES$gender
## ANES$choice_labeled                Female   Male   Other   Total
## ----------------------------------------------------------------
## By law Always Personal Choice        1044    862       7    1913
## ----------------------------------------------------------------
## By law Never                          296    242       1     539
## ----------------------------------------------------------------
## Law Permits Extreme Cases             563    542       3    1108
## ----------------------------------------------------------------
## Law Permits If Need Established       294    315       0     609
## ----------------------------------------------------------------
## Total                                2197   1961      11    4169
## ================================================================

V161158x: Political party identification

Self identify as Strong Democrat to Strong Republican. Originally coded as 7-point ordinal scale:
1. Strong Democrat 2. Not very strong Democrat 3. Independent-Democrat 4. Independent 5. Independent-Republican 6. Not very strong Republican 7. Strong Republican

ANES$partyid <- ifelse(ANES$V161158x <1,NA, ANES$V161158x)

table(ANES$partyid)
## 
##   1   2   3   4   5   6   7 
## 890 559 490 579 500 508 721
hist(ANES$partyid) #  base R histogram

ANES %>% ggplot(aes(partyid)) +  
  geom_histogram(binwidth = 1) # ggplot histogram with more options
## Warning: Removed 23 rows containing non-finite values (stat_bin).

Seems that people tend to be one extreme or the other; strong democrat and strong republican. (this is a bimodal distribution)

Now let’s consolidate this scale. (I wouldn’t normally do this because you lose valuable information regarding how strongly someone considers themselves to be one thing or the other, but it makes a good example for recoding)

mapvalues command from plyr package.

# V161158x: Democrat -> Republican, 3 consolidated options, Ordinal

ANES$partyid3 <- plyr::mapvalues(as.numeric(ANES$partyid), c(1,2,3,4,5,6,7), c('Democrat', 'Democrat', 'Democrat', 'Independent', 'Republican', 'Republican', 'Republican'))
table(ANES$partyid3)
## 
##    Democrat Independent  Republican 
##        1939         579        1729
CrossTable(ANES$choice_labeled, ANES$partyid3,chisq = FALSE,  prop.c = FALSE, prop.t = FALSE, prop.r = FALSE,  prop.chisq = FALSE)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## |-------------------------|
## 
## ==============================================================================
##                                    ANES$partyid3
## ANES$choice_labeled                Democrat   Independent   Republican   Total
## ------------------------------------------------------------------------------
## By law Always Personal Choice          1205           245          477    1927
## ------------------------------------------------------------------------------
## By law Never                            149            66          327     542
## ------------------------------------------------------------------------------
## Law Permits Extreme Cases               303           157          649    1109
## ------------------------------------------------------------------------------
## Law Permits If Need Established         258            94          262     614
## ------------------------------------------------------------------------------
## Total                                  1915           562         1715    4192
## ==============================================================================

Finding Data

UIC’s data sources UIC has a list of data sources of various types categorized as “Research Tools” and “Policy Documents”

If you don’t remember what your options are, this is a good place to start. Links to HUD, American Community Survey, USA Gov, CMAP, National Low Income Housing Coalition, Chicago Data Portal, just to name a few. UIC also allow students to access other large data repositories such as ICPSR, Policy Map, and more.

Data Management

Use some form of version control to save your work!
- UIC’s Box, Github, OneDrive, GoogleDrive

Vocab reminders

Below if a partial list of vocab from the course. Tie some terms from course into your final project:

Bimodal distriubtion: a set of observations where two values occur more than often than the other values. A graph of the frequency shows two peaks in the distribution.

Accuracy: How close a number is to the true quantity being measured. Different than precision.

Precision: A measure of the level of detail, or resolution, in a number. 98.683 is more precise than 100, but if the true value is 101.00, than a measurement equal to 100 is more accurate.

Incidence: Number of new cases reported in a specific time

Prevalence: Number of existing caes (cumulative)

Univariate Data: Univariate data involves analyzing one variable. It does not deal with causal relationships and it is mostly used to describe the variable.

Categorical Data: A categorical variable can only take on a specific set of values representing a set of possible categories. Even if a number is assigned to it for data recording purposes, the numbers do not mean anything (eg. taking the average of: 1. Blue, 2. Orange, 3. Green). This occurs most often when analyzing survey data when information is stored as numbered responses (that work for analysis) and labels (that describe the survey options). Categorical variables are also called nominal variables. I always remember this as nominal = numbers have no meaning.

Binary Variable: Categorical variables with only two options (Yes/No, 0/1, true/false, etc.). Synonyms include: dichotomous, logical, indicator, boolean

Ordinal: This is a type of categorical variable where the responses have a meaningful order. Age groups, income brackets, Likert scales (Strongly Dislike, Dislike, Neutral, Like, Strongly Like) are all ordinal variables. These you must approach carefully when doing analysis.

Continuous Variables: Data that can take on any value in an interval. ex. income, age, miles driven, etc.

The density for a continuous distribution is a measure of relative probability for getting a value close to x. The probability of getting a value in a particular interval is the area under the corresponding part of the curve. (Calc flashbacks…)

Mean: The average. Has a formula.
In Excel: =AVERAGE((CELL_1 + CELL_2 + CELL_3)/(Number_Of_Cells))

Median: Think of the thing between lanes in a road - the median. It is in the center. Equally distanced between both sides. Half of the observations are on one side, and half are on the other.

Medians are useful for “measuring the middle” when there are outliers.

When is the mean an inappropriate statistic?

Bimodal distributions. Bimodal distributions indicate heterogeneity in the sample. If possible, look at the statistics for each group to gain more context.

Mode: Observation that occurs the most often.

Measures of Dispersion: Assess how tightly clustered or spread out data points are

Variation: The tendency of variables to change from measurement to measurement
- Each measurement includes a small amount of error

Deviations: How for a measurement is from an expected value (i.e. scatterplot dot from best fit line)

Variance: The sum of squared deviations from the mean divided by n-1, where n is the number of data values.

Standard deviation: Square root of the variance.

Graphing Univariate Data

Histograms: A histogram is similar to a barplot but is used for numeric data types. Instead of counting the occurrence of each individual value, a histogram divides the data into bins (buckets/ranges) depending on the entire data range and displays the count. It gives an idea abut the distribution of the numeric variable.

ggpot() +
  geom_histogram(aes())

Boxplots are a type of visual shorthand for distribution. The box stretches from the 25th percentile to the 75th percetile of the distribution (so the middle 50% of all observations, also known as the interquartile range). In the middle of the box is a line that shows the median value in the observations. Together, these help show the spread of the distribution, if it is symmetric around the median, and if it is skewed in one direction or the other.

A line (or a whisker) extends from the ends of the box to the farthest nonoutlier point in the distribution.

ggplot(data = mpg) + 
  geom_boxplot(
  mapping = aes(class, y = hwy)
)

Graphing Bivariate Data

Bivariate data is used to find out if there is a relationship between two different variables. It is frequently represented with scatter plots where one variable is on the X axis and the other is on the Y axis. If the data seems to fit a line or curve then there may be a relationship, or correlation, between the two variables. Always be careful when examining relationships. Many variables may appear related when in fact their relationship happened by chance or a third variable is influencing both variables.

Scatterplots: Scatter plots can be used to identify if any relationship exists between numeric variables.

Test of Association: Used to test whether two variables are related to one another or not
- Depends on the nature of the variables you are studying (nominal, ordinal, interval) and number of categories
- Depends on the nature of the question you’re answering (independence, agreement of coders, effect of an intervention, etc.)

Questions to ask yourself:

  • Could this pattern happen by chance?
  • How do you describe the relationship implied by the pattern?
  • How strong is the relationship implied by the pattern?
  • What other variables might affect the relationship?
  • Does the relationship change if you look at individual subgroups of the data?

“Variation creates uncertainty but covariation reduces it.”

Correlation is the extent that two continuous variables are related. A relationship exists when knowing the value of one variable is useful for predicting the value of a second variable.

Correlation Coefficient: symmetric, scale-invariant measure of association between two variables
- Ranges from -1 to +1.
- Strength of Correlation: 0 means no correlation, -1 and +1 imply perfect correlation (either one increases as the other decreases [-1] or they move in the same direction[+1])

Pearson Correlation
- assumes there is a normal distribution - cor() can be used to compute correlation between two or more vectors - cor.test() tells you if the correlation is significantly different than zero - cor.test() has options for different correlation calculations (pearson (default), spearman, kendall) - Spearman and Kendall correlation methods are NON-parametric tests, meaning the variables do not meet the assumptions based on a normal distribution. For this course, you do not need to worry about non-parametric tests BUT if you were doing your own analysis, you must remember that the type of test you use depends on your data and look into which tests are appropriate.

cor(variable1, variable2, use = "complete.obs")
cor.test(v1, v2)
cor.test(v1, v2, method = "spearman") # Spearman's p and Kendall for nonparametric 

Covariation is when variation describes the behavior between variables (Reminder: variation describes behavior within a singular variable). The easiest way to spot covariation is to visualize the relationship between the two variables. Depending on the types of variables determines the best ways to visualize the relationships.

Two Categorical Variables Covariation between categorical variables is visualized by examining the number of observations for each combination.

Contingency tables or Cross Tables: A tally of counts between two or more categorical variables. Cross tabulation groups variables to understand the correlation between different variables. It also shows how correlations change from one variable grouping to another. It is usually used in statistical analysis to find patterns, trends, and probabilities within raw data.
Source

  • Preferred for categorical data since it can be divided into different groups
  • Ex. Democrats, Republicans, and Independents in geographic areas (North, South, Northeast, West, etc.)
  • Good for studying survey responses
  • Cross Tabulations in Excel is just the pivot table feature