Do Previous Readings
- Wickham Chapter 7- Exploratory Analysis
- Wickham Chapter 3 - Data Visualization
- Wickham Chapter 27 - R Markdown - Wickham Chapter 28 - Graphics for Communication
- The Wickham online book has a LOT of very helpful chapters. I suggest skimming the topics to see if they are relevant for your final project.
- Calling Bullshit Book Chapters
Other resources:
[Dates and times infographic]https://github.com/rstudio/concept-maps/raw/master/inspirations/datetime-silvia-canelon.png
knitr::include_graphics("moving-average-excel-formula.png")
See example final project for code with moving averages.
ANES (American National Election Survey) is a large survey done every 4 years. This data is from 2016 but there is new data that came out in 2020.
library(haven) # to read .dta files
ANES <- read_dta("ANES2016.dta")
Survey Question V161115: Would you say that in general your health is excellent, very good, good,fair, or poor? where 1=Excellent and 5=Poor
table(ANES$V161115)
##
## -9 1 2 3 4 5
## 8 742 1429 1341 604 146
Remove values = -9 (turn them into missing values that R recognizes):
ANES$healthy <- recode(ANES$V161115, "-9=NA") # turn -9 to NA
table(ANES$healthy)
##
## 1 2 3 4 5
## 742 1429 1341 604 146
# install.packages(descr)
# library(descr)
freq(ANES$healthy) # freq is from descr package
## PRE: Self-evaluation of R health
## Frequency Percent Valid Percent
## 1 742 17.3770 17.410
## 2 1429 33.4660 33.529
## 3 1341 31.4052 31.464
## 4 604 14.1452 14.172
## 5 146 3.4192 3.426
## NA's 8 0.1874
## Total 4270 100.0000 100.000
# does frequency table with counts and percent + graph
Categoical variable.
attributes(ANES$V161232) # only works for data that has attributes stored. Will not work for most datasets used for the final project unless it is survey data.
## $label
## [1] "PRE: STD Abortion: self-placement"
##
## $format.stata
## [1] "%93.0g"
##
## $class
## [1] "haven_labelled" "vctrs_vctr" "double"
##
## $labels
## -9. Refused
## -9
## -8. Don't know (FTF only)
## -8
## 1. By law, abortion should never be permitted.
## 1
## 2. By law, only in case of rape, incest, or woman's life in danger.
## 2
## 3. By law, for reasons other than rape, incest, or woman's life in danger if need established
## 3
## 4. By law, abortion as a matter of personal choice.
## 4
## 5. Other SPECIFY
## 5
Here is the count of each potential survey response (including missing data and “Other”). Remember, we only know what these numbers mean from having a codebook for the data (which I already had and is available from the ANES website). When downloading data from the internet, you will need a codebook to know how the data was coded and what the numbers mean.
table(ANES$V161232)
##
## -9 -8 1 2 3 4 5
## 48 9 544 1115 616 1932 6
Remove numbers less than 1 and make them NA:
ANES$choice <- ifelse(ANES$V161232 <1, NA, ANES$V161232)
table(ANES$choice)
##
## 1 2 3 4 5
## 544 1115 616 1932 6
Now we have 5 remaining options for our variable. Let’s add labels to our variable and keep the 4 main options. Only 6 people said “Other” so lets drop it for this example.
We can recode the variable as a string by making the new variable values in ‘ apostrophe marks’.
ANES$choice_labeled <- car::recode(ANES$V161232, "1='By law Never'; 2='Law Permits Extreme Cases'; 3='Law Permits If Need Established'; 4='By law Always Personal Choice'; else=NA")
## Error: Can't convert <character> to <double>.
# It won't work! Read the error message!
# lets check the class of the variable
class(ANES$V161232) # stored as double
## [1] "haven_labelled" "vctrs_vctr" "double"
The solution to our problem: as.numeric() is wrapped around the variable name in the code chunk below. This lets the code work because R is… complicated. I solved this by googling the error message and the words “car package” together.
ANES$choice_labeled <- recode(as.numeric(ANES$V161232), "1='By law Never'; 2='Law Permits Extreme Cases'; 3='Law Permits If Need Established'; 4='By law Always Personal Choice'; else=NA")
table(ANES$choice_labeled) # check labels
##
## By law Always Personal Choice By law Never
## 1932 544
## Law Permits Extreme Cases Law Permits If Need Established
## 1115 616
Simplified 4 options to Never or Conditional Yes/Yes
ANES$choice01 <- recode(ANES$V161232, "1=0; 2=1;3=1;4=1; else=NA")
table(ANES$choice01) # check work
##
## 0 1
## 544 3663
If I wanted to give choice01 labels:
ANES$choice01_labels <- recode(ANES$V161232, "1='Never'; 2:4= 'Conditional Yes'; else=NA")
## Error: Can't convert <character> to <double>.
# This won't work
# same error message as last time
ANES$choice01_labels <- recode(as.numeric(ANES$V161232), "1='Never'; 2:4= 'Conditional Yes'; else=NA")
table(ANES$choice01_labels)
##
## Conditional Yes Never
## 3663 544
Make sure that 0 and Never have the same number of people (544). Always check your work!
Categorical Variable. Codebook said that -1 = missing, 1=Male, 2=Female, 3=Other. We should add labels.
table(ANES$V161342)
##
## -9 1 2 3
## 41 1987 2231 11
ANES$gender <- recode(as.numeric(ANES$V161342), "1='Male';2='Female'; 3='Other'; else=NA") #else=NA gets rid of the 3 and -9
table(ANES$gender)
##
## Female Male Other
## 2231 1987 11
Categorical variable: 1 = Democratic party, 2 = Republican Party, 4 = Independent, 5 = Other
table(ANES$V161019)
##
## -9 -8 -1 1 2 4 5
## 9 11 2151 924 682 471 22
Remove missing data:
ANES$regparty <- ANES$V161019 # duplicate the variable so we still have the original
# this adds regparty to the end of the dataset
ANES$regparty <- ifelse(ANES$V161019 <0, NA, ANES$V161019)
table(ANES$regparty)
##
## 1 2 4 5
## 924 682 471 22
ANES <- ANES %>%
mutate(regparty = ifelse(V161019 <1, NA, V161019))
Add Labels:
ANES$regpartylabels <- recode(ANES$regparty, "1='Democrat'; 2='Republican'; 4='Independent'; 5='Other' ")
table(ANES$regpartylabels)
##
## Democrat Independent Other Republican
## 924 471 22 682
# Gender and Choice with 2 options (No = 0, Yes/Conditional Yes = 1)
table(ANES$choice01_labels, ANES$gender)
##
## Female Male Other
## Conditional Yes 1901 1719 10
## Never 296 242 1
CrossTable(ANES$choice01_labels, ANES$gender, prop.r = FALSE, prop.c = FALSE, prop.chisq = F)
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
## ======================================================
## ANES$gender
## ANES$choice01_labels Female Male Other Total
## ------------------------------------------------------
## Conditional Yes 1901 1719 10 3630
## 0.456 0.412 0.002
## ------------------------------------------------------
## Never 296 242 1 539
## 0.071 0.058 0.000
## ------------------------------------------------------
## Total 2197 1961 11 4169
## ======================================================
# Gender and Choice with 4 options
CrossTable(ANES$choice_labeled, ANES$gender,
prop.r = FALSE, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE)
## Cell Contents
## |-------------------------|
## | N |
## |-------------------------|
##
## ================================================================
## ANES$gender
## ANES$choice_labeled Female Male Other Total
## ----------------------------------------------------------------
## By law Always Personal Choice 1044 862 7 1913
## ----------------------------------------------------------------
## By law Never 296 242 1 539
## ----------------------------------------------------------------
## Law Permits Extreme Cases 563 542 3 1108
## ----------------------------------------------------------------
## Law Permits If Need Established 294 315 0 609
## ----------------------------------------------------------------
## Total 2197 1961 11 4169
## ================================================================
Self identify as Strong Democrat to Strong Republican. Originally coded as 7-point ordinal scale:
1. Strong Democrat 2. Not very strong Democrat 3. Independent-Democrat 4. Independent 5. Independent-Republican 6. Not very strong Republican 7. Strong Republican
ANES$partyid <- ifelse(ANES$V161158x <1,NA, ANES$V161158x)
table(ANES$partyid)
##
## 1 2 3 4 5 6 7
## 890 559 490 579 500 508 721
hist(ANES$partyid) # base R histogram
ANES %>% ggplot(aes(partyid)) +
geom_histogram(binwidth = 1) # ggplot histogram with more options
## Warning: Removed 23 rows containing non-finite values (stat_bin).
Seems that people tend to be one extreme or the other; strong democrat and strong republican. (this is a bimodal distribution)
Now let’s consolidate this scale. (I wouldn’t normally do this because you lose valuable information regarding how strongly someone considers themselves to be one thing or the other, but it makes a good example for recoding)
mapvalues command from plyr package.
# V161158x: Democrat -> Republican, 3 consolidated options, Ordinal
ANES$partyid3 <- plyr::mapvalues(as.numeric(ANES$partyid), c(1,2,3,4,5,6,7), c('Democrat', 'Democrat', 'Democrat', 'Independent', 'Republican', 'Republican', 'Republican'))
table(ANES$partyid3)
##
## Democrat Independent Republican
## 1939 579 1729
CrossTable(ANES$choice_labeled, ANES$partyid3,chisq = FALSE, prop.c = FALSE, prop.t = FALSE, prop.r = FALSE, prop.chisq = FALSE)
## Cell Contents
## |-------------------------|
## | N |
## |-------------------------|
##
## ==============================================================================
## ANES$partyid3
## ANES$choice_labeled Democrat Independent Republican Total
## ------------------------------------------------------------------------------
## By law Always Personal Choice 1205 245 477 1927
## ------------------------------------------------------------------------------
## By law Never 149 66 327 542
## ------------------------------------------------------------------------------
## Law Permits Extreme Cases 303 157 649 1109
## ------------------------------------------------------------------------------
## Law Permits If Need Established 258 94 262 614
## ------------------------------------------------------------------------------
## Total 1915 562 1715 4192
## ==============================================================================
UIC’s data sources UIC has a list of data sources of various types categorized as “Research Tools” and “Policy Documents”
If you don’t remember what your options are, this is a good place to start. Links to HUD, American Community Survey, USA Gov, CMAP, National Low Income Housing Coalition, Chicago Data Portal, just to name a few. UIC also allow students to access other large data repositories such as ICPSR, Policy Map, and more.
Use some form of version control to save your work!
- UIC’s Box, Github, OneDrive, GoogleDrive
Below if a partial list of vocab from the course. Tie some terms from course into your final project:
Bimodal distriubtion: a set of observations where two values occur more than often than the other values. A graph of the frequency shows two peaks in the distribution.
Accuracy: How close a number is to the true quantity being measured. Different than precision.
Precision: A measure of the level of detail, or resolution, in a number. 98.683 is more precise than 100, but if the true value is 101.00, than a measurement equal to 100 is more accurate.
Incidence: Number of new cases reported in a specific time
Prevalence: Number of existing caes (cumulative)
Univariate Data: Univariate data involves analyzing one variable. It does not deal with causal relationships and it is mostly used to describe the variable.
Categorical Data: A categorical variable can only take on a specific set of values representing a set of possible categories. Even if a number is assigned to it for data recording purposes, the numbers do not mean anything (eg. taking the average of: 1. Blue, 2. Orange, 3. Green). This occurs most often when analyzing survey data when information is stored as numbered responses (that work for analysis) and labels (that describe the survey options). Categorical variables are also called nominal variables. I always remember this as nominal = numbers have no meaning.
Binary Variable: Categorical variables with only two options (Yes/No, 0/1, true/false, etc.). Synonyms include: dichotomous, logical, indicator, boolean
Ordinal: This is a type of categorical variable where the responses have a meaningful order. Age groups, income brackets, Likert scales (Strongly Dislike, Dislike, Neutral, Like, Strongly Like) are all ordinal variables. These you must approach carefully when doing analysis.
Continuous Variables: Data that can take on any value in an interval. ex. income, age, miles driven, etc.
The density for a continuous distribution is a measure of relative probability for getting a value close to x. The probability of getting a value in a particular interval is the area under the corresponding part of the curve. (Calc flashbacks…)
Mean: The average. Has a formula.
In Excel: =AVERAGE((CELL_1 + CELL_2 + CELL_3)/(Number_Of_Cells))
Median: Think of the thing between lanes in a road - the median. It is in the center. Equally distanced between both sides. Half of the observations are on one side, and half are on the other.
Medians are useful for “measuring the middle” when there are outliers.
When is the mean an inappropriate statistic?
Bimodal distributions. Bimodal distributions indicate heterogeneity in the sample. If possible, look at the statistics for each group to gain more context.
Mode: Observation that occurs the most often.
Measures of Dispersion: Assess how tightly clustered or spread out data points are
Variation: The tendency of variables to change from measurement to measurement
- Each measurement includes a small amount of error
Deviations: How for a measurement is from an expected value (i.e. scatterplot dot from best fit line)
Variance: The sum of squared deviations from the mean divided by n-1, where n is the number of data values.
Standard deviation: Square root of the variance.
Histograms: A histogram is similar to a barplot but is used for numeric data types. Instead of counting the occurrence of each individual value, a histogram divides the data into bins (buckets/ranges) depending on the entire data range and displays the count. It gives an idea abut the distribution of the numeric variable.
ggpot() +
geom_histogram(aes())
Boxplots are a type of visual shorthand for distribution. The box stretches from the 25th percentile to the 75th percetile of the distribution (so the middle 50% of all observations, also known as the interquartile range). In the middle of the box is a line that shows the median value in the observations. Together, these help show the spread of the distribution, if it is symmetric around the median, and if it is skewed in one direction or the other.
A line (or a whisker) extends from the ends of the box to the farthest nonoutlier point in the distribution.
ggplot(data = mpg) +
geom_boxplot(
mapping = aes(class, y = hwy)
)
Bivariate data is used to find out if there is a relationship between two different variables. It is frequently represented with scatter plots where one variable is on the X axis and the other is on the Y axis. If the data seems to fit a line or curve then there may be a relationship, or correlation, between the two variables. Always be careful when examining relationships. Many variables may appear related when in fact their relationship happened by chance or a third variable is influencing both variables.
Scatterplots: Scatter plots can be used to identify if any relationship exists between numeric variables.
Test of Association: Used to test whether two variables are related to one another or not
- Depends on the nature of the variables you are studying (nominal, ordinal, interval) and number of categories
- Depends on the nature of the question you’re answering (independence, agreement of coders, effect of an intervention, etc.)
Questions to ask yourself:
“Variation creates uncertainty but covariation reduces it.”
Correlation is the extent that two continuous variables are related. A relationship exists when knowing the value of one variable is useful for predicting the value of a second variable.
Correlation Coefficient: symmetric, scale-invariant measure of association between two variables
- Ranges from -1 to +1.
- Strength of Correlation: 0 means no correlation, -1 and +1 imply perfect correlation (either one increases as the other decreases [-1] or they move in the same direction[+1])
Pearson Correlation
- assumes there is a normal distribution - cor() can be used to compute correlation between two or more vectors - cor.test() tells you if the correlation is significantly different than zero - cor.test() has options for different correlation calculations (pearson (default), spearman, kendall) - Spearman and Kendall correlation methods are NON-parametric tests, meaning the variables do not meet the assumptions based on a normal distribution. For this course, you do not need to worry about non-parametric tests BUT if you were doing your own analysis, you must remember that the type of test you use depends on your data and look into which tests are appropriate.
cor(variable1, variable2, use = "complete.obs")
cor.test(v1, v2)
cor.test(v1, v2, method = "spearman") # Spearman's p and Kendall for nonparametric
Covariation is when variation describes the behavior between variables (Reminder: variation describes behavior within a singular variable). The easiest way to spot covariation is to visualize the relationship between the two variables. Depending on the types of variables determines the best ways to visualize the relationships.
Two Categorical Variables Covariation between categorical variables is visualized by examining the number of observations for each combination.
Contingency tables or Cross Tables: A tally of counts between two or more categorical variables. Cross tabulation groups variables to understand the correlation between different variables. It also shows how correlations change from one variable grouping to another. It is usually used in statistical analysis to find patterns, trends, and probabilities within raw data.
Source