EEFE 530, Spring 2025

R Markdown
- Including Plots
- More on R Markdown
Important notes
- Publishing your practice homework
- Abide by the academic integrity
Maps: just a few basic examples
Word clouds and sentiment analysis

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. Check Lesson 1. Also check Lesson 15 and the PDF document rmarkdown-2.0 on Canvas.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

R Markdown can show the code, the result of the code, both, neither.

Including Plots

You can also embed plots, for example:

qplot(speed, dist, data=cars) + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Note that the warning = FALSE parameter was added to the code chunk to prevent printing of the warnings.

It is also possible to prevent printing of the R code that generated the plot by adding the echo = FALSE parameter to the code chunk:

More on R Markdown

You can show some calculations in your text if you do them in the homework: for example, 5+5 equals 10, which is not calculated if you leave spaces r 5+5.

You can write equations as in LATEX.

\[ E=mc^2 \] Additional online resources: Yihui Xie’s knitr page and whole book, and cheatsheets.

Important notes

Publishing your practice homework

When you are done with the homework, click the publish document logo (a blue sorta-circle- or sorta-eye-ish icone in the top right corner). This will take you to Rpubs, and you can now publish on Rpubs using the fake name and username you had created for your own account.
Because your homework is public, don’t identify yourself by your correct name. Copy the RPubs link to your work and submit it safely on Canvas or by email to me. For any entirely equal submissions, whoever sent me their RPubs link last has copied the others. So, timely submissions are important. Own your work. I can randomly ask you your *Rmd file for double-checking purposes.

Abide by the academic integrity

Academic integrity is the pursuit of scholarly activity in an open, honest and responsible manner. Academic integrity is a basic guiding principle for all academic activity at The Pennsylvania State University, and all members of the University community are expected to act in accordance with this principle. Consistent with this expectation, the University’s Code of Conduct states that all students should act with personal integrity, respect other students’ dignity, rights and property, and help create and maintain an environment in which all can succeed through the fruits of their efforts.

Academic integrity includes a commitment by all members of the University community not to engage in or tolerate acts of falsification, misrepresentation or deception. Such acts of dishonesty violate the fundamental ethical principles of the University community and compromise the worth of work completed by others.

Maps: just a few basic examples

Without color

map("state", boundary=TRUE, col="black")

Using no specific color

states <- map_data("state")
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, fill = region, group = group),
color = "white") + coord_fixed(1.3) + guides(fill="none")

State-level income

head(state.x77, 15) ## Built-in state-level data

##             Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama           3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska             365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona           2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas          2110   3378        1.9    70.66   10.1    39.9    65  51945
## California       21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado          2541   4884        0.7    72.06    6.8    63.9   166 103766
## Connecticut       3100   5348        1.1    72.48    3.1    56.0   139   4862
## Delaware           579   4809        0.9    70.06    6.2    54.6   103   1982
## Florida           8277   4815        1.3    70.66   10.7    52.6    11  54090
## Georgia           4931   4091        2.0    68.54   13.9    40.6    60  58073
## Hawaii             868   4963        1.9    73.60    6.2    61.9     0   6425
## Idaho              813   4119        0.6    71.87    5.3    59.5   126  82677
## Illinois         11197   5107        0.9    70.14   10.3    52.6   127  55748
## Indiana           5313   4458        0.7    70.88    7.1    52.9   122  36097
## Iowa              2861   4628        0.5    72.56    2.3    59.0   140  55941

Color based on state-level income

usdata = data.frame(region=tolower(rownames(state.x77)), state.x77, stringsAsFactors = TRUE)
mapIncome <- ggplot(usdata, aes(map_id = region)) +
geom_map(aes(fill = Income), map = states) +
scale_fill_gradientn(colours=c("lightblue","darkblue")) +
expand_limits(x = states$long, y = states$lat) +
geom_map(aes(fill = Income), map = states)+
coord_fixed(1.3)
mapIncome

Word clouds and sentiment analysis

Nobel lectures of Guido W. Imbens and Joshua D. Angrist

https://www.nobelprize.org/prizes/economic-sciences/2021/summary/ > The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel 2021 was divided, one half awarded to David Card “for his empirical contributions to labour economics”, the other half jointly to Joshua D. Angrist and Guido W. Imbens “for their methodological contributions to the analysis of causal relationships.”

Parts of the introduction of their Nobel lectures: Source

nobel <- "Causality in Econometrics: Choice vs. Chance. Knowledge of causal effects is of great importance for decision makers in government, firms, as well as individuals in their private lives. Inferring the values of these effects from observed data is often a major challenge when causal mechanisms are not fully understood. These challenges have motivated methodological research in multiple disciplines. This research got a major boost in the 1920s and 1930s, thanks to advances in the design and analysis of randomized experiments in statistics and, separately, methodological work on observational studies in econometrics.
More recently, in the late 1980s and early 1990s, there was a sharp increase in empirical and methodological research in economics, as well as other disciplines, with an explicit focus on estimating causal effects. A convergence of the statistical and econometric traditions has been a catalyst for this increase. More than thirty years later, causality is a thriving
area of study. Researchers from many disciplines, including economics, statistics, political science, psychology, epidemiology, computer science and other fields, bring new questions and different methodological perspectives to the discussion. Applications range widely from biomedical to social science, with interest coming from academic, government, and private
sector organizations. 
In this lecture I discuss some of the themes of this field. Per the charge of the committee awarding the prize, this article focuses primarily on my  contributions to the study of causality, but I shall place them in the context of the broader interdisciplinary literature. I start by discussing briefly some of the history of methods for causal inference in statistics and econometrics. I then discuss the credibility crisis in the 1980s that provided some of the motivation for the work that was recognized in the prize. After that I discuss some of my contributions to the causal inference literature. In that part of the paper, I will also add some background and color to the specific research I describe, discussing the origins and questions that motivated my collaborators and myself, as well as pivotal moments in my intellectual journey. I see this prize as a recognition of the importance of this general interdisciplinary enterprise and hope it further invigorates the field.
Empirical Strategies in Economics: Illuminating the Path from Cause to Effect. In a chapter in the Handbook of Labor Economics, Alan Krueger and I employed the phrase “empirical strategy” to describe econometric analysis of natural experiments like the one John Snow (1855) used to establish that cholera is a waterborne illness. The Handbook volume in question (Ashenfelter and Card, 1999) was edited by two of my Princeton Ph.D. thesis advisors, Orley Ashenfelter and David Card, leaders in the battle to bring empirical strategies like Snow’s into the econometric mainstream. Ashenfelter and Card’s quest for an empirical strategy that reliably captures the causal efects of government training programs inspired me and others at Princeton to explore the econometrics of program evaluation. 
An empirical strategy for program or policy evaluation is a research plan that encompasses data collection, identifcation, and estimation. As Krueger and I used it, the term “identifcation” is shorthand for research design. The Prize I share with David Card and Guido Imbens recognizes the prominent role research design has come to play in modern economics. A randomized clinical trial (RCT) is the simplest and most powerful research design. Random assignment ensures that treatment and control groups are comparable in the absence of treatment, so diferences between them after random assignment refect only the treatment efect. Not surprisingly, though also not without resistance, RCTs have come to be both an aspiration and a benchmark for empirical strategies in economics.
This past October, I worried about what I should expect from the Economics Prize treatment efect. The spotlight and disruption accompanying the prize made me wonder how the Economics Prize celebrity might change life for the Angrist family. It soon dawned on me that the matter of how public recognition afects a scholar’s life is a simple causal question: the Economics Prize intervention is substantial, sudden, and wellmeasured; outcomes like health and wealth are easy to record. Although the Economics Prizes are probably not randomly
assigned, a compelling empirical strategy for the Economics Prize treatment efect comes to mind, at least as a fight of empirical fancy."

Word cloud (very basic cleaning)

nobelClean <- nobel %>%
  tolower() %>%
  removeWords("’") %>% # curly apostrophe causing trouble
  removeWords("…") %>% # … causing trouble
  removeWords(stopwords("en")) %>%
  removePunctuation() %>%
  removeNumbers()
wordcloud(nobelClean, scale=c(2,0.5), max.words=200, random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(6, "Dark2"))

nobelClean_tokens <- tokens(nobelClean)
nobelClean_dfm <- dfm(nobelClean_tokens)
causal_counts <- nobelClean_dfm[, c("causal", "causality", "treatment", "design", "methodological", "econometrics", "econometric", "statistics", "empirical", "strategy")] 
causal_counts

## Document-feature matrix of: 1 document, 10 features (0.00% sparse) and 0 docvars.
##        features
## docs    causal causality treatment design methodological econometrics
##   text1      7         3         5      4              4            4
##        features
## docs    econometric statistics empirical strategy
##   text1           3          3         9        4

Elon Musk’s sentiment analysis

Elon Musk Tweets Datasets: here or there

muskfile <- read.csv("https://query.data.world/s/yusehiqh3mj4usgmbk4bd3nkquresv", header=TRUE, stringsAsFactors=FALSE);
musk <- data.frame(date = muskfile$created_at, tweet = as.character(muskfile$text), stringsAsFactors = FALSE)
musk$tweet[1:20]

##  [1] "b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'"                                                                                               
##  [2] "b\"@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\\xe2\\x80\\xa6 https://t.co/qQcTqkzgMl\""
##  [3] "b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"                                                                                                                  
##  [4] "b'Stormy weather in Shortville ...'"                                                                                                                            
##  [5] "b\"@DaveLeeBBC @verge Coal is dying due to nat gas fracking. It's basically dead.\""                                                                            
##  [6] "b\"@Lexxxzis It's just a helicopter in helicopter's clothing\""                                                                                                 
##  [7] "b\"@verge It won't matter\""                                                                                                                                    
##  [8] "b'@SuperCoolCube Pretty good'"                                                                                                                                  
##  [9] "b\"Why did we waste so much time developing silly rockets? Damn you, aliens! So obtuse! You have all this crazy tech, but can't speak English!?\""              
## [10] "b'Technology breakthrough: turns out chemtrails are actually a message from time-traveling aliens describing the secret of teleportation'"                      
## [11] "b\"RT @OpenAI: We've created the world's first Spam-detecting AI trained entirely in simulation and deployed on a physical robot: https://t.co\\xe2\\x80\\xa6\""
## [12] "b'RT @ProfBrianCox: This is extremely important from @elonmusk and @SpaceX - reusable rockets bring us MUCH closer to becoming a spacefaring\\xe2\\x80\\xa6'"   
## [13] "b'@adamsbj Def P100D with Ludicrous+, although the rocket starts going a lot faster after that'"                                                                
## [14] "b'@BadAstronomer We can def bring it back like Dragon. Just a question of how much weight we need to add.'"                                                     
## [15] "b'@tesla_addict @TeslaMotors Working on it'"                                                                                                                    
## [16] "b\"@jasonlamb Looks like it could do 20% more with some structural upgrades to handle higher loads. But that's in fully expendable mode.\""                     
## [17] "b'@cheron A lot'"                                                                                                                                               
## [18] "b'@Cardoso Silliest thing we can imagine! Secret payload of 1st Dragon flight was a giant wheel of cheese. Inspired b\\xe2\\x80\\xa6 https://t.co/68nMJkiPsC'"  
## [19] "b'@redletterdave Good point, odds go from 0% to &gt;0% :)'"                                                                                                     
## [20] "b'Falcon Heavy test flight currently scheduled for late summer'"

Scoring feelings behind each tweet

feelings <- get_nrc_sentiment(musk$tweet)
head(feelings, 20)

##    anger anticipation disgust fear joy sadness surprise trust negative positive
## 1      0            0       0    0   1       0        0     1        0        1
## 2      0            0       0    0   0       0        0     0        0        0
## 3      0            0       0    0   0       0        0     0        0        0
## 4      0            0       0    0   0       0        0     0        0        0
## 5      1            1       1    2   0       1        0     0        2        0
## 6      0            0       0    0   0       0        0     0        0        0
## 7      0            1       0    1   0       0        0     0        1        0
## 8      0            1       0    0   1       0        0     1        0        1
## 9      2            1       2    1   1       1        0     0        5        0
## 10     0            1       0    0   0       0        0     1        0        1
## 11     0            0       0    0   0       0        0     0        0        0
## 12     0            0       0    0   0       0        0     1        0        1
## 13     1            0       0    0   0       0        0     0        1        0
## 14     0            1       1    2   1       1        1     1        1        2
## 15     0            0       0    0   0       0        0     0        0        1
## 16     0            0       0    0   0       0        0     2        0        1
## 17     0            0       0    0   0       0        0     0        0        0
## 18     0            0       0    2   1       0        1     2        0        1
## 19     0            1       0    0   1       0        1     1        0        1
## 20     0            0       0    0   0       1        0     0        1        0

Little cleaning

# Merge tweet data and feelings matrix
musk <- cbind(musk, feelings)
# Collapse data to monthly:
musk$month <- format(as.Date(musk$date), "%Y-%m")
muskMonthly <- musk %>%
group_by(month) %>%
summarize(sumNeg = sum(negative), sumPos = sum(positive))
# Compute a sentiment score for each month "positive/negative"
muskMonthly$positivity_ratio = muskMonthly$sumPos / muskMonthly$sumNeg
# Make dates look better:
muskMonthly$month <- as.Date(paste(muskMonthly$month, "01", sep="-"))

Positivity ratio over time

ggplot(muskMonthly, aes(x=month, y=positivity_ratio)) + geom_line()