Homework 3

##[Michael Sanjurjo Jr]

Instructions

Due by midngith October 30th. Upload the .Rmd file and the knit .html file. 10 point penalty for not uploading in the appropriate file formats

As with prior homeworks, this homework is in two parts. Part 1 will review what we’ve learned during the lab sessions, using the social capital data set. Part 2 will ask you to continue buildng your final project, your data self portrait. Each part is worth 50 points.

For your code, you will be reviewed on its appropriateness, readability, and parsimony. That is to say that you should be using the functions and packages we discuss in class, prioritize wirting readable code and text in markdown, and writing code that is simple but effective. Copy and pasting from chatgpt or other gen AI platforms will result in a 0 for the whole assignment, regardless of whether or not you got the right answer.

Please type out and inital the middlebury honor code at the bottom of this file.

Part 1 - 50 Points

Q1 - Q5 (2 pts each, 10 points total)

For the following 5 questions, consider a scenario where you have three different dataframes:

DF1 : 100 observations of variables A, B, and C
DF2 : 25 additional observations of variables A, B, C (no repeats from DF1)
DF3 : 100 observations of variables A, D, and E (column A is an exact match to DF1$A)
DF4 : 50 observations of variables A and F (column A matches 50 rows in DF1$A)

Q1 Which two data frames could you combine with bind_rows() and what are the dimensions of the new data frame?

You can combine Data Frames DF1 and DF2 becuase they have the same columns/variables but differ in rows. The new data frame would have 125 observations (rows) and 3 variables(columns).

Q2 Which two data frames could you combine with bind_cols() and what are the dimensions of the new data frame?

You can combine Data Frames DF1 and DF3 using bind_cols() since they have the same number of observations(rows). There would be 100 observations and 6 columns

Q3 DF1 and DF4 could be combined to produce a data frame with 100 observations and 4 variables using what function?

Since the 50 observations of column A in DF4 matches 50 observation in DF1 I could combine the frames using the “left_join()” function

Q4 What would be the correct order of those two data frames within your chosen function?

DF1&4 <- left_join(DF1, DF4, by =“A”)

BONUS POINT What would be the dimensions of the new data frame if you reversed that order?

There would be 50 observations and 4 variables. Since the code reversed would attempt to join only the 50 matching rows from DF1 into DF4 there would only be 50 rows. The columns would still become A, B, C, and F.

Q5 You can use left_join() and bind_cols() to combine two of the above data frames with nearly the same results (One new data frame would have two A columns, one would have just one A column). What are those two dataframes?

If you left_join(DF1, DF3, by =“A”) the code will combine the A columns since they are an exact match forming only 1 A column. If you use bind_cols() the code will create a data set with both A columns from DF1 and DF4, making two A columns.

Q6- Q9 - Import and Examine Data (10 points)

Q6 Import the homework 3 data set here. You should import 2 files. (2 points)

DailyV20t25 <- read_csv(“~/Desktop/Data_in_the_SW/Homework3Data/DailyViews_2020to2025.csv”)

DailyV25 <- read_csv(“~/Desktop/Data_in_the_SW/Homework3Data/DailyViews_2025ytd.csv”)

Q7 Look at the data, including the names of the files. Write a brief description of the two data sets and how they are related (2 sentences, 4 pts).

The first data file named “DailyV20t25” lists all days from Jan.1.2020 to Dec.31.2024, totaling 1827 observations. The data set “DailyV25” shares the same variables in the table but lists data from Jan.1.2025 to Oct.21.2025, totaling 294 observations.

Q8 What are the dimensions of the two data sets? (2 pts)

Daily Views 2020 to 2025 has 4 variables (columns) and 1827 observations (rows)

Daily Views 2025 has 4 variables (columns) and 294 observations (rows)

Q9 Is the data in wide or long format? (2 pts)

The data is in Wide Format

Q10. Views over time. (10 points)

For this question, I want you to produce a figure that plots the views for each holiday’s Wikipedia page from 2020 to October 21st, 2025.

You will need to transform the datasets I provided so that you have one data frame (not two!), with variables for the date, page, and views.

When plotting the date, you’ll need to tell R to interpret the variable as a date. You can use the following code to transform the date variable do so!

# as.Date(Date, “%m/%d/%y”)

Us this either when you call date within ggplot or as a transformation to the data before you start graphing. This tells R to understand that variable as a date in the format of month / day / year.

DailyV_ALL <- bind_rows(DailyV20t25, DailyV25)

DailyV_long <- DailyV_ALL |> pivot_longer( cols = c(Halloween, Thanksgiving, New Year's Eve), names_to = “Holiday”, values_to = “Views”)

ggplot(DailyV_long, aes(x = as.Date(Date, “%m/%d/%y”), y = Views, color = Holiday)) + geom_line(linewidth = 1) + scale_color_manual(values = c( “Halloween” = “blue”, “Thanksgiving” = “red”, “New Year’s Eve” = “yellow”)) + theme_classic() + labs( title = “Views over time”, x = “Date”, y = “Views”, color = “Holiday”) + theme(legend.position = “top”)

10 points total. 5 points for correct data transformation, 5 points for producing a figure with the right variables. Show your code

BONUS 2 POINTS Change the colors for each holiday so halloween = orange, thanksgiving = red, and new years eve = navy and set linewith = 3.

ggplot(DailyV_long, aes(x = as.Date(Date, “%m/%d/%y”), y = Views, color = Holiday)) + geom_line(linewidth = 3) + scale_color_manual(values = c( “Halloween” = “orange”, “Thanksgiving” = “red”, “New Year’s Eve” = “navy”)) + theme_classic() + labs( title = “Views over time”, x = “Date”, y = “Views”, color = “Holiday”) + theme(legend.position = “top”)

Q11 Highest Views (5 points)

Which of our three holidays got the highest cumulative number of views from 2020 until today? Does the answer suprise you? Why or why not?

Hint: you should use the same dataframe that you created in Q10.

Show your code! Only 1 point for correct answer without code.

holiday_totals <- DailyV_long |> group_by(Holiday) |> summarise(Total_Views = sum(Views, na.rm = TRUE)) |> arrange(desc(Total_Views))

Halloween has the highest cummalitive views. The views for Haloween’s Wikipedia page doesn’t really suprise me because out of the three holidays I know the least about Halloweens origins so maybe that’s why people search it more often.

Q12 July 4th ( 5 points)

On July 4th of this year (2025) how many people were looking up each of our Holidays? Show your code! Only 1 point for correct answer without code

July4_2025 <- DailyV_long |> filter(Date == as.Date(“07-04-25”))

Halloween had 1380 views, Thanksgiving had 1730 views, and New Year’s Eve 290 views

#Q13 Frankenstein (10 points)

Run the following code to download the entire book of Frankenstein from Project Gutenberg. We’ll keep the Halloween theme going!

library(gutenbergr)

Frankenstein <- gutenberg_download(84)

## Determining mirror for Project Gutenberg from
## https://www.gutenberg.org/robot/harvest.
## Using mirror http://aleph.gutenberg.org.

# every book in Project Gutenberg has an id that allows you to download it. You can look up books using this code (just remove the hashtag and add your particular query. 

gutenberg_works() |>
filter(str_detect(title, "YOUR QUERY!"))

## # A tibble: 0 × 8
## # ℹ 8 variables: gutenberg_id <int>, title <chr>, author <chr>,
## #   gutenberg_author_id <int>, language <fct>, gutenberg_bookshelf <chr>,
## #   rights <fct>, has_text <lgl>

Frankenstein <- Frankenstein |> filter(text != "") # this just helps clean a little! Removing all the blank lines for page breaks.

Following our steps from Lab7, identify the top 10 most frequently occurring non-stop words in Frankenstein.

install.packages(“tidytext”) library(tidytext)

Frankenstein_short <- Frankenstein |> select(gutenberg_id, text) Frankenstein_short_words <- Frankenstein_short |> unnest_tokens(word, text)

total_words <- Frankenstein_short_words |> group_by(word) |> summarize(total=n())

Frankenstein_short_words |> group_by(gutenberg_id) |> slice_head(n=20) |> ggplot(aes(y=word, x=word_frequency, fill= Title)) + geom_col() + facet_wrap(~Title)

ggplot(Frankenstein_short_words, aes(x=word, y=total, fill=gutenberg_id)) + geom_col()

head(Frankenstein_short_words) Frankenstein_short_words |> arrange(desc(total), word)

The top 10 words life, father, eyes, time, night, elizabeth, found, mind, heart, and day. I messed up a step somewhere when making my word_frequency variable but instead I made a column called “n” and “total” which show me the word frequency. I have no idea how.

Part 2 - 50 Points

Question 1

As part of your final project, you will build three visualizations. By now we’ve reviewed the basics of ggplot in class and you’ve looked at a wide variety of data visualizations in the readings. Looking at your returned data package, describe two figures that you would like to create for your final project. Note: you aren’t yet building the figures, but identifying what you want to create! This is the first step in any sort of visualization project. Often, I map out my figures on a piece of scrap paper or my whiteboard before I start coding! It might be helpful to do that in order to describe your figure here.

Answer each of the following questions for those two figures:

What will be the x axis?
What will be the y axis?
What form will your figure take (bars, lines, points, think about what geom() you would use!)?
Will there be any groups (colors, different facets, etc) if so, what variables will determine these?
What will this graph show us? Maybe you can’t tell from looking at your data what you’ll find, but what do you expect and/or why is this interesting?
What are the steps you’ll have to take to be ready to plot your data? Write them out here. It’s ok if you don’t know how to do every step, especially if your data package isn’t a csv file! I’m looking for an understanding of how you’ll need to transform your data to accomplish your plans

Figure #1 My x-axis would be “search history queries”. My y-axis would be “Inferences made about your interests”. The inferences made about my interests are just a list of topics I’m assuming based on my searches and likes. I would use this figure to prove the correlation between my searches and the inferences about my interests. This will also help me make the list of inferences clearer by ranking them based on related search queries. I believe a bar graph would be most suitable for this hypothesis and graph. This data frame would require me to categorize my search queries by filtering out keywords for each topic. I would also need to decide how to consolidate my areas of interest into a single category. There is one category named “Men Curly hairstyles” and another category named “Afro” or “Cornrows”. My data frame would most likely need to be in Long Format because there are sub-categories to big categories that I need to sort using cleaning and binding functions. Since my data is an HTML file, I would need to be cautious about identifying headings in my code so RStudio knows how to read and sort my data. I struggle a lot with RStudio, struggling to locate the data I’m asking for, so this project will require attention to detail.

Figure 2 The main goal of my analysis is to determine in which months of the year I am most active on Pinterest. To do this, my X-axis would represent “months of the year,” and my Y-axis would represent “user sessions.” I want to see if the timing of my Pinterest activity corresponds with when I receive more clothing ads while searching for fashion inspiration. This data can be further refined by reviewing my search history and identifying when I searched for specific outfit options, such as distinguishing Halloween costume searches from casual searches like “80-degree weather outfits summer men.” I’d need to filter out unrelated searches, like “action figures,” that usually occur in December for Christmas shopping. A line graph would best visualize these time-specific user sessions, highlighting spikes during certain months or years.A problem with my hypothesis about ads using Pinterest data is that the “user events” data, or ads, aren’t categorized by topic. They are listed by how I interacted with the ad, but without time stamps. For example, the data shows whether I watched an ad entirely or just skipped it. I would need to check and categorize all the ads myself.

15 points per figure. 30 points total

Question 2

Also as part of your final project you’ll need to cite at least 2 academic references that help contextualize your data. These should not come from the syllabus. To get started, think about the variables you’re most interested in and how you use your platform of choice, find an academic, peer-reviewed, article that helps you understand your data better. I’m only asking for one paper now, but if you find two or more relevant one makes sure to save them so you can use them for your final project!

In addition to my office hours, the librarians are well equipped to help you find relevant literature!

Read the paper and write a short description of it (~250 words)

What’s the research question?
What data and analysis methods did the researcher(s) use?
What did they find?
How does it relate to your data and/or visualization plans?
Include a citation, properly following APA citation styles.

20 points. An AI “hallucinated” paper or description will result in a 0 for the whole assignment.

The author provides detailed explanations of Pinterest’s data policies concerning data storage, user content, and intellectual property rights. Using this information, the article investigates how users perceive Pinterest’s policies and their trust in the platform with their data. The author points out that consumers and users often do not fully understand or read the policies and agreements when creating an account. Between April 5 and July 5, 2019, 365 users across 41 countries were surveyed to analyze their perceptions of Pinterest’s policy agreements. “According to our survey, 16.9 percent of users read the terms and conditions and guidelines” (Kasakowskij et al., 2021, 10). This indicates that users generally trust Pinterest’s policies and data collection; however, they are statistically unaware of how their data is being collected and used. The article also offered me some new insights into Pinterest’s globalization as a social media and shopping platform. The trademark policies in Germany differ slightly from those I am familiar with in the U.S. The author notes, “In Germany there is also the possibility to protect a trademark without registration. An entry of the trademark in a register is not necessary, because under certain conditions the mere use of the trademark can be sufficient to justify its protection” (Kasakowskij et al., 2021, 4). Branding, trademark, and marketing policies are crucial for protecting consumers from scams and misleading advertisements. For instance, a company might act under a similar logo of a well-known “high fashion” brand without being officially recognized and verified, yet still have access to Pinterest impression ads. In conclusion, this article would be more useful for my second figure hypothesis because it provides more insight into Pinterest policies, data sharing, and collection for marketing purposes; however, it offers little information about the search engine and history, which would be valuable for my first figure hypothesis. Kasakowskij, T., Kasakowskij, R., & Fietkiewicz, K. J. (2021). “Can I pin this?” The legal position of Pinterest and its users: An analysis of Pinterest’s data storage policies and users’ trust in the service. First Monday, 26(7). https://doi.org/10.5210/fm.v26i7.11477

I have neither given nor received unauthorized aid on this assignment

Homework3

2025-10-22