2026-4-5

Today’s Class

  • Warm-up: collaboration activity
  • Intro to GitHub
  • Three types of collaborative data
  • 10 Features of Big Data
  • Start with GitHub

Wednesday’s Class

  • Managing Coding: RStudio, GitHub
  • Gathering Data with R
  • Creating and handling data
  • Running, committing, pull requests and branches

Office Hours

  • Office Hours: Fridays, 1:30pm-3:00pm (Tyler)
  • Tuesdays, 10:30am-12:00pm (Yao)

Miscellaneous

  • Absences
  • R Markdown issues
  • Readings

Groups!

##          group 1           group 2         group 3           group 4
## 1 Devir, Lindsey   Batson, Anthony    Myoung, Sein       Smith, Reid
## 2  Leahy, Olivia     Pacheco, Alex     Qin, Celine   Kang, Christine
## 3   Barga, Jolie Wolfenstein, Luci Bell, Mary Rose Mahoney, Brigette
##          group 5      group 6
## 1    Ong, Alyssa Mendoza, Ava
## 2 Knowles, Genny             
## 3  Moore, Allana  Pham, Canon

System for Collaboration Warm-up

  • In groups: how would you create a system for collaboration in the issues named?
  • See handout for details
##          group 1           group 2         group 3           group 4
## 1 Devir, Lindsey   Batson, Anthony    Myoung, Sein       Smith, Reid
## 2  Leahy, Olivia     Pacheco, Alex     Qin, Celine   Kang, Christine
## 3   Barga, Jolie Wolfenstein, Luci Bell, Mary Rose Mahoney, Brigette
##          group 5      group 6
## 1    Ong, Alyssa Mendoza, Ava
## 2 Knowles, Genny             
## 3  Moore, Allana  Pham, Canon

Intro to GitHub

Git and GitHub

Git is a version control system.

GitHub is a way to share your version history online with collaborators.

GitHub

  • Some basic functions:
  • Commit: save your code
  • Push: send your code “up” to a higher level
  • Pull: to gather code “down” from a higher level

GitHub

  • Other functions:
  • Fork: Use existing work, and create your own version
  • Branch: Create parallel streams of work (e.g. working version and editing version)

GitHub

  • In the activity, the central question is how to merge information up and down when it is changing on both ends
  • Such scenarios are common in computational tasks and collaboration
  • A tool like GitHub could help. For example
  • A principal could create a repository with attendance information and notes
  • Teachers could pull data on expected students and push information on actual attendance
  • Teachers might use branches to create investigate ambiguous information (e.g. substitute recordings)
  • A principal could merge branches when information is finalized
  • There are other ways to do this! (no singular solution)

Types of Collaborative Data

Recall: Custom made vs. ready made

Recall: Custom made vs. ready made

What about collaborative data?

Human Computation

Open Calls

  • Example: Netflix Prize
  • 2006
  • Data: user, movie, date of grade, grade
  • Key element: training set and test set
  • In 2009, a group beat Netflix’s algorithm by 10%, winning challenge

Distributed Data Collection

Assess Collaboration Examples

  • In groups:
  • Why did the collaborative data collection work?
  • What are potential risks?
  • Could a similar model be applied to other social science tasks?
  • Groups 1, 2: Galaxy Zoo. Groups 3, 4: Netflix. Groups 5, 6: E-bird.
##          group 1           group 2         group 3           group 4
## 1 Devir, Lindsey   Batson, Anthony    Myoung, Sein       Smith, Reid
## 2  Leahy, Olivia     Pacheco, Alex     Qin, Celine   Kang, Christine
## 3   Barga, Jolie Wolfenstein, Luci Bell, Mary Rose Mahoney, Brigette
##          group 5      group 6
## 1    Ong, Alyssa Mendoza, Ava
## 2 Knowles, Genny             
## 3  Moore, Allana  Pham, Canon

Galaxy Zoo

  • Humanly possible … but takes a long time!
  • Fun
  • Relatively easy (for humans)
  • Difficult (for machines)

Netflix Prize

  • Fun
  • Massive amount of data (>100,000,000 records)
  • Money involved
  • Very likely that someone out there had better algorithm than Netflix
  • In late 2009, a group filed a privacy lawsuit against Netflix, later settled

E-bird

  • Birders are already birding!
  • Risk for rare/poached birds

10 Features of Big Data

Think …

  • On your own, write down one example of “big data”
  • Maybe something you are interested in analyzing or using for class project
##          group 1           group 2         group 3           group 4
## 1 Devir, Lindsey   Batson, Anthony    Myoung, Sein       Smith, Reid
## 2  Leahy, Olivia     Pacheco, Alex     Qin, Celine   Kang, Christine
## 3   Barga, Jolie Wolfenstein, Luci Bell, Mary Rose Mahoney, Brigette
##          group 5      group 6
## 1    Ong, Alyssa Mendoza, Ava
## 2 Knowles, Genny             
## 3  Moore, Allana  Pham, Canon

Big!

  • Too much to analyze individually (e.g. 100,000,000 observations in Netflix competition)

Always On

  • Always tracking data (e.g. social media)

Nonreactive

  • People don’t know they are being observed (e.g., when they use Google Search)

Incomplete

  • Often missing key components for social theory, such as information on users (e.g. mobile phone data)

Inaccessible

  • The best datasets are often not publicly available (e.g. tax records)

Nonrepresentative

  • Users are not taken from probabilistic random sample (e.g. Reddit forums)

Drifting

  • How people use platforms changes over time (e.g. Twitter)

Algorithmically Confounded

  • Site-specific rules may create social artifacts (e.g. Facebook friend numbers)

Dirty

  • Because data are collected for non-research purposes, they are often messy (e.g. housing records)

Sensitive

  • Could contain non-obvious sensitive data (e.g. Netflix)

10 Features of Big Data

  • In groups: pick a big data set
  • How many of the 10 terms apply (big, always on, nonreactive, incomplete, inaccessible, nonrepresentative, drifting, algorithmically confounded, dirty, sensitive)?
##          group 1           group 2         group 3           group 4
## 1 Devir, Lindsey   Batson, Anthony    Myoung, Sein       Smith, Reid
## 2  Leahy, Olivia     Pacheco, Alex     Qin, Celine   Kang, Christine
## 3   Barga, Jolie Wolfenstein, Luci Bell, Mary Rose Mahoney, Brigette
##          group 5      group 6
## 1    Ong, Alyssa Mendoza, Ava
## 2 Knowles, Genny             
## 3  Moore, Allana  Pham, Canon

Starting with GitHub

  • github.com
  • Before next class: add your github username