Data Collection

Spring 2025

Data Collection

Statistics Overview, p1

The real world (and most simulations) are random and uncertain
We need a way to describe, predict, and draw conclusions from such observations
We do this using statistics
- population - the thing in which we are interested … “truth”
- parameters - the defining characteristics of a population
We usually cannot accurately know populations or the parameters describing a population

Statistics Overview, p2

So we have to collect examples from the population and characterize these examples
- sample - subset of a population that we collect and can describe
Types of statistics:
- Collecting data and analyzing it (descriptive statistics)
- Using this to draw conclusions about a population (inferential statistics)
We can organize this data in a data matrix (a table)
- Each row represents a specific thing observed
- Each column represents a variable

Data Matrix Example (p.1)

Data collected on students in a statistics class on a variety of variables:

Each row is a single observation of a student
Each column is a variable we are measuring

Data Matrix Example (p.2)

Student	Gender	Intro/Extra	…	Dread
1	male	extravert	\(\cdots\)	3
2	female	extravert	\(\cdots\)	2
3	female	introvert	\(\cdots\)	4
4	female	extravert	\(\cdots\)	2
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)
86	male	extravert	\(\cdots\)	3

Types of variables

Numerical - numeric values
- discrete -integers, counting numbers (e.g., \(1, 2, 3\))
- continuous - real values (e.g., \(-1.2, \pi, 5.2\))

Categorical -non-numeric values
- nominal - values that cannot be ordered (e.g., red, blue, green)
- ordinal - values that can be ordered (e.g., tall, medium, short)

Associated vs. independent

Associated or dependent variables
- show some connection with one another
Independent variables are those with no evident connection between the variables
Variables might be:
- Negatively associated - As one one increases in value or proportion, the other tends to decrease
- Positively associated - As one one increases in value or proportion, the other also tends to increase

Point vs. Summary Statistics

Sometimes use the word “statistic” to refer to a characteristic of data
Point Statistic - is a specific measured value (e.g., I am 1.86 meters tall)
Summary Statistic - is a value representing a characteristic of many values (e.g., average)
Summaries are abstract descriptions of a sample or population
Many inferential statistical methods deal with distributions of summary statistics (not point statistics)

Measures of Center

mean - statistical average of numeric data,
median - middle-most value of numeric data, when sorted
mode - most common value of categorical data
proportion - frequency with which some categorical value occurs

What Makes Data Useful?

To be able to conduct useful analysis, data must:

Quantity: have a sufficient quantity of data must exist
Consistent: be on common scales, along common baselines
Structure: be in a common organizational structure
Clean & Clear: have few errors, omissions, and ambiguities
Trustworthy: come from a reputable source

Traits of Even More Useful Data

To produce the most meaningful results, data should also:

Atomic: provide access to the unaggregated, atomic components to the data analyst
Multivariate: permit the analyst to see how different variables compare
Contextual: permit the analyst to place the data along some context (space, time, group, etc.)

Dealing with Data

Most data isn’t all of these things
Some data isn’t any of these things …
The hardest part of data analysis is usually dealing with the raw data

Collecting Your Own Data

Often, we are coordinating some experiment and collecting empirical results from our own experiments

We might be manually recording data based on observation
We might be using some software that produces data
We might have colleagues that give us empirical data
Sometimes, it’s all of these

When we are organizing our own data, we should choose to lay it out and store it in a way that makes later analysis and visualization as easy as possible

Garbage-In, Garbage-Out

Manually entered data is subject to periodic errors (e.g., typos)
Automatically populated data is subject to systemic error (e.g., miscalculating some statistic)
You should assume that any data you have (whether you recorded it or someone else) is likely to contain errors
So check the data!
You should also keep track of the metadata:
- Where did the data come from?
- When was the data collected?
- What does the data represent?
- What are the observations, what are the variables?
- Etc.

Where to Find Data

Sometimes you must collect data not directly part of your own project

Most statistics tools have some built-in data sets
Many public institutions publish their data
Some field-specific sites provide a portal to relevant data sets
There are programs and libraries designed to help you access certain kinds of data
You can always email authors / PIs and ask for their data (the worst that can happen is that they say, “no”)

Public Data

Search the web:
- Google maintains a public data repository
- Wolfram Alpha maintains socioeconomic data sets
Check Public Institutions:

Example Field-Specific Sites

Example Programs / Libraries

API to access social media data: Flickr, Facebook, Twitter
Google API to access Google Docs & Maps, etc.
Freebase
GeoCommons
OpenStreetMap

Common Data Storage Formats

Common data storage formats include:

Excel
Delimited text (e.g, CSV)
JSON
HTML / XML
Database (e.g., MySQL tables)
SAS, SPSS, or other stat-package format

Data challenges

Dealing with data is usually the most challenging part.
As it is produced, data often

Comes from multiple sources or appears in multiple files / tables
Must be manually entered or automatically extracted and consolidated
Is missing values or has inaccuracies
Is not formatted conveniently for analysis

Data Scraping

Sometimes we can find the data we want on the web but:
- It’s not in one place
- It’s not in one file
- It’s posted on the web in HTML or other formats
We might tediously visit sites and record values in a format we can use
More often, we write code to automatically gather and reformat the data
Such automated processes are called data scraping

Data Munging

Sometimes we can conveniently collect certain variables of data but:
- Those variables are not in a form you want
- You must perform some grouping, summarizing, or other computation to create new variables
Again, we might record the true data in one place and tediously manipulate and extract what we need to record in another place
But usually we have computers do all this for us
We sometimes call such automated processes data munging

Data Formatting Tools

There are some tools to help with managing data:
- Google’s OpenRefine
- Mr. Data Converter
Many of the tools we have can do some conversion
- Excel can export in CSV format
- R can read xlsx and SPSS formats
But most data scraping and munging is done in a programming language
- Python
- Perl
- Unix shell scripts
- R

A Typical Workflow

Collect data using algorithm, device, or tool from your field
Perform scraping and munging tasks in Python
- Python has very powerful text manipulation tools built in
- Python has well-developed data manipulation libaries (e.g., pandas)
- Python’s data visualization packages are not great
Load data in R for statistical testing and visualization
- R is designed for data analysis
- ggplot2 is a particularly mature and well-known data visualization library
- But R’s file parsing routines are tedious and krufty

Collecting Data in R

Data Frames

Basic data type for statistical operations
Implements a table
- Variables stored in columns
- Observations stored in rows
Typically columns are named
Values in each column must be the same type
- factors
- numeric vectors
- character vectors
Different columns may have different types

Data Frames: Creating Data Frames

Student = c("Bob", "Sue", "Cat", "Lin")
NumberGrade = c(96, 82, 97, 74)
LetterGrade = factor(c("A","B","A","C"))
RosterData = data.frame(Student,NumberGrade,LetterGrade)
RosterData

##   Student NumberGrade LetterGrade
## 1     Bob          96           A
## 2     Sue          82           B
## 3     Cat          97           A
## 4     Lin          74           C

Data Frames: Getting Variable Data

rd = data.frame( Student = c("Bob", "Sue", "Cat", "Lin"),
                 NumberGrade = c(96, 82, 97, 74),
                 LetterGrade = factor(c("A","B","A","C")) )
rd$NumberGrade

## [1] 96 82 97 74

rd$LetterGrade

## [1] A B A C
## Levels: A B C

More about Data Frames

# rd from last slide
dim(rd); levels(rd$LetterGrade)

## [1] 4 3

## [1] "A" "B" "C"

summary(rd)

##    Student           NumberGrade    LetterGrade
##  Length:4           Min.   :74.00   A:2        
##  Class :character   1st Qu.:80.00   B:1        
##  Mode  :character   Median :89.00   C:1        
##                     Mean   :87.25              
##                     3rd Qu.:96.25              
##                     Max.   :97.00

summary(rd$NumberGrade)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   74.00   80.00   89.00   87.25   96.25   97.00

Read Table

x = read.table('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/color-cookbook-eg.txt', 
               header=T)
head(x)

##   cond1 cond2 yval
## 1     A     I  2.0
## 2     A     J  2.5
## 3     A     K  1.6
## 4     A     L  0.8
## 5     B     I  2.2
## 6     B     J  2.4

Note: For some distributions of R, the header argument is assumed to be True, and others False – so best to specify what you want

Common Manipulation Needs

Robust file reading options
- read.table() has a lot features
Convert “wide” format tables into traditional tables
- melt()
Next week:
- subset(), filtering data, and slicing data
- aggregate() and joining data
- transform(), reorder(), and arrange()

Robustly Reading Data

R’s read.table() functions can handle a lot of things

Include / exclude header or assign it on read
Ignore lines after a specific comment character
Explicitly specify the class of each data column
Skip the first \(x\) lines
Skip blank lines
Fill in missing data after last column
Fill in missing numeric data as NA

https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv

Robustly Reading Data (example)

x = read.table('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv',
               sep=',',       # Use a comma as a field separator
               header=FALSE,  # Don't use the first line as the header
               skip=10,       # But skip the first 10 lines
               col.names=c('ProblemInstance','Trial','BestSolution'),
               colClasses =c('character', 'integer', 'numeric'),
               blank.lines.skip=TRUE,  # Skip blank lines
               comment.char=";")
head(x)

##   ProblemInstance Trial BestSolution
## 1              P1     0     13.55185
## 2              P1     1     13.79003
## 3              P1     3     14.68695
## 4              P1     4     13.52434
## 5              P1     5     13.55290
## 6              P1     6     14.03507

Reshaping Data

Often we get data in wide form, but we need it in long form
For example:

df <- data.frame(Instructor=factor(c("Wiegand", "Besmer", "Doman", "Scibelli")),
                 PresentationStyle=c(2.0, 1.6, 1.7, 1.8),
                 PreparedMaterials=c(1.5, 1.8, 1.6, 1.7),
                 Availability=c(2.0, 2.0, 2.0, 2.0))
df

##   Instructor PresentationStyle PreparedMaterials Availability
## 1    Wiegand               2.0               1.5            2
## 2     Besmer               1.6               1.8            2
## 3      Doman               1.7               1.6            2
## 4   Scibelli               1.8               1.7            2

R’s Melt Function

library(reshape2)
melt(df, variable.name="Measure", value.name="Score")

## Using Instructor as id variables

##    Instructor           Measure Score
## 1     Wiegand PresentationStyle   2.0
## 2      Besmer PresentationStyle   1.6
## 3       Doman PresentationStyle   1.7
## 4    Scibelli PresentationStyle   1.8
## 5     Wiegand PreparedMaterials   1.5
## 6      Besmer PreparedMaterials   1.8
## 7       Doman PreparedMaterials   1.6
## 8    Scibelli PreparedMaterials   1.7
## 9     Wiegand      Availability   2.0
## 10     Besmer      Availability   2.0
## 11      Doman      Availability   2.0
## 12   Scibelli      Availability   2.0

Collecting Data in Python

Reading Data in R

Python has some powerful abilities to read files and perform text manipulation
These are great if you need to perform your own lexing and parsing to build a data set
But they aren’t good for just picking up well formatted data
For that, we turn to an external library: pandas
https://pandas.pydata.org/docs/user_guide/10min.html

Pandas

Pandas provides some of the basic data types typical to modeling and statistics (e.g., categorial, numeric, date/time, etc)
Pandas provides a data frame structure that’s similar to R’s data.frame
It also provides some basic I/O functionality from different formats
Many other python libraries rely on pandas (e.g., tensorflow)
It’s an external library and must be installed (e.g., via pip)

Pandas Series

A Series in Pandas is a list structure that enforces a common type across that list
Those types can be any supported Pandas types: integer, floating point, categorical, datetime, strings, etc.
Series form the columns in data frames
Note that in practice, Pandas is typically imported as pd:

import pandas as pd

my_series = pd.Series([3.14, 2.78, 1.0, 5])
print(my_series)

## 0    3.14
## 1    2.78
## 2    1.00
## 3    5.00
## dtype: float64

Pandas Data Frames, p1

We can get close to the convenience of R data frames with the Pandas DataFrame structure:

df = pd.DataFrame({
    "Student":["Bob", "Sue", "Cat", "Lin"],
    "NumberGrade":pd.Series([96, 82, 97, 74]),
    "LetterGrade":pd.Categorical(["A","B","A","C"])
  })
  
df

##   Student  NumberGrade LetterGrade
## 0     Bob           96           A
## 1     Sue           82           B
## 2     Cat           97           A
## 3     Lin           74           C

Pandas Data Frames, p2

df.columns

## Index(['Student', 'NumberGrade', 'LetterGrade'], dtype='object')

df.dtypes

## Student          object
## NumberGrade       int64
## LetterGrade    category
## dtype: object

df.shape

## (4, 3)

Pulling Out a Single Column

df["Student"]

## 0    Bob
## 1    Sue
## 2    Cat
## 3    Lin
## Name: Student, dtype: object

Sort of Like R’s summary:

df["NumberGrade"].describe()

## count     4.000000
## mean     87.250000
## std      11.176612
## min      74.000000
## 25%      80.000000
## 50%      89.000000
## 75%      96.250000
## max      97.000000
## Name: NumberGrade, dtype: float64

df["LetterGrade"].cat.categories  # Levels of the factor

## Index(['A', 'B', 'C'], dtype='object')

Reading from Files into a DataFrame

pd.read_csv("foo.csv") to read from a CSV
pd.read_excel("foo.xlsx") to read from an excel file
pd.read_sas("foo.sas7bdat") to read from a SAS file
pd.read_spss("foo.sav") to read from a SPSS file

Reshaping Data

We can reshape data using pandas, as well

df = pd.DataFrame({
       "Instructor":pd.Categorical(["Wiegand", "Besmer", "Doman", "Scibelli"]),
       "PresentationStyle":pd.Series([2.0, 1.6, 1.7, 1.8]),
       "PreparedMaterials":pd.Series([1.5, 1.8, 1.6, 1.7]),
       "Availability":pd.Series([2.0, 2.0, 2.0, 2.0])
     })
df

##   Instructor  PresentationStyle  PreparedMaterials  Availability
## 0    Wiegand                2.0                1.5           2.0
## 1     Besmer                1.6                1.8           2.0
## 2      Doman                1.7                1.6           2.0
## 3   Scibelli                1.8                1.7           2.0

Panda’s Melt Function

df.melt(id_vars=['Instructor'], var_name="Measure", value_name="Score")

##    Instructor            Measure  Score
## 0     Wiegand  PresentationStyle    2.0
## 1      Besmer  PresentationStyle    1.6
## 2       Doman  PresentationStyle    1.7
## 3    Scibelli  PresentationStyle    1.8
## 4     Wiegand  PreparedMaterials    1.5
## 5      Besmer  PreparedMaterials    1.8
## 6       Doman  PreparedMaterials    1.6
## 7    Scibelli  PreparedMaterials    1.7
## 8     Wiegand       Availability    2.0
## 9      Besmer       Availability    2.0
## 10      Doman       Availability    2.0
## 11   Scibelli       Availability    2.0

Data Collection

Statistics Overview, p1

Statistics Overview, p2

Data Matrix Example (p.1)

Data Matrix Example (p.2)

Types of variables

Associated vs. independent

Point vs. Summary Statistics

Measures of Center

What Makes Data Useful?

Traits of Even More Useful Data

Dealing with Data

Collecting Your Own Data

Garbage-In, Garbage-Out

Where to Find Data

Public Data

Example Field-Specific Sites

Example Programs / Libraries

Common Data Storage Formats

Data challenges

Data Scraping

Data Munging

Data Formatting Tools

A Typical Workflow

Collecting Data in R

Data Frames

Data Frames: Creating Data Frames

Data Frames: Getting Variable Data

More about Data Frames

Read Table

Common Manipulation Needs

Robustly Reading Data

Robustly Reading Data (example)

Other Formats to Read

Reshaping Data

R’s Melt Function

Collecting Data in Python

Reading Data in R

Pandas

Pandas Series

Pandas Data Frames, p1

Pandas Data Frames, p2

Pulling Out a Single Column

Sort of Like R’s summary:

Reading from Files into a DataFrame

More Versatile Reading

Reshaping Data

Panda’s Melt Function