Spring 2025

Data Collection

Statistics Overview, p1

  • The real world (and most simulations) are random and uncertain
  • We need a way to describe, predict, and draw conclusions from such observations
  • We do this using statistics
    • population - the thing in which we are interested … “truth
    • parameters - the defining characteristics of a population
  • We usually cannot accurately know populations or the parameters describing a population

Statistics Overview, p2

  • So we have to collect examples from the population and characterize these examples
    • sample - subset of a population that we collect and can describe
  • Types of statistics:
    • Collecting data and analyzing it (descriptive statistics)
    • Using this to draw conclusions about a population (inferential statistics)
  • We can organize this data in a data matrix (a table)
    • Each row represents a specific thing observed
    • Each column represents a variable

Data Matrix Example (p.1)

Data collected on students in a statistics class on a variety of variables:

  • Each row is a single observation of a student
  • Each column is a variable we are measuring

Data Matrix Example (p.2)

Student Gender Intro/Extra Dread
1 male extravert \(\cdots\) 3
2 female extravert \(\cdots\) 2
3 female introvert \(\cdots\) 4
4 female extravert \(\cdots\) 2
\(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\)
86 male extravert \(\cdots\) 3

Types of variables

  • Numerical - numeric values
    • discrete -integers, counting numbers (e.g., \(1, 2, 3\))
    • continuous - real values (e.g., \(-1.2, \pi, 5.2\))


  • Categorical -non-numeric values
    • nominal - values that cannot be ordered (e.g., red, blue, green)
    • ordinal - values that can be ordered (e.g., tall, medium, short)

Associated vs. independent

  • Associated or dependent variables
    • show some connection with one another
  • Independent variables are those with no evident connection between the variables
  • Variables might be:
    • Negatively associated - As one one increases in value or proportion, the other tends to decrease
    • Positively associated - As one one increases in value or proportion, the other also tends to increase

Point vs. Summary Statistics

  • Sometimes use the word “statistic” to refer to a characteristic of data
  • Point Statistic - is a specific measured value (e.g., I am 1.86 meters tall)
  • Summary Statistic - is a value representing a characteristic of many values (e.g., average)
  • Summaries are abstract descriptions of a sample or population
  • Many inferential statistical methods deal with distributions of summary statistics (not point statistics)

Measures of Center

  • mean - statistical average of numeric data,
  • median - middle-most value of numeric data, when sorted
  • mode - most common value of categorical data
  • proportion - frequency with which some categorical value occurs

What Makes Data Useful?

To be able to conduct useful analysis, data must:

  • Quantity: have a sufficient quantity of data must exist
  • Consistent: be on common scales, along common baselines
  • Structure: be in a common organizational structure
  • Clean & Clear: have few errors, omissions, and ambiguities
  • Trustworthy: come from a reputable source

Traits of Even More Useful Data

To produce the most meaningful results, data should also:

  • Atomic: provide access to the unaggregated, atomic components to the data analyst
  • Multivariate: permit the analyst to see how different variables compare
  • Contextual: permit the analyst to place the data along some context (space, time, group, etc.)

Dealing with Data

  • Most data isn’t all of these things
  • Some data isn’t any of these things …
  • The hardest part of data analysis is usually dealing with the raw data

Collecting Your Own Data

Often, we are coordinating some experiment and collecting empirical results from our own experiments

  • We might be manually recording data based on observation
  • We might be using some software that produces data
  • We might have colleagues that give us empirical data
  • Sometimes, it’s all of these

When we are organizing our own data, we should choose to lay it out and store it in a way that makes later analysis and visualization as easy as possible

Garbage-In, Garbage-Out

  • Manually entered data is subject to periodic errors (e.g., typos)
  • Automatically populated data is subject to systemic error (e.g., miscalculating some statistic)
  • You should assume that any data you have (whether you recorded it or someone else) is likely to contain errors
  • So check the data!
  • You should also keep track of the metadata:
    • Where did the data come from?
    • When was the data collected?
    • What does the data represent?
    • What are the observations, what are the variables?
    • Etc.

Where to Find Data

Sometimes you must collect data not directly part of your own project

  • Most statistics tools have some built-in data sets
  • Many public institutions publish their data
  • Some field-specific sites provide a portal to relevant data sets
  • There are programs and libraries designed to help you access certain kinds of data
  • You can always email authors / PIs and ask for their data (the worst that can happen is that they say, “no”)

Public Data

Example Field-Specific Sites

Example Programs / Libraries

Common Data Storage Formats

Common data storage formats include:

  • Excel
  • Delimited text (e.g, CSV)
  • JSON
  • HTML / XML
  • Database (e.g., MySQL tables)
  • SAS, SPSS, or other stat-package format

Data challenges

Dealing with data is usually the most challenging part.
As it is produced, data often

  • Comes from multiple sources or appears in multiple files / tables
  • Must be manually entered or automatically extracted and consolidated
  • Is missing values or has inaccuracies
  • Is not formatted conveniently for analysis

Data Scraping

  • Sometimes we can find the data we want on the web but:
    • It’s not in one place
    • It’s not in one file
    • It’s posted on the web in HTML or other formats
  • We might tediously visit sites and record values in a format we can use
  • More often, we write code to automatically gather and reformat the data
  • Such automated processes are called data scraping

Data Munging

  • Sometimes we can conveniently collect certain variables of data but:
    • Those variables are not in a form you want
    • You must perform some grouping, summarizing, or other computation to create new variables
  • Again, we might record the true data in one place and tediously manipulate and extract what we need to record in another place
  • But usually we have computers do all this for us
  • We sometimes call such automated processes data munging

Data Formatting Tools

  • There are some tools to help with managing data:
  • Many of the tools we have can do some conversion
    • Excel can export in CSV format
    • R can read xlsx and SPSS formats
  • But most data scraping and munging is done in a programming language
    • Python
    • Perl
    • Unix shell scripts
    • R

A Typical Workflow

  1. Collect data using algorithm, device, or tool from your field
  2. Perform scraping and munging tasks in Python
    • Python has very powerful text manipulation tools built in
    • Python has well-developed data manipulation libaries (e.g., pandas)
    • Python’s data visualization packages are not great
  3. Load data in R for statistical testing and visualization
    • R is designed for data analysis
    • ggplot2 is a particularly mature and well-known data visualization library
    • But R’s file parsing routines are tedious and krufty

Collecting Data in R

Data Frames

  • Basic data type for statistical operations
  • Implements a table
    • Variables stored in columns
    • Observations stored in rows
  • Typically columns are named
  • Values in each column must be the same type
    • factors
    • numeric vectors
    • character vectors
  • Different columns may have different types

Data Frames: Creating Data Frames

Student = c("Bob", "Sue", "Cat", "Lin")
NumberGrade = c(96, 82, 97, 74)
LetterGrade = factor(c("A","B","A","C"))
RosterData = data.frame(Student,NumberGrade,LetterGrade)
RosterData
##   Student NumberGrade LetterGrade
## 1     Bob          96           A
## 2     Sue          82           B
## 3     Cat          97           A
## 4     Lin          74           C

Data Frames: Getting Variable Data

rd = data.frame( Student = c("Bob", "Sue", "Cat", "Lin"),
                 NumberGrade = c(96, 82, 97, 74),
                 LetterGrade = factor(c("A","B","A","C")) )
rd$NumberGrade
## [1] 96 82 97 74
rd$LetterGrade
## [1] A B A C
## Levels: A B C

More about Data Frames

# rd from last slide
dim(rd); levels(rd$LetterGrade)
## [1] 4 3
## [1] "A" "B" "C"
summary(rd)
##    Student           NumberGrade    LetterGrade
##  Length:4           Min.   :74.00   A:2        
##  Class :character   1st Qu.:80.00   B:1        
##  Mode  :character   Median :89.00   C:1        
##                     Mean   :87.25              
##                     3rd Qu.:96.25              
##                     Max.   :97.00
summary(rd$NumberGrade)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   74.00   80.00   89.00   87.25   96.25   97.00

Read Table

x = read.table('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/color-cookbook-eg.txt', 
               header=T)
head(x) 
##   cond1 cond2 yval
## 1     A     I  2.0
## 2     A     J  2.5
## 3     A     K  1.6
## 4     A     L  0.8
## 5     B     I  2.2
## 6     B     J  2.4

Note: For some distributions of R, the header argument is assumed to be True, and others False – so best to specify what you want

Common Manipulation Needs

  • Robust file reading options
    • read.table() has a lot features
  • Convert “wide” format tables into traditional tables
    • melt()
  • Next week:
    • subset(), filtering data, and slicing data
    • aggregate() and joining data
    • transform(), reorder(), and arrange()

Robustly Reading Data

R’s read.table() functions can handle a lot of things

  • Include / exclude header or assign it on read
  • Ignore lines after a specific comment character
  • Explicitly specify the class of each data column
  • Skip the first \(x\) lines
  • Skip blank lines
  • Fill in missing data after last column
  • Fill in missing numeric data as NA

https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv

Robustly Reading Data (example)

x = read.table('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv',
               sep=',',       # Use a comma as a field separator
               header=FALSE,  # Don't use the first line as the header
               skip=10,       # But skip the first 10 lines
               col.names=c('ProblemInstance','Trial','BestSolution'),
               colClasses =c('character', 'integer', 'numeric'),
               blank.lines.skip=TRUE,  # Skip blank lines
               comment.char=";")
head(x) 
##   ProblemInstance Trial BestSolution
## 1              P1     0     13.55185
## 2              P1     1     13.79003
## 3              P1     3     14.68695
## 4              P1     4     13.52434
## 5              P1     5     13.55290
## 6              P1     6     14.03507

Other Formats to Read

  • R also has the read.csv function, which is simpler for standard, clean CSV files
  • The readxl library in R has the function read.excel that will read Excel spreadsheets
  • The haven library in R has functions to read.sas, read.spss, and read.stata
  • The scan function will just read numeric data into a vector out of a file
  • If you need lower-level reading and parsing, better to use a different language

Reshaping Data

  • Often we get data in wide form, but we need it in long form
  • For example:
df <- data.frame(Instructor=factor(c("Wiegand", "Besmer", "Doman", "Scibelli")),
                 PresentationStyle=c(2.0, 1.6, 1.7, 1.8),
                 PreparedMaterials=c(1.5, 1.8, 1.6, 1.7),
                 Availability=c(2.0, 2.0, 2.0, 2.0))
df
##   Instructor PresentationStyle PreparedMaterials Availability
## 1    Wiegand               2.0               1.5            2
## 2     Besmer               1.6               1.8            2
## 3      Doman               1.7               1.6            2
## 4   Scibelli               1.8               1.7            2

R’s Melt Function

library(reshape2)
melt(df, variable.name="Measure", value.name="Score")
## Using Instructor as id variables
##    Instructor           Measure Score
## 1     Wiegand PresentationStyle   2.0
## 2      Besmer PresentationStyle   1.6
## 3       Doman PresentationStyle   1.7
## 4    Scibelli PresentationStyle   1.8
## 5     Wiegand PreparedMaterials   1.5
## 6      Besmer PreparedMaterials   1.8
## 7       Doman PreparedMaterials   1.6
## 8    Scibelli PreparedMaterials   1.7
## 9     Wiegand      Availability   2.0
## 10     Besmer      Availability   2.0
## 11      Doman      Availability   2.0
## 12   Scibelli      Availability   2.0

Collecting Data in Python

Reading Data in R

  • Python has some powerful abilities to read files and perform text manipulation
  • These are great if you need to perform your own lexing and parsing to build a data set
  • But they aren’t good for just picking up well formatted data
  • For that, we turn to an external library: pandas
  • https://pandas.pydata.org/docs/user_guide/10min.html

Pandas

  • Pandas provides some of the basic data types typical to modeling and statistics (e.g., categorial, numeric, date/time, etc)
  • Pandas provides a data frame structure that’s similar to R’s data.frame
  • It also provides some basic I/O functionality from different formats
  • Many other python libraries rely on pandas (e.g., tensorflow)
  • It’s an external library and must be installed (e.g., via pip)

Pandas Series

  • A Series in Pandas is a list structure that enforces a common type across that list
  • Those types can be any supported Pandas types: integer, floating point, categorical, datetime, strings, etc.
  • Series form the columns in data frames
  • Note that in practice, Pandas is typically imported as pd:
import pandas as pd

my_series = pd.Series([3.14, 2.78, 1.0, 5])
print(my_series)
## 0    3.14
## 1    2.78
## 2    1.00
## 3    5.00
## dtype: float64

Pandas Data Frames, p1

  • We can get close to the convenience of R data frames with the Pandas DataFrame structure:
df = pd.DataFrame({
    "Student":["Bob", "Sue", "Cat", "Lin"],
    "NumberGrade":pd.Series([96, 82, 97, 74]),
    "LetterGrade":pd.Categorical(["A","B","A","C"])
  })
  
df
##   Student  NumberGrade LetterGrade
## 0     Bob           96           A
## 1     Sue           82           B
## 2     Cat           97           A
## 3     Lin           74           C

Pandas Data Frames, p2

df.columns
## Index(['Student', 'NumberGrade', 'LetterGrade'], dtype='object')
df.dtypes
## Student          object
## NumberGrade       int64
## LetterGrade    category
## dtype: object
df.shape
## (4, 3)

Pulling Out a Single Column

df["Student"]
## 0    Bob
## 1    Sue
## 2    Cat
## 3    Lin
## Name: Student, dtype: object

Sort of Like R’s summary:

df["NumberGrade"].describe()
## count     4.000000
## mean     87.250000
## std      11.176612
## min      74.000000
## 25%      80.000000
## 50%      89.000000
## 75%      96.250000
## max      97.000000
## Name: NumberGrade, dtype: float64
df["LetterGrade"].cat.categories  # Levels of the factor
## Index(['A', 'B', 'C'], dtype='object')

Reading from Files into a DataFrame

  • pd.read_csv("foo.csv") to read from a CSV
  • pd.read_excel("foo.xlsx") to read from an excel file
  • pd.read_sas("foo.sas7bdat") to read from a SAS file
  • pd.read_spss("foo.sav") to read from a SPSS file

More Versatile Reading

import numpy as np
df = pd.read_csv('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv',
                 sep=',',               # Use a comma as a field separator
                 header=0,              # Don't use the first line as the header
                 skiprows=list(range(10)),       # But skip the first 10 lines
                 names=['ProblemInstance','Trial','BestSolution'],
                 skip_blank_lines=True,  # Skip blank lines
                 comment=";")
df.head() 
##   ProblemInstance  Trial  BestSolution
## 0              P1      1     13.790031
## 1              P1      3     14.686953
## 2              P1      4     13.524335
## 3              P1      5     13.552902
## 4              P1      6     14.035066

Reshaping Data

  • We can reshape data using pandas, as well
df = pd.DataFrame({
       "Instructor":pd.Categorical(["Wiegand", "Besmer", "Doman", "Scibelli"]),
       "PresentationStyle":pd.Series([2.0, 1.6, 1.7, 1.8]),
       "PreparedMaterials":pd.Series([1.5, 1.8, 1.6, 1.7]),
       "Availability":pd.Series([2.0, 2.0, 2.0, 2.0])
     })
df
##   Instructor  PresentationStyle  PreparedMaterials  Availability
## 0    Wiegand                2.0                1.5           2.0
## 1     Besmer                1.6                1.8           2.0
## 2      Doman                1.7                1.6           2.0
## 3   Scibelli                1.8                1.7           2.0

Panda’s Melt Function

df.melt(id_vars=['Instructor'], var_name="Measure", value_name="Score")
##    Instructor            Measure  Score
## 0     Wiegand  PresentationStyle    2.0
## 1      Besmer  PresentationStyle    1.6
## 2       Doman  PresentationStyle    1.7
## 3    Scibelli  PresentationStyle    1.8
## 4     Wiegand  PreparedMaterials    1.5
## 5      Besmer  PreparedMaterials    1.8
## 6       Doman  PreparedMaterials    1.6
## 7    Scibelli  PreparedMaterials    1.7
## 8     Wiegand       Availability    2.0
## 9      Besmer       Availability    2.0
## 10      Doman       Availability    2.0
## 11   Scibelli       Availability    2.0