Analysis, Part 1

Spring 2026

Data Analysis

Analysis

Data Analysis is the act of trying to fnd utility or meaning in data
We are typically attempting to summarize, draw conclusions from, make predictions, discover patterns out of data during analysis
Data analysis is a broad idea that encompasses many tasks, techniques, and objectives
But a key aspect of it typically entails some form of modeling

What is a Model?

A model is an abstraction or generalization of something that captures salient properties to study
All models are wrong, but good models are useful
We model because it is difficult to understand reality in toto
With data, models typically involve summarization at some level
E.g., There is no such thing as “an average height of humans”, but knowing the average height can be useful and fully comprehending the entire population of heights of humans is not possible
The process of finding a model that represents the data as well as possible is called fitting
A model can also (sometimes) be seen as a kind of explanation for the data

Explaining vs. Observing

In science, we observe phenomena – We record or note what we see, hear, smell, taste, fell, or detect through some apparatus
We can do a lot of things with that data, including just summarize it
An explanation is something that provides a (typically causal) reason that we observe those things
An hypothesis is a (testable) possible explanation for what we are seeing
A theory is the explanation we have for what we are observing that is best supported by evidentiary data
Theories can be descriptive, predictive, or prescriptive – and they also often involve modeling to accomplish those
This is why statistics is so important in science

Statistcs, Machine Learning, Data Analytics! Oh My!

Statistics – Practice or science of collecting and analyzing data, especially for the purpose of making inferences about a population based on a sample
Machine Learning – The use of existing data to build a generalized model to apply to new, unseen data
Data Analytics – The process of analyzing data to make informed decisions and solve problems
In pactice, though, there’s very little difference
All of them involve building models to try to generalize from information we have
They often use exactly the same techniques and are applied to the same kinds of tasks

Some Typical Modeling Tasks

Hypothesis Testing – Determining whether a hypothesis is a likely explanation for observed data
Classification – Build a model to predict categories of data
Regression – Build a model to show the functional relationship between data variables
Clustering – Build a model to group “similar” data
Dimension Reduction – Build a model to transform data so that fewer dimensions are used, but as much information as possible still remains
Reinforcement Learning – Build a behavior model the performs well on some task

Statistical Modeling in R

Hypothesis Testing in R

https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/t.test

HA: Cars get less than 22 miles per gallon
H0: Cars tend to get 22 miles per gallon or more
The \(p\)-value is less than 5%, so we reject the NULL hypothesis and conclude cars get less than 22 mpg

t.test(mtcars$mpg,mu=22, alternative="less", conf.leve=0.95)

## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = -1.7921, df = 31, p-value = 0.04144
## alternative hypothesis: true mean is less than 22
## 95 percent confidence interval:
##      -Inf 21.89707
## sample estimates:
## mean of x 
##  20.09062

Models in R

We can use R to construct statistical models
Typically, we have to tell R what the response and explanatory variables of the model are using something called a formula
response variable ~ explanatory variables
For example:
- \(y \sim x\)
- \(y \sim x + z + w\)
- \(y \sim \;.\)

Two-Way Analysis of Variance (ANOVA)

What matters for predicting MPG, number of gears or manual vs. automatic?

mtcars_fit <- aov(mtcars$mpg ~  factor(mtcars$gear) * factor(mtcars$am)) 
summary(mtcars_fit)

##                     Df Sum Sq Mean Sq F value   Pr(>F)    
## factor(mtcars$gear)  2  483.2  241.62  11.869 0.000185 ***
## factor(mtcars$am)    1   72.8   72.80   3.576 0.069001 .  
## Residuals           28  570.0   20.36                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Gears have a statistically significant impact, but manual vs. automatic doesn’t

How Good Does a Linear Model Fit?

One way to estimate good a fit is is using the coefficient of determination
Computed by examing the inter- and intra- variation of the variables
Uni-variate model (explanatory variable \(x\) vs. response variable \(y\)):

\[ r = \frac{n\left(\sum_{i=1}^{n} \sum_{j=1}^{n}x_iy_j\right) - \left(\sum_{i=1}^{n}x_i\right)\left(\sum_{j=1}^{n}y_j\right)} {\sqrt{\left(n\sum_{i=1}^{n}x_i^2 -(\sum_{i=1}^{n}x_i)^2\right) \left(n\sum_{j=1}^{n}y_j^2 -(\sum_{j=1}^{n}y_j)^2\right)}}\]

Typically this is squared to give us a number between 0 and 1, where larger suggests a stronger correlation, \(r^2\)
\(r^2\) is basically a measure of explained variation over total variation

Back to Linear Models in R, p.1

For linear models, we use the function lm()
lm() takes the data set and the formula, then returns a model

fit = lm(data=Orange,formula= age ~ circumference)
summary(fit)

Back to Linear Models in R, p.2

## 
## Call:
## lm(formula = age ~ circumference, data = Orange)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -317.88 -140.90  -17.20   96.54  471.16 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    16.6036    78.1406   0.212    0.833    
## circumference   7.8160     0.6059  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203.1 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

Issues with R^2

\(r^2\) is an ad-hoc measure of variance from beteween actual and predicted values from the model
It can’t tell you whether the estimates are biased, meaning whether there are systemic errors in the predictions
- For this, you have to look at the residuals
The more variables you add to it, the higher it gets, regardless
- For multi-variate models, you should use the adjusted coefficient of determination
You can have a high \(r^2\) for a bad model or a low \(r^2\) for a good model
- For one, the data can be non-linearly related

Residuals Plot

## `geom_smooth()` using formula = 'y ~ x'

Multivariate Adjusted R^2

# using ds from last slide
summary(lm(mpg ~ hp, data=mtcars))$adj.r.squared

## [1] 0.5891853

summary(lm(mpg ~ hp + wt, data=mtcars))$adj.r.squared

## [1] 0.8148396

Some Data is Obviously Not Linear

Notice Something about Polynomial Fits:

Suppose I have explanatory variable \(x\), as well as a variable \(y\) that depends on \(x\) by a quadratic function
This means that there’s some function: \[y = \alpha_0 + \alpha_1 x^2 + \alpha_2 x\]
But this is just like a linear function, if \(x^2\) were it’s own variable! \[y = \alpha_0 + \alpha_1 z + \alpha_2 x\]

Notice Something about Polynomial Fits (2):

So we could rework our formula to “trick” R into fitting a polynomial model:

fit = lm(data=hatcolor,formula= y ~ x^2 + x)

Actually, we could use this trick for all sorts of non-linear functions: \[y = \alpha_0 + \alpha_1 \sin(x) + \alpha_2 x + \alpha_3 x^3\]

fit = lm(data=hatcolor,formula= y ~ sin(x) + x + x^3)

For polynomials, R gives us a function, poly() so we can be more general:

fit = lm(data=hatcolor,formula= y ~ poly(x,2)); summary(fit)

Use Non-Linear Models for Non-Linear Data

hatcolorURL = "http://cs.ucf.edu/~wiegand/ids6938/datasets/hatcolor.csv"
hatcolor = read.table(hatcolorURL,header=TRUE)

summary(lm(data=hatcolor,formula=Coolitude ~ HatLightness))$r.squared

## [1] 0.005302867

summary(lm(data=hatcolor,formula=Coolitude ~ HatLightness))$adj.r.squared

## [1] -0.01541999

summary(lm(data=hatcolor,formula=Coolitude ~ poly(HatLightness,2)))$r.squared

## [1] 0.8888319

summary(lm(data=hatcolor,formula=Coolitude ~ poly(HatLightness,2)))$adj.r.squared

## [1] 0.8841013

Isn’t That Better?

Model Prediction

# Build the Model
carModel = lm(data=mtcars, formula=mpg ~ wt + factor(cyl))

# Create the dataset over which to make predictions
newCarData = data.frame(wt=c(2.5,1.9),
                        cyl=factor(c(4,6), levels= levels(factor(mtcars$cyl))))

# Make predictions
newCarData$mpg = predict(carModel, newdata=newCarData)

print(newCarData)

##    wt cyl      mpg
## 1 2.5   4 25.97676
## 2 1.9   6 23.64455

Generalized Linear Models

The lm() function assumes the mean value for response variable modeled by the linear function over the explanatory variables deviates by a normal distribution
I.e.,: \(y = \alpha_0 + \left(\sum_{i=1}^n \alpha_i x_i\right) + N(\mu,\sigma)\)
That is, that points more or less follow the model except for an additive error that is both Normal and i.i.d.
But what if the distribution is different?
Generalized linear models extend traditional methods to allow the mean to depend on the explanatory variable through a link function, and use different kinds of distributions
This is particularly significant with the response variable is categorical, rather than numeric
In R, you can use glm() to perform such modeling

Logistic Regression, p.1

Suppose you have a dataset with different student applicants to a graduate program, each with verbal, quantitative, and analytical scores on the GRE, as well as a GPA value and several other numeric measures
You want a predictive model as to whether a given applicant will be accepted or not
The response variable is categorical (1 or 0, accepted or not) and the explanatory variables are numeric
You can’t just use regular linear regression, you must use logistic regression (“logit”), which models the response variable using a binomial distribution

acceptModel = glm(data=somedata, formula = accept ~ ., family=binomial)

Logistic Regression, p.2

We can do a number of things with such a model:

Basically a classifier
Get the coefficients for the model
Get confidence intervals for the different explanatory variables for acceptance
Predict the probability whether a new point(s) would be accepted or not

acceptModel = glm(data=somedata, formula = accept ~ ., family=binomial)
coef(acceptModel)
exp(confint.default(acceptModel))
predict.glm(acceptModel,
            data.frame(verbal=155,quant=160,analytic=4,gpa=3.6),
            type="response",
            se.fit=TRUE)

Other Generalizations

You can weight different points in the data set differently (i.e., weighted linear regression) by giving lm() or glm() a weights vector
You can use other non-linear models (e.g., an exponential function of the explanatory variables)
You can use local regression, loess(), which uses a \(k-\)nearest neighbor type approach to produce a piece-wise continuous curve that fits points as well as possible in local regions of the space

LOESS Modeling

library(dplyr,quietly=TRUE, warn.conflicts=FALSE)

# Get population by state and build a DF of year and population
popbystate <- read.csv("population-by-year.csv", header=F, col.names=c("State", "Year", "Population"))
population <- summarise(group_by(filter(popbystate, Year>=1986, Year<=2023), Year),      
                                 Total=sum(Population))

# Get crime by year and build a DF of year and violent crime rate
crimebyyear <- read.csv("fbi-violentcrime-1986-2025.csv", header=T)

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on
## 'fbi-violentcrime-1986-2025.csv'

crime <- data.frame(Year=as.numeric(gsub("X", "", colnames(crimebyyear)[2:39])), 
                    Count=as.numeric(crimebyyear[1,2:39]))
 
# Merge these, then compute the rate per 100K
df <- mutate(merge(crime, population), Rate=100000*Count/Total)

crimeModel = loess(data=df, formula=Year ~ Rate)

LOESS Visualization

Statistical Modeling in Python

The SciPy Package

The main package for statistic in Python is SciPy
This is an external package that must be installed
It has many capabilities, but traditional statistical modeling is one

More Info: https://docs.scipy.org/doc/scipy/reference/stats.html#hypotests

Getting Some Data

Let’s use another car data set that we can get that from here: https://media.geeksforgeeks.org/wp-content/uploads/20250320141852453432/auto-mpg.csv

Read it into Python like this:

import pandas as pd
cars = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/20250320141852453432/auto-mpg.csv")
cars.head()

##     mpg  cylinders  displacement  ... model year  origin                   car name
## 0  18.0          8         307.0  ...         70       1  chevrolet chevelle malibu
## 1  15.0          8         350.0  ...         70       1          buick skylark 320
## 2  18.0          8         318.0  ...         70       1         plymouth satellite
## 3  16.0          8         304.0  ...         70       1              amc rebel sst
## 4  17.0          8         302.0  ...         70       1                ford torino
## 
## [5 rows x 9 columns]

Hypothesis Testing in Python

HA: Cars get less than 24.5 miles per gallon
H0: Cars tend to get 24.5 miles per gallon or more
The \(p\)-value is less than 5%, so we reject the NULL hypothesis and conclude cars get less than 24.5 mpg

import scipy
result = scipy.stats.ttest_1samp(cars["mpg"], popmean=24.5, alternative="less")
result.pvalue
result.confidence_interval(0.95)

Analysis of Variance

Suppose I would like to know what effect the number of cylinders has on MPG?
I could perform a one-way ANOVA, where each treatment is the set of MPG values for each cylinder
Python is much more annoying about this than R… here’s one way:

scipy.stats.f_oneway(cars["mpg"][cars["cylinders"]==8], 
                     cars["mpg"][cars["cylinders"]==5], 
                     cars["mpg"][cars["cylinders"]==5], 
                     cars["mpg"][cars["cylinders"]==4], 
                     cars["mpg"][cars["cylinders"]==3])

Yes, the number of cylinders has an effect on the MPG…

Formulae in Python Using StatsModels

There’s another external package called statsmodels that lets you build formulae like in R
It permits a variety of regression tasks
But it is very persnickity about the version of numpy you are running

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

dat = sm.datasets.get_rdataset("Guerry", "HistData").data
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
results.summary()

More Info: https://www.statsmodels.org/stable/index.html

More Complex Models

For basic stats in Python: Use scipy.stats
There are a variety of other libraries to help provide easier ways of building standard statistical models
For more complex models, we’ll be using SciKit-Learn and TensorFlow
More on those next week

Statistical Modeling in Julia

Getting Useful Packages

Julia’s Statistics package contains basic statistical operations (mean, median, stddev, etc)
There are packages that extend this (StatsBase and StatsPlots)
You can even install most of the default R datasets (RDatasets)
For statistical testing, we’ll be using the HypothesisTests package

More Info: https://juliastats.org/HypothesisTests.jl/stable/

Getting Some Data

Let’s grab that mtcars dataset:

using RDatasets
mtcars = dataset("datasets","mtcars")

Hypothesis Testing in Julia

HA: Cars do not get 22 miles per gallon on average
H0: Cars tend to get 22 miles per gallon on average
The \(p\)-value is bigger than 5%, so we cannot conclude anything

using HypothesisTests
result = OneSampleTTest(mtcars.MPG, 22);
pvalue(result)

Analysis of Variance

The ANOVA understands Julia’s grouped DataFrames
Use groupby to break the data into groups
The use list comprehension to extract just the variable you care about

samplegroups = [grp.MPG  for grp in groupby(mtcars, :Cyl)])
OneWayANOVATest(samplegroups)

Yes, the number of cylinders has an effect on the MPG…

Linear Regression

The GLM package in Julia facilitates regression modeling
You can use the @formula annotation to create formulas like in R

using GLM
lm(@formula(MPG ~ Cyl + WT), mtcars)

More Info: https://juliastats.org/GLM.jl/stable/

More Complex Models

Julia is particularly well-suited for complex modeling
There are many modules that help with this
We’ll focus next week on Flux