Spring 2026

Data Analysis

Analysis

  • Data Analysis is the act of trying to fnd utility or meaning in data
  • We are typically attempting to summarize, draw conclusions from, make predictions, discover patterns out of data during analysis
  • Data analysis is a broad idea that encompasses many tasks, techniques, and objectives
  • But a key aspect of it typically entails some form of modeling

What is a Model?

  • A model is an abstraction or generalization of something that captures salient properties to study
  • All models are wrong, but good models are useful
  • We model because it is difficult to understand reality in toto
  • With data, models typically involve summarization at some level
  • E.g., There is no such thing as “an average height of humans”, but knowing the average height can be useful and fully comprehending the entire population of heights of humans is not possible
  • The process of finding a model that represents the data as well as possible is called fitting
  • A model can also (sometimes) be seen as a kind of explanation for the data

Explaining vs. Observing

  • In science, we observe phenomena – We record or note what we see, hear, smell, taste, fell, or detect through some apparatus
  • We can do a lot of things with that data, including just summarize it
  • An explanation is something that provides a (typically causal) reason that we observe those things
  • An hypothesis is a (testable) possible explanation for what we are seeing
  • A theory is the explanation we have for what we are observing that is best supported by evidentiary data
  • Theories can be descriptive, predictive, or prescriptive – and they also often involve modeling to accomplish those
  • This is why statistics is so important in science

Statistcs, Machine Learning, Data Analytics! Oh My!

  • Statistics – Practice or science of collecting and analyzing data, especially for the purpose of making inferences about a population based on a sample
  • Machine Learning – The use of existing data to build a generalized model to apply to new, unseen data
  • Data Analytics – The process of analyzing data to make informed decisions and solve problems
  • In pactice, though, there’s very little difference
  • All of them involve building models to try to generalize from information we have
  • They often use exactly the same techniques and are applied to the same kinds of tasks

Some Typical Modeling Tasks

  • Hypothesis Testing – Determining whether a hypothesis is a likely explanation for observed data
  • Classification – Build a model to predict categories of data
  • Regression – Build a model to show the functional relationship between data variables
  • Clustering – Build a model to group “similar” data
  • Dimension Reduction – Build a model to transform data so that fewer dimensions are used, but as much information as possible still remains
  • Reinforcement Learning – Build a behavior model the performs well on some task

Statistical Modeling in R

Hypothesis Testing in R

https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/t.test

  • HA: Cars get less than 22 miles per gallon
  • H0: Cars tend to get 22 miles per gallon or more
  • The \(p\)-value is less than 5%, so we reject the NULL hypothesis and conclude cars get less than 22 mpg
t.test(mtcars$mpg,mu=22, alternative="less", conf.leve=0.95)
## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = -1.7921, df = 31, p-value = 0.04144
## alternative hypothesis: true mean is less than 22
## 95 percent confidence interval:
##      -Inf 21.89707
## sample estimates:
## mean of x 
##  20.09062

Models in R

  • We can use R to construct statistical models

  • Typically, we have to tell R what the response and explanatory variables of the model are using something called a formula

  • response variable ~ explanatory variables

  • For example:

    • \(y \sim x\)
    • \(y \sim x + z + w\)
    • \(y \sim \;.\)

Two-Way Analysis of Variance (ANOVA)

What matters for predicting MPG, number of gears or manual vs. automatic?

mtcars_fit <- aov(mtcars$mpg ~  factor(mtcars$gear) * factor(mtcars$am)) 
summary(mtcars_fit) 
##                     Df Sum Sq Mean Sq F value   Pr(>F)    
## factor(mtcars$gear)  2  483.2  241.62  11.869 0.000185 ***
## factor(mtcars$am)    1   72.8   72.80   3.576 0.069001 .  
## Residuals           28  570.0   20.36                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Gears have a statistically significant impact, but manual vs. automatic doesn’t

How Good Does a Linear Model Fit?

  • One way to estimate good a fit is is using the coefficient of determination
  • Computed by examing the inter- and intra- variation of the variables
  • Uni-variate model (explanatory variable \(x\) vs. response variable \(y\)):

\[ r = \frac{n\left(\sum_{i=1}^{n} \sum_{j=1}^{n}x_iy_j\right) - \left(\sum_{i=1}^{n}x_i\right)\left(\sum_{j=1}^{n}y_j\right)} {\sqrt{\left(n\sum_{i=1}^{n}x_i^2 -(\sum_{i=1}^{n}x_i)^2\right) \left(n\sum_{j=1}^{n}y_j^2 -(\sum_{j=1}^{n}y_j)^2\right)}}\]

  • Typically this is squared to give us a number between 0 and 1, where larger suggests a stronger correlation, \(r^2\)
  • \(r^2\) is basically a measure of explained variation over total variation

Back to Linear Models in R, p.1

  • For linear models, we use the function lm()

  • lm() takes the data set and the formula, then returns a model

fit = lm(data=Orange,formula= age ~ circumference)
summary(fit)

Back to Linear Models in R, p.2

## 
## Call:
## lm(formula = age ~ circumference, data = Orange)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -317.88 -140.90  -17.20   96.54  471.16 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    16.6036    78.1406   0.212    0.833    
## circumference   7.8160     0.6059  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203.1 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

Issues with R^2

  • \(r^2\) is an ad-hoc measure of variance from beteween actual and predicted values from the model

  • It can’t tell you whether the estimates are biased, meaning whether there are systemic errors in the predictions

    • For this, you have to look at the residuals
  • The more variables you add to it, the higher it gets, regardless

    • For multi-variate models, you should use the adjusted coefficient of determination
  • You can have a high \(r^2\) for a bad model or a low \(r^2\) for a good model

    • For one, the data can be non-linearly related

Residuals Plot

## `geom_smooth()` using formula = 'y ~ x'

Multivariate Adjusted R^2

# using ds from last slide
summary(lm(mpg ~ hp, data=mtcars))$adj.r.squared
## [1] 0.5891853
summary(lm(mpg ~ hp + wt, data=mtcars))$adj.r.squared
## [1] 0.8148396

Some Data is Obviously Not Linear

Notice Something about Polynomial Fits:

  • Suppose I have explanatory variable \(x\), as well as a variable \(y\) that depends on \(x\) by a quadratic function

  • This means that there’s some function: \[y = \alpha_0 + \alpha_1 x^2 + \alpha_2 x\]

  • But this is just like a linear function, if \(x^2\) were it’s own variable! \[y = \alpha_0 + \alpha_1 z + \alpha_2 x\]

Notice Something about Polynomial Fits (2):

  • So we could rework our formula to “trick” R into fitting a polynomial model:
fit = lm(data=hatcolor,formula= y ~ x^2 + x)
  • Actually, we could use this trick for all sorts of non-linear functions: \[y = \alpha_0 + \alpha_1 \sin(x) + \alpha_2 x + \alpha_3 x^3\]
fit = lm(data=hatcolor,formula= y ~ sin(x) + x + x^3)
  • For polynomials, R gives us a function, poly() so we can be more general:
fit = lm(data=hatcolor,formula= y ~ poly(x,2)); summary(fit)

Use Non-Linear Models for Non-Linear Data

hatcolorURL = "http://cs.ucf.edu/~wiegand/ids6938/datasets/hatcolor.csv"
hatcolor = read.table(hatcolorURL,header=TRUE)

summary(lm(data=hatcolor,formula=Coolitude ~ HatLightness))$r.squared
## [1] 0.005302867
summary(lm(data=hatcolor,formula=Coolitude ~ HatLightness))$adj.r.squared
## [1] -0.01541999
summary(lm(data=hatcolor,formula=Coolitude ~ poly(HatLightness,2)))$r.squared
## [1] 0.8888319
summary(lm(data=hatcolor,formula=Coolitude ~ poly(HatLightness,2)))$adj.r.squared
## [1] 0.8841013

Isn’t That Better?

Model Prediction

# Build the Model
carModel = lm(data=mtcars, formula=mpg ~ wt + factor(cyl))

# Create the dataset over which to make predictions
newCarData = data.frame(wt=c(2.5,1.9),
                        cyl=factor(c(4,6), levels= levels(factor(mtcars$cyl))))

# Make predictions
newCarData$mpg = predict(carModel, newdata=newCarData)

print(newCarData)
##    wt cyl      mpg
## 1 2.5   4 25.97676
## 2 1.9   6 23.64455

Generalized Linear Models

  • The lm() function assumes the mean value for response variable modeled by the linear function over the explanatory variables deviates by a normal distribution

  • I.e.,: \(y = \alpha_0 + \left(\sum_{i=1}^n \alpha_i x_i\right) + N(\mu,\sigma)\)

  • That is, that points more or less follow the model except for an additive error that is both Normal and i.i.d.

  • But what if the distribution is different?

  • Generalized linear models extend traditional methods to allow the mean to depend on the explanatory variable through a link function, and use different kinds of distributions

  • This is particularly significant with the response variable is categorical, rather than numeric

  • In R, you can use glm() to perform such modeling

Logistic Regression, p.1

  • Suppose you have a dataset with different student applicants to a graduate program, each with verbal, quantitative, and analytical scores on the GRE, as well as a GPA value and several other numeric measures

  • You want a predictive model as to whether a given applicant will be accepted or not

  • The response variable is categorical (1 or 0, accepted or not) and the explanatory variables are numeric

  • You can’t just use regular linear regression, you must use logistic regression (“logit”), which models the response variable using a binomial distribution

acceptModel = glm(data=somedata, formula = accept ~ ., family=binomial)

Logistic Regression, p.2

We can do a number of things with such a model:

  • Basically a classifier

  • Get the coefficients for the model

  • Get confidence intervals for the different explanatory variables for acceptance

  • Predict the probability whether a new point(s) would be accepted or not

acceptModel = glm(data=somedata, formula = accept ~ ., family=binomial)
coef(acceptModel)
exp(confint.default(acceptModel))
predict.glm(acceptModel,
            data.frame(verbal=155,quant=160,analytic=4,gpa=3.6),
            type="response",
            se.fit=TRUE)

Other Generalizations

  • You can weight different points in the data set differently (i.e., weighted linear regression) by giving lm() or glm() a weights vector

  • You can use other non-linear models (e.g., an exponential function of the explanatory variables)

  • You can use local regression, loess(), which uses a \(k-\)nearest neighbor type approach to produce a piece-wise continuous curve that fits points as well as possible in local regions of the space

LOESS Modeling

library(dplyr,quietly=TRUE, warn.conflicts=FALSE)

# Get population by state and build a DF of year and population
popbystate <- read.csv("population-by-year.csv", header=F, col.names=c("State", "Year", "Population"))
population <- summarise(group_by(filter(popbystate, Year>=1986, Year<=2023), Year),      
                                 Total=sum(Population))

# Get crime by year and build a DF of year and violent crime rate
crimebyyear <- read.csv("fbi-violentcrime-1986-2025.csv", header=T)
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on
## 'fbi-violentcrime-1986-2025.csv'
crime <- data.frame(Year=as.numeric(gsub("X", "", colnames(crimebyyear)[2:39])), 
                    Count=as.numeric(crimebyyear[1,2:39]))
 
# Merge these, then compute the rate per 100K
df <- mutate(merge(crime, population), Rate=100000*Count/Total)

crimeModel = loess(data=df, formula=Year ~ Rate)

LOESS Visualization

Statistical Modeling in Python

The SciPy Package

Getting Some Data

Let’s use another car data set that we can get that from here: https://media.geeksforgeeks.org/wp-content/uploads/20250320141852453432/auto-mpg.csv

Read it into Python like this:

import pandas as pd
cars = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/20250320141852453432/auto-mpg.csv")
cars.head()
##     mpg  cylinders  displacement  ... model year  origin                   car name
## 0  18.0          8         307.0  ...         70       1  chevrolet chevelle malibu
## 1  15.0          8         350.0  ...         70       1          buick skylark 320
## 2  18.0          8         318.0  ...         70       1         plymouth satellite
## 3  16.0          8         304.0  ...         70       1              amc rebel sst
## 4  17.0          8         302.0  ...         70       1                ford torino
## 
## [5 rows x 9 columns]

Hypothesis Testing in Python

  • HA: Cars get less than 24.5 miles per gallon
  • H0: Cars tend to get 24.5 miles per gallon or more
  • The \(p\)-value is less than 5%, so we reject the NULL hypothesis and conclude cars get less than 24.5 mpg
import scipy
result = scipy.stats.ttest_1samp(cars["mpg"], popmean=24.5, alternative="less")
result.pvalue
result.confidence_interval(0.95)

Analysis of Variance

  • Suppose I would like to know what effect the number of cylinders has on MPG?
  • I could perform a one-way ANOVA, where each treatment is the set of MPG values for each cylinder
  • Python is much more annoying about this than R… here’s one way:
scipy.stats.f_oneway(cars["mpg"][cars["cylinders"]==8], 
                     cars["mpg"][cars["cylinders"]==5], 
                     cars["mpg"][cars["cylinders"]==5], 
                     cars["mpg"][cars["cylinders"]==4], 
                     cars["mpg"][cars["cylinders"]==3])

Yes, the number of cylinders has an effect on the MPG…

Formulae in Python Using StatsModels

  • There’s another external package called statsmodels that lets you build formulae like in R
  • It permits a variety of regression tasks
  • But it is very persnickity about the version of numpy you are running
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

dat = sm.datasets.get_rdataset("Guerry", "HistData").data
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
results.summary()

More Info: https://www.statsmodels.org/stable/index.html

More Complex Models

  • For basic stats in Python: Use scipy.stats
  • There are a variety of other libraries to help provide easier ways of building standard statistical models
  • For more complex models, we’ll be using SciKit-Learn and TensorFlow
  • More on those next week

Statistical Modeling in Julia

Getting Useful Packages

  • Julia’s Statistics package contains basic statistical operations (mean, median, stddev, etc)
  • There are packages that extend this (StatsBase and StatsPlots)
  • You can even install most of the default R datasets (RDatasets)
  • For statistical testing, we’ll be using the HypothesisTests package

More Info: https://juliastats.org/HypothesisTests.jl/stable/

Getting Some Data

Let’s grab that mtcars dataset:

using RDatasets
mtcars = dataset("datasets","mtcars")

Hypothesis Testing in Julia

  • HA: Cars do not get 22 miles per gallon on average
  • H0: Cars tend to get 22 miles per gallon on average
  • The \(p\)-value is bigger than 5%, so we cannot conclude anything
using HypothesisTests
result = OneSampleTTest(mtcars.MPG, 22);
pvalue(result)

Analysis of Variance

  • The ANOVA understands Julia’s grouped DataFrames
  • Use groupby to break the data into groups
  • The use list comprehension to extract just the variable you care about
samplegroups = [grp.MPG  for grp in groupby(mtcars, :Cyl)])
OneWayANOVATest(samplegroups)

Yes, the number of cylinders has an effect on the MPG…

Linear Regression

  • The GLM package in Julia facilitates regression modeling
  • You can use the @formula annotation to create formulas like in R
using GLM
lm(@formula(MPG ~ Cyl + WT), mtcars)

More Info: https://juliastats.org/GLM.jl/stable/

More Complex Models

  • Julia is particularly well-suited for complex modeling
  • There are many modules that help with this
  • We’ll focus next week on Flux