Spring 2025
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/t.test
t.test(mtcars$mpg,mu=22, alternative="less", conf.leve=0.95)
## ## One Sample t-test ## ## data: mtcars$mpg ## t = -1.7921, df = 31, p-value = 0.04144 ## alternative hypothesis: true mean is less than 22 ## 95 percent confidence interval: ## -Inf 21.89707 ## sample estimates: ## mean of x ## 20.09062
We can use R to construct statistical models
Typically, we have to tell R what the response and explanatory variables of the model are using something called a formula
response variable ~ explanatory variables
For example:
What matters for predicting MPG, number of gears or manual vs. automatic?
mtcars_fit <- aov(mtcars$mpg ~ factor(mtcars$gear) * factor(mtcars$am)) summary(mtcars_fit)
## Df Sum Sq Mean Sq F value Pr(>F) ## factor(mtcars$gear) 2 483.2 241.62 11.869 0.000185 *** ## factor(mtcars$am) 1 72.8 72.80 3.576 0.069001 . ## Residuals 28 570.0 20.36 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Gears have a statistically significant impact, but manual vs. automatic doesn’t
\[ r = \frac{n\left(\sum_{i=1}^{n} \sum_{j=1}^{n}x_iy_j\right) - \left(\sum_{i=1}^{n}x_i\right)\left(\sum_{j=1}^{n}y_j\right)} {\sqrt{\left(n\sum_{i=1}^{n}x_i^2 -(\sum_{i=1}^{n}x_i)^2\right) \left(n\sum_{j=1}^{n}y_j^2 -(\sum_{j=1}^{n}y_j)^2\right)}}\]
For linear models, we use the function lm()
lm() takes the data set and the formula, then returns a model
fit = lm(data=Orange,formula= age ~ circumference) summary(fit)
## ## Call: ## lm(formula = age ~ circumference, data = Orange) ## ## Residuals: ## Min 1Q Median 3Q Max ## -317.88 -140.90 -17.20 96.54 471.16 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 16.6036 78.1406 0.212 0.833 ## circumference 7.8160 0.6059 12.900 1.93e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 203.1 on 33 degrees of freedom ## Multiple R-squared: 0.8345, Adjusted R-squared: 0.8295 ## F-statistic: 166.4 on 1 and 33 DF, p-value: 1.931e-14
\(r^2\) is an ad-hoc measure of variance from beteween actual and predicted values from the model
It can’t tell you whether the estimates are biased, meaning whether there are systemic errors in the predictions
The more variables you add to it, the higher it gets, regardless
You can have a high \(r^2\) for a bad model or a low \(r^2\) for a good model
## `geom_smooth()` using formula = 'y ~ x'
# using ds from last slide summary(lm(mpg ~ hp, data=mtcars))$adj.r.squared
## [1] 0.5891853
summary(lm(mpg ~ hp + wt, data=mtcars))$adj.r.squared
## [1] 0.8148396
Suppose I have explanatory variable \(x\), as well as a variable \(y\) that depends on \(x\) by a quadratic function
This means that there’s some function: \[y = \alpha_0 + \alpha_1 x^2 + \alpha_2 x\]
But this is just like a linear function, if \(x^2\) were it’s own variable! \[y = \alpha_0 + \alpha_1 z + \alpha_2 x\]
fit = lm(data=hatcolor,formula= y ~ x^2 + x)
fit = lm(data=hatcolor,formula= y ~ sin(x) + x + x^3)
fit = lm(data=hatcolor,formula= y ~ poly(x,2)); summary(fit)
hatcolorURL = "http://cs.ucf.edu/~wiegand/ids6938/datasets/hatcolor.csv" hatcolor = read.table(hatcolorURL,header=TRUE) summary(lm(data=hatcolor,formula=Coolitude ~ HatLightness))$r.squared
## [1] 0.005302867
summary(lm(data=hatcolor,formula=Coolitude ~ HatLightness))$adj.r.squared
## [1] -0.01541999
summary(lm(data=hatcolor,formula=Coolitude ~ poly(HatLightness,2)))$r.squared
## [1] 0.8888319
summary(lm(data=hatcolor,formula=Coolitude ~ poly(HatLightness,2)))$adj.r.squared
## [1] 0.8841013
# Build the Model carModel = lm(data=mtcars, formula=mpg ~ wt + factor(cyl)) # Create the dataset over which to make predictions newCarData = data.frame(wt=c(2.5,1.9), cyl=factor(c(4,6), levels= levels(factor(mtcars$cyl)))) # Make predictions newCarData$mpg = predict(carModel, newdata=newCarData) print(newCarData)
## wt cyl mpg ## 1 2.5 4 25.97676 ## 2 1.9 6 23.64455
The lm() function assumes the mean value for response variable modeled by the linear function over the explanatory variables deviates by a normal distribution
I.e.,: \(y = \alpha_0 + \left(\sum_{i=1}^n \alpha_i x_i\right) + N(\mu,\sigma)\)
That is, that points more or less follow the model except for an additive error that is both Normal and i.i.d.
But what if the distribution is different?
Generalized linear models extend traditional methods to allow the mean to depend on the explanatory variable through a link function, and use different kinds of distributions
This is particularly significant with the response variable is categorical, rather than numeric
In R, you can use glm() to perform such modeling
Suppose you have a dataset with different student applicants to a graduate program, each with verbal, quantitative, and analytical scores on the GRE, as well as a GPA value and several other numeric measures
You want a predictive model as to whether a given applicant will be accepted or not
The response variable is categorical (1 or 0, accepted or not) and the explanatory variables are numeric
You can’t just use regular linear regression, you must use logistic regression (“logit”), which models the response variable using a binomial distribution
acceptModel = glm(data=somedata, formula = accept ~ ., family=binomial)
We can do a number of things with such a model:
Basically a classifier
Get the coefficients for the model
Get confidence intervals for the different explanatory variables for acceptance
Predict the probability whether a new point(s) would be accepted or not
acceptModel = glm(data=somedata, formula = accept ~ ., family=binomial) coef(acceptModel) exp(confint.default(acceptModel)) predict.glm(acceptModel, data.frame(verbal=155,quant=160,analytic=4,gpa=3.6), type="response", se.fit=TRUE)
You can weight different points in the data set differently (i.e., weighted linear regression) by giving lm() or glm() a weights vector
You can use other non-linear models (e.g., an exponential function of the explanatory variables)
You can use local regression, loess(), which uses a \(k-\)nearest neighbor type approach to produce a piece-wise continuous curve that fits points as well as possible in local regions of the space
library(dplyr,quietly=TRUE, warn.conflicts=FALSE) # Get population by state and build a DF of year and population popbystate <- read.csv("population-by-year.csv", header=F, col.names=c("State", "Year", "Population")) population <- summarise(group_by(filter(popbystate, Year>=1986, Year<=2023), Year), Total=sum(Population)) # Get crime by year and build a DF of year and violent crime rate crimebyyear <- read.csv("fbi-violentcrime-1986-2025.csv", header=T)
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, : ## incomplete final line found by readTableHeader on ## 'fbi-violentcrime-1986-2025.csv'
crime <- data.frame(Year=as.numeric(gsub("X", "", colnames(crimebyyear)[2:39])), Count=as.numeric(crimebyyear[1,2:39])) # Merge these, then compute the rate per 100K df <- mutate(merge(crime, population), Rate=100000*Count/Total) crimeModel = loess(data=df, formula=Year ~ Rate)