Learning Objectives:

Students will learn how to …

  • Import data into R
  • Obtain summary statistics in R
  • Create scatter plots
  • Perform mathematical functions on vectors of data in R (elementwise)

Example 1: Ice Cream

Last week was hot!!! What refreshing treat do people enjoy on a hot day? ICE CREAM! As we come to the end of the summer and the cooler weather of the fall, let’s take a look at this fun dataset about ice cream sales and weather.

These data are provided on github by Guilherme D. F. Silva (guilhermedom).

Step 1: Import the Data

###Import the data
icecream<-read.csv("https://raw.githubusercontent.com/guilhermedom/perceptron-regression-ice-cream-per-temp/refs/heads/main/data/raw/icecream_sales.csv",
                 header=TRUE)

Step 2: Look at the Structure of the Data

###Look at the data
str(icecream)
## 'data.frame':    500 obs. of  2 variables:
##  $ Temperature: num  24.6 26 27.8 20.6 11.5 ...
##  $ Revenue    : num  535 625 661 488 316 ...
#View(icecream)

Step 3: Save Vectors for Response and Explanatory

###Get access to the data set and create variables for easy access later
temp<-icecream$Temperature
rev<-icecream$Revenue

Step 4: Create a Histogram

### IN BASE
###Make a histogram of the response variable 
hist(rev)

### IN GGPLOT
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(icecream, aes(x=rev))+
  geom_histogram(bins=10)

Step 5: Summary Statistics

###Report basic statistics of the response variable 
summary(rev)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0   405.6   529.4   521.6   642.3  1000.0
mean(rev)
## [1] 521.5708
sd(rev)
## [1] 175.4048

Step 6: Linear Model in R

###Run a simple linear regression model on shear against age and store the results
results<-lm(rev~temp)
summary(results)
## 
## Call:
## lm(formula = rev ~ temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.303 -15.596  -0.167  16.811  91.294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.8313     3.2718    13.7   <2e-16 ***
## temp         21.4436     0.1383   155.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.01 on 498 degrees of freedom
## Multiple R-squared:  0.9797, Adjusted R-squared:  0.9797 
## F-statistic: 2.404e+04 on 1 and 498 DF,  p-value: < 2.2e-16

Step 7: Scatter Plot

####Create a scatterplot
## BASE
plot(temp, rev)
####Add the least squares regression line
abline(coefficients(results), col="red", lwd=2)

## GGPLOT
ggplot(icecream, aes(x=temp, y=rev))+
  geom_point()+
  geom_smooth(method="lm", color="red")
## `geom_smooth()` using formula = 'y ~ x'

Step 8: Verify from Scratch

Use the mathematical equations derived in class to confirm the least squares estimates from lm.

Slope:

\[\hat{\beta}_1 = \frac{n\sum_{i=1}^n x_i Y_i - \sum_{i=1}^n x_i \sum_{i=1} ^n Y_i}{n\sum_{i=1}^n x_i^2-(\sum_{i=1}^n x_i)^2}\]

Intercept:

\[\hat{\beta}_0 = \bar{Y}-\hat{\beta}_1 \bar{x}\]

### FOR EASE OF NOTATION
x<-temp
Y<-rev

### VECTOR MATH IS ELEMENT WISE
head(x)
## [1] 24.56688 26.00519 27.79055 20.59534 11.50350 14.35251
head(Y)
## [1] 534.7990 625.1901 660.6323 487.7070 316.2402 367.9407
head(x*Y)
## [1] 13138.346 16258.189 18359.337 10044.488  3637.868  5280.875
### NUMERATOR (Part 1)
n<-length(x)
num1<-n*sum(x*Y)

### NUMERATOR (Part 2)
num2<-sum(x)*(sum(Y))

### DENOMINATOR (Part 1)
den1<-n*sum(x^2)

### DENOMINATOR (Part 2)
den2<-sum(x)^2

### BETA 1 HAT
beta_1_hat<-(num1-num2)/(den1-den2)
beta_1_hat
## [1] 21.44363
### BETA 0 HAT
beta_0_hat<-mean(Y)-beta_1_hat*mean(x)
beta_0_hat
## [1] 44.83127

Example 2: Rockets

The rocket data are available here:

rocket<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching_Public/refs/heads/main/STAT441/FA25/data/rocketProp.csv",
                 header=TRUE)

These rocket data are from:

“Introduction to Linear Regression Analysis (3rd Edition)” by Douglas C. Montgomery, Elizabeth A. Peek, and G. Geoffrey Vining. Published by Wiley-Interscience as part of the Wiley Series in Probability and Statistics.