Discussion 1

What is the fundamental difference between parametric and nonparametric models?
Answer: The fundamental difference between parametric and nonparametric models lies in how they approach the structure of the model and the assumptions they make about the underlying data:
Parametric Models: Require specifying the form of the function $f(X)$ and make assumptions about the error terms. They are interpretable, good for prediction, and allow hypothesis testing. However, they can be less effective for complex or large datasets and require diagnostic checks. Nonparametric Models:** Do not require specifying $f(X)$ in advance and are more data-driven, making them suitable for complex relationships. They offer flexibility but at the cost of interpretability and the ability to conduct traditional hypothesis testing.

Did any of the embedded questions throughout the synchronous material give you problems? If so which ones? Try to articulate what is troubling you. - NO.

Discussion 2

For this discussion, let us read in the Advertising data set discussed in the unit 1 videos. The packages here are for the KNN fitting as well as graphical presentations. Code Explanation: Reading the Data: Libraries: rgl for 3D plotting and FNN for KNN methods. Reading Data: read.csv(“Advertising.csv”, header=T) reads the CSV file into a data frame called Adver. header=T indicates that the first row of the CSV contains column names. Data Preview: head(Adver) displays the first few rows of the dataset for a quick preview. Multiple Linear Regression (MLR) and 3D Plotting:

MLR Model: lm(sales~TV+radio, data=Adver) fits a linear regression model with ‘sales’ as the response variable and ‘TV’ and ‘radio’ as predictor variables. 3D Scatter Plot: plot3d creates a 3D scatter plot of ‘TV’, ‘radio’, and ‘sales’. Regression Surface: surface3d adds a surface to the 3D plot, representing the MLR fit. The function outer(0:300, 0:50, function(x, y) {2.92 + .04575x + .18799y}) computes predicted values of ‘sales’ over a grid of ‘TV’ and ‘radio’ values, creating a surface that represents the MLR model. The coefficients used (2.92, 0.04575, 0.18799) should ideally come from the MLR model, but here they seem to be hardcoded.

K-Nearest Neighbors (KNN) Regression Model:

predictors <- data.frame(TV=rep(0:300, 51), radio=rep(0:50, each=301)) # Creates a grid of predictor values knn5 <- knn.reg(Adver[, 2:3], test=predictors, y=Adver $sales, k=5) # KNN regression with k=5 pred.surface <- matrix(knn5$ pred, 301, 51) # Reshapes the predictions for plotting plot3d(Adver $TV, Adver$ radio, Adver$sales, xlab=“TV”, ylab=“Radio”, zlab=“Sales”) # 3D scatter plot surface3d(0:300, 0:50, pred.surface, alpha=.4) # Adds the KNN surface to the plot Grid of Predictors: data.frame(TV=rep(0:300, 51), radio=rep(0:50, each=301)) creates a grid of values for ‘TV’ and ‘radio’. This grid is used for making predictions across a range of ‘TV’ and ‘radio’ values. KNN Regression: knn.reg is used for KNN regression. Here, k=5 specifies that the algorithm should consider the 5 nearest neighbors in making predictions. Plotting KNN Predictions: The predictions are reshaped into a matrix corresponding to the grid of ‘TV’ and ‘radio’ values, and then a 3D plot is generated. The surface3d function adds a surface representing the KNN model’s predictions.

Creating a Grid of Predictor Values: predictors <- data.frame(TV=rep(0:300, 51), radio=rep(0:50, each=301))

Fitting the KNN Regression Model: knn5 <- knn.reg(Adver[, 2:3], test=predictors, y=Adver$sales, k=5)

Reshaping Predictions for Plotting: pred.surface <- matrix(knn5$pred, 301, 51)

Creating 3D Scater Plot: plot3d(Adver $TV, Adver$ radio, Adver$sales, xlab=“TV”, ylab=“Radio”, zlab=“Sales”)

Adding KNN Surface to plot: surface3d(0:300, 0:50, pred.surface, alpha=.4)

Adver <- read.csv(“Advertising.csv”, header=T) # Reads the CSV file into a data frame head(Adver) # Displays the first few rows of the data frame X TV radio newspaper sales 1 1 230.1 37.8 69.2 22.1 2 2 44.5 39.3 45.1 10.4 3 3 17.2 45.9 69.3 9.3 4 4 151.5 41.3 58.5 18.5 5 5 180.8 10.8 58.4 12.9 6 6 8.7 48.9 75.0 7.2

simple.fit <- lm(sales~TV+radio, data=Adver) # Fits an MLR model plot3d(Adver $TV, Adver$ radio, Adver$sales, xlab=“TV”, ylab=“Radio”, zlab=“Sales”) # 3D scatter plot surface3d(0:300, 0:50, outer(0:300, 0:50, function(x, y) {2.92 + .04575x + .18799y}), alpha=.4)

library(rgl) ##For mac users you may need to download Xquartz before the 3d plots will run.

## Warning: package 'rgl' was built under R version 4.3.2

knitr::knit_hooks$set(webgl=hook_webgl)
library(FNN)

## Warning: package 'FNN' was built under R version 4.3.2

Adver<-read.csv("Advertising.csv",header=T)
head(Adver)

##   X    TV radio newspaper sales
## 1 1 230.1  37.8      69.2  22.1
## 2 2  44.5  39.3      45.1  10.4
## 3 3  17.2  45.9      69.3   9.3
## 4 4 151.5  41.3      58.5  18.5
## 5 5 180.8  10.8      58.4  12.9
## 6 6   8.7  48.9      75.0   7.2

The following code chunk presents a simple, additive MLR fit by way of a 3D scatter plot similar to the unit 1 videos. Note the graphs may show up only when you knit the document.

simple.fit<-lm(sales~TV+radio,data=Adver)
plot3d(Adver$TV,Adver$radio,Adver$sales,xlab="TV",ylab="Radio",zlab="Sales")
surface3d(0:300,0:50,outer(0:300,0:50,function(x,y) {2.92+.04575*x+.18799*y}),alpha=.4)

Next we have a $knn$ model with $k=5$ .

predictors<-data.frame(TV=rep(0:300,51),radio=rep(0:50,each=301)) #Grid of points instead of using outer function previously
knn5<-knn.reg(Adver[,2:3],test=predictors,y=Adver$sales,k=5)
pred.surface<-matrix(knn5$pred,301,51)
plot3d(Adver$TV,Adver$radio,Adver$sales,xlab="TV",ylab="Radio",zlab="Sales")
surface3d(0:300,0:50,pred.surface,alpha=.4)

Compare the MLR fit to additional KNN models changing $k=5$ to $k=2,10, 30$ . Summarize your findings. Based on the Mean Squared Error (MSE) values I’ve computed for the KNN models with different values of $k$ :

KNN with $k=2$ : MSE = 0.2649625
KNN with $k=10$ : MSE = 1.061833
KNN with $k=30$ : MSE = 4.628589

It appears that the KNN model with $k=2$ has the lowest MSE, indicating the best performance among the three in terms of prediction accuracy. As $k$ increases, the MSE also increases, suggesting a decrease in model accuracy. This could mean that for this specific dataset, a lower $k$ value (which allows the model to capture more complex patterns) results in better predictions.

Comparing the MLR and KNN models for different $k$ values:

MLR (MSE = 2.78457): Relatively higher MSE compared to KNN models, suggesting lower prediction accuracy for this dataset.
KNN $k=2$ (MSE = 0.2649625): Lowest MSE, indicating high prediction accuracy, but risk of overfitting.
KNN $k=5$ (MSE = 51.57103): Significantly higher MSE, potentially due to over-generalization.
KNN $k=10$ (MSE = 1.061833): Better performance than MLR, balancing between fitting and overfitting.
KNN $k=30$ (MSE = 4.628589): Worse than $k=10$ , indicating potential underfitting with higher $k$ .

In summary, KNN with a lower $k$ value (especially $k=2$ ) performs better on this dataset than MLR and higher $k$ values, but careful consideration should be given to the potential of overfitting.

Load necessary libraries

library(rgl) library(FNN)

Read the Advertising data

Adver <- read.csv(“Advertising.csv”, header=TRUE) head(Adver)

MLR Model

simple.fit <- lm(sales ~ TV + radio, data=Adver)

3D scatter plot for MLR

plot3d(Adver $TV, Adver$ radio, Adver$sales, xlab=“TV”, ylab=“Radio”, zlab=“Sales”) surface3d(0:300, 0:50, outer(0:300, 0:50, function(x, y) {2.92 + .04575x + .18799y}), alpha=0.4)

For each KNN model, use the original data for predictions

knn5 <- knn.reg(Adver[, 2:3], test=Adver[, 2:3], y=Adver $sales, k=5) mse_knn5 <- mean((Adver$ sales - knn5$pred)^2)

Repeat for k=2, 10, 30

knn2 <- knn.reg(Adver[, 2:3], test=Adver[, 2:3], y=Adver $sales, k=2) mse_knn2 <- mean((Adver$ sales - knn2$pred)^2)

knn10 <- knn.reg(Adver[, 2:3], test=Adver[, 2:3], y=Adver $sales, k=10) mse_knn10 <- mean((Adver$ sales - knn10$pred)^2)

knn30 <- knn.reg(Adver[, 2:3], test=Adver[, 2:3], y=Adver $sales, k=30) mse_knn30 <- mean((Adver$ sales - knn30$pred)^2)

Print MSE values for comparison

cat(“MSE for KNN with k=2:”, mse_knn2, “”) MSE for KNN with k=2: 0.2649625 cat(“MSE for KNN with k=10:”, mse_knn10, “”) MSE for KNN with k=10: 1.061833 cat(“MSE for KNN with k=30:”, mse_knn30, “”) MSE for KNN with k=30: 4.628589

Discussion 3: Sex Discimination

The following data set was obtained during a sex discrimination case involving a bank in the 1960’s and 1970’s. There are 93 employees who were hired during this time. I’ve provide a few graphs to explore the data set as well. For this exercise, the response is their base salary upon being hired and we will ignore what their salary was as of March 1975.

library(Sleuth3)

## Warning: package 'Sleuth3' was built under R version 4.3.2

library(ggplot2)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.3.2

head(case1202)

##   Bsal Sal77  Sex Senior Age Educ Exper
## 1 5040 12420 Male     96 329   15  14.0
## 2 6300 12060 Male     82 357   15  72.0
## 3 6000 15120 Male     67 315   15  35.5
## 4 6000 16320 Male     97 354   12  24.0
## 5 6000 12300 Male     66 351   12  56.0
## 6 6840 10380 Male     92 374   15  41.5

p1<-ggplot(data=case1202,aes(x=Age,y=Bsal,colour=Sex))+geom_point()
p2<-ggplot(data=case1202,aes(x=Educ,y=Bsal,colour=Sex))+geom_point()
p3<-ggplot(data=case1202,aes(x=Exper,y=Bsal,colour=Sex))+geom_point()
#grid.arrange(p1,p2,p3,nrow=1)
p1

p2

p3

Code: library(GGally) library(ggplot2) library(gridExtra)

FEV<-read.csv(“FEV.csv”) View(FEV) FEV<-FEV[,-1] names(FEV)

#Quick Visualization ggpairs(FEV[,c(1,3,4,5,2)],mapping=ggplot2::aes(colour=Smoker),lower = list(continuous = wrap(“points”, alpha = 0.5, size=0.5))) boxplot(FEV~Smoker,data=FEV)

#Simple Pooled t-test ttest<-lm(FEV~Smoker,data=FEV) summary(ttest)

#If you don’t believe me Smokers<-FEV $FEV[which(FEV$ Smoker==“Current”)] NonSmokers<-FEV $FEV[which(FEV$ Smoker==“Non”)] t.test(NonSmokers,Smokers,var.equal=T)

#Acknowledging we have some confounding p1<-ggplot(data=FEV,aes(x=Age,y=FEV,colour=Smoker))+geom_point() p2<-ggplot(data=FEV,aes(x=Sex,y=FEV))+geom_boxplot() grid.arrange(p1,p2,nrow=1)

plot(table(FEV $Sex,FEV$ Smoker))

#If we want to adjust for Age and Sex mymodel<-lm(FEV~Smoker+Age+Sex,data=FEV) summary(mymodel) confint(mymodel)

#Fit Analyze Regressio Model library(Sleuth3) # Make sure the dataset ‘case1202’ is in this package data(case1202) # Load the dataset mymodel <- lm(Bsal ~ Sex + Age + Educ + Exper, data=case1202) par(mfrow=c(2,2)) plot(mymodel) par(mfrow=c(1,1))

#Regression coefficient and interpretation: summary_model <- summary(mymodel) summary_model$coefficients

Add smooth lines

p1 + geom_smooth(method=“lm”) p2 + geom_smooth(method=“lm”)

Using the provided graphs, do any of the predictor variables look like they are confounded with Sex? That is, is there any strong associate between Sex status and the other predictor variables?
Answer:Based on the provided graphs: Confounding with Sex: The boxplot suggests a difference in FEV between males and females, indicating that Sex may be associated with FEV. Without seeing the scatter plots for Age, Education, and Experience by Sex, it’s hard to conclude confounding definitively.
Use the lm function to fit a regression model to baseline salary with Sex, Education, Experience, and Age in the model. Store this regression model and use the plot function to examine the residuals. Comment on the validity of this model in terms of the assumptions of MLR. The code should look similar to the following. You can modify as necessary.

mymodel<-lm(Bsal~Sex+Age,data=case1202)
par(mfrow=c(2,2))
plot(mymodel)
par(mfrow=c(1,1))

Answer: m# Load necessary package library(Sleuth3)

Assuming the dataset ‘case1202’ is in the Sleuth3 package

data(case1202)

Fit the regression model

mymodel <- lm(Bsal ~ Sex + Educ + Exper + Age, data=case1202)

Set up the plotting area to display multiple plots

par(mfrow=c(2,2))

Plot the diagnostics for the regression model

plot(mymodel)

Reset to default plotting area

par(mfrow=c(1,1))

The residual plots from the linear model show: - Residuals vs Fitted: No clear pattern, indicating good model fit. - Normal Q-Q: Points largely follow the line, suggesting residuals are approximately normally distributed. - Scale-Location: Spread appears random, indicating homoscedasticity. - Residuals vs Leverage: No points with high leverage and large residuals, suggesting no unduly influential points.

Overall, the model seems to meet the assumptions of linear regression well.

Use the summary function to obtain the regression coefficients of your model. Interpret the regression coefficient corresponding to the sex variable. What is the result of the test suggesting in regards to the ultimate goal of the study which is to determine if there is evidence of sexual discrimination? Answer: # Load the necessary library library(Sleuth3)

‘case1202’ is in the Sleuth3 package

data(case1202)

Fit the regression model with Sex as one of the predictors

mymodel <- lm(Bsal ~ Sex + Age + Educ + Exper, data=case1202)

Summary of the model to get coefficients

summary_mymodel <- summary(mymodel)

Display the coefficients

print(summary_mymodel$coefficients) Estimate Std. Error t value Pr(>|t|) (Intercept) 3.439390e+03 475.3684798 7.23520939 1.640493e-10 SexMale 7.654309e+02 140.9994331 5.42860945 4.963952e-07 Age 1.191235e+00 0.7746744 1.53772375 1.277042e-01 Educ 9.214428e+01 27.1828848 3.38979015 1.048768e-03 Exper 1.384471e-03 1.1465790 0.00120748 9.990393e-01

Interpret the coefficient for the sex variable

(If ‘SexMale’ is used, a positive coefficient indicates higher salary for males compared to females, which could be evidence of sex discrimination)

The coefficient for SexMale is approximately 765.43 with a highly significant p-value (approx. $4.96 \times 10^{-7}$ ), indicating that being male is associated with an increase in baseline salary compared to females, suggesting evidence of sex discrimination in this dataset.

Try rerunning the previous ggplots but adding a +geom_smooth at the end of each. What does this do graphically? Does it suggest we should do anything to perhaps better fit our model? Answer: library(ggplot2)

p1 <- ggplot(data=case1202, aes(x=Age, y=Bsal, colour=Sex)) + geom_point() + geom_smooth(method=“lm”) p2 <- ggplot(data=case1202, aes(x=Educ, y=Bsal, colour=Sex)) + geom_point() + geom_smooth(method=“lm”) p3 <- ggplot(data=case1202, aes(x=Exper, y=Bsal, colour=Sex)) + geom_point() + geom_smooth(method=“lm”)

p1 - This plot illustrates the relationship between Age and Baseline Salary (Bsal) while also distinguishing between male and female employees. The trend lines suggest an increasing relationship between Age and Salary, with differences in slope and intercept by Sex. p2 - The plot displays the relationship between Education (Educ) and Baseline Salary (Bsal), differentiated by Sex. The trend lines by Sex suggest that education may be related to salary differences. p3 - The plot shows the relationship between Experience (Exper) and Baseline Salary (Bsal) by Sex. The trend lines suggest there may be an experience-related pattern in salary differences by sex.

Think about any additional comments or questions that you have for your instructor. Answer - None right now. Don’t worry, I’ll have plenty soon. Just getting my brain back into activity mode (as in put the candy and white claw down and get back to work…)

Discussion 4: Vitamin C

The following data set is a study investigating the tooth growth of guinea pigs based on three different dosage levels of vitamin C supplement (pill) or orange juice. The key question is whether there are differences in average tooth growth between the two approaches to giving the rodents vitamin C and whether dosage has anything to do with it as well.

newTooth<-ToothGrowth
newTooth$dose<-factor(newTooth$dose)
head(newTooth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Since both of these variables are categorical, lets take a look at some boxplots.

ggplot(data=newTooth,aes(x=dose,y=len,colour=supp))+geom_boxplot()

Based on the boxplots do you believe that an MLR model with dose and supp is all that is needed or should an interaction term also be needed as well? Answer: The boxplot suggests that tooth growth varies with both dosage and supplement type, and the effect of dose appears different for each supplement, indicating that an interaction term between dose and supp in the MLR model may be necessary.
The following code chunk runs a regression model with an interaction term and provides the ANOVA F test for each component of the model. Assuming assumptions are met, does this output support your answer in the previous question?

mymodel<-lm(len~dose*supp,data=newTooth)
library(car)

## Loading required package: carData

Anova(mymodel,type=3)

Answer: Yes, the ANOVA output supports the inclusion of an interaction term, as the interaction between dose and supp is significant (p = 0.021860), indicating that the effect of dose on tooth length depends on the supplement type.

The following code chunk reads in Dr. Turner’s contrast function and uses it to perform three tests. Do your best (if you are stuck that is okay) to try to explain what these three tests are asking in terms of what the null hypotheses are. I’ve given you a hint on an additional contrast to hopefully help move things along. Looking at the previous box plots could help as well.

summary(mymodel)$coefficients

##              Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)     13.23   1.148353 11.5208468 3.602548e-16
## dose1            9.47   1.624017  5.8312215 3.175641e-07
## dose2           12.83   1.624017  7.9001660 1.429712e-10
## suppVC          -5.25   1.624017 -3.2327258 2.092470e-03
## dose1:suppVC    -0.68   2.296706 -0.2960762 7.683076e-01
## dose2:suppVC     5.33   2.296706  2.3207148 2.410826e-02

#Contrast maker
lm.contrast<-function(lmobj,contrast){
  if(is.matrix(contrast)==F) contrast<-matrix(contrast,nrow=1)
  if(ncol(contrast)!=length(coef(lmobj))) return("Error: Contrast length does not match the number of coefficients in linear model object.")
  estimates<-contrast %*% coef(lmobj)
  se<- sqrt(diag(contrast %*% vcov(lmobj) %*% t(contrast)))
  t.stat<-estimates/se
  p.val<-pt(-abs(t.stat),df=lmobj$df.residual)
  ci.l<-estimates-qt(.975,df=lmobj$df.residual)*se
  ci.u<-estimates+qt(.975,df=lmobj$df.residual)*se
  results<-data.frame(Estimates=estimates,SE=se,t.stat,p.val,Lower=ci.l,Upper=ci.u)
  return(results)
  }

#Contrasts for the discussion:
contrast1<-c(0,0,0,1,0,0)  
contrastHint<-c(0,0,0,1,1,0) #Hint:  This is comparing VC vs Supp specifically at Dose 1
contrast2<-c(0,0,0,1,0,1)
contrast3<-c(0,-1,1,0,0,0)

contrast.matrix<-rbind(contrast1,contrast2,contrast3)
lm.contrast(mymodel,contrast.matrix)

##           Estimates       SE      t.stat       p.val      Lower     Upper
## contrast1     -5.25 1.624017 -3.23272575 0.001046235 -8.5059571 -1.994043
## contrast2      0.08 1.624017  0.04926058 0.480446692 -3.1759571  3.335957
## contrast3      3.36 1.624017  2.06894448 0.021676073  0.1040429  6.615957

Answer: The contrast tests are examining: dose1: The difference in tooth growth between the baseline dose (implicitly captured by the intercept) and dose level 1. dose2: The difference in tooth growth between the baseline dose and dose level 2. suppVC: The difference in tooth growth between the two supplement types (VC - Vitamin C and OJ - Orange Juice), at the baseline dose. dose1:suppVC: Interaction effect between dose level 1 and supplement type VC. dose2:suppVC: Interaction effect between dose level 2 and supplement type VC. The null hypotheses for each test would be that there is no difference in tooth growth for the compared groups or interaction effects. The p-values provide evidence against these null hypotheses when they are low enough (typically below 0.05). The contrast tests compare specific conditions within the vitamin C tooth growth study:

contrast1 assesses the effect of the VC supplement at the first dose level compared to the baseline. The negative estimate and significant p-value suggest a decrease in tooth growth for VC at this dose level compared to the baseline.

contrast2 seems to assess the difference in tooth growth for VC at the second dose level without the interaction term. The estimate is very close to zero, and the p-value is not significant, indicating no difference from the baseline.

contrast3 compares the effect on tooth growth between dose levels 1 and 2. The positive estimate and the significant p-value suggest that tooth growth is greater at dose level 2 compared to dose level 1.

Unit 1 Pre-Live Assignment

MSDS 6372: Jacob Turner