What is the fundamental difference between parametric and
nonparametric models?
Answer: The fundamental difference between parametric and nonparametric
models lies in how they approach the structure of the model and the
assumptions they make about the underlying data:
Parametric Models: Require specifying the form of the function f(X) and make assumptions about the error
terms. They are interpretable, good for prediction, and allow hypothesis
testing. However, they can be less effective for complex or large
datasets and require diagnostic checks. Nonparametric Models:** Do not
require specifying f(X) in advance
and are more data-driven, making them suitable for complex
relationships. They offer flexibility but at the cost of
interpretability and the ability to conduct traditional hypothesis
testing.
Did any of the embedded questions throughout the synchronous material give you problems? If so which ones? Try to articulate what is troubling you. - NO.
For this discussion, let us read in the Advertising data set discussed in the unit 1 videos. The packages here are for the KNN fitting as well as graphical presentations. Code Explanation: Reading the Data: Libraries: rgl for 3D plotting and FNN for KNN methods. Reading Data: read.csv(“Advertising.csv”, header=T) reads the CSV file into a data frame called Adver. header=T indicates that the first row of the CSV contains column names. Data Preview: head(Adver) displays the first few rows of the dataset for a quick preview. Multiple Linear Regression (MLR) and 3D Plotting:
MLR Model: lm(sales~TV+radio, data=Adver) fits a linear regression model with ‘sales’ as the response variable and ‘TV’ and ‘radio’ as predictor variables. 3D Scatter Plot: plot3d creates a 3D scatter plot of ‘TV’, ‘radio’, and ‘sales’. Regression Surface: surface3d adds a surface to the 3D plot, representing the MLR fit. The function outer(0:300, 0:50, function(x, y) {2.92 + .04575x + .18799y}) computes predicted values of ‘sales’ over a grid of ‘TV’ and ‘radio’ values, creating a surface that represents the MLR model. The coefficients used (2.92, 0.04575, 0.18799) should ideally come from the MLR model, but here they seem to be hardcoded.
K-Nearest Neighbors (KNN) Regression Model:
predictors <- data.frame(TV=rep(0:300, 51), radio=rep(0:50, each=301)) # Creates a grid of predictor values knn5 <- knn.reg(Adver[, 2:3], test=predictors, y=Adversales, k=5) # KNN regression with k=5 pred.surface <- matrix(knn5pred, 301, 51) # Reshapes the predictions for plotting plot3d(AdverTV,Adverradio, Adver$sales, xlab=“TV”, ylab=“Radio”, zlab=“Sales”) # 3D scatter plot surface3d(0:300, 0:50, pred.surface, alpha=.4) # Adds the KNN surface to the plot Grid of Predictors: data.frame(TV=rep(0:300, 51), radio=rep(0:50, each=301)) creates a grid of values for ‘TV’ and ‘radio’. This grid is used for making predictions across a range of ‘TV’ and ‘radio’ values. KNN Regression: knn.reg is used for KNN regression. Here, k=5 specifies that the algorithm should consider the 5 nearest neighbors in making predictions. Plotting KNN Predictions: The predictions are reshaped into a matrix corresponding to the grid of ‘TV’ and ‘radio’ values, and then a 3D plot is generated. The surface3d function adds a surface representing the KNN model’s predictions.
Creating a Grid of Predictor Values: predictors <- data.frame(TV=rep(0:300, 51), radio=rep(0:50, each=301))
Fitting the KNN Regression Model: knn5 <- knn.reg(Adver[, 2:3], test=predictors, y=Adver$sales, k=5)
Reshaping Predictions for Plotting: pred.surface <- matrix(knn5$pred, 301, 51)
Creating 3D Scater Plot: plot3d(AdverTV,Adverradio, Adver$sales, xlab=“TV”, ylab=“Radio”, zlab=“Sales”)
Adding KNN Surface to plot: surface3d(0:300, 0:50, pred.surface, alpha=.4)
Adver <- read.csv(“Advertising.csv”, header=T) # Reads the CSV file into a data frame head(Adver) # Displays the first few rows of the data frame X TV radio newspaper sales 1 1 230.1 37.8 69.2 22.1 2 2 44.5 39.3 45.1 10.4 3 3 17.2 45.9 69.3 9.3 4 4 151.5 41.3 58.5 18.5 5 5 180.8 10.8 58.4 12.9 6 6 8.7 48.9 75.0 7.2
simple.fit <- lm(sales~TV+radio, data=Adver) # Fits an MLR model plot3d(AdverTV,Adverradio, Adver$sales, xlab=“TV”, ylab=“Radio”, zlab=“Sales”) # 3D scatter plot surface3d(0:300, 0:50, outer(0:300, 0:50, function(x, y) {2.92 + .04575x + .18799y}), alpha=.4)
library(rgl) ##For mac users you may need to download Xquartz before the 3d plots will run.
## Warning: package 'rgl' was built under R version 4.3.2
knitr::knit_hooks$set(webgl=hook_webgl)
library(FNN)
## Warning: package 'FNN' was built under R version 4.3.2
Adver<-read.csv("Advertising.csv",header=T)
head(Adver)
## X TV radio newspaper sales
## 1 1 230.1 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 151.5 41.3 58.5 18.5
## 5 5 180.8 10.8 58.4 12.9
## 6 6 8.7 48.9 75.0 7.2
The following code chunk presents a simple, additive MLR fit by way of a 3D scatter plot similar to the unit 1 videos. Note the graphs may show up only when you knit the document.
simple.fit<-lm(sales~TV+radio,data=Adver)
plot3d(Adver$TV,Adver$radio,Adver$sales,xlab="TV",ylab="Radio",zlab="Sales")
surface3d(0:300,0:50,outer(0:300,0:50,function(x,y) {2.92+.04575*x+.18799*y}),alpha=.4)
Next we have a knn model with k=5.
predictors<-data.frame(TV=rep(0:300,51),radio=rep(0:50,each=301)) #Grid of points instead of using outer function previously
knn5<-knn.reg(Adver[,2:3],test=predictors,y=Adver$sales,k=5)
pred.surface<-matrix(knn5$pred,301,51)
plot3d(Adver$TV,Adver$radio,Adver$sales,xlab="TV",ylab="Radio",zlab="Sales")
surface3d(0:300,0:50,pred.surface,alpha=.4)
Compare the MLR fit to additional KNN models changing k=5 to k=2,10,30. Summarize your findings. Based on the Mean Squared Error (MSE) values I’ve computed for the KNN models with different values of k:
It appears that the KNN model with k=2 has the lowest MSE, indicating the best performance among the three in terms of prediction accuracy. As k increases, the MSE also increases, suggesting a decrease in model accuracy. This could mean that for this specific dataset, a lower k value (which allows the model to capture more complex patterns) results in better predictions.
Comparing the MLR and KNN models for different k values:
In summary, KNN with a lower k value (especially k=2) performs better on this dataset than MLR and higher k values, but careful consideration should be given to the potential of overfitting.
library(rgl) library(FNN)
Adver <- read.csv(“Advertising.csv”, header=TRUE) head(Adver)
simple.fit <- lm(sales ~ TV + radio, data=Adver)
plot3d(AdverTV,Adverradio, Adver$sales, xlab=“TV”, ylab=“Radio”, zlab=“Sales”) surface3d(0:300, 0:50, outer(0:300, 0:50, function(x, y) {2.92 + .04575x + .18799y}), alpha=0.4)
knn5 <- knn.reg(Adver[, 2:3], test=Adver[, 2:3], y=Adversales,k=5)mseknn5<−mean((Adversales - knn5$pred)^2)
knn2 <- knn.reg(Adver[, 2:3], test=Adver[, 2:3], y=Adversales,k=2)mseknn2<−mean((Adversales - knn2$pred)^2)
knn10 <- knn.reg(Adver[, 2:3], test=Adver[, 2:3], y=Adversales,k=10)mseknn10<−mean((Adversales - knn10$pred)^2)
knn30 <- knn.reg(Adver[, 2:3], test=Adver[, 2:3], y=Adversales,k=30)mseknn30<−mean((Adversales - knn30$pred)^2)
cat(“MSE for KNN with k=2:”, mse_knn2, “”) MSE for KNN with k=2: 0.2649625 cat(“MSE for KNN with k=10:”, mse_knn10, “”) MSE for KNN with k=10: 1.061833 cat(“MSE for KNN with k=30:”, mse_knn30, “”) MSE for KNN with k=30: 4.628589
The following data set was obtained during a sex discrimination case involving a bank in the 1960’s and 1970’s. There are 93 employees who were hired during this time. I’ve provide a few graphs to explore the data set as well. For this exercise, the response is their base salary upon being hired and we will ignore what their salary was as of March 1975.
library(Sleuth3)
## Warning: package 'Sleuth3' was built under R version 4.3.2
library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.3.2
head(case1202)
## Bsal Sal77 Sex Senior Age Educ Exper
## 1 5040 12420 Male 96 329 15 14.0
## 2 6300 12060 Male 82 357 15 72.0
## 3 6000 15120 Male 67 315 15 35.5
## 4 6000 16320 Male 97 354 12 24.0
## 5 6000 12300 Male 66 351 12 56.0
## 6 6840 10380 Male 92 374 15 41.5
p1<-ggplot(data=case1202,aes(x=Age,y=Bsal,colour=Sex))+geom_point()
p2<-ggplot(data=case1202,aes(x=Educ,y=Bsal,colour=Sex))+geom_point()
p3<-ggplot(data=case1202,aes(x=Exper,y=Bsal,colour=Sex))+geom_point()
#grid.arrange(p1,p2,p3,nrow=1)
p1
p2
p3
Code: library(GGally) library(ggplot2) library(gridExtra)
FEV<-read.csv(“FEV.csv”) View(FEV) FEV<-FEV[,-1] names(FEV)
#Quick Visualization ggpairs(FEV[,c(1,3,4,5,2)],mapping=ggplot2::aes(colour=Smoker),lower = list(continuous = wrap(“points”, alpha = 0.5, size=0.5))) boxplot(FEV~Smoker,data=FEV)
#Simple Pooled t-test ttest<-lm(FEV~Smoker,data=FEV) summary(ttest)
#If you don’t believe me Smokers<-FEVFEV[which(FEVSmoker==“Current”)] NonSmokers<-FEVFEV[which(FEVSmoker==“Non”)] t.test(NonSmokers,Smokers,var.equal=T)
#Acknowledging we have some confounding p1<-ggplot(data=FEV,aes(x=Age,y=FEV,colour=Smoker))+geom_point() p2<-ggplot(data=FEV,aes(x=Sex,y=FEV))+geom_boxplot() grid.arrange(p1,p2,nrow=1)
plot(table(FEVSex,FEVSmoker))
#If we want to adjust for Age and Sex mymodel<-lm(FEV~Smoker+Age+Sex,data=FEV) summary(mymodel) confint(mymodel)
#Fit Analyze Regressio Model library(Sleuth3) # Make sure the dataset ‘case1202’ is in this package data(case1202) # Load the dataset mymodel <- lm(Bsal ~ Sex + Age + Educ + Exper, data=case1202) par(mfrow=c(2,2)) plot(mymodel) par(mfrow=c(1,1))
#Regression coefficient and interpretation: summary_model <- summary(mymodel) summary_model$coefficients
p1 + geom_smooth(method=“lm”) p2 + geom_smooth(method=“lm”)
Using the provided graphs, do any of the predictor variables look
like they are confounded with Sex? That is, is there any strong
associate between Sex status and the other predictor variables?
Answer:Based on the provided graphs: Confounding with Sex: The boxplot
suggests a difference in FEV between males and females, indicating that
Sex may be associated with FEV. Without seeing the scatter plots for
Age, Education, and Experience by Sex, it’s hard to conclude confounding
definitively.
Use the lm
function to fit a regression model to
baseline salary with Sex, Education, Experience, and Age in the model.
Store this regression model and use the plot
function to
examine the residuals. Comment on the validity of this model in terms of
the assumptions of MLR. The code should look similar to the following.
You can modify as necessary.
mymodel<-lm(Bsal~Sex+Age,data=case1202)
par(mfrow=c(2,2))
plot(mymodel)
par(mfrow=c(1,1))
Answer: m# Load necessary package library(Sleuth3)
data(case1202)
mymodel <- lm(Bsal ~ Sex + Educ + Exper + Age, data=case1202)
par(mfrow=c(2,2))
plot(mymodel)
par(mfrow=c(1,1))
The residual plots from the linear model show: - Residuals vs Fitted: No clear pattern, indicating good model fit. - Normal Q-Q: Points largely follow the line, suggesting residuals are approximately normally distributed. - Scale-Location: Spread appears random, indicating homoscedasticity. - Residuals vs Leverage: No points with high leverage and large residuals, suggesting no unduly influential points.
Overall, the model seems to meet the assumptions of linear regression well.
summary
function to obtain the regression
coefficients of your model. Interpret the regression coefficient
corresponding to the sex variable. What is the result of the test
suggesting in regards to the ultimate goal of the study which is to
determine if there is evidence of sexual discrimination? Answer: # Load
the necessary library library(Sleuth3)data(case1202)
mymodel <- lm(Bsal ~ Sex + Age + Educ + Exper, data=case1202)
summary_mymodel <- summary(mymodel)
print(summary_mymodel$coefficients) Estimate Std. Error t value Pr(>|t|) (Intercept) 3.439390e+03 475.3684798 7.23520939 1.640493e-10 SexMale 7.654309e+02 140.9994331 5.42860945 4.963952e-07 Age 1.191235e+00 0.7746744 1.53772375 1.277042e-01 Educ 9.214428e+01 27.1828848 3.38979015 1.048768e-03 Exper 1.384471e-03 1.1465790 0.00120748 9.990393e-01
The coefficient for SexMale
is approximately 765.43 with
a highly significant p-value (approx. 4.96×10−7), indicating that being male is associated with
an increase in baseline salary compared to females, suggesting evidence
of sex discrimination in this dataset.
+geom_smooth
at the end of each. What does this do
graphically? Does it suggest we should do anything to perhaps better fit
our model? Answer: library(ggplot2)p1 <- ggplot(data=case1202, aes(x=Age, y=Bsal, colour=Sex)) + geom_point() + geom_smooth(method=“lm”) p2 <- ggplot(data=case1202, aes(x=Educ, y=Bsal, colour=Sex)) + geom_point() + geom_smooth(method=“lm”) p3 <- ggplot(data=case1202, aes(x=Exper, y=Bsal, colour=Sex)) + geom_point() + geom_smooth(method=“lm”)
p1 - This plot illustrates the relationship between Age and Baseline Salary (Bsal) while also distinguishing between male and female employees. The trend lines suggest an increasing relationship between Age and Salary, with differences in slope and intercept by Sex. p2 - The plot displays the relationship between Education (Educ) and Baseline Salary (Bsal), differentiated by Sex. The trend lines by Sex suggest that education may be related to salary differences. p3 - The plot shows the relationship between Experience (Exper) and Baseline Salary (Bsal) by Sex. The trend lines suggest there may be an experience-related pattern in salary differences by sex.
The following data set is a study investigating the tooth growth of guinea pigs based on three different dosage levels of vitamin C supplement (pill) or orange juice. The key question is whether there are differences in average tooth growth between the two approaches to giving the rodents vitamin C and whether dosage has anything to do with it as well.
newTooth<-ToothGrowth
newTooth$dose<-factor(newTooth$dose)
head(newTooth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
Since both of these variables are categorical, lets take a look at some boxplots.
ggplot(data=newTooth,aes(x=dose,y=len,colour=supp))+geom_boxplot()
Based on the boxplots do you believe that an MLR model with
dose
and supp
is all that is needed or should
an interaction term also be needed as well? Answer: The boxplot suggests
that tooth growth varies with both dosage and supplement type, and the
effect of dose appears different for each supplement, indicating that an
interaction term between dose and supp in the MLR model may be
necessary.
The following code chunk runs a regression model with an interaction term and provides the ANOVA F test for each component of the model. Assuming assumptions are met, does this output support your answer in the previous question?
mymodel<-lm(len~dose*supp,data=newTooth)
library(car)
## Loading required package: carData
Anova(mymodel,type=3)
Answer: Yes, the ANOVA output supports the inclusion of an interaction term, as the interaction between dose and supp is significant (p = 0.021860), indicating that the effect of dose on tooth length depends on the supplement type.
summary(mymodel)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.23 1.148353 11.5208468 3.602548e-16
## dose1 9.47 1.624017 5.8312215 3.175641e-07
## dose2 12.83 1.624017 7.9001660 1.429712e-10
## suppVC -5.25 1.624017 -3.2327258 2.092470e-03
## dose1:suppVC -0.68 2.296706 -0.2960762 7.683076e-01
## dose2:suppVC 5.33 2.296706 2.3207148 2.410826e-02
#Contrast maker
lm.contrast<-function(lmobj,contrast){
if(is.matrix(contrast)==F) contrast<-matrix(contrast,nrow=1)
if(ncol(contrast)!=length(coef(lmobj))) return("Error: Contrast length does not match the number of coefficients in linear model object.")
estimates<-contrast %*% coef(lmobj)
se<- sqrt(diag(contrast %*% vcov(lmobj) %*% t(contrast)))
t.stat<-estimates/se
p.val<-pt(-abs(t.stat),df=lmobj$df.residual)
ci.l<-estimates-qt(.975,df=lmobj$df.residual)*se
ci.u<-estimates+qt(.975,df=lmobj$df.residual)*se
results<-data.frame(Estimates=estimates,SE=se,t.stat,p.val,Lower=ci.l,Upper=ci.u)
return(results)
}
#Contrasts for the discussion:
contrast1<-c(0,0,0,1,0,0)
contrastHint<-c(0,0,0,1,1,0) #Hint: This is comparing VC vs Supp specifically at Dose 1
contrast2<-c(0,0,0,1,0,1)
contrast3<-c(0,-1,1,0,0,0)
contrast.matrix<-rbind(contrast1,contrast2,contrast3)
lm.contrast(mymodel,contrast.matrix)
## Estimates SE t.stat p.val Lower Upper
## contrast1 -5.25 1.624017 -3.23272575 0.001046235 -8.5059571 -1.994043
## contrast2 0.08 1.624017 0.04926058 0.480446692 -3.1759571 3.335957
## contrast3 3.36 1.624017 2.06894448 0.021676073 0.1040429 6.615957
Answer: The contrast tests are examining: dose1: The difference in tooth growth between the baseline dose (implicitly captured by the intercept) and dose level 1. dose2: The difference in tooth growth between the baseline dose and dose level 2. suppVC: The difference in tooth growth between the two supplement types (VC - Vitamin C and OJ - Orange Juice), at the baseline dose. dose1:suppVC: Interaction effect between dose level 1 and supplement type VC. dose2:suppVC: Interaction effect between dose level 2 and supplement type VC. The null hypotheses for each test would be that there is no difference in tooth growth for the compared groups or interaction effects. The p-values provide evidence against these null hypotheses when they are low enough (typically below 0.05). The contrast tests compare specific conditions within the vitamin C tooth growth study:
contrast1 assesses the effect of the VC supplement at the first dose level compared to the baseline. The negative estimate and significant p-value suggest a decrease in tooth growth for VC at this dose level compared to the baseline.
contrast2 seems to assess the difference in tooth growth for VC at the second dose level without the interaction term. The estimate is very close to zero, and the p-value is not significant, indicating no difference from the baseline.
contrast3 compares the effect on tooth growth between dose levels 1 and 2. The positive estimate and the significant p-value suggest that tooth growth is greater at dose level 2 compared to dose level 1.