Homework 4

DACSS 603 - Tyler Paske

Tyler Paske
2022-04-14
                                                 Question 1. 
                                                 

For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.

##A. For backward elimination, which variable would be deleted first? Why?

For backward elimination, Beds would be eliminated first because it has the largest P-Value. The process of Backwards Elimination as referenced from slides is as follows; predetermine a significance level, start with including all variables in the model, at each stage delete variable with the largest p-value and stop when all variables are significant. As we see if we deleted beds, all the remaining variables hold significance.

library(smss)

data(“house.selling.price.2”)

summary(lm(P~ ., data = house.selling.price.2))

##B. For forward selection, which variable would be added first? Why?

For forward selection we essentially do the opposite of backwards elimination where we begin with no explanatory variable, add the variable with the most significance at each step and stop when no remaining variable can make a significant partial contribution. In this case, Size would be added first because it has the most significance with a P-Value at 0 where New has a continuous variable that may have a little less significance.

##C. Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

Beds had such a large p value because there was week evidence against the experimental study with relation to the Price of the home. However, there was a substantial correlation with price. This shows to me that though Beds in a home was important, people could do without them as other factors took more importance as other factors go into buying a home when compared to its price (Size & Baths) for example.

##D. Using software with these four predictors, find the model that would be selected using each criterion:

###a

A statistical calculation that measures the degree of interrelation and dependence between two variables. In other words, it is a formula that determines how much a variable’s behavior can explain the behavior of another variable.

As R(squared) measures the degree of interrelation and dependence between two variables I find that this model would NOT be selected using each criterion.

###b The adjusted R-squared is a modified version of R-squared that adjusts for predictors that are not significant in a regression model. Compared to a model with additional input variables, a lower adjusted R-squared indicates that the additional input variables are not adding value to the model. Compared to a model with additional input variables, a higher adjusted R-squared indicates that the additional input variables are adding value to the model.

This model would in my opinion be SELECTED as it uses each criterion of the model itself. I find that this model would be selected as this model uses additional input variables with a lower adjusted R-Squared to indicate that the additional input variables are or are not adding value to the model.

###c The idea is that RSS describes how well a linear model fits the data to which it was fitted, but PRESS tells you how well the model will predict new data. As we’re not interested in predicting new data this model would NOT be selected.

###d This model would again be important for predicting the relationship between variables or in our case, predictors. In the instance that we’re not looking to predict but more so compare and contrast I lean in the direction that this model again would NOT be selected using each criterion.

###e Bayesian Information Criterion (BIC) is a model selection tool. If a model is estimated on a particular data set (training set), BIC score gives an estimate of the model performance on a new, fresh data set (testing set). BIC is given by the formula:

                            BIC = -2 * loglikelihood + d * log(N),

where N is the sample size of the training set and d is the total number of parameters. The lower BIC score signals a better model.

#E. Explain which model you prefer and why. As mentioned I would prefer the Adjusted R-Squared as the model uses additional input variables with a lower adjusted R-Squared to indicate that the additional input variables are or are not adding value to the model.

                                             Question 2 
                                             
                                             
                                             
                                             

“This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.” Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

##A fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

CODE:

Call:
lm(formula = log(Volume) ~ log(Girth), data = trees)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.205999 -0.068702  0.001011  0.072585  0.247963 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.35332    0.23066  -10.20 4.18e-11 ***
log(Girth)   2.19997    0.08983   24.49  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.115 on 29 degrees of freedom
Multiple R-squared:  0.9539,    Adjusted R-squared:  0.9523 
F-statistic: 599.7 on 1 and 29 DF,  p-value: < 2.2e-16

summary(fm1 <- lm(log(Volume) ~ log(Girth), data = trees))

##B Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

After reviewing the following I conclude that there’s no violated regressions assumptions as a violation would include there NOT being a relationship between the residuals and the variables. As we can see from the regression diagnostic plots on the model, there is a relationship.

CODE:

pairs(trees, panel = panel.smooth, main = “trees data”) plot(Volume ~ Girth, data = trees, log = “xy”) coplot(log(Volume) ~ log(Girth) | Height, data = trees,panel = panel.smooth)

#Question 3. In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Simple linear regression model

CODE:

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08

summary(lm(Buchanan ~ Bush, data = florida))

Regression Diagnostic Plots

CODE:

Based on the plots we can see that in my opinion Palm Beach County is an outlier based on the plots in that there’s little proof to show that. The plots show a similar correlation for Buchanan and Core as they do with Buchanan and Bush. Seeing that there’s a similar regression I lead to believe that the butterfly ballot had little to do with the layout of the ballot causing some voters to cast votes for Buchanan when their intended choice was Gore.

#PART 2 (Final Project)

##1. What is your research question for the final project?

Based on studies conducted in 1986, what percentage of people were working remotely vs those that had to travel for work? How does that compare to the data shared in the same year for the number of hours people were working. Is there any correlation?

##2. What is your hypothesis (i.e. an answer to the research question) that you want to test?

My hypothesis is that I’d like to test is the difference between in time to commute to time worked. I’d like to see if people that don’t travel worked very many hours.

##3. Present some exploratory analysis. In particular:

###a. Numerically summarize (e.g. with the summary() function) the variables of interest (the outcome, the explanatory variable, the control variables).

                                 Could not knit my data file for summary 

CHW <- DACSS_603_Tyler_Paske_Final_Project_Commuting_Hours_worked_ summary(CHW)

The outcome variable is going to be the Number of Commuted participants as we have the greatest number of people who participated in that study. The explanatory variable will be the length of time it took them to commute and the control variable will be the number of hours worked as it relates to those that commuted and or didn’t have to commute. From the summary we already start to gather evidence of what we can expect. We see that most of the people that are actually working are those that work full time.

###b. Plot the relationships between key variables. You can do this any way you want, but one straightforward way of doing this would be with the pairs() function or other scatter plots / box plots. Interpret what you see.

                                   Could not knit my data file for summary 
                                   

plot(x = CHW\(`Commute hours`, y = CHW\)Commute participants, xlab = “Time in hours”, ylab = “Participants”)

We see that among the key variables (Commute hours & Commute participants) we see that most of our participants had to travel anywhere between roughly 10 to 45 minutes for work within this year.