install.packages(‘R Markdown’) library(R Markdown)
Part A. Variables In the field of psychology, much research is done using self-report surveys using Likert scales (look it up!).
A1 What type of variable is a Likert response?__ (1 pt)
The variable type is considered ordinal variable, which means that the variables have a natural order but the gaps betweeen the answer choices are not necessarily equal.
A2 What are some (at least 2) benefits of using Likert scales?__ (2 pts)
A benefit woulc be it is easy for the people taking the survey to understand thus allows for quick gathering of data. Another benefit would be that it allows us to quanitfy subjective opinions so they can then be analyzed by technology.
A3 What are some drawbacks — and dangers — of using them? Make sure you mention at least one ‘drawback’ and one ‘danger’ (a ‘drawback’ is a shortcoming, while a ‘danger’ implies potential harm).__ (2 pts)
A significant problem associated with the likert scales would be that there is a central bias tendency because people are less inclined to click an extreme response. Also even though it allows us to quanitify these opinions it still lacks complexity in the results.
Part B. Simple Linear Regression
Perform linear regressions on a dataset from a European Toyota car dealer on the sales records of used cars (Toyota Corolla). We would like to construct a reasonable linear regression model for the relationship between the sales prices of used cars and various explanatory variables (such as age, mileage, horsepower). We are interested to see what factors affect the sales price of a used car and by how much.
Data Description
Id - ID number of each used car Model - Model name of each used car Price - The price (in Euros) at which each used car was sold Age - Age (in months) of each used car as of August 2004 KM - Accumulated kilometers on odometer HP - Horsepower Metallic - Metallic color? (Yes = 1, No = 0) Automatic - Automatic transmission? ( Yes = 1, No = 0) CC - Cylinder volume (in cubic centimeters) Doors - Number of doors Gears - Number of gears Weight - Weight (in kilograms)
The data is in the file “UsedCars.csv”. To read the data in
R, save the file in your working directory (make sure you
have changed the directory if different from the R working directory)
and read the data using the R function
read.csv().
Read data and show few rows of data
# Read in the data
data = read.csv("UsedCars.csv",sep = ",",header = TRUE)
# Show the first few rows of data
head(data, 3)
## Id Model Price Age KM HP
## 1 1 TOYOTA Corolla 1800 T SPORT VVT I 2/3-Doors 21500 27 19700 192
## 2 2 TOYOTA Corolla 1.8 VVTL-i T-Sport 3-Drs 2/3-Doors 20950 25 31461 192
## 3 3 TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT BNS 2/3-Doors 19950 22 43610 192
## Metallic Automatic CC Doors Gears Weight
## 1 0 0 1800 3 5 1185
## 2 0 0 1800 3 6 1185
## 3 0 0 1800 3 6 1185
Question B1: Exploratory Data Analysis
setwd("~/Regression Analysis/Homework")
getwd
## function ()
## .Internal(getwd())
## <bytecode: 0x0000025122f883c0>
## <environment: namespace:base>
data = read.csv("UsedCars.csv", head = TRUE, sep = ",")
head(data)
## Id Model Price Age KM HP
## 1 1 TOYOTA Corolla 1800 T SPORT VVT I 2/3-Doors 21500 27 19700 192
## 2 2 TOYOTA Corolla 1.8 VVTL-i T-Sport 3-Drs 2/3-Doors 20950 25 31461 192
## 3 3 TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT BNS 2/3-Doors 19950 22 43610 192
## 4 4 TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT 2/3-Doors 19600 25 32189 192
## 5 5 TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT 2/3-Doors 21500 31 23000 192
## 6 6 TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT 2/3-Doors 22500 32 34131 192
## Metallic Automatic CC Doors Gears Weight
## 1 0 0 1800 3 5 1185
## 2 0 0 1800 3 6 1185
## 3 0 0 1800 3 6 1185
## 4 0 0 1800 3 6 1185
## 5 1 0 1800 3 6 1185
## 6 1 0 1800 3 6 1185
str(data)
## 'data.frame': 1264 obs. of 12 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Model : chr "TOYOTA Corolla 1800 T SPORT VVT I 2/3-Doors" "TOYOTA Corolla 1.8 VVTL-i T-Sport 3-Drs 2/3-Doors" "TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT BNS 2/3-Doors" "TOYOTA Corolla 1.8 16V VVTLI 3DR T SPORT 2/3-Doors" ...
## $ Price : int 21500 20950 19950 19600 21500 22500 22000 22750 17950 16750 ...
## $ Age : int 27 25 22 25 31 32 28 30 24 24 ...
## $ KM : int 19700 31461 43610 32189 23000 34131 18739 34000 21716 25563 ...
## $ HP : int 192 192 192 192 192 192 192 192 110 110 ...
## $ Metallic : int 0 0 0 0 1 1 0 1 1 0 ...
## $ Automatic: int 0 0 0 0 0 0 0 0 0 0 ...
## $ CC : int 1800 1800 1800 1800 1800 1800 1800 1800 1600 1600 ...
## $ Doors : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Gears : int 5 6 6 6 6 6 6 5 5 5 ...
## $ Weight : int 1185 1185 1185 1185 1185 1185 1185 1185 1105 1065 ...
plot(data$KM, data$Price, main ="Scatterplot of Price vs Accumulated KM",
ylab="Price", xlab =" Accuumulated KM")
There appears to be an inverse relationship between the Price and the Accumulated KM. On average as the amount of accumulated KM increases the price decreases.
model_lr = lm(KM~ Price, data = data)
summary(model_lr)
##
## Call:
## lm(formula = KM ~ Price, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83459 -16217 -2149 13685 105476
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.224e+05 2.244e+03 54.56 <2e-16 ***
## Price -5.609e+00 2.007e-01 -27.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23720 on 1262 degrees of freedom
## Multiple R-squared: 0.3824, Adjusted R-squared: 0.3819
## F-statistic: 781.4 on 1 and 1262 DF, p-value: < 2.2e-16
The R-squared is 0.3824 which means there is a weak positive correlation between the KM and the response variable Price.
Yes, because even though the correlation coefficient is low, the regression analysis is significant according to F statistic being high and the p-values being very low.
Yes, because of the lower R-squared value and potential of outliers in the residuals, pursuing a transformation of the data could possibly improve the fit of the model.
Question B2: Fitting the Simple Linear Regression Model
Fit a linear regression model, named model_1, to evaluate the relationship between UsedCars Price and the accumulated KM. Do not transform the data. The function you should use in R is:
# Create the model
model_1 = lm(Price ~ KM, data)
summary(model_1)
##
## Call:
## lm(formula = Price ~ KM, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7993 -1774 -457 1394 11437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.494e+04 1.693e+02 88.24 <2e-16 ***
## KM -6.817e-02 2.439e-03 -27.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2615 on 1262 degrees of freedom
## Multiple R-squared: 0.3824, Adjusted R-squared: 0.3819
## F-statistic: 781.4 on 1 and 1262 DF, p-value: < 2.2e-16
Model parameters:
βo = 1.494e4
β1 = -6.817e-2
ϵ = 2.439e-3
Price = 1.494e4 -6.817e-2(KM)
With an increase of one km in the accumulated total km on the odometer the price change associated is a decrease by $0.068 on average.
#confidence interrval
confint(model_1, level= 0.95)
## 2.5 % 97.5 %
## (Intercept) 1.461094e+04 1.527539e+04
## KM -7.296019e-02 -6.339078e-02
The confidence interval that 𝛽1 statistically significant is ( -7.296, -6.339).
#stat neg?
tvalue= -27.95
pt(tvalue,1262)
## [1] 1.661374e-134
Yes, 𝛽1 is statistically negative at an alpha level of 0.01 with the approximate p-value being 1.66e-134.
Question B3: Checking the Assumptions of the Model
Create and interpret the following graphs with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Each graph may be used to assess one or more model assumptions.
#scatter plot with km - x and price -y
plot(data$KM, data$Price, main ="Scatterplot of Price vs Accumulated KM",
ylab="Price", xlab =" Accuumulated KM")
abline(model_1, col = "red", lwd = 2)
Model Assumption(s) it checks: Linearity
Interpretation: The data exhibits linear behavior in an inverse relationship, and there are no obvious outliers or leverage points visible. Therefore, the assumption is satisfied.
#Residuals
plot(model_1$fitted, model_1$residuals, main
= "Residuals Plot", xlab = "Price", ylab ="Residuals", pty =2, lwd = 2)
abline(h = 0, col = "red", lwd = 2)
Model Assumption(s) it checks: Constant Variance
Interpretation: The residual plot centers around y=0 which can indicate constant variance, but they are grouped together that could indicate a pattern within the residuals, outliers, or a skew to the right. Since this pattern is minimal it means that the model has room for improvement. But overall, the assumption is satisfied.
#QQplot and histogram
library(car)
## Loading required package: carData
qqPlot(model_1$residuals, ylab = "Residuals")
## [1] 36 115
hist(model_1$residuals, xlab="Residuals")
Model Assumption(s) it checks: Normality
Interpretation: For the most part the data looks normally distributed. According to the Q-Q plot there is a very light lift on the upper tail indicating there could be a right skew to the data. Then looking at the histogram this aligns with the Q-Q plot with a skew to the right. Thus the assumption is satisfied.
Question B4: Prediction Use the results from both model_1 to discuss the effects of KM on the dependent variable: Holding everything else equal, how much the sales price would decrease if a car accumulated 10,000 more kilometers? What observations can you make about the result in the context of the problem? (3 pts)
#predictions
model_1pred = predict(model_1)
pred_10k = data.frame(KM = c(10000))
predictions = predict(model_1, pred_10k)
print(predictions-1.494e+04)
## 1
## -678.5901
The price would drop $678.59 if the car were to accumulate another 10,000 km, holding all other things constant.
Part C. Experiment!
You work for the National Park Service (NPS), and you absolutely love bears. Describe an imaginary (it can be realistic) scenario in which you get to run a one-way ANOVA on a few (3+) species of bears.
Part C1 What are you comparing (name the variable!)? What do you hope to learn from ANOVA?__ (2 pts)
The variable of interest is the average claw length of different species of bears. We hope to see if there is a significant difference in the average claw length for the different species of bears.
Part C2 Imagine that the results are “mixed”, meaning you can draw some conclusions and not others. Describe your conclusions and make sure you detail, with reference to your ANOVA, why the results were “mixed.”__ (3 pts)
• The conclusions from the test were that the ANOVA test reported a p-value lower than our alpha at 0.01 indicating a significant difference in the species claw lengths.
• From there the species were grouped together in pairs and their means compared using the turkey method. This revealed that some pairs of bear species are significantly different whereas others had no difference.
• The test indicates that there are mixed results meaning that there were some pairs of species that had a significant difference in claw length and others that did not.
Part C3 Now imagine that you have just been granted 3 months and $50,000 to continue this study (you’re a great grant writer and a very likable member of the NPS!). Describe some next steps you would take to clarify, reinforce and/or further explore your nascent investigation. You MUST reference using a ‘controlling’ variable somehow in your response.__ (5 pts)
After achieving mixed results in the study, the next steps would be to consider other factors (diet, environment, other animals in the area, etc.) that might influence the length of the bear’s claws. In addition, there would be some controlling variables that we would use for reference for claw length, such as gender and age. A 2-way ANOVA test would be used to then incorporate the control variables which would then give us a narrower view on which means of bears’ claw length are being compared. There could also be an issue with a small sample size and there needs to be more points in each species sample to accurately investigate the means of the claw length.
Part D. Explain the meaning of a p-value! Explain in detail what it means specifically — in a statistical sense — for any result to be “statistically significant” at a particular -level. In other words, explain the meaning and use of p-values. You should research this question, and you should expect your answer to be at least a paragraph long.__ (6 pts)
Statistical significance indicates that the result of a test is a true finding not likely to have occurred by random chance. When establishing statistical significance, a null and alternative hypothesis is written. The null hypothesis describes that the effect of interest has happened due to random chance and there is no true result whereas the alternative hypothesis states that there is a true result. Associated with the null hypothesis is the alpha value which represents the probability threshold that the null hypothesis is true. From there we run the test to find the probability of getting the result of interest given the current sample of observations. This probability is called the p-value, and if the p-value is lower than the alpha value it means that the null hypothesis is rejected, and it is unlikely that the result is due to random chance.