##Question #1: Toyota
#set working directory and import the data
setwd("~/NYU/classes/4. Statistical Modeling/Assessment 1")
library(readr)
Toyota <- read_csv("HW1DatasetToyotaPrices _2_.csv")
## Rows: 1436 Columns: 2
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (2): Price, AgeInMonths
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(Toyota)
#printing the data. 'head()' function here will print only the first 10 lines from the dataset.
head(Toyota, 10)
## # A tibble: 10 x 2
## Price AgeInMonths
## <dbl> <dbl>
## 1 13500 23
## 2 13750 23
## 3 13950 24
## 4 14950 26
## 5 13750 30
## 6 12950 32
## 7 16900 27
## 8 18600 30
## 9 21500 27
## 10 12950 23
#1) Make a scatter plot of the price (Y variable) against the age (X variable) of the cars.
library(readr)
library(ggplot2)
# command for loading the plot and assigning the axis + plotting the points on graph in blue
# 'labs()' function is used to lable the axis
ggplot(data=Toyota,aes(x= AgeInMonths, y= Price))+
geom_point(color='blue')+labs(x="AgeInMonths",y="Price")
#2) Fit a regression model to the data. What is the equation of the regression line that you get?
# simple regression model stored in the variable mamed "linearRegModel"
logRegModel<- lm(Price ~ AgeInMonths, data= Toyota)
#getting the summary
summary(logRegModel)
##
## Call:
## lm(formula = Price ~ AgeInMonths, data = Toyota)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8423.0 -997.4 -24.6 878.5 12889.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20294.059 146.097 138.91 <2e-16 ***
## AgeInMonths -170.934 2.478 -68.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1746 on 1434 degrees of freedom
## Multiple R-squared: 0.7684, Adjusted R-squared: 0.7682
## F-statistic: 4758 on 1 and 1434 DF, p-value: < 2.2e-16
##Answer to #2 - The equation of the egression line that you get is: y= Intercept + slope of x, so in this Price = 20294.059 - 170.934AgeInMonths .
#We will again use the library ggplot2 to add the regression line
library(ggplot2)
# assigning the dataset
data(Toyota)
## Warning in data(Toyota): data set 'Toyota' not found
# deciding on X and Y axis + plotting the points on graph in blue
p1 = ggplot( data= Toyota,aes( x= AgeInMonths, y= Price)) +
geom_point( color= 'blue')
# plotting the regression line through the points
# "labs()" function is used to lable the axis
p1 + geom_smooth( method= 'lm', se= F, col= "red")+
labs(x="AgeInMonths",y="Price")
## `geom_smooth()` using formula 'y ~ x'
#3) Does the value of the slope in the regression model align with your intuition of the relationship that you would expect between price and age of a second hand car?
##Answer - yes I would have expected as the age increases the price to decrease, which is clear based on the downward sloppe of the fitted line in the regression model.
#4) Based on the output that you get from the regression, is there evidence of a linear relationship between price and age of the car? Specify any number(s) that you base your conclusion on
##Answer to #4 - Yes there is evidence of a linear relationship between price and the age of a car, the T-Stat value is -68.98, which is outside (-2,2), and the P-Value is 0.001, which is less than 5%, indicating evidence of a linear relationship. In this case the correlation is negative therefore there is negative correlation, which also is in line with my intuition as noted above.
#5) What is the R-squared value for your model
#6) According to your model, what is the average predicted price in Euros for a second hand Toyota which is 30 months old?
##Answer to #6: the predicted price in Euros for a second hand car which is 30 months old is approximately 15,000.
#7) What is the 95% PI for the price of a second hand Toyota that is 30 months old? Show the calculations that you made to arrive at your answer?
##Answer to #7 is as follows: Standard Error is 1746, and 2*(1746) is 3492, therefore the 95% PI is 15,000+/- 3492, or (11508, 18492)
#8) A second hand Toyota dealer has a Toyota that is 30 months old that you are interested in buying. He quotes a price of 19,500 Euros for it. Based on all the analysis done so far in the earlier questions, do you think that this price is reasonable or do you think that it is too high? Justify your answer in a couple of sentences
##Answer to #8 is that I think a price of 19,500 is too high. Based on our analysis 95% of the used cars at 30 months are priced between ~11,500 to ~18,500. Given the 19,500 is above this, I definitely think the price is too high.
##Question #2: Airports
#set working directory and import the data
setwd("~/NYU/classes/4. Statistical Modeling/Assessment 1")
library(readr)
Airport <- read_csv("AirportViolHW1 _2_.csv")
## Rows: 19 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Airport
## dbl (2): TurnRate, ViolDet
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#printing the data. 'head()' function here will print only the first 10 lines from the dataset.
head(Airport, 10)
## # A tibble: 10 x 3
## TurnRate ViolDet Airport
## <dbl> <dbl> <chr>
## 1 416 11.9 St. Louis
## 2 375 7.3 Atlanta
## 3 237 10.6 Houston
## 4 207 22.9 Boston
## 5 200 6.5 Chicago
## 6 193 15.2 Denver
## 7 156 18.2 Dallas
## 8 155 21.7 Baltimore
## 9 140 31.5 Seattle/Tacoma
## 10 110 20.7 San Francisco
names(Airport)
## [1] "TurnRate" "ViolDet" "Airport"
#1) Make a scatter plot of violations detected per million passengers (Y) versus the turnover rate (X). Visually, do you see a pattern that indicates a linear association between the two?
library(readr)
library(ggplot2)
# command for loading the plot and assigning the axis + plotting the points on graph in blue
# 'labs()' function is used to lable the axis
ggplot(data=Airport,aes(x= TurnRate, y= ViolDet))+
geom_point(color='blue')+labs(x="turnover rate",y="violations detected per million passengers")
##Answer to #1: Visually I do NOT see a pattern that indicates a linear association between the two
#Separately run a Correlation of the data
cor(Airport$TurnRate, Airport$ViolDet)
## [1] -0.4014181
#2) Run a regression analysis and based on the output, state whether there is evidence that the two variables are linearly related
# simple regression model stored in the variable mamed "linearRegModel2"
logRegModel2<- lm(ViolDet ~ TurnRate, data= Airport)
#getting the summary
summary(logRegModel2)
##
## Call:
## lm(formula = ViolDet ~ TurnRate, data = Airport)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4440 -5.9082 -0.8105 5.2192 13.8808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.86874 3.02808 7.222 1.43e-06 ***
## TurnRate -0.03035 0.01680 -1.807 0.0885 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.509 on 17 degrees of freedom
## Multiple R-squared: 0.1611, Adjusted R-squared: 0.1118
## F-statistic: 3.266 on 1 and 17 DF, p-value: 0.08848
##Answer to #2: Based on the regression analysis run: The slope coefficient has a t-statistic of -1.807, which lies in the interval (-2,2). Equivalently, its p-value is 0.0885 (and so greater than 5%). Hence, there is no evidence that the Y variable, here violations detected, has a significant relationship with the X variable, here turnover rate, and therefore conclude there is not evidence that the two variables are linearly related.