Project=read_xlsx("QUANT_Project.xlsx")
Dependent variable: - Dollars spent on pavement projects
Independent variables: - Pavement Condition (PCI) - 311 Call Volume - Length (if needed for kitchen sink exercise)
Cost_PCI_Model=lm(Cost~IMPPCI, data=Project)
summary(Cost_PCI_Model)
##
## Call:
## lm(formula = Cost ~ IMPPCI, data = Project)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73651 -25458 -10957 -747 2373021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73651.42 1555.01 47.36 <2e-16 ***
## IMPPCI -778.19 21.31 -36.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 87450 on 33334 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.03845, Adjusted R-squared: 0.03842
## F-statistic: 1333 on 1 and 33334 DF, p-value: < 2.2e-16
This dataset includes all streets, whether or not they received a pavement project. Right now, I want to filter out the ones that did not get a project (ones where the cost field contains a zero) and see what happens.
Project_NoZeroCost = Project[Project$Cost != 0, ]
Model_ProjectNoZero=lm(Cost~IMPPCI, data=Project_NoZeroCost)
summary(Model_ProjectNoZero)
##
## Call:
## lm(formula = Cost ~ IMPPCI, data = Project_NoZeroCost)
##
## Residuals:
## Min 1Q Median 3Q Max
## -266907 -45391 -20598 967 2222838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 267738.27 4853.98 55.16 <2e-16 ***
## IMPPCI -2687.06 64.25 -41.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 148500 on 8600 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.169, Adjusted R-squared: 0.1689
## F-statistic: 1749 on 1 and 8600 DF, p-value: < 2.2e-16
Looking at the report, my understanding is that there is statistical significance. The p-value is extremely low, meaning it is extremely unlikely that the dependent variable’s results are random chance.
Removing the zero values also increased the p-value, which should mean a stronger relationship.
I also plotted it out of curiosity.
ggplot(Project_NoZeroCost, aes(x = IMPPCI, y = Cost)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).
6) Choose some significant independent variables. Interpret its
Estimates (or Beta Coefficients). How do the independent variables
individually affect the dependent variable?
SinkNoZeroModel=lm(Cost~IMPPCI+Calls+Shape_Length, data=Project_NoZeroCost)
summary(SinkNoZeroModel)
##
## Call:
## lm(formula = Cost ~ IMPPCI + Calls + Shape_Length, data = Project_NoZeroCost)
##
## Residuals:
## Min 1Q Median 3Q Max
## -736677 -41764 -1892 24860 2007529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.701e+05 4.587e+03 37.08 <2e-16 ***
## IMPPCI -2.433e+03 5.640e+01 -43.14 <2e-16 ***
## Calls 1.189e+04 5.062e+02 23.50 <2e-16 ***
## Shape_Length 1.868e+01 6.153e-01 30.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 128300 on 8598 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.3797, Adjusted R-squared: 0.3795
## F-statistic: 1754 on 3 and 8598 DF, p-value: < 2.2e-16
The report suggests that: 1. Cost goes down by 2433 for every PCI point 2. Cost raises by 11,890 for every call that comes in 3. The cost only goes up by 18.68 for every foot of pavement. This result seems off, and this variable was only included for the purpose of the exercise. In general, there are too many differences in the pavement applications and other factors that make the road’s length a poor factor to include without extremely specific context.
I am removing pavement length just to see.
PCIandCallNoZeroModel=lm(Cost~IMPPCI+Calls, data=Project_NoZeroCost)
summary(PCIandCallNoZeroModel)
##
## Call:
## lm(formula = Cost ~ IMPPCI + Calls, data = Project_NoZeroCost)
##
## Residuals:
## Min 1Q Median 3Q Max
## -642250 -40148 -8701 10986 2014978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 211858.60 4604.56 46.01 <2e-16 ***
## IMPPCI -2293.51 59.14 -38.78 <2e-16 ***
## Calls 19591.44 460.97 42.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135000 on 8599 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.3132, Adjusted R-squared: 0.3131
## F-statistic: 1961 on 2 and 8599 DF, p-value: < 2.2e-16
Even though the points are visually scattered, the model line seems relatively linear. In future homeworks, I will see how different transformations affect this, and I am also hoping to factor in the streets that did not receive a project in some form or another.
plot(PCIandCallNoZeroModel,which=1)