7.1 After reading the dataset into R
(naming it using your student ID as usual), identify the number of
observations, data types, as well as the mean, median, and quartiles for
each variable in the dataset. Use any function(s) that you like. Show
your function(s) and your output. Consider other statistics or graphics
you may want to create.
getwd()
## [1] "C:/Users/rafid/Desktop"
setwd("C:/Users/rafid/Desktop")
data28 <- read.csv("CommercialProperties.csv")
str(data28)
## 'data.frame': 82 obs. of 5 variables:
## $ Rental.Rates : num 13.5 12 10.5 15 14 10.5 14 16.5 17.5 16.5 ...
## $ Age : int 1 14 16 4 11 15 2 1 1 8 ...
## $ Operating.Expenses.and.Taxes: num 5.02 8.19 3 10.7 8.97 ...
## $ Vacancy.Rates : num 0.14 0.27 0 0.05 0.07 0.24 0.19 0.6 0 0.03 ...
## $ Total.Square.Footage : int 123000 104079 39998 57112 60000 101385 31300 248172 215000 251015 ...
#The number of observations#
no_obsv <- nrow(data28)
cat("The Number of Observations:", no_obsv, "\n")
## The Number of Observations: 82
#The Data Types#
data_type <- sapply(data28, class)
print(data_type)
## Rental.Rates Age
## "numeric" "integer"
## Operating.Expenses.and.Taxes Vacancy.Rates
## "numeric" "numeric"
## Total.Square.Footage
## "integer"
#Finding the mean, median and quartile for each variable in the dataset#
summary(data28)
## Rental.Rates Age Operating.Expenses.and.Taxes Vacancy.Rates
## Min. :10.5 Min. : 0.000 Min. : 3.000 Min. :0.00000
## 1st Qu.:14.0 1st Qu.: 2.000 1st Qu.: 8.145 1st Qu.:0.00000
## Median :15.0 Median : 4.000 Median :10.370 Median :0.03000
## Mean :15.2 Mean : 8.012 Mean : 9.814 Mean :0.08244
## 3rd Qu.:16.5 3rd Qu.:15.000 3rd Qu.:11.620 3rd Qu.:0.09750
## Max. :20.0 Max. :20.000 Max. :20.000 Max. :0.73000
## Total.Square.Footage
## Min. : 27000
## 1st Qu.: 70500
## Median :129614
## Mean :164772
## 3rd Qu.:239000
## Max. :500020
summaryStatistics <- sapply(data28, function(x) {
c(Mean = mean(x, na.rm = TRUE),
Median = median(x, na.rm = TRUE),
First_Quartile = quantile(x, 0.25, na.rm = TRUE),
Third_Quartile = quantile(x, 0.75, na.rm = TRUE))
})
print(summaryStatistics)
## Rental.Rates Age Operating.Expenses.and.Taxes
## Mean 15.19817 8.012195 9.813902
## Median 15.00000 4.000000 10.370000
## First_Quartile.25% 14.00000 2.000000 8.145000
## Third_Quartile.75% 16.50000 15.000000 11.620000
## Vacancy.Rates Total.Square.Footage
## Mean 0.08243902 164772.1
## Median 0.03000000 129614.0
## First_Quartile.25% 0.00000000 70500.0
## Third_Quartile.75% 0.09750000 239000.0
#Histogram Data of the Rental Rates#
hist(data28$Rental.Rates,
main="Distribution of Commercial Rental Rates",
xlab="Rental Rates ($USD per square foot)",
col="darkblue",
border="grey")
7.2 Perform a series of regressions where
you predict
7.2.1 Rental rates of the properties (Rental Rates) using Age,
rentalRatesByAgeModel <- lm(Rental.Rates ~ Age, data = data28)
summary(rentalRatesByAgeModel)
##
## Call:
## lm(formula = Rental.Rates ~ Age, data = data28)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3739 -0.9308 0.1209 1.0112 5.3582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.57003 0.30593 50.895 <2e-16 ***
## Age -0.04641 0.02932 -1.583 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.775 on 80 degrees of freedom
## Multiple R-squared: 0.03038, Adjusted R-squared: 0.01826
## F-statistic: 2.506 on 1 and 80 DF, p-value: 0.1173
7.2.2 Rental rates of the properties (Rental Rates) using Operating Expenses & Taxes,
rentalRatesByExpenses <- lm(Rental.Rates ~ Operating.Expenses.and.Taxes, data = data28)
summary(rentalRatesByExpenses)
##
## Call:
## lm(formula = Rental.Rates ~ Operating.Expenses.and.Taxes, data = data28)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5862 -0.9241 -0.1766 0.7576 4.8161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.17872 0.63735 19.109 < 2e-16 ***
## Operating.Expenses.and.Taxes 0.30767 0.06247 4.925 4.45e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.579 on 80 degrees of freedom
## Multiple R-squared: 0.2327, Adjusted R-squared: 0.2231
## F-statistic: 24.26 on 1 and 80 DF, p-value: 4.452e-06
7.2.3 Rental rates of the properties (Rental Rates) using Vacancy Rates
rentalRatesByVacancy <- lm(Rental.Rates ~ Vacancy.Rates, data = data28)
summary(rentalRatesByVacancy)
##
## Call:
## lm(formula = Rental.Rates ~ Vacancy.Rates, data = data28)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8924 -1.1243 -0.0965 1.1264 4.6569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.0965 0.2329 64.815 <2e-16 ***
## Vacancy.Rates 1.2329 1.4841 0.831 0.409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.795 on 80 degrees of freedom
## Multiple R-squared: 0.008552, Adjusted R-squared: -0.003841
## F-statistic: 0.6901 on 1 and 80 DF, p-value: 0.4086
7.2.4 Rental rates of the properties (Rental Rates) using Total Square Footage
rentalRatesBySQFT <- lm(Rental.Rates ~ Total.Square.Footage, data = data28)
summary(rentalRatesBySQFT)
##
## Call:
## lm(formula = Rental.Rates ~ Total.Square.Footage, data = data28)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1236 -0.7636 0.2848 1.0718 3.3699
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.370e+01 2.848e-01 48.117 < 2e-16 ***
## Total.Square.Footage 9.065e-06 1.421e-06 6.377 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.468 on 80 degrees of freedom
## Multiple R-squared: 0.337, Adjusted R-squared: 0.3287
## F-statistic: 40.67 on 1 and 80 DF, p-value: 1.084e-08
Explanation : Property age, operating costs, and size significantly influence rental prices, indicating these factors affect rent variations. Conversely, vacancy rates show no significant impact on rent.
7.3 Look at your output for the
regression of rental rates and total square footage. How many
observations are analyzed (what is the “n” value)? What is the
regression equation? What is the R2 value? [use the “Multiple
R-Squared”]. What percentage of variation does this model explain?
According to this model, how much would a property of 300,000 square
feet rent for?
Explanation
Number of Observations Analyzed (n value): The dataset contains 82 observations. Regression Equation: The regression equation derived from the model is given by: Rental Rates = 13.70 + 0.000009065 × Total Square Footage
R² Value (Multiple R-Squared): The R² value is 0.337, indicating that the model explains 33.7% of the variance in rental rates from the total square footage. Percentage of Variation Explained: This model explains 33.7% of the variation in rental rates.
Rental Rate for a Property of 300,000 Square Feet: To find the rental rate for a property of 300,000 square feet, you use the regression equation: Rental Rates for 300,000 sq. ft = 13.70 + 0.000009065 × 300,000 = 16.42 USD per square foot
Therefore, Rental Rates for 300,000 sq. ft = 16.42 USD per square foot
7.4 Use the ggplot() to check the
required conditions for the model [only do model 4, the one with total
square footage as the independent variable and rental rates as the
dependent variable]. Show your commands and your output. Choose your own
colors for your graphics. Be sure you have appropriate titles on all of
your graphics, and that all axes are labeled appropriately.State your
conclusion about the usefulness of the model.
library(ggplot2)
#Leverage Values#
data28$leverage <- hatvalues(rentalRatesBySQFT)
data28$residuals <- resid(rentalRatesBySQFT)
data28$fitted.values <- fitted(rentalRatesBySQFT)
#Plot 1#
plot_1 <- ggplot(data28, aes(x = fitted.values, y = residuals)) +
geom_point(color = "darkgreen") +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
labs(title = "Residuals vs. Fitted Values", x = "Fitted Values", y = "Residuals") +
theme(plot.title = element_text(hjust = 0.5))
print(plot_1)
Explanation : This plot evaluates the linear relationship assumption and homoscedasticity (constant variance) of residuals. Ideally, the points should be randomly dispersed around the horizontal line at 0, without forming any patterns. Patterns or a funnel shape would indicate potential issues with the model.
#Plot 2#
plot_2 <- ggplot(data28, aes(x = residuals)) +
geom_histogram(fill = "coral", color = "black", binwidth = 0.5) +
labs(title = "Histogram of Residuals", x = "Residuals", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))
print(plot_2)
Explanation : The histogram of residuals checks for the normality of residuals. A well-fitting model should have residuals that approximately follow a normal distribution. Deviations from this pattern suggest the model might not be capturing some aspects of the data’s structure.
#Plot 3#
plot_3 <- ggplot(data28, aes(sample = residuals)) +
geom_qq(color = "purple") +
geom_qq_line(color = "darkred") +
labs(title = "QQ Plot of Residuals") +
theme(plot.title = element_text(hjust = 0.5))
print(plot_3)
Explanation : The QQ plot provides another way to assess the normality of residuals. Points following closely along the reference line indicate that residuals are normally distributed. Significant deviations from the line, especially at the tails, would signal that the residuals do not follow a normal distribution, potentially undermining the model’s assumptions.
#plot 4#
plot_4 <- ggplot(data28, aes(x = leverage, y = residuals)) +
geom_point(color = "steelblue") +
labs(title = "Residuals vs. Leverage", x = "Leverage", y = "Residuals") +
theme(plot.title = element_text(hjust = 0.5)) # Centering the title
# Print the plot
print(plot_4)
Explanation : This plot helps in
identifying influential cases in the regression model. Points with high
leverage can have a disproportionate effect on the model’s parameters.
Ideally, residuals should be randomly distributed, and leverage values
should be low. A cluster of points along the zero line for residuals,
with no distinct patterns or outliers with high leverage, suggests the
model is robust and the predictors are appropriate.
Conclusion
This comprehensive
graphical analysis, combined with statistical measures from the
regression summary, provides a strong foundation for concluding the
model’s effectiveness and its potential usefulness in predicting rental
rates based on the total square footage.