Find your own data set that contains at least 20 units and at least 4 variables (most of which are numeric, but it is good to have at least one categorical variable as well). Perform the following steps using the R program:
TASK 1:1 Explain data set
library(readr)
car_data <- read_csv("./car data.csv")
## Rows: 301 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Car_Name, Fuel_Type, Seller_Type, Transmission
## dbl (5): Year, Selling_Price, Present_Price, Kms_Driven, Owner
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(car_data)
## # A tibble: 6 × 9
## Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ritz 2014 3.35 5.59 27000 Petrol Dealer
## 2 sx4 2013 4.75 9.54 43000 Diesel Dealer
## 3 ciaz 2017 7.25 9.85 6900 Petrol Dealer
## 4 wagon r 2011 2.85 4.15 5200 Petrol Dealer
## 5 swift 2014 4.6 6.87 42450 Diesel Dealer
## 6 vitara bre… 2018 9.25 9.83 2071 Diesel Dealer
## # ℹ 2 more variables: Transmission <chr>, Owner <dbl>
Dataset description:
This dataset contains information about cars being sold, including their specifications and details about the sellers.
# Check the structure of the dataset
str(car_data)
## spc_tbl_ [301 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Car_Name : chr [1:301] "ritz" "sx4" "ciaz" "wagon r" ...
## $ Year : num [1:301] 2014 2013 2017 2011 2014 ...
## $ Selling_Price: num [1:301] 3.35 4.75 7.25 2.85 4.6 9.25 6.75 6.5 8.75 7.45 ...
## $ Present_Price: num [1:301] 5.59 9.54 9.85 4.15 6.87 9.83 8.12 8.61 8.89 8.92 ...
## $ Kms_Driven : num [1:301] 27000 43000 6900 5200 42450 ...
## $ Fuel_Type : chr [1:301] "Petrol" "Diesel" "Petrol" "Petrol" ...
## $ Seller_Type : chr [1:301] "Dealer" "Dealer" "Dealer" "Dealer" ...
## $ Transmission : chr [1:301] "Manual" "Manual" "Manual" "Manual" ...
## $ Owner : num [1:301] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Car_Name = col_character(),
## .. Year = col_double(),
## .. Selling_Price = col_double(),
## .. Present_Price = col_double(),
## .. Kms_Driven = col_double(),
## .. Fuel_Type = col_character(),
## .. Seller_Type = col_character(),
## .. Transmission = col_character(),
## .. Owner = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# View the first few rows
head(car_data, 6)
## # A tibble: 6 × 9
## Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ritz 2014 3.35 5.59 27000 Petrol Dealer
## 2 sx4 2013 4.75 9.54 43000 Diesel Dealer
## 3 ciaz 2017 7.25 9.85 6900 Petrol Dealer
## 4 wagon r 2011 2.85 4.15 5200 Petrol Dealer
## 5 swift 2014 4.6 6.87 42450 Diesel Dealer
## 6 vitara bre… 2018 9.25 9.83 2071 Diesel Dealer
## # ℹ 2 more variables: Transmission <chr>, Owner <dbl>
TASK 1:2 Data manipulations
# Create a new variable 'Car_Age' based on the difference between the current year and the car's year
car_data$Car_Age <- 2024 - car_data$Year
# View the first few rows to confirm the new variable
head(car_data, 6)
## # A tibble: 6 × 10
## Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ritz 2014 3.35 5.59 27000 Petrol Dealer
## 2 sx4 2013 4.75 9.54 43000 Diesel Dealer
## 3 ciaz 2017 7.25 9.85 6900 Petrol Dealer
## 4 wagon r 2011 2.85 4.15 5200 Petrol Dealer
## 5 swift 2014 4.6 6.87 42450 Diesel Dealer
## 6 vitara bre… 2018 9.25 9.83 2071 Diesel Dealer
## # ℹ 3 more variables: Transmission <chr>, Owner <dbl>, Car_Age <dbl>
# Since I have no missing data, I proposed to remove 2 data that seem irrelevant given the amount of data. These irrelevant data are CNG combustion cars.
# Remove rows where Fuel_Type is 'CNG'
car_data_clean <- subset(car_data, Fuel_Type != "CNG")
# Check the first few rows to confirm the removal of CNG cars
head(car_data_clean, 6)
## # A tibble: 6 × 10
## Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ritz 2014 3.35 5.59 27000 Petrol Dealer
## 2 sx4 2013 4.75 9.54 43000 Diesel Dealer
## 3 ciaz 2017 7.25 9.85 6900 Petrol Dealer
## 4 wagon r 2011 2.85 4.15 5200 Petrol Dealer
## 5 swift 2014 4.6 6.87 42450 Diesel Dealer
## 6 vitara bre… 2018 9.25 9.83 2071 Diesel Dealer
## # ℹ 3 more variables: Transmission <chr>, Owner <dbl>, Car_Age <dbl>
# Check how many rows were removed (original vs clean data)
nrow(car_data) # Number of rows before
## [1] 301
nrow(car_data_clean) # Number of rows after removing CNG cars
## [1] 299
# Rename a column in the dataset
names(car_data)[names(car_data) == "Selling_Price"] <- "Price_To_Sell"
# View the first few rows to confirm the rename
head(car_data, 6)
## # A tibble: 6 × 10
## Car_Name Year Price_To_Sell Present_Price Kms_Driven Fuel_Type Seller_Type
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ritz 2014 3.35 5.59 27000 Petrol Dealer
## 2 sx4 2013 4.75 9.54 43000 Diesel Dealer
## 3 ciaz 2017 7.25 9.85 6900 Petrol Dealer
## 4 wagon r 2011 2.85 4.15 5200 Petrol Dealer
## 5 swift 2014 4.6 6.87 42450 Diesel Dealer
## 6 vitara bre… 2018 9.25 9.83 2071 Diesel Dealer
## # ℹ 3 more variables: Transmission <chr>, Owner <dbl>, Car_Age <dbl>
# Create a new data frame based on certain conditions
petrol_cars <- subset(car_data, Fuel_Type == "Petrol" & Kms_Driven < 50000)
# View the new data frame
head(petrol_cars, 6)
## # A tibble: 6 × 10
## Car_Name Year Price_To_Sell Present_Price Kms_Driven Fuel_Type Seller_Type
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ritz 2014 3.35 5.59 27000 Petrol Dealer
## 2 ciaz 2017 7.25 9.85 6900 Petrol Dealer
## 3 wagon r 2011 2.85 4.15 5200 Petrol Dealer
## 4 ciaz 2015 6.75 8.12 18796 Petrol Dealer
## 5 alto 800 2017 2.85 3.6 2135 Petrol Dealer
## 6 ciaz 2015 7.5 9.94 15000 Petrol Dealer
## # ℹ 3 more variables: Transmission <chr>, Owner <dbl>, Car_Age <dbl>
TASK 1:3 Descriptive data statistics
# Summary of descriptive statistics for the selected variables
summary(car_data_clean[, c("Selling_Price", "Present_Price", "Kms_Driven")])
## Selling_Price Present_Price Kms_Driven
## Min. : 0.100 Min. : 0.320 Min. : 500
## 1st Qu.: 0.850 1st Qu.: 1.200 1st Qu.: 15000
## Median : 3.650 Median : 6.400 Median : 32000
## Mean : 4.672 Mean : 7.637 Mean : 36908
## 3rd Qu.: 6.000 3rd Qu.: 9.900 3rd Qu.: 48384
## Max. :35.000 Max. :92.600 Max. :500000
# Calculate specific sample statistics
# Mean
mean_price_to_sell <- mean(car_data_clean$Selling_Price)
# Median
median_price_to_sell <- median(car_data_clean$Selling_Price)
# Standard Deviation
sd_price_to_sell <- sd(car_data_clean$Selling_Price)
# Quantile
quantile_price_to_sell <- quantile(car_data_clean$Selling_Price, 0.30)
Explanation of the samples statistics:
TASK 1: 4 Graphics of distribution of variables
# Load ggplot2 library
library(ggplot2)
# Plot a histogram for 'Selling_Price'
ggplot(car_data_clean, aes(x = Selling_Price)) +
geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
labs(title = "Distribution of Selling Price", x = "Selling Price (Units)", y = "Frequency")
Interpreting the histogram:
The sales price histogram shows a right-skewed distribution, where most cars are sold at lower prices and only a few cars are sold at higher prices.
# Scatterplot of Kms_Driven vs Selling_Price
ggplot(car_data_clean, aes(x = Kms_Driven, y = Selling_Price)) +
geom_point(color = "red") +
labs(title = "Scatterplot of Kms Driven vs Selling Price",
x = "Kms Driven", y = "Selling Price (Units)")
Interpretation of the scatterplot (km_driven vs. sales_price):
There appears to be a negative relationship between the number of kilometers driven and sales_price. As km_driven increases, sales_price tends to decrease. This is intuitive because cars that have been driven more often experience more wear and tear, leading to lower resale prices.
There are some outlier cases where cars with a very high km_driven (around 200,000–500,000 km) still have higher sales prices (above 10 units). This could be because these cars have unique attributes (e.g. being luxury or rare models) that allow them to maintain a higher value despite the many kilometers driven.
While there is an overall negative association, the graph does not show a clear and strong linear relationship. Factors other than km_driven (e.g. car model, condition, age or fuel type) are also likely to significantly influence the selling price.
# Boxplot of Selling Price by Fuel Type
ggplot(car_data_clean, aes(x = as.factor(Fuel_Type), y = Selling_Price)) +
geom_boxplot(fill = "orange", color = "black") +
labs(title = "Boxplot of Selling Price by Fuel Type",
x = "Fuel Type", y = "Selling Price (Units)")
Interpretation of the box plot (selling price by fuel type):
The median selling price of diesel cars is higher than that of gasoline cars. The thick horizontal line inside the boxes shows the median value, which is approximately 10 units for diesel cars and slightly higher than 5 units for gasoline cars.
The interquartile range (IQR) is wider for diesel cars compared to gasoline cars. This suggests that there is greater variability in the selling prices of diesel cars, indicating that prices are spread over a wider range. In contrast, gasoline car prices tend to be concentrated in a narrower range around the median.
Both diesel and gasoline cars have outliers (the black dots outside the boxes). These outliers represent cars with significantly higher prices than the rest. Diesel cars show outliers above 30 units, while gasoline cars have outliers between 15 and 20 units. These vehicles could be high-end or luxury models that maintain a high price despite the type of fuel.
You have a data set for 100 MBA students of the current generation. In the previous year, the average grade in this program was 74.
1 Graph the distribution of undergrad degrees using the ggplot function. Which degree is the most common? 2 Show the descriptive statistics of the Annual Salary and its distribution with the histogram (use the ggplot function). Describe the distribution. 3 Test the following hypothesis: 𝐻0: 𝜇MBA Grade = 74. Explain the result and interpret the effect size.
# Read data
library(readxl)
# Use read_excel function from readxl package
MBAdata <- read_xlsx("Business School.xlsx")
MBAdata <- as.data.frame(MBAdata)
TASK 2: 1
# Load necessary library
library(ggplot2)
# Create a bar chart for the distribution of Undergraduate Degrees
ggplot(MBAdata, aes(x = `Undergrad Degree`, fill = `Undergrad Degree`)) +
geom_bar(color = "black") +
labs(title = "Distribution of Undergraduate Degrees", x = "Undergraduate Degree", y = "Count") +
theme_minimal()
Interpretation:
This distribution suggests that most MBA students come from business-related backgrounds, followed by technical degrees like Computer Science and Finance. Engineering and Art have a much lower representation.
TASK 2: 2
# Descriptive Statistics
mean_salary <- mean(MBAdata$`Annual Salary`, na.rm = TRUE)
median_salary <- median(MBAdata$`Annual Salary`, na.rm = TRUE)
min_salary <- min(MBAdata$`Annual Salary`, na.rm = TRUE)
max_salary <- max(MBAdata$`Annual Salary`, na.rm = TRUE)
std_dev_salary <- sd(MBAdata$`Annual Salary`, na.rm = TRUE)
# Print Descriptive Statistics
mean_salary
## [1] 109058
median_salary
## [1] 103500
min_salary
## [1] 20000
max_salary
## [1] 340000
std_dev_salary
## [1] 41501.49
data.frame(
Mean = mean_salary,
Median = median_salary,
Min = min_salary,
Max = max_salary,
Std_Dev = std_dev_salary
)
## Mean Median Min Max Std_Dev
## 1 109058 103500 20000 340000 41501.49
# Plot Histogram of Annual Salary
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(ggplot2)
library(scales)
ggplot(MBAdata, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 10000, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Distribution of Annual Salary", x = "Annual Salary", y = "Frequency") +
scale_x_continuous(labels = comma) +
theme_minimal()
Interpretation: The histogram of annual salaries reveals a right-skewed
distribution, where most students earn within the middle salary range of
$90,000 to $110,000, as indicated by the tallest bars.
The mean salary is around $109,058, while the median is slightly lower at $103,500, further suggesting a right-skewed pattern.
A few students earn significantly higher salaries, exceeding $200,000 or even $300,000, which stretches the tail of the distribution to the right. These high salaries can be considered outliers, potentially pulling the mean salary higher than the median.
The concentration of students within the $90,000 to $100,000 range reflects typical earnings, but the presence of these outliers notably influences the overall distribution.
TASK 2: 3
#Null Hypothesis (𝐻0): The mean MBA grade is equal to 74 (μ=74)
# Calculate the average MBA grade
average_mba_grade <- mean(MBAdata$`MBA Grade`, na.rm = TRUE)
# Print the average MBA grade
average_mba_grade
## [1] 76.04055
# Perform a one-sample t-test
t_test_result <- t.test(MBAdata$`MBA Grade`, mu = 74)
# Print the t-test results
t_test_result
##
## One Sample t-test
##
## data: MBAdata$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
Interpretation: The t-statistic of 2.6587, with a p-value of 0.00915, indicates that the difference between the sample mean (76.04) and the hypothesized mean (74) is statistically significant, as the p-value is less than the standard significance level of 0.05. This suggests that the sample mean is about 2.66 times larger than the variation in the data.
Since the p-value is below 0.05, you can reject the null hypothesis and conclude that the average MBA grade in the sample is significantly different from 74.
Additionally, the confidence interval (74.52 to 77.56) supports this conclusion, as it does not include the value 74.
library(readxl)
apartments_data <- read_excel("./Apartments.xlsx")
head(apartments_data)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
apartments_data$Parking <- factor(apartments_data$Parking, levels = c(0, 1), labels = c("No", "Yes"))
apartments_data$Balcony <- factor(apartments_data$Balcony, levels = c(0, 1), labels = c("No", "Yes"))
str(apartments_data)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ Balcony : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
mean_price <- mean(apartments_data$Price)
sd_price <- sd(apartments_data$Price)
n_price <- length(apartments_data$Price)
mean_price
## [1] 2018.941
sd_price
## [1] 377.8417
n_price
## [1] 85
t_test_result <- t.test(apartments_data$Price, mu = 1900)
t_test_result
##
## One Sample t-test
##
## data: apartments_data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
The mean price of the apartments is 2018.941 EUR.
Since the p-value (0.004731) is less than the significance level (typically 0.05), we reject the null hypothesis H0:μPrice = 1900. This indicates that the mean price of apartments is significantly different from 1900 EUR.
# Estimate the simple regression function: Price = f(Age)
fit1 <- lm(Price ~ Age, data = apartments_data)
# View the summary of the regression results
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apartments_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
# Calculate the correlation between Price and Age
correlation <- cor(apartments_data$Price, apartments_data$Age)
# Display the correlation
correlation
## [1] -0.230255
The regression coefficient represents the relationship between the independent variable (Age) and the dependent variable (Price). In this case, it tells us how much the apartment price changes as the age of the apartment increases by one year. A coefficient of -8.975 means that for each additional year of age, the apartment price decreases by 8.975 EUR on average. This negative sign indicates an inverse relationship: as apartments age, their prices tend to decrease.
Intercept = 2185.455: The intercept represents the expected price when the age of the apartment is 0. In this case, when an apartment is newly built (age = 0), the estimated price is 2185.455 EUR.
Correlation coefficient (r = -0.230255): In this case, r = -0.230255 indicates: a weak negative correlation between Age and Price. This means that as the Age of the apartment increases, the Price tends to decrease slightly. A perfect correlation would be closer to -1 (for a perfect negative correlation) or +1 (for a perfect positive correlation). Since their correlation is -0.23, it suggests that the relationship is weak, meaning that other factors (besides Age) likely play a significant role in determining the price of the apartment.
Coefficient of determination (R² = 0.05302): The coefficient of determination (R² = 5.3%) suggests that only a small part of the variation in price can be explained by the age of the apartment, and that most of the variation is due to other factors.
library(car)
## Loading required package: carData
scatterplotMatrix(~ Price + Age + Distance, data = apartments_data, main = "Scatterplot Matrix")
Interpretation:
Based on the scatterplot between Age and Distance, there doesn’t seem to be a multicollinearity problem. The lack of a clear linear relationship between Age and Distance suggests that these two independent variables are not highly correlated.
fit2 <- lm(Price ~ Age + Distance, data = apartments_data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
vif(fit2)
## Age Distance
## 1.001845 1.001845
Interpretation:
The VIF values for both Age and Distance are approximately 1.0018, which are very low and well below the common threshold of 5.
The low VIF values confirm that multicollinearity is not an issue in your multiple regression model, and the coefficients for Age and Distance can be interpreted independently without concern for inflated variances.
apartments_data$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
apartments_data$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
apartments_data$fitted <- fitted(fit2)
head(apartments_data[order(-apartments_data$StdResid),], 3) #Three units with lowest value of stand. residuals
## # A tibble: 3 × 8
## Age Distance Price Parking Balcony StdResid CooksD fitted
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 5 45 2180 Yes Yes 2.58 0.32 1490.
## 2 2 11 2790 Yes No 2.05 0.069 2217.
## 3 18 1 2800 Yes No 1.78 0.03 2297.
head(apartments_data[order(-apartments_data$CooksD),], 6) #Six units with highest value of Cooks distance
## # A tibble: 6 × 8
## Age Distance Price Parking Balcony StdResid CooksD fitted
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 5 45 2180 Yes Yes 2.58 0.32 1490.
## 2 43 37 1740 No No 1.44 0.104 1354.
## 3 2 11 2790 Yes No 2.05 0.069 2217.
## 4 7 2 1760 No Yes -2.15 0.066 2363.
## 5 37 3 2540 Yes Yes 1.58 0.061 2105.
## 6 40 2 2400 No Yes 1.09 0.038 2101.
apartments_data <- apartments_data[c(-38,-55), ]
fit2_cleaned <- lm(Price ~ Age + Distance, data = apartments_data)
summary(fit2_cleaned)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
Interpretation:
After running the commands to identify outliers or influential points, the result show us no outliers (for standardized residuals) and no influential points (for Cook’s Distance).
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(fit2)
##
## studentized Breusch-Pagan test
##
## data: fit2
## BP = 4.4439, df = 2, p-value = 0.1084
Interpretation:
The p-value (0.1084) is greater than 0.05, which means we fail to reject the null hypothesis of homoskedasticity. This suggests that there is no significant evidence of heteroskedasticity in our model. The residuals appear to have constant variance, meaning heteroskedasticity is not a problem for this regression model.
shapiro_test <- shapiro.test(apartments_data$StdResid)
shapiro_test
##
## Shapiro-Wilk normality test
##
## data: apartments_data$StdResid
## W = 0.94963, p-value = 0.002636
hist(apartments_data$StdResid,
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals",
breaks = 20,
col = "lightblue",
border = "black")
plot(apartments_data$fitted, apartments_data$StdResid,
main = "Scatterplot of Standardized Residuals vs Fitted Values",
xlab = "Fitted Values",
ylab = "Standardized Residuals",
pch = 19, col = "blue")
Interpretation:
p-value = 0.00366 is less than 0.05, which means that we reject the null hypothesis that the standardized residuals are normally distributed. This indicates that the standardized residuals are not normally distributed.
The scatterplot shows how the standardized residuals are distributed relative to the fitted values of the regression model. Therefore, there is no clear pattern in the scatterplot, which is a good sign. The points appear to be randomly scattered, suggesting that there is no heteroscedasticity.
fit2 <- lm(Price ~ Age + Distance, data = apartments_data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
Interpretation:
Intercept (2460,101): Represents the expected price of an apartment when both age and distance are 0. In other words, when an apartment is new and located in the city center (distance = 0), the estimated price is 2460,101 EUR.
Age coefficient (-7.934): For each additional year of age of the apartment, the price is expected to decrease by 7.934 EUR, holding distance constant.
The p-value for age is 0.016, which is less than 0.05, indicating that age is a statistically significant predictor of price.
Distance coefficient (-20.667): For each additional unit of distance from the city center, the price decreases by 20.667 EUR, holding age constant.
The p-value for distance is 6.18e-11, which is extremely small, indicating that distance is a highly significant predictor of price.
# 1. Check the structure of the data
str(apartments_data)
## tibble [83 × 8] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:83] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:83] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:83] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ Balcony : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
## $ StdResid: Named num [1:83] -0.665 1.783 -0.594 0.754 -1.073 ...
## ..- attr(*, "names")= chr [1:83] "1" "2" "3" "4" ...
## $ CooksD : Named num [1:83] 0.007 0.03 0.006 0.008 0.005 0.005 0.001 0.004 0.001 0.017 ...
## ..- attr(*, "names")= chr [1:83] "1" "2" "3" "4" ...
## $ fitted : Named num [1:83] 1826 2297 1826 1639 1945 ...
## ..- attr(*, "names")= chr [1:83] "1" "2" "3" "4" ...
# 2. Handle missing values: Remove rows with NA values for Parking and Balcony
apartments_data_clean <- na.omit(apartments_data[, c("Price", "Age", "Distance", "Parking", "Balcony")])
# 3. Ensure Parking and Balcony are treated as factors
apartments_data_clean$Parking <- factor(apartments_data_clean$Parking, levels = c(0, 1), labels = c("No", "Yes"))
apartments_data_clean$Balcony <- factor(apartments_data_clean$Balcony, levels = c(0, 1), labels = c("No", "Yes"))
# 4. Check unique levels in Parking and Balcony
if (length(unique(apartments_data_clean$Parking)) == 1) {
apartments_data_clean$Parking <- NULL # Remove Parking if it has only one level
print("Parking has only one level, removing from model.")
}
## [1] "Parking has only one level, removing from model."
if (length(unique(apartments_data_clean$Balcony)) == 1) {
apartments_data_clean$Balcony <- NULL # Remove Balcony if it has only one level
print("Balcony has only one level, removing from model.")
}
## [1] "Balcony has only one level, removing from model."
# 5. Create dynamic regression formula
formula <- "Price ~ Age + Distance"
# Add Parking if it still exists
if (!is.null(apartments_data_clean$Parking)) {
formula <- paste(formula, "+ Parking")
}
## Warning: Unknown or uninitialised column: `Parking`.
# Add Balcony if it still exists
if (!is.null(apartments_data_clean$Balcony)) {
formula <- paste(formula, "+ Balcony")
}
## Warning: Unknown or uninitialised column: `Balcony`.
# 6. Convert the formula to actual formula object
formula <- as.formula(formula)
# 7. Fit the linear regression model with the correct variables
fit3 <- lm(formula, data = apartments_data_clean)
# 8. Show the summary of the model
summary(fit3)
##
## Call:
## lm(formula = formula, data = apartments_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 80 5982100 0 0
Interpretation:
The ANOVA test shows that fit3 does not improve the fit over fit2 because they are identical after removing Parking and Balcony. Since Parking and Balcony were removed due to lack of variability, it’s expected that both models perform exactly the same.
summary(fit3)
##
## Call:
## lm(formula = formula, data = apartments_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
fit3_summary <- summary(fit3)
fit3_summary$fstatistic
## value numdf dendf
## 39.49241 2.00000 80.00000
Interpretation: Age and Distance interpretation (Parking and Balcony removed): - Age: Represents the price change for each additional year in the age of the apartment. - Distance: Represents the price change for each additional unit of distance from the city center.
Hypothesis being tested: H₀: All coefficients except the intercept are zero. H₁: At least one coefficient is not zero.
Since the p-value is much smaller than 0.05 (4.896e-11), we reject the null hypothesis (H₀). This means that at least one of the variables, Age or Distance, significantly explains the variation in Price.
fitted_values <- fitted(fit3)
residuals_values <- residuals(fit3)
fitted_value_ID2 <- fitted_values[2]
residual_ID2 <- residuals_values[2]
cat("Fitted value for apartment ID 2:", fitted_value_ID2, "\n")
## Fitted value for apartment ID 2: 2324.86
cat("Residual for apartment ID 2:", residual_ID2, "\n")
## Residual for apartment ID 2: 475.1398