Task 1

Find your own data set that contains at least 20 units and at least 4 variables (most of which are numeric, but it is good to have at least one categorical variable as well). Perform the following steps using the R program:

Explain the data set (the variables used in the analysis).
Perform some data manipulations (create new variable, delete some units due to missing data, rename variables, create new data.frame based on conditions, etc.).
Present the descriptive statistics for the selected variables and explain at least 3 sample statistics (mean, median, etc.).
Graph the distribution of the variables using histograms, scatterplots, and/or boxplots. Explain the results.

TASK 1:1 Explain data set

library(readr)
car_data <- read_csv("./car data.csv")

## Rows: 301 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Car_Name, Fuel_Type, Seller_Type, Transmission
## dbl (5): Year, Selling_Price, Present_Price, Kms_Driven, Owner
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(car_data)

## # A tibble: 6 × 9
##   Car_Name     Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type
##   <chr>       <dbl>         <dbl>         <dbl>      <dbl> <chr>     <chr>      
## 1 ritz         2014          3.35          5.59      27000 Petrol    Dealer     
## 2 sx4          2013          4.75          9.54      43000 Diesel    Dealer     
## 3 ciaz         2017          7.25          9.85       6900 Petrol    Dealer     
## 4 wagon r      2011          2.85          4.15       5200 Petrol    Dealer     
## 5 swift        2014          4.6           6.87      42450 Diesel    Dealer     
## 6 vitara bre…  2018          9.25          9.83       2071 Diesel    Dealer     
## # ℹ 2 more variables: Transmission <chr>, Owner <dbl>

Dataset description:

Car_Name:Variable representing the name and model of the car.
Year: Numeric Variable indicating the year the car was bought.
Selling_Price: Variable representing the price at which the owner wants to sell the car.
Present_Price: Variable representing the current ex-showroom price of the car.
Kms_Driven: Variable showing the total kilometers the car has been driven.
Fuel_Type: Variable representing the type of fuel the car uses (e.g., Petrol, Diesel, etc.).
Seller_Type: Variable defining whether the seller is a dealer or an individual.
Transmission: Variable defining whether the car has a manual or automatic transmission.
Owner: Variable indicating how many owners the car has had previously.

This dataset contains information about cars being sold, including their specifications and details about the sellers.

# Check the structure of the dataset
str(car_data)

## spc_tbl_ [301 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Car_Name     : chr [1:301] "ritz" "sx4" "ciaz" "wagon r" ...
##  $ Year         : num [1:301] 2014 2013 2017 2011 2014 ...
##  $ Selling_Price: num [1:301] 3.35 4.75 7.25 2.85 4.6 9.25 6.75 6.5 8.75 7.45 ...
##  $ Present_Price: num [1:301] 5.59 9.54 9.85 4.15 6.87 9.83 8.12 8.61 8.89 8.92 ...
##  $ Kms_Driven   : num [1:301] 27000 43000 6900 5200 42450 ...
##  $ Fuel_Type    : chr [1:301] "Petrol" "Diesel" "Petrol" "Petrol" ...
##  $ Seller_Type  : chr [1:301] "Dealer" "Dealer" "Dealer" "Dealer" ...
##  $ Transmission : chr [1:301] "Manual" "Manual" "Manual" "Manual" ...
##  $ Owner        : num [1:301] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Car_Name = col_character(),
##   ..   Year = col_double(),
##   ..   Selling_Price = col_double(),
##   ..   Present_Price = col_double(),
##   ..   Kms_Driven = col_double(),
##   ..   Fuel_Type = col_character(),
##   ..   Seller_Type = col_character(),
##   ..   Transmission = col_character(),
##   ..   Owner = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

# View the first few rows
head(car_data, 6)

## # A tibble: 6 × 9
##   Car_Name     Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type
##   <chr>       <dbl>         <dbl>         <dbl>      <dbl> <chr>     <chr>      
## 1 ritz         2014          3.35          5.59      27000 Petrol    Dealer     
## 2 sx4          2013          4.75          9.54      43000 Diesel    Dealer     
## 3 ciaz         2017          7.25          9.85       6900 Petrol    Dealer     
## 4 wagon r      2011          2.85          4.15       5200 Petrol    Dealer     
## 5 swift        2014          4.6           6.87      42450 Diesel    Dealer     
## 6 vitara bre…  2018          9.25          9.83       2071 Diesel    Dealer     
## # ℹ 2 more variables: Transmission <chr>, Owner <dbl>

TASK 1:2 Data manipulations

# Create a new variable 'Car_Age' based on the difference between the current year and the car's year
car_data$Car_Age <- 2024 - car_data$Year

# View the first few rows to confirm the new variable
head(car_data, 6)

## # A tibble: 6 × 10
##   Car_Name     Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type
##   <chr>       <dbl>         <dbl>         <dbl>      <dbl> <chr>     <chr>      
## 1 ritz         2014          3.35          5.59      27000 Petrol    Dealer     
## 2 sx4          2013          4.75          9.54      43000 Diesel    Dealer     
## 3 ciaz         2017          7.25          9.85       6900 Petrol    Dealer     
## 4 wagon r      2011          2.85          4.15       5200 Petrol    Dealer     
## 5 swift        2014          4.6           6.87      42450 Diesel    Dealer     
## 6 vitara bre…  2018          9.25          9.83       2071 Diesel    Dealer     
## # ℹ 3 more variables: Transmission <chr>, Owner <dbl>, Car_Age <dbl>

# Since I have no missing data, I proposed to remove 2 data that seem irrelevant given the amount of data. These irrelevant data are CNG combustion cars.

# Remove rows where Fuel_Type is 'CNG'
car_data_clean <- subset(car_data, Fuel_Type != "CNG")

# Check the first few rows to confirm the removal of CNG cars
head(car_data_clean, 6)

## # A tibble: 6 × 10
##   Car_Name     Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type
##   <chr>       <dbl>         <dbl>         <dbl>      <dbl> <chr>     <chr>      
## 1 ritz         2014          3.35          5.59      27000 Petrol    Dealer     
## 2 sx4          2013          4.75          9.54      43000 Diesel    Dealer     
## 3 ciaz         2017          7.25          9.85       6900 Petrol    Dealer     
## 4 wagon r      2011          2.85          4.15       5200 Petrol    Dealer     
## 5 swift        2014          4.6           6.87      42450 Diesel    Dealer     
## 6 vitara bre…  2018          9.25          9.83       2071 Diesel    Dealer     
## # ℹ 3 more variables: Transmission <chr>, Owner <dbl>, Car_Age <dbl>

# Check how many rows were removed (original vs clean data)
nrow(car_data)       # Number of rows before

## [1] 301

nrow(car_data_clean) # Number of rows after removing CNG cars

## [1] 299

# Rename a column in the dataset
names(car_data)[names(car_data) == "Selling_Price"] <- "Price_To_Sell"

# View the first few rows to confirm the rename
head(car_data, 6)

## # A tibble: 6 × 10
##   Car_Name     Year Price_To_Sell Present_Price Kms_Driven Fuel_Type Seller_Type
##   <chr>       <dbl>         <dbl>         <dbl>      <dbl> <chr>     <chr>      
## 1 ritz         2014          3.35          5.59      27000 Petrol    Dealer     
## 2 sx4          2013          4.75          9.54      43000 Diesel    Dealer     
## 3 ciaz         2017          7.25          9.85       6900 Petrol    Dealer     
## 4 wagon r      2011          2.85          4.15       5200 Petrol    Dealer     
## 5 swift        2014          4.6           6.87      42450 Diesel    Dealer     
## 6 vitara bre…  2018          9.25          9.83       2071 Diesel    Dealer     
## # ℹ 3 more variables: Transmission <chr>, Owner <dbl>, Car_Age <dbl>

# Create a new data frame based on certain conditions
petrol_cars <- subset(car_data, Fuel_Type == "Petrol" & Kms_Driven < 50000)

# View the new data frame
head(petrol_cars, 6)

## # A tibble: 6 × 10
##   Car_Name  Year Price_To_Sell Present_Price Kms_Driven Fuel_Type Seller_Type
##   <chr>    <dbl>         <dbl>         <dbl>      <dbl> <chr>     <chr>      
## 1 ritz      2014          3.35          5.59      27000 Petrol    Dealer     
## 2 ciaz      2017          7.25          9.85       6900 Petrol    Dealer     
## 3 wagon r   2011          2.85          4.15       5200 Petrol    Dealer     
## 4 ciaz      2015          6.75          8.12      18796 Petrol    Dealer     
## 5 alto 800  2017          2.85          3.6        2135 Petrol    Dealer     
## 6 ciaz      2015          7.5           9.94      15000 Petrol    Dealer     
## # ℹ 3 more variables: Transmission <chr>, Owner <dbl>, Car_Age <dbl>

TASK 1:3 Descriptive data statistics

# Summary of descriptive statistics for the selected variables
summary(car_data_clean[, c("Selling_Price", "Present_Price", "Kms_Driven")])

##  Selling_Price    Present_Price      Kms_Driven    
##  Min.   : 0.100   Min.   : 0.320   Min.   :   500  
##  1st Qu.: 0.850   1st Qu.: 1.200   1st Qu.: 15000  
##  Median : 3.650   Median : 6.400   Median : 32000  
##  Mean   : 4.672   Mean   : 7.637   Mean   : 36908  
##  3rd Qu.: 6.000   3rd Qu.: 9.900   3rd Qu.: 48384  
##  Max.   :35.000   Max.   :92.600   Max.   :500000

# Calculate specific sample statistics

# Mean
mean_price_to_sell <- mean(car_data_clean$Selling_Price)

# Median
median_price_to_sell <- median(car_data_clean$Selling_Price)

# Standard Deviation
sd_price_to_sell <- sd(car_data_clean$Selling_Price)

# Quantile
quantile_price_to_sell <- quantile(car_data_clean$Selling_Price, 0.30)

Explanation of the samples statistics:

The mean selling price is 4.67 units, which represents the average price at which cars are being sold.
The median selling price is 3.65 units, which means that half the cars are sold for less than 3.65 units and the other half for more than 3.65 units.
The standard deviation of 5.10 units indicates a large spread in the selling prices, meaning there is significant variability in the prices at which cars are being sold.
The 30% of the cars are sold for 1.15 units or less, while the remaining 70% are sold for a higher price.

TASK 1: 4 Graphics of distribution of variables

# Load ggplot2 library
library(ggplot2)

# Plot a histogram for 'Selling_Price'
ggplot(car_data_clean, aes(x = Selling_Price)) + 
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") + 
  labs(title = "Distribution of Selling Price", x = "Selling Price (Units)", y = "Frequency")

Interpreting the histogram:

The sales price histogram shows a right-skewed distribution, where most cars are sold at lower prices and only a few cars are sold at higher prices.

# Scatterplot of Kms_Driven vs Selling_Price
ggplot(car_data_clean, aes(x = Kms_Driven, y = Selling_Price)) + 
  geom_point(color = "red") + 
  labs(title = "Scatterplot of Kms Driven vs Selling Price", 
       x = "Kms Driven", y = "Selling Price (Units)")

Interpretation of the scatterplot (km_driven vs. sales_price):

There appears to be a negative relationship between the number of kilometers driven and sales_price. As km_driven increases, sales_price tends to decrease. This is intuitive because cars that have been driven more often experience more wear and tear, leading to lower resale prices.

There are some outlier cases where cars with a very high km_driven (around 200,000–500,000 km) still have higher sales prices (above 10 units). This could be because these cars have unique attributes (e.g. being luxury or rare models) that allow them to maintain a higher value despite the many kilometers driven.

While there is an overall negative association, the graph does not show a clear and strong linear relationship. Factors other than km_driven (e.g. car model, condition, age or fuel type) are also likely to significantly influence the selling price.

# Boxplot of Selling Price by Fuel Type
ggplot(car_data_clean, aes(x = as.factor(Fuel_Type), y = Selling_Price)) + 
  geom_boxplot(fill = "orange", color = "black") + 
  labs(title = "Boxplot of Selling Price by Fuel Type", 
       x = "Fuel Type", y = "Selling Price (Units)")

Interpretation of the box plot (selling price by fuel type):

The median selling price of diesel cars is higher than that of gasoline cars. The thick horizontal line inside the boxes shows the median value, which is approximately 10 units for diesel cars and slightly higher than 5 units for gasoline cars.

The interquartile range (IQR) is wider for diesel cars compared to gasoline cars. This suggests that there is greater variability in the selling prices of diesel cars, indicating that prices are spread over a wider range. In contrast, gasoline car prices tend to be concentrated in a narrower range around the median.

Both diesel and gasoline cars have outliers (the black dots outside the boxes). These outliers represent cars with significantly higher prices than the rest. Diesel cars show outliers above 30 units, while gasoline cars have outliers between 15 and 20 units. These vehicles could be high-end or luxury models that maintain a high price despite the type of fuel.

Task 2

You have a data set for 100 MBA students of the current generation. In the previous year, the average grade in this program was 74.

1 Graph the distribution of undergrad degrees using the ggplot function. Which degree is the most common? 2 Show the descriptive statistics of the Annual Salary and its distribution with the histogram (use the ggplot function). Describe the distribution. 3 Test the following hypothesis: 𝐻0: 𝜇MBA Grade = 74. Explain the result and interpret the effect size.

# Read  data 
library(readxl)

# Use read_excel function from readxl package
MBAdata <- read_xlsx("Business School.xlsx")
MBAdata <- as.data.frame(MBAdata)

TASK 2: 1

# Load necessary library
library(ggplot2)

# Create a bar chart for the distribution of Undergraduate Degrees

ggplot(MBAdata, aes(x = `Undergrad Degree`, fill = `Undergrad Degree`)) +
  geom_bar(color = "black") +
  labs(title = "Distribution of Undergraduate Degrees", x = "Undergraduate Degree", y = "Count") +
  theme_minimal()

Interpretation:

This distribution suggests that most MBA students come from business-related backgrounds, followed by technical degrees like Computer Science and Finance. Engineering and Art have a much lower representation.

TASK 2: 2

# Descriptive Statistics
mean_salary <- mean(MBAdata$`Annual Salary`, na.rm = TRUE)
median_salary <- median(MBAdata$`Annual Salary`, na.rm = TRUE)
min_salary <- min(MBAdata$`Annual Salary`, na.rm = TRUE)
max_salary <- max(MBAdata$`Annual Salary`, na.rm = TRUE)
std_dev_salary <- sd(MBAdata$`Annual Salary`, na.rm = TRUE)


# Print Descriptive Statistics
mean_salary

## [1] 109058

median_salary

## [1] 103500

min_salary

## [1] 20000

max_salary

## [1] 340000

std_dev_salary

## [1] 41501.49

data.frame(
  Mean = mean_salary,
  Median = median_salary,
  Min = min_salary,
  Max = max_salary,
  Std_Dev = std_dev_salary
)

##     Mean Median   Min    Max  Std_Dev
## 1 109058 103500 20000 340000 41501.49

# Plot Histogram of Annual Salary
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:readr':
## 
##     col_factor

library(ggplot2)
library(scales)

ggplot(MBAdata, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 10000, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Annual Salary", x = "Annual Salary", y = "Frequency") +
  scale_x_continuous(labels = comma) +
  theme_minimal()

Interpretation: The histogram of annual salaries reveals a right-skewed distribution, where most students earn within the middle salary range of $90,000 to $110,000, as indicated by the tallest bars.

The mean salary is around $109,058, while the median is slightly lower at $103,500, further suggesting a right-skewed pattern.

A few students earn significantly higher salaries, exceeding $200,000 or even $300,000, which stretches the tail of the distribution to the right. These high salaries can be considered outliers, potentially pulling the mean salary higher than the median.

The concentration of students within the $90,000 to $100,000 range reflects typical earnings, but the presence of these outliers notably influences the overall distribution.

TASK 2: 3

#Null Hypothesis (𝐻0): The mean MBA grade is equal to 74 (μ=74)

# Calculate the average MBA grade
average_mba_grade <- mean(MBAdata$`MBA Grade`, na.rm = TRUE)

# Print the average MBA grade
average_mba_grade

## [1] 76.04055

# Perform a one-sample t-test
t_test_result <- t.test(MBAdata$`MBA Grade`, mu = 74)

# Print the t-test results
t_test_result

## 
##  One Sample t-test
## 
## data:  MBAdata$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

Interpretation: The t-statistic of 2.6587, with a p-value of 0.00915, indicates that the difference between the sample mean (76.04) and the hypothesized mean (74) is statistically significant, as the p-value is less than the standard significance level of 0.05. This suggests that the sample mean is about 2.66 times larger than the variation in the data.

Since the p-value is below 0.05, you can reject the null hypothesis and conclude that the average MBA grade in the sample is significantly different from 74.

Additionally, the confidence interval (74.52 to 77.56) supports this conclusion, as it does not include the value 74.

Task 3

Import the dataset Apartments.xlsx

library(readxl)

apartments_data <- read_excel("./Apartments.xlsx")

head(apartments_data)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

apartments_data$Parking <- factor(apartments_data$Parking, levels = c(0, 1), labels = c("No", "Yes"))

apartments_data$Balcony <- factor(apartments_data$Balcony, levels = c(0, 1), labels = c("No", "Yes"))

str(apartments_data)

## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ Balcony : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

mean_price <- mean(apartments_data$Price)

sd_price <- sd(apartments_data$Price)

n_price <- length(apartments_data$Price)

mean_price

## [1] 2018.941

sd_price

## [1] 377.8417

n_price

## [1] 85

t_test_result <- t.test(apartments_data$Price, mu = 1900)

t_test_result

## 
##  One Sample t-test
## 
## data:  apartments_data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

The mean price of the apartments is 2018.941 EUR.

Since the p-value (0.004731) is less than the significance level (typically 0.05), we reject the null hypothesis H0:μPrice = 1900. This indicates that the mean price of apartments is significantly different from 1900 EUR.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

# Estimate the simple regression function: Price = f(Age)
fit1 <- lm(Price ~ Age, data = apartments_data)

# View the summary of the regression results
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = apartments_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

# Calculate the correlation between Price and Age
correlation <- cor(apartments_data$Price, apartments_data$Age)

# Display the correlation
correlation

## [1] -0.230255

Regression coefficient = -8.975

The regression coefficient represents the relationship between the independent variable (Age) and the dependent variable (Price). In this case, it tells us how much the apartment price changes as the age of the apartment increases by one year. A coefficient of -8.975 means that for each additional year of age, the apartment price decreases by 8.975 EUR on average. This negative sign indicates an inverse relationship: as apartments age, their prices tend to decrease.

Intercept = 2185.455: The intercept represents the expected price when the age of the apartment is 0. In this case, when an apartment is newly built (age = 0), the estimated price is 2185.455 EUR.

Correlation coefficient (r = -0.230255): In this case, r = -0.230255 indicates: a weak negative correlation between Age and Price. This means that as the Age of the apartment increases, the Price tends to decrease slightly. A perfect correlation would be closer to -1 (for a perfect negative correlation) or +1 (for a perfect positive correlation). Since their correlation is -0.23, it suggests that the relationship is weak, meaning that other factors (besides Age) likely play a significant role in determining the price of the apartment.
Coefficient of determination (R² = 0.05302): The coefficient of determination (R² = 5.3%) suggests that only a small part of the variation in price can be explained by the age of the apartment, and that most of the variation is due to other factors.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

scatterplotMatrix(~ Price + Age + Distance, data = apartments_data, main = "Scatterplot Matrix")

Interpretation:

Based on the scatterplot between Age and Distance, there doesn’t seem to be a multicollinearity problem. The lack of a clear linear relationship between Age and Distance suggests that these two independent variables are not highly correlated.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = apartments_data)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

Interpretation:

The VIF values for both Age and Distance are approximately 1.0018, which are very low and well below the common threshold of 5.

The low VIF values confirm that multicollinearity is not an issue in your multiple regression model, and the coefficients for Age and Distance can be interpreted independently without concern for inflated variances.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

apartments_data$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
apartments_data$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances

apartments_data$fitted <- fitted(fit2)

head(apartments_data[order(-apartments_data$StdResid),], 3) #Three units with lowest value of stand. residuals

## # A tibble: 3 × 8
##     Age Distance Price Parking Balcony StdResid CooksD fitted
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>  <dbl>
## 1     5       45  2180 Yes     Yes         2.58  0.32   1490.
## 2     2       11  2790 Yes     No          2.05  0.069  2217.
## 3    18        1  2800 Yes     No          1.78  0.03   2297.

head(apartments_data[order(-apartments_data$CooksD),], 6) #Six units with highest value of Cooks distance

## # A tibble: 6 × 8
##     Age Distance Price Parking Balcony StdResid CooksD fitted
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>  <dbl>
## 1     5       45  2180 Yes     Yes         2.58  0.32   1490.
## 2    43       37  1740 No      No          1.44  0.104  1354.
## 3     2       11  2790 Yes     No          2.05  0.069  2217.
## 4     7        2  1760 No      Yes        -2.15  0.066  2363.
## 5    37        3  2540 Yes     Yes         1.58  0.061  2105.
## 6    40        2  2400 No      Yes         1.09  0.038  2101.

apartments_data <- apartments_data[c(-38,-55), ]

fit2_cleaned <- lm(Price ~ Age + Distance, data = apartments_data)

summary(fit2_cleaned)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

Interpretation:

After running the commands to identify outliers or influential points, the result show us no outliers (for standardized residuals) and no influential points (for Cook’s Distance).

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(fit2)

## 
##  studentized Breusch-Pagan test
## 
## data:  fit2
## BP = 4.4439, df = 2, p-value = 0.1084

Interpretation:

The p-value (0.1084) is greater than 0.05, which means we fail to reject the null hypothesis of homoskedasticity. This suggests that there is no significant evidence of heteroskedasticity in our model. The residuals appear to have constant variance, meaning heteroskedasticity is not a problem for this regression model.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

shapiro_test <- shapiro.test(apartments_data$StdResid)

shapiro_test

## 
##  Shapiro-Wilk normality test
## 
## data:  apartments_data$StdResid
## W = 0.94963, p-value = 0.002636

hist(apartments_data$StdResid, 
     main = "Histogram of Standardized Residuals", 
     xlab = "Standardized Residuals", 
     breaks = 20,  
     col = "lightblue", 
     border = "black")

plot(apartments_data$fitted, apartments_data$StdResid,
     main = "Scatterplot of Standardized Residuals vs Fitted Values",
     xlab = "Fitted Values",
     ylab = "Standardized Residuals",
     pch = 19, col = "blue")

Interpretation:

p-value = 0.00366 is less than 0.05, which means that we reject the null hypothesis that the standardized residuals are normally distributed. This indicates that the standardized residuals are not normally distributed.

The scatterplot shows how the standardized residuals are distributed relative to the fitted values of the regression model. Therefore, there is no clear pattern in the scatterplot, which is a good sign. The points appear to be randomly scattered, suggesting that there is no heteroscedasticity.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance, data = apartments_data)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

Interpretation:

Intercept (2460,101): Represents the expected price of an apartment when both age and distance are 0. In other words, when an apartment is new and located in the city center (distance = 0), the estimated price is 2460,101 EUR.

Age coefficient (-7.934): For each additional year of age of the apartment, the price is expected to decrease by 7.934 EUR, holding distance constant.

The p-value for age is 0.016, which is less than 0.05, indicating that age is a statistically significant predictor of price.

Distance coefficient (-20.667): For each additional unit of distance from the city center, the price decreases by 20.667 EUR, holding age constant.

The p-value for distance is 6.18e-11, which is extremely small, indicating that distance is a highly significant predictor of price.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

# 1. Check the structure of the data
str(apartments_data)

## tibble [83 × 8] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:83] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:83] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:83] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ Balcony : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
##  $ StdResid: Named num [1:83] -0.665 1.783 -0.594 0.754 -1.073 ...
##   ..- attr(*, "names")= chr [1:83] "1" "2" "3" "4" ...
##  $ CooksD  : Named num [1:83] 0.007 0.03 0.006 0.008 0.005 0.005 0.001 0.004 0.001 0.017 ...
##   ..- attr(*, "names")= chr [1:83] "1" "2" "3" "4" ...
##  $ fitted  : Named num [1:83] 1826 2297 1826 1639 1945 ...
##   ..- attr(*, "names")= chr [1:83] "1" "2" "3" "4" ...

# 2. Handle missing values: Remove rows with NA values for Parking and Balcony
apartments_data_clean <- na.omit(apartments_data[, c("Price", "Age", "Distance", "Parking", "Balcony")])

# 3. Ensure Parking and Balcony are treated as factors
apartments_data_clean$Parking <- factor(apartments_data_clean$Parking, levels = c(0, 1), labels = c("No", "Yes"))
apartments_data_clean$Balcony <- factor(apartments_data_clean$Balcony, levels = c(0, 1), labels = c("No", "Yes"))

# 4. Check unique levels in Parking and Balcony
if (length(unique(apartments_data_clean$Parking)) == 1) {
  apartments_data_clean$Parking <- NULL  # Remove Parking if it has only one level
  print("Parking has only one level, removing from model.")
}

## [1] "Parking has only one level, removing from model."

if (length(unique(apartments_data_clean$Balcony)) == 1) {
  apartments_data_clean$Balcony <- NULL  # Remove Balcony if it has only one level
  print("Balcony has only one level, removing from model.")
}

## [1] "Balcony has only one level, removing from model."

# 5. Create dynamic regression formula
formula <- "Price ~ Age + Distance"

# Add Parking if it still exists
if (!is.null(apartments_data_clean$Parking)) {
  formula <- paste(formula, "+ Parking")
}

## Warning: Unknown or uninitialised column: `Parking`.

# Add Balcony if it still exists
if (!is.null(apartments_data_clean$Balcony)) {
  formula <- paste(formula, "+ Balcony")
}

## Warning: Unknown or uninitialised column: `Balcony`.

# 6. Convert the formula to actual formula object
formula <- as.formula(formula)

# 7. Fit the linear regression model with the correct variables
fit3 <- lm(formula, data = apartments_data_clean)

# 8. Show the summary of the model
summary(fit3)

## 
## Call:
## lm(formula = formula, data = apartments_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance
##   Res.Df     RSS Df Sum of Sq F Pr(>F)
## 1     80 5982100                      
## 2     80 5982100  0         0

Interpretation:

The ANOVA test shows that fit3 does not improve the fit over fit2 because they are identical after removing Parking and Balcony. Since Parking and Balcony were removed due to lack of variability, it’s expected that both models perform exactly the same.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = formula, data = apartments_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

fit3_summary <- summary(fit3)
fit3_summary$fstatistic

##    value    numdf    dendf 
## 39.49241  2.00000 80.00000

Interpretation: Age and Distance interpretation (Parking and Balcony removed): - Age: Represents the price change for each additional year in the age of the apartment. - Distance: Represents the price change for each additional unit of distance from the city center.

Hypothesis being tested: H₀: All coefficients except the intercept are zero. H₁: At least one coefficient is not zero.

Since the p-value is much smaller than 0.05 (4.896e-11), we reject the null hypothesis (H₀). This means that at least one of the variables, Age or Distance, significantly explains the variation in Price.

Save fitted values and claculate the residual for apartment ID2.

fitted_values <- fitted(fit3)

residuals_values <- residuals(fit3)

fitted_value_ID2 <- fitted_values[2]
residual_ID2 <- residuals_values[2]

cat("Fitted value for apartment ID 2:", fitted_value_ID2, "\n")

## Fitted value for apartment ID 2: 2324.86

cat("Residual for apartment ID 2:", residual_ID2, "\n")

## Residual for apartment ID 2: 475.1398

Cars

Tomas Miklic

2024-09-19

Task 1

Task 2

Task 3

Import the dataset Apartments.xlsx

Change categorical variables into factors.

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

Chech the multicolinearity with VIF statistics. Explain the findings.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

With function anova check if model fit3 fits data better than model fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

Save fitted values and claculate the residual for apartment ID2.