The dataset contains 25 different motorcycle models
from various manufacturers.
It includes technical specifications, engine size,
power, torque etc., and a categorical variable for the
type of bike.
Variables: - Price: base model retail
price in €
- Weight: wet weight in kg with full fuel tank
- EngineCC: displacement in cubic centimeters (ccm)
- PowerHP: peak power in horsepower (hp)
- TorqueNm: peak torque in Newton meters (Nm)
- Cylinders: number of cylinders (integer)
- Category: type of motorcycle (Sport, Naked, Cruiser,
Adventure, Touring)
motorbikes <- data.frame(
Model = c(
"Yamaha R6", "Honda CBR650R", "Kawasaki Ninja 1000SX", "Suzuki GSX-R750",
"BMW S1000RR", "Ducati Panigale V2", "KTM 890 Duke R", "Yamaha MT-07",
"Honda CB500F", "Kawasaki Z900", "Suzuki V-Strom 650", "BMW R1250GS",
"Harley-Davidson Iron 883", "Indian Scout Bobber", "Triumph Street Triple RS",
"Ducati Multistrada V4", "Honda Africa Twin", "Kawasaki Vulcan S",
"Yamaha XSR900", "KTM 1290 Super Adventure R", "Honda Gold Wing",
"Harley-Davidson Road Glide", "BMW F900XR", "Triumph Tiger 900", "Ducati Monster 937"
),
Price = c(12500, 9500, 13500, 11800, 19000, 17000, 12500, 8000, 6500, 9500,
9000, 19500, 11000, 13000, 12500, 21000, 15000, 9000, 11000, 18500,
30000, 28000, 12500, 13500, 14000),
Weight = c(190, 208, 235, 190, 197, 200, 184, 184, 189, 210, 213, 249, 256, 252,
188, 240, 226, 228, 193, 245, 360, 375, 219, 228, 188),
EngineCC = c(599, 649, 1043, 750, 999, 955, 890, 689, 471, 948, 645, 1254, 883,
1133, 765, 1158, 1084, 649, 889, 1301, 1833, 1868, 895, 888, 937),
PowerHP = c(118, 94, 142, 128, 207, 155, 121, 73, 47, 125, 70, 136, 49, 100,
121, 170, 101, 61, 117, 160, 125, 89, 105, 95, 111),
TorqueNm = c(61, 64, 111, 86, 113, 104, 99, 68, 43, 98, 62, 143, 74, 97, 79,
125, 105, 63, 93, 140, 170, 150, 92, 87, 95),
Cylinders = c(4, 4, 4, 4, 4, 2, 2, 2, 2, 4, 2, 2, 2, 2, 3, 4, 2, 2, 3, 2, 6,
2, 2, 3, 2),
Category = c("Sport", "Sport", "Sport", "Sport", "Sport", "Sport", "Naked", "Naked",
"Naked", "Naked", "Adventure", "Adventure", "Cruiser", "Cruiser",
"Naked", "Adventure", "Adventure", "Cruiser", "Naked", "Adventure",
"Touring", "Cruiser", "Adventure", "Adventure", "Naked")
)
I have created a few new variables that are shown on a new table
below.
- Power-to-weight ratio (hp/kg) - Torque-to-weight ratio (Nm/kg) -
Power-to-torque ratio (hp/Nm)
All are calculated to three decimal places.
Then I removed heavier bikes, above 250 kg.
I also renamed some variables to make the data easier to understand.
And finally, because I like Naked bikes, I created a subset with only Naked bikes in my data.
# New variables
motorbikes <- motorbikes %>%
mutate(
PowerToWeight = round(PowerHP / Weight, 3),
TorqueToWeight = round(TorqueNm / Weight, 3),
PowerToTorque = round(PowerHP / TorqueNm, 3)
)
# Remove heavier bikes (over 250 kg)
motorbikes_clean <- motorbikes %>%
filter(Weight < 250)
# Rename variables
motorbikes_clean <- motorbikes_clean %>%
rename(Displacement_ccm = EngineCC,
Power_HP = PowerHP,
Torque_Nm = TorqueNm)
# Only Naked bikes
naked_bikes <- motorbikes_clean %>%
filter(Category == "Naked")
kable(naked_bikes, caption = "Final dataset: Naked bikes only")
| Model | Price | Weight | Displacement_ccm | Power_HP | Torque_Nm | Cylinders | Category | PowerToWeight | TorqueToWeight | PowerToTorque |
|---|---|---|---|---|---|---|---|---|---|---|
| KTM 890 Duke R | 12500 | 184 | 890 | 121 | 99 | 2 | Naked | 0.658 | 0.538 | 1.222 |
| Yamaha MT-07 | 8000 | 184 | 689 | 73 | 68 | 2 | Naked | 0.397 | 0.370 | 1.074 |
| Honda CB500F | 6500 | 189 | 471 | 47 | 43 | 2 | Naked | 0.249 | 0.228 | 1.093 |
| Kawasaki Z900 | 9500 | 210 | 948 | 125 | 98 | 4 | Naked | 0.595 | 0.467 | 1.276 |
| Triumph Street Triple RS | 12500 | 188 | 765 | 121 | 79 | 3 | Naked | 0.644 | 0.420 | 1.532 |
| Yamaha XSR900 | 11000 | 193 | 889 | 117 | 93 | 3 | Naked | 0.606 | 0.482 | 1.258 |
| Ducati Monster 937 | 14000 | 188 | 937 | 111 | 95 | 2 | Naked | 0.590 | 0.505 | 1.168 |
Here you can see the summary of a cleaned dataset
(motorbikes_clean) for all bike categories.
The cleaned dataset differs from the original one by including new
calculated variables and renamed columns.
# Summary stats of cleaned dataset
kable(summary(motorbikes_clean[, c("Price", "Weight", "Displacement_ccm", "Power_HP", "Torque_Nm")]),
caption = "Summary statistics of cleaned dataset")
| Price | Weight | Displacement_ccm | Power_HP | Torque_Nm | |
|---|---|---|---|---|---|
| Min. : 6500 | Min. :184.0 | Min. : 471 | Min. : 47 | Min. : 43.00 | |
| 1st Qu.: 9500 | 1st Qu.:190.0 | 1st Qu.: 689 | 1st Qu.: 95 | 1st Qu.: 68.00 | |
| Median :12500 | Median :208.0 | Median : 890 | Median :118 | Median : 93.00 | |
| Mean :13110 | Mean :209.7 | Mean : 879 | Mean :117 | Mean : 91.95 | |
| 3rd Qu.:15000 | 3rd Qu.:228.0 | 3rd Qu.: 999 | 3rd Qu.:136 | 3rd Qu.:105.00 | |
| Max. :21000 | Max. :249.0 | Max. :1301 | Max. :207 | Max. :143.00 |
For graphical analysis I opted for a histogram, scatterplot, and a
boxplot to show different motorcycle characteristics.
- The histogram shows the distribution of engine power.
- The scatterplot shows the connection between engine power and
price.
- The boxplot shows the relationship between price and category.
# Histogram of engine power
ggplot(motorbikes_clean, aes(x = Power_HP)) +
geom_histogram(binwidth = 20, fill = "darkgreen", color = "black") +
labs(title = "Distribution of Engine Power", x = "Engine Power (HP)", y = "Count")
# Scatterplot: Power vs Price
ggplot(motorbikes_clean, aes(x = Price, y = Power_HP, color = Category)) +
geom_point(size=3) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Engine Power vs Price ", x = "Price (€)", y = "Power (HP)")
## `geom_smooth()` using formula = 'y ~ x'
# Boxplot: Price by Category
ggplot(motorbikes_clean, aes(x = Category, y = Price, fill = Category)) +
geom_boxplot() +
labs(title = "Price Distribution by Category", x = "Category", y = "Price (€)")
The power histogram shows that most bikes fall in the 100–150 HP range, with a few powerful outliers such as the Kawasaki Ninja, while models like the Honda CB500F are on the lower end of the power scale.
The scatterplot indicates a clear positive relationship between price and engine power, which was expected. Sport bikes stand out as the most powerful category on average, while Naked and Cruiser bikes tend to cluster at lower values. We can also observe that the trend line for naked and adventure bikes is practically the same, which was unexpected.
The boxplot shows that Adventure bikes are consistently the most expensive, with the widest price dispersion. Naked bikes are the cheapest on average, while Sport bikes and Naked bikes show similar variability in prices. Because there is only one Cruiser bike left after cleaning, its boxplot is not representative.
library(readxl)
library(ggplot2)
library(dplyr)
library(knitr)
mba <- read_excel("Business School.xlsx")
# Distribution of undergrad degrees
ggplot(mba, aes(x = `Undergrad Degree`, fill = `Undergrad Degree`)) +
geom_bar() +
labs(title = "Distribution of Undergraduate Degrees",
x = "Undergraduate Degree", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
From the graph we can see that the most common degree is business.
# Descriptive statistics of Annual Salary
salary_summary <- as.data.frame(t(summary(mba$`Annual Salary`)))
kable(salary_summary, caption = "Summary of Annual Salary")
| Var1 | Var2 | Freq |
|---|---|---|
| A | Min. | 20000 |
| A | 1st Qu. | 87125 |
| A | Median | 103500 |
| A | Mean | 109058 |
| A | 3rd Qu. | 124000 |
| A | Max. | 340000 |
# Histogram of salary
ggplot(mba, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 10000, fill = "darkgreen", color = "black") +
labs(title = "Distribution of Annual Salary", x = "Annual Salary", y = "Count")
The summary table shows the mean, median, minimum, maximum, and
quartiles of Annual Salary.
The histogram indicates that most earn between 90,000 and
120,000 , while a few earn much higher salaries.
This creates a positively skewed distribution, where
the long tail extends to the right.
As a result, the mean salary is higher than the median,
since the few very high salaries pull the average upwards.
# t-test against population mean = 74
t_result <- t.test(mba$`MBA Grade`, mu = 74)
t_result
##
## One Sample t-test
##
## data: mba$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
# Effect size (Cohen's d)
mean_grade <- mean(mba$`MBA Grade`)
sd_grade <- sd(mba$`MBA Grade`)
mu0 <- 74
cohen_d <- (mean_grade - mu0) / sd_grade
cohen_d
## [1] 0.2658658
Sampel’s mean MBA grade is 76.04, which is higher
than 74.
The t-test result is t(99) = 2.66, p = 0.009. Since the
p-value is less than 0.05, we reject the null hypothesis.
The 95% confidence interval for the mean (74.52 – 77.56) does not
include 74, which further confirms the result.
The effect size, Cohen’s d = 0.27, indicates a
small effect.
# Load dataset
apartments <- read_excel("Apartments.xlsx")
# Show first rows
kable(head(apartments), caption = "Apartments dataset")
| Age | Distance | Price | Parking | Balcony |
|---|---|---|---|---|
| 7 | 28 | 1640 | 0 | 1 |
| 18 | 1 | 2800 | 1 | 0 |
| 7 | 28 | 1660 | 0 | 0 |
| 28 | 29 | 1850 | 0 | 1 |
| 18 | 18 | 1640 | 1 | 1 |
| 28 | 12 | 1770 | 0 | 1 |
Description:
apartments$Parking <- factor(apartments$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
apartments$Balcony <- factor(apartments$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
# Updated dataset
knitr::kable(head(apartments), caption = "Apartments dataset with categorical variables as factors")
| Age | Distance | Price | Parking | Balcony |
|---|---|---|---|---|
| 7 | 28 | 1640 | No | Yes |
| 18 | 1 | 2800 | Yes | No |
| 7 | 28 | 1660 | No | No |
| 28 | 29 | 1850 | No | Yes |
| 18 | 18 | 1640 | Yes | Yes |
| 28 | 12 | 1770 | No | Yes |
# One-sample t-test
t_price <- t.test(apartments$Price, mu = 1900)
t_price
##
## One Sample t-test
##
## data: apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
# Effect size (Cohen's d)
mean_price <- mean(apartments$Price)
sd_price <- sd(apartments$Price)
mu0 <- 1900
cohen_d <- (mean_price - mu0) / sd_price
cohen_d
## [1] 0.314791
The hypothesis is that the average apartment price per m2 is 1900 €.
The sample mean turned out to be 2018.94 €/m2, which is higher than 1900 €. The test result was t(84) = 2.90, p = 0.0047. Since the p-value is below 0.05, we can say this difference is statistically significant. The 95% confidence interval 1937.44 – 2100.44 €/m2 doesn’t include 1900, which backs up this conclusion.
The effect size was Cohen’s d = 0.31, which is considered a small to medium effect. In practice, this means the difference is real, but not extremely large.
Apartments in Ljubljana are on average priced noticeably above 1900 €/m2.
# Simple regression: Price = f(Age)
fit1 <- lm(Price ~ Age, data = apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
# Correlation
cor_age_price <- cor(apartments$Age, apartments$Price)
cor_age_price
## [1] -0.230255
I estimated a simple regression function: Price = f(Age).
Regression coefficient (slope): The slope is –8.98, which means that for every additional year of building age, the average apartment price per m2 decreases by about 9 €. Since the p-value = 0.034 (< 0.05), this effect is statistically significant.
Intercept: The intercept is 2185.46, which represents the estimated price per m2 when the building is new.
Correlation coefficient (r): The correlation between Age and Price is –0.23, showing a weak negative relationship: as apartments get older, prices tend to decrease.
Coefficient of determination (R²): The R² value is 0.053, meaning that about 5.3% of the variation in apartment prices can be explained by age alone. The remaining variation depends on other factors.
Older apartments tend to be cheaper, but age alone explains only a small portion of price differences.
library(GGally)
# Scatterplot matrix
ggpairs(apartments[, c("Price", "Age", "Distance")])
Price vs Age: has a weak negative relationship
(correlation = –0.23). Older apartments are slightly cheaper.
Price vs Distance: stronger negative relationship
(correlation = –0.63). Apartments farther from the city center are
cheaper.
Age vs Distance: almost no correlation (correlation =
0.04).
# Multiple regression function
fit2 <- lm(Price ~ Age + Distance, data = apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
# VIF for fit2
vif(fit2)
## Age Distance
## 1.001845 1.001845
The VIF values for both Age (1.00) and
Distance (1.00) are very close to 1.
This means that the two predictors are practically uncorrelated, and
there is no problem with multicollinearity.
# Standardized residuals
std_resid <- rstandard(fit2)
# Cooks distances
cooks_d <- cooks.distance(fit2)
# Combine into a data frame for inspection
diagnostics <- data.frame(std_resid, cooks_d)
knitr::kable(head(diagnostics), caption = "First rows of diagnostics: standardized residuals & Cooks D")
| std_resid | cooks_d |
|---|---|
| -0.6653487 | 0.0073866 |
| 1.7832876 | 0.0303654 |
| -0.5937629 | 0.0058826 |
| 0.7543794 | 0.0082992 |
| -1.0733987 | 0.0051126 |
| -0.7775190 | 0.0049009 |
# Identify problematic units
#
# |standardized residual| > 2 (possible outlier)
# Cooks D > 4/n (influential point)
n <- nrow(apartments)
outliers <- which(abs(std_resid) > 2 | cooks_d > (4/n))
outliers
## 22 33 38 53 55
## 22 33 38 53 55
apartments[outliers, ]
## # A tibble: 5 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 37 3 2540 Yes Yes
## 2 2 11 2790 Yes No
## 3 5 45 2180 Yes Yes
## 4 7 2 1760 No Yes
## 5 43 37 1740 No No
# Standardized residuals and fitted values
std_resid <- rstandard(fit2)
std_fitted <- scale(fit2$fitted.values)
# Scatterplot
plot(std_fitted, std_resid,
main = "Residuals vs Fitted Values",
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
pch = 19, col = "blue")
abline(h = 0, col = "red", lwd = 2)
The residuals are scattered randomly around zero without a clear
pattern, but the spread does get a bit wider for higher fitted
values.
This suggests that heteroskedasticity is not a serious
problem, although there may be a slight increase in variance at
higher fitted values.
# Standardized residuals
std_resid <- rstandard(fit2)
# Histogram
hist(std_resid, breaks = 10, col = "darkgreen", border = "black",
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals")
# Q-Q plot
qqnorm(std_resid, main = "Q-Q Plot of Standardized Residuals")
qqline(std_resid, col = "blue", lwd = 2)
# Formal test of normality (Shapiro-Wilk test)
shapiro_test <- shapiro.test(std_resid)
shapiro_test
##
## Shapiro-Wilk normality test
##
## data: std_resid
## W = 0.95306, p-value = 0.00366
The histogram of standardized residuals looks roughly bell-shaped,
and the Q-Q plot shows that most points follow the straight line, with
only small deviations at the tails.
In our case, the residuals appear reasonably normal, so
the normality assumption for regression is not a serious concern.
# Remove problematic units (identified earlier)
apartments_clean <- apartments[-outliers, ]
# Re-estimate regression: Price ~ Age + Distance
fit2_clean <- lm(Price ~ Age + Distance, data = apartments_clean)
# Show results
summary(fit2_clean)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
Intercept (2502.47): Estimated price per m² for a
brand-new apartment located right in the city center (Age = 0, Distance
= 0). This is the baseline value.
Age (–8.67): For every additional year of apartment
age, the price decreases on average by about 9 €/m²,
keeping distance constant.
Distance (–24.06): For every additional kilometer away
from the city center, the price decreases on average by about 24
€/m², keeping age constant.
R² = 0.54: About 54% of the variation in
apartment prices can be explained by Age and Distance together,
which indicates a fairly strong model fit.
# Linear regression function Price = f(Age, Distance, Parking and Balcony)
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apartments_clean)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartments_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
#### Fit2 and fit3 comparison using ANOVA
anova(fit2_clean, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
#### Fit3 results
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartments_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Hypotheses tested with the F-statistic:
Null hypothesis (H₀): All slope coefficients are
equal to 0.
→ Age, Distance, Parking, and Balcony have no effect on Price.
Alternative hypothesis (H₁): At least one slope
coefficient is different from 0.
→ At least one of the predictors significantly affects Price.
F(4, 75) = 24.08, p < 0.001 → we reject H₀. The model as a whole is statistically significant, meaning that Age, Distance, Parking, and Balcony explain a significant share of apartment price variation.
# Fitted values and residuals from fit3
fitted_values <- fitted(fit3)
residuals <- resid(fit3)
# Data frame with row numbers as IDs
results <- data.frame(
Row = 1:length(fitted_values),
Fitted = fitted_values,
Residual = residuals
)
# Show only apartment ID2 (2nd row in dataset)
results[results$Row == 2, ]
## Row Fitted Residual
## 2 2 2356.597 443.4026