HOMEWORK 1

Anja Kadič

TASK 1

mydata <- read.csv("./car_sales_data.csv")

1.) Explain the data set

We have 50.000 observations (cars) and 7 variables

Unit: 1 car

We have 7 variables, numerical and categorical:

2.) Perform some data manipulations

mydata1.1 <- mydata[c(1, 2, 3, 7)]
mydata_cars_after_2015 <- subset (mydata, Year.of.manufacture > 2015)

summary(mydata_cars_after_2015$Year.of.manufacture)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2016    2017    2018    2018    2019    2022
nrow(mydata_cars_after_2015)
## [1] 7811
mydata$Fuel.type.example <- factor (mydata$Fuel.type,
                                    levels = c("Diesel", "Petrol", "Hybrid"),
                                    labels = c("D", "P", "H")) 
str(mydata)
## 'data.frame':    50000 obs. of  8 variables:
##  $ Manufacturer       : chr  "Ford" "Porsche" "Ford" "Toyota" ...
##  $ Model              : chr  "Fiesta" "718 Cayman" "Mondeo" "RAV4" ...
##  $ Engine.size        : num  1 4 1.6 1.8 1 1.4 1.8 1.4 1.2 2 ...
##  $ Fuel.type          : chr  "Petrol" "Petrol" "Diesel" "Hybrid" ...
##  $ Year.of.manufacture: int  2002 2016 2014 1988 2006 2018 2010 2015 2012 1992 ...
##  $ Mileage            : int  127300 57850 39190 210814 127869 33603 86686 30663 73470 262514 ...
##  $ Price              : int  3074 49704 24072 1705 4101 29204 14350 30297 9977 1049 ...
##  $ Fuel.type.example  : Factor w/ 3 levels "D","P","H": 2 2 1 3 2 2 1 3 2 1 ...
mydata_cars_removed <- mydata[-c(2, 7, 10, 100, 5000, 10000, 15555), ]

3.) Present the descriptive statistics for the selected variables and explain at least 3 sample statistics

library(psych)
describe(mydata[c("Price", "Mileage", "Engine.size")])
##             vars     n      mean       sd   median   trimmed      mad min
## Price          1 50000  13828.90 16416.68   7971.5  10735.19  8851.86  76
## Mileage        2 50000 112497.32 71632.52 100987.5 106408.91 75488.06 630
## Engine.size    3 50000      1.77     0.73      1.6      1.65     0.59   1
##                max  range skew kurtosis     se
## Price       168081 168005 2.85    12.81  73.42
## Mileage     453537 452907 0.73     0.08 320.35
## Engine.size      5      4 2.08     5.32   0.00

Mean = it shows the central tendency of the data

Standard deviation (sd) = measures of variability

Median = half of the units have a value equal to or lower than the median

4.) Graph the distribution of the variables using histograms, scatterplots, and/or boxplots. Explain the results.

HISTOGRAM

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(mydata, aes(x = Engine.size)) +
  geom_histogram(binwidth = 0.2, fill = "lightcoral", color = "black") +
  labs(title = "Distribution of Engine Sizes", x= "Engine Size (L)", y = "Frequency")

Explanation:

  • The graph shows right-sided skewness, which means that cars with large engine displacements are rare, and bimodality, which indicates that two groups of engine sizes are more common than others. It can also be seen from the graph that we have outliers.
mydata_no_outliers <- subset(mydata, Engine.size < 3L)
library(ggplot2)

ggplot(mydata_no_outliers, aes(x = Engine.size)) +
  geom_histogram(binwidth = 0.2, fill = "lightcoral", color = "black") +
  labs(title = "Distribution of Engine Sizes", x= "Engine Size (L)", y = "Frequency")

Explanation:

  • In this graph I have removed all the outliers and showed more clearly the bimodality.

SCATTERPLOT

ggplot(mydata, aes(x = Mileage, y = Price)) +
  geom_point(alpha = 0.3, color = "pink2")+
  labs (title = "Relationship between Mileage and Price",
        x = "Mileage (km)", y = "Price (EUR)")

Explanation:

  • The scatterplot between mileage and price shows a clear negative relationship. The graph also shows differences in price variability across mileage classes (more dispersion at higher mileage), which suggests possible heteroscedasticity.

BAR CHART

ggplot(mydata, aes(x = Fuel.type)) + 
  geom_bar(fill = "darkblue") +
  labs (title = "Distribution of Fuel Types", x = "Fuel Type", y = "Frequency")

Explanation:

  • The fuel type bar chart shows that the sample is predominantly composed of petrol vehicles.

BOXPLOT

ggplot(mydata_no_outliers, aes(x = Engine.size, y = Price)) +
  geom_boxplot(fill = "lightyellow") +
  labs (title = "Car Prices by Engine Size", x = "Engine Size (L)", y = "Price (EUR)")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

Explanation:

  • The boxplot shows that vehicles with larger engines achieve higher median prices. At the same time, there are large differences for some classes, many outliers, indicating heterogeneity of the offer within classes.
mydata_no_outliers2 <- subset(mydata, Price < 50000)
ggplot(mydata_no_outliers2, aes(x = Engine.size, y = Price)) +
  geom_boxplot(fill = "lightyellow") +
  labs (title = "Car Prices by Engine Size", x = "Engine Size (L)", y = "Price (EUR)")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

Explanation:

  • I deducted the range of data and got more transparent graph, showing that vehicles with larger engines achieve higher median prices.

TASK 2

1.) Graph the distribution of undergrad degrees using the ggplot function. Which degree is the most common?

library(readxl)

mydata_Business_School <- read_excel("./Business School.xlsx")
library(ggplot2)

ggplot(mydata_Business_School, aes(x = `Undergrad Degree`)) +
  geom_bar(fill = "darkviolet") +
  ylab("Frequency") +
  theme_minimal() +
  geom_text(stat = "count", 
            aes(label = ..count..), 
            vjust = -0.3)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Explanation:

  • The most common Undergrad Degree is Business Degree.

2.) Show the descriptive statistics of the Annual Salary and its distribution with the histogram (use the ggplot function). Describe the distribution.

summary(mydata_Business_School$`Annual Salary`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000
library(ggplot2)

ggplot(mydata_Business_School, aes(x = `Annual Salary`)) +
  geom_histogram(bins = 20, fill = "pink", color = "black", alpha = 0.7) +
  geom_density(aes(y = ..count.. * 1000), color = "red2", size = 1)+
  theme_minimal() +
  labs (title = "Distribution of Annual Salary",
        x = "Annual Salary",
        y = "Frequency")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Explanation:

  • The histogram shows that the distribution is almost symmetrical, yet not fully symmetrical, since we can observe a longer right tail with a few individuals with very high salaries. Because of this asymmetry, the distribution is right-skewed. The red line represents illustrates the continuous shape of the data distribution and confirms that this distribution has one distinct peak and a longer tail towards higher values. The higher values and almost symmetrical distribution, yet still asymmetrical to te right is due to outliers on the right side.

3.) Test the following hypothesis: 𝐻0: 𝜇MBA Grade = 74. Explain the result and interpret the effect size.

H₀: 𝜇MBA Grade = 74

t.test(mydata_Business_School$`MBA Grade`, mu = 74)
## 
##  One Sample t-test
## 
## data:  mydata_Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055
library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
cohens_d(mydata_Business_School$`MBA Grade`, mu = 74, na.rm = TRUE)
## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

Explanation:

  • We reject the null hypothesis, because p value (p = 0.009) is smaller than 0.05. The average MBA grade this year is significantly different from 74, since the mean (76.04) is higher than 74, students performed better on average than the previous. According to Cohen’s guidelines:

    • 0.2 = small
    • 0.5 = medium
    • 0.8 = large

which means, that the 0.27 is a small effect size. The difference (76.04 and 74) is statistically sigificant, but the magnitude of difference is small.

TASK 3

Import the dataset Apartments.xlsx

library(readxl)

mydata_Apartments <- read_excel("./Apartments.xlsx")

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata_Apartments$Parking <- factor(mydata_Apartments$Parking,
                        levels = c(0, 1),
                        labels = c("No", "Yes"))
mydata_Apartments$Balcony <- factor(mydata_Apartments$Balcony,
                        levels = c(0, 1),
                        labels = c("No", "Yes"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata_Apartments$Price, mu = 1900)
## 
##  One Sample t-test
## 
## data:  mydata_Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Conclusion:

  • P-value (0.0047) is smaller than 0.05, which means we can reject the null hypothesis. This indicates that the sample mean (2018.94) is significantly different from the hypothesised mean of 1900 at the 5% significance level, and based on the 95% confidence interval (1937.44 to 2100.44), we can be confident that the true population mean is above 1900.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(mydata_Apartments$Price ~ mydata_Apartments$Age)
summary(fit1)
## 
## Call:
## lm(formula = mydata_Apartments$Price ~ mydata_Apartments$Age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2185.455     87.043  25.108   <2e-16 ***
## mydata_Apartments$Age   -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
cor(mydata_Apartments$Price, mydata_Apartments$Age)
## [1] -0.230255

Explanation:

  • Estimate of regression coefficient: On average, for each additional year of an apartment’s age the price decreases by about 8.98 units.

  • Coefficient of correlation: As age increases, prices tend to decrease, but the relationship is not very strong (weak negative linear relationship).

  • Coefficient of determination: Only 5.3 % of price variation is explained by Age, therefore, Age alone is not a strong predictor of price of the apartments.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(GGally)

ggpairs(mydata_Apartments[, c("Price", "Age", "Distance")],
        lower = list(continuous = wrap("smooth_lm", color = "darkblue")),
        diag = list(continuous = "densityDiag"),
        title = "Scatterplot Matrix with Trend Lines")

Explanation:

  • The scatterplot matrix shows that apartment price is negatively correlated with both age (r = −0.230) and distance from the centre (r = −0.631), indicating that older and more distant apartments tend to be cheaper. In contrast, the correlation between age and distance is very low (r = 0.043), suggesting that these two predictors are essentially independent. Because the predictors are not highly correlated with each other, there is no indication of problematic multicollinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(mydata_Apartments$Price ~ mydata_Apartments$Age + mydata_Apartments$Distance)
summary(fit2)
## 
## Call:
## lm(formula = mydata_Apartments$Price ~ mydata_Apartments$Age + 
##     mydata_Apartments$Distance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2460.101     76.632   32.10  < 2e-16 ***
## mydata_Apartments$Age        -7.934      3.225   -2.46    0.016 *  
## mydata_Apartments$Distance  -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
vif(fit2)
##      mydata_Apartments$Age mydata_Apartments$Distance 
##                   1.001845                   1.001845

Explanation:

  • For both Age and Distance the VIF is around 1, which means there is no multicolinearity problem (VIF>5 would be concerning).

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydata_Apartments$StdResid <- round(rstandard(fit2), 3)

hist(mydata_Apartments$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

mydata_Apartments$CooksD <- round(cooks.distance(fit2), 3)

hist(mydata_Apartments$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cook's distances")

head(mydata_Apartments[order(-mydata_Apartments$CooksD),], 6)
## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony StdResid CooksD
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl>
## 1     5       45  2180 Yes     Yes         2.58  0.32 
## 2    43       37  1740 No      No          1.44  0.104
## 3     2       11  2790 Yes     No          2.05  0.069
## 4     7        2  1760 No      Yes        -2.15  0.066
## 5    37        3  2540 Yes     Yes         1.58  0.061
## 6    40        2  2400 No      Yes         1.09  0.038
threshold <- 0.06 

mydata_Apartments <- mydata_Apartments [mydata_Apartments$CooksD <= threshold, ]

Explanation: 0.06 is based on the the visible gaps in the distribution

hist(mydata_Apartments$CooksD,
     xlab = "Cook's distance",
     main = "Histogram of Cook's distances")

Explanation:

  • The histogram shows that most values are close to zero, meaning the majority of observations have little influence on the regression model. No Cook’s Distance exceeds the typical threshold of 1 (or 4/n), and because we removed the outliers there are no highly influential data points that could distort the results.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

fit2 <- lm(Price ~ Age + Distance, data = mydata_Apartments)

mydata_Apartments$StdResid <- rstandard(fit2)
mydata_Apartments$StdFitted <- as.numeric(scale(fit2$fitted.values)) 

library(car)
scatterplot(y= mydata_Apartments$StdResid, x = mydata_Apartments$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE, regline = FALSE, smooth = FALSE)
## Warning in plot.window(...): "regline" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "regline" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "regline" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "regline" is not a
## graphical parameter
## Warning in box(...): "regline" is not a graphical parameter
## Warning in title(...): "regline" is not a graphical parameter

Explanation:

  • The graph show that the residuals are standardized and plotted against the standardized fitted values. Ideally, residuals should be randomly scattered around zero without showing any clear pattern. This would suggest that the model assumptions of linearity and constant variance are reasonable. In this plot, the residuals appear fairly randomly distributed, indicating that the linear regression model is appropriate, with no strong evidence of non-linearity or heteroscedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

mydata_Apartments$StdResid <- round(rstandard(fit2), 3)

hist(mydata_Apartments$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(mydata_Apartments$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata_Apartments$StdResid
## W = 0.94154, p-value = 0.001166

Explanation:

  • The p-values is smaller than 0.05, that means that the standardized residuals are not normally distributed. This indicates that the standardized residuals deviate significantly from a normal distribution, so the normality assumption of the regression model may not be fully satisfied.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance, data = mydata_Apartments)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata_Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Explanation:

  • If the age of the apartment increases by one year, the price decreases by 8.67 EUR/m2 on average at p<0.01, controlling for distance.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + factor(Parking) + factor(Balcony),
           data = mydata_Apartments)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + factor(Parking) + factor(Balcony)
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

Explanation:

  • Model fit2 explains slightly more variance than model fit3, as shown by the lower residual sum of squares. However, the improvement is not statistically significant (F = 2.24, p = 0.1135). The residuals vs fitted plot shows no major violations of regression assumptions, suggesting that the simpler model (fit2) already fits the data adequately. Therefore, fit2 is preferable, as it is more parsimonious without significantly losing explanatory power.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + factor(Parking) + factor(Balcony), 
##     data = mydata_Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2393.316     93.930  25.480  < 2e-16 ***
## Age                  -7.970      3.191  -2.498   0.0147 *  
## Distance            -21.961      2.830  -7.762 3.39e-11 ***
## factor(Parking)Yes  128.700     60.801   2.117   0.0376 *  
## factor(Balcony)Yes    6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Explanation:

  • Regression coefficient:

    • Parking: Apartments with parking are on average 196.17EUR more expensive than those without parking, holding all other variables constant.
    • Balcony: Apartments with a balcony are on average 1.94EUR more expensive than those without, holding other variables constant, but because p-value not statistically significant, balcony does not have a meaningful effect on price in this dataset.
  • H₀: None of the predictors (Age, Distance, Parking, Balcony) have any effect on the apartment price.

    βAge = βDistance = βParking =βBalcony =0

    H₁: At least one of the predictors (Age, Distance, Parking, Balcony) has a significant effect on the apartment price.

    At least one βi≠0

Save fitted values and claculate the residual for apartment ID2.

fitted_values <- fitted (fit3)
residuals_fit3 <- resid(fit3)

residuals_fit3[2]
##        2 
## 443.4026

Explanation:

  • A residual of 442.589 is positive, which means that the actual price of apartment 2 is higher than the price predicted by the model.