Task 1

mydataTask1 <- read.table("./Customer Purchasing Behaviors.csv",
                     header = TRUE,
                     sep = ",")

head(mydataTask1)

##   UserId age AnnualIncome PurchaseAmount LoyaltyScore Region PurchaseFrequency
## 1      1  25        45000            200          4.5  North                12
## 2      2  34        55000            350          7.0  South                18
## 3      3  45        65000            500          8.0   West                22
## 4      4  22        30000            150          3.0   East                10
## 5      5  29        47000            220          4.8  North                13
## 6      6  41        61000            480          7.8  South                21

1. Explanation of the data set

• Age: The age of the customer.

• Annual income: The customer’s annual income (in USD).

• Purchase amount: The total amount of purchases made by the customer (in USD).

• Loyalty score: The customer’s loyalty score (on a scale 1-10; 10 being highest).

• Region: The region where the customer lives (north, west, south, east).

• Purchase Frequency: Frequency of customer purchases (number of times per year).

2. Data manipulations

mydataTask1$AnnualIncome1000 <- mydataTask1$AnnualIncome / 1000
head(mydataTask1)

##   UserId age AnnualIncome PurchaseAmount LoyaltyScore Region PurchaseFrequency AnnualIncome1000
## 1      1  25        45000            200          4.5  North                12               45
## 2      2  34        55000            350          7.0  South                18               55
## 3      3  45        65000            500          8.0   West                22               65
## 4      4  22        30000            150          3.0   East                10               30
## 5      5  29        47000            220          4.8  North                13               47
## 6      6  41        61000            480          7.8  South                21               61

I created a new variable, which presents the Annual Income in 1000 USD.

colnames(mydataTask1) [2] <- "Age"

I renamed the variable “age” into “Age, for more cohesiveness.

mydataTask1.1 <- subset(mydataTask1, Age > 30)
head(mydataTask1.1)

##    UserId Age AnnualIncome PurchaseAmount LoyaltyScore Region PurchaseFrequency AnnualIncome1000
## 2       2  34        55000            350          7.0  South                18               55
## 3       3  45        65000            500          8.0   West                22               65
## 6       6  41        61000            480          7.8  South                21               61
## 7       7  36        54000            400          6.5   West                19               54
## 9       9  50        70000            600          9.0  North                25               70
## 10     10  31        50000            320          5.5  South                17               50

I created a new database with the variable “Age” value over 30.

3. Presenting the descriptive statistics

summary(mydataTask1[, c("Age", "AnnualIncome", "PurchaseAmount")])

##       Age         AnnualIncome   PurchaseAmount 
##  Min.   :22.00   Min.   :30000   Min.   :150.0  
##  1st Qu.:31.00   1st Qu.:50000   1st Qu.:320.0  
##  Median :39.00   Median :59000   Median :440.0  
##  Mean   :38.68   Mean   :57408   Mean   :425.6  
##  3rd Qu.:46.75   3rd Qu.:66750   3rd Qu.:527.5  
##  Max.   :55.00   Max.   :75000   Max.   :640.0

Age:

Min: The minimum age of customers from my data set is 22 years.
Max: The maximum age of customers from my data set is 55 years.

Annual income:

Median: Half of the customers’ annual income was lower than 59000 USD, and half of the customers’ annual income was higher.
3rd Quartile: 25 % of the customers’ annual income was higher than 66750 USD.

Purchase amount:

Mean: The average purchase amount was 425.6 USD.

4. Graph the distribution of the variables

hist(mydataTask1$AnnualIncome,
     main = "Distribution of Annual Income",
     ylab = "Frequency",
     xlab = "Annual Income (in USD)",
     breaks = seq(30000, 80000, (10000)),
     right = FALSE,
     col = "magenta", 
     fill = "magenta4")

## Warning in plot.window(xlim, ylim, "", ...): "fill" is not a graphical parameter

## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "fill" is not a graphical parameter

## Warning in axis(1, ...): "fill" is not a graphical parameter

## Warning in axis(2, at = yt, ...): "fill" is not a graphical parameter

Explanation of the histogram for Annual Income: The customers’ annual income (in USD) has a normal distribution, which is slightly skewed to the left.

boxplot(mydataTask1$PurchaseAmount,
        main = "Purchase Amount",
        ylab = "USD",
        col = "magenta",
        border = "magenta4")

Explanation of the boxplot for Purchase Amount: We can see that there are no outliers; from the boxplot we can also see that the the median is 440 USD, it shows us the minimum (150 USD), the maximum (640 USD), the first quartile (320 USD) and the third quartile (527,5 USD).

Task 2

library(readxl)
mydataTask2 <- read_xlsx("./Business School.xlsx")

head(mydataTask2)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade` `Work Experience` `Employability (Before)`
##          <dbl> <chr>                          <dbl>       <dbl> <chr>                                <dbl>
## 1            1 Business                        68.4        90.2 No                                     252
## 2            2 Computer Science                70.2        68.7 Yes                                    101
## 3            3 Finance                         76.4        83.3 No                                     401
## 4            4 Business                        82.6        88.7 No                                     287
## 5            5 Finance                         76.9        75.4 No                                     275
## 6            6 Computer Science                83.3        82.1 No                                     254
## # ℹ 3 more variables: `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

1. Graph the distribution of undergrad degrees.

library(ggplot2)

ggplot(mydataTask2, aes(x=`Undergrad Degree`)) + 
  geom_bar(colour = "magenta4", fill = "magenta") +
  ylab("Frequency") +
  xlab("Undergrad Degree")

From the histogram we can see that the most common Undergrad Degree among the MBA students is in Business.

2. Show the descriptive statistics of the Annual Salary and its distribution with the histogram

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## The following object is masked from 'package:car':
## 
##     logit

describeBy(mydataTask2$`Annual Salary`)

## Warning in describeBy(mydataTask2$`Annual Salary`): no grouping variable requested

##    vars   n   mean       sd median  trimmed     mad   min    max  range skew kurtosis      se
## X1    1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22     9.41 4150.15

mydataTask2$`Annual Salary1000` <- mydataTask2$`Annual Salary` / 1000

library(ggplot2)
ggplot(mydataTask2, aes(x=`Annual Salary1000`))+
  geom_histogram(color="magenta4", fill ="magenta")+
  xlab("Annual Salary (in 1000 USD)")+
  ylab("Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Explanation of the histogram for Annual Salary (in 1000 USD): we can see that the histogram is unimodal and it is skewed to the right. We can also notice that there are some outliers on the right, which we can see from the big gaps.

3. Test the following hypothesis: 𝐻0: 𝜇MBA Grade = 74.

t.test(mydataTask2$`MBA Grade`,
       mu=74,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydataTask2$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

H0: mu = 74

H1: mu ≠ 74

We can reject the null hypothesis at p = 0.00915, and accept H1, which means that the average of the MBA grades is different from 74.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

cohens_d(mydataTask2$`MBA Grade`,
                     mu=74)

## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

The value of Cohen’s d is 0.27, which means that there is a small effect size (95% confidence interval).

The value of Cohen’s d being 0.27, indicates that there is a small effect size with (95 % confidence). There is a modest difference between this year’s generation and last year’s generation.

Task 3

1. Import the dataset Apartments.xlsx

library(readxl)
mydataTask3 <- read_excel("./Apartments.xlsx")
head(mydataTask3)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. Change categorical variables into factors.

mydataTask3$Balcony <- factor(mydataTask3$Balcony,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))

mydataTask3$Parking <- factor(mydataTask3$Parking,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))

3. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydataTask3$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydataTask3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

H0: mu = 1900

H1: mu ≠ 1900

We can reject the null hypothesis at p = 0.004731, and accept H1, which means that the average price of the apartments per m2 is different from 1900.

4. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = mydataTask3)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydataTask3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

sqrt(summary(fit1)$r.squared)

## [1] 0.230255

If the age of an apartment increases by 1 year, the price per m2 decreases by 8.975 euros on average (p = 0.034).

The correlation coefficient is 0.23, which means that the linear correlation between price and age of an apartment is weak and positive.

The coefficient of determination is 0.053, which means that age explains only 5.3 % of the variability of price per m2.

5. Show the scatterplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

scatterplotMatrix(mydataTask3[c("Price", "Age", "Distance")], smooth = FALSE)

Based on the matrix we can see that there is no problem with potential multicollinearity, because none of the scatter plots show a strong relationship.

6. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = mydataTask3)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataTask3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

7. Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

All VIF statistics are below 5 and very close to 1, and the average VIF statistic is 1.002, which is very close to 1. This means that there is no multicollinearity between the variables.

8. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydataTask3$StdResid <- round(rstandard(fit2), 3)

hist(mydataTask3$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "magenta",
     fill = "magenta4")

## Warning in plot.window(xlim, ylim, "", ...): "fill" is not a graphical parameter

## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "fill" is not a graphical parameter

## Warning in axis(1, ...): "fill" is not a graphical parameter

## Warning in axis(2, at = yt, ...): "fill" is not a graphical parameter

From the histogram we can see that all values are between 3 and -3, so there are no outliers.

mydataTask3$CooksD <- round(cooks.distance(fit2), 3)

hist(mydataTask3$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances",
     col = "magenta",
     fill = "magenta4")

## Warning in plot.window(xlim, ylim, "", ...): "fill" is not a graphical parameter

## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "fill" is not a graphical parameter

## Warning in axis(1, ...): "fill" is not a graphical parameter

## Warning in axis(2, at = yt, ...): "fill" is not a graphical parameter

From the Cook’s distance histogram we can see that at least 1 unit has a value that is to some extent greater than the values of other units, so we are checking (which we can see from the gap between 0.15 and 0.30).

head(mydataTask3[order(-mydataTask3$CooksD), "CooksD"], 10)

## # A tibble: 10 × 1
##    CooksD
##     <dbl>
##  1  0.32 
##  2  0.104
##  3  0.069
##  4  0.066
##  5  0.061
##  6  0.038
##  7  0.037
##  8  0.034
##  9  0.032
## 10  0.03

If we print the first 10 units with the highest Cook’s distances, we can see that the units with the greatest impact are 0.320 and 0.104. There’s another gap between the 5th and 6th value, so we can decide to remove them as well.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydataTask3 <- mydataTask3 %>%
  filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.061))

9. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

fit2 <- lm(Price ~ Age + Distance, data = mydataTask3)
mydataTask3$StdFittedValues <- scale(fit2$fitted.values)

library(car)
scatterplot(y=mydataTask3$StdResid, x=mydataTask3$StdFittedValues,
            ylab = "Standardized residuals", 
            xlab = "Standardized fitted values", 
            boxplots = FALSE, 
            regLine = FALSE, 
            smooth = FALSE)

The points in the scatterplot of the standardized residuals against the standardized fitted values are randomly distributed and in a horizontal band of constant variability, so there is no problem of heteroskedasticity.

10. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydataTask3$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     col = "magenta",
     fill = "magenta4")

## Warning in plot.window(xlim, ylim, "", ...): "fill" is not a graphical parameter

## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "fill" is not a graphical parameter

## Warning in axis(1, ...): "fill" is not a graphical parameter

## Warning in axis(2, at = yt, ...): "fill" is not a graphical parameter

From the histogram of standardized residuals we can see that all values are between 3 and -3, so there are no outliers. The standardized residuals are slightly assimetrically distributed to the right.

shapiro.test(mydataTask3$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydataTask3$StdResid
## W = 0.93418, p-value = 0.0004761

The p-value is lower than 0.05, so we can reject the null hypothesis that the standardized residuals are normally distributed and we can conclude that errors are not distributed normally.

11. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataTask3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

sqrt(summary(fit2)$r.squared)

## [1] 0.732187

Age: If the age of an apartment increases by 1 year, the price per m2 decreases on average by 8.674 USD (p-value < 0.05), assuming distance remains constant.
Distance: If the distance from the city center increases by 1 km, the price per m2 decreases on average by 24.063 USD (p-value < 0.001), assuming age is constant.
Coefficient of determination = 0.5361, which means that 53.61 % of the variability in price per m2 is explained by the linear effect of age and distance.
Multiple correlation coefficient = 0.7322 , which means that the linear correlation between price, age and distance of the apartment is strong.

12. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = mydataTask3)

13. With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

The results from the comparison of the two regression models show that the fit3 model does not fit the data better (p-value = 0.1135)

14. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydataTask3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Parking: Given the values of the other explanatory variables, apartments with a parking space have on average a 128.700 USD higher price per m2 in comparison to the apartments without a parking space (p-value = 0.0376).
Balcony: we found no difference between the average price per m2 of two identical apartments, with the exception that one apartment has a balcony and the other does not, because it is not statistically significant (p-value = 0.9165).
The F-statistic is testing H0: R-squared = 0.

15. Save fitted values and calculate the residual for apartment ID2.

mydataTask3$StdFittedValues <- fitted.values(fit3)
mydataTask3$StdResid <- residuals(fit3)
head(mydataTask3[ , colnames(mydataTask3) %in% c("ID", "Price", "StdFittedValues", "StdResid")])

## # A tibble: 6 × 3
##   Price StdResid StdFittedValues
##   <dbl>    <dbl>           <dbl>
## 1  1640    -88.6           1729.
## 2  2800    443.            2357.
## 3  1660    -62.6           1723.
## 4  1850    311.            1539.
## 5  1640   -349.            1989.
## 6  1770   -143.            1913.

Based on the estimated regression function, the expected price per m2 for this apartment is 2356.597 euros, whereas the actual price is 2800 euros, so the residual is 443.4 euros.

Take-home exam

Vijola Spasovic

2024-09-18