Task 1

data(package = .packages(all.available = TRUE))
#install.packages("reshape2")
library(reshape2)

mydata <- force(tips)
head(mydata)

##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

Explanaiton of variables:

- total_bill: amount of the bill in dollars

- tip: amount of the tip in dollars

- sex: sex of the bill payer

- smoker: whether there were smokers in the party

- day: day of the week

- time: time of the day

- size: size of the party

colnames(mydata) [1] <- "Total Bill"
colnames(mydata) [2] <- "Tip"
colnames(mydata) [3] <- "Sex"
colnames(mydata) [4] <- "Smoker"
colnames(mydata) [5] <- "Day"
colnames(mydata) [6] <- "Time"
colnames(mydata) [7] <- "Size"


head(mydata)

##   Total Bill  Tip    Sex Smoker Day   Time Size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

Here I changed/renamed the name of the variables, so they start with a capital letter and the table looks neater.

mydataWeekend <- subset(mydata, !(Day %in% c("Thur", "Fri")))
head(mydataWeekend)

##   Total Bill  Tip    Sex Smoker Day   Time Size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

I created a new dataframe by excluding the observations that have been collected either on friday or thursday, so I have a dataframe which only includes the data for the weekend.

mydata$Sex <- factor(mydata$Sex,
                    levels = c("Male", "Female"), 
                    labels = c("M", "F"))

head(mydata)

##   Total Bill  Tip Sex Smoker Day   Time Size
## 1      16.99 1.01   F     No Sun Dinner    2
## 2      10.34 1.66   M     No Sun Dinner    3
## 3      21.01 3.50   M     No Sun Dinner    3
## 4      23.68 3.31   M     No Sun Dinner    2
## 5      24.59 3.61   F     No Sun Dinner    4
## 6      25.29 4.71   M     No Sun Dinner    4

Here I changed variable Sex from Male” and “Female” to “M” and “F” for possible later purposes.

summary(mydata)

##    Total Bill         Tip         Sex     Smoker      Day         Time    
##  Min.   : 3.07   Min.   : 1.000   M:157   No :151   Fri :19   Dinner:176  
##  1st Qu.:13.35   1st Qu.: 2.000   F: 87   Yes: 93   Sat :87   Lunch : 68  
##  Median :17.80   Median : 2.900                     Sun :76               
##  Mean   :19.79   Mean   : 2.998                     Thur:62               
##  3rd Qu.:24.13   3rd Qu.: 3.562                                           
##  Max.   :50.81   Max.   :10.000                                           
##       Size     
##  Min.   :1.00  
##  1st Qu.:2.00  
##  Median :2.00  
##  Mean   :2.57  
##  3rd Qu.:3.00  
##  Max.   :6.00

Here I provided descriptive statistics for selected variables “Total Bill”and “Size”.

Mean:

19.79 —> explains that on average the amount of the bill was 19.79 dollars

2.57 —> explains that on average the size of the party was 2.57 persons

Median:

17.80 —> half of the amounts of bills where lower than 17.80 dollars and the other half were higher than 17.80 dollars

2.00 —> half of the parties were smaller than 2 persons and the other half were bigger than 2 persons

Max:

50.81 —> the highest bill in the dataset amounts to 50.81 dollars

6.00 —> the size of biggest party in the dataset amounts to 6

hist(mydata$Tip,
     main = "Distribution of the variable Tip", 
     ylab = "Frequency", 
     xlab = "Tip in dollars", 
     breaks = seq(0, 12, (0.5)), 
     right = FALSE)

Here I provided you a histogram that shows the distribution of the variable Tip. The two peaks that can be seen in the histogram indicate that it is a case of a bimodal distribution. Since most of the distribution is on the left side, I would argue that it is right skewed.

boxplot(mydata$`Total Bill`,
        main = "Boxplot of Total Bill", 
        ylab = "Value in dollars", 
        col = "lightgreen", 
        border = "black")

Here I provided Boxplot of variable “Total Bill”. Half of the values of the variable lie in the green box, which is called interquartile range. Bottom whisker represents the minimum and the upper whisker represents maximum (the exact value of maximum and minimum is provided in descriptive statistic above). We can also see some outliers, these points are considered unusually high compared to the rest of the data. The line the box represents the Median which sits at 17.80 as already stated before.

Task 2

#install.packages("readxl")
library(readxl)

Business_School <- read_excel("R Take Home Exam 2024/Task 2/Business School.xlsx")

Here I imported the new dataset and named it Business_School.

library(ggplot2)
ggplot(Business_School, aes(x=`Undergrad Degree`,))+
  geom_bar(colour="blue", fill = "lightblue")+
  xlab("Undergrad Degrees")+
  ylab("Number of degrees")

Above I provided a histogram displaying the distribution of undergrad degrees among all 100 MBA students. We can see that the most common one is Bussines degree (35).

#install.packages("summarytools")
library(summarytools)

descr(Business_School$`Annual Salary`)

## Descriptive Statistics  
## Business_School$`Annual Salary`  
## N: 100  
## 
##                       Annual Salary
## ----------------- -----------------
##              Mean         109058.00
##           Std.Dev          41501.49
##               Min          20000.00
##                Q1          86750.00
##            Median         103500.00
##                Q3         124000.00
##               Max         340000.00
##               MAD          25945.50
##               IQR          36875.00
##                CV              0.38
##          Skewness              2.22
##       SE.Skewness              0.24
##          Kurtosis              9.41
##           N.Valid            100.00
##         Pct.Valid            100.00

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describeBy(Business_School$`Annual Salary`)

## Warning in describeBy(Business_School$`Annual Salary`): no grouping variable
## requested

##    vars   n   mean       sd median  trimmed     mad   min    max  range skew
## X1    1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22
##    kurtosis      se
## X1     9.41 4150.15

Here I displayed descriptive statistics of the variable “Annual Salary” with two different functions.

Business_School$`Annual Salary2` <- Business_School$`Annual Salary` / 1000

First off all I made a new variable named “Annual Salary_2” which presents the values of Annual Salary in the thousands, since the histogram made with the initial variable “Annual Salary” looks weird, because of the high values of the variable.

library(ggplot2)
ggplot(Business_School, aes(x=`Annual Salary2`))+
  geom_histogram(color="grey4", fill= "green4")+
  xlab("Annual Salary in thousands")+
  ylab("Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can describe the distribution of the histogram as follows:

The histogram is unimodal, that means it has only one high, which lies at 100 (Annual salary in thousands). Since most of the distribution is on the left side, it is right skewed. On the right side we can also see some outliers. Latter means that there are a few values that are much higher than the rest of the values.

t.test(Business_School$`MBA Grade`,
       mu=74,
        alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

With the t.test function I tested null hypothesis (𝐻0:𝜇MBA Grade=74, 𝐻1:𝜇MBA Grade≠74). The P value of 0.00915 is significantly lower than 0.05, therefore I can reject the null hypothesis. That means that the true mean MBA grade differs from 74, sample mean of the MBA grades is 76.04055, which is obviously greater than 74.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cohens_d(Business_School$`MBA Grade`,
                     mu=74)

## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

The value of Cohen’s d being 0.27, indicates that there is a small effect size with (95% confidence). There is a modest difference between this year’s generation and last year’s generation.

Task 3

library(readxl)
Apartments <- read_excel("R Take Home Exam 2024/Task 3/Apartments.xlsx")
head(Apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

- Age: Age of an apartment in years

- Distance: The distance from city center in km

- Price: Price per m2

- Parking: 0-No, 1-Yes

- Balcony: 0-No, 1-Yes

Apartments$Balcony <- factor(Apartments$Balcony,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))
Apartments$Parking <- factor(Apartments$Parking,
                             levels = c(0, 1),
                             labels = c("No", "Yes"))

I changed categorical variables into factors.

t.test(Apartments$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Since the P value I got with T-Test is significantly lower than 0.05, I can reject the null hypothesis, therefore I can say that the true mean differs from 1900. With 95 % confidence I can say that true mean of apartment prices per m2 lies in the interval from 1937.443 to 2100.440.

fit1 <- lm(Price ~ Age, data = Apartments)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

sqrt(summary(fit1)$r.squared)

## [1] 0.230255

Estimate of regression coefficient = -8.975

If the age of an apartment increases by 1 year, the price per m2 decreases by 8.975 euros on average (p = 0.034).

Correlation coefficient = 0.230355

The linear correlation between price and age of an apartment is weak based on the value of correlation coefficient being 0.23.

Coefficient of determination = 0.05302

This coefficient indicates the proportion of the total variability of the dependent variable that can be explained by the linear effect of all explanatory variables. Therefore only 5.3 % of the variability of the price per m2 is explained by the linear effect of age.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(Apartments[c("Price", "Age", "Distance")], smooth = FALSE)

Scatterplots between Age and Distance do not show any kind of relationship between the two variables, since the line is horizontal. Change in one variable does not affect the other, so there is currently no problem with multiciolinearity.

fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

library(car)
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

Both VIF values are close to 1, which indicates no multicollinearity between the two independent variables. They are not correlated.

Apartments$StdResid <- round(rstandard(fit2), 3)

Here I calculated all standardized residuals and stored them as StdResid. All values are between 3 and -3 (I checked the values in the table “Apartments”), so I can conclude that here there are no outliers.

Apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments$CooksD,
     xlab = "Cooks distance",
     ylab = "Freequency",
     main = "Histogram of Cooks distances")

First of all I saved Cooks distances, which I caluclated with cooks.distance function. Then I made a histogram providing me their distribution. From the histogram I can see that there is at least one value that is to some extent greater than the values of other units. We can see that by a gap between 0.15 and 0.30.

head(Apartments[order(-Apartments$CooksD), "CooksD"], 10)

## # A tibble: 10 × 1
##    CooksD
##     <dbl>
##  1  0.32 
##  2  0.104
##  3  0.069
##  4  0.066
##  5  0.061
##  6  0.038
##  7  0.037
##  8  0.034
##  9  0.032
## 10  0.03

By printing the 10 highest cooks distance values we can see that the first 2 values are much greater than the others. On top of that 3rd, 4th and 5th value are also slitly higher than the others, so I decided to delete them all with the function below.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Apartments <- Apartments %>%
  filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.061))

fit2 <- lm(Price ~ Age + Distance, data = Apartments)
Apartments$StdFittedValues <- scale(fit2$fitted.values)

I already estimated/created the fit2 again without potentially excluded units and show the summary of the model here, because I could not create standardized fitted values, since I deleted 5 units (one data included 80 and the other 85 units). I also stored standardized fitted values.

library(car)
scatterplot(y=Apartments$StdResid, x=Apartments$StdFittedValues,
            ylab = "Standardized residuals", 
            xlab = "Standardized fitted values", 
            boxplots = FALSE, 
            regLine = FALSE, 
            smooth = FALSE)

The points in the scatterplot of the standardized residuals against the standardized fitted values are from what I see randomly distributed, which inidicated that there is no heteroskedasticity. To be completely sure I will also conduct Breuch-Pagan test.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

This test includes following hypothesis:

-𝐻0:the variance of errors is constant (homoskedasticity)

-𝐻1:the variance is errors not constant (heteroskedasticity)

P-value is above 0.05, therefore null hypothesis can not be rejected, and I assume homoskedasticity.

hist(Apartments$StdResid,
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

The histogram is slightly right skewed. We also see here that all values are between -3 and 3 which is a threshold used to identify outliers. We will test distribution with Shapiro test.

shapiro.test(Apartments$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.93418, p-value = 0.0004761

Now that we formally tested the distribution we see that the P-value is lower than 0.05 (0.001166), which means that the null hypothesis (standard residuals are normally distributed) can be rejected. That shows that standard residuals are not distributed normally. But since our sample is over the size of 30, a violation of this assumption does not significantly affect our results.

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

sqrt(summary(fit2)$r.squared)

## [1] 0.732187

Here are the results of the regression function without excluded units.

The coefficient of determination is 0.5361. That means that approximately 54% of the variability of the price per m2 is explained by the linear effect of age and distance (from city center in km).

Intercept of 2502.467 indicated the average value of the dependent variable when all explanatory variables are 0. But because neither Age or Distance can be valued at 0, the regression constant has no meaning.

Estimate of regression coefficient = -8.674

If the age of an apartment increases by 1 year, the price per m2 decreases by 8.674 euros on average (p = 0.00869), assuming that distance constant.

Estimate of regression coefficient = -24.063

If the distance from the city center is increased by 1 km, the price per m2 decreases on average by 24.063 (p < 0.001), assuming that age is constant.

The linear correlation between price, age and distance of an apartment is strong based on the value of correlation coefficient being 0.73.

F-statistics is used to test the statistical suitability Since P value is <0.001, we can reject the null hypothesis. That means that this model is a good model and it makes sense.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)

Here I estimated and saved the linear regression function Price = f(Age, Distance, Parking and Balcony) as fit3.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

With Anova I tested if the fit3 fits the data statistically better than fit2. The results shows that fit3 model does not fit the data better since p=0.1135.

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Given the values of the other explanatory variables, apartments with a parking space have an average price per m2 higher by 128.700 euros compared to the apartments without a parking space (p = 0.0376).

We found no difference between the average price per m2 of two identical apartments, with the exception that one apartment has a balcony and the other does not. That is because P-value is greater than 0.05 (p=0.9165).

With F-statistics the hypothesis that is being tested is: 𝐻0: R-squared=0.

Apartments$StdFittedValues <- fitted.values(fit3)
Apartments$StdResid <- residuals(fit3)
head(Apartments[ , colnames(Apartments) %in% c("Price", "StdFittedValues", "StdResid")])

## # A tibble: 6 × 3
##   Price StdResid StdFittedValues
##   <dbl>    <dbl>           <dbl>
## 1  1640    -88.6           1729.
## 2  2800    443.            2357.
## 3  1660    -62.6           1723.
## 4  1850    311.            1539.
## 5  1640   -349.            1989.
## 6  1770   -143.            1913.

Based on the estimated regression function, we would expect the price per m2 for this apartment to be 2356.597 euros. The actual price was 2800 euros, so the residual is 443.4 euros.

Pino Sekač Take Home Exam

Pino Sekač

2024-09-19

Task 1

Task 2

Task 3