Take home exam - IMB Bootcamp

Results for each excercise are shown in tabs

Task 1

For the first excercise I choose a data set of Nvidia Daily Stock Price Data sales I got from kaggle.com

First I read my data from .csv file format into object mydata using the function read.table. I used an absolute path to get to my .csv file, because it was in different folder than this markdown file.

mydata <- read.table("C:/Delo/Sola/IMB/Bootcamp/TakeHomeExam-R/Task 1/nvidia.csv", header=TRUE, sep=";", dec=",")


head(mydata)

##         Date     Open     High      Low    Close     Volume
## 1 2004-01-02 0.196417 0.199083 0.192333 0.192333  436416000
## 2 2004-01-05 0.195250 0.199917 0.193500 0.198583  575292000
## 3 2004-01-06 0.198000 0.209417 0.197083 0.206667 1093344000
## 4 2004-01-07 0.204333 0.209500 0.202917 0.208500  673032000
## 5 2004-01-08 0.211083 0.212083 0.207250 0.209250  433752000
## 6 2004-01-09 0.207833 0.214833 0.206167 0.212250  766584000

1.1 Explanation of dataset

The variables in the analysis are explained as following:

Date - The date of the stock price data
Open - The price of the stock at market opening in USD
High - The highest price of the stock during the trading day in USD
Low - The lowest price of the stock during the trading day in USD
Close - The price of the stock at market closing in USD
Volume - The total number of shares traded during the day

1.2 Performing some data manipulation

#renaming column names
colnames(mydata) <- c("Date", "OpenPrice", "HighPrice", "LowPrice", "ClosePrice", "Volume")

#function for creating new variable of daily return
mydata[["Return"]] <- (mydata[["ClosePrice"]] - mydata[["OpenPrice"]]) / mydata[["OpenPrice"]]

#this is a function I found for deleting all the rows with missing data
mydata <- na.omit(mydata)

#created a new object for new dataframe called positiveRreturn where I would store only days where return > 0
positiveReturn <- subset(mydata, Return > 0)

# print resoults
head(positiveReturn)

##          Date OpenPrice HighPrice LowPrice ClosePrice     Volume      Return
## 2  2004-01-05  0.195250  0.199917 0.193500   0.198583  575292000 0.017070434
## 3  2004-01-06  0.198000  0.209417 0.197083   0.206667 1093344000 0.043772763
## 4  2004-01-07  0.204333  0.209500 0.202917   0.208500  673032000 0.020393135
## 6  2004-01-09  0.207833  0.214833 0.206167   0.212250  766584000 0.021252579
## 7  2004-01-12  0.213000  0.215333 0.211000   0.214667  541980000 0.007826328
## 10 2004-01-15  0.200583  0.204083 0.197000   0.202417  609204000 0.009143373

1.3 Descriptive Statistics

summary(mydata[, c("OpenPrice", "ClosePrice", "Volume", "Return")])

##    OpenPrice          ClosePrice           Volume              Return          
##  Min.   : 0.08058   Min.   : 0.07858   Min.   :4.564e+07   Min.   :-0.1380099  
##  1st Qu.: 0.34400   1st Qu.: 0.34325   1st Qu.:3.483e+08   1st Qu.:-0.0132584  
##  Median : 0.54400   Median : 0.54200   Median :4.911e+08   Median : 0.0002939  
##  Mean   : 5.21467   Mean   : 5.21704   Mean   :5.588e+08   Mean   : 0.0003971  
##  3rd Qu.: 5.23375   3rd Qu.: 5.22900   3rd Qu.:6.817e+08   3rd Qu.: 0.0139310  
##  Max.   :50.21600   Max.   :50.40900   Max.   :5.089e+09   Max.   : 0.1491142

Explanation of some of the sample statistics:

Mean (ClosePrice): The average closing price is 5.21704.
This is the arithmetic average and shows the central tendency of prices.
Median (Return): Half of the daily returns are below 0.0002939 and half are above.
This means the stock had slightly more positive days than negative ones.
Max (Volume): The maximum traded volume is 5.089×10⁹ shares.
This shows the most active trading day — useful for spotting unusual market activity.
3rd Quartile (OpenPrice): The third quartile is 5.23375, meaning 75% of opening prices are at or below this value.
Only 25% of days opened higher — these represent the strongest opening price days.

library(ggplot2)

#histogram for OpenPrice
ggplot(mydata, aes(x = OpenPrice)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Opening Prices", x = "Open Price", y = "Count")

Histogram for variable OpenPrice: Most opening prices are very low and close to zero, which is why there’s a tall bar on the left. There are fewer days with high opening prices, but a few go all the way up to around 50. This shows the data is very uneven, with most prices small and just a few very big ones.

#histogram for Return
ggplot(mydata, aes(x = Return)) +
  geom_histogram(bins = 50, fill = "darkgreen", color = "white") +
  labs(title = "Distribution of Daily Returns", x = "Daily Return", y = "Count")

Histogram for variable Return: The histogram shows that most daily returns are close to zero, meaning the stock usually moves only a little each day. The shape looks like a bell curve, so small gains or losses are most common. Big jumps or drops happen only sometimes, which we can see from the few bars far from the center.

#scatterplot Open vs Close
ggplot(mydata, aes(x = OpenPrice, y = ClosePrice)) +
  geom_point(alpha = 0.4) +
  labs(title = "Open vs. Close Prices", x = "Open Price", y = "Close Price")

ScatterPlot of Open vs Close: This scatterplot shows that open and close prices move almost the same way. The points form a straight line, meaning the price at the end of the day is usually very close to the price at the start of the day.

mydata$Date <- as.Date(mydata$Date)

#boxplot of Volume
ggplot(mydata, aes(x = Date, y = Volume)) +
  geom_col(fill = "steelblue") + # e.g. 2B instead of 2e+09
  labs(
    title = "Daily Trading Volume",
    x = "Date",
    y = "Volume"
  ) +
  theme_minimal()

BoxPlot of volume: This chart shows how trading volume changed over time. Most days have similar volume, but there are some very tall spikes where way more shares were traded. Probably on big news or earnings days.

Task 2

First, I import the data from given .xslx file using read_excel function from readx1 library and stored it to object mbaStudents.

library(readxl)

mbaStudents <- read_excel("C:/Delo/Sola/IMB/Bootcamp/TakeHomeExam-R/Task 2/Business School.xlsx")

head(mbaStudents)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

Question 1

library(ggplot2)

ggplot(mbaStudents, aes(x = `Undergrad Degree`)) +
  geom_bar(fill = "hotpink") +
  labs(
    title = "Distribution of Undergraduate Degrees",
    x = "Undergrad Degree",
    y = "Number of Students"
  ) +
  theme_minimal()

In the shown chart the tallest bar is that of the most common undergrad degree in this particular .csv data set this is Business.

Question 2

First I ran summary function on Annual Salary column from dataset stored in object mbaStudents.

# Descriptive statistics
summary(mbaStudents$`Annual Salary`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

That gave me the following descriptive statistics: Minimum Sallary, 1st Quartile, Median, Mean, 3rd quartile and Maximum.

Than i used a ggplot function to draw me a histogram for Distribution od Annual salaries. For the correction of notation on x-axis from, for example 1e+05 to 100,000, I used label_comma() from scales library.

library(scales)

ggplot(mbaStudents, aes(x = `Annual Salary`)) +
  geom_histogram(bins = 20, fill = "darkgreen", color = "white") +
  scale_x_continuous(labels = label_comma()) +
  labs(
    title = "Distribution of Annual Salaries",
    x = "Annual Salary",
    y = "Count"
  ) +
  theme_minimal()

Most salaries are clustered around the 100000$ salary mark, with a few outliers with very high salaries on the right side, so the distribution has a slight positive skew. The mean is higher than the median because of these few very high salaries.

Question 3

Using t.test() to test given hypothesis $H_0: \mu_{\text{MBA Grade}} = 74$.

t.test(mbaStudents$`MBA Grade`, mu = 74)

## 
##  One Sample t-test
## 
## data:  mbaStudents$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

Function t-test() was performed to check whether the mean MBA grade is equal to 74. The sample mean was 76.04, which is higher than 74. The test produced a t-value of 2.66 with 99 degrees of freedom and a p-value of 0.00915.

Since the p-value is less than 0.05, we reject the null hypothesis and conclude that the average MBA grade is significantly different from 74.

The 95% confidence interval (74.52, 77.56) suggests that the true mean lies in this range and since this range does not include 74, the difference is statistically significant. ***

Task 3

Import the dataset Apartments.xlsx

library(readxl)

apartments <- read_excel("C:/Delo/Sola/IMB/Bootcamp/TakeHomeExam-R/Task 3/Apartments.xlsx")

head(apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

apartments$ParkingF <- factor(apartments$Parking, levels = c(0, 1),
                             labels = c("No Parking", "Parking"))

apartments$BalconyF <- factor(apartments$Balcony, levels = c(0, 1),
                             labels = c("No Balcony", "Balcony"))

head(apartments)

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony ParkingF   BalconyF  
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>      <fct>     
## 1     7       28  1640       0       1 No Parking Balcony   
## 2    18        1  2800       1       0 Parking    No Balcony
## 3     7       28  1660       0       0 No Parking No Balcony
## 4    28       29  1850       0       1 No Parking Balcony   
## 5    18       18  1640       1       1 Parking    Balcony   
## 6    28       12  1770       0       1 No Parking Balcony

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(apartments$Price, mu = 1900)

## 
##  One Sample t-test
## 
## data:  apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

The average apartment price is 2018.9 EUR. The t-test gives t = 2.90, p = 0.005. Since the p-value is < 0.05, we reject the null hypothesis that the mean price is 1900 EUR. The 95% confidence interval (1937.4, 2100.4) suggests that the true mean price is in this range. So the conclusion is that apartments are on average more expensive than 1900 EUR.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = apartments)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

The estimate for Age is −8.975, meaning that for every additional year an apartment is older, its price per m² decreases by about 9 EUR on average. Since the p-value is 0.034 < 0.05, this negative effect is statistically significant older apartments really do tend to be cheaper.

Coefition of corelation \[ r = -\sqrt{0.05302} \approx -0.23 \], this means there is a weak negative relationship between age and price as apartments get older, their price tends to go down slightly.

Coefitient of Determination is \[ r = -\sqrt{0.05302} \approx -0.23 \], which means that 5.3% of variation in apartment prices is explained by age.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

subsetOfApartments <- apartments[, c("Age", "Distance", "Price")]


scatterplotMatrix(subsetOfApartments, ,
      main = "Scatterplot Matrix: Price, Age, and Distance",
      col = "blue",
      smooth = FALSE)

Based on the given output there is no signs of multicolinearity, the points are randomly scattared and based on that there is no strongly corelated variables.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, apartments)

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

The VIF values for both Age and Distance are close to 1. This means there is no multicollinearity problem — Age and Distance do not overlap much in the information they provide about Price. We can keep both variables in the model.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

stdRes <- rstandard(fit2)

cooks_d <- cooks.distance(fit2)
potenProbl <- data.frame(
  Price = apartments$Price,
  Age = apartments$Age,
  Distance = apartments$Distance,
  StdResidual = stdRes,
  CooksD = cooks_d
)

head(potenProbl)

##   Price Age Distance StdResidual      CooksD
## 1  1640   7       28  -0.6653487 0.007386569
## 2  2800  18        1   1.7832876 0.030365432
## 3  1660   7       28  -0.5937629 0.005882612
## 4  1850  28       29   0.7543794 0.008299153
## 5  1640  18       18  -1.0733987 0.005112584
## 6  1770  28       12  -0.7775190 0.004900891

n <- nrow(apartments)

thresholdCook <- 4 / n#


apartmentsClean <- apartments[
  abs(stdRes) <= 3 & cooks_d <= thresholdCook, ]

fit2_clean <- lm(Price ~ Age + Distance, data = apartmentsClean)

summary(fit2_clean)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartmentsClean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

std_res <- rstandard(fit2)
std_fitted <- scale(fitted(fit2))

plot(std_fitted, std_res,
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     main = "Residuals vs Fitted (Check for Heteroskedasticity)",
     pch = 19, col = "blue")

abline(h = 0, lty = 2, col = "red")

This plot suggests that heteroskedasticity is not a serious problem in this model. The residuals seem to have roughly constant variance, so the regression assumption of homoskedasticity is likely satisfied.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

std_res <- rstandard(fit2)


hist(std_res,
     breaks = 15,
     main = "Histogram of Standardized Residuals",
     xlab = "Standardized Residuals",
     col = "lightblue", border = "white")

#sapiro test
shapiro.test(std_res)

## 
##  Shapiro-Wilk normality test
## 
## data:  std_res
## W = 0.95306, p-value = 0.00366

The residuals are close to normal but not perfect — there are a few outliers and a slight skew. The Shapiro–Wilk test confirms this (p < 0.05)

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

#I already have fit2 object from earlier so I just re-use it
std_res <- rstandard(fit2_clean)
cooks_d <- cooks.distance(fit2_clean)


summary(fit2_clean)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartmentsClean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

After removing outliers, the regression model is stronger and explains more than half of the variation in prices. Both Age and Distance are significant: older and farther apartments are cheaper.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apartments)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking      196.168     62.868   3.120  0.00251 ** 
## Balcony        1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The reasult of p-value < 0.05, we conclude that fit3 is significantly better. Adding Parking and Balcony makes the model better.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking      196.168     62.868   3.120  0.00251 ** 
## Balcony        1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

The F-statistic of 20.03 tells us that our model explains apartment prices much better than a model with no predictors. The degrees of freedom (4 and 80) show that we used 4 predictors and had 80 pieces of information left to estimate error. Because the p-value is almost zero, we can be very confident that at least one variable (Age, Distance, Parking, or Balcony) really affects apartment price.

Save fitted values and claculate the residual for apartment ID2.

apartments$fitted_values <- fitted(fit3)

apartments$residuals <- apartments$Price - apartments$fitted_values

apartments[2, c("Price", "fitted_values", "residuals")]

## # A tibble: 1 × 3
##   Price fitted_values residuals
##   <dbl>         <dbl>     <dbl>
## 1  2800         2357.      443.