For the first excercise I choose a data set of Nvidia Daily Stock Price Data sales I got from kaggle.com
First I read my data from .csv file format into object
mydata using the function read.table. I used an absolute
path to get to my .csv file, because it was in different folder than
this markdown file.
mydata <- read.table("C:/Delo/Sola/IMB/Bootcamp/TakeHomeExam-R/Task 1/nvidia.csv", header=TRUE, sep=";", dec=",")
head(mydata)## Date Open High Low Close Volume
## 1 2004-01-02 0.196417 0.199083 0.192333 0.192333 436416000
## 2 2004-01-05 0.195250 0.199917 0.193500 0.198583 575292000
## 3 2004-01-06 0.198000 0.209417 0.197083 0.206667 1093344000
## 4 2004-01-07 0.204333 0.209500 0.202917 0.208500 673032000
## 5 2004-01-08 0.211083 0.212083 0.207250 0.209250 433752000
## 6 2004-01-09 0.207833 0.214833 0.206167 0.212250 766584000
1.1 Explanation of dataset
The variables in the analysis are explained as following:
1.2 Performing some data manipulation
#renaming column names
colnames(mydata) <- c("Date", "OpenPrice", "HighPrice", "LowPrice", "ClosePrice", "Volume")
#function for creating new variable of daily return
mydata[["Return"]] <- (mydata[["ClosePrice"]] - mydata[["OpenPrice"]]) / mydata[["OpenPrice"]]
#this is a function I found for deleting all the rows with missing data
mydata <- na.omit(mydata)
#created a new object for new dataframe called positiveRreturn where I would store only days where return > 0
positiveReturn <- subset(mydata, Return > 0)
# print resoults
head(positiveReturn)## Date OpenPrice HighPrice LowPrice ClosePrice Volume Return
## 2 2004-01-05 0.195250 0.199917 0.193500 0.198583 575292000 0.017070434
## 3 2004-01-06 0.198000 0.209417 0.197083 0.206667 1093344000 0.043772763
## 4 2004-01-07 0.204333 0.209500 0.202917 0.208500 673032000 0.020393135
## 6 2004-01-09 0.207833 0.214833 0.206167 0.212250 766584000 0.021252579
## 7 2004-01-12 0.213000 0.215333 0.211000 0.214667 541980000 0.007826328
## 10 2004-01-15 0.200583 0.204083 0.197000 0.202417 609204000 0.009143373
1.3 Descriptive Statistics
## OpenPrice ClosePrice Volume Return
## Min. : 0.08058 Min. : 0.07858 Min. :4.564e+07 Min. :-0.1380099
## 1st Qu.: 0.34400 1st Qu.: 0.34325 1st Qu.:3.483e+08 1st Qu.:-0.0132584
## Median : 0.54400 Median : 0.54200 Median :4.911e+08 Median : 0.0002939
## Mean : 5.21467 Mean : 5.21704 Mean :5.588e+08 Mean : 0.0003971
## 3rd Qu.: 5.23375 3rd Qu.: 5.22900 3rd Qu.:6.817e+08 3rd Qu.: 0.0139310
## Max. :50.21600 Max. :50.40900 Max. :5.089e+09 Max. : 0.1491142
Explanation of some of the sample statistics:
Mean (ClosePrice): The average closing price is
5.21704.
This is the arithmetic average and shows the central tendency of
prices.
Median (Return): Half of the daily returns are
below 0.0002939 and half are above.
This means the stock had slightly more positive days than negative
ones.
Max (Volume): The maximum traded volume is
5.089×10⁹ shares.
This shows the most active trading day — useful for spotting unusual
market activity.
3rd Quartile (OpenPrice): The third quartile is
5.23375, meaning 75% of opening prices are at or below
this value.
Only 25% of days opened higher — these represent the strongest opening
price days.
library(ggplot2)
#histogram for OpenPrice
ggplot(mydata, aes(x = OpenPrice)) +
geom_histogram(bins = 50, fill = "steelblue", color = "white") +
labs(title = "Distribution of Opening Prices", x = "Open Price", y = "Count")Histogram for variable OpenPrice: Most opening prices are very low and close to zero, which is why there’s a tall bar on the left. There are fewer days with high opening prices, but a few go all the way up to around 50. This shows the data is very uneven, with most prices small and just a few very big ones.
#histogram for Return
ggplot(mydata, aes(x = Return)) +
geom_histogram(bins = 50, fill = "darkgreen", color = "white") +
labs(title = "Distribution of Daily Returns", x = "Daily Return", y = "Count")Histogram for variable Return: The histogram shows that most daily returns are close to zero, meaning the stock usually moves only a little each day. The shape looks like a bell curve, so small gains or losses are most common. Big jumps or drops happen only sometimes, which we can see from the few bars far from the center.
#scatterplot Open vs Close
ggplot(mydata, aes(x = OpenPrice, y = ClosePrice)) +
geom_point(alpha = 0.4) +
labs(title = "Open vs. Close Prices", x = "Open Price", y = "Close Price")ScatterPlot of Open vs Close: This scatterplot shows that open and close prices move almost the same way. The points form a straight line, meaning the price at the end of the day is usually very close to the price at the start of the day.
mydata$Date <- as.Date(mydata$Date)
#boxplot of Volume
ggplot(mydata, aes(x = Date, y = Volume)) +
geom_col(fill = "steelblue") + # e.g. 2B instead of 2e+09
labs(
title = "Daily Trading Volume",
x = "Date",
y = "Volume"
) +
theme_minimal()BoxPlot of volume: This chart shows how trading volume changed over time. Most days have similar volume, but there are some very tall spikes where way more shares were traded. Probably on big news or earnings days.
First, I import the data from given .xslx file using
read_excel function from readx1 library and
stored it to object mbaStudents.
library(readxl)
mbaStudents <- read_excel("C:/Delo/Sola/IMB/Bootcamp/TakeHomeExam-R/Task 2/Business School.xlsx")
head(mbaStudents)## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
Question 1
library(ggplot2)
ggplot(mbaStudents, aes(x = `Undergrad Degree`)) +
geom_bar(fill = "hotpink") +
labs(
title = "Distribution of Undergraduate Degrees",
x = "Undergrad Degree",
y = "Number of Students"
) +
theme_minimal()In the shown chart the tallest bar is that of the most common undergrad degree in this particular .csv data set this is Business.
Question 2
First I ran summary function on
Annual Salary column from dataset stored in object
mbaStudents.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
That gave me the following descriptive statistics: Minimum
Sallary, 1st Quartile, Median, Mean, 3rd quartile and Maximum.
Than i used a ggplot function to draw me a
histogram for Distribution od Annual salaries. For the correction of
notation on x-axis from, for example 1e+05 to 100,000, I used
label_comma() from scales library.
library(scales)
ggplot(mbaStudents, aes(x = `Annual Salary`)) +
geom_histogram(bins = 20, fill = "darkgreen", color = "white") +
scale_x_continuous(labels = label_comma()) +
labs(
title = "Distribution of Annual Salaries",
x = "Annual Salary",
y = "Count"
) +
theme_minimal()Most salaries are clustered around the 100000$ salary mark, with a few outliers with very high salaries on the right side, so the distribution has a slight positive skew. The mean is higher than the median because of these few very high salaries.
Question 3
Using t.test() to test given hypothesis \(H_0: \mu_{\text{MBA Grade}} = 74\).
##
## One Sample t-test
##
## data: mbaStudents$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
Function t-test() was performed to check whether the
mean MBA grade is equal to 74. The sample mean was 76.04, which is
higher than 74. The test produced a t-value of 2.66 with 99
degrees of freedom and a p-value of 0.00915.
Since the p-value is less than 0.05, we reject the null
hypothesis and conclude that the average MBA grade is significantly
different from 74.
The 95% confidence interval (74.52, 77.56) suggests that the true
mean lies in this range and since this range does not include 74, the
difference is statistically significant. ***
library(readxl)
apartments <- read_excel("C:/Delo/Sola/IMB/Bootcamp/TakeHomeExam-R/Task 3/Apartments.xlsx")
head(apartments)## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
apartments$ParkingF <- factor(apartments$Parking, levels = c(0, 1),
labels = c("No Parking", "Parking"))
apartments$BalconyF <- factor(apartments$Balcony, levels = c(0, 1),
labels = c("No Balcony", "Balcony"))
head(apartments)## # A tibble: 6 × 7
## Age Distance Price Parking Balcony ParkingF BalconyF
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 0 1 No Parking Balcony
## 2 18 1 2800 1 0 Parking No Balcony
## 3 7 28 1660 0 0 No Parking No Balcony
## 4 28 29 1850 0 1 No Parking Balcony
## 5 18 18 1640 1 1 Parking Balcony
## 6 28 12 1770 0 1 No Parking Balcony
##
## One Sample t-test
##
## data: apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
The average apartment price is 2018.9 EUR. The t-test gives t = 2.90, p = 0.005. Since the p-value is < 0.05, we reject the null hypothesis that the mean price is 1900 EUR. The 95% confidence interval (1937.4, 2100.4) suggests that the true mean price is in this range. So the conclusion is that apartments are on average more expensive than 1900 EUR.
##
## Call:
## lm(formula = Price ~ Age, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
The estimate for Age is −8.975, meaning that for every additional year an apartment is older, its price per m² decreases by about 9 EUR on average. Since the p-value is 0.034 < 0.05, this negative effect is statistically significant older apartments really do tend to be cheaper.
Coefition of corelation \[ r = -\sqrt{0.05302} \approx -0.23 \], this means there is a weak negative relationship between age and price as apartments get older, their price tends to go down slightly.
Coefitient of Determination is \[ r = -\sqrt{0.05302} \approx -0.23 \], which means that 5.3% of variation in apartment prices is explained by age.
## Loading required package: carData
subsetOfApartments <- apartments[, c("Age", "Distance", "Price")]
scatterplotMatrix(subsetOfApartments, ,
main = "Scatterplot Matrix: Price, Age, and Distance",
col = "blue",
smooth = FALSE)
Based on the given output there is no signs of multicolinearity, the
points are randomly scattared and based on that there is no strongly
corelated variables.
## Age Distance
## 1.001845 1.001845
The VIF values for both Age and Distance are close to 1. This means there is no multicollinearity problem — Age and Distance do not overlap much in the information they provide about Price. We can keep both variables in the model.
stdRes <- rstandard(fit2)
cooks_d <- cooks.distance(fit2)
potenProbl <- data.frame(
Price = apartments$Price,
Age = apartments$Age,
Distance = apartments$Distance,
StdResidual = stdRes,
CooksD = cooks_d
)
head(potenProbl)## Price Age Distance StdResidual CooksD
## 1 1640 7 28 -0.6653487 0.007386569
## 2 2800 18 1 1.7832876 0.030365432
## 3 1660 7 28 -0.5937629 0.005882612
## 4 1850 28 29 0.7543794 0.008299153
## 5 1640 18 18 -1.0733987 0.005112584
## 6 1770 28 12 -0.7775190 0.004900891
n <- nrow(apartments)
thresholdCook <- 4 / n#
apartmentsClean <- apartments[
abs(stdRes) <= 3 & cooks_d <= thresholdCook, ]
fit2_clean <- lm(Price ~ Age + Distance, data = apartmentsClean)
summary(fit2_clean)##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartmentsClean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
std_res <- rstandard(fit2)
std_fitted <- scale(fitted(fit2))
plot(std_fitted, std_res,
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
main = "Residuals vs Fitted (Check for Heteroskedasticity)",
pch = 19, col = "blue")
abline(h = 0, lty = 2, col = "red") This plot suggests that heteroskedasticity is not a serious problem in this model. The residuals seem to have roughly constant variance, so the regression assumption of homoskedasticity is likely satisfied.
std_res <- rstandard(fit2)
hist(std_res,
breaks = 15,
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals",
col = "lightblue", border = "white")##
## Shapiro-Wilk normality test
##
## data: std_res
## W = 0.95306, p-value = 0.00366
The residuals are close to normal but not perfect — there are a few
outliers and a slight skew. The Shapiro–Wilk test confirms this
(p < 0.05)
#I already have fit2 object from earlier so I just re-use it
std_res <- rstandard(fit2_clean)
cooks_d <- cooks.distance(fit2_clean)
summary(fit2_clean)##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartmentsClean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
After removing outliers, the regression model is stronger and explains more than half of the variation in prices. Both Age and Distance are significant: older and farther apartments are cheaper.
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parking 196.168 62.868 3.120 0.00251 **
## Balcony 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The reasult of p-value < 0.05, we conclude that fit3 is significantly better. Adding Parking and Balcony makes the model better.
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parking 196.168 62.868 3.120 0.00251 **
## Balcony 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
The F-statistic of 20.03 tells us that our model explains apartment prices much better than a model with no predictors. The degrees of freedom (4 and 80) show that we used 4 predictors and had 80 pieces of information left to estimate error. Because the p-value is almost zero, we can be very confident that at least one variable (Age, Distance, Parking, or Balcony) really affects apartment price.
apartments$fitted_values <- fitted(fit3)
apartments$residuals <- apartments$Price - apartments$fitted_values
apartments[2, c("Price", "fitted_values", "residuals")]## # A tibble: 1 × 3
## Price fitted_values residuals
## <dbl> <dbl> <dbl>
## 1 2800 2357. 443.