.columns {display: flex;}
h1 {color: crimson;font-family: "Copperplate", fantasy; font-size: 36px;}
h2 {color: indigo;font-family: "Copperplate", fantasy; font-size: 36px;}
h3 {color: salmon;font-family: "Copperplate", fantasy; font-size: 36px;}
h1.title {color: darkgreen;font-family: "Copperplate", fantasy; font-size: 48px;}
.author {color: goldenrod ;font-family: "Copperplate", fantasy; font-size: 24px;}
.date {color: olive ;font-family: "Copperplate", fantasy; font-size: 18px;}
data(package = .packages(all.available = TRUE))
#install.packages("reshape2")
library(reshape2)
mydata <- force(tips)
head(mydata)
## total_bill tip sex smoker day time size
## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4
Explanaiton of variables:
- total_bill: amount of the bill in dollars
- tip: amount of the tip in dollars
- sex: sex of the bill payer
- smoker: whether there were smokers in the party
- day: day of the week
- time: time of the day
- size: size of the party
colnames(mydata) [1] <- "Total Bill"
colnames(mydata) [2] <- "Tip"
colnames(mydata) [3] <- "Sex"
colnames(mydata) [4] <- "Smoker"
colnames(mydata) [5] <- "Day"
colnames(mydata) [6] <- "Time"
colnames(mydata) [7] <- "Size"
head(mydata)
## Total Bill Tip Sex Smoker Day Time Size
## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4
Here I changed/renamed the name of the variables, so they start with a capital letter and the table looks neater.
mydataWeekend <- subset(mydata, !(Day %in% c("Thur", "Fri")))
head(mydataWeekend)
## Total Bill Tip Sex Smoker Day Time Size
## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4
I created a new dataframe by excluding the observations that have been collected either on friday or thursday, so I have a dataframe which only includes the data for the weekend.
mydata$Sex <- factor(mydata$Sex,
levels = c("Male", "Female"),
labels = c("M", "F"))
head(mydata)
## Total Bill Tip Sex Smoker Day Time Size
## 1 16.99 1.01 F No Sun Dinner 2
## 2 10.34 1.66 M No Sun Dinner 3
## 3 21.01 3.50 M No Sun Dinner 3
## 4 23.68 3.31 M No Sun Dinner 2
## 5 24.59 3.61 F No Sun Dinner 4
## 6 25.29 4.71 M No Sun Dinner 4
Here I changed variable Sex from Male” and “Female” to “M” and “F” for possible later purposes.
summary(mydata)
## Total Bill Tip Sex Smoker Day Time
## Min. : 3.07 Min. : 1.000 M:157 No :151 Fri :19 Dinner:176
## 1st Qu.:13.35 1st Qu.: 2.000 F: 87 Yes: 93 Sat :87 Lunch : 68
## Median :17.80 Median : 2.900 Sun :76
## Mean :19.79 Mean : 2.998 Thur:62
## 3rd Qu.:24.13 3rd Qu.: 3.562
## Max. :50.81 Max. :10.000
## Size
## Min. :1.00
## 1st Qu.:2.00
## Median :2.00
## Mean :2.57
## 3rd Qu.:3.00
## Max. :6.00
Here I provided descriptive statistics for selected variables “Total Bill”and “Size”.
Mean:
19.79 —> explains that on average the amount of the bill was 19.79 dollars
2.57 —> explains that on average the size of the party was 2.57 persons
Median:
17.80 —> half of the amounts of bills where lower than 17.80 dollars and the other half were higher than 17.80 dollars
2.00 —> half of the parties were smaller than 2 persons and the other half were bigger than 2 persons
Max:
50.81 —> the highest bill in the dataset amounts to 50.81 dollars
6.00 —> the size of biggest party in the dataset amounts to 6
hist(mydata$Tip,
main = "Distribution of the variable Tip",
ylab = "Frequency",
xlab = "Tip in dollars",
breaks = seq(0, 12, (0.5)),
right = FALSE)
Here I provided you a histogram that shows the distribution of the variable Tip. The two peaks that can be seen in the histogram indicate that it is a case of a bimodal distribution. Since most of the distribution is on the left side, I would argue that it is right skewed.
boxplot(mydata$`Total Bill`,
main = "Boxplot of Total Bill",
ylab = "Value in dollars",
col = "lightgreen",
border = "black")
Here I provided Boxplot of variable “Total Bill”. Half of the values of the variable lie in the green box, which is called interquartile range. Bottom whisker represents the minimum and the upper whisker represents maximum (the exact value of maximum and minimum is provided in descriptive statistic above). We can also see some outliers, these points are considered unusually high compared to the rest of the data. The line the box represents the Median which sits at 17.80 as already stated before.
#install.packages("readxl")
library(readxl)
Business_School <- read_excel("R Take Home Exam 2024/Task 2/Business School.xlsx")
Here I imported the new dataset and named it Business_School.
library(ggplot2)
ggplot(Business_School, aes(x=`Undergrad Degree`,))+
geom_bar(colour="blue", fill = "lightblue")+
xlab("Undergrad Degrees")+
ylab("Number of degrees")
Above I provided a histogram displaying the distribution of undergrad degrees among all 100 MBA students. We can see that the most common one is Bussines degree (35).
#install.packages("summarytools")
library(summarytools)
descr(Business_School$`Annual Salary`)
## Descriptive Statistics
## Business_School$`Annual Salary`
## N: 100
##
## Annual Salary
## ----------------- -----------------
## Mean 109058.00
## Std.Dev 41501.49
## Min 20000.00
## Q1 86750.00
## Median 103500.00
## Q3 124000.00
## Max 340000.00
## MAD 25945.50
## IQR 36875.00
## CV 0.38
## Skewness 2.22
## SE.Skewness 0.24
## Kurtosis 9.41
## N.Valid 100.00
## Pct.Valid 100.00
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describeBy(Business_School$`Annual Salary`)
## Warning in describeBy(Business_School$`Annual Salary`): no grouping variable
## requested
## vars n mean sd median trimmed mad min max range skew
## X1 1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22
## kurtosis se
## X1 9.41 4150.15
Here I displayed descriptive statistics of the variable “Annual Salary” with two different functions.
Business_School$`Annual Salary2` <- Business_School$`Annual Salary` / 1000
First off all I made a new variable named “Annual Salary_2” which presents the values of Annual Salary in the thousands, since the histogram made with the initial variable “Annual Salary” looks weird, because of the high values of the variable.
library(ggplot2)
ggplot(Business_School, aes(x=`Annual Salary2`))+
geom_histogram(color="grey4", fill= "green4")+
xlab("Annual Salary in thousands")+
ylab("Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can describe the distribution of the histogram as follows:
The histogram is unimodal, that means it has only one high, which lies at 100 (Annual salary in thousands). Since most of the distribution is on the left side, it is right skewed. On the right side we can also see some outliers. Latter means that there are a few values that are much higher than the rest of the values.
t.test(Business_School$`MBA Grade`,
mu=74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
With the t.test function I tested null hypothesis (𝐻0:𝜇MBA Grade=74, 𝐻1:𝜇MBA Grade≠74). The P value of 0.00915 is significantly lower than 0.05, therefore I can reject the null hypothesis. That means that the true mean MBA grade differs from 74, sample mean of the MBA grades is 76.04055, which is obviously greater than 74.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cohens_d(Business_School$`MBA Grade`,
mu=74)
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
The value of Cohen’s d being 0.27, indicates that there is a small effect size with (95% confidence). There is a modest difference between this year’s generation and last year’s generation.
library(readxl)
Apartments <- read_excel("R Take Home Exam 2024/Task 3/Apartments.xlsx")
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
- Age: Age of an apartment in years
- Distance: The distance from city center in km
- Price: Price per m2
- Parking: 0-No, 1-Yes
- Balcony: 0-No, 1-Yes
Apartments$Balcony <- factor(Apartments$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
Apartments$Parking <- factor(Apartments$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
I changed categorical variables into factors.
t.test(Apartments$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Since the P value I got with T-Test is significantly lower than 0.05, I can reject the null hypothesis, therefore I can say that the true mean differs from 1900. With 95 % confidence I can say that true mean of apartment prices per m2 lies in the interval from 1937.443 to 2100.440.
fit1 <- lm(Price ~ Age, data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
sqrt(summary(fit1)$r.squared)
## [1] 0.230255
Estimate of regression coefficient = -8.975
If the age of an apartment increases by 1 year, the price per m2 decreases by 8.975 euros on average (p = 0.034).
Correlation coefficient = 0.230355
The linear correlation between price and age of an apartment is weak based on the value of correlation coefficient being 0.23.
Coefficient of determination = 0.05302
This coefficient indicates the proportion of the total variability of the dependent variable that can be explained by the linear effect of all explanatory variables. Therefore only 5.3 % of the variability of the price per m2 is explained by the linear effect of age.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(Apartments[c("Price", "Age", "Distance")], smooth = FALSE)
Scatterplots between Age and Distance do not show any kind of relationship between the two variables, since the line is horizontal. Change in one variable does not affect the other, so there is currently no problem with multiciolinearity.
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
Both VIF values are close to 1, which indicates no multicollinearity between the two independent variables. They are not correlated.
Apartments$StdResid <- round(rstandard(fit2), 3)
Here I calculated all standardized residuals and stored them as StdResid. All values are between 3 and -3 (I checked the values in the table “Apartments”), so I can conclude that here there are no outliers.
Apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Freequency",
main = "Histogram of Cooks distances")
First of all I saved Cooks distances, which I caluclated with cooks.distance function. Then I made a histogram providing me their distribution. From the histogram I can see that there is at least one value that is to some extent greater than the values of other units. We can see that by a gap between 0.15 and 0.30.
head(Apartments[order(-Apartments$CooksD), "CooksD"], 10)
## # A tibble: 10 × 1
## CooksD
## <dbl>
## 1 0.32
## 2 0.104
## 3 0.069
## 4 0.066
## 5 0.061
## 6 0.038
## 7 0.037
## 8 0.034
## 9 0.032
## 10 0.03
By printing the 10 highest cooks distance values we can see that the first 2 values are much greater than the others. On top of that 3rd, 4th and 5th value are also slitly higher than the others, so I decided to delete them all with the function below.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Apartments <- Apartments %>%
filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.061))
.
fit2 <- lm(Price ~ Age + Distance, data = Apartments)
Apartments$StdFittedValues <- scale(fit2$fitted.values)
I already estimated/created the fit2 again without potentially excluded units and show the summary of the model here, because I could not create standardized fitted values, since I deleted 5 units (one data included 80 and the other 85 units). I also stored standardized fitted values.
library(car)
scatterplot(y=Apartments$StdResid, x=Apartments$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
The points in the scatterplot of the standardized residuals against the standardized fitted values are from what I see randomly distributed, which inidicated that there is no heteroskedasticity. To be completely sure I will also conduct Breuch-Pagan test.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.738591
## Prob > Chi2 = 0.1873174
This test includes following hypothesis:
-𝐻0:the variance of errors is constant (homoskedasticity)
-𝐻1:the variance is errors not constant (heteroskedasticity)
P-value is above 0.05, therefore null hypothesis can not be rejected, and I assume homoskedasticity.
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
The histogram is slightly right skewed. We also see here that all values are between -3 and 3 which is a threshold used to identify outliers. We will test distribution with Shapiro test.
shapiro.test(Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.93418, p-value = 0.0004761
Now that we formally tested the distribution we see that the P-value is lower than 0.05 (0.001166), which means that the null hypothesis (standard residuals are normally distributed) can be rejected. That shows that standard residuals are not distributed normally. But since our sample is over the size of 30, a violation of this assumption does not significantly affect our results.
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
sqrt(summary(fit2)$r.squared)
## [1] 0.732187
Here are the results of the regression function without excluded units.
The coefficient of determination is 0.5361. That means that approximately 54% of the variability of the price per m2 is explained by the linear effect of age and distance (from city center in km).
Intercept of 2502.467 indicated the average value of the dependent variable when all explanatory variables are 0. But because neither Age or Distance can be valued at 0, the regression constant has no meaning.
Estimate of regression coefficient = -8.674
If the age of an apartment increases by 1 year, the price per m2 decreases by 8.674 euros on average (p = 0.00869), assuming that distance constant.
Estimate of regression coefficient = -24.063
If the distance from the city center is increased by 1 km, the price per m2 decreases on average by 24.063 (p < 0.001), assuming that age is constant.
The linear correlation between price, age and distance of an apartment is strong based on the value of correlation coefficient being 0.73.
F-statistics is used to test the statistical suitability Since P value is <0.001, we can reject the null hypothesis. That means that this model is a good model and it makes sense.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)
Here I estimated and saved the linear regression function Price = f(Age, Distance, Parking and Balcony) as fit3.
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
With Anova I tested if the fit3 fits the data statistically better than fit2. The results shows that fit3 model does not fit the data better since p=0.1135.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Given the values of the other explanatory variables, apartments with a parking space have an average price per m2 higher by 128.700 euros compared to the apartments without a parking space (p = 0.0376).
We found no difference between the average price per m2 of two identical apartments, with the exception that one apartment has a balcony and the other does not. That is because P-value is greater than 0.05 (p=0.9165).
With F-statistics the hypothesis that is being tested is: 𝐻0: R-squared=0.
Apartments$StdFittedValues <- fitted.values(fit3)
Apartments$StdResid <- residuals(fit3)
head(Apartments[ , colnames(Apartments) %in% c("Price", "StdFittedValues", "StdResid")])
## # A tibble: 6 × 3
## Price StdResid StdFittedValues
## <dbl> <dbl> <dbl>
## 1 1640 -88.6 1729.
## 2 2800 443. 2357.
## 3 1660 -62.6 1723.
## 4 1850 311. 1539.
## 5 1640 -349. 1989.
## 6 1770 -143. 1913.
Based on the estimated regression function, we would expect the price per m2 for this apartment to be 2356.597 euros. The actual price was 2800 euros, so the residual is 443.4 euros.