Find your own data set that contains at least 20 units and at least 4
variables (most of which are numeric, but it is good to have at least
one categorical variable as well). Perform the following steps using the
R program:
1. Explain the data set (the variables used in the
analysis).
2. Perform some data manipulations (create new variable,
delete some units due to missing data, rename variables, create new
data.frame based on conditions, etc.).
3. Present the descriptive
statistics for the selected variables and explain at least 3 sample
statistics (mean, median, etc.).
4. Graph the distribution of the
variables using histograms, scatter plots, and/or box plots. Explain the
results.
mydata1 <- read.table("./dataexam.csv",
header=TRUE,
sep=",",
dec=".")
head(mydata1)
## ID Age Experience Projects Income JobTitle
## 1 1 25 2 5 30000 2
## 2 2 30 7 12 32000 4
## 3 3 22 1 3 28000 1
## 4 4 28 5 8 35000 3
## 5 5 35 10 20 37000 5
## 6 6 40 15 25 40000 6
Description of the variables:
- Age: Age of the subject.
-
Experience: Experience of the subject given in years.
- Projects:
Amount of projects in which the subject has taken part of.
-
Income: Anual income of the subject given in US dollars.
-
JobTitle: Current job position from the subject (Intern = 1, Junior
Developer = 2, Developer = 3, Software Engineer = 4, Senior Developer =
5, Lead Developer = 6).
mydata1 <- mydata1[,-1]
mydata1$JobTitle <- factor(mydata1$JobTitle,
levels = c(1, 2, 3, 4, 5, 6),
labels = c("Intern", "Junior Developer", "Developer", "Software Engineer", "Senior Developer", "Lead Developer")
)
head(mydata1)
## Age Experience Projects Income JobTitle
## 1 25 2 5 30000 Junior Developer
## 2 30 7 12 32000 Software Engineer
## 3 22 1 3 28000 Intern
## 4 28 5 8 35000 Developer
## 5 35 10 20 37000 Senior Developer
## 6 40 15 25 40000 Lead Developer
mydata1$Average <- c(mydata1$Projects / mydata1$Experience)
mydata1$Average <- round(mydata1$Average, digits = 1)
head(mydata1)
## Age Experience Projects Income JobTitle Average
## 1 25 2 5 30000 Junior Developer 2.5
## 2 30 7 12 32000 Software Engineer 1.7
## 3 22 1 3 28000 Intern 3.0
## 4 28 5 8 35000 Developer 1.6
## 5 35 10 20 37000 Senior Developer 2.0
## 6 40 15 25 40000 Lead Developer 1.7
The new variable included represents the average projects per year that the individuals performed at the company.
mydata1order <- mydata1[, c(5, 4, 1, 2, 3, 6)]
head(mydata1order)
## JobTitle Income Age Experience Projects Average
## 1 Junior Developer 30000 25 2 5 2.5
## 2 Software Engineer 32000 30 7 12 1.7
## 3 Intern 28000 22 1 3 3.0
## 4 Developer 35000 28 5 8 1.6
## 5 Senior Developer 37000 35 10 20 2.0
## 6 Lead Developer 40000 40 15 25 1.7
library(pastecs)
round(stat.desc(mydata1order[,-1]), 2)
## Income Age Experience Projects Average
## nbr.val 20.00 20.00 20.00 20.00 20.00
## nbr.null 0.00 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 28000.00 22.00 1.00 2.00 1.60
## max 41000.00 40.00 15.00 25.00 3.00
## range 13000.00 18.00 14.00 23.00 1.40
## sum 671000.00 589.00 125.00 231.00 38.90
## median 33000.00 29.00 6.00 10.50 1.90
## mean 33550.00 29.45 6.25 11.55 1.95
## SE.mean 789.65 1.06 0.85 1.48 0.07
## CI.mean.0.95 1652.76 2.21 1.78 3.09 0.15
## var 12471052.63 22.26 14.41 43.63 0.10
## std.dev 3531.44 4.72 3.80 6.61 0.32
## coef.var 0.11 0.16 0.61 0.57 0.16
From the results given by stat.desc we explain the following
statistics:
- range: is the difference between the the lowest and
the highest value of a variable in the data set.
- mean: this is the
average of the variables of variable in the data set.
- SE.mean:
this value provides us the accuracy of the mean, a smaller value means a
better precision in the mean.
- var: is the measurement of how much
the data points variate from the mean, if we have a lower value it
indicates the data is closer to the mean. It is calculated as the square
of the standar deviation.
library(ggplot2)
ggplot(mydata1order, aes(x = JobTitle)) +
geom_bar(color="Black") +
ylab("Frequency")
This is a bar chart that helps us easily identify how many people is in each of the positions within the company from data we are analyzing.
library(ggplot2)
ggplot(mydata1order,aes(x = Experience, y = Income))+
geom_point(color = "black")
This is a scatter plot in which the X axis is given by the years of experience and the Y axis is given by the anual income of the individuals in this data set. From this data we can identify that there is a linear positive relationship between this two variables, which means that the more experience you have will direcly impact in your salary.
You have a data set for 100 MBA students of the current generation.
In the previous year, the average grade in this program was 74.
1.
Graph the distribution of undergrad degrees using the ggplot function.
Which degree is the most common?
2. Show the descriptive statistics
of the Annual Salary and its distribution with the histogram (use the
ggplot function). Describe the distribution.
3 Test the following
hypothesis: 𝐻0:𝜇MBA Grade=74. Explain the result and interpret the
effect size.
#install.packages("openxlsx")
library(openxlsx)
mydata2 <- read.xlsx("~/IMB/Bootcamp/R/Exam/R Take Home Exam 2024/Task 2/Business School.xlsx")
head(mydata2)
## Student.ID Undergrad.Degree Undergrad.Grade MBA.Grade Work.Experience
## 1 1 Business 68.4 90.2 No
## 2 2 Computer Science 70.2 68.7 Yes
## 3 3 Finance 76.4 83.3 No
## 4 4 Business 82.6 88.7 No
## 5 5 Finance 76.9 75.4 No
## 6 6 Computer Science 83.3 82.1 No
## Employability.(Before) Employability.(After) Status Annual.Salary
## 1 252 276 Placed 111000
## 2 101 119 Placed 107000
## 3 401 462 Placed 109000
## 4 287 342 Placed 148000
## 5 275 347 Placed 255500
## 6 254 313 Placed 103500
library(ggplot2)
ggplot(mydata2, aes(x = Undergrad.Degree)) +
geom_bar(color="Black") +
ylab("Frequency")
With the help of this graph we can conclude that the most common
degree is business.
library(pastecs)
round(stat.desc(mydata2$Annual.Salary), 0)
## nbr.val nbr.null nbr.na min max range
## 100 0 0 20000 340000 320000
## sum median mean SE.mean CI.mean.0.95 var
## 10905800 103500 109058 4150 8235 1722373475
## std.dev coef.var
## 41501 0
library(ggplot2)
ggplot(mydata2, aes(x = Annual.Salary)) +
geom_histogram(binwidth = 10000,color="Black") +
labs(title = "Histogram for annual salary",
x = "Salary",
y = "Frequency")+
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(labels = scales::comma)
With the descriptive statistics and the histogram of annual salary we
can interpret that the majority of people earn around 100,000. The
histogram is skewed to the right, but it is possible that removing some
outliers we would have a normal distribution.
t.test(mydata2$MBA.Grade, mu = 74)
##
## One Sample t-test
##
## data: mydata2$MBA.Grade
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
After performing the t-test we can reject the null hypothesis “(p<.05)”, this means that the average grade is significantly different from 74 as the null hypothesis suggested.
You analyze the price per m2 for a sample of apartments in the Ljubljana region (Apartments.xlsx in Task 3 folder). Follow the questions in R Markdown included in the folder.
library(openxlsx)
mydata3 <- read.xlsx("~/IMB/Bootcamp/R/Exam/R Take Home Exam 2024/Task 3/Apartments.xlsx")
mydata3$ID <- seq(1 : nrow(mydata3))
head(mydata3)
## Age Distance Price Parking Balcony ID
## 1 7 28 1640 0 1 1
## 2 18 1 2800 1 0 2
## 3 7 28 1660 0 0 3
## 4 28 29 1850 0 1 4
## 5 18 18 1640 1 1 5
## 6 28 12 1770 0 1 6
Description:
mydata3$Parking <- factor(mydata3$Parking,
levels = c(0, 1),
labels = c("No", "Yes")
)
mydata3$Balcony <- factor(mydata3$Balcony,
levels = c(0, 1),
labels = c("No", "Yes")
)
head(mydata3)
## Age Distance Price Parking Balcony ID
## 1 7 28 1640 No Yes 1
## 2 18 1 2800 Yes No 2
## 3 7 28 1660 No No 3
## 4 28 29 1850 No Yes 4
## 5 18 18 1640 Yes Yes 5
## 6 28 12 1770 No Yes 6
t.test(mydata3$Price, mu = 1900)
##
## One Sample t-test
##
## data: mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We can reject the null hypothesis “(p<0.05)” with a 95% confidence interval, meaning that the mean price is significantly different from 1900 euros.
fit1 <- lm(formula = Price ~ Age,
data = mydata3)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
library(car)
## Loading required package: carData
scatterplotMatrix(mydata3[ , c(1, 2, 3)],
smooth = FALSE)
By visually analyzing the matrix we can say that there is no clear signs of multicolinearity as none of the slopes on the relationship of the different variables are too steep.
fit2 <- lm(formula = Price ~ Age + Distance,
data = mydata3)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
Given the VIF statistics we can confirm that there is a really low correlation between these two variables and it will not represent a problem in our model.
mydata3$StdResid <- round(rstandard(fit2), 2)
mydata3$CooksD <- round(cooks.distance(fit2), 2)
hist(mydata3$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
hist(mydata3$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
head(mydata3[order(-mydata3$StdResid),], 5)
## Age Distance Price Parking Balcony ID StdResid CooksD
## 38 5 45 2180 Yes Yes 38 2.58 0.32
## 33 2 11 2790 Yes No 33 2.05 0.07
## 2 18 1 2800 Yes No 2 1.78 0.03
## 61 18 1 2800 Yes Yes 61 1.78 0.03
## 58 8 2 2820 Yes No 58 1.66 0.04
head(mydata3[order(-mydata3$CooksD),], 5)
## Age Distance Price Parking Balcony ID StdResid CooksD
## 38 5 45 2180 Yes Yes 38 2.58 0.32
## 55 43 37 1740 No No 55 1.44 0.10
## 33 2 11 2790 Yes No 33 2.05 0.07
## 53 7 2 1760 No Yes 53 -2.15 0.07
## 22 37 3 2540 Yes Yes 22 1.58 0.06
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata3 <- mydata3 %>%
filter(!ID %in% c(38, 55))
head(mydata3[order(-mydata3$StdResid),], 5)
## Age Distance Price Parking Balcony ID StdResid CooksD
## 33 2 11 2790 Yes No 33 2.05 0.07
## 2 18 1 2800 Yes No 2 1.78 0.03
## 59 18 1 2800 Yes Yes 61 1.78 0.03
## 56 8 2 2820 Yes No 58 1.66 0.04
## 55 10 1 2810 No No 57 1.60 0.03
head(mydata3[order(-mydata3$CooksD),], 5)
## Age Distance Price Parking Balcony ID StdResid CooksD
## 33 2 11 2790 Yes No 33 2.05 0.07
## 52 7 2 1760 No Yes 53 -2.15 0.07
## 22 37 3 2540 Yes Yes 22 1.58 0.06
## 38 40 2 2400 No Yes 39 1.09 0.04
## 56 8 2 2820 Yes No 58 1.66 0.04
fit2 <- lm(formula = Price ~ Age + Distance,
data = mydata3)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
mydata3$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata3$StdResid, x = mydata3$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
By analizing the scatterplot visually we could assume that there is no heteroskedaticity in our model, because the standarized residuals do not show a clear pattern and they are moving around 0.
hist(mydata3$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
With a visual analysis we cannot fully assure there is a normal distribution of the residuals because of the high frequency between +1.5 and +2.0, so we would need to perform a Shapiro test in order to verify we have a normal distribution.
shapiro.test(mydata3$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata3$StdResid
## W = 0.94959, p-value = 0.00262
We can now say that there is not a normal distribution on our histogram as the p-value is less than 0.05.
fit2 <- lm(formula = Price ~ Age + Distance,
data = mydata3)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
With the summary of fit2 we can observe the relationship of the
variable price with age and distance, this would help us understand the
effect that this latter variables will have on the price.
First we
understand that given the age and distance being equal to 0, the average
price of an apartment would be around 2490 euros per m2 (p=0). We can
also assume that every year an apartment loses around 7.85 euros per m2,
assuming all other variables remain constant (p<0.01). Finally given
the distance of an apartment from the center we can say that it becomes
23.95 euros per m2 for every km that you move further away from the
center, all the other variables remaining constant (p=0).
fit3 <- lm(formula = Price ~ Age + Distance + Parking + Balcony,
data = mydata3)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 78 5458696 2 523404 3.7395 0.02813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 2e-16 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 2.14e-10 ***
## ParkingYes 168.921 62.166 2.717 0.00811 **
## BalconyYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 1.449e-12
The categorical values we have in this model are Parking and Balcony,
which simply express if an apartment has this or not. In this case we
can say that having a parking spot for the apartment increases the
average price of the apartment by 168.92 euros per m2, all the other
variables remaining constant (p<0.001). In the case of the Balcony we
cannot say that this variable is of any statistical significance at the
moment of determining the price of an apartment because of the p value
being significantly high.
The null hypothesis is testing wether the
variables included in the model are not significant and in this case
they would not affect the price of the apartments, in this case we can
reject the null hypothesis as the F-statistic is significantly higher
than 0, meaning that our model is accurate (p<0.05).
fitted_values <- fitted(fit3)
residual_ID2 <- residuals(fit3)[2]
# Print the fitted value and residual for apartment ID2
fitted_ID2 <- fitted_values[2]
cat("Fitted value for apartment ID2:", fitted_ID2, "\n")
## Fitted value for apartment ID2: 2377.043
cat("Residual for apartment ID2:", residual_ID2, "\n")
## Residual for apartment ID2: 422.9572