Bootcamp R Exam

Task 1

Find your own data set that contains at least 20 units and at least 4 variables (most of which are numeric, but it is good to have at least one categorical variable as well). Perform the following steps using the R program:
1. Explain the data set (the variables used in the analysis).
2. Perform some data manipulations (create new variable, delete some units due to missing data, rename variables, create new data.frame based on conditions, etc.).
3. Present the descriptive statistics for the selected variables and explain at least 3 sample statistics (mean, median, etc.).
4. Graph the distribution of the variables using histograms, scatter plots, and/or box plots. Explain the results.

mydata1 <- read.table("./dataexam.csv", 
                     header=TRUE, 
                     sep=",", 
                     dec=".")
head(mydata1)

##   ID Age Experience Projects Income JobTitle
## 1  1  25          2        5  30000        2
## 2  2  30          7       12  32000        4
## 3  3  22          1        3  28000        1
## 4  4  28          5        8  35000        3
## 5  5  35         10       20  37000        5
## 6  6  40         15       25  40000        6

Description of the variables:
- Age: Age of the subject.
- Experience: Experience of the subject given in years.
- Projects: Amount of projects in which the subject has taken part of.
- Income: Anual income of the subject given in US dollars.
- JobTitle: Current job position from the subject (Intern = 1, Junior Developer = 2, Developer = 3, Software Engineer = 4, Senior Developer = 5, Lead Developer = 6).

mydata1 <- mydata1[,-1]

mydata1$JobTitle <- factor(mydata1$JobTitle,
                           levels = c(1, 2, 3, 4, 5, 6),
                           labels = c("Intern", "Junior Developer", "Developer", "Software Engineer", "Senior Developer", "Lead Developer")
                           )
head(mydata1)

##   Age Experience Projects Income          JobTitle
## 1  25          2        5  30000  Junior Developer
## 2  30          7       12  32000 Software Engineer
## 3  22          1        3  28000            Intern
## 4  28          5        8  35000         Developer
## 5  35         10       20  37000  Senior Developer
## 6  40         15       25  40000    Lead Developer

mydata1$Average <- c(mydata1$Projects / mydata1$Experience)
mydata1$Average <- round(mydata1$Average, digits = 1)

head(mydata1)

##   Age Experience Projects Income          JobTitle Average
## 1  25          2        5  30000  Junior Developer     2.5
## 2  30          7       12  32000 Software Engineer     1.7
## 3  22          1        3  28000            Intern     3.0
## 4  28          5        8  35000         Developer     1.6
## 5  35         10       20  37000  Senior Developer     2.0
## 6  40         15       25  40000    Lead Developer     1.7

The new variable included represents the average projects per year that the individuals performed at the company.

mydata1order <- mydata1[, c(5, 4, 1, 2, 3, 6)]
head(mydata1order)

##            JobTitle Income Age Experience Projects Average
## 1  Junior Developer  30000  25          2        5     2.5
## 2 Software Engineer  32000  30          7       12     1.7
## 3            Intern  28000  22          1        3     3.0
## 4         Developer  35000  28          5        8     1.6
## 5  Senior Developer  37000  35         10       20     2.0
## 6    Lead Developer  40000  40         15       25     1.7

library(pastecs)

round(stat.desc(mydata1order[,-1]), 2)

##                   Income    Age Experience Projects Average
## nbr.val            20.00  20.00      20.00    20.00   20.00
## nbr.null            0.00   0.00       0.00     0.00    0.00
## nbr.na              0.00   0.00       0.00     0.00    0.00
## min             28000.00  22.00       1.00     2.00    1.60
## max             41000.00  40.00      15.00    25.00    3.00
## range           13000.00  18.00      14.00    23.00    1.40
## sum            671000.00 589.00     125.00   231.00   38.90
## median          33000.00  29.00       6.00    10.50    1.90
## mean            33550.00  29.45       6.25    11.55    1.95
## SE.mean           789.65   1.06       0.85     1.48    0.07
## CI.mean.0.95     1652.76   2.21       1.78     3.09    0.15
## var          12471052.63  22.26      14.41    43.63    0.10
## std.dev          3531.44   4.72       3.80     6.61    0.32
## coef.var            0.11   0.16       0.61     0.57    0.16

From the results given by stat.desc we explain the following statistics:
- range: is the difference between the the lowest and the highest value of a variable in the data set.
- mean: this is the average of the variables of variable in the data set.
- SE.mean: this value provides us the accuracy of the mean, a smaller value means a better precision in the mean.
- var: is the measurement of how much the data points variate from the mean, if we have a lower value it indicates the data is closer to the mean. It is calculated as the square of the standar deviation.

library(ggplot2)

ggplot(mydata1order, aes(x = JobTitle)) +
  geom_bar(color="Black") +
  ylab("Frequency")

This is a bar chart that helps us easily identify how many people is in each of the positions within the company from data we are analyzing.

library(ggplot2)

ggplot(mydata1order,aes(x = Experience, y = Income))+
  geom_point(color = "black")

This is a scatter plot in which the X axis is given by the years of experience and the Y axis is given by the anual income of the individuals in this data set. From this data we can identify that there is a linear positive relationship between this two variables, which means that the more experience you have will direcly impact in your salary.

Task 2

You have a data set for 100 MBA students of the current generation. In the previous year, the average grade in this program was 74.
1. Graph the distribution of undergrad degrees using the ggplot function. Which degree is the most common?
2. Show the descriptive statistics of the Annual Salary and its distribution with the histogram (use the ggplot function). Describe the distribution.
3 Test the following hypothesis: 𝐻0:𝜇MBA Grade=74. Explain the result and interpret the effect size.

#install.packages("openxlsx")

library(openxlsx)
mydata2 <- read.xlsx("~/IMB/Bootcamp/R/Exam/R Take Home Exam 2024/Task 2/Business School.xlsx")

head(mydata2)

##   Student.ID Undergrad.Degree Undergrad.Grade MBA.Grade Work.Experience
## 1          1         Business            68.4      90.2              No
## 2          2 Computer Science            70.2      68.7             Yes
## 3          3          Finance            76.4      83.3              No
## 4          4         Business            82.6      88.7              No
## 5          5          Finance            76.9      75.4              No
## 6          6 Computer Science            83.3      82.1              No
##   Employability.(Before) Employability.(After) Status Annual.Salary
## 1                    252                   276 Placed        111000
## 2                    101                   119 Placed        107000
## 3                    401                   462 Placed        109000
## 4                    287                   342 Placed        148000
## 5                    275                   347 Placed        255500
## 6                    254                   313 Placed        103500

library(ggplot2)

ggplot(mydata2, aes(x = Undergrad.Degree)) +
  geom_bar(color="Black") +
  ylab("Frequency")

With the help of this graph we can conclude that the most common degree is business.

library(pastecs)

round(stat.desc(mydata2$Annual.Salary), 0)

##      nbr.val     nbr.null       nbr.na          min          max        range 
##          100            0            0        20000       340000       320000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##     10905800       103500       109058         4150         8235   1722373475 
##      std.dev     coef.var 
##        41501            0

library(ggplot2)

ggplot(mydata2, aes(x = Annual.Salary)) +
  geom_histogram(binwidth = 10000,color="Black") +
   labs(title = "Histogram for annual salary",
       x = "Salary",
       y = "Frequency")+
  scale_y_continuous(labels = scales::comma) +  
  scale_x_continuous(labels = scales::comma)

With the descriptive statistics and the histogram of annual salary we can interpret that the majority of people earn around 100,000. The histogram is skewed to the right, but it is possible that removing some outliers we would have a normal distribution.

t.test(mydata2$MBA.Grade, mu = 74)

## 
##  One Sample t-test
## 
## data:  mydata2$MBA.Grade
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

After performing the t-test we can reject the null hypothesis “(p<.05)”, this means that the average grade is significantly different from 74 as the null hypothesis suggested.

Task 3

You analyze the price per m2 for a sample of apartments in the Ljubljana region (Apartments.xlsx in Task 3 folder). Follow the questions in R Markdown included in the folder.

Import the dataset Apartments.xlsx

library(openxlsx)
mydata3 <- read.xlsx("~/IMB/Bootcamp/R/Exam/R Take Home Exam 2024/Task 3/Apartments.xlsx")

mydata3$ID <- seq(1 : nrow(mydata3))
head(mydata3)

##   Age Distance Price Parking Balcony ID
## 1   7       28  1640       0       1  1
## 2  18        1  2800       1       0  2
## 3   7       28  1660       0       0  3
## 4  28       29  1850       0       1  4
## 5  18       18  1640       1       1  5
## 6  28       12  1770       0       1  6

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata3$Parking <- factor(mydata3$Parking,
                          levels = c(0, 1),
                          labels = c("No", "Yes")
                          )

mydata3$Balcony <- factor(mydata3$Balcony,
                          levels = c(0, 1),
                          labels = c("No", "Yes")
                          )

head(mydata3)

##   Age Distance Price Parking Balcony ID
## 1   7       28  1640      No     Yes  1
## 2  18        1  2800     Yes      No  2
## 3   7       28  1660      No      No  3
## 4  28       29  1850      No     Yes  4
## 5  18       18  1640     Yes     Yes  5
## 6  28       12  1770      No     Yes  6

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata3$Price, mu = 1900)

## 
##  One Sample t-test
## 
## data:  mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We can reject the null hypothesis “(p<0.05)” with a 95% confidence interval, meaning that the mean price is significantly different from 1900 euros.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(formula = Price ~ Age,
           data = mydata3)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

scatterplotMatrix(mydata3[ , c(1, 2, 3)],
                  smooth = FALSE)

By visually analyzing the matrix we can say that there is no clear signs of multicolinearity as none of the slopes on the relationship of the different variables are too steep.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(formula = Price ~ Age + Distance,
           data = mydata3)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Check the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

Given the VIF statistics we can confirm that there is a really low correlation between these two variables and it will not represent a problem in our model.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydata3$StdResid <- round(rstandard(fit2), 2)
mydata3$CooksD <- round(cooks.distance(fit2), 2)

hist(mydata3$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

hist(mydata3$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

head(mydata3[order(-mydata3$StdResid),], 5)

##    Age Distance Price Parking Balcony ID StdResid CooksD
## 38   5       45  2180     Yes     Yes 38     2.58   0.32
## 33   2       11  2790     Yes      No 33     2.05   0.07
## 2   18        1  2800     Yes      No  2     1.78   0.03
## 61  18        1  2800     Yes     Yes 61     1.78   0.03
## 58   8        2  2820     Yes      No 58     1.66   0.04

head(mydata3[order(-mydata3$CooksD),], 5)

##    Age Distance Price Parking Balcony ID StdResid CooksD
## 38   5       45  2180     Yes     Yes 38     2.58   0.32
## 55  43       37  1740      No      No 55     1.44   0.10
## 33   2       11  2790     Yes      No 33     2.05   0.07
## 53   7        2  1760      No     Yes 53    -2.15   0.07
## 22  37        3  2540     Yes     Yes 22     1.58   0.06

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata3 <- mydata3 %>%
  filter(!ID %in% c(38, 55))

head(mydata3[order(-mydata3$StdResid),], 5)

##    Age Distance Price Parking Balcony ID StdResid CooksD
## 33   2       11  2790     Yes      No 33     2.05   0.07
## 2   18        1  2800     Yes      No  2     1.78   0.03
## 59  18        1  2800     Yes     Yes 61     1.78   0.03
## 56   8        2  2820     Yes      No 58     1.66   0.04
## 55  10        1  2810      No      No 57     1.60   0.03

head(mydata3[order(-mydata3$CooksD),], 5)

##    Age Distance Price Parking Balcony ID StdResid CooksD
## 33   2       11  2790     Yes      No 33     2.05   0.07
## 52   7        2  1760      No     Yes 53    -2.15   0.07
## 22  37        3  2540     Yes     Yes 22     1.58   0.06
## 38  40        2  2400      No     Yes 39     1.09   0.04
## 56   8        2  2820     Yes      No 58     1.66   0.04

fit2 <- lm(formula = Price ~ Age + Distance,
           data = mydata3)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

mydata3$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(y = mydata3$StdResid, x = mydata3$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

By analizing the scatterplot visually we could assume that there is no heteroskedaticity in our model, because the standarized residuals do not show a clear pattern and they are moving around 0.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydata3$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

With a visual analysis we cannot fully assure there is a normal distribution of the residuals because of the high frequency between +1.5 and +2.0, so we would need to perform a Shapiro test in order to verify we have a normal distribution.

shapiro.test(mydata3$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata3$StdResid
## W = 0.94959, p-value = 0.00262

We can now say that there is not a normal distribution on our histogram as the p-value is less than 0.05.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(formula = Price ~ Age + Distance,
           data = mydata3)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

With the summary of fit2 we can observe the relationship of the variable price with age and distance, this would help us understand the effect that this latter variables will have on the price.
First we understand that given the age and distance being equal to 0, the average price of an apartment would be around 2490 euros per m2 (p=0). We can also assume that every year an apartment loses around 7.85 euros per m2, assuming all other variables remain constant (p<0.01). Finally given the distance of an apartment from the center we can say that it becomes 23.95 euros per m2 for every km that you move further away from the center, all the other variables remaining constant (p=0).

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(formula = Price ~ Age + Distance + Parking + Balcony,
           data = mydata3)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2358.900     93.664  25.185  < 2e-16 ***
## Age           -7.197      3.148  -2.286  0.02499 *  
## Distance     -21.241      2.911  -7.296 2.14e-10 ***
## ParkingYes   168.921     62.166   2.717  0.00811 ** 
## BalconyYes    -6.985     58.745  -0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

The categorical values we have in this model are Parking and Balcony, which simply express if an apartment has this or not. In this case we can say that having a parking spot for the apartment increases the average price of the apartment by 168.92 euros per m2, all the other variables remaining constant (p<0.001). In the case of the Balcony we cannot say that this variable is of any statistical significance at the moment of determining the price of an apartment because of the p value being significantly high.
The null hypothesis is testing wether the variables included in the model are not significant and in this case they would not affect the price of the apartments, in this case we can reject the null hypothesis as the F-statistic is significantly higher than 0, meaning that our model is accurate (p<0.05).

Save fitted values and claculate the residual for apartment ID2.

fitted_values <- fitted(fit3)

residual_ID2 <- residuals(fit3)[2]

# Print the fitted value and residual for apartment ID2
fitted_ID2 <- fitted_values[2]

cat("Fitted value for apartment ID2:", fitted_ID2, "\n")

## Fitted value for apartment ID2: 2377.043

cat("Residual for apartment ID2:", residual_ID2, "\n")

## Residual for apartment ID2: 422.9572