Exam

TASK 1

mydata <- force(airquality)
head(mydata)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Explanation of the variables in dataset airquality:

Ozone: Numeric, ozone concentration in parts per billion.
Solar.R: Numeric, solar radiation in Langleys.
Wind: Numeric, wind speed in miles per hour.
Temp: Numeric, temperature in degrees Fahrenheit.
Month: Categorical (integer representing months from May to September).

library(tidyr)
mydataNA <- drop_na(mydata)

By applying drop_na(), I cleaned mydata (airquality) dataset by removing incomplete rows. This approach helps ensure that the data I analyze is free from missing values, which could otherwise lead to inaccurate or biased results in subsequent statistical analysis.

mydataNA$Month <- factor(mydataNA$Month, 
                           levels = c(5, 6, 7, 8, 9), 
                           labels = c("May", "June", "July", "August", "September"))

The Month column still had numeric values, and I wanted to label them as month names.

mydataNA$Temp_C <- (mydataNA$Temp - 32) * 5 / 9

The measurements were made with imperial units, and I want metric units. In order to convert Fahrenheit to Celsius, I use the formula, °C = (°F - 32) × 5/9. I created a new variable called Temp_C.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydataNA <- rename(mydataNA,
                           Ozone_ppb = Ozone,
                           Solar_Radiation = Solar.R)

I renamed some variables for clarity. Ozone to Ozone_ppb (parts per billion), Solar.R to Solar_Radiation

airquality_filtered <- subset(mydataNA, Temp_C > 20 & Wind < 10)
head(airquality_filtered)

##    Ozone_ppb Solar_Radiation Wind Temp Month Day   Temp_C
## 2         36             118  8.0   72   May   2 22.22222
## 8         16             256  9.7   69   May  12 20.55556
## 23       115             223  5.7   79   May  30 26.11111
## 24        37             279  7.4   76   May  31 24.44444
## 25        29             127  9.7   82  June   7 27.77778
## 28        23             148  8.0   82  June  13 27.77778

From mydataNA, I created a new dataset with specific conditions. I called it airquality_filtered, and it contains measurements where Temp_C is greater than 20°C and Wind is less than 10 mph.

summary(airquality_filtered[, c("Ozone_ppb", "Wind", "Temp_C")])

##    Ozone_ppb           Wind           Temp_C     
##  Min.   : 16.00   Min.   :2.300   Min.   :20.56  
##  1st Qu.: 31.00   1st Qu.:6.300   1st Qu.:26.11  
##  Median : 61.00   Median :7.400   Median :28.33  
##  Mean   : 62.98   Mean   :7.041   Mean   :28.61  
##  3rd Qu.: 84.50   3rd Qu.:8.000   3rd Qu.:31.39  
##  Max.   :168.00   Max.   :9.700   Max.   :36.11

Mean (Average): Ozone_ppb: The mean concentration of ozone is 62.98 ppb. This value represents the average ozone concentration across all the days in the dataset. Since ozone levels vary widely, this mean value gives us a central tendency but doesn’t fully capture variability or outliers.

Median: Ozone_ppb: The median ozone concentration is 31.50 ppb, which is lower than the mean. This suggests that the distribution of ozone values is right-skewed, meaning a few high ozone levels are pulling the mean higher than the median.

sd(airquality_filtered$Ozone_ppb)

## [1] 35.74548

sd(airquality_filtered$Wind)

## [1] 1.87213

sd(airquality_filtered$Temp_C)

## [1] 3.883127

Standard Deviation (SD): Ozone_ppb: The standard deviation of ozone levels is 31.14 ppb, indicating that there is high variability in ozone concentrations. The high standard deviation reflects that ozone levels fluctuate significantly from day to day.

library(ggplot2)
ggplot(airquality_filtered, aes(x = Ozone_ppb)) +
  geom_histogram(binwidth = 10, fill = "blue", color = "black") +
  labs(title = "Histogram of Ozone Concentration",
       x = "Ozone (ppb)", y = "Frequency")

ggplot(airquality_filtered, aes(y = Ozone_ppb)) +
  geom_boxplot(fill = "blue") +
  labs(title = "Boxplot of Ozone Concentration",
       y = "Ozone (ppb)")

ggplot(airquality_filtered, aes(x = Wind, y = Ozone_ppb)) +
  geom_point(color = "green") +
  labs(title = "Scatterplot of Ozone vs. Wind",
       x = "Wind (mph)", y = "Ozone (ppb)")

The histogram likely shows a right-skewed distribution, indicating that while most ozone values are relatively low, there are several days with significantly higher ozone levels (outliers).

The boxplot may reveal outliers above the upper whisker, confirming the high values identified in the histogram.

The scatterplot may show a positive correlation; as temperature increases, ozone levels tend to rise, possibly indicating that higher temperatures promote ozone formation.

TASK 2

library(readxl)
Business_School <- read_excel("~/R Studio/Bootcamp/R Take Home Exam 2024/Task 2/Business School.xlsx")

I import the new data set.

library(ggplot2)
ggplot(Business_School, aes(x = `Undergrad Degree`)) + 
  geom_bar(fill="blue") +
  ylab("Frequency")

I create a histogram of the frequency of various undergraduate degrees among the observed students. The business degree is most common.

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describeBy(Business_School$`Annual Salary`)

## Warning in describeBy(Business_School$`Annual Salary`): no grouping variable
## requested

##    vars   n   mean       sd median  trimmed     mad   min    max  range skew
## X1    1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22
##    kurtosis      se
## X1     9.41 4150.15

library(ggplot2)
ggplot(Business_School, aes(x=`Annual Salary`))+
  geom_histogram(colour="black",
                 fill="blue")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Business_School$AnnualS1000 <- Business_School$`Annual Salary` / 1000

library(ggplot2)
ggplot(Business_School, aes(x=`AnnualS1000`))+
  geom_histogram(colour="black",
                 fill="blue")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I show the descriptive statistics of the Annual Salary and its distribution with the histogram.

I divided the values of the annual salary by 1000, because they were too high to show on the histogram. Now the numbers are shown in thousands.

The distribution is unimodal and skewed to the right. There are a few outliers.

t.test (Business_School$`MBA Grade`,
        mu=74,
        alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

I test the null hypothesis of the mean being 74. I reject the null hypothesis, based on the p-value being lower than 0.05 (it is 0.00915). I can say that the true mean is different from 74. The true mean is 76.04055.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize:: cohens_d(Business_School$`MBA Grade`,
                      mu=74)

## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

A Cohen’s d value of 0.27 indicates a small to medium effect size. The difference between the two groups being compared (average MBA grade of last years generation versus this years generation) is noticeable but not substantial.

TASK 3

library(readxl)
Apartments <- read_excel("~/R Studio/Bootcamp/R Take Home Exam 2024/Task 3/Apartments.xlsx")
View(Apartments)

library(readxl)
Apartments <- read_excel("~/R Studio/Bootcamp/R Take Home Exam 2024/Task 3/Apartments.xlsx")
View(Apartments)

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Apartments$Parking <- factor(Apartments$Parking, 
                           levels = c(0, 1), 
                           labels = c("No", "Yes"))

Apartments$Balcony <- factor(Apartments$Balcony, 
                           levels = c(0, 1), 
                           labels = c("No", "Yes"))

t.test (Apartments$Price,
        mu=1900,
        alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

The p-value from the t-test is lower than p=0.05, therefore I can conclude that the mean price of observed apartments differs from 1900. I reject the null hypothesis. The mean price is greater than 1900; it lies in the interval from 1937.443 to 2100.440.

fit1 <- lm(Price ~ Age, data=Apartments)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

sqrt(summary(fit1)$r.squared)

## [1] 0.230255

Estimate of regression coefficient = -8.975 If the age of an apartment increases by 1 year, the price of an apartment per m2 decreases by 8.975€ on average (p=0.034), assuming all other variables remain unchanged.

Coefficient of correlation = 0.230355 This value is relatively low, indicating a weak linear relationship between Age and Price. It suggests that as Age increases, Price slightly tends to change, but not in a strong linear pattern. Since the correlation is positive, it suggests that Price and Age have a slightly positive relationship, for example, as Age increases, Price tends to increase a bit.

Coefficient of determination = 0.05302 5.3% of the variability in price per m2 is explained by the linear effect of age. This suggests that Age alone does a poor job of explaining the variability in Price. There are likely other factors not included in the model that have a much stronger influence on Price.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

#install.packages("car")
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

scatterplotMatrix (Apartments[c("Price", "Age", "Distance")], smooth= FALSE)

The horizontal lines suggests no linear relationship between the variables (slope ≈ 0), therefore there is no problems with multicolinearity.

fit2 <- lm(Price ~ Age + Distance, data = Apartments)
summary (fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

library(car)
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

Since both VIF values are around 1, multicollinearity is not a problem in the model. The independent variables Age and Distance do not interfere with each others ability to predict the dependent variable (Price), and the estimates of their effects should be reliable.

Apartments$StdRsd <- round(rstandard(fit2), 3)

All values of the standardized residuals are between -3 and +3, therefore there are no outliers.

Apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")

From the Cooks distance histogram, I can identify potential outliers because of a big gap in values from 0.15 to 0.30.

head(Apartments[order(-Apartments$CooksD), "CooksD"], 10)

## # A tibble: 10 × 1
##    CooksD
##     <dbl>
##  1  0.32 
##  2  0.104
##  3  0.069
##  4  0.066
##  5  0.061
##  6  0.038
##  7  0.037
##  8  0.034
##  9  0.032
## 10  0.03

From showing the 10 biggest Cooks distance values, I can identify the biggest outliers, which are the first two values, but also the next three are quite big, so I decided to remove them all from our observation to receive greater reliability in the results.

library (dplyr)
Apartments <- Apartments %>%
  filter (!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.61))

fit2 <- lm(Price ~ Age + Distance, data = Apartments)
Apartments$StdFtd <- scale (fit2$fitted.values)

I renew the fit2 model with the excluded outliers, so that there is the same amount of units in the datas and I put the standardized fitted values in the table of apartments.

library(car)
scatterplot (y=Apartments$StdRsd, x=Apartments$StdFtd,
             ylab= "Standardized residuals",
             xlab= "Standardized fitted values",
             boxplots=FALSE,
             regLine = FALSE,
             smooth= FALSE)

The scatterplot between standardized residuals and standardized fitted values look randomly distributed, therefore I assume there is no heteroskedasticity.

hist(Apartments$StdRsd,
     xlab= "Standardized residuals",
     ylab= "Frequency",
     main="Histogram of standardized residuals")

The histogram is slightly skewed to the right. All of the values are between -3 and +3, therefore there is no outliers. I formally test the graph with with the Shapiro test.

shapiro.test(Apartments$StdRsd)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdRsd
## W = 0.93191, p-value = 0.0003328

I can see that the p-value is lower than 0.05 (p-value = 0.0003328), therefore I can reject the null hypothesis of the standard residuals being normally distributed. So, the standard residuals are not distributed normally. However, we have a sample size of over 30, so this does not affect our result much.

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -408.72 -217.93  -38.86  220.60  506.18 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2492.715     75.544  32.997  < 2e-16 ***
## Age           -7.496      3.169  -2.365   0.0205 *  
## Distance     -24.574      2.700  -9.100  6.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 259.4 on 78 degrees of freedom
## Multiple R-squared:  0.5324, Adjusted R-squared:  0.5204 
## F-statistic:  44.4 on 2 and 78 DF,  p-value: 1.335e-13

sqrt(summary(fit2)$r.squared)

## [1] 0.7296492

The coefficient of determination is 0.5324. That means that approximately 53% of the variability of price per m2 is explained by the linear effect of age and distance.

Estimate of regression coefficient -7.496. If the age of an apartment increases by 1 year, the price per m2 decreases by 8.674 euros on average (p = 0.0205), assuming that distance is constant.

Estimate of regression coefficient -24.574 If the distance from the city center is increased by 1 km, the price per m2 decreases on average by 24.574 (p<0.001), assuming that age is constant.

The linear correlation between price, age and distance of an apartment is strong based on the value of correlation coefficient being 0.73.

fit3 <- lm (Price~ Age + Distance + Parking + Balcony, data = Apartments)

I estimate and save the linear regression function Price = f(Age, Distance, Parking and Balcony).

anova (fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     78 5248925                              
## 2     76 4915944  2    332981 2.5739 0.08287 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With the Anova function, I test how fit3 fits the data statistically compared to fit2. The p-value is 0.08287, therefore it fits the data worse than fit2.

summary (fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -403.16 -204.08  -41.24  239.52  528.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2372.693     93.344  25.419  < 2e-16 ***
## Age           -6.906      3.118  -2.215   0.0298 *  
## Distance     -22.254      2.839  -7.837 2.25e-11 ***
## ParkingYes   137.479     60.854   2.259   0.0267 *  
## BalconyYes    17.234     57.099   0.302   0.7636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 254.3 on 76 degrees of freedom
## Multiple R-squared:  0.5621, Adjusted R-squared:  0.539 
## F-statistic: 24.38 on 4 and 76 DF,  p-value: 5.29e-13

Given the values of other explanatory variables, apartments with a parking space have an average price per m2 higher by 137.479€ compared to apartments without a parking space (p=0.0267)

I found no difference in the price of an apartment with a balcony compared to apartments without a balcony. That is because the p-value of this test is higher than 0.05 (p=0.7636)

Hypothesis of F-statistic is that r^2=0

Apartments$StdFtd <- fitted.values (fit3)
Apartments$StdRsd <- residuals(fit3)
head(Apartments [, colnames(Apartments) %in% c("ID", "Price", "StdFtd", "StdRsd")])

## # A tibble: 6 × 3
##   Price StdRsd StdFtd
##   <dbl>  <dbl>  <dbl>
## 1  1640  -78.5  1718.
## 2  2800  436.   2364.
## 3  1660  -41.2  1701.
## 4  1850  299.   1551.
## 5  1640 -363.   2003.
## 6  1770 -160.   1930.

Based on the estimated regression function, I expect the price per m2 for the ID2 apartment to be 2363.611€. The actual price is 2800€, therefore the residual is 436.38945€.

Exam

Jernej Košorok

2024-09-22

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.