Task 1

Identifying a dataset that I shall use

data()

Importing the dataset identified which shows Motor Trend Car Road Tests

mydata <- force(mtcars)

head(mydata)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Part 1

Explanation of variables

A data frame with 32 observations on 11 variables.

mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

Numerical Summary of the Data

summary(mydata)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Part 2

Renaming 5 Columns for better understanding

colnames(mydata)[colnames(mydata) %in% c("am", "wt")] <- c("Transmission", "weight")

colnames(mydata)[colnames(mydata) %in% c("hp", "vs", "carb")] <- c("Horsepower", "Engine", "Carburetors")

View the data with renamed columns

head(mydata)

##                    mpg cyl disp Horsepower drat Transmission  qsec Engine
## Mazda RX4         21.0   6  160        110 3.90        2.620 16.46      0
## Mazda RX4 Wag     21.0   6  160        110 3.90        2.875 17.02      0
## Datsun 710        22.8   4  108         93 3.85        2.320 18.61      1
## Hornet 4 Drive    21.4   6  258        110 3.08        3.215 19.44      1
## Hornet Sportabout 18.7   8  360        175 3.15        3.440 17.02      0
## Valiant           18.1   6  225        105 2.76        3.460 20.22      1
##                   weight gear Carburetors
## Mazda RX4              1    4           4
## Mazda RX4 Wag          1    4           4
## Datsun 710             1    4           1
## Hornet 4 Drive         0    3           1
## Hornet Sportabout      0    3           2
## Valiant                0    3           1

Creating a New Variable based on the “Transmission” column

mydata$`Car Transmission` <- ifelse(mydata$Transmission == 0, "automatic", "manual")

View the data with new variable

head(mydata)

##                    mpg cyl disp Horsepower drat Transmission  qsec Engine
## Mazda RX4         21.0   6  160        110 3.90        2.620 16.46      0
## Mazda RX4 Wag     21.0   6  160        110 3.90        2.875 17.02      0
## Datsun 710        22.8   4  108         93 3.85        2.320 18.61      1
## Hornet 4 Drive    21.4   6  258        110 3.08        3.215 19.44      1
## Hornet Sportabout 18.7   8  360        175 3.15        3.440 17.02      0
## Valiant           18.1   6  225        105 2.76        3.460 20.22      1
##                   weight gear Carburetors Car Transmission
## Mazda RX4              1    4           4           manual
## Mazda RX4 Wag          1    4           4           manual
## Datsun 710             1    4           1           manual
## Hornet 4 Drive         0    3           1           manual
## Hornet Sportabout      0    3           2           manual
## Valiant                0    3           1           manual

Creating a new data frame showing only manual cars

manual_cars <- mydata[mydata$`Car Transmission` == "manual", ]

Viewing new data frame

head(manual_cars)

##                    mpg cyl disp Horsepower drat Transmission  qsec Engine
## Mazda RX4         21.0   6  160        110 3.90        2.620 16.46      0
## Mazda RX4 Wag     21.0   6  160        110 3.90        2.875 17.02      0
## Datsun 710        22.8   4  108         93 3.85        2.320 18.61      1
## Hornet 4 Drive    21.4   6  258        110 3.08        3.215 19.44      1
## Hornet Sportabout 18.7   8  360        175 3.15        3.440 17.02      0
## Valiant           18.1   6  225        105 2.76        3.460 20.22      1
##                   weight gear Carburetors Car Transmission
## Mazda RX4              1    4           4           manual
## Mazda RX4 Wag          1    4           4           manual
## Datsun 710             1    4           1           manual
## Hornet 4 Drive         0    3           1           manual
## Hornet Sportabout      0    3           2           manual
## Valiant                0    3           1           manual

Part 3

Descriptive statistics for the selected variables, showing 3 statistics mean, median and standard deviation

Present descriptive statistics for mpg, Horsepower, and weight variables

select_variables <- mydata[, c("mpg", "Horsepower", "weight")]
summary(select_variables)

##       mpg          Horsepower        weight      
##  Min.   :10.40   Min.   : 52.0   Min.   :0.0000  
##  1st Qu.:15.43   1st Qu.: 96.5   1st Qu.:0.0000  
##  Median :19.20   Median :123.0   Median :0.0000  
##  Mean   :20.09   Mean   :146.7   Mean   :0.4062  
##  3rd Qu.:22.80   3rd Qu.:180.0   3rd Qu.:1.0000  
##  Max.   :33.90   Max.   :335.0   Max.   :1.0000

The typical car in this dataset gets around 20.09 miles per gallon. This means that, on average, it can drive 20.09 miles on a single gallon of gas.
Most cars in this group have an engine power output of about 146.7 horsepower. This gives us a good idea of the overall strength and performance of the vehicles.
If the cars were ordered in order of their fuel usage/consumption the middle car would get 19.20 miles per gallon. This means that half of the cars are more fuel-efficient than this, while the other half are less efficient.

Calculating the standard deviation for the weight and horsepower

sd(mydata$Horsepower)   # Standard deviation for horsepower (hp)

## [1] 68.56287

sd(mydata$weight)   # Standard deviation for weight (wt)

## [1] 0.4989909

With a standard deviation of 68.56287 horsepower, there are significant differences between the most powerful and least powerful vehicles. This means some cars have much stronger engines than others.
With a standard deviation of 0.50, we can conclude that the cars in the dataset are similar in weight, that there’s not a huge difference in size.

Part 4

Histogram for miles per gallon

hist(mydata$mpg, main="Histogram of MPG", xlab="Miles Per Gallon", col="darkolivegreen", border="gray20")

The histogram shows that the most common MPG value is around 20 with a possible distribution skewed to the right indicating that there are few cars with a very high fuel consumption or Miles/Gallon values.
The histogram also shows that there are a few cars with very few fuel efficiency or low Miles/Gallon values.

Scatterplot for Miles per Gallon vs. Horsepower

plot(mydata$Horsepower, mydata$mpg, main="Scatterplot of MPG vs. Horsepower", xlab="Horsepower", ylab="Miles per Gallon", col="darkolivegreen")

Explanation:

There seems to be a general negative relationship between horsepower and MPG. Evidencing that as horsepower increases, MPG tends to decrease. This would suggest that cars with more powerful engines are generally less fuel-efficient.

Boxplot for Miles per Gallon

boxplot(mydata$mpg, main="Miles per Gallon", ylab="MPG", col="darkolivegreen")

The box plot suggests that the majority of cars in your dataset have a MPG value between 15 and approximately 23, with a median MPG of approximately 18.5.

Task 2

Step 1 - importing the excel file into R

#install.packages("readxl")
library(readxl)

MBA <- read_excel("/Users/jethro/Downloads/R Take Home Exam 2024/Task 2/Business School.xlsx")

We confirm the data that we have imported.

head(MBA)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

Create a bar plot of the degree distribution

#install.packages("ggplot2")

library(ggplot2)
ggplot(MBA, aes(x = `Undergrad Degree`, fill = `Undergrad Degree`)) +
  geom_bar(color = "gray20") +
  labs(title = "Undergraduate Degree Distribution", x = "Undergraduate Degree", y = "Total")

We find the most common undergraduate degree among the MBA students

undergrad_cnt <- table(MBA$`Undergrad Degree`)
most_listed_degree <- names(undergrad_cnt)[which.max(undergrad_cnt)]
highest_count <- max(undergrad_cnt)

print(most_listed_degree)

## [1] "Business"

print(highest_count)

## [1] 35

The most common undergraduate degree among the MBA students is Business.

Part 2

Descriptive statistics of the Annual Salary and its distribution with histogram

statistics <- summary(MBA$`Annual Salary`)
mean <- mean(MBA$`Annual Salary`, na.rm = TRUE)
median <- median(MBA$`Annual Salary`, na.rm = TRUE)

paste("Mean Salary:", mean)

## [1] "Mean Salary: 109058"

paste("Median Salary:", median)

## [1] "Median Salary: 103500"

# Based on the results above, we create a histogram for 'Annual Salary'
library(ggplot2)
ggplot(MBA, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 5000, fill = "darkolivegreen", color = "gray20") +
  labs(title = "Distribution of Annual Salary", x = "Annual Salary", y = "Density") +
  theme_minimal()

The histogram shows a normal distribution, suggesting that MBA graduates’ salaries are concentrated around the mean. This is evidenced by the nature of the bars forming a characteristic bell-shaped curve. The frequency of salaries decreases as we move away from the mean in either direction, indicating that fewer graduates earn significantly lower or higher salaries.

Test the following hypothesis: 𝐻0: 𝜇MBA Grade = 74. Explain the result and interpret the effect size.

Grade_test <- t.test(MBA$`MBA Grade`, mu = 74)

# Calculate effect size (Cohen's d)
mean_grade <- mean(MBA$`MBA Grade`)
sd_grade <- sd(MBA$`MBA Grade`)
cohens_d <- (mean_grade - 74) / sd_grade

# Print Cohen's d
cohens_d

## [1] 0.2658658

Based on the one-sample t-test, the average MBA grade is significantly different from the hypothesized mean of 74. The effect size as 0.27, indicates a small effect. The magnitude of the difference between the sample mean and 74 is relatively minor.

Task 3

Import the dataset Apartments.xlsx

Apartments <- read_excel("/Users/jethro/Downloads/R Take Home Exam 2024/Task 3/Apartments.xlsx")

head(Apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

# Understanding the structure of the dataset
str(Apartments)

## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
##  $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...

Apartments$Parking <- factor(Apartments$Parking, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))

Apartments$Balcony <- factor(Apartments$Balcony, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))

# Display the new data input
head(Apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl> <fct>   <fct>  
## 1     7       28  1640 No      Yes    
## 2    18        1  2800 Yes     No     
## 3     7       28  1660 No      No     
## 4    28       29  1850 No      Yes    
## 5    18       18  1640 Yes     Yes    
## 6    28       12  1770 No      Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

HP_test <- t.test(Apartments$Price, mu = 1900)

# View the results
HP_test

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

The p-value of 0.004731 is much lower than the common significance level of 0.05 and because of this, we reject the null hypothesis and rely on this to conclude that the mean price per m² is not equal to 1900EUR

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = Apartments)

# Summary
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

From the above, three conclusions can be made in answering the question:

The price per square meter decreases by approximately 9 euros for every year an apartment ages.
Because of age alone, we can not reliably use it as a key predictor of price
The effect of age on price is significant.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

# Install and load library GGally
# install.packages("GGally")

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(Apartments, columns = c("Price", "Age", "Distance"))

The correlation between Price and Age is −0.23 meaning that as the age of an apartment increases, the price tends to decrease slightly, but the relationship is not very strong.
The correlation between Price and Distance is −0.63 indicating that apartments further from the city center tend to have lower prices.
The correlation between Age and Distance is 0.04 meaning there is no significant relationship between the age of the apartment and its distance from the city center. Therefore, multicolinearity between Age and Distance is not a concern.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = Apartments)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

#install and load package car
# install.packages("car")

library(car)

## Loading required package: carData

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

VIF values close to 1 mean that there is no multicolinearity between Age and Distance. We can state that the factors of age and distance are independent of each other in this model, and the regression estimates for the coefficients can be relied on.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

# Standardized residuals
standardized_residuals <- rstandard(fit2)

# Cook's distances
cooks_distance <- cooks.distance(fit2)

# Histograms
hist(standardized_residuals, main = "Histogram for Standardized Residuals",
     xlab = "Standardized Residuals", col = "darkolivegreen", border = "gray20")

hist(cooks_distance, main = "Histogram for Cook's Distances",
     xlab = "Cook's Distance", col = "darkolivegreen", border = "gray20")

# outliers
outliers <- which(abs(standardized_residuals) > 3)

# influential and problem points
n <- nrow(Apartments)
high_influence <- which(cooks_distance > (4/n))

problem_data_units <- unique(c(outliers, high_influence))

cat("Potentially problematic units:\n", problem_data_units, "\n")

## Potentially problematic units:
##  22 33 38 53 55

# New dataset without problematic data units
New_Apartments_set <- Apartments[-problem_data_units, ]

fit2_clean <- lm(Price ~ Age + Distance, data = New_Apartments_set)
head(fit2_clean)

## $coefficients
## (Intercept)         Age    Distance 
## 2502.466701   -8.673848  -24.063024 
## 
## $residuals
##          1          2          3          4          5          6          7 
## -127.98510  477.72559 -107.98510  288.22873 -273.20301 -200.84267  -49.77235 
##          8          9         10         11         12         13         14 
## -231.95929  126.79912  332.50555  421.93460  -40.69882 -411.49819 -126.97194 
##         15         16         17         18         19         20         21 
##   67.89473 -270.17204 -194.75758  162.94350 -143.58381 -397.20249 -101.66733 
##         22         23         24         25         26         27         28 
##  -28.19250 -140.08534  492.56270  153.80855  410.37789 -269.13999 -171.58063 
##         29         30         31         32         33         34         35 
##   46.52227  303.17996  111.93460 -212.21300 -336.02445 -343.97923 -300.08534 
##         36         37         38         39         40         41         42 
##  292.61327  184.10216 -287.58115  -50.23132   38.39622   14.46407   16.18351 
##         43         44         45         46         47         48         49 
##  470.54276   24.16784 -250.04116  226.54329 -198.52864  212.12049 -216.63366 
##         50         51         52         53         54         55         56 
## -273.17985  -61.87686  418.33480  435.05013 -131.32906  104.31596  477.72559 
##         57         58         59         60         61         62         63 
## -107.98510  288.22873 -273.20301 -200.84267  -49.77235 -231.95929  126.79912 
##         64         65         66         67         68         69         70 
##  332.50555  421.93460  -40.69882 -411.49819 -126.97194   67.89473 -270.17204 
##         71         72         73         74         75         76         77 
## -194.75758   16.18351  470.54276   24.16784 -250.04116  226.54329 -198.52864 
##         78         79         80 
##  212.12049 -216.63366   24.16784 
## 
## $effects
##  (Intercept)          Age     Distance                                        
## -17955.62586    773.34116   2295.53502    240.78790   -263.30487   -253.34951 
##                                                                               
##    -15.03143   -225.63716    109.17247    295.20380    464.32891    -28.41668 
##                                                                               
##   -366.42187    -70.36822     36.95267   -219.03402   -237.82303    237.43430 
##                                                                               
##   -135.27736   -359.97587    -20.81683    -41.05115   -134.35921    565.46181 
##                                                                               
##    195.60686    427.33342   -258.94385   -136.14201    155.59525    150.30627 
##                                                                               
##    154.32891   -152.22959   -302.47552   -374.32528   -294.35921    164.38606 
##                                                                               
##    108.73685   -279.47102    -81.76938    146.87320    -41.32077     32.63766 
##                                                                               
##    506.77369    -51.09579   -362.86698    290.10270   -167.36372    228.27663 
##                                                                               
##   -231.57832   -336.32009    -45.91702    471.66051    500.79723   -122.02692 
##                                                                               
##     38.79172    482.55772    -28.42829    240.78790   -263.30487   -253.34951 
##                                                                               
##    -15.03143   -225.63716    109.17247    295.20380    464.32891    -28.41668 
##                                                                               
##   -366.42187    -70.36822     36.95267   -219.03402   -237.82303     32.63766 
##                                                                               
##    506.77369    -51.09579   -362.86698    290.10270   -167.36372    228.27663 
##                           
##   -231.57832    -51.09579 
## 
## $rank
## [1] 3
## 
## $fitted.values
##        1        2        3        4        5        6        7        8 
## 1767.985 2322.274 1767.985 1561.771 1913.203 1970.843 1899.772 2201.959 
##        9       10       11       12       13       14       15       16 
## 2143.201 2237.494 2278.065 1720.699 2061.498 2126.972 2222.105 2070.172 
##       17       18       19       20       21       22       23       24 
## 2204.758 2177.057 1543.584 2197.202 2161.667 1758.192 2250.085 1807.437 
##       25       26       27       28       29       30       31       32 
## 2326.191 2339.622 1889.140 2341.581 2373.478 1606.820 2278.065 2352.213 
##       33       34       35       36       37       38       39       40 
## 1996.024 2173.979 2250.085 2107.387 1325.898 2057.581 2270.231 2421.604 
##       41       42       43       44       45       46       47       48 
## 2235.536 1383.816 1779.457 1815.832 1860.041 2063.457 2188.529 1407.880 
##       49       50       51       52       53       54       55       56 
## 1926.634 1833.180 1921.877 2391.665 2384.950 1961.329 2025.684 2322.274 
##       57       58       59       60       61       62       63       64 
## 1767.985 1561.771 1913.203 1970.843 1899.772 2201.959 2143.201 2237.494 
##       65       66       67       68       69       70       71       72 
## 2278.065 1720.699 2061.498 2126.972 2222.105 2070.172 2204.758 1383.816 
##       73       74       75       76       77       78       79       80 
## 1779.457 1815.832 1860.041 2063.457 2188.529 1407.880 1926.634 1815.832 
## 
## $assign
## [1] 0 1 2

summary(fit2_clean)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = New_Apartments_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

standardized_fitted <- rstandard(lm(fitted(fit2) ~ fitted(fit2)))

## Warning in model.matrix.default(mt, mf, contrasts): the response appeared on
## the right-hand side and was dropped

## Warning in model.matrix.default(mt, mf, contrasts): problem with term 1 in
## model.matrix: no columns are assigned

library(ggplot2)
ggplot(data = data.frame(standardized_fitted, standardized_residuals), 
       aes(x = standardized_fitted, y = standardized_residuals)) +
  geom_point(color = "gray") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray20") +
  labs(title = "Residuals vs Fitted Values",
       x = "Standardized Fitted Values",
       y = "Standardized Residuals") +
  theme_minimal()

The standardized residuals are mostly clustered around zero, with a few spreading out towards -1 and 1. This indicates that the model likely satisfies the assumption of homoskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

standardized_residuals <- rstandard(fit2)

qqnorm(standardized_residuals)
qqline(standardized_residuals, col = "gray20", lwd = 2)

Because the points on the Q-Q plot are moving closely to the the diagonal line, this would be an indication of a normal distribution of the residuals.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

cooks_distance <- cooks.distance(fit2)

n <- nrow(Apartments) 
high_influence_data <- which(cooks_distance > (4/n))  # Identify influential points

print(high_influence_data)

## 22 33 38 53 55 
## 22 33 38 53 55

# delete high influence data points
new_apartments <- Apartments[-high_influence_data, ]

fit2_prune <- lm(Price ~ Age + Distance, data = new_apartments)

summary(fit2_prune)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = new_apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Explanations

Intercept: Shows an estimate of average price per m² when both distance and age are zero.

Age: This shows that with every year that passes, the price per per m² of Apartments reduces by approximately 8.67 EUROS. The presence of a p-value less than 0.01 shows that the effect of age on price for the apartments is statistically quite huge.

Distance: This shows that with every KILOMETER away from the city, the price per per m² of Apartments reduces by approximately 24.06 EUROS. The presence of a p-value less than 0.01 shows that the effect of distance on price for the apartments is statistically quite huge.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## ParkingYes   196.168     62.868   3.120  0.00251 ** 
## BalconyYes     1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The reduction of the RSS when Parking and Balcony were added to Model 2 is a basis for one to state that Model 2 fits the data better than Model 1.

Additionally, noting that the p-value of 0.01007 is less than the significance level of 0.05, we can arrive on the conclusion same as above as the additional data was a basis for a better price prediction for the apartments.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## ParkingYes   196.168     62.868   3.120  0.00251 ** 
## BalconyYes     1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

The F-statistic tests whether the model’s predictors (Age, Distance, Parking, and Balcony) collectively explain a significant amount of variation in the Price variable. A low p-value like we have would suggest that the we have an alternative hypothesis which means that at least one of the predictor variables significantly contributes to explaining the variation in the Price.

Save fitted values and claculate the residual for apartment ID2.

Fitted_values <- fitted(fit3)

Residuals_Values <- residuals(fit3)

Fitted_Value_ID2 <- Fitted_values[2]
Residual_ID2 <- Residuals_Values[2]

cat("Fitted value for apartment ID 2:", Fitted_Value_ID2, "\n")

## Fitted value for apartment ID 2: 2357.411

cat("Residual for apartment ID 2:", Residual_ID2, "\n")

## Residual for apartment ID 2: 442.5889

END

options(repos = c(CRAN = "https://cran.rstudio.com/"))

# Set CRAN mirror first
options(repos = c(CRAN = "https://cran.rstudio.com/"))

# Then install the required packages
if (!require(readxl)) {
   install.packages("readxl")
}

Jethro Muwanguzi

2024-09-21

Task 1

Identifying a dataset that I shall use

Importing the dataset identified which shows Motor Trend Car Road Tests

Part 1

Explanation of variables

Numerical Summary of the Data

Part 2

Renaming 5 Columns for better understanding

View the data with renamed columns

Creating a New Variable based on the “Transmission” column

View the data with new variable

Creating a new data frame showing only manual cars

Viewing new data frame

Part 3

Descriptive statistics for the selected variables, showing 3 statistics mean, median and standard deviation

Present descriptive statistics for mpg, Horsepower, and weight variables

Calculating the standard deviation for the weight and horsepower

Part 4

Histogram for miles per gallon

Scatterplot for Miles per Gallon vs. Horsepower

Boxplot for Miles per Gallon

Task 2

Step 1 - importing the excel file into R

We confirm the data that we have imported.

Create a bar plot of the degree distribution

We find the most common undergraduate degree among the MBA students

Part 2

Descriptive statistics of the Annual Salary and its distribution with histogram

Test the following hypothesis: 𝐻0: 𝜇MBA Grade = 74. Explain the result and interpret the effect size.

Task 3

Import the dataset Apartments.xlsx

Change categorical variables into factors.

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

Chech the multicolinearity with VIF statistics. Explain the findings.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

With function anova check if model fit3 fits data better than model fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

Save fitted values and claculate the residual for apartment ID2.

END