TASK 1

Matej Suhalj

library(carData)
mydata1 <- SLID
head(mydata1)
##   wages education age    sex language
## 1 10.56      15.0  40   Male  English
## 2 11.00      13.2  19   Male  English
## 3    NA      16.0  49   Male    Other
## 4 17.76      14.0  46   Male    Other
## 5    NA       8.0  71   Male  English
## 6 14.00      16.0  50 Female  English

Explanation of the variables used in the analysis:

  • Wages: Composite hourly wage rate from all jobs. “NA” means missing data
  • Education: Number in years of schooling.
  • Age: Number in years.
  • Sex: A factor with levels: Female, Male.
  • Language: A factor with levels: English, French, Other.
  1. The survey was created in 1994, so I wanted to see how old the subject in the survey are today. That’s why I created a new variable (ageNOW).
mydata1$ageNOW <- mydata1$age + 29 #Creating new variable (ageNOW). 
mydata1_CleanData <- na.omit(mydata1) #Removing the units that have NA in them (missing data)
head(mydata1_CleanData)
##    wages education age    sex language ageNOW
## 1  10.56      15.0  40   Male  English     69
## 2  11.00      13.2  19   Male  English     48
## 4  17.76      14.0  46   Male    Other     75
## 6  14.00      16.0  50 Female  English     79
## 9   8.20      15.0  31   Male  English     60
## 12 16.97      13.5  30 Female  English     59
names(mydata1)[names(mydata1) == "ageNOW"] <- "age2023"
names(mydata1_CleanData)[names(mydata1_CleanData) == "ageNOW"] <- "age2023" #Renaming the variable "ageNOW" to "age2023"

I would also like to rename the variable sex to gender, but only in mydata1_CleanData.

names(mydata1_CleanData)[names(mydata1_CleanData) == "sex"] <- "gender"
  1. I would like to make some explanations of the results using the summary formula.
summary(mydata1_CleanData[ , c(-4,-5)]) #Removing sex and language from the summary
##      wages         education          age          age2023    
##  Min.   : 2.30   Min.   : 0.00   Min.   :16.0   Min.   :45.0  
##  1st Qu.: 9.25   1st Qu.:12.00   1st Qu.:28.0   1st Qu.:57.0  
##  Median :14.13   Median :13.00   Median :36.0   Median :65.0  
##  Mean   :15.54   Mean   :13.34   Mean   :37.1   Mean   :66.1  
##  3rd Qu.:19.72   3rd Qu.:15.10   3rd Qu.:46.0   3rd Qu.:75.0  
##  Max.   :49.92   Max.   :20.00   Max.   :69.0   Max.   :98.0

Explanation:

  • q3: 19.72 -> 75% of people in this research have had an hourly wage rate of 19.72$ or less, and 25% of them have had a higher hourly wage rate than 19.72$.
  • Range for education: 20 years: The difference between the person who went to school the longest, compared to the person who went to school the shortest amount of time of all the people included in the survey, is 20 years.
  • Mean for wages: The average hourly wage rate among people that were participating in our survey is 15.54$.
  • The median age of this group of people is 36. This means that when you arrange their ages in order, 36 is the middle value. Half of the group is younger than 36, and half is older than 36.
hist(mydata1_CleanData$wages,
     main = "Histogram of wages for people in our survay",
     xlab = "Hourly wage rate",
     ylab = "Frequency",
     breaks = seq(from=0, to=50, by=2))

Explanation of the histogram:

We can see that the majority of people in our example have had their hourly wage rate lover than 10$ (to be more precised, somewhere arount 8$). As expected we can also notice that as soon as the hourly wage rate gets higher, the frequency gets lower.

library(car)
scatterplot(y=mydata1_CleanData$age,
            x=mydata1_CleanData$wages,
            ylab="age",
            xlab="wages",
            smooth=FALSE)

Explanation of the scatterplot:

The scatterplot shows us that the majority of people in our sample has a hourly wage rate between 5$ and 20$. There is a positive correlation between wage and age, and we can also notice that as older the people are, the higher their hourly wage rate is.

TASK 2

Body_mass_data <- read.table("./Body mass.csv", header=TRUE, sep=";", dec=",")
head(Body_mass_data)
##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2
library(pastecs)
stat.desc(Body_mass_data)
##                       ID         Mass
## nbr.val        50.000000 5.000000e+01
## nbr.null        0.000000 0.000000e+00
## nbr.na          0.000000 0.000000e+00
## min             1.000000 4.970000e+01
## max            50.000000 8.320000e+01
## range          49.000000 3.350000e+01
## sum          1275.000000 3.143800e+03
## median         25.500000 6.280000e+01
## mean           25.500000 6.287600e+01
## SE.mean         2.061553 8.501407e-01
## CI.mean.0.95    4.142845 1.708422e+00
## var           212.500000 3.613696e+01
## std.dev        14.577380 6.011403e+00
## coef.var        0.571662 9.560727e-02
round(stat.desc(Body_mass_data), 1)
##                  ID   Mass
## nbr.val        50.0   50.0
## nbr.null        0.0    0.0
## nbr.na          0.0    0.0
## min             1.0   49.7
## max            50.0   83.2
## range          49.0   33.5
## sum          1275.0 3143.8
## median         25.5   62.8
## mean           25.5   62.9
## SE.mean         2.1    0.9
## CI.mean.0.95    4.1    1.7
## var           212.5   36.1
## std.dev        14.6    6.0
## coef.var        0.6    0.1
hist(Body_mass_data$Mass,
     breaks = 4,
     col = "blue",
     border = "black",
     main = "Histogram of the weight of some nine graders",
     xlab = "Body mass",
     ylab = "Frequency")

  1. Testing hypothesis with t-test:

H0: Mu = 59.5 H1: Mu =/ 59.5

mean(Body_mass_data$Mass)
## [1] 62.876
result <- t.test(Body_mass_data$Mass, mu = 59.5)
cat("P-value:", result$p.value,"\n")
## P-value: 0.0002339924

p < 0.001, that is why we reject the null hypothesis, therefore we confirm our results from before that the population of nine graders have become more obese over the period of 3 years (from 2018/19 to 2021/22).

  1. To interpret the effect size I decided to use Cohen’s d as shown down below:
library("effectsize")
cohens_d(Body_mass_data$Mass, mu= 59.5)
## Cohen's d |       95% CI
## ------------------------
## 0.56      | [0.26, 0.86]
## 
## - Deviation from a difference of 59.5.
interpret_cohens_d(0.56, "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)

In our case Cohen’s d is 0.56, therefore the effect size is medium. This means that the difference between the sample mean and the hypothesized population mean (59.5 kg) is of moderate magnitude. In practical terms, this suggests that there is a noticeable difference between the sample mean of (of Mass in our case) and the hypothesized population mean.

TASK 3

Import the dataset Apartments.xlsx

#install.packages("readxl")

library(readxl)
mydata_Apartments <- read_xlsx("./Apartments.xlsx")

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata_Apartments$ParkingF <- factor(mydata_Apartments$Parking,
 levels = c(1, 0),
 labels = c("YES", "NO"))
mydata_Apartments$BalconyF <- factor(mydata_Apartments$Balcony,
 levels = c(1, 0),
 labels = c("YES", "NO"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata_Apartments$Price,
 mu = 1900,
 alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mydata_Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Based on the p-value, I can conclude that the p-value (0.004731) is less than the level of 0.05. This indicates that I reject the null hypothesis (H0) in favor of the alternative hypothesis (H1). In other words, there is evidence that the true mean price of apartments is statistically different from 1900.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

library(ggplot2)
lm(Price ~ Age, data = mydata_Apartments)
## 
## Call:
## lm(formula = Price ~ Age, data = mydata_Apartments)
## 
## Coefficients:
## (Intercept)          Age  
##    2185.455       -8.975
fit1 <- lm(Price ~ Age, data = mydata_Apartments)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = mydata_Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
r <- sqrt(0.05302)

Explanation: - Estimate of regression coefficient: If any of the apartments get older by 1 year (if variable “Age” increases by 1 year), on average there will be a decrease in price by 8.975$, assuming all other variables are constant. - Estimate of coefficient in correlation: The positive sign of the correlation coefficient indicates a positive linear relationship between “Age” and “Price,” meaning that as “Age” increases, “Price” will decrease. The magnitude of 0.2300 suggests a weak linear relationship between those two. - Estimate of coefficient of determination: Only about 5.3% of the variability in “Price” is explained by the linear relationship with “Age.” In other words, “Age” explains a relatively small proportion of the variation in apartment prices.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
scatterplotMatrix(mydata_Apartments[ , c(-4, -5, -6, -7)],smooth = FALSE)

cor_matrix <- cor(mydata_Apartments[, c("Price", "Age", "Distance")])

To see if there is a potential problem with multicollinearity, I created a correlation matrix. Because there is no number close to 1 (positive or negative), I can say that there is no real potential problem of multicollinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = mydata_Apartments)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata_Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845

The VIF values for both “Age” and “Distance” are very close to 1. A VIF of 1 indicates that there is almost no multicollinearity between these variables. A high multicollinearity between two variables is considered, when VIF is 5 or higher.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydata_Apartments$StdResid <- round(rstandard(fit2), 3)
mydata_Apartments$CooksD <- round(cooks.distance(fit2), 3)
#I removed the one outlier (row 38) 3 questions below, because otherwise my new data.frame (mydata11) would only consist of 9 variables insted of 10, because StdFitted could not be added in because my data11 would have 84 rows and fit2 has 85 rows. That's why I removed the outlier a few questions after this one.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

mydata_Apartments$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata_Apartments$StdResid, x = mydata_Apartments$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Based on the findings with the scatterplot, I haven’t located any potential heteroskedasticity between standarized residuals and standrdized fitted values.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydata_Apartments$StdResid,
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydata_Apartments$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata_Apartments$StdResid
## W = 0.95303, p-value = 0.003645

standardized residuals are not distributed normally. I have proved this with a shapiro.test, where p-value = 0.003645 (p < 0.05) and because of that I can reject the null hypothesis (H0) and accept the alternative one (H1).

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

mydata11 <- mydata_Apartments [-38, ]
fit2 <- lm(Price ~ Age + Distance,
           data = mydata11)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

Explanation of all coefficients:

  • Estimate for Age: -6.464 indicates that, on average, for each additional year of “Age,” the estimated “Price” decreases by 6.464$.
  • Std. Error: Std. Error of “Distance” is quite low, which indicates us that the estimate is also very precise. Small Std.Error means that there is no much difference with sample mean compred to the mean in population.
  • P-values: All p-values are lower than 0.05, which means that we reject the null hypothesis (H0) in all cases and accept the alternative one (H1).
  • Residual standard error: A residual standard error of 276.1 means that, on average, the actual “Price” values in this dataset deviate from the predicted values by approximately 276.1 units.
  • R-squared: An R² value of 0.4838 means that approximately 48.38% of the variability in “Price” has been explained by the independent variables (“Age” and “Distance”) included in this model.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data=mydata11)
summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2482.048     79.263  31.314  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFNO  -167.531     62.864  -2.665  0.00933 ** 
## BalconyFNO    15.207     59.201   0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: fit2 = fit3….The null hypothesis in this context is that there is no significant difference in model fit between fit2 and fit3.

H1: fit2 != fit3….The alternative hypothesis is that there is a significant difference in model fit between fit2 and fit3.

Results: The p-value is 0.03051. This p-value is less than the 0.05. Since the p-value is less than 0.05, I can reject the null hypothesis, and based on the ANOVA test results fit3 is better suited for the data than fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2482.048     79.263  31.314  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFNO  -167.531     62.864  -2.665  0.00933 ** 
## BalconyFNO    15.207     59.201   0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

Explanation:

  • ParkingFNO: The coefficient for “ParkingFNO” represents the change in the predicted price when the parking facility is not available (compared to when it is available). A negative coefficient suggests that, on average, properties without parking facilities have a lower price. The p-value indicates that this effect is statistically significant at the 0.01 level, meaning that the presence or absence of parking facilities has a significant impact on price prediction. -BalconyFNO: The coefficient for “BalconyFNO” represents the change in the predicted price when there is no balcony (compared to when there is a balcony). A positive coefficient suggests that, on average, properties without balconies have a slightly higher price, but the effect is not statistically significant (because of the p-value > 0.05).

The hypothesis which is being tested with F-statistics:

  • H0: b1 = b2 = b3 = b4 = 0 (no relationship between any of the explanatory and the dependent variable)
  • H1: b1, b2, b3, b4 != 0 (at least one explanatory variable is related to the dependent one)

Save fitted values and calculate the residual for apartment ID2.

mydata11$residuals <- residuals(fit3)
mydata11$fittedValues <- fitted.values(fit3)
head(mydata11)
## # A tibble: 6 × 12
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResid
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>
## 1     7       28  1640       0       1 NO       YES        -0.665
## 2    18        1  2800       1       0 YES      NO          1.78 
## 3     7       28  1660       0       0 NO       NO         -0.594
## 4    28       29  1850       0       1 NO       YES         0.754
## 5    18       18  1640       1       1 YES      YES        -1.07 
## 6    28       12  1770       0       1 NO       YES        -0.778
## # ℹ 4 more variables: CooksD <dbl>, StdFitted <dbl[,1]>,
## #   residuals <dbl>, fittedValues <dbl>
mydata11$residuals <- residuals(fit3)
mydata11$fittedValues <- fitted.values(fit3)
head(mydata11[2, c(11,12)])
## # A tibble: 1 × 2
##   residuals fittedValues
##       <dbl>        <dbl>
## 1      428.        2372.