Task 1

library(carData)
data(Salaries)

Checking the first 6 rows

head(Salaries)
##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

Explanation of the dataset:

The dataset has 397 units of observations and 6 variables, out of which 3 are numerical and 3 are categorical variables.

Rank - categorical variable that shows the academic rank of faculty members, it has 3 levels: “Assisstant Professor”, “Associate Professor” and “Professor”

Discipline - categorical variable that indicates the field of study in which the professor works, it has 2 levels: ‘A’(theoretical departments) and ‘B’(applied departments)

Years since PhD - numerical variable, the number of years since professor earned PhD

Years of service - numerical variable, the number of years the professor has been working at the University

Sex - binary variable with 2 categories, “Male” and “Female”

Salary - represents nine-month salary in dollars, numerical variable

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Creating a new variable called “IncreasedSalary” that increases each professor’s salary by 10%

Salaries <- Salaries %>%
  mutate(IncreasedSalary = salary * 1.10)
head(Salaries)
##        rank discipline yrs.since.phd yrs.service  sex salary IncreasedSalary
## 1      Prof          B            19          18 Male 139750          153725
## 2      Prof          B            20          16 Male 173200          190520
## 3  AsstProf          B             4           3 Male  79750           87725
## 4      Prof          B            45          39 Male 115000          126500
## 5      Prof          B            40          41 Male 141500          155650
## 6 AssocProf          B             6           6 Male  97000          106700

Deleting this variable

Salaries <- Salaries[ ,-7]

Renaming the variable “discipline” to “department”

colnames(Salaries)[2] <- "department"
head(Salaries)
##        rank department yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

Creating a new dataframe that displays the professors that have lower salary than 100000

new_dataframe <- Salaries %>%
  filter(rank == "Prof" & salary > 100000)
head(new_dataframe)
##   rank department yrs.since.phd yrs.service  sex salary
## 1 Prof          B            19          18 Male 139750
## 2 Prof          B            20          16 Male 173200
## 3 Prof          B            45          39 Male 115000
## 4 Prof          B            40          41 Male 141500
## 5 Prof          B            30          23 Male 175000
## 6 Prof          B            45          45 Male 147765

Descriptive statistics of the variables

summary(Salaries)
##         rank     department yrs.since.phd    yrs.service        sex     
##  AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00   Female: 39  
##  AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00   Male  :358  
##  Prof     :266              Median :21.00   Median :16.00               
##                             Mean   :22.31   Mean   :17.61               
##                             3rd Qu.:32.00   3rd Qu.:27.00               
##                             Max.   :56.00   Max.   :60.00               
##      salary      
##  Min.   : 57800  
##  1st Qu.: 91000  
##  Median :107300  
##  Mean   :113706  
##  3rd Qu.:134185  
##  Max.   :231545

Salary, 1st Qu.: 91000 - it means that 25% of the salaries in the dataset are less than or equal to 91000

Salary, Mean :113706 - it means that the average salary is 113706

Years of service, Median :16.00 - it means that half of the faculty members have 16 or more years of service, and the other half have 16 or fewer years of services

From the data, we can see that the Salary Range is 173745 (Max-Min)

Years since PhD, 3rd Qu.:32.00 - this means that 75% of the members in the dataset have between 0 and 32 years of experience since earning PhD

#install.packages("ggplot2")
library(ggplot2)

Displaying the salaries based on the rank

ggplot(Salaries, aes(x = salary, fill = rank)) +
  geom_histogram(binwidth = 5000, position = "dodge") +
  labs(title = "Salary Distribution by Rank",
       x = "Salary",
       y = "Frequency") +
  scale_fill_manual(values = c("blue", "red", "green"), name = "Rank")

Salary ranges: The salary distribution for the Professors is the widest, with salaries spanning from 70000 to 230000

Central Tendency: The median salary for Assisstant Professors would roughly be 80000

Spread: The salary distribution for Professors has the widest range, with some professors earning significantly higher salaries

There are overlapping in salary ranges between ranks. For example, some Assisstant Professors earn similar salaries or even more than some Associate Professors

There are potential outliers in the category Professors

The distribution is slightly right-skewed

Task 2

Data import

dataset <- read.table("./Body mass.csv", 
                     header = TRUE, 
                     sep = ";", 
                     dec = ",")

head(dataset)
##   ID Mass.
## 1  1  62.1
## 2  2  64.5
## 3  3  56.5
## 4  4  53.4
## 5  5  61.3
## 6  6  62.2

ID: ID of a student Mass: weight of a student in kg

str(dataset[ , -1])
##  num [1:50] 62.1 64.5 56.5 53.4 61.3 62.2 62.7 64.5 59.5 68.9 ...

Descriptive statistics

summary(dataset$Mass.)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49.70   60.23   62.80   62.88   64.50   83.20

Creating a histogram for Body Mass distribution

hist(dataset$Mass.,
     main = "Distribution of Body Mass",
     xlab = "Body Mass (kg)",
     ylab = "Frequency",
     col = "lightblue",
     border = "black")

Hypothesis testing

H0: Mu = 59.5

H1: Mu =/ 59.5

t.test(dataset$Mass.,
       mu = 59.5,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  dataset$Mass.
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

Since p-value = 0.000234 is less than alpha = 0.05, we have to reject H0 and accept H1. That means that the average weight of ninth graders in 2021/2022 is different than it was in 2018/2019. The average weight of students increased.

95% CI for arithmetic mean: 61.16758 < Mu < 64.58442 Because 59.9 is not included in this interval, we reject H0 at 5%

Calculating the effect size

#install.packages("effectsize")
library(effectsize)

cohens_d(dataset$Mass., mu = 59.5)
## Cohen's d |       95% CI
## ------------------------
## 0.56      | [0.26, 0.86]
## 
## - Deviation from a difference of 59.5.

Interpreting the effect size

interpret_cohens_d(0.56, rules = "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)

Medium effect size means that the increase in the weight was not too big, but also not too small.

Task 3

Importing the dataset

library(readxl)

mydata <- "Apartments.xlsx"

apartments_data <- read_excel(mydata)

head(apartments_data)
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Changing categorical variables into factors

apartments_data$Parking1 <- factor(apartments_data$Parking,
                          levels = c(0,1),
                          labels = c("No", "Yes"))
head(apartments_data)
## # A tibble: 6 × 6
##     Age Distance Price Parking Balcony Parking1
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>   
## 1     7       28  1640       0       1 No      
## 2    18        1  2800       1       0 Yes     
## 3     7       28  1660       0       0 No      
## 4    28       29  1850       0       1 No      
## 5    18       18  1640       1       1 Yes     
## 6    28       12  1770       0       1 No

Doing the same for variable Balcony

apartments_data$Balcony1 <- factor(apartments_data$Balcony,
                                   levels = c(0,1),
                                   labels = c("No", "Yes"))

head(apartments_data)
## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony Parking1 Balcony1
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>   
## 1     7       28  1640       0       1 No       Yes     
## 2    18        1  2800       1       0 Yes      No      
## 3     7       28  1660       0       0 No       No      
## 4    28       29  1850       0       1 No       Yes     
## 5    18       18  1640       1       1 Yes      Yes     
## 6    28       12  1770       0       1 No       Yes

Testing the hypothesis:

H0: Mu_Price = 1900

H1: Mu_Price =/ 1900

t.test(apartments_data$Price,
       mu = 1900,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  apartments_data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We reject H0 because p-value = 0.004731 is less than alpha = 0.05 and accept H1. From the data, we can see that the average price is higher than 1900.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
          data = apartments_data)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = apartments_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Price = 2185.455 - 8.975Age

Regression coefficient = 8.975, which means that if age increases for 1 year, the price per m2 will be lowered by 8.975 dollars.

b0 = 2185.455 - the price at the beginning is 2185.455 per m2 (new apartment)

Coefficient of determination (r2) - 53% of variability of price can be explained by the effect of age

cor(apartments_data$Price, apartments_data$Age)
## [1] -0.230255

r = -0.23, indicates a relatively weak negative linear relationship, which means that if the age of the apartment increases, the price will slightly decrease

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(apartments_data[ , c(-4, -5, -6, -7)], #Scatterplot matrix,
                  smooth = FALSE)

#install.packages("Hmisc")
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(apartments_data[ , c(-4, -5, -6, -7)])) #Correlation matrix
##            Age Distance Price
## Age       1.00     0.04 -0.23
## Distance  0.04     1.00 -0.63
## Price    -0.23    -0.63  1.00
## 
## n= 85 
## 
## 
## P
##          Age    Distance Price 
## Age             0.6966   0.0340
## Distance 0.6966          0.0000
## Price    0.0340 0.0000

There is not a potential problem of multicolinearity because we don’t see any high correlation values between variables

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
   data = apartments_data)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Price = 2460.101 - 7.934 * Age - 20.667 * Distance

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2) #Mulitcolinearity
##      Age Distance 
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845

None of the variables should be removed, because both values are under 5 - there is no problem of multicolinearity

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

apartments_data$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
apartments_data$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
head(apartments_data[order(apartments_data$StdResid), ], 3)
## # A tibble: 3 × 9
##     Age Distance Price Parking Balcony Parking1 Balcony1 StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1     7        2  1760       0       1 No       Yes         -2.15  0.066
## 2    12       14  1650       0       1 No       Yes         -1.50  0.013
## 3    12       14  1650       0       0 No       No          -1.50  0.013
head(apartments_data[order(-apartments_data$CooksD), ], 6)
## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony Parking1 Balcony1 StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes      Yes          2.58  0.32 
## 2    43       37  1740       0       0 No       No           1.44  0.104
## 3     2       11  2790       1       0 Yes      No           2.05  0.069
## 4     7        2  1760       0       1 No       Yes         -2.15  0.066
## 5    37        3  2540       1       1 Yes      Yes          1.58  0.061
## 6    40        2  2400       0       1 No       Yes          1.09  0.038

Removing the one with the highest cooks distance

removed_unit_data <- apartments_data[-37, ] 

There is a 1 outlier based on cooks distance equal to 0.320 which highly differs from other units

hist(apartments_data$CooksD,
     xlab = "CooksD",
     ylab = "Frequency",
     main = "Histogram of cooks distances")

#### Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(apartments_data$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(apartments_data$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  apartments_data$StdResid
## W = 0.95303, p-value = 0.003645

H0: variable is normally distributed H1: variable is not normally distributed

Since p-value is less than 5%, we reject H0 - so standardized residuals are not normally distributed

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

Inserted categorical variables as factors

fit3 <- lm(Price ~ Age + Distance + Parking1 + Balcony1,
           data = apartments_data)

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking1 + Balcony1, data = apartments_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parking1Yes  196.168     62.868   3.120  0.00251 ** 
## Balcony1Yes    1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

A positive coefficient for the parking variable indicates that there is a positive relationship between the presence of parking and price. If there is a parking, the price of the apartment is expected to increase by 182.107 dollars, assuming all other variables remain constant.

A negative coefficient for the balcony variable indicates that there is a negative relationship between the presence of a balcony and price. If there is a balcony in the apartment, the price is expected to decrease by 29.899 dollars, assuming all other variables remain constant.