library(carData)
data(Salaries)
head(Salaries)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
The dataset has 397 units of observations and 6 variables, out of which 3 are numerical and 3 are categorical variables.
Rank - categorical variable that shows the academic rank of faculty members, it has 3 levels: “Assisstant Professor”, “Associate Professor” and “Professor”
Discipline - categorical variable that indicates the field of study in which the professor works, it has 2 levels: ‘A’(theoretical departments) and ‘B’(applied departments)
Years since PhD - numerical variable, the number of years since professor earned PhD
Years of service - numerical variable, the number of years the professor has been working at the University
Sex - binary variable with 2 categories, “Male” and “Female”
Salary - represents nine-month salary in dollars, numerical variable
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Salaries <- Salaries %>%
mutate(IncreasedSalary = salary * 1.10)
head(Salaries)
## rank discipline yrs.since.phd yrs.service sex salary IncreasedSalary
## 1 Prof B 19 18 Male 139750 153725
## 2 Prof B 20 16 Male 173200 190520
## 3 AsstProf B 4 3 Male 79750 87725
## 4 Prof B 45 39 Male 115000 126500
## 5 Prof B 40 41 Male 141500 155650
## 6 AssocProf B 6 6 Male 97000 106700
Salaries <- Salaries[ ,-7]
colnames(Salaries)[2] <- "department"
head(Salaries)
## rank department yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
new_dataframe <- Salaries %>%
filter(rank == "Prof" & salary > 100000)
head(new_dataframe)
## rank department yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 Prof B 45 39 Male 115000
## 4 Prof B 40 41 Male 141500
## 5 Prof B 30 23 Male 175000
## 6 Prof B 45 45 Male 147765
summary(Salaries)
## rank department yrs.since.phd yrs.service sex
## AsstProf : 67 A:181 Min. : 1.00 Min. : 0.00 Female: 39
## AssocProf: 64 B:216 1st Qu.:12.00 1st Qu.: 7.00 Male :358
## Prof :266 Median :21.00 Median :16.00
## Mean :22.31 Mean :17.61
## 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :56.00 Max. :60.00
## salary
## Min. : 57800
## 1st Qu.: 91000
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
Salary, 1st Qu.: 91000 - it means that 25% of the salaries in the dataset are less than or equal to 91000
Salary, Mean :113706 - it means that the average salary is 113706
Years of service, Median :16.00 - it means that half of the faculty members have 16 or more years of service, and the other half have 16 or fewer years of services
From the data, we can see that the Salary Range is 173745 (Max-Min)
Years since PhD, 3rd Qu.:32.00 - this means that 75% of the members in the dataset have between 0 and 32 years of experience since earning PhD
#install.packages("ggplot2")
library(ggplot2)
ggplot(Salaries, aes(x = salary, fill = rank)) +
geom_histogram(binwidth = 5000, position = "dodge") +
labs(title = "Salary Distribution by Rank",
x = "Salary",
y = "Frequency") +
scale_fill_manual(values = c("blue", "red", "green"), name = "Rank")
Salary ranges: The salary distribution for the Professors is the widest,
with salaries spanning from 70000 to 230000
Central Tendency: The median salary for Assisstant Professors would roughly be 80000
Spread: The salary distribution for Professors has the widest range, with some professors earning significantly higher salaries
There are overlapping in salary ranges between ranks. For example, some Assisstant Professors earn similar salaries or even more than some Associate Professors
There are potential outliers in the category Professors
The distribution is slightly right-skewed
dataset <- read.table("./Body mass.csv",
header = TRUE,
sep = ";",
dec = ",")
head(dataset)
## ID Mass.
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
ID: ID of a student Mass: weight of a student in kg
str(dataset[ , -1])
## num [1:50] 62.1 64.5 56.5 53.4 61.3 62.2 62.7 64.5 59.5 68.9 ...
summary(dataset$Mass.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.70 60.23 62.80 62.88 64.50 83.20
hist(dataset$Mass.,
main = "Distribution of Body Mass",
xlab = "Body Mass (kg)",
ylab = "Frequency",
col = "lightblue",
border = "black")
H0: Mu = 59.5
H1: Mu =/ 59.5
t.test(dataset$Mass.,
mu = 59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: dataset$Mass.
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
Since p-value = 0.000234 is less than alpha = 0.05, we have to reject H0 and accept H1. That means that the average weight of ninth graders in 2021/2022 is different than it was in 2018/2019. The average weight of students increased.
95% CI for arithmetic mean: 61.16758 < Mu < 64.58442 Because 59.9 is not included in this interval, we reject H0 at 5%
#install.packages("effectsize")
library(effectsize)
cohens_d(dataset$Mass., mu = 59.5)
## Cohen's d | 95% CI
## ------------------------
## 0.56 | [0.26, 0.86]
##
## - Deviation from a difference of 59.5.
interpret_cohens_d(0.56, rules = "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)
Medium effect size means that the increase in the weight was not too big, but also not too small.
library(readxl)
mydata <- "Apartments.xlsx"
apartments_data <- read_excel(mydata)
head(apartments_data)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
apartments_data$Parking1 <- factor(apartments_data$Parking,
levels = c(0,1),
labels = c("No", "Yes"))
head(apartments_data)
## # A tibble: 6 × 6
## Age Distance Price Parking Balcony Parking1
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 7 28 1640 0 1 No
## 2 18 1 2800 1 0 Yes
## 3 7 28 1660 0 0 No
## 4 28 29 1850 0 1 No
## 5 18 18 1640 1 1 Yes
## 6 28 12 1770 0 1 No
apartments_data$Balcony1 <- factor(apartments_data$Balcony,
levels = c(0,1),
labels = c("No", "Yes"))
head(apartments_data)
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony Parking1 Balcony1
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
H0: Mu_Price = 1900
H1: Mu_Price =/ 1900
t.test(apartments_data$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: apartments_data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We reject H0 because p-value = 0.004731 is less than alpha = 0.05 and accept H1. From the data, we can see that the average price is higher than 1900.
fit1 <- lm(Price ~ Age,
data = apartments_data)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apartments_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Price = 2185.455 - 8.975Age
Regression coefficient = 8.975, which means that if age increases for 1 year, the price per m2 will be lowered by 8.975 dollars.
b0 = 2185.455 - the price at the beginning is 2185.455 per m2 (new apartment)
Coefficient of determination (r2) - 53% of variability of price can be explained by the effect of age
cor(apartments_data$Price, apartments_data$Age)
## [1] -0.230255
r = -0.23, indicates a relatively weak negative linear relationship, which means that if the age of the apartment increases, the price will slightly decrease
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(apartments_data[ , c(-4, -5, -6, -7)], #Scatterplot matrix,
smooth = FALSE)
#install.packages("Hmisc")
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(apartments_data[ , c(-4, -5, -6, -7)])) #Correlation matrix
## Age Distance Price
## Age 1.00 0.04 -0.23
## Distance 0.04 1.00 -0.63
## Price -0.23 -0.63 1.00
##
## n= 85
##
##
## P
## Age Distance Price
## Age 0.6966 0.0340
## Distance 0.6966 0.0000
## Price 0.0340 0.0000
There is not a potential problem of multicolinearity because we don’t see any high correlation values between variables
fit2 <- lm(Price ~ Age + Distance,
data = apartments_data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Price = 2460.101 - 7.934 * Age - 20.667 * Distance
vif(fit2) #Mulitcolinearity
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
None of the variables should be removed, because both values are under 5 - there is no problem of multicolinearity
apartments_data$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
apartments_data$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
head(apartments_data[order(apartments_data$StdResid), ], 3)
## # A tibble: 3 × 9
## Age Distance Price Parking Balcony Parking1 Balcony1 StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 7 2 1760 0 1 No Yes -2.15 0.066
## 2 12 14 1650 0 1 No Yes -1.50 0.013
## 3 12 14 1650 0 0 No No -1.50 0.013
head(apartments_data[order(-apartments_data$CooksD), ], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony Parking1 Balcony1 StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 Yes No 2.05 0.069
## 4 7 2 1760 0 1 No Yes -2.15 0.066
## 5 37 3 2540 1 1 Yes Yes 1.58 0.061
## 6 40 2 2400 0 1 No Yes 1.09 0.038
Removing the one with the highest cooks distance
removed_unit_data <- apartments_data[-37, ]
There is a 1 outlier based on cooks distance equal to 0.320 which highly differs from other units
hist(apartments_data$CooksD,
xlab = "CooksD",
ylab = "Frequency",
main = "Histogram of cooks distances")
#### Are standardized residuals ditributed normally? Show the graph and
formally test it. Explain the findings.
hist(apartments_data$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(apartments_data$StdResid)
##
## Shapiro-Wilk normality test
##
## data: apartments_data$StdResid
## W = 0.95303, p-value = 0.003645
H0: variable is normally distributed H1: variable is not normally distributed
Since p-value is less than 5%, we reject H0 - so standardized residuals are not normally distributed
Inserted categorical variables as factors
fit3 <- lm(Price ~ Age + Distance + Parking1 + Balcony1,
data = apartments_data)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking1 + Balcony1, data = apartments_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parking1Yes 196.168 62.868 3.120 0.00251 **
## Balcony1Yes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
A positive coefficient for the parking variable indicates that there is a positive relationship between the presence of parking and price. If there is a parking, the price of the apartment is expected to increase by 182.107 dollars, assuming all other variables remain constant.
A negative coefficient for the balcony variable indicates that there is a negative relationship between the presence of a balcony and price. If there is a balcony in the apartment, the price is expected to decrease by 29.899 dollars, assuming all other variables remain constant.