##TASK 1 ####LUKA ČERNILA
data() #used this function to summon the data
data(package = .packages(all.available = TRUE)) #copied this row from the previus data base to get even more datasets
#install.packages("library") #i intalled the package library so i can run datasets and I put # before it afterwards so it doesnt install again and again everytime
library("MASS")
mydata <- OME #I have choosen this dataset and inserted it as my data
head(mydata) #i summoned the first six rows of my dataset for a quick overview
## ID Age OME Loud Noise Correct Trials
## 1 1 30 low 35 coherent 1 4
## 2 1 30 low 35 incoherent 4 5
## 3 1 30 low 40 coherent 0 3
## 4 1 30 low 40 incoherent 1 1
## 5 1 30 low 45 coherent 2 4
## 6 1 30 low 45 incoherent 2 2
Varibles explanations: ID- id of the subject in the study OME- the extent of OME which can be low high or not given AGE-age of the subjects in months Loud-loudnes of stimulus in decibels Noise- whether the signal in the stimulus was “coherent” or “incoherent”. Correct- Number of correct responses from Trials trials. Trials- Number of trials performed.
mydata$AgeY <- mydata$Age / 12 #here i tried to get an extra row that shows the age in years for easier comprehention
mydata$ID<-NULL #i eliminated the column ID because it doesnt make sense to run descriptive statistics for this variable
head(mydata)
## Age OME Loud Noise Correct Trials AgeY
## 1 30 low 35 coherent 1 4 2.5
## 2 30 low 35 incoherent 4 5 2.5
## 3 30 low 40 coherent 0 3 2.5
## 4 30 low 40 incoherent 1 1 2.5
## 5 30 low 45 coherent 2 4 2.5
## 6 30 low 45 incoherent 2 2 2.5
summary(mydata) #i ran this code to see the descriptive statistics
## Age OME Loud Noise Correct
## Min. : 7.00 N/A :385 Min. :35.00 coherent :552 Min. : 0.000
## 1st Qu.:18.00 high:178 1st Qu.:40.00 incoherent:545 1st Qu.: 1.000
## Median :30.00 low :534 Median :45.00 Median : 3.000
## Mean :28.52 Mean :45.38 Mean : 2.964
## 3rd Qu.:30.00 3rd Qu.:50.00 3rd Qu.: 4.000
## Max. :60.00 Max. :65.00 Max. :13.000
## Trials AgeY
## Min. : 1.000 Min. :0.5833
## 1st Qu.: 2.000 1st Qu.:1.5000
## Median : 3.000 Median :2.5000
## Mean : 3.825 Mean :2.3766
## 3rd Qu.: 5.000 3rd Qu.:2.5000
## Max. :14.000 Max. :5.0000
summary(mydata)[ , c(-4)] #i deleted the forth column, because it is categorical
## Age OME Loud Correct Trials
## Min. : 7.00 N/A :385 Min. :35.00 Min. : 0.000 Min. : 1.000
## 1st Qu.:18.00 high:178 1st Qu.:40.00 1st Qu.: 1.000 1st Qu.: 2.000
## Median :30.00 low :534 Median :45.00 Median : 3.000 Median : 3.000
## Mean :28.52 Mean :45.38 Mean : 2.964 Mean : 3.825
## 3rd Qu.:30.00 3rd Qu.:50.00 3rd Qu.: 4.000 3rd Qu.: 5.000
## Max. :60.00 Max. :65.00 Max. :13.000 Max. :14.000
## AgeY
## Min. :0.5833
## 1st Qu.:1.5000
## Median :2.5000
## Mean :2.3766
## 3rd Qu.:2.5000
## Max. :5.0000
description: MEAN for Age= The average age of a baby in this study is 28.52 years old. Min. Trials= The minimum amount of trials performed on a baby was 1. Max. Loud= The maximum loudness of the sound played was 65 decibels.
hist(mydata$Correct,
main = "Histogram of correct responses to trials",
xlab = "times of correct responses",
ylab = "Frequency",
breaks = 11)
Describtion: From the histogram above we can tell that the largest
frequency of babies had between zero and two correct answers in the
study.
#install.packages("car")
library(car)
## Loading required package: carData
scatterplot(y=mydata$Age,
x=mydata$Correct,
ylab="age",
xlab="Times of correct responses",
smooth=FALSE)
Description: From this scatterplot we can tell, that the older the age
of the baby, more correct answers he or she gets.
##TASK 2
mydata <- read.table("./Body mass.csv", header = TRUE, sep = ";", dec = ",") #First I used the function above to instruct R which file to read. I used sep to specify the field seperator and dec to specify the decimal point used.
summary(mydata) #I used this function to show the descriptive statistics of our data.
## ID Mass
## Min. : 1.00 Min. :49.70
## 1st Qu.:13.25 1st Qu.:60.23
## Median :25.50 Median :62.80
## Mean :25.50 Mean :62.88
## 3rd Qu.:37.75 3rd Qu.:64.50
## Max. :50.00 Max. :83.20
summary(mydata[ , c(-1)]) #I deleted first column because there is no point to run all of those statistics for ID
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.70 60.23 62.80 62.88 64.50 83.20
hist(mydata$Mass,
breaks = 6,
col = "yellow",
border = "black",
main = "Diatribution of the nine grader's weight",
xlab = "Body mass",
ylab = "Frequency")
#here I used the simplest function - hist(), to create a histrogram that shows me the frequency of body mass distribution.
Explanation of the histogram:
From this histogram we can tell, that the most teenagers in this sample have between 60kg and 65kg.
T TEST (mean)
H0: Mu = 59.5 H1: Mu =/ 59.5
mean(mydata$Mass)
## [1] 62.876
We can see that the average student in 2021/22 was 3,376 heavier compared to the year 2018/2019.
average <- mean(mydata$Mass)
result <- t.test(mydata$Mass, mu = 59.5)
cat("P-value:", result$p.value,"\n")
## P-value: 0.0002339924
P-value < 0,0001 That means that we can reject the H0
Effectsize
library("effectsize")
cohens_d(mydata$Mass, mu= 59.5)
## Cohen's d | 95% CI
## ------------------------
## 0.56 | [0.26, 0.86]
##
## - Deviation from a difference of 59.5.
interpret_cohens_d(0.56, "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)
Because Cohen’s d is medium, that there is a moderate or noticable difference between the conditions we are comparing.
##TASK 3
#install.packages("readxl")
library(readxl)
mydataA <- read_xlsx("./Apartments.xlsx")
Description:
mydataA$ParkingFactor <- factor(mydataA$Parking,
levels = c(1, 0),
labels = c("YES", "NO"))
mydataA$BalconyFactor <- factor(mydataA$Balcony,
levels = c(1, 0),
labels = c("YES", "NO"))
t.test(mydataA$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydataA$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Since the p-value of this test was less then 0.0001 we can confidentlly reject the H0.
#install.packages("ggplot2")
library(ggplot2)
lm(Price ~ Age, data = mydataA)
##
## Call:
## lm(formula = Price ~ Age, data = mydataA)
##
## Coefficients:
## (Intercept) Age
## 2185.455 -8.975
fit1 <- lm(Price ~ Age, data = mydataA)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydataA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
r<-sqrt(0.05302)
#install.packages(car)
library(car)
scatterplotMatrix(mydataA[ , c(-4, -5, -6, -7)],
smooth = FALSE)
cor_matrix <- cor(mydataA[, c("Price", "Age", "Distance")])
As we can from cor_matrix , no value is close to 1, which means there is no potential risk for multicolinearity.
fit2 <- lm(Price ~ Age + Distance,
data = mydataA)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library("car")
vif(fit2)
## Age Distance
## 1.001845 1.001845
mydataA$StdResid <- round(rstandard(fit2), 3)
mydataA$CooksD <- round(cooks.distance(fit2), 3)
mydataA$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydataA$StdResid, x = mydataA$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Description: As we can see from the scatterplot above there is no sign
of heterostecity.
hist(mydataA$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydataA$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydataA$StdResid
## W = 0.95303, p-value = 0.003645
Description: we can tell that the residuals are not distributed normally . That is proven also by the P-value which is less then 0.05, therefore we can reject H0.
mydata2 <- mydataA [-38, ]
fit2 <- lm(Price ~ Age + Distance,
data = mydata2)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
Est of age: estimated price decreases by 6.464$, with each additional year of age. Std. Error: it is very low, which means the estimate is very accured. Res. std. error: on average, the actual price varies from predicted values by 276.1 P-value: we can reject all of the H0 since p values are all bellow 0.05 R squared: 48,38% of the variability in price is explained by independet variables.
fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor, data=mydata2)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2482.048 79.263 31.314 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFactorNO -167.531 62.864 -2.665 0.00933 **
## BalconyFactorNO 15.207 59.201 0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
anova(fit2,fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Description: H0=no difference between fit2 and fit3 H1= there is a difference
p-value < 0.05, which means we can reject the H0.
anova suggests that fit3 is a better fit.
summary (fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2482.048 79.263 31.314 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFactorNO -167.531 62.864 -2.665 0.00933 **
## BalconyFactorNO 15.207 59.201 0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
Describe: H0: There is no relationship between b1,b2,b3 and b4 H1: b1, b2, b3, b4, !=0 (at least on of them is not worth zero)
Since the p value is 3.018e-12, we can reject the null hypothesis.
mydata2$residuals <- residuals(fit3)
mydata2$fittedValues <- fitted.values(fit3)
head(mydata2)
## # A tibble: 6 × 12
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 7 28 1640 0 1 NO YES -0.665
## 2 18 1 2800 1 0 YES NO 1.78
## 3 7 28 1660 0 0 NO NO -0.594
## 4 28 29 1850 0 1 NO YES 0.754
## 5 18 18 1640 1 1 YES YES -1.07
## 6 28 12 1770 0 1 NO YES -0.778
## # ℹ 4 more variables: CooksD <dbl>, StdFitted <dbl[,1]>, residuals <dbl>,
## # fittedValues <dbl>
mydata2$residuals <- residuals(fit3)
mydata2$fittedValues <- fitted.values(fit3)
head(mydata2[2, c(11,12)])
## # A tibble: 1 × 2
## residuals fittedValues
## <dbl> <dbl>
## 1 428. 2372.