TASK1

##TASK 1 ####LUKA ČERNILA

data() #used this function to summon the data

data(package = .packages(all.available = TRUE)) #copied this row from the previus data base to get even more datasets

#install.packages("library") #i intalled the package library so i can run datasets and I put # before it afterwards so it doesnt install again and again everytime

library("MASS")
mydata <- OME  #I have choosen this dataset and inserted it as my data

head(mydata) #i summoned the first six rows of my dataset for a quick overview

##   ID Age OME Loud      Noise Correct Trials
## 1  1  30 low   35   coherent       1      4
## 2  1  30 low   35 incoherent       4      5
## 3  1  30 low   40   coherent       0      3
## 4  1  30 low   40 incoherent       1      1
## 5  1  30 low   45   coherent       2      4
## 6  1  30 low   45 incoherent       2      2

Varibles explanations: ID- id of the subject in the study OME- the extent of OME which can be low high or not given AGE-age of the subjects in months Loud-loudnes of stimulus in decibels Noise- whether the signal in the stimulus was “coherent” or “incoherent”. Correct- Number of correct responses from Trials trials. Trials- Number of trials performed.

mydata$AgeY <- mydata$Age / 12 #here i tried to get an extra row that shows the age in years for easier comprehention

mydata$ID<-NULL #i eliminated the column ID because it doesnt make sense to run descriptive statistics for this variable

head(mydata)

##   Age OME Loud      Noise Correct Trials AgeY
## 1  30 low   35   coherent       1      4  2.5
## 2  30 low   35 incoherent       4      5  2.5
## 3  30 low   40   coherent       0      3  2.5
## 4  30 low   40 incoherent       1      1  2.5
## 5  30 low   45   coherent       2      4  2.5
## 6  30 low   45 incoherent       2      2  2.5

summary(mydata) #i ran this code to see the descriptive statistics

##       Age          OME           Loud              Noise        Correct      
##  Min.   : 7.00   N/A :385   Min.   :35.00   coherent  :552   Min.   : 0.000  
##  1st Qu.:18.00   high:178   1st Qu.:40.00   incoherent:545   1st Qu.: 1.000  
##  Median :30.00   low :534   Median :45.00                    Median : 3.000  
##  Mean   :28.52              Mean   :45.38                    Mean   : 2.964  
##  3rd Qu.:30.00              3rd Qu.:50.00                    3rd Qu.: 4.000  
##  Max.   :60.00              Max.   :65.00                    Max.   :13.000  
##      Trials            AgeY       
##  Min.   : 1.000   Min.   :0.5833  
##  1st Qu.: 2.000   1st Qu.:1.5000  
##  Median : 3.000   Median :2.5000  
##  Mean   : 3.825   Mean   :2.3766  
##  3rd Qu.: 5.000   3rd Qu.:2.5000  
##  Max.   :14.000   Max.   :5.0000

summary(mydata)[ , c(-4)] #i deleted the forth column, because it is categorical

##       Age          OME           Loud          Correct           Trials      
##  Min.   : 7.00   N/A :385   Min.   :35.00   Min.   : 0.000   Min.   : 1.000  
##  1st Qu.:18.00   high:178   1st Qu.:40.00   1st Qu.: 1.000   1st Qu.: 2.000  
##  Median :30.00   low :534   Median :45.00   Median : 3.000   Median : 3.000  
##  Mean   :28.52              Mean   :45.38   Mean   : 2.964   Mean   : 3.825  
##  3rd Qu.:30.00              3rd Qu.:50.00   3rd Qu.: 4.000   3rd Qu.: 5.000  
##  Max.   :60.00              Max.   :65.00   Max.   :13.000   Max.   :14.000  
##       AgeY       
##  Min.   :0.5833  
##  1st Qu.:1.5000  
##  Median :2.5000  
##  Mean   :2.3766  
##  3rd Qu.:2.5000  
##  Max.   :5.0000

description: MEAN for Age= The average age of a baby in this study is 28.52 years old. Min. Trials= The minimum amount of trials performed on a baby was 1. Max. Loud= The maximum loudness of the sound played was 65 decibels.

hist(mydata$Correct,
     main = "Histogram of correct responses to trials",
     xlab = "times of correct responses",
     ylab = "Frequency",
     breaks = 11)

Describtion: From the histogram above we can tell that the largest frequency of babies had between zero and two correct answers in the study.

#install.packages("car")

library(car)

## Loading required package: carData

scatterplot(y=mydata$Age,
            x=mydata$Correct,
            ylab="age",
            xlab="Times of correct responses",
            smooth=FALSE)

Description: From this scatterplot we can tell, that the older the age of the baby, more correct answers he or she gets.

##TASK 2

mydata <- read.table("./Body mass.csv", header = TRUE, sep = ";", dec = ",")        #First I used the function above to instruct R which file to read. I used sep to specify the field seperator and dec to specify the decimal point used.

summary(mydata) #I used this function to show the descriptive statistics of our data.

##        ID             Mass      
##  Min.   : 1.00   Min.   :49.70  
##  1st Qu.:13.25   1st Qu.:60.23  
##  Median :25.50   Median :62.80  
##  Mean   :25.50   Mean   :62.88  
##  3rd Qu.:37.75   3rd Qu.:64.50  
##  Max.   :50.00   Max.   :83.20

summary(mydata[ , c(-1)]) #I deleted first column because there is no point to run all of those statistics for ID

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49.70   60.23   62.80   62.88   64.50   83.20

hist(mydata$Mass,
     breaks = 6,
     col = "yellow",
     border = "black",
     main = "Diatribution of the nine grader's weight",
     xlab = "Body mass",
     ylab = "Frequency")

#here I used the simplest function - hist(),  to create a histrogram that shows me the frequency of body mass distribution.

Explanation of the histogram:

From this histogram we can tell, that the most teenagers in this sample have between 60kg and 65kg.

T TEST (mean)

H0: Mu = 59.5 H1: Mu =/ 59.5

mean(mydata$Mass)

## [1] 62.876

We can see that the average student in 2021/22 was 3,376 heavier compared to the year 2018/2019.

average <- mean(mydata$Mass)

result <- t.test(mydata$Mass, mu = 59.5)

cat("P-value:", result$p.value,"\n")

## P-value: 0.0002339924

P-value < 0,0001 That means that we can reject the H0

Effectsize

library("effectsize")
cohens_d(mydata$Mass, mu= 59.5)

## Cohen's d |       95% CI
## ------------------------
## 0.56      | [0.26, 0.86]
## 
## - Deviation from a difference of 59.5.

interpret_cohens_d(0.56, "sawilowsky2009")

## [1] "medium"
## (Rules: sawilowsky2009)

Because Cohen’s d is medium, that there is a moderate or noticable difference between the conditions we are comparing.

##TASK 3

Import the dataset Apartments.xlsx

#install.packages("readxl")

library(readxl)
mydataA <- read_xlsx("./Apartments.xlsx")

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydataA$ParkingFactor <- factor(mydataA$Parking,
 levels = c(1, 0),
 labels = c("YES", "NO"))

mydataA$BalconyFactor <- factor(mydataA$Balcony,
 levels = c(1, 0),
 labels = c("YES", "NO"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydataA$Price,
 mu = 1900,
 alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydataA$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Since the p-value of this test was less then 0.0001 we can confidentlly reject the H0.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

#install.packages("ggplot2")
library(ggplot2)
lm(Price ~ Age, data = mydataA)

## 
## Call:
## lm(formula = Price ~ Age, data = mydataA)
## 
## Coefficients:
## (Intercept)          Age  
##    2185.455       -8.975

fit1 <- lm(Price ~ Age, data = mydataA)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydataA)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

r<-sqrt(0.05302)

Regression coefficient: If any of the apartments get older by 1 year, on average there will be a decrease in price by 8.975$, assuming all other variables are constant.
Coefficient in correlation: The positive sign of the correlation coefficient indicates a positive linear relationship between “Age” and “Price,” meaning that as “Age” increases, “Price” will decrease. The magnitude of 0.2300 suggests a weak linear relationship between those two.
Coefficient of determination: Only about 5.3% of the variability in “Price” is explained by the linear relationship with “Age.” That means, “Age” explains a relatively small proportion of the variation in apartment prices.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

#install.packages(car)
library(car)
scatterplotMatrix(mydataA[ , c(-4, -5, -6, -7)], 
                  smooth = FALSE)

cor_matrix <- cor(mydataA[, c("Price", "Age", "Distance")])

As we can from cor_matrix , no value is close to 1, which means there is no potential risk for multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = mydataA)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydataA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

library("car")
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydataA$StdResid <- round(rstandard(fit2), 3)
mydataA$CooksD <- round(cooks.distance(fit2), 3)

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

mydataA$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydataA$StdResid, x = mydataA$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Description: As we can see from the scatterplot above there is no sign of heterostecity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydataA$StdResid,
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydataA$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydataA$StdResid
## W = 0.95303, p-value = 0.003645

Description: we can tell that the residuals are not distributed normally . That is proven also by the P-value which is less then 0.05, therefore we can reject H0.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

mydata2 <- mydataA [-38, ]
fit2 <- lm(Price ~ Age + Distance,
           data = mydata2)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

Est of age: estimated price decreases by 6.464$, with each additional year of age. Std. Error: it is very low, which means the estimate is very accured. Res. std. error: on average, the actual price varies from predicted values by 276.1 P-value: we can reject all of the H0 since p values are all bellow 0.05 R squared: 48,38% of the variability in price is explained by independet variables.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor, data=mydata2)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2482.048     79.263  31.314  < 2e-16 ***
## Age               -5.821      3.074  -1.894  0.06190 .  
## Distance         -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFactorNO -167.531     62.864  -2.665  0.00933 ** 
## BalconyFactorNO   15.207     59.201   0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

With function anova check if model fit3 fits data better than model fit2.

anova(fit2,fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Description: H0=no difference between fit2 and fit3 H1= there is a difference

p-value < 0.05, which means we can reject the H0.

anova suggests that fit3 is a better fit.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary (fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2482.048     79.263  31.314  < 2e-16 ***
## Age               -5.821      3.074  -1.894  0.06190 .  
## Distance         -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFactorNO -167.531     62.864  -2.665  0.00933 ** 
## BalconyFactorNO   15.207     59.201   0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

Describe: H0: There is no relationship between b1,b2,b3 and b4 H1: b1, b2, b3, b4, !=0 (at least on of them is not worth zero)

Since the p value is 3.018e-12, we can reject the null hypothesis.

Save fitted values and claculate the residual for apartment ID2.

mydata2$residuals <- residuals(fit3)
mydata2$fittedValues <- fitted.values(fit3)
head(mydata2)

## # A tibble: 6 × 12
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>
## 1     7       28  1640       0       1 NO            YES             -0.665
## 2    18        1  2800       1       0 YES           NO               1.78 
## 3     7       28  1660       0       0 NO            NO              -0.594
## 4    28       29  1850       0       1 NO            YES              0.754
## 5    18       18  1640       1       1 YES           YES             -1.07 
## 6    28       12  1770       0       1 NO            YES             -0.778
## # ℹ 4 more variables: CooksD <dbl>, StdFitted <dbl[,1]>, residuals <dbl>,
## #   fittedValues <dbl>

mydata2$residuals <- residuals(fit3)
mydata2$fittedValues <- fitted.values(fit3)
head(mydata2[2, c(11,12)])

## # A tibble: 1 × 2
##   residuals fittedValues
##       <dbl>        <dbl>
## 1      428.        2372.