library(readxl)
mydata2 <- read_xlsx("~/Desktop/r task 1.xlsx")
print (mydata2)
## # A tibble: 111 × 5
## gender `test preparation course` `math score` `reading score` `writing score`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 female 1 72 72 74
## 2 female 3 69 90 88
## 3 female 16 90 95 93
## 4 male 0 47 57 44
## 5 male 3 76 78 75
## 6 female 6 71 83 78
## 7 female 10 88 95 92
## 8 male 7 40 43 39
## 9 male 12 64 64 67
## 10 female 0 38 60 50
## # … with 101 more rows
The data set I will be referring to in the Task 1 was taken from the github webpage and it consists of the sample data regarding Student performance of an unknown highschool in 2020. The sample consists of 111 observations and 5 variables.
Gender is a categorical variable that indicates weather the student is a female or a male.
Test preparation course is a numerical variable that tells us how many days has the student been preparing for all of the following tests.
Math score, reading score and writing score are all numerical variables that tell us how many points (between 0 and 100) has each student achieved on the specific exam.
mydata2$ID <- (1:111)
mydata2[36,3] <-71
print(mydata2)
## # A tibble: 111 × 6
## gender `test preparation course` `math score` `reading score` writing…¹ ID
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 female 1 72 72 74 1
## 2 female 3 69 90 88 2
## 3 female 16 90 95 93 3
## 4 male 0 47 57 44 4
## 5 male 3 76 78 75 5
## 6 female 6 71 83 78 6
## 7 female 10 88 95 92 7
## 8 male 7 40 43 39 8
## 9 male 12 64 64 67 9
## 10 female 0 38 60 50 10
## # … with 101 more rows, and abbreviated variable name ¹`writing score`
mydata2$`corrected math score`<-mydata2$`math score` + 2
mydata3 <- mydata2[ , -3]
colnames(mydata3)[2] ="total studying days"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata3 = mydata3 %>% select(ID, gender, `total studying days`,`reading score`,`writing score`,`corrected math score`)
head(mydata3)
## # A tibble: 6 × 6
## ID gender `total studying days` `reading score` `writing score` corrected…¹
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 female 1 72 74 74
## 2 2 female 3 90 88 71
## 3 3 female 16 95 93 92
## 4 4 male 0 57 44 49
## 5 5 male 3 78 75 78
## 6 6 female 6 83 78 73
## # … with abbreviated variable name ¹`corrected math score`
tail(mydata3)
## # A tibble: 6 × 6
## ID gender `total studying days` `reading score` `writing score` corrected…¹
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 106 female 1 67 72 60
## 2 107 female 25 100 100 89
## 3 108 male 0 63 64 68
## 4 109 female 1 76 70 54
## 5 110 female 0 64 72 72
## 6 111 female 10 89 98 79
## # … with abbreviated variable name ¹`corrected math score`
mydata3$gender <-factor(mydata3$gender)
mydata5 <- mydata3[order(-mydata3$`total studying days`), ]
head(mydata5)
## # A tibble: 6 × 6
## ID gender `total studying days` `reading score` `writing score` corrected…¹
## <int> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 107 female 25 100 100 89
## 2 105 male 21 86 90 100
## 3 35 male 20 87 82 99
## 4 30 female 19 70 75 64
## 5 50 male 18 84 82 84
## 6 16 female 17 75 78 71
## # … with abbreviated variable name ¹`corrected math score`
mydata6 <-mydata3 [mydata3$`reading score` >= 60 & mydata3$`reading score`<= 100, ]
head(mydata6)
## # A tibble: 6 × 6
## ID gender `total studying days` `reading score` `writing score` corrected…¹
## <int> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 female 1 72 74 74
## 2 2 female 3 90 88 71
## 3 3 female 16 95 93 92
## 4 5 male 3 78 75 78
## 5 6 female 6 83 78 73
## 6 7 female 10 95 92 90
## # … with abbreviated variable name ¹`corrected math score`
str(mydata3)
## tibble [111 × 6] (S3: tbl_df/tbl/data.frame)
## $ ID : int [1:111] 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
## $ total studying days : num [1:111] 1 3 16 0 3 6 10 7 12 0 ...
## $ reading score : num [1:111] 72 90 95 57 78 83 95 43 64 60 ...
## $ writing score : num [1:111] 74 88 93 44 75 78 92 39 67 50 ...
## $ corrected math score: num [1:111] 74 71 92 49 78 73 90 42 66 40 ...
There are now 111 observations and 6 variables in my dataframe, from which 4 are numerical and one is categorical, converted into factors. I also have an ID variable which I inserted for easier reading of a specific observation.
summary(mydata3)
## ID gender total studying days reading score
## Min. : 1.0 female:56 Min. : 0.00 Min. : 17.00
## 1st Qu.: 28.5 male :55 1st Qu.: 2.00 1st Qu.: 55.00
## Median : 56.0 Median : 5.00 Median : 67.00
## Mean : 56.0 Mean : 6.45 Mean : 65.56
## 3rd Qu.: 83.5 3rd Qu.: 9.00 3rd Qu.: 75.50
## Max. :111.0 Max. :25.00 Max. :100.00
## writing score corrected math score
## Min. : 10.00 Min. : 2.00
## 1st Qu.: 54.00 1st Qu.: 54.50
## Median : 67.00 Median : 65.00
## Mean : 64.59 Mean : 63.74
## 3rd Qu.: 75.00 3rd Qu.: 73.50
## Max. :100.00 Max. :100.00
From the summary above, I can conclude that there is 56 female and 55 men in my sample. Since gender is a categorical variable, it will not be needed in the following analysis.
mydata4 <- mydata3[, -2]
library(psych)
describe(mydata4)
## vars n mean sd median trimmed mad min max range
## ID 1 111 56.00 32.19 56 56.00 41.51 1 111 110
## total studying days 2 111 6.45 5.44 5 5.81 5.93 0 25 25
## reading score 3 111 65.56 16.07 67 66.17 16.31 17 100 83
## writing score 4 111 64.59 17.00 67 65.25 16.31 10 100 90
## corrected math score 5 111 63.74 16.07 65 64.29 14.83 2 100 98
## skew kurtosis se
## ID 0.00 -1.23 3.06
## total studying days 0.97 0.51 0.52
## reading score -0.39 -0.13 1.53
## writing score -0.44 0.06 1.61
## corrected math score -0.59 1.22 1.53
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
round(stat.desc(mydata4)
, 2)
## ID total studying days reading score writing score
## nbr.val 111.00 111.00 111.00 111.00
## nbr.null 0.00 10.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 1.00 0.00 17.00 10.00
## max 111.00 25.00 100.00 100.00
## range 110.00 25.00 83.00 90.00
## sum 6216.00 716.00 7277.00 7169.00
## median 56.00 5.00 67.00 67.00
## mean 56.00 6.45 65.56 64.59
## SE.mean 3.06 0.52 1.53 1.61
## CI.mean.0.95 6.05 1.02 3.02 3.20
## var 1036.00 29.61 258.34 288.92
## std.dev 32.19 5.44 16.07 17.00
## coef.var 0.57 0.84 0.25 0.26
## corrected math score
## nbr.val 111.00
## nbr.null 0.00
## nbr.na 0.00
## min 2.00
## max 100.00
## range 98.00
## sum 7075.00
## median 65.00
## mean 63.74
## SE.mean 1.53
## CI.mean.0.95 3.02
## var 258.30
## std.dev 16.07
## coef.var 0.25
boxplot(mydata4[ , c(-1, -2)])
From the boxplots above, I can conclude that the medians of the reading score variable as well as the writing score variable are the same (67), whereas the mean of the corrected math score is lower than the mean reading and writing scores. The highest overall obtained score was 100, the lowest obtained score overall lays in the outlier of the math test, with the value of approximately 2.
READING SCORE VARIABLE: The minimum value excluding outliers is approximately 25, the maximum value excluding outliers is 100.
WRITING SCORE: The minimum value excluding outliers is approximately 28, the maximum value excluding outliers is 100.
CORRECTED MATH SCORE: The minimum value excluding outliers is approximately 30, the maximum value excluding outliers is 100.
Overall, I can conclude that the dispursion of scores was not that high, meaning that all together, the test scores were similar. The students showed approximately the same amount of knowledge for all three tests and more than 50% of the students have passed ( score more than 60) each subject.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
scatterplot(y=mydata4$`total studying days`,
x=mydata4$`corrected math score`,
ylab = "Total studying days",
xlab= "Math score",
smooth=FALSE,
regLine= TRUE)
With this scatter plot I wanted to show the relationship between two of the numeric variables, total studying days and corrected math score. Based on this graph, I can conclude that if the math score increases, studying days increased too. There is a positive correlation between these two variables, with some outliers present. Based on that, I can say that more days the students were studying, higher was their math score on the exam, excluding some exceptions.
scatterplotMatrix(mydata4[ , -1],
smooth = FALSE)
Using this graphical representation I can see the relationship between the four numerical variables in my data set. The pictures with a curve depict the distribution of variables (ex. total studying days is positively skewed- to the right). In addition, I can see how positively correlated are most score variables with the total studying days (total studying days increased, the score increased as well).
#mydata <- Body_mass
mydata <-read.table("~/Desktop/Task 2/Body mass.csv", header= TRUE,
sep= ";", dec= ",")
head(mydata)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
str(mydata[-1])
## 'data.frame': 50 obs. of 1 variable:
## $ Mass: num 62.1 64.5 56.5 53.4 61.3 62.2 62.7 64.5 59.5 68.9 ...
There is 50 units of observation in the data set. Mass is a numerical variable.
summary(mydata[-1])
## Mass
## Min. :49.70
## 1st Qu.:60.23
## Median :62.80
## Mean :62.88
## 3rd Qu.:64.50
## Max. :83.20
The mass data, summarized.
#install.packages("psych")
library(psych)
describe(mydata[-1])
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Mass 1 50 62.88 6.01 62.8 62.56 3.34 49.7 83.2 33.5 0.85 2.11 0.85
Descriptive statistics of mass. An example of the interpretation: MEAN: The mean of the mass variable is 62.88, meaning that on average, the mass of all the units of observation from the sample is 62.88.
hist(mydata$Mass, main= "Histogram of body mass distribution",
xlab= "Mass",
ylab="Frequency",
breaks= seq(from= 40, to= 90, by= 5))
The distribution of the mass with the histogram.
t.test(mydata$Mass, mu= 59.5 , alternative = "two.sided" )
##
## One Sample t-test
##
## data: mydata$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
As the p value (which is equal to 0.000234) is lower than alpha of 5%, we reject the null hypothesis at the p= 0.000234. The mean (=62,88) falls out of the interval for confirmation of H0.
r <- sqrt (3.9711^2/(3.9711^2 + 49))
print(r)
## [1] 0.4934295
The effect size is 0.4934295, which means that there is a significant change in the weight of the students in 2021/22 compared to the year 2018/19. H0 can therefore be rejected.
#install.packages("readxl")
library(readxl)
mydata1 <- read_xlsx("~/Desktop/Task 3/Apartments.xlsx")
head(mydata1)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
There are 85 observations and 5 variables in this data set. - Age refers to the age of an apartment, measured in years. It is a numerical variable. - Distance means the distance from the center.It is a numerical variable. - Price is measured per m2. It is a numerical variable. - Parking is a categorical variable, later factored, where 0-No, 1-Yes. - Balcony is a categorical variable, later factored, where 0-No, 1-Yes.
mydata1$ParkingFactor <- factor(mydata1$Parking,
levels = c(1,0),
labels = c("yes", "no"))
mydata1$BalconyFactor <- factor(mydata1$Balcony,
levels = c(1,0),
labels = c("yes", "no"))
str(mydata1)
## tibble [85 × 7] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance : num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
## $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
## $ ParkingFactor: Factor w/ 2 levels "yes","no": 2 1 2 2 1 2 2 1 1 1 ...
## $ BalconyFactor: Factor w/ 2 levels "yes","no": 1 2 2 1 1 1 1 1 2 2 ...
t.test(mydata1$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata1$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Based on the sample data we can conclude that we can reject H0 with 5% level of significance, with p-value= 0.004731. (p is smaller than alpha) H0 = 1900.
fit1 <- lm(Price ~ Age,
data = mydata1)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
sqrt(0.05302)
## [1] 0.2302607
Estimate of regression coefficient (8.975) means that if the apartment age increases for 1 year, then the price per m2 decreases by 8.975 EUR on average, ceteris paribus.
Coefficient of correlation is 0.23, meaning that there is not a strong linear relationship between apartment age and price per m2.
Coefficient of determination (0.05302) means that 5.302 of variability of price per m2 is explained by the linear effect of the apartment age variable.
scatterplotMatrix(mydata1[ ,c(1,2,3)],
smooth = TRUE)
Based on the Scatter plot Matrix, there is no potential problem with multicolinearity detected.
fit2 <- lm(Price ~ Age + Distance,
data = mydata1)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
VIF average should be as close to one as possible (which it is in our case= 1.001845), all VIFs should be less than 5. Our data applies to both of these conditions so we should leave it in the dataset. These variables are therefore not strongly, but just moderately correlated.
mydata1$stdresid <- round(rstandard(fit2), 2)
mydata1$cooksD <- round(cooks.distance(fit2), 2)
hist(mydata1$stdresid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
hist(mydata1$cooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distance")
head(mydata1[order(-mydata1$stdresid), ])
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor stdresid cooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 yes yes 2.58 0.32
## 2 2 11 2790 1 0 yes no 2.05 0.07
## 3 18 1 2800 1 0 yes no 1.78 0.03
## 4 18 1 2800 1 1 yes yes 1.78 0.03
## 5 8 2 2820 1 0 yes no 1.66 0.04
## 6 10 1 2810 0 0 no no 1.6 0.03
head(mydata1 [order(-mydata1$cooksD) , ])
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor stdresid cooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 yes yes 2.58 0.32
## 2 43 37 1740 0 0 no no 1.44 0.1
## 3 2 11 2790 1 0 yes no 2.05 0.07
## 4 7 2 1760 0 1 no yes -2.15 0.07
## 5 37 3 2540 1 1 yes yes 1.58 0.06
## 6 40 2 2400 0 1 no yes 1.09 0.04
Since the cooks distance is significantly higher at the first above observation, I will get rid of this outlier.
mydata7 <- mydata1[-38, ]
head(mydata7 [order(-mydata7$cooksD) , ])
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor stdresid cooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 43 37 1740 0 0 no no 1.44 0.1
## 2 2 11 2790 1 0 yes no 2.05 0.07
## 3 7 2 1760 0 1 no yes -2.15 0.07
## 4 37 3 2540 1 1 yes yes 1.58 0.06
## 5 40 2 2400 0 1 no yes 1.09 0.04
## 6 8 2 2820 1 0 yes no 1.66 0.04
hist(mydata7$cooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distance")
I removed a variable where the cooks distance was 0.32, since it was very different from the rest of them and was an outlier to this case.
mydata1$stdfittedvalues <- scale(fit2$fitted.values)
scatterplot(y=mydata1$stdresid, x = mydata1$stdfittedvalues,
ylab = "Standardized Residuals",
xlab = "Standardized Fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
The heteroskedasticity is not detected, therefore there is no valuation of the homoskedasticity.
hist(mydata1$stdresid,
xlab = "Standardized Residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata1$stdresid)
##
## Shapiro-Wilk normality test
##
## data: mydata1$stdresid
## W = 0.95307, p-value = 0.003667
The distribution seen from the graph has two peaks. Therefore it is not a normal distribution (but bimodal).
fit2 <- lm(Price ~ Age + Distance,
data = mydata1)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
sqrt(summary(fit2)$r.squared)
## [1] 0.6629887
If the age of an apartment increased by 1 year, the price per m2 decreased by 7.934 EUR on average, ceteris paribus. When the distance between the apartment and the center increases by 1 kilometer,price per m2 decreases by 20 667 EUR on average, ceteris paribus.
fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
data = mydata1)
anova(fit2,fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because p is less than 5%, it means that fit3 explains more and is therefore better.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2499.770 80.645 30.997 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingFactorno -196.168 62.868 -3.120 0.00251 **
## BalconyFactorno -1.935 60.014 -0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
mydata1$fittedvalues <- fitted.values(fit3)
mydata1$residuals <- residuals(fit3)
mydata1[2,9]
## # A tibble: 1 × 1
## cooksD
## <dbl>
## 1 0.03