task1data <- read.csv("~/R Bootcamp/R Take Home Exam/oasis_cross-sectional.csv")
head(task1data)
## ID M.F Hand Age Educ SES MMSE CDR eTIV nWBV ASF Delay
## 1 OAS1_0001_MR1 F R 74 2 3 29 0.0 1344 0.743 1.306 N/A
## 2 OAS1_0002_MR1 F R 55 4 1 29 0.0 1147 0.810 1.531 N/A
## 3 OAS1_0003_MR1 F R 73 4 3 27 0.5 1454 0.708 1.207 N/A
## 4 OAS1_0004_MR1 M R 28 NA NA NA NA 1588 0.803 1.105 N/A
## 5 OAS1_0005_MR1 M R 18 NA NA NA NA 1737 0.848 1.010 N/A
## 6 OAS1_0006_MR1 F R 24 NA NA NA NA 1131 0.862 1.551 N/A
Variable description:
For this assignment, the targeted population are people above 40 years old. The data was obtained from OASIS cross-sectional on brain imaging.
task1data <- task1data[task1data$Age >= 40, ]
Next the missing values per column/variable need to be determined and imputated based on median. Median was used based on the research of other N/A values for similar datasets on brain fuction.
colSums(is.na(task1data))
## ID M.F Hand Age Educ SES MMSE CDR eTIV nWBV ASF Delay
## 0 0 0 0 29 48 29 29 0 0 0 0
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
task1data$Educ <- impute(task1data$Educ, median)
task1data$SES <- impute(task1data$SES, median)
task1data$MMSE <- impute(task1data$MMSE, median)
task1data$CDR <- impute(task1data$CDR, median)
head(task1data)
## ID M.F Hand Age Educ SES MMSE CDR eTIV nWBV ASF Delay
## 1 OAS1_0001_MR1 F R 74 2 3 29 0.0 1344 0.743 1.306 N/A
## 2 OAS1_0002_MR1 F R 55 4 1 29 0.0 1147 0.810 1.531 N/A
## 3 OAS1_0003_MR1 F R 73 4 3 27 0.5 1454 0.708 1.207 N/A
## 9 OAS1_0010_MR1 M R 74 5 2 30 0.0 1636 0.689 1.073 N/A
## 10 OAS1_0011_MR1 F R 52 3 2 30 0.0 1321 0.827 1.329 N/A
## 12 OAS1_0013_MR1 F R 81 5 2 30 0.0 1664 0.679 1.055 N/A
Count rows and columns:
dim(task1data)
## [1] 262 12
Due to the large data set, the str() function was used to display the internal structure:
str(task1data)
## 'data.frame': 262 obs. of 12 variables:
## $ ID : chr "OAS1_0001_MR1" "OAS1_0002_MR1" "OAS1_0003_MR1" "OAS1_0010_MR1" ...
## $ M.F : chr "F" "F" "F" "M" ...
## $ Hand : chr "R" "R" "R" "R" ...
## $ Age : int 74 55 73 74 52 81 76 82 89 48 ...
## $ Educ : 'impute' int 2 4 4 5 3 5 2 2 5 5 ...
## ..- attr(*, "imputed")= int [1:29] 15 28 29 36 54 83 84 89 90 95 ...
## $ SES : 'impute' num 3 1 3 2 2 2 2 4 1 2 ...
## ..- attr(*, "imputed")= int [1:48] 7 15 24 27 28 29 34 36 46 51 ...
## $ MMSE : 'impute' int 29 29 27 30 30 30 28 27 30 29 ...
## ..- attr(*, "imputed")= int [1:29] 15 28 29 36 54 83 84 89 90 95 ...
## $ CDR : 'impute' num 0 0 0.5 0 0 0 0.5 0.5 0 0 ...
## ..- attr(*, "imputed")= int [1:29] 15 28 29 36 54 83 84 89 90 95 ...
## $ eTIV : int 1344 1147 1454 1636 1321 1664 1738 1477 1536 1326 ...
## $ nWBV : num 0.743 0.81 0.708 0.689 0.827 0.679 0.719 0.739 0.715 0.785 ...
## $ ASF : num 1.31 1.53 1.21 1.07 1.33 ...
## $ Delay: chr "N/A" "N/A" "N/A" "N/A" ...
Two colums, Hand and delay, were deleted since they are irrelevant:
task1data <- subset(task1data, select = -c(Hand, Delay))
Rename the variable M.F to Gender:
colnames(task1data)[colnames(task1data) == "M.F"] <- "Gender"
head(task1data)
## ID Gender Age Educ SES MMSE CDR eTIV nWBV ASF
## 1 OAS1_0001_MR1 F 74 2 3 29 0.0 1344 0.743 1.306
## 2 OAS1_0002_MR1 F 55 4 1 29 0.0 1147 0.810 1.531
## 3 OAS1_0003_MR1 F 73 4 3 27 0.5 1454 0.708 1.207
## 9 OAS1_0010_MR1 M 74 5 2 30 0.0 1636 0.689 1.073
## 10 OAS1_0011_MR1 F 52 3 2 30 0.0 1321 0.827 1.329
## 12 OAS1_0013_MR1 F 81 5 2 30 0.0 1664 0.679 1.055
Statistical description of numerical variables:
library(pastecs)
round(stat.desc(task1data[c(-1,-2)]), 2)
## Age Educ SES MMSE CDR eTIV nWBV ASF
## nbr.val 262.00 262.00 262.00 262.00 262.00 262.00 262.00 262.00
## nbr.null 0.00 0.00 0.00 0.00 162.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## min 40.00 1.00 1.00 14.00 0.00 1123.00 0.64 0.88
## max 96.00 5.00 5.00 30.00 2.00 1992.00 0.86 1.56
## range 56.00 4.00 4.00 16.00 2.00 869.00 0.22 0.68
## sum 18323.00 827.00 629.00 7143.00 67.00 382609.00 198.17 318.59
## median 72.00 3.00 2.00 29.00 0.00 1446.50 0.76 1.21
## mean 69.94 3.16 2.40 27.26 0.26 1460.34 0.76 1.22
## SE.mean 0.84 0.08 0.06 0.22 0.02 9.97 0.00 0.01
## CI.mean.0.95 1.65 0.15 0.12 0.43 0.05 19.62 0.01 0.02
## var 183.62 1.54 1.05 12.59 0.14 26018.37 0.00 0.02
## std.dev 13.55 1.24 1.03 3.55 0.37 161.30 0.05 0.13
## coef.var 0.19 0.39 0.43 0.13 1.46 0.11 0.07 0.11
Interpretation:
The average age was 69.94 and the standard error was high for the
variable, meaning that there is a higher chance the sample does not
represent the population well. For MMSE the wide range indicates that
there are people included in the study who range between moderate and
mild dementia, and nondemented.
The highest coefficient of variance is from variable CDR 1.45, meaning
that it has the greatest dispersion of level around the mean compared to
other research variables.
Additional variable was added to show whether a person is identified as demented or not, based on the CDR test.
task1data$Group <- ifelse(task1data$CDR > 0.4,"Demented", "Nondemented")
hist(task1data[c(-1,-2,-11)])
The difference between male and female CDR:
library(ggplot2)
ggplot(task1data, aes(as.factor(CDR), Age, fill = Gender))+
geom_boxplot()+
ggtitle("Degree of CDR by Age")+
xlab("CDR")+
#geom_text(stat = "count", aes(label = ..count..), y = 60, col = "red")+
theme(plot.title = element_text(hjust = .5))
The tested females that do not exhibit dementia are older than man in the same category. People who score 0 on CDR have a larger dispersion of age than the ones who scored 0.5 of higher. For older subjects that exhibit dementia, the data is more reliable as the interquartile range is smaller.
task1data$GroupFactor <- factor(task1data$Group,
levels = c("Demented" , "Nondemented"),
labels = c(1, 0))
library(car)
## Loading required package: carData
scatterplot(y = task1data$CDR, x= task1data$nWBV,
main = "The relationship between the group and nWBV",
ylab = "CDR",
xlab = "Normalised whole brain volume",
smooth = FALSE)
The scatter plot shows the relationship that CDR, which is used to identify whether a test subject is demented or not, and the volume size of the brain. The relationship between the variables is negative, meaning that if the normalised whole brain volume decreases, the CDR increases and vice verca.
task2data <- read.table("~/R Bootcamp/R Take Home Exam/Task 2/Body mass.csv", header=TRUE, sep=";", dec=",")
head(task2data)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
Description (data beginning of 2021/2022 school year)
mean beginning of 2018/2019 = 59.5 kg
summary(task2data[-1])
## Mass
## Min. :49.70
## 1st Qu.:60.23
## Median :62.80
## Mean :62.88
## 3rd Qu.:64.50
## Max. :83.20
Above data shows minimum values, (Q1) lower 25% of the collected data, the median, which divides the lower 50% with the top 50%, (Q3) 75% mark and maximum number when observing the body mass of the ninth graders.
round(stat.desc(task2data[-1]), 2)
## Mass
## nbr.val 50.00
## nbr.null 0.00
## nbr.na 0.00
## min 49.70
## max 83.20
## range 33.50
## sum 3143.80
## median 62.80
## mean 62.88
## SE.mean 0.85
## CI.mean.0.95 1.71
## var 36.14
## std.dev 6.01
## coef.var 0.10
x <- task2data$Mass
hist(task2data$Mass,
prob = TRUE,
main = "Distribution of body mass of ninth graders 2021/2022",
ylab = "Frequency",
xlab = "Body mass",
breaks = seq(40, 90, 2),
col="turquoise")
H0: mu = 59.5 H1: mu ≠ 59.5
t.test(task2data$Mass,
mu = 59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: task2data$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
Results interpretation:
One can be 95% confident that the population mean is between 61.17 and 64.58 kg. Null hypothesis is rejected since p-value < 0.001.It is extremely unlikely that average weight of ninth graders at the beginning of school year 2021/2022. The mean increased since the beginning of school year 2019/2022.
r=sqrt(t2/(t2+df)) r = 0.49
Furthermore, the effect size is 0.49, which means it has a medium effect.
library(readxl)
task3data <- read_xlsx("~/R Bootcamp/R Take Home Exam/Task 3/Apartments.xlsx")
head(task3data)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
task3data$ParkingFactor <- factor(task3data$Parking,
levels = c(0 , 1),
labels = c("No", "Yes"))
task3data$BalconyFactor <- factor(task3data$Balcony,
levels = c(0 , 1),
labels = c("No", "Yes"))
t.test(task3data$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: task3data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We can conclude that with 95% confidence the mean of the price is between 1937.443 and 2100.44. Null hypothesis is rejected since p-value < 0.01.It is extremely unlikely that the average price of an apartment is 1900 as indicated in the H0. The mean is higher.
fit1 <- lm(Price ~ Age,
data = task3data)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = task3data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
As seen from the summary the regression coefficient function is y=2185.455 -8.975x . Estimate of the regression coefficient has a slope of -8.975 meaning that the relationship is negative: If the age of the apartment increases by 1 year, the average price per m2 decreases by 8.975 units. Coefficient of determination, also known as R-squared, of 0.05 means that this model explains 5% of variation within the data. The coefficient of correlation is r = -0.230. The r demonstrates that the relationship between the variables is weak.
sqrt(0.05302)
## [1] 0.2302607
library(car)
scatterplotMatrix(task3data[c( 1, 2, 3)],
smooth= FALSE)
Without the calculation it is difficult to determine potential multicolinearity; however, based on an educated guess, it seems that there is no risk of potential multicolinearity. As it can be seen from the scatter plot matrix, non of the regressions show abnormally high correlation and the points are well and equally spread out.
fit2 <- lm(Price ~ Age + Distance,
data = task3data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = task3data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
vif(fit2)
## Age Distance
## 1.001845 1.001845
The findings would be potentially concerning if VIF would be higher than 5. Since it is 1, the variables are not correlated.
head(round(rstandard(fit2), 2))
## 1 2 3 4 5 6
## -0.67 1.78 -0.59 0.75 -1.07 -0.78
dim(fit2)
## NULL
There are no outliers.
fit2$CooksD <- round(cooks.distance(fit2),3)
scatterplot(y = fit2$residuals, x = fit2$fitted.values,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = TRUE,
smooth = FALSE)
The findings show that the both have similar variability, meaning that the variance of fitted values and standardised residuals is constant.
hist(fit2$residuals,
prob = TRUE,
main = "Distribution of standardised residuals",
ylab = "Frequency",
xlab = "Standardised residuals",
col="green")
library(stats)
shapiro.test(fit2$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit2$residuals
## W = 0.95366, p-value = 0.00398
H0: Standardised residuals is normally distributed H1: Standardised residuals is not normally distributed
Since the p-value is less than 0.05, the h0 is rejected at 5% significance meaning that the data is non-normal.
fit2 <- lm(Price ~ Age + Distance,
data = task3data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = task3data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
fit3 <- lm(Price ~ Age + Distance +ParkingFactor +BalconyFactor,
data = task3data)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = task3data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingFactorYes 196.168 62.868 3.120 0.00251 **
## BalconyFactorYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Setting the hypotheses: H0: Fit2 (model 1) is more appropriate. H1: Fit3 (model 2) is more appropriate.
The results show that p-value=0.01, resulting in rejection of H0 and accepting that fit3 model is more appropriate.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = task3data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingFactorYes 196.168 62.868 3.120 0.00251 **
## BalconyFactorYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
Explanation for the categorical variables:
Parking: When examining the effects on price given the age, distance, parking and balcony, the parking increases the price by 196 units (EUR). The p-value = 0.00251, indicating that the results are statistically significant.
Balcony: p-value of 0.974 is not considered significant since p-value>0.05. This means that the apartment with a balcony does not significantly change the price of the apartment.
The hypothesis tested for F-statistics: H0: ρ squared = 0 H1: ρ squared > 0
H0 is rejected at p< 0.001, meaning that the linear relationship between price and at least one explanatory variable was found.
task3data$Fittedvalues <- fitted.values(fit3)
head(task3data)
## # A tibble: 6 × 8
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor Fittedvalues
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 7 28 1640 0 1 No Yes 1751.
## 2 18 1 2800 1 0 Yes No 2357.
## 3 7 28 1660 0 0 No No 1749.
## 4 28 29 1850 0 1 No Yes 1590.
## 5 18 18 1640 1 1 Yes Yes 2053.
## 6 28 12 1770 0 1 No Yes 1897.
Calculating residuals for apartment ID2:
Residual = actual value (Price) − predicted value (fitted value)
2800-2357.411= 442.6 EUR.