Task 1

Dataset introduction and preparation

task1data <- read.csv("~/R Bootcamp/R Take Home Exam/oasis_cross-sectional.csv")

head(task1data)
##              ID M.F Hand Age Educ SES MMSE CDR eTIV  nWBV   ASF Delay
## 1 OAS1_0001_MR1   F    R  74    2   3   29 0.0 1344 0.743 1.306   N/A
## 2 OAS1_0002_MR1   F    R  55    4   1   29 0.0 1147 0.810 1.531   N/A
## 3 OAS1_0003_MR1   F    R  73    4   3   27 0.5 1454 0.708 1.207   N/A
## 4 OAS1_0004_MR1   M    R  28   NA  NA   NA  NA 1588 0.803 1.105   N/A
## 5 OAS1_0005_MR1   M    R  18   NA  NA   NA  NA 1737 0.848 1.010   N/A
## 6 OAS1_0006_MR1   F    R  24   NA  NA   NA  NA 1131 0.862 1.551   N/A

Variable description:

  • ID: Identification of a subject
  • M.F: Gender (Male/Female)
  • Hand: Dominant hand (left/right)
  • Age: Age in years
  • Educ: Education level ( 1: less than high school grad., 2: high school grad., 3: some college, 4: college grad., 5: beyond college)
  • SES - Socioeconomic status (1-5)
  • MMSE - Mini Mental State Examination
  • CDR - Clinical Dementia Rating (0 = normal, 0.5 = very mild dementia, 1 = mild dementia, 2 = moderate dementia, 3 = severe dementia)
  • eTIV - Estimated Total Intracranial Volume
  • nWBV - Normalize Whole Brain Volume
  • ASF Atlas Scaling Factor
  • Delay - the number of days between two medical visits.

For this assignment, the targeted population are people above 40 years old. The data was obtained from OASIS cross-sectional on brain imaging.

task1data <- task1data[task1data$Age >= 40, ]

Next the missing values per column/variable need to be determined and imputated based on median. Median was used based on the research of other N/A values for similar datasets on brain fuction.

colSums(is.na(task1data))
##    ID   M.F  Hand   Age  Educ   SES  MMSE   CDR  eTIV  nWBV   ASF Delay 
##     0     0     0     0    29    48    29    29     0     0     0     0
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
task1data$Educ <- impute(task1data$Educ, median)

task1data$SES <- impute(task1data$SES, median)

task1data$MMSE <- impute(task1data$MMSE, median)

task1data$CDR <- impute(task1data$CDR, median)

head(task1data)
##               ID M.F Hand Age Educ SES MMSE CDR eTIV  nWBV   ASF Delay
## 1  OAS1_0001_MR1   F    R  74    2   3   29 0.0 1344 0.743 1.306   N/A
## 2  OAS1_0002_MR1   F    R  55    4   1   29 0.0 1147 0.810 1.531   N/A
## 3  OAS1_0003_MR1   F    R  73    4   3   27 0.5 1454 0.708 1.207   N/A
## 9  OAS1_0010_MR1   M    R  74    5   2   30 0.0 1636 0.689 1.073   N/A
## 10 OAS1_0011_MR1   F    R  52    3   2   30 0.0 1321 0.827 1.329   N/A
## 12 OAS1_0013_MR1   F    R  81    5   2   30 0.0 1664 0.679 1.055   N/A

Count rows and columns:

dim(task1data)
## [1] 262  12

Due to the large data set, the str() function was used to display the internal structure:

str(task1data)
## 'data.frame':    262 obs. of  12 variables:
##  $ ID   : chr  "OAS1_0001_MR1" "OAS1_0002_MR1" "OAS1_0003_MR1" "OAS1_0010_MR1" ...
##  $ M.F  : chr  "F" "F" "F" "M" ...
##  $ Hand : chr  "R" "R" "R" "R" ...
##  $ Age  : int  74 55 73 74 52 81 76 82 89 48 ...
##  $ Educ : 'impute' int  2 4 4 5 3 5 2 2 5 5 ...
##   ..- attr(*, "imputed")= int [1:29] 15 28 29 36 54 83 84 89 90 95 ...
##  $ SES  : 'impute' num  3 1 3 2 2 2 2 4 1 2 ...
##   ..- attr(*, "imputed")= int [1:48] 7 15 24 27 28 29 34 36 46 51 ...
##  $ MMSE : 'impute' int  29 29 27 30 30 30 28 27 30 29 ...
##   ..- attr(*, "imputed")= int [1:29] 15 28 29 36 54 83 84 89 90 95 ...
##  $ CDR  : 'impute' num  0 0 0.5 0 0 0 0.5 0.5 0 0 ...
##   ..- attr(*, "imputed")= int [1:29] 15 28 29 36 54 83 84 89 90 95 ...
##  $ eTIV : int  1344 1147 1454 1636 1321 1664 1738 1477 1536 1326 ...
##  $ nWBV : num  0.743 0.81 0.708 0.689 0.827 0.679 0.719 0.739 0.715 0.785 ...
##  $ ASF  : num  1.31 1.53 1.21 1.07 1.33 ...
##  $ Delay: chr  "N/A" "N/A" "N/A" "N/A" ...

Two colums, Hand and delay, were deleted since they are irrelevant:

task1data <- subset(task1data, select = -c(Hand, Delay))

Rename the variable M.F to Gender:

colnames(task1data)[colnames(task1data) == "M.F"] <- "Gender"
head(task1data)
##               ID Gender Age Educ SES MMSE CDR eTIV  nWBV   ASF
## 1  OAS1_0001_MR1      F  74    2   3   29 0.0 1344 0.743 1.306
## 2  OAS1_0002_MR1      F  55    4   1   29 0.0 1147 0.810 1.531
## 3  OAS1_0003_MR1      F  73    4   3   27 0.5 1454 0.708 1.207
## 9  OAS1_0010_MR1      M  74    5   2   30 0.0 1636 0.689 1.073
## 10 OAS1_0011_MR1      F  52    3   2   30 0.0 1321 0.827 1.329
## 12 OAS1_0013_MR1      F  81    5   2   30 0.0 1664 0.679 1.055

Statistical description of numerical variables:

library(pastecs)
round(stat.desc(task1data[c(-1,-2)]), 2)
##                   Age   Educ    SES    MMSE    CDR      eTIV   nWBV    ASF
## nbr.val        262.00 262.00 262.00  262.00 262.00    262.00 262.00 262.00
## nbr.null         0.00   0.00   0.00    0.00 162.00      0.00   0.00   0.00
## nbr.na           0.00   0.00   0.00    0.00   0.00      0.00   0.00   0.00
## min             40.00   1.00   1.00   14.00   0.00   1123.00   0.64   0.88
## max             96.00   5.00   5.00   30.00   2.00   1992.00   0.86   1.56
## range           56.00   4.00   4.00   16.00   2.00    869.00   0.22   0.68
## sum          18323.00 827.00 629.00 7143.00  67.00 382609.00 198.17 318.59
## median          72.00   3.00   2.00   29.00   0.00   1446.50   0.76   1.21
## mean            69.94   3.16   2.40   27.26   0.26   1460.34   0.76   1.22
## SE.mean          0.84   0.08   0.06    0.22   0.02      9.97   0.00   0.01
## CI.mean.0.95     1.65   0.15   0.12    0.43   0.05     19.62   0.01   0.02
## var            183.62   1.54   1.05   12.59   0.14  26018.37   0.00   0.02
## std.dev         13.55   1.24   1.03    3.55   0.37    161.30   0.05   0.13
## coef.var         0.19   0.39   0.43    0.13   1.46      0.11   0.07   0.11

Interpretation:

The average age was 69.94 and the standard error was high for the variable, meaning that there is a higher chance the sample does not represent the population well. For MMSE the wide range indicates that there are people included in the study who range between moderate and mild dementia, and nondemented.
The highest coefficient of variance is from variable CDR 1.45, meaning that it has the greatest dispersion of level around the mean compared to other research variables.

Additional variable was added to show whether a person is identified as demented or not, based on the CDR test.

task1data$Group <- ifelse(task1data$CDR > 0.4,"Demented", "Nondemented")

Data Presentation

hist(task1data[c(-1,-2,-11)])

The difference between male and female CDR:

library(ggplot2)
ggplot(task1data, aes(as.factor(CDR), Age, fill = Gender))+
  geom_boxplot()+
  ggtitle("Degree of CDR by Age")+
  xlab("CDR")+
  #geom_text(stat = "count", aes(label = ..count..), y = 60, col = "red")+
  theme(plot.title = element_text(hjust = .5))

The tested females that do not exhibit dementia are older than man in the same category. People who score 0 on CDR have a larger dispersion of age than the ones who scored 0.5 of higher. For older subjects that exhibit dementia, the data is more reliable as the interquartile range is smaller.

task1data$GroupFactor <- factor(task1data$Group,
    levels = c("Demented" , "Nondemented"),
  labels = c(1, 0))
library(car)
## Loading required package: carData
scatterplot(y = task1data$CDR, x= task1data$nWBV,
            main = "The relationship between the group and nWBV",
            ylab = "CDR",
            xlab = "Normalised whole brain volume",
            smooth = FALSE)

The scatter plot shows the relationship that CDR, which is used to identify whether a test subject is demented or not, and the volume size of the brain. The relationship between the variables is negative, meaning that if the normalised whole brain volume decreases, the CDR increases and vice verca.

Task 2

task2data <- read.table("~/R Bootcamp/R Take Home Exam/Task 2/Body mass.csv", header=TRUE, sep=";", dec=",")
head(task2data)
##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2

Description (data beginning of 2021/2022 school year)

mean beginning of 2018/2019 = 59.5 kg

Descriptive statistics and histogram

summary(task2data[-1])
##       Mass      
##  Min.   :49.70  
##  1st Qu.:60.23  
##  Median :62.80  
##  Mean   :62.88  
##  3rd Qu.:64.50  
##  Max.   :83.20

Above data shows minimum values, (Q1) lower 25% of the collected data, the median, which divides the lower 50% with the top 50%, (Q3) 75% mark and maximum number when observing the body mass of the ninth graders.

round(stat.desc(task2data[-1]), 2)
##                 Mass
## nbr.val        50.00
## nbr.null        0.00
## nbr.na          0.00
## min            49.70
## max            83.20
## range          33.50
## sum          3143.80
## median         62.80
## mean           62.88
## SE.mean         0.85
## CI.mean.0.95    1.71
## var            36.14
## std.dev         6.01
## coef.var        0.10
  • nbr.val: 50 students were observed
  • The minimum body mass at the point of the data collection was 49.70 kg and the maximum was 83.20
  • The range between minimum and maximum is 33.50 kg
  • Mean is 62.88 kg, which is the average body mass of ninth graders
  • Median shows that 50% of ninth graders have a higher body mass than 62.80 kg, 50% of students have lower body mass than 62.80 kg
  • CI.mean.0.95: arithmetic mean of the body mass is between 61.17 kg and 64,59 kg with 95% confidence
  • variance: the larger the variance the larger the variability in the collected body mass -coef.var it is used to compare the variability between different numeric variables. In this case only 1 numeric variable is given
x <- task2data$Mass

hist(task2data$Mass,
     prob = TRUE,
     main = "Distribution of body mass of ninth graders 2021/2022", 
     ylab = "Frequency",
     xlab = "Body mass",
     breaks = seq(40, 90, 2),
     col="turquoise")

Hypothesis testing and results interpretation

H0: mu = 59.5 H1: mu ≠ 59.5

t.test(task2data$Mass, 
       mu = 59.5,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  task2data$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

Results interpretation:

One can be 95% confident that the population mean is between 61.17 and 64.58 kg. Null hypothesis is rejected since p-value < 0.001.It is extremely unlikely that average weight of ninth graders at the beginning of school year 2021/2022. The mean increased since the beginning of school year 2019/2022.

r=sqrt(t2/(t2+df)) r = 0.49

Furthermore, the effect size is 0.49, which means it has a medium effect.

Task3

Import the dataset Apartments.xlsx

library(readxl)
task3data <- read_xlsx("~/R Bootcamp/R Take Home Exam/Task 3/Apartments.xlsx")

head(task3data) 
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors

task3data$ParkingFactor <- factor(task3data$Parking,
    levels = c(0 , 1),
  labels = c("No", "Yes"))

task3data$BalconyFactor <- factor(task3data$Balcony,
    levels = c(0 , 1),
  labels = c("No", "Yes"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(task3data$Price, 
       mu = 1900,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  task3data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We can conclude that with 95% confidence the mean of the price is between 1937.443 and 2100.44. Null hypothesis is rejected since p-value < 0.01.It is extremely unlikely that the average price of an apartment is 1900 as indicated in the H0. The mean is higher.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
           data = task3data)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = task3data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

As seen from the summary the regression coefficient function is y=2185.455 -8.975x . Estimate of the regression coefficient has a slope of -8.975 meaning that the relationship is negative: If the age of the apartment increases by 1 year, the average price per m2 decreases by 8.975 units. Coefficient of determination, also known as R-squared, of 0.05 means that this model explains 5% of variation within the data. The coefficient of correlation is r = -0.230. The r demonstrates that the relationship between the variables is weak.

sqrt(0.05302)
## [1] 0.2302607

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
scatterplotMatrix(task3data[c( 1, 2, 3)],
                  smooth= FALSE)

Without the calculation it is difficult to determine potential multicolinearity; however, based on an educated guess, it seems that there is no risk of potential multicolinearity. As it can be seen from the scatter plot matrix, non of the regressions show abnormally high correlation and the points are well and equally spread out.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = task3data)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = task3data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)
vif(fit2)
##      Age Distance 
## 1.001845 1.001845

The findings would be potentially concerning if VIF would be higher than 5. Since it is 1, the variables are not correlated.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic case (outlier or unit with big influence).

head(round(rstandard(fit2), 2))
##     1     2     3     4     5     6 
## -0.67  1.78 -0.59  0.75 -1.07 -0.78
dim(fit2)
## NULL

There are no outliers.

fit2$CooksD <- round(cooks.distance(fit2),3)

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

scatterplot(y = fit2$residuals, x = fit2$fitted.values,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = TRUE,
            smooth = FALSE)

The findings show that the both have similar variability, meaning that the variance of fitted values and standardised residuals is constant.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(fit2$residuals,
     prob = TRUE,
     main = "Distribution of standardised residuals", 
     ylab = "Frequency",
     xlab = "Standardised residuals",
     col="green")

library(stats)
shapiro.test(fit2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit2$residuals
## W = 0.95366, p-value = 0.00398

H0: Standardised residuals is normally distributed H1: Standardised residuals is not normally distributed

Since the p-value is less than 0.05, the h0 is rejected at 5% significance meaning that the data is non-normal.

Estimate the fit2 again with potentially excluded cases and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance,
           data = task3data)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = task3data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance +ParkingFactor +BalconyFactor,
           data = task3data)
summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = task3data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2301.667     94.271  24.415  < 2e-16 ***
## Age                -6.799      3.110  -2.186  0.03172 *  
## Distance          -18.045      2.758  -6.543 5.28e-09 ***
## ParkingFactorYes  196.168     62.868   3.120  0.00251 ** 
## BalconyFactorYes    1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Setting the hypotheses: H0: Fit2 (model 1) is more appropriate. H1: Fit3 (model 2) is more appropriate.

The results show that p-value=0.01, resulting in rejection of H0 and accepting that fit3 model is more appropriate.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = task3data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2301.667     94.271  24.415  < 2e-16 ***
## Age                -6.799      3.110  -2.186  0.03172 *  
## Distance          -18.045      2.758  -6.543 5.28e-09 ***
## ParkingFactorYes  196.168     62.868   3.120  0.00251 ** 
## BalconyFactorYes    1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Explanation for the categorical variables:

Parking: When examining the effects on price given the age, distance, parking and balcony, the parking increases the price by 196 units (EUR). The p-value = 0.00251, indicating that the results are statistically significant.

Balcony: p-value of 0.974 is not considered significant since p-value>0.05. This means that the apartment with a balcony does not significantly change the price of the apartment.

The hypothesis tested for F-statistics: H0: ρ squared = 0 H1: ρ squared > 0

H0 is rejected at p< 0.001, meaning that the linear relationship between price and at least one explanatory variable was found.

Save fitted values and claculate the residual for apartment ID2.

task3data$Fittedvalues <- fitted.values(fit3)
head(task3data)
## # A tibble: 6 × 8
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor Fittedvalues
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>                <dbl>
## 1     7       28  1640       0       1 No            Yes                  1751.
## 2    18        1  2800       1       0 Yes           No                   2357.
## 3     7       28  1660       0       0 No            No                   1749.
## 4    28       29  1850       0       1 No            Yes                  1590.
## 5    18       18  1640       1       1 Yes           Yes                  2053.
## 6    28       12  1770       0       1 No            Yes                  1897.

Calculating residuals for apartment ID2:

Residual = actual value (Price) − predicted value (fitted value)

2800-2357.411= 442.6 EUR.