# data(package = .packages(all.available = TRUE)) = Code used to show all data sets available in my R studio, even the ones that aren't loaded at the moment <- This is how I found my data set in package "boot"
mydata <- boot::melanoma
head(mydata)
## time status sex age year thickness ulcer
## 1 10 3 1 76 1972 6.76 1
## 2 30 3 1 56 1968 0.65 0
## 3 35 2 1 41 1977 1.34 0
## 4 99 3 0 71 1968 2.90 0
## 5 185 1 1 52 1965 12.08 1
## 6 204 1 1 28 1971 4.84 1
Description of variables:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- dplyr::mutate(mydata, ID = row_number())
head((mydata), 3)
## time status sex age year thickness ulcer ID
## 1 10 3 1 76 1972 6.76 1 1
## 2 30 3 1 56 1968 0.65 0 2
## 3 35 2 1 41 1977 1.34 0 3
In this step I added the variable “ID” to make tracking observations easier.
mydata <- mydata[c(8, 3, 4, 5, 6, 7, 1, 2)]
colnames(mydata) <- c("ID", "Sex", "Age", "YearOfOp", "Thickness", "Ulceration", "Time", "Status")
mydata$SexF <- factor(mydata$Sex,
levels = c(0, 1),
labels = c("Female", "Male"))
mydata$UlcerationF <- factor(mydata$Ulceration,
levels = c(0, 1),
labels = c("Absent", "Present"))
mydata$StatusF <- factor(mydata$Status,
levels = c(1, 2, 3),
labels = c("Died from Melanoma", "Alive", "Death from Othr"))
tail(mydata)
## ID Sex Age YearOfOp Thickness Ulceration Time Status SexF UlcerationF StatusF
## 200 200 0 19 1965 1.13 1 4479 2 Female Present Alive
## 201 201 1 29 1965 7.06 1 4492 2 Male Present Alive
## 202 202 0 40 1965 6.12 0 4668 2 Female Absent Alive
## 203 203 0 42 1965 0.48 0 4688 2 Female Absent Alive
## 204 204 0 50 1964 2.26 0 4926 2 Female Absent Alive
## 205 205 0 41 1962 2.90 0 5565 2 Female Absent Alive
Here I re-organised the column order and renamed to variables to make the data set more organised. I also added the Factor variables for the 3 categorical variables.
mydata <- mydata[c(-2, -4, -6, -8)] # Removing the non Factor variables and variable YearOfOp since it is not necessarily useful (unless we would like to research the difference in effects of new technology)
mydata <- mydata[c(1, 5, 2, 3, 6, 4, 7)] # Reordering the variables
mydata1 <- mydata[!(mydata$StatusF == "Death from Othr"), ]
summary(mydata1[7])
## StatusF
## Died from Melanoma: 57
## Alive :134
## Death from Othr : 0
In this step I removed the variables which are not needed anymore, reordered the columns and created a new data set without the observations whose cause of death was other than melanoma (they are not relevant in the research). Summary of variable StatusF in the new data set shows us that there are no such observations present.
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
round(stat.desc(mydata[c(-1, -2, -5, -7)]), 2)
## Age Thickness Time
## nbr.val 205.00 205.00 205.00
## nbr.null 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00
## min 4.00 0.10 10.00
## max 95.00 17.42 5565.00
## range 91.00 17.32 5555.00
## sum 10755.00 598.57 441324.00
## median 54.00 1.94 2005.00
## mean 52.46 2.92 2152.80
## SE.mean 1.16 0.21 78.37
## CI.mean.0.95 2.30 0.41 154.52
## var 277.95 8.76 1259020.14
## std.dev 16.67 2.96 1122.06
## coef.var 0.32 1.01 0.52
We can see that there were 205 observations in this data set. Of these 205 observations the minimum age of patients was 4yo and the max was 95yo, which makes the range of Age in the data set 91 years. The average age of the patients was 52.46 years and the median was 54 years. This means that 50% of the patients were older than 54 and 50% were younger. The average tumor thickness of the patients was 2.92mm. The 95% CI for the mean of tumor thickness is 0.41, so we can say with 95% certainty that the real population mean lies somewhere between 2.92mm +- 0.41mm.
summary(mydata[c(2, 5, 7)])
## SexF UlcerationF StatusF
## Female:126 Absent :115 Died from Melanoma: 57
## Male : 79 Present: 90 Alive :134
## Death from Othr : 14
Here is the summary data for only the Factor variables which shows us the number of each category in the variables.
ggplot(mydata, aes(x = Thickness))+
theme_bw()+
geom_histogram(binwidth = 1, colour = "black", fill = "dark green")+
ggtitle("Distribution of tumor thickness among patients")+
ylab("Frequency")+
xlab("Thickness of tumor in mm")
Here we have the distribution of tumor thickness values in mm for the
data set. On the Histogram we can see that the thickness is positively
or right skewed. The size of the bins is 1mm so we can see that the most
frequent size of patients’ tumor was between 0.5mm and 1.5mm. This
Histogram also leads us to believe that any tumor above 5mm is quite
rare.
scatterplot(y = mydata$Thickness, x = mydata$Age,
ylab = "Tumor thickness in mm",
xlab = "Age",
boxplots = FALSE,
smooth = FALSE)
In this scatter plot I plotted the values of Age on the X axis and tumor
thickness on the Y axis searching for a relationship. I predicted that
there will be a positive correlation between age and thickness as the
tumor gets the chance grow with age. My prediction was correct as we can
see with the regression line on the scatter plot, the slope is
positive.
ggplot(mydata1, aes(x = StatusF, y = Thickness, fill = SexF))+
theme_bw()+
geom_boxplot()+
scale_fill_brewer(palette = "Dark2")+
ylab("Tumor thickness in mm")+
xlab("Status")+
labs(fill = "Sex")
In this box plot I decided to compare the tumor thickness by “Status”
and “Sex” to see if there are any differences in thickness of those
patients who died and those that survived as well as between genders.
For this box plot I used the data set without observations whose cause
of death was not melanoma. We can see that the ones who died from
melanoma had on average thicker tumors that the ones who did not. We can
also depict from the graph that Male patients on average had thicker
tumors than Female patients.
mydata2 <- read.table("~/Desktop/IMB/R data/R Take Home Exam/Task 2/Body mass.csv",
sep = ";",
dec = ",",
header = TRUE)
library(pastecs)
round(stat.desc(mydata2[-1]), 2)
## Mass
## nbr.val 50.00
## nbr.null 0.00
## nbr.na 0.00
## min 49.70
## max 83.20
## range 33.50
## sum 3143.80
## median 62.80
## mean 62.88
## SE.mean 0.85
## CI.mean.0.95 1.71
## var 36.14
## std.dev 6.01
## coef.var 0.10
ggplot(mydata2, aes(x = Mass))+
theme_bw()+
geom_histogram(binwidth = 5, colour = "black", fill = "light blue")+
ggtitle("Distribution of body mass of 9th graders 2021/22")+
ylab("Frequency")+
xlab("Body mass in kg")
t.test(mydata2$Mass,
mu = 59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata2$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
With this t-test we researched two hypothesis. H0 = average weight of 9th graders in 2018/19 is equal to the avg weight of 9th graders in 2021/22, Ha = average weight of 9th graders in 2018/19 is NOT equal to the avg weight of 9th graders in 2021/22. Because we are checking equality the mean of 2021/22 can either be higher or lower than the mean of 2018/19 which makes this test a two tailed test. From observing the histogram we can already suspect that the two means are not equal as the most frequent values lay between 62.5 and 67.5kg. Conducting the t-test we can deduct that at a 5% level of significance (a = 5%) we can reject the null hypothesis. The p value for the t-test is 0.0002 or 0.02% which also equals to the percentage of us making a mistake and the null hypothesis being correct. Therefore we can safely say that the mean of 2018/19 in not equal to the mean of 2021/22, it is higher in 2021/22. The 95% CI also shows us that we can say with 95% certainty that the population mean of 2021/2022 lies between 61.17 and 64.58 which does not include 59.5 (2018/19 mean).
#install.packages("readxl")
library(readxl)
mydata3 <- read_xlsx("~/Desktop/IMB/R data/R Take Home Exam/Task 3/Apartments.xlsx")
Description:
mydata3$ParkingFactor <- factor(mydata3$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata3$BalconyFactor <- factor(mydata3$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
str(mydata3)
## tibble [85 × 7] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance : num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
## $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
## $ ParkingFactor: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ BalconyFactor: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
t.test(mydata3$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
H0 : mu = 1900 Ha : mu is not = 1900
I can conclude that the mean price of apartments is not equal to 1900 at a 5% significance level (a = 5%), therefore we can reject H0. The 95% CI for the apartment price is between 1937.443 and 2100.440 which means that we can say with 95% certainty that the true population means lies between those 2 values. This interval does not include 1900 which again lets us reject the null hypothesis at a 5% significance level.
fit1 <- lm(Price ~ Age,
data = mydata3)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
sqrt(summary(fit1)$r.squared) # Coefficient of correlation (in absolute terms)
## [1] 0.230255
The regression function is constructed as Price = 2185.455 - 8.975 x Age
The intercept coefficient means that on average is Age is 0 and everything else stays the same the price will be 2185.455. The regression coefficient for Age tells us the if Age increases by 1, the price will on average decrease by 8.975 if all else stays equal. Both of the coefficients have a p value below 0.05 which leads us to reject the null hypothesis at each. The null hypothesis is that the coefficient in equal to 0 therefore we can say at a 5% level of significance, or even less with the intercept, that the coefficients are not equal to 0
The multiple r squared statistic is the coefficient of determination an it tells us that 5.302% of the variability in the price is explained by the x variables (in this case Age).
The square root of R-squared is the coefficient of correlation and it tells us the absolute correlation between the observed variables and the predicted variables, in this case it is 0.23.
scatterplotMatrix(mydata3[ , 1:3],
smooth = FALSE)
There would be a potential problem with multicolinearity when the
explanatory variables would be correlated with each other. In the case
of the scatter plot we can deduct if this is the case by looking at the
regression line on the scatter plot between the explanatory variables,
in this case Age and Distance. If we look at the plot we can see that
the slope of the line is close to 0 therefore meaning that there is no
significant correlation between the two variables which tells us that
there is not problem with multicolinearity.
fit2 <- lm(Price ~ Age + Distance,
data = mydata3)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
The VIF statistic checks for multicolinearity between the explanatory variables. The rule for the VIF statistic is that it should be less than 5 for all explanatory variables and the mean of the VIF statistic should be as close to one. If a variable has a VIF of higher than 5 this variable is causing the problem of multicolinearity and should be looked into/removed. In our case the explanatury variables are both less than 5 and the mean is close to 1 which tells us there is no problem with multycolinearity.
mydata3$StdResid <- round(rstandard(fit2), 3)
head(mydata3[order(-mydata3$StdResid), ])
## # A tibble: 6 × 8
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58
## 2 2 11 2790 1 0 Yes No 2.05
## 3 18 1 2800 1 0 Yes No 1.78
## 4 18 1 2800 1 1 Yes Yes 1.78
## 5 8 2 2820 1 0 Yes No 1.66
## 6 10 1 2810 0 0 No No 1.60
tail(mydata3[order(-mydata3$StdResid), ])
## # A tibble: 6 × 8
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 24 5 1830 1 0 Yes No -1.19
## 2 14 16 1660 0 1 No Yes -1.26
## 3 13 8 1800 0 0 No No -1.38
## 4 12 14 1650 0 1 No Yes -1.50
## 5 12 14 1650 0 0 No No -1.50
## 6 7 2 1760 0 1 No Yes -2.15
mydata3$CooksD <- round(cooks.distance(fit2), 3)
head(mydata3[order(-mydata3$CooksD), ])
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 Yes No 2.05 0.069
## 4 7 2 1760 0 1 No Yes -2.15 0.066
## 5 37 3 2540 1 1 Yes Yes 1.58 0.061
## 6 40 2 2400 0 1 No Yes 1.09 0.038
hist(mydata3$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
hist(mydata3$StdResid,
xlab = "Standardized Residuals",
ylab = "Frequency",
main = "Histogram of Standardized Residuals")
By looking at the standardized residuals we can see that none are above 3 or below -3, which would automatically categorize them as outlines and they would have been removed. However when looking at the cooks distance we can see that one observation in particular has a significantly higher cooks distance than all others. This means that it has a higher effect on the regression than other observations. When combining this fact with the facts that the standardized residual of this observation is very close to 3 (2.577) I decided to remove it from my data set. This will be done later when estimating fit 2 again in order to avoid problems when completing the next 2 exercises. In the next 2 exercises I will use the regression with the “outlier” included.
mydata3$StdFittedValues <- scale(fit2$fitted.values)
scatterplot(y = mydata3$StdResid, x = mydata3$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Heteroskedasticity occurs when the variance is not constant. on this
scatter plot we can see that the variance of standardized residuals is
constant so therefore we can say there is no heteroskedasticity and
there is homoskedasticity, which is not a problem.
hist(mydata3$StdResid,
prob = TRUE,
xlab = "Standardized residuals",
ylab = "Density",
main = "Histogram of standardized residuals")
curve(dnorm(x, mean = mean(mydata3$StdResid), sd = sd(mydata3$StdResid)),
add = TRUE,
lwd = 2,
col = "red")
shapiro.test(mydata3$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata3$StdResid
## W = 0.95303, p-value = 0.003645
From the histogram we can already suspect that the standardized residuals are not distributed normally because the shape of the distribution does not match the one of the normal distribution. When we formally test it with the Shapiro-Wilk normality test where the H0 : distribution is normal, and Ha : distribution is not normal we can reject the null hypothesis at a 5% significance level. This means that the distribution is not normal.
mydata3 <- mydata3[!(mydata3$CooksD == 0.320), ]
head(mydata3[order(-mydata3$CooksD), ])
## # A tibble: 6 × 10
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD StdFittedValues[,1]
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 43 37 1740 0 0 No No 1.44 0.104 -2.65
## 2 2 11 2790 1 0 Yes No 2.05 0.069 0.790
## 3 7 2 1760 0 1 No Yes -2.15 0.066 1.37
## 4 37 3 2540 1 1 Yes Yes 1.58 0.061 0.342
## 5 40 2 2400 0 1 No Yes 1.09 0.038 0.329
## 6 8 2 2820 1 0 Yes No 1.66 0.037 1.34
fit2 <- lm(Price ~ Age + Distance,
data = mydata3)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
sqrt(summary(fit2)$r.squared)
## [1] 0.6955609
Here I finally removed the observation that I decided needed to be removed, more specifically the on with a cooks distance of 0.320 and standardized residual of 2.577. After that i checked if it was properly removed, and it was. I later estimated fit2 again to “update” the regression without the removed observation.
The regression functions is constructed as Price = 2456.076 - 6.464 x Age - 22.955 x Distance
The intercept coefficient means that on average is Age and Distance are 0 and everything else stays the same the price will be 2456.076. The regression coefficient for Age tells us the if Age increases by 1, the price will on average decrease by 6.464 if all else stays equal. The regression coefficient for Distance tells us the if Distance increases by 1, the price will on average decrease by 22.955 if all else stays equal. All 3 of the coefficients have a p value below 0.05 which leads us to reject the null hypothesis at each. The null hypothesis is that the coefficient in equal to 0 therefore we can say at a 5% level of significance, or even less with the intercept and reg coeff of Distance, that the coefficients are not equal to 0
The multiple r squared statistic is the coefficient of determination an it tells us that 48.38% of the variability in the price is explained by the x variables (in this case Age and Distance).
The square root of R-squared is the coefficient of correlation and it tells us the absolute correlation between the observed variables and the predicted variables, in this case it is 0.6955609.
fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
data = mydata3)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With the analysis of variance we can see that at a 5% level of significance we can conclude that the fit3 regression fits data better than fit2. This is because if the p value is lower than 0.05 anova says the more complex model is better (fit3) and if the p value is more than 0.05 anova says the less complex is better (fit2)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFactorYes 167.531 62.864 2.665 0.00933 **
## BalconyFactorYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
The regression coefficient for Parking tells us that is the apartment has a parking spot the price will on average be higher by 167.531 if all else stays the same. This coefficient also has a p value of below 0.01 so we can say at a 1% level of significance that the coefficient is not equal to 0. The coefficient for balcony has a p value of 0.79795 therefore we can not say that the coefficient is not equal to 0.
The null hypothesis for the test with the F-statistic is that the population coefficient of determination equals to 0 and the alternative hypothesis is that the population coefficient of determination is greater than 0.
mydata3$FittedValues <- fitted.values(fit3)
mydata3$Residuals <- residuals(fit3)
round(mydata3[2, 12], 2)
## # A tibble: 1 × 1
## Residuals
## <dbl>
## 1 428.
The residual for apartment ID2 is equal to 427.8 which means that the difference between the actual price and price calculated by the regression is 427.8, or that the actual price is higher by 427.8 that the estimated value by the regression.