data()
data(package = .packages(all.available = TRUE))
library(boot)
mydata <- force(melanoma)
head(mydata)
## time status sex age year thickness ulcer
## 1 10 3 1 76 1972 6.76 1
## 2 30 3 1 56 1968 0.65 0
## 3 35 2 1 41 1977 1.34 0
## 4 99 3 0 71 1968 2.90 0
## 5 185 1 1 52 1965 12.08 1
## 6 204 1 1 28 1971 4.84 1
Explanation of variables: - time: Survival time in days since the operation - status: The patients status at the end of the study. 1 indicates that they had died from melanoma, 2 indicates that they were still alive and 3 indicates that they had died from causes unrelated to their melanoma. - sex: The patients sex; 1=male, 0=female. - age: Age in years at the time of the operation. - year: Year of operation. - thickness: Tumour thickness in mm. - ulcer: Indicator of ulceration; 1=present, 0=absent
colnames(mydata)[3] <- "gender"
mydata$GenderF <- factor(mydata$gender,
levels = c(0, 1),
labels = c("F", "M"))
head (mydata)
## time status gender age year thickness ulcer GenderF
## 1 10 3 1 76 1972 6.76 1 M
## 2 30 3 1 56 1968 0.65 0 M
## 3 35 2 1 41 1977 1.34 0 M
## 4 99 3 0 71 1968 2.90 0 F
## 5 185 1 1 52 1965 12.08 1 M
## 6 204 1 1 28 1971 4.84 1 M
I renamed the variable sex into gender and change 0, 1 into M, F to be more logical when analysing.
summary(mydata)
## time status gender age year
## Min. : 10 Min. :1.00 Min. :0.0000 Min. : 4.00 Min. :1962
## 1st Qu.:1525 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:42.00 1st Qu.:1968
## Median :2005 Median :2.00 Median :0.0000 Median :54.00 Median :1970
## Mean :2153 Mean :1.79 Mean :0.3854 Mean :52.46 Mean :1970
## 3rd Qu.:3042 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:65.00 3rd Qu.:1972
## Max. :5565 Max. :3.00 Max. :1.0000 Max. :95.00 Max. :1977
## thickness ulcer GenderF
## Min. : 0.10 Min. :0.000 F:126
## 1st Qu.: 0.97 1st Qu.:0.000 M: 79
## Median : 1.94 Median :0.000
## Mean : 2.92 Mean :0.439
## 3rd Qu.: 3.56 3rd Qu.:1.000
## Max. :17.42 Max. :1.000
I see that I don’t have any missing data, so I won’t delete some units with drop-na function.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- dplyr::mutate(mydata, ID = row_number())
head(mydata)
## time status gender age year thickness ulcer GenderF ID
## 1 10 3 1 76 1972 6.76 1 M 1
## 2 30 3 1 56 1968 0.65 0 M 2
## 3 35 2 1 41 1977 1.34 0 M 3
## 4 99 3 0 71 1968 2.90 0 F 4
## 5 185 1 1 52 1965 12.08 1 M 5
## 6 204 1 1 28 1971 4.84 1 M 6
I added new variable, named ID.
tail ((mydata), 4)
## time status gender age year thickness ulcer GenderF ID
## 202 4668 2 0 40 1965 6.12 0 F 202
## 203 4688 2 0 42 1965 0.48 0 F 203
## 204 4926 2 0 50 1964 2.26 0 F 204
## 205 5565 2 0 41 1962 2.90 0 F 205
Showing only last 4 rows.
mydata <- mydata[c(8, 3, 4, 5, 6, 7, 1, 2)]
head(mydata)
## GenderF gender age year thickness ulcer time status
## 1 M 1 76 1972 6.76 1 10 3
## 2 M 1 56 1968 0.65 0 30 3
## 3 M 1 41 1977 1.34 0 35 2
## 4 F 0 71 1968 2.90 0 99 3
## 5 M 1 52 1965 12.08 1 185 1
## 6 M 1 28 1971 4.84 1 204 1
I made difference in column order so I couls easily analyse.
mydata2 <- mydata %>% select(-c(1, 2, 4, 5, 6, 7))
mydata2 <- mydata[mydata$age >= 20 & mydata$status == 3, ]
I made new data set. Firstly, I deleted the variables which I won’t need, then I made the set with only age and status 3 (which means the died), so I will easier see the colaration between age of deaths from malenoma. Now I have only 14 obs. so it’s easier to analyse.
#install.packages("pastecs")
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
summary(mydata[, -c(1, 2, 6, 7, 8)])
## age year thickness
## Min. : 4.00 Min. :1962 Min. : 0.10
## 1st Qu.:42.00 1st Qu.:1968 1st Qu.: 0.97
## Median :54.00 Median :1970 Median : 1.94
## Mean :52.46 Mean :1970 Mean : 2.92
## 3rd Qu.:65.00 3rd Qu.:1972 3rd Qu.: 3.56
## Max. :95.00 Max. :1977 Max. :17.42
I wanted observe only this 3 specific variables variables.The youngest patient was 4 years old, the oldest 95. Avarage age of patient is 52,46. Because the median was 54, half of the patients were older than that and half was younger. We can see that all the operations happened between year 1962 and 1977. The smallest tumor thickness was 0,10mm and the biggest 17,42mm.
summary(mydata[, c(1)])
## F M
## 126 79
We have some category variables. We can see that there was more women than men having the operation.
round(stat.desc(mydata[ , 3]))
## nbr.val nbr.null nbr.na min max range
## 205 0 0 4 95 91
## sum median mean SE.mean CI.mean.0.95 var
## 10755 54 52 1 2 278
## std.dev coef.var
## 17 0
Statistic only for age. The 95% CI for the mean of age is 2.
#install.packages("ggplot2")
library(ggplot2)
ggplot(mydata, aes(x = thickness))+
theme_bw()+
geom_histogram(binwidth = 1, colour = "hotpink", fill = "lavenderblush")+
ggtitle("Distribution of tumor thickness")+
ylab("Frequency")+
xlab("Thickness of tumor (mm)")
On the histogram we can see that the thickness is positively or right
side. Tumor above 8mm is quite rare. The most frequently the tumor is
about 2mm.
#install.packages("car")
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:boot':
##
## logit
scatterplot(y = mydata$thickness, x = mydata$age,
ylab = "Tumor thickness (mm)",
xlab = "Age",
boxplots = FALSE,
smooth = FALSE)
We can see a positive correlation between age and thickness, what makes
sense because usually the tumor gets the chance grow with age. We can
see with the regression line on the scatter plot, the slope is
positive.
library(readxl)
Business_School <- read_excel("Task 2/Business School.xlsx")
head(Business_School)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
library(ggplot2)
ggplot(Business_School, aes(x = `Undergrad Degree`))+
geom_bar()+
labs(title = "University Degree Distribution", x = "Degree", y = "Count")
Business degree is most common.
summary(Business_School$`Annual Salary`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
library(pastecs)
round(stat.desc(Business_School$`Annual Salary`))
## nbr.val nbr.null nbr.na min max range
## 100 0 0 20000 340000 320000
## sum median mean SE.mean CI.mean.0.95 var
## 10905800 103500 109058 4150 8235 1722373475
## std.dev coef.var
## 41501 0
library(scales)
ggplot(Business_School, aes(x = `Annual Salary`))+
geom_histogram()+
scale_x_continuous(labels = comma)+
labs(title = "Annual Salary Distribution", x = "Annual Salary", y = "Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Graph is right skewed. Mediana of annual salary is 103 500 (that means
for half is higher, for half is lower), average salary is 109 058,
minimal salary is 20 000, maximum 340 000. Range is 320 000.
library(stats)
t.test(Business_School$`MBA Grade`,
mu = 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
With this t-test e researched two hypothesis. H0= average MBA grade is equal to 74, Ha = average MBA Grade is not equal to 74. Because we are checking equality, the mean of MBA grade can either be higher or lower than 74 which makes this test a two tailed test. Conducting the t-test we can deduct that at a 5% level of significance (a= 5%) we can reject null hypothesis. The p value for the t-test is 0,009 or 9% which also equals to the percentage of us making the mistake and the null hypothesis being correct. Therefore we can say that the mean MBA grade is not equal to 74, it is higher. The 95% CI also shows us that we can say with 95% certainity that the population mean of MBA Grade of current MBA students lies between 74.52% and 77.56, which does not include 74.
library(readxl)
Apartments <- read_excel("Task 3/Apartments.xlsx")
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
Apartments$BalconyF<-factor(Apartments$Balcony,
levels=c(1,0),
labels=c("Yes","No"))
Apartments$ParkingF<-factor(Apartments$Parking,
levels=c(1,0),
labels=c("Yes","No"))
head(Apartments)
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony BalconyF ParkingF
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 0 1 Yes No
## 2 18 1 2800 1 0 No Yes
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 Yes No
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 Yes No
t.test(Apartments$Price,
mu=1900,
alternative="two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
H0 : mu = 1900 Ha : mu is not = 1900 We have to check the p-value. Here it is very low, so we have to reject the hypothesis.The 95% CI for the apartment price is between 1937.443 and 2100.440, which means that we can say with 95% certainty that the true population means lies between those 2 values. Also, this interval does not include 1900, so we can again reject the null hypothesis.
scatterplot(y = Apartments$Price,
x = Apartments$Age,
ylab = "Price",
xlab = "Age",
smooth = FALSE)
fit1 <- lm(Price ~ Age, data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(Apartments$Price,Apartments$Age)
## [1] -0.230255
We can see that the p-value for both coeffcients is again under the 0.05 so we reject the hypothesis at each. The regression coeffient for age tells us, if age increases by 1. the price will on average decrease by 8.975, if averything else stay equal. Coefficient of determination is 0.05302, so the multiple r squared statistics tells us that age explains only around 5% of variability of price. The square root of R-squared is the corellation coefficient is -0.23, that means there is a linear relationship between price and age.
library(car)
scatterplotMatrix(Apartments[ , c(-4,-5,-6, -7)],
smooth = FALSE)
data2 <- lm(Price ~ Age + Distance,
data = Apartments)
vif(data2)
## Age Distance
## 1.001845 1.001845
When analysing scatter plot we can deduct that there is a potentinal problem with multycolinearity, by looking at the regression line on the scatter plot between the explanatory variables, in this case Age and Distance. In our case there is not problem, because we can see that the slope of the line is close to 0 therefore meaning that there is no significant correlation between the two variables.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
The VIF statistic checks for multicolinearity between the explanatory variables. In our case the explanatury variables are both less than 5 and the mean is close to 1 which tells us there is no problem with multicolinearity.
Apartments$StdResid <- round(rstandard(fit2), 3)
Apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
Apartments$StdResid <- round(rstandard(fit2), 3)
head(Apartments[order(-Apartments$StdResid), ])
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony BalconyF ParkingF StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 2 11 2790 1 0 No Yes 2.05 0.069
## 3 18 1 2800 1 0 No Yes 1.78 0.03
## 4 18 1 2800 1 1 Yes Yes 1.78 0.03
## 5 8 2 2820 1 0 No Yes 1.66 0.037
## 6 10 1 2810 0 0 No No 1.60 0.032
Apartments$CooksD <- round(cooks.distance(fit2), 3)
head(Apartments[order(-Apartments$CooksD), ])
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony BalconyF ParkingF StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 No Yes 2.05 0.069
## 4 7 2 1760 0 1 Yes No -2.15 0.066
## 5 37 3 2540 1 1 Yes Yes 1.58 0.061
## 6 40 2 2400 0 1 Yes No 1.09 0.038
tail(Apartments[order(-Apartments$StdResid), ])
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony BalconyF ParkingF StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 24 5 1830 1 0 No Yes -1.19 0.012
## 2 14 16 1660 0 1 Yes No -1.26 0.008
## 3 13 8 1800 0 0 No No -1.38 0.012
## 4 12 14 1650 0 1 Yes No -1.50 0.013
## 5 12 14 1650 0 0 No No -1.50 0.013
## 6 7 2 1760 0 1 Yes No -2.15 0.066
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
If we look at the standardized residuals we can see that none are above 3 or below -3, which would automatically categorize them as outlines.However if we look at the cooks distance we can see that one observation has a significantly higher cooks distance than all others, so we can decide to remove it from data set.
Apartments <- Apartments[-38, ]
head(Apartments[order(-Apartments$CooksD),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony BalconyF ParkingF StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 43 37 1740 0 0 No No 1.44 0.104
## 2 2 11 2790 1 0 No Yes 2.05 0.069
## 3 7 2 1760 0 1 Yes No -2.15 0.066
## 4 37 3 2540 1 1 Yes Yes 1.58 0.061
## 5 40 2 2400 0 1 Yes No 1.09 0.038
## 6 8 2 2820 1 0 No Yes 1.66 0.037
fit2$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = fit2$StdResid, x = fit2$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Heteroskedasticity occurs when the variance is not constant. On this scatter plot we can see that dots are relatively constant, so there is no heteroskedasticity.
hist(Apartments$StdResid,
prob = TRUE,
xlab = "Standardized residuals",
ylab = "Density",
main = "Histogram of standardized residuals")
curve(dnorm(x, mean = mean(Apartments$StdResid), sd = sd(Apartments$StdResid)),
add = TRUE,
lwd = 2,
col = "red")
shapiro.test(Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.94879, p-value = 0.002187
Because of the shape of the distribution, we see that is not normal.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
Both age and distance have a negative correlation with a price of apartment. One year older apartment is on average 6,4 units cheaper than one year younger. And apartment with one additional unit of distance is on average 22,9 units cheaper than the one closer. R squared is 0,48 which means 48% of variability of price is explained by age and distance. F-statistic shows that this model is good and explains at least some of the variability.
sqrt(summary(fit2)$r.squared)
## [1] 0.6955609
The square root of R-squared is the coefficient of correlation and it tells us the absolute correlation between the observed variables and the predicted variables.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
data = Apartments)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that at a 5% level of significance we can conclude that the fit3 regression fits data better than fit2. This is because if the p value is lower than 0.05.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2482.048 79.263 31.314 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFNo -167.531 62.864 -2.665 0.00933 **
## BalconyFNo 15.207 59.201 0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
On average an apartment with parking is 167.5 units more expensive. Apartment having balcony is statistically irrelevant as p-value is greater than 0.05. F-statistic shows that we can reject null hypotheses so this model is good and explains at least some of the variability.
Apartments$Fitted <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)
head(Apartments[ ,colnames(Apartments)%in% c("Price", "Fitted", "Residuals")])
## # A tibble: 6 × 3
## Price Fitted Residuals
## <dbl> <dbl> <dbl>
## 1 1640 1706. -66.0
## 2 2800 2372. 428.
## 3 1660 1721. -61.2
## 4 1850 1563. 287.
## 5 1640 2012. -372.
## 6 1770 1908. -138.
The residual for apartment ID2 is equal to 427.8.