#Task one.
library(car)
## Loading required package: carData
data(package = .packages(all.available = TRUE))
mydata <- Prestige
summary(Prestige)
## education income women prestige
## Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
## 1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
## Median :10.540 Median : 5930 Median :13.600 Median :43.60
## Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
## 3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## census type
## Min. :1113 bc :44
## 1st Qu.:3120 prof:31
## Median :5135 wc :23
## Mean :5402 NA's: 4
## 3rd Qu.:8312
## Max. :9517
head(Prestige)
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
Description of variables:
The Prestige data frame has 102 rows and 6 columns. The observations are occupations.
This data frame contains the following columns:
education: Average education of occupational incumbents, years, in 1971.
income: Average income of incumbents, dollars, in 1971.
women: Percentage of incumbents who are women.
prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census: Canadian Census occupational code.
type: Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar.
Prestige1 <- (Prestige[, c(-5, -3)])
Deleted variableS 5 (Census) and 3 (Women - percent of females in the sample), because it was irrelevant to our conclusions.
library(pastecs)
round(stat.desc(Prestige1[, -4]),2)
## education income prestige
## nbr.val 102.00 102.00 102.00
## nbr.null 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00
## min 6.38 611.00 14.80
## max 15.97 25879.00 87.20
## range 9.59 25268.00 72.40
## sum 1095.28 693386.00 4777.00
## median 10.54 5930.50 43.60
## mean 10.74 6797.90 46.83
## SE.mean 0.27 420.41 1.70
## CI.mean.0.95 0.54 833.98 3.38
## var 7.44 18027855.55 295.99
## std.dev 2.73 4245.92 17.20
## coef.var 0.25 0.62 0.37
The median income (or the income higher and lower of which is 50% of the observations) is 5930 dollars, the median for years of education is 10,54 and the median for the level of an occupation prestige is 43,60
The arithmetic mean of education is 10,74 years, the mean for income is 6797,90 dollars, and the mean for the prestige level is 46,83
The variation coeficient is highest in the “income” variable (0,62), meaning it has the dispersion around the mean. It it’s followed by v. c. of prestige (0,37) and education (0,25)
library(car)
scatterplot(y = Prestige1$income,
x = Prestige1$prestige,
ylab = "prestige score for occupation",
xlab = "Average income",
smooth = FALSE)
library(car)
scatterplot(y = Prestige1$income,
x = Prestige1$education,
ylab = "years of education",
xlab = "Average income",
smooth = FALSE)
By examining the scatterplots we can conclude the following:
Scattaerplot 1 (Comparing the correlation between prestige score for occupation and income): Income is positively corelated to the prestige score - in other words - the higher the prestige score, the higher the income.
Scatterplot 2 (Comparing the correlation between years of education and income): Income is also positively corelated to years of education - in other words - the longer the years of education, the higher the income.
#Task 2:
mydata <- read.table("~/IMB/BootcampR/Takehomeexam/R Take Home Exam/Task 2/Body mass.csv", header=TRUE, sep=";", dec=",")
head(mydata)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
library(pastecs)
round(stat.desc(mydata),2)
## ID Mass
## nbr.val 50.00 50.00
## nbr.null 0.00 0.00
## nbr.na 0.00 0.00
## min 1.00 49.70
## max 50.00 83.20
## range 49.00 33.50
## sum 1275.00 3143.80
## median 25.50 62.80
## mean 25.50 62.88
## SE.mean 2.06 0.85
## CI.mean.0.95 4.14 1.71
## var 212.50 36.14
## std.dev 14.58 6.01
## coef.var 0.57 0.10
Description: ID: -ID of 9th graders Mass: -Mass of specific 9th graders
hist(mydata$Mass,
main = "Distribution of the variable Mass",
ylab = "Frequency",
xlab = "Mass",
breaks = seq(40, 90, by =5 ),
right = FALSE)
Histogram showing the distribution of frequencies of body mass by
specific 9th graders.
library(ggplot2)
ggplot(NULL, aes(c(-4, 4))) +
geom_line(stat = "function", fun = dt, args = list (df =49 )) +
ylab("Density") +
xlab("Sample estimates") +
labs(title="Distribution of sample estimates")
qt(p = 0.025, df = 49, lower.tail = FALSE)
## [1] 2.009575
qt(p = 0.025, df = 49, lower.tail = TRUE)
## [1] -2.009575
t.test(mydata$Mass,
mu = 59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
H0:𝜇=59.5 H1:𝜇≠ 59.5
We can reject the null hypothisis for 2 reasons. #1 the T-value is greater than 2.009575 - it is 3.9711 #2 the true mean of the population 59.5 taking into account the 95% confidence interval - it is actually 62.876
#Task 3
#### Import the dataset Apartments.xlsx
```r
#install.packages("readxl")
library(readxl)
Apartments <- read_excel("~/IMB/BootcampR/Takehomeexam/R Take Home Exam/Task 3/Apartments.xlsx")
Description:
Apartments$Balconyf <- factor(Apartments$Balcony, levels = c ("0", "1"), labels = c("No", "Yes") )
Apartments$Parkingf <- factor(Apartments$Parking, levels = c ("0", "1"), labels = c("No", "Yes") )
Apartments1 <- Apartments
t.test(Apartments$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We can reject the null hypothisis and conclude with a 95% confidence interval that the mean (price) of the population is not equal to 1900. #### Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.
fit1 <- lm(formula = Price ~ Age, data = Apartments1)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(Apartments1$Price, Apartments1$Age)
## [1] -0.230255
scatterplotMatrix(Apartments[,c(1,2,3)],
smooth = FALSE)
We can assume the multicolinearity is not present due to the values
being dispersed around the regression line accordingly.
fit2 <- lm(Apartments$Price ~ Apartments$Age + Apartments$Distance, data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Apartments$Price ~ Apartments$Age + Apartments$Distance,
## data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Apartments$Age -7.934 3.225 -2.46 0.016 *
## Apartments$Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
fit2 <-lm(formula = Price ~ Age + Distance, data = Apartments)
vif (fit2)
## Age Distance
## 1.001845 1.001845
The VIF statistic is less than 5 so we can conclude there is no multicoliniarity.
#install.packages("olsrr")
Apartments$StdResid <- round(rstandard(fit2), 3)
Apartments$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.95303, p-value = 0.003645
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
The p value is less than 5%. There is one outlier that skews the
conclusions about the regression. We have to remove the outlier in order
to have valueable conclusions. Because we still had residuals on the
Cooks distance, therefore we removed the variables that had the Cooks
distance higher than 4/N.
excluded_cases <- as.numeric(names(Apartments1$CooksD)[(Apartments$CooksD > (4/85))])
## Warning: Unknown or uninitialised column: `CooksD`.
Apartments1<-Apartments[c(-22,-33,-38,-53,-55) , ]
head(Apartments[order(Apartments$StdResid),], 3)
## # A tibble: 3 × 9
## Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 7 2 1760 0 1 Yes No -2.15 0.066
## 2 12 14 1650 0 1 Yes No -1.50 0.013
## 3 12 14 1650 0 0 No No -1.50 0.013
head(Apartments[order(-Apartments1$CooksD),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 24 5 1830 1 0 No Yes -1.19 0.012
## 2 7 2 1760 0 1 Yes No -2.15 0.066
## 3 18 4 2110 1 1 Yes Yes -0.44 0.001
## 4 22 16 1710 0 1 Yes No -0.861 0.003
## 5 18 1 2800 1 0 No Yes 1.78 0.03
## 6 1 5 2420 1 1 Yes Yes 0.256 0.001
I removed the outlier from the data, and will have to run the regression again to have an unbiased regression result. I removed outliers that had a cooks value greater than 4/N
fit2 <- lm(formula = Price ~ Age + Distance, data = Apartments1)
Apartments1$StdResid <- round(rstandard(fit2), 3)
Apartments1$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments1$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
hist(Apartments1$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
We can observe that the histogram show that there is no outliers left,
therefore we can proceed to run the shapiro test again.
shapiro.test(Apartments1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments1$StdResid
## W = 0.94154, p-value = 0.001166
hist(Apartments1$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
head(Apartments1[order(Apartments1$StdResid),], 3)
## # A tibble: 3 × 9
## Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 12 14 1650 0 1 Yes No -1.62 0.017
## 2 12 14 1650 0 0 No No -1.62 0.017
## 3 13 8 1800 0 0 No No -1.56 0.017
head(Apartments1[order(-Apartments1$CooksD),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 45 21 1910 0 1 Yes No 1.26 0.077
## 2 8 26 2300 1 1 Yes Yes 1.96 0.064
## 3 40 2 2400 0 1 Yes No 1.20 0.055
## 4 8 2 2820 1 0 No Yes 1.73 0.046
## 5 10 1 2810 0 0 No No 1.66 0.04
## 6 18 1 2800 1 0 No Yes 1.89 0.038
Apartments1$StdfittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = Apartments1$StdResid, x= Apartments1$StdfittedValues,
ylab = "Standarized residuals",
xlab = "Standarized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
We checked the possibility of homo/heteroskedasticity with the help of a
scatter plot between standardized residuals and (standardized) fitted
values. According to the scatterplot the Variance does not appear to be
growing, the conclusion is that homoskedasticity is not violated.
hist(Apartments1$StdResid,
xlab = "Standarized Residuals",
ylab = "Frequency",
main = "Histogram of standarized residuals")
shapiro.test(Apartments1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments1$StdResid
## W = 0.94154, p-value = 0.001166
H0: The variables are normally distributed H1: The variables are not normally distributed
p-value is less than alpha = 5% (p-value = 0.001568), so we can reject H0 and accept H1
The assumption of normal distribution of errors is violated, meaning t-distribution may not be correct.
fit2 <- lm(formula = Price ~ Age + Distance, data = Apartments)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
fit3 <- lm(formula =Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf,
data = Apartments)
fit3
##
## Call:
## lm(formula = Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf,
## data = Apartments)
##
## Coefficients:
## (Intercept) Age Distance
## 2301.667 -6.799 -18.045
## Apartments$BalconyfYes Apartments$ParkingfYes
## 1.935 196.168
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf,
## data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Apartments$BalconyfYes 1.935 60.014 0.032 0.97436
## Apartments$ParkingfYes 196.168 62.868 3.120 0.00251 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
Parking: Given Age, Distance and Balcony Factor, the apartments with parking space have on average price higher by 196 € given the p-value 0.0025 (p < α, (α=5%)) Balcony: p=0.97, p> α, (α=5%): we can not reject the null hypothesis (if a balcony is available in the apartment it effects the price of it)
F-statistics: H0: ρ squared = 0 H1: ρ squared > 0
p-value < 0.001 therefore we reject H0 at p< 0.001. It means there is a linear correlation between Price and at least one explanatory variable. The model is appropriate due to the r squared being higher than 0
Apartments$Fitted <- fitted(fit2)
Apartments$FittedRes <- Apartments$Fitted - Apartments$Price
print(Apartments1[2,])
## # A tibble: 1 × 10
## Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD Stdfi…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 18 1 2800 1 0 No Yes 1.89 0.038 1.16
## # … with abbreviated variable name ¹StdfittedValues[,1]