data(package = .packages(all.available = TRUE))
Air_Quality<-force(airquality)
head(Air_Quality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
I am using the dataset containing the details of daily air quality measurements in New York, May to September 1973.
This dataset contains 153 observations on 6 variables:
Air_Quality$Month <- factor(Air_Quality$Month,
levels = c (5,6,7,8,9),
labels = c ("May","Jun","Jul","Aug","Sep"))
Air_Quality$Day <- factor(Air_Quality$Day,
levels = c (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31),
labels = c (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31))
head(Air_Quality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 May 1
## 2 36 118 8.0 72 May 2
## 3 12 149 12.6 74 May 3
## 4 18 313 11.5 62 May 4
## 5 NA NA 14.3 56 May 5
## 6 28 NA 14.9 66 May 6
# removing units with missing data
library(tidyr)
Air_Quality_clean <- drop_na(Air_Quality)
# renaming variable
colnames(Air_Quality_clean)[4]<- "Temp_in_F"
# create new variable
Air_Quality_clean$Solar_Temp_Ratio <- Air_Quality_clean$Solar.R / Air_Quality_clean$Temp_in_F
# create new data.frame
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Air_Quality_clean_July <- Air_Quality_clean %>%
filter(Month == "Jul")
head(Air_Quality_clean_July)
## Ozone Solar.R Wind Temp_in_F Month Day Solar_Temp_Ratio
## 1 135 269 4.1 84 Jul 1 3.202381
## 2 49 248 9.2 85 Jul 2 2.917647
## 3 32 236 9.2 81 Jul 3 2.913580
## 4 64 175 4.6 83 Jul 5 2.108434
## 5 40 314 10.9 83 Jul 6 3.783133
## 6 77 276 5.1 88 Jul 7 3.136364
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
## The following object is masked from 'package:tidyr':
##
## extract
round(stat.desc(Air_Quality_clean[ , c (-5, -6) ]), 2)
## Ozone Solar.R Wind Temp_in_F Solar_Temp_Ratio
## nbr.val 111.00 111.00 111.00 111.00 111.00
## nbr.null 0.00 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 1.00 7.00 2.30 57.00 0.09
## max 168.00 334.00 20.70 97.00 5.22
## range 167.00 327.00 18.40 40.00 5.12
## sum 4673.00 20513.00 1103.30 8635.00 262.68
## median 31.00 207.00 9.70 79.00 2.48
## mean 42.10 184.80 9.94 77.79 2.37
## SE.mean 3.16 8.65 0.34 0.90 0.12
## CI.mean.0.95 6.26 17.15 0.67 1.79 0.23
## var 1107.29 8308.74 12.66 90.82 1.47
## std.dev 33.28 91.15 3.56 9.53 1.21
## coef.var 0.79 0.49 0.36 0.12 0.51
mean(Air_Quality_clean$Wind)
## [1] 9.93964
The average wind speed measured in May, June, July, August and September in New York in 1973 was 9.94 mph.
median(Air_Quality_clean$Temp_in_F)
## [1] 79
The median temperature was 79 degrees F - 50% of the days were warmer than 79 degrees F and 50% were colder.
min(Air_Quality_clean_July$Temp_in_F)
## [1] 73
max(Air_Quality_clean_July$Temp_in_F)
## [1] 92
The lowest temperature measured in July was 73 degrees F. The highest was 92 degrees F.
library(ggplot2)
ggplot(Air_Quality_clean, aes(x = Month)) +
geom_bar() +
ylab("Frequency") +
xlab("Month")
After removing the days that had some missing data, I have decided to check how many days had complete data and were included in the clean data.The graph shows that the least represented month in the clean data is June. Most values are included from September.
hist(Air_Quality_clean$Ozone, main="Histogram of Ozone Levels", xlab="Ozone (ppb)")
In above histogram I wanted to see how the ozone was distributed in the measurements. We can see that the ozone level distribution is skewed to the right. On most days the ozone levels were up to 50 ppb.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplot(y = Air_Quality_clean$Solar.R,
x = Air_Quality_clean$Temp_in_F,
ylab = "Solar radiation",
xlab = "Temperature",
smooth = FALSE,
boxplots = FALSE)
With this scatter plot I wanted to check if there is any correlation between the measured temperature and the solar radiation. I assumed that on the hotter days the solar radiation would also be higher (essentially, that the temperature would be higher on sunny days). From the graph, I can see that there seems to be a positive correlation between both variables like I expected.
ggplot(Air_Quality_clean, aes(x = Month, y = Temp_in_F)) +
geom_boxplot() +
labs(title = "Temperature by Month",
x = "Month",
y = "Temperature in F")
With the box plots I tried to check how the temperature varied across different months. As expected, the temperatures were highest in the summer months (June, July and August). We can also see that in July, the temperature were most constant.
library(readxl)
MBA <- read_xlsx("./Business School.xlsx")
MBA <- as.data.frame(MBA)
head(MBA)
## Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1 1 Business 68.4 90.2 No
## 2 2 Computer Science 70.2 68.7 Yes
## 3 3 Finance 76.4 83.3 No
## 4 4 Business 82.6 88.7 No
## 5 5 Finance 76.9 75.4 No
## 6 6 Computer Science 83.3 82.1 No
## Employability (Before) Employability (After) Status Annual Salary
## 1 252 276 Placed 111000
## 2 101 119 Placed 107000
## 3 401 462 Placed 109000
## 4 287 342 Placed 148000
## 5 275 347 Placed 255500
## 6 254 313 Placed 103500
colnames(MBA)[2]<- "Undergrad_Degree"
colnames(MBA)[3]<- "Undergrad_Grade"
colnames(MBA)[4]<- "MBA_Grade"
colnames(MBA)[5]<- "Work_Experience"
colnames(MBA)[9]<- "Annual_Salary"
MBA$Undergrad_Degree <- factor(MBA$Undergrad_Degree,
levels = c("Art", "Business", "Computer Science", "Engineering", "Finance"),
labels = c("Art", "Business", "Computer Science", "Engineering", "Finance"))
library(ggplot2)
ggplot(MBA, aes(x=Undergrad_Degree)) +
geom_bar() +
ylab("Frequency") +
xlab("Undergraduate degree")
The most common undergraduate degree is a Business degree.
library(pastecs)
round(stat.desc(MBA$Annual_Salary),0)
## nbr.val nbr.null nbr.na min max range
## 100 0 0 20000 340000 320000
## sum median mean SE.mean CI.mean.0.95 var
## 10905800 103500 109058 4150 8235 1722373475
## std.dev coef.var
## 41501 0
options(scipen = 999)
library(ggplot2)
ggplot(MBA, aes(x=Annual_Salary)) +
geom_histogram(col="white", binwidth = 10000) +
ylab("Frequency") +
xlab("Annual Salary")
The distribution seems to be skewed to the right. To check the distribution we could do the Shapiro-Wilks test. There also seem to be some outliers in our distribution - we need to check if they can be removed, which could make the distribution more normal.
mean(MBA$MBA_Grade)
## [1] 76.04055
t.test(MBA$MBA_Grade,
mu = 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: MBA$MBA_Grade
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
H0: μ equals 74
H1: μ does not equal 74
Based on the p value (p=0.009 < 0.05) we reject the null hypothesis, the true mean of MBA Grade does not equal 74. Based on the 95% confidence interval, we can also see that the true mean is most likely between 74.5 and 77.6.
library(effectsize)
cohens_d(MBA$MBA_Grade, mu = 74)
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
The effect size is 0.27 and according to theory this is a small effect size. This means that the difference between the sample mean and tested mean (74) is 0.27 standard deviations.
library(readxl)
Apartments <- read_xlsx("./Apartments.xlsx")
Apartments <- as.data.frame(Apartments)
head(Apartments)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
We have categorical values Parking and Balcony:
Apartments$Balcony <- factor(Apartments$Balcony,
levels = c (0,1),
labels = c ("No","Yes"))
Apartments$Parking <- factor(Apartments$Parking,
levels = c (0,1),
labels = c ("No", "Yes"))
head(Apartments)
## Age Distance Price Parking Balcony
## 1 7 28 1640 No Yes
## 2 18 1 2800 Yes No
## 3 7 28 1660 No No
## 4 28 29 1850 No Yes
## 5 18 18 1640 Yes Yes
## 6 28 12 1770 No Yes
calculating sample mean:
mean(Apartments$Price)
## [1] 2018.941
H0: Mu_Price = 1900 eur
H1: Mu_Price does not equal 1900 eur
t.test(Apartments$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We can reject the null hypothesis at p=0,004731 - the true mean does not equal 1900 EUR per m2. As per the 95% confidence interval, the true mean is somewhere between 1937 and 2101 EUR per m2.
fit1 <- lm(Price ~ Age, data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <0.0000000000000002 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
sqrt(summary(fit1)$r.squared)
## [1] 0.230255
Regression coefficient: If the age of the apartment increases by 1 year, the price of the apartment falls on average by 8.975 EUR per m2 (p = 0.034).
Coefficient of determination: 5.3% of the variability of price is explained by the linear effect of the age of the apartment.
Coefficient of correlation: Coefficient of correlation can be calculated as the square root of R-squared. In this care we have a weak correlation. Since this is not a multiple correlation coefficient (because we are checking the correlation between only two variables), we can also see if the correlation is positive or negative.
cor(Apartments$Price, Apartments$Age)
## [1] -0.230255
So in this case we can observe a weak negative correlation between the price and the age of the apartment.
library(car)
scatterplotMatrix(Apartments[ , c(1, 2, 3)],
smooth = FALSE)
From the graphs it seems that there might a strong connection between distance and price. Since price is a dependent variable, this does not point to a problem of multicolinearity. There does not seem to be a strong correlation between the two independent variables (Distance and Age), so I am assuming that multicolinearity will not be a problem in this model. We could check this more closely with VIF statistic.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 0.0000000000000002 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
To say that we do not have a problem with multicolinearity, the VIF statistic for each variable should be equal or below 5 and the average VIF should be around one. The calculated results confirm that multicolinearity is not too strong in our case.
Apartments$StdResid <- round(rstandard(fit2), 3) # Calculating standardized residuals
Apartments$CooksD <- round(cooks.distance(fit2), 3) #Calculating cooks distances
hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
We can see that there are no standardized errors that have a value larger than -3 or smaller than 3. So there seems to be no need to remove any of the data.
We can do the Shapiro-Wilk test to check if the standardized residuals are distributed normally.
shapiro.test(Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.95303, p-value = 0.003645
H0: Errors are normally distributed.
H1: Errors are not normally distributed.
Based on the results, we reject the H0 (p = 0.003645), which means that we assume the standardized errors are not distributed normally. From the theory we know, that one of the assumptions for regression analysis is that the errors are distributed normally, which is not true in our case. However, since we have a large sample (n>30) we can assume t distribution.
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
I can see that we have at least one value with high impact that I will need to remove. I’d like to see how the cooks distances look in my data so I will ask R to display the 6 values with the highest Cooks distances.
head(Apartments[order(-Apartments$CooksD),], 6)
## Age Distance Price Parking Balcony StdResid CooksD
## 38 5 45 2180 Yes Yes 2.577 0.320
## 55 43 37 1740 No No 1.445 0.104
## 33 2 11 2790 Yes No 2.051 0.069
## 53 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
## 39 40 2 2400 No Yes 1.091 0.038
I can see that at least the largest two values should be removed:
library(dplyr)
Apartments <- Apartments %>%
filter(!row_number() %in% c(38, 55))
head(Apartments[order(-Apartments$CooksD), ], 6)
## Age Distance Price Parking Balcony StdResid CooksD
## 33 2 11 2790 Yes No 2.051 0.069
## 52 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
## 38 40 2 2400 No Yes 1.091 0.038
## 56 8 2 2820 Yes No 1.655 0.037
## 25 8 26 2300 Yes Yes 1.571 0.034
Let’s check the model now with the values removed:
fit2_E <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit2_E)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 0.0000000000000002 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 0.000000000000953 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 0.000000000001173
I will check this for my new model, that already has values of high impact removed (fit2_E).
Apartments$StdFitted <- scale(fit2_E$fitted.values)
scatterplot(y = Apartments$StdResid, x = Apartments$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
From the scatterplot it seems that we might have a problem with heteroskedasitcity. I will check this with the Breusch Pagan test.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2_E)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.775135
## Prob > Chi2 = 0.05201969
Based on the p value (p>0.05) we cannot reject H0, meaning we assume homoskedasticity.
I will also check this for my new model:
Apartments$StdResid_f2 <- round(rstandard(fit2_E), 3)
Apartments$CooksD_f2 <- round(cooks.distance(fit2_E), 3)
hist(Apartments$StdResid_f2,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(Apartments$StdResid_f2)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid_f2
## W = 0.95952, p-value = 0.01044
For formal testing I used the Skapiro Wilk test.
H0: errors are distributed normally.
H1: errors are not distributed normally.
Based on the p value (p = 0.01) we reject the null hypothesis. However, because we have a large sample (n >> 30) we can assume t distribution of errors.
I have already estimated this above with the model fit2_E. A summary of my new model:
summary(fit2_E)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 0.0000000000000002 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 0.000000000000953 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 0.000000000001173
The regression coefficients:
Multiple R-squared = coefficient of determination:
49.68% of variability of the apartment price can be explained by the linear effect of distance and age.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = Apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 0.0000000000000002 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 0.000000000214 ***
## ParkingYes 168.921 62.166 2.717 0.00811 **
## BalconyYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 0.000000000001449
anova(fit2_E, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 78 5458696 2 523404 3.7395 0.02813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: ∆ro^2=0
H1: ∆ro^2>0
Based on the Pr(>F) = 0.028 (<0.05) we can reject the null hypothesis. We therefore say that the second model (fit3) fits the data statistically better.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 0.0000000000000002 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 0.000000000214 ***
## ParkingYes 168.921 62.166 2.717 0.00811 **
## BalconyYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 0.000000000001449
Given the Age, Distance and presence of Balcony in the apartment, the apartments with a parking space have on average higher price by 168.921 EUR per m2 than the apartments without the parking space (p = 0.008).
As the coefficient for balcony is not statistically significant, I will not interpret it. We cannot say that the balcony has an effect on the price of the apartment.
F statistics:
H0: ro^2 = 0
H1: ro^2 > 0
We can reject the null hypothesis (p<0.001). The coefficient of determination of the population is above 0. At least some of the variability of the dependent variable can be explained by the liner influence f the explanatory variables.
Apartments$Fitted_3 <- fitted.values(fit3)
Apartments$Residual_3 <- residuals(fit3)
Apartments[2, "Residual_3"]
## [1] 422.9572