data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Explanation: Showing the structure of the data set.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Explanation: The average lenght of a sepal of an iris flower was 5.843 cm, while the minimum width of a petal of an iris flower was 0.100 cm.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
These are the variables that we have in our data.
dim(iris)
## [1] 150 5
In this data we have 150 rows and 5 columns.
iris$Petal.Area <- iris$Petal.Length * iris$Petal.Width
colnames(iris)[1] <- "Sepal_Length"
iris_filtered <- subset(iris, Petal.Length > 1.5)
head(iris)
## Sepal_Length Sepal.Width Petal.Length Petal.Width Species Petal.Area
## 1 5.1 3.5 1.4 0.2 setosa 0.28
## 2 4.9 3.0 1.4 0.2 setosa 0.28
## 3 4.7 3.2 1.3 0.2 setosa 0.26
## 4 4.6 3.1 1.5 0.2 setosa 0.30
## 5 5.0 3.6 1.4 0.2 setosa 0.28
## 6 5.4 3.9 1.7 0.4 setosa 0.68
summary(iris)
## Sepal_Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species Petal.Area
## setosa :50 Min. : 0.110
## versicolor:50 1st Qu.: 0.420
## virginica :50 Median : 5.615
## Mean : 5.794
## 3rd Qu.: 9.690
## Max. :15.870
mean(iris$Sepal_Length)
## [1] 5.843333
median(iris$Sepal_Length)
## [1] 5.8
sd(iris$Sepal_Length)
## [1] 0.8280661
Explanation: On average, the sepal lenght across 150 iris flowers from the data is 5.84 cm. The median is 5.80 cm which means that half of the flowers have a sepal smaller than 5.80 cm, while the other half have a larger sepal lenght.The sepal lenght of iris flowers vary on average +-0.83 cm from the mean.
hist(iris$Sepal_Length, main="Histogram of Sepal Length", xlab="Sepal Length",
col = "violet",
border = "deeppink4")
Explanation: From the histogram we can see that the most frequent were
iris flowers with sepal leghts between 6.0 and 6.5 cm, and that the
least frequent sepal lenght is between 4.0 and 4.5 cm.
plot(iris$Sepal_Length, iris$Petal.Length, main="Sepal vs Petal Length",
xlab="Sepal Length", ylab="Petal Length", col=iris$Species)
boxplot(Sepal_Length ~ Species, data=iris, main="Sepal Length by Species",
xlab="Species", ylab="Sepal Length")
library(readxl)
file_path <- "./Business School.xlsx"
mba_data <- read_excel("./Business School.xlsx")
head(mba_data)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
colnames("mba_data")
## NULL
library(ggplot2)
ggplot(mba_data, aes(x=`Undergrad Degree`)) +
geom_bar(colour = "yellowgreen", fill = "skyblue") +
ylab("Frequency")
Explanation: The most common degree is “Business”.
# Summary statistics for Annual Salary
summary(mba_data$"Annual Salary")
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
mean(mba_data$"Annual Salary")
## [1] 109058
median(mba_data$"Annual Salary")
## [1] 103500
sd(mba_data$"Annual Salary")
## [1] 41501.49
ggplot(mba_data, aes(x = `Annual Salary` / 1000)) +
geom_histogram(binwidth = 5, fill = "darkgoldenrod2", color = "forestgreen") +
labs(title = "Distribution of Annual Salary (in Thousands)",
x = "Annual Salary (in Thousands)",
y = "Count") +
scale_x_continuous(labels = scales::comma)
Explanation: From the graph we can see that the most common annual salary for MBA students is 105,000 EUR.
t_test_result <- t.test(mba_data$`MBA Grade`, mu = 74)
print(t_test_result)
##
## One Sample t-test
##
## data: mba_data$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
library(readxl)
file_path <- "./Apartments.xlsx"
apartment_data <- read_excel("./Apartments.xlsx")
Description:
apartment_data$Parking <- as.factor(apartment_data$Parking)
apartment_data$Balcony <- as.factor(apartment_data$Balcony)
str(apartment_data)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 2 2 2 ...
## $ Balcony : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 2 1 1 ...
t_test_result <- t.test(apartment_data$Price, mu = 1900)
print(t_test_result)
##
## One Sample t-test
##
## data: apartment_data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Conclusion: We reject the null hypothesis because the p value is less than 0.05. The mean price of the apartments is significantly different from 1900 EUR. The t-value being positive suggests that the mean price of the apartments is higher than 1900 EUR.
fit1 <- lm(Price ~ Age, data = apartment_data)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apartment_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(apartment_data$Age, apartment_data$Price)
## [1] -0.230255
From the results we can read that the age of apartments effects it’s price negatively, the price goes down (Estimate of Age: -8.975). As the age increases, the price decreases. We reject the null hypothesis that age of apartments has no effect on the price (p-value: 0.034). Only 5.3% of the variation in price can be explained by the age of the apartment. Age alone is not a very strong predictor of price.
options(repos = c(CRAN = "https://cloud.r-project.org/"))
install.packages("GGally")
##
## The downloaded binary packages are in
## /var/folders/0b/8vj65c4j39v9_bn7pds5v6sr0000gn/T//RtmpVHwTp2/downloaded_packages
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(apartment_data[, c("Price", "Age", "Distance")])
Multicolinearity is not a potential problem because the it is between 0.7 and -0.7.
fit2 <- lm(Price ~ Age + Distance, data = apartment_data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartment_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
install.packages("car")
##
## The downloaded binary packages are in
## /var/folders/0b/8vj65c4j39v9_bn7pds5v6sr0000gn/T//RtmpVHwTp2/downloaded_packages
library(car)
## Loading required package: carData
vif_values <- vif(fit2)
print(vif_values)
## Age Distance
## 1.001845 1.001845
There is no serious multicollinearity problem.
standardized_residuals <- rstandard(fit2)
cooks_distances <- cooks.distance(fit2)
diagnostics <- data.frame(
Standardized_Residuals = standardized_residuals,
Cooks_Distance = cooks_distances
)
print(diagnostics)
## Standardized_Residuals Cooks_Distance
## 1 -0.66534868 7.386569e-03
## 2 1.78328759 3.036543e-02
## 3 -0.59376287 5.882612e-03
## 4 0.75437937 8.299153e-03
## 5 -1.07339872 5.112584e-03
## 6 -0.77751899 4.900891e-03
## 7 -0.30197426 5.481629e-04
## 8 -0.78702618 3.782998e-03
## 9 0.45519754 1.288366e-03
## 10 1.24081605 1.671333e-02
## 11 1.55056437 2.012202e-02
## 12 -0.35348493 1.052317e-03
## 13 -1.49938655 1.311979e-02
## 14 -0.46873367 1.596111e-03
## 15 0.29161240 8.077040e-04
## 16 -0.99974519 6.450802e-03
## 17 -0.64518999 4.466658e-03
## 18 0.57708256 3.352797e-03
## 19 -0.81224591 1.016915e-02
## 20 -1.38103755 1.226489e-02
## 21 -0.37096388 1.514554e-03
## 22 1.57598216 6.086868e-02
## 23 -0.28302175 5.473966e-04
## 24 -0.44000857 1.410045e-03
## 25 1.57098451 3.412790e-02
## 26 0.62752775 3.800650e-03
## 27 1.55037406 2.341203e-02
## 28 -1.07148455 5.406313e-03
## 29 -0.51613952 2.634792e-03
## 30 0.25601392 1.325713e-03
## 31 0.88866941 3.026431e-02
## 32 0.45424508 1.726921e-03
## 33 2.05058591 6.913379e-02
## 34 -0.66346701 5.158092e-03
## 35 -1.26104001 7.942022e-03
## 36 -1.18908643 1.150838e-02
## 37 -1.00495060 7.355297e-03
## 38 2.57677232 3.197306e-01
## 39 1.09117576 3.750987e-02
## 40 0.27869313 2.343477e-03
## 41 -1.05183308 4.572751e-03
## 42 -0.10301522 1.181142e-04
## 43 0.25167568 1.363230e-03
## 44 0.11668553 1.908988e-04
## 45 -0.32874383 2.843472e-03
## 46 1.47782998 1.908942e-02
## 47 -0.05286709 3.387028e-05
## 48 -1.00856647 2.169110e-02
## 49 0.74691175 4.469820e-03
## 50 -0.68260290 2.798905e-03
## 51 0.39348196 3.794891e-03
## 52 -0.86117571 3.384716e-03
## 53 -2.15178743 6.625775e-02
## 54 -1.10136340 1.199531e-02
## 55 1.44476770 1.042044e-01
## 56 -0.32780310 4.879320e-04
## 57 1.60066793 3.199717e-02
## 58 1.65534294 3.654413e-02
## 59 -0.55038426 1.237107e-03
## 60 0.33337375 1.203554e-03
## 61 1.78328759 3.036543e-02
## 62 -0.59376287 5.882612e-03
## 63 0.75437937 8.299153e-03
## 64 -1.07339872 5.112584e-03
## 65 -0.77751899 4.900891e-03
## 66 -0.30197426 5.481629e-04
## 67 -0.78702618 3.782998e-03
## 68 0.45519754 1.288366e-03
## 69 1.24081605 1.671333e-02
## 70 1.55056437 2.012202e-02
## 71 -0.35348493 1.052317e-03
## 72 -1.49938655 1.311979e-02
## 73 -0.46873367 1.596111e-03
## 74 0.29161240 8.077040e-04
## 75 -0.99974519 6.450802e-03
## 76 -0.64518999 4.466658e-03
## 77 -0.32874383 2.843472e-03
## 78 1.47782998 1.908942e-02
## 79 -0.05286709 3.387028e-05
## 80 -1.00856647 2.169110e-02
## 81 0.74691175 4.469820e-03
## 82 -0.68260290 2.798905e-03
## 83 0.39348196 3.794891e-03
## 84 -0.86117571 3.384716e-03
## 85 -0.05286709 3.387028e-05
n <- nrow(apartment_data)
outliers <- abs(standardized_residuals) > 3
influential <- cooks_distances > (4 / n)
diagnostics$Outlier <- outliers
diagnostics$Influential <- influential
problematic_units <- diagnostics[diagnostics$Outlier | diagnostics$Influential, ]
print(problematic_units)
## Standardized_Residuals Cooks_Distance Outlier Influential
## 22 1.575982 0.06086868 FALSE TRUE
## 33 2.050586 0.06913379 FALSE TRUE
## 38 2.576772 0.31973058 FALSE TRUE
## 53 -2.151787 0.06625775 FALSE TRUE
## 55 1.444768 0.10420445 FALSE TRUE
There are no problemati units.
standardized_fitted_values <- scale(fitted(fit2))
library(ggplot2)
ggplot(data = data.frame(standardized_fitted_values, standardized_residuals),
aes(x = standardized_fitted_values, y = standardized_residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Standardized Residuals vs. Standardized Fitted Values",
x = "Standardized Fitted Values",
y = "Standardized Residuals") +
theme_minimal()
The scatterplot indicates potential heteroskedasticity - variance of the errors changes with the level of predictor variables.
ggplot(data = data.frame(standardized_residuals), aes(sample = standardized_residuals)) +
geom_qq() +
geom_qq_line() +
labs(title = "Q-Q Plot of Standardized Residuals",
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
theme_minimal()
shapiro_test_result <- shapiro.test(standardized_residuals)
print(shapiro_test_result)
##
## Shapiro-Wilk normality test
##
## data: standardized_residuals
## W = 0.95306, p-value = 0.00366
The assumption of normality for the residuals has been violated.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apartment_data)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartment_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parking1 196.168 62.868 3.120 0.00251 **
## Balcony1 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
anova_result <- anova(fit2, fit3)
print(anova_result)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can reject the null hypothesis that Parking and Balcony do not improve the model. When parking and Balcony are included in the model, it is statistically significantly improved than the model with only Age and Distance.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartment_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Parking1 196.168 62.868 3.120 0.00251 **
## Balcony1 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
Age and distance negatively impact the price, while parking positively impact the price. Balcony variable does not appear to have significant impact. The null hypothesis shows that all of the regression coefficients in the model are equal to zero. None of the predict variables have a significant effect on price (dependent variable). Alternative null hypothesis indicated that at least one variable has a significant effect on the price.