library(carData)
mydata1 <- SLID
head(mydata1)
## wages education age sex language
## 1 10.56 15.0 40 Male English
## 2 11.00 13.2 19 Male English
## 3 NA 16.0 49 Male Other
## 4 17.76 14.0 46 Male Other
## 5 NA 8.0 71 Male English
## 6 14.00 16.0 50 Female English
Explanation of the variables used in the analysis:
mydata1$ageNOW <- mydata1$age + 29 #Creating new variable (ageNOW).
mydata1_CleanData <- na.omit(mydata1) #Removing the units that have NA in them (missing data)
head(mydata1_CleanData)
## wages education age sex language ageNOW
## 1 10.56 15.0 40 Male English 69
## 2 11.00 13.2 19 Male English 48
## 4 17.76 14.0 46 Male Other 75
## 6 14.00 16.0 50 Female English 79
## 9 8.20 15.0 31 Male English 60
## 12 16.97 13.5 30 Female English 59
names(mydata1)[names(mydata1) == "ageNOW"] <- "age2023"
names(mydata1_CleanData)[names(mydata1_CleanData) == "ageNOW"] <- "age2023" #Renaming the variable "ageNOW" to "age2023"
I would also like to rename the variable sex to gender, but only in mydata1_CleanData.
names(mydata1_CleanData)[names(mydata1_CleanData) == "sex"] <- "gender"
summary(mydata1_CleanData[ , c(-4,-5)]) #Removing sex and language from the summary
## wages education age age2023
## Min. : 2.30 Min. : 0.00 Min. :16.0 Min. :45.0
## 1st Qu.: 9.25 1st Qu.:12.00 1st Qu.:28.0 1st Qu.:57.0
## Median :14.13 Median :13.00 Median :36.0 Median :65.0
## Mean :15.54 Mean :13.34 Mean :37.1 Mean :66.1
## 3rd Qu.:19.72 3rd Qu.:15.10 3rd Qu.:46.0 3rd Qu.:75.0
## Max. :49.92 Max. :20.00 Max. :69.0 Max. :98.0
Explanation:
hist(mydata1_CleanData$wages,
main = "Histogram of wages for people in our survay",
xlab = "Hourly wage rate",
ylab = "Frequency",
breaks = seq(from=0, to=50, by=2))
Explanation of the histogram:
We can see that the majority of people in our example have had their hourly wage rate lover than 10$ (to be more precised, somewhere arount 8$). As expected we can also notice that as soon as the hourly wage rate gets higher, the frequency gets lower.
library(car)
scatterplot(y=mydata1_CleanData$age,
x=mydata1_CleanData$wages,
ylab="age",
xlab="wages",
smooth=FALSE)
Explanation of the scatterplot:
The scatterplot shows us that the majority of people in our sample has a hourly wage rate between 5$ and 20$. There is a positive correlation between wage and age, and we can also notice that as older the people are, the higher their hourly wage rate is.
Body_mass_data <- read.table("./Body mass.csv", header=TRUE, sep=";", dec=",")
head(Body_mass_data)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
library(pastecs)
stat.desc(Body_mass_data)
## ID Mass
## nbr.val 50.000000 5.000000e+01
## nbr.null 0.000000 0.000000e+00
## nbr.na 0.000000 0.000000e+00
## min 1.000000 4.970000e+01
## max 50.000000 8.320000e+01
## range 49.000000 3.350000e+01
## sum 1275.000000 3.143800e+03
## median 25.500000 6.280000e+01
## mean 25.500000 6.287600e+01
## SE.mean 2.061553 8.501407e-01
## CI.mean.0.95 4.142845 1.708422e+00
## var 212.500000 3.613696e+01
## std.dev 14.577380 6.011403e+00
## coef.var 0.571662 9.560727e-02
round(stat.desc(Body_mass_data), 1)
## ID Mass
## nbr.val 50.0 50.0
## nbr.null 0.0 0.0
## nbr.na 0.0 0.0
## min 1.0 49.7
## max 50.0 83.2
## range 49.0 33.5
## sum 1275.0 3143.8
## median 25.5 62.8
## mean 25.5 62.9
## SE.mean 2.1 0.9
## CI.mean.0.95 4.1 1.7
## var 212.5 36.1
## std.dev 14.6 6.0
## coef.var 0.6 0.1
hist(Body_mass_data$Mass,
breaks = 4,
col = "blue",
border = "black",
main = "Histogram of the weight of some nine graders",
xlab = "Body mass",
ylab = "Frequency")
H0: Mu = 59.5 H1: Mu =/ 59.5
mean(Body_mass_data$Mass)
## [1] 62.876
result <- t.test(Body_mass_data$Mass, mu = 59.5)
cat("P-value:", result$p.value,"\n")
## P-value: 0.0002339924
p < 0.001, that is why we reject the null hypothesis, therefore we confirm our results from before that the population of nine graders have become more obese over the period of 3 years (from 2018/19 to 2021/22).
library("effectsize")
cohens_d(Body_mass_data$Mass, mu= 59.5)
## Cohen's d | 95% CI
## ------------------------
## 0.56 | [0.26, 0.86]
##
## - Deviation from a difference of 59.5.
interpret_cohens_d(0.56, "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)
In our case Cohen’s d is 0.56, therefore the effect size is medium. This means that the difference between the sample mean and the hypothesized population mean (59.5 kg) is of moderate magnitude. In practical terms, this suggests that there is a noticeable difference between the sample mean of (of Mass in our case) and the hypothesized population mean.
#install.packages("readxl")
library(readxl)
mydata_Apartments <- read_xlsx("./Apartments.xlsx")
Description:
mydata_Apartments$ParkingF <- factor(mydata_Apartments$Parking,
levels = c(1, 0),
labels = c("YES", "NO"))
mydata_Apartments$BalconyF <- factor(mydata_Apartments$Balcony,
levels = c(1, 0),
labels = c("YES", "NO"))
t.test(mydata_Apartments$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata_Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Based on the p-value, I can conclude that the p-value (0.004731) is less than the level of 0.05. This indicates that I reject the null hypothesis (H0) in favor of the alternative hypothesis (H1). In other words, there is evidence that the true mean price of apartments is statistically different from 1900.
library(ggplot2)
lm(Price ~ Age, data = mydata_Apartments)
##
## Call:
## lm(formula = Price ~ Age, data = mydata_Apartments)
##
## Coefficients:
## (Intercept) Age
## 2185.455 -8.975
fit1 <- lm(Price ~ Age, data = mydata_Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata_Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
r <- sqrt(0.05302)
Explanation: - Estimate of regression coefficient: If any of the apartments get older by 1 year (if variable “Age” increases by 1 year), on average there will be a decrease in price by 8.975$, assuming all other variables are constant. - Estimate of coefficient in correlation: The positive sign of the correlation coefficient indicates a positive linear relationship between “Age” and “Price,” meaning that as “Age” increases, “Price” will decrease. The magnitude of 0.2300 suggests a weak linear relationship between those two. - Estimate of coefficient of determination: Only about 5.3% of the variability in “Price” is explained by the linear relationship with “Age.” In other words, “Age” explains a relatively small proportion of the variation in apartment prices.
library(car)
scatterplotMatrix(mydata_Apartments[ , c(-4, -5, -6, -7)],smooth = FALSE)
cor_matrix <- cor(mydata_Apartments[, c("Price", "Age", "Distance")])
To see if there is a potential problem with multicollinearity, I created a correlation matrix. Because there is no number close to 1 (positive or negative), I can say that there is no real potential problem of multicollinearity.
fit2 <- lm(Price ~ Age + Distance,
data = mydata_Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata_Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
The VIF values for both “Age” and “Distance” are very close to 1. A VIF of 1 indicates that there is almost no multicollinearity between these variables. A high multicollinearity between two variables is considered, when VIF is 5 or higher.
mydata_Apartments$StdResid <- round(rstandard(fit2), 3)
mydata_Apartments$CooksD <- round(cooks.distance(fit2), 3)
#I removed the one outlier (row 38) 3 questions below, because otherwise my new data.frame (mydata11) would only consist of 9 variables insted of 10, because StdFitted could not be added in because my data11 would have 84 rows and fit2 has 85 rows. That's why I removed the outlier a few questions after this one.
mydata_Apartments$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata_Apartments$StdResid, x = mydata_Apartments$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Based on the findings with the scatterplot, I haven’t located any
potential heteroskedasticity between standarized residuals and
standrdized fitted values.
hist(mydata_Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata_Apartments$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata_Apartments$StdResid
## W = 0.95303, p-value = 0.003645
standardized residuals are not distributed normally. I have proved this with a shapiro.test, where p-value = 0.003645 (p < 0.05) and because of that I can reject the null hypothesis (H0) and accept the alternative one (H1).
mydata11 <- mydata_Apartments [-38, ]
fit2 <- lm(Price ~ Age + Distance,
data = mydata11)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
Explanation of all coefficients:
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data=mydata11)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2482.048 79.263 31.314 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFNO -167.531 62.864 -2.665 0.00933 **
## BalconyFNO 15.207 59.201 0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: fit2 = fit3….The null hypothesis in this context is that there is no significant difference in model fit between fit2 and fit3.
H1: fit2 != fit3….The alternative hypothesis is that there is a significant difference in model fit between fit2 and fit3.
Results: The p-value is 0.03051. This p-value is less than the 0.05. Since the p-value is less than 0.05, I can reject the null hypothesis, and based on the ANOVA test results fit3 is better suited for the data than fit2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2482.048 79.263 31.314 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFNO -167.531 62.864 -2.665 0.00933 **
## BalconyFNO 15.207 59.201 0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
Explanation:
The hypothesis which is being tested with F-statistics:
mydata11$residuals <- residuals(fit3)
mydata11$fittedValues <- fitted.values(fit3)
head(mydata11)
## # A tibble: 6 × 12
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 7 28 1640 0 1 NO YES -0.665
## 2 18 1 2800 1 0 YES NO 1.78
## 3 7 28 1660 0 0 NO NO -0.594
## 4 28 29 1850 0 1 NO YES 0.754
## 5 18 18 1640 1 1 YES YES -1.07
## 6 28 12 1770 0 1 NO YES -0.778
## # ℹ 4 more variables: CooksD <dbl>, StdFitted <dbl[,1]>,
## # residuals <dbl>, fittedValues <dbl>
mydata11$residuals <- residuals(fit3)
mydata11$fittedValues <- fitted.values(fit3)
head(mydata11[2, c(11,12)])
## # A tibble: 1 × 2
## residuals fittedValues
## <dbl> <dbl>
## 1 428. 2372.