data()
data(package = .packages(all.available = TRUE)) #Searching for data set
#install.packages("carData")
library(carData) #Activating the package carData
mydata <- force(States) #Importing a data set called "States" and creating a data frame
head(mydata) #Showing first 6 rows of the newly created data frame
## region pop SATV SATM percent dollars pay
## AL ESC 4041 470 514 8 3.648 27
## AK PAC 550 438 476 42 7.887 43
## AZ MTN 3665 445 497 25 4.231 30
## AR WSC 2351 470 511 6 3.334 23
## CA PAC 29760 419 484 45 4.826 39
## CO MTN 3294 456 513 28 4.809 31
Explanation of a data set:
The States data frame contains 51 observations of 7 variables. The observations are the U. S. states and Washington, D. C.
Source of data set: United States (1992) Statistical Abstract of the United States. Bureau of the Census.
Explanation of the variables in the States data set:
region: U.S. Census regions.
pop: The population of each state, measured in thousands (1,000s).
SATV: This represents the average score achieved by graduating high-school students in each state on the verbal component of the Scholastic Aptitude Test (SAT), a widely-recognized university admission exam.
SATM: The average score of graduating high-school students in each state on the math component of the Scholastic Aptitude Test (SAT).
percent: The percentage of graduating high-school students in each state who took the SAT exam.
dollars: State spending on public education, reported in thousands of dollars per student.
pay: The average salary of teachers in each state, measured in thousands of dollars.
mydata$popreal <- mydata$pop * 1000 #Creating new variable popreal to represent the actual population figures (not in thousands)
head(mydata)
## region pop SATV SATM percent dollars pay popreal
## AL ESC 4041 470 514 8 3.648 27 4041000
## AK PAC 550 438 476 42 7.887 43 550000
## AZ MTN 3665 445 497 25 4.231 30 3665000
## AR WSC 2351 470 511 6 3.334 23 2351000
## CA PAC 29760 419 484 45 4.826 39 29760000
## CO MTN 3294 456 513 28 4.809 31 3294000
colnames(mydata) <- c("Region", "Population", "SATVerbal", "SATMath", "Percent", "StateSpending", "Salary", "RealPopulation") #Renaming variables in data frame mydata
head(mydata)
## Region Population SATVerbal SATMath Percent StateSpending Salary
## AL ESC 4041 470 514 8 3.648 27
## AK PAC 550 438 476 42 7.887 43
## AZ MTN 3665 445 497 25 4.231 30
## AR WSC 2351 470 511 6 3.334 23
## CA PAC 29760 419 484 45 4.826 39
## CO MTN 3294 456 513 28 4.809 31
## RealPopulation
## AL 4041000
## AK 550000
## AZ 3665000
## AR 2351000
## CA 29760000
## CO 3294000
mydata2 <- mydata[c(-1, -2), -8] #Making new data frame mydata2 excluding the first two rows and the eight column
mydata2[2,6] <- 1 #Changing particular value in a data frame
head(mydata2)
## Region Population SATVerbal SATMath Percent StateSpending Salary
## AZ MTN 3665 445 497 25 4.231 30
## AR WSC 2351 470 511 6 1.000 23
## CA PAC 29760 419 484 45 4.826 39
## CO MTN 3294 456 513 28 4.809 31
## CN NE 3287 430 471 74 7.914 43
## DE SA 666 433 470 58 6.016 35
mydata3 <- mydata [mydata$Salary>=25 & mydata$Salary <=35, ] #Making new data frame mydata3 which includes only states with salaries between 25 000 and 35 000 inclusive
head(mydata3)
## Region Population SATVerbal SATMath Percent StateSpending Salary
## AL ESC 4041 470 514 8 3.648 27
## AZ MTN 3665 445 497 25 4.231 30
## CO MTN 3294 456 513 28 4.809 31
## DE SA 666 433 470 58 6.016 35
## FL SA 12938 418 466 44 5.154 30
## GA SA 6478 401 443 57 4.860 29
## RealPopulation
## AL 4041000
## AZ 3665000
## CO 3294000
## DE 666000
## FL 12938000
## GA 6478000
summary(mydata) #Interpretation of descriptive statistics #There is not non available data in this data set
## Region Population SATVerbal SATMath
## SA : 9 Min. : 454 Min. :397.0 Min. :437.0
## MTN : 8 1st Qu.: 1215 1st Qu.:422.5 1st Qu.:470.0
## WNC : 7 Median : 3294 Median :443.0 Median :490.0
## NE : 6 Mean : 4877 Mean :448.2 Mean :497.4
## ENC : 5 3rd Qu.: 5780 3rd Qu.:474.5 3rd Qu.:522.5
## PAC : 5 Max. :29760 Max. :511.0 Max. :577.0
## (Other):11
## Percent StateSpending Salary RealPopulation
## Min. : 4.00 Min. :2.993 Min. :22.00 Min. : 454000
## 1st Qu.:11.50 1st Qu.:4.354 1st Qu.:27.50 1st Qu.: 1215000
## Median :25.00 Median :5.045 Median :30.00 Median : 3294000
## Mean :33.75 Mean :5.175 Mean :30.94 Mean : 4876647
## 3rd Qu.:57.50 3rd Qu.:5.689 3rd Qu.:33.50 3rd Qu.: 5780000
## Max. :74.00 Max. :9.159 Max. :43.00 Max. :29760000
##
library(psych)
describe(mydata) #Interpretation of descriptive statistics
## vars n mean sd median trimmed
## Region* 1 51 5.27 2.45 5.000e+00 5.37
## Population 2 51 4876.65 5439.20 3.294e+03 3813.15
## SATVerbal 3 51 448.16 30.82 4.430e+02 447.29
## SATMath 4 51 497.39 34.57 4.900e+02 496.51
## Percent 5 51 33.75 24.07 2.500e+01 32.76
## StateSpending 6 51 5.18 1.38 5.040e+00 5.02
## Salary 7 51 30.94 5.31 3.000e+01 30.63
## RealPopulation 8 51 4876647.06 5439202.69 3.294e+06 3813146.34
## mad min max range skew
## Region* 2.97 1.00e+00 9.000e+00 8.0000e+00 -0.25
## Population 3239.48 4.54e+02 2.976e+04 2.9306e+04 2.41
## SATVerbal 37.06 3.97e+02 5.110e+02 1.1400e+02 0.18
## SATMath 40.03 4.37e+02 5.770e+02 1.4000e+02 0.23
## Percent 28.17 4.00e+00 7.400e+01 7.0000e+01 0.22
## StateSpending 1.03 2.99e+00 9.160e+00 6.1700e+00 0.97
## Salary 4.45 2.20e+01 4.300e+01 2.1000e+01 0.51
## RealPopulation 3239481.00 4.54e+05 2.976e+07 2.9306e+07 2.41
## kurtosis se
## Region* -1.11 0.34
## Population 7.06 761.64
## SATVerbal -1.11 4.32
## SATMath -0.86 4.84
## Percent -1.63 3.37
## StateSpending 0.71 0.19
## Salary -0.49 0.74
## RealPopulation 7.06 761640.72
Interpretation of the descriptive statistics for data frame mydata
Mean for StateSpending: The mean value for StateSpending is approximately 5.175. This means that, on average, states spend around $5,175 per student on public education.
Median for Salary: The median value for Salary is approximately 30.00. It suggests that half of the states have teacher salaries below $30,000, while the other half have salaries above this figure.
IQR for RealPopulation: 5,780,000(Q3) - 1,215,000(Q1) = 4,565,000 (IQR), the IQR signifies that the middle 50% of the states have population sizes within this range. In other words, their populations span from a minimum of 1,215,000 to a maximum of 5,780,000.
Max. for Percent: The maximum percent of graduating high-school students per state who took the SAT exam is 74.
Min. for Salary: The minimum average teacher’s salary among the states is $22,000.
1st Qu. for SatVerbal: Q1 of 422.5 is indicating that 25% of the states in the dataset had an average SAT verbal scores of 422.5 or below this level while 75% had higher average scores.
3rd Qu. for Population: Q3 of 5780 is indication that 75% of the states have population sizes of 5,780,000 or below this level while 25% have population sizes above this level.
Skewness of Population: A skewness value of 2.41 indicates that the distribution of the “Population” variable is positively skewed, asymmetrical to the right.
hist(mydata$Percent, #Histogram of the percentage of students who took the SAT exam per state
main = "Distribution of the percent of students taking the SAT exam per state",
xlab = "Percent of students taking the SAT exam",
ylab = "Frequecny",
breaks = seq(from = 0, to = 100, by = 10),
col = "lavenderblush",
border = "black")
Based on the graphical representation of the data above, the histogram exhibits bimodal distribution with its modes being at frequency of 12 in the interval of 0% to 10% and at frequency of 9 in the interval of 50% to 60%. This indicates that in the first mode, 12 states had 0 to 10 percentage rates of graduating students who took the SAT exam while in the second mode, 9 states had 50 to 60 percentages rates of graduating students who took the SAT exam. Furthermore, the histogram reveals that the highest participation rate is between 70% and 80%, with only two states exhibiting such high rates. From the descriptive statistics, we can conclude that the variable “Percent” has a skewness of 0.51, which explains histogram’s slight positive skewness, asymmetrical to the right.
boxplot(mydata [ , c(3,4)], #Boxplots of average SATMath scores vs SATVerbal scores
col = "lavenderblush",
border = "black")
The boxplots clearly show that, on average, students performed better in SAT Math than SAT Verbal. We can conclude that from the fact that SAT Math’s median score is around 490 points, compared to SAT Verbal’s median of about 440 points. This means that on SAT Math 50% of the states had an average score of around 490 points or more while the on SAT Verbal 50% of the states had an average scores of around 440 points or more. Furthermore, SAT Math’s interquartile range spans from approximately 470(Q1) to 520(Q3) points, while SAT Verbal’s IQR ranges from 420(Q1) to 480(Q3) points. This statistics shows how many points did the middle 50% (from 25% to 75%) of states had on average. Looking at individual states, the lowest average SAT Verbal score among the states is around 400 points, while the highest is approximately 510 points. On the other hand, for SAT Math, the lowest average score among the states is about 440 points, and the highest is around 580 points. Notably, neither boxplot suggests the presence of outliers.
mydata4 <- read.table("./body mass.csv", #Data importing
header = TRUE,
sep = ";",
dec = ",")
head(mydata4)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
library(pastecs)
stat.desc (mydata4) #Descriptive statistics
## ID Mass
## nbr.val 50.000000 5.000000e+01
## nbr.null 0.000000 0.000000e+00
## nbr.na 0.000000 0.000000e+00
## min 1.000000 4.970000e+01
## max 50.000000 8.320000e+01
## range 49.000000 3.350000e+01
## sum 1275.000000 3.143800e+03
## median 25.500000 6.280000e+01
## mean 25.500000 6.287600e+01
## SE.mean 2.061553 8.501407e-01
## CI.mean.0.95 4.142845 1.708422e+00
## var 212.500000 3.613696e+01
## std.dev 14.577380 6.011403e+00
## coef.var 0.571662 9.560727e-02
round(stat.desc (mydata4, 2)) #Rounding to two decimals
## ID Mass
## median 26 63
## mean 26 63
## SE.mean 2 1
## CI.mean.0.95 4 2
## var 212 36
## std.dev 15 6
## coef.var 1 0
hist(mydata4$Mass, #Histogarm of Mass
main = "Histogram of Mass",
xlab = "Body Mass (in kg)",
ylab = "Frequency",
breaks = seq(from = 30, to = 100, by = 10),
col = "lavenderblush")
t.test(mydata4$Mass, #Testing hypothesis with t-test
mu = 59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata4$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
Testing hypothesis:
H0: 𝜇 = 59.5. H1: 𝜇 != 59.5.
p-value = 0.000234
We reject the H0 at p < 0.001 and accept H1 >>>>> We can not say that population mean (𝜇) of ninth graders’ weight in the school year of 2021/2022 is equal to 59.5. We can say that the average weight of ninth graders increased in comparison with school year 2018/2019.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
cohens_d(mydata4$Mass, mu = 59.5) #Calculating effect size through Cohen's d measure
## Cohen's d | 95% CI
## ------------------------
## 0.56 | [0.26, 0.86]
##
## - Deviation from a difference of 59.5.
interpret_cohens_d(0.56, rules = "sawilowsky2009") #Interpreting the results of calculating the effect size through Cohen's d measure
## [1] "medium"
## (Rules: sawilowsky2009)
library(readxl)
mydata5 <- read_xlsx("./Apartments.xlsx") #Import the dataset Apartments.xlsx
mydata5 <- as.data.frame(mydata5)
head(mydata5)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata5$ParkingFactor <- factor(mydata5$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata5$BalconyFactor <- factor(mydata5$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
str(mydata5) #Changing categorical variables into factors and showing the structure of the data frame
## 'data.frame': 85 obs. of 7 variables:
## $ Age : num 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance : num 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : num 0 1 0 0 1 0 0 1 1 1 ...
## $ Balcony : num 1 0 0 1 1 1 1 1 0 0 ...
## $ ParkingFactor: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ BalconyFactor: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
t.test(mydata5$Price, #Testing hypothesis with t-test
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata5$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
H0: Mu_Price = 1900 eur. H1: Mu_Price != 1900 eur.
p-value = 0.004731
Conclusion: We reject the H0 at p = 0.004731 and accept H1 >>>>> We can not say that population mean of price for apartments is equal to 1900 eur. We accept H1 and assume that it is different.
fit1 <- lm(Price ~ Age, #Simple regression function
data = mydata5)
summary(fit1) #Summary of fit1
##
## Call:
## lm(formula = Price ~ Age, data = mydata5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Simple linear function: Price = 2185.455 - 8.975 * Age
Estimated regression coefficient
H0: Beta 1 = 0 H1: Beta 1 != 0
p = 0.034
We reject H0 at p = 0.034 and accept H1.
Explanation of b1 = -8.975. If Age increases by 1 year, on average the Price of the apartment decreases by 8.975 euros.
Coefficient of determination, Multiple-R2: 0.05302
5.302% of the variability of Price can be explained by the effect of Age of an apartment.
cor(mydata5$Price, mydata5$Age) #Coefficient of correlation
## [1] -0.230255
There is a weak negative linear relationship between the Price of apartments and their Age.
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydata5[ c("Price", "Age", "Distance")],#Scatterplot matrix
col = "navyblue",
smooth = FALSE)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata5 [c("Price", "Age", "Distance")])) #Calculating correlation matrix and associated p-values for the specified variables
## Price Age Distance
## Price 1.00 -0.23 -0.63
## Age -0.23 1.00 0.04
## Distance -0.63 0.04 1.00
##
## n= 85
##
##
## P
## Price Age Distance
## Price 0.0340 0.0000
## Age 0.0340 0.6966
## Distance 0.0000 0.6966
There is a negative semi-strong relationship between Price and Distance. (Potential problem with multicolinearity)
fit2 <- lm(Price ~ Age + Distance, #Multiple regression function
data = mydata5)
vif(fit2) #Checking for multicolineairty with VIF statistics
## Age Distance
## 1.001845 1.001845
mean(vif(fit2)) #Calculating mean of VIF statistics
## [1] 1.001845
All variance inflation factors are less than 5 and also, their mean is low. That means that we can be quite sure that there is no strong multicolliniarity in my data.
mydata5$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
mydata5$CooksD <- round(cooks.distance(fit2), 3) #Cook's distance
hist(mydata5$StdResid, #Histogram of standardized residuals
xlab = "Standardized residuals",
ylab = "Frequency",
col = "lavenderblush",
main = "Histogram of standardized residuals")
There are not any units that have critical values (less than -3 or more than 3) of standardized residuals.
shapiro.test(mydata5$StdResid) #Shapiro-Wilk test
##
## Shapiro-Wilk normality test
##
## data: mydata5$StdResid
## W = 0.95303, p-value = 0.003645
H0: Variable is normally distributed H1: Variable is not normally distributed
p-value = 0.003645
We reject H0 at p = 0.003645 and accept H1. We can not say that standardized residuals are normally distributed. We assume that they are not. However, this assumption is critical only for sample sizes of less than 30 units.
hist(mydata5$CooksD, #Histogram for Cook's distance
xlab = "Cooks distance",
ylab = "Frequency",
col = "lavenderblush",
main = "Histogram of Cooks distance")
head(mydata5[order(-mydata5$StdResid),], 3) #Three units with lowest value of stand. residuals
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 38 5 45 2180 1 1 Yes Yes
## 33 2 11 2790 1 0 Yes No
## 2 18 1 2800 1 0 Yes No
## StdResid CooksD
## 38 2.577 0.320
## 33 2.051 0.069
## 2 1.783 0.030
head(mydata5[order(-mydata5$CooksD),], 3) #Three units with highest value of Cooks distance
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 38 5 45 2180 1 1 Yes Yes
## 55 43 37 1740 0 0 No No
## 33 2 11 2790 1 0 Yes No
## StdResid CooksD
## 38 2.577 0.320
## 55 1.445 0.104
## 33 2.051 0.069
mydata5 <- mydata5[-38, ] #Removing unit ID38 because of high Cook's distance value
mydata5 <- mydata5[-54, ] #Removing unit ID55 since even though unit ID38 was removed the Histogram of Cook's distance did not have a continuous distribution and this unit had a Cook's distance value higher than 0.1
fit2 <- lm(Price ~ Age + Distance, #Estimating fit2 again without ID38 and ID55
data = mydata5)
mydata5$StdResid <- round(rstandard(fit2), 3) #Estimating standardized residuals again without ID38 and ID55
mydata5$CooksD <- round(cooks.distance(fit2), 3) #Estimating Cook's distance again without ID38 and ID55
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(data = mydata5, aes(x = StdResid)) +
geom_histogram(binwidth = 0.5, fill = "lavenderblush", color = "black") +
labs(title = "Histogram of Standardized Residuals", x = "Standardized Residuals", y = "Frequency") #Histogram of standardized residuals without unit ID38 and ID55
shapiro.test(mydata5$StdResid) #Shapiro-Wilk test
##
## Shapiro-Wilk normality test
##
## data: mydata5$StdResid
## W = 0.95952, p-value = 0.01044
H0: Variable is normally distributed H1: Variable is not normally distributed
p-value = 0.01044
We reject H0 at p = 0.01044 and accept H1. We can not say that standardized residuals are normally distributed. We assume that they are not. However, this assumption is critical only for sample sizes of less than 30 units.
hist(mydata5$CooksD, #Histogram of Cook's distance without unit ID 38 and ID55
xlab = "Cooks distance",
ylab = "Frequency",
col = "lavenderblush",
main = "Histogram of Cooks distances")
mydata5$StdFitted <- scale(fit2$fitted.values) #Adding the standardized values of the fitted values from a regression model stored in fit2
library(car)
scatterplot(y = mydata5$StdResid, x = mydata5$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
col = "navyblue",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE) #Creating a scatterplot to visualize the relationship between standardized residuals and standardized fitted values and check for heteroskedasticity
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2) #Checking heteroskedasticity through the Breusch Pagan Test
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.775135
## Prob > Chi2 = 0.05201969
H0: the variance is constant
Ha: the variance is not constant
p-value = 0.05201969
Since p-value is above the significant level (alpha) of 5%, we can not reject the null hypothesis. The variance between standardized residuals and standardized fitted values is constant and there is no potential heteroskedasticity.
fit2 <- lm(Price ~ Age + Distance,
data = mydata5)
summary(fit2) #Summary of fit2
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
H0: Beta0 = 0 H1: Beta0 != 0 b0 = < 2e-16
We reject H0 at p-value < 0.001 and accept H1. Explanation of the intercept coefficient (b0), If the explanatory variables used in the regression function (Age and Distance) equal zero, the apartment on average will have Price of 2490.112 euros.
H0: Beta1 = 0 H1: Beta1 != 0 p = 0.0178 b1 = -7.850
We reject H0 at p-value = 0.0178 and accept H1. Explanation of b1 = -7.850, If Age increases by one year, the Price on average decreases by 7.850 euros, assuming that all other explanatory variables, included in the model, are constant.
H0: Beta2 = 0 H1: Beta2 != 0 p < 0.001 b2 = -23.945
We reject H0 at p-value < 0.001 and accept H1. Explanation of b1 = -23.945, If Distance increases by 1 km, on average Price decreases by 23.945 euros, assuming that all other explanatory variables, included in the model, are constant.
Coefficient of determination, Multiple-R2: 0.4968 49.68% of variability of Price is explained by linear effect of Age and Distance.
fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
data = mydata5) #Creating fit3
anova(fit2, fit3) #Checking whether fit3 or fit2 fits data better through ANOVA
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 78 5458696 2 523404 3.7395 0.02813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Model 1 is better H1: Model 2 is better
p = 0.02813
We reject the null hypothesis at p = 0.02813. Model 2 is better >>> it explains more variability in the price of an apartment.
summary(fit3) #Summary of fit3
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 2e-16 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 2.14e-10 ***
## ParkingFactorYes 168.921 62.166 2.717 0.00811 **
## BalconyFactorYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 1.449e-12
H0: Beta3 = 0 H1: Beta3 != 0
p = 0.00811
We reject H0 at p = 0.00811. Assuming that all other variables are the same, on average the apartments that have parking have higher price by 168.921 compared to the ones that do not have a parking.
H0: Beta4 = 0 H1: Beta4 != 0
p = 0.90566
We accept null hypothesis at p = 0.90566. We can not say that Balcony has a statistical effect on Price.
Test of significance of regression H0: Ro2 = 0 H1: Ro2 > 0
F= 22.97, p < 0.001
We reject the null hypothesis at p < 0.001. We can say that population coefficient of determination is more than 0 meaning that we found liner relationship between Price and explanatory variables, included in the model.
mydata5$Fitted <- fitted.values(fit3) #Calculating fitted values
mydata5$Residuals <- residuals(fit3) #Calculating residuals
head(mydata5 [2, 12]) #Showing residual for ID2
## [1] 422.9572