R take-home exam

Ilina Maksimovska

Task 1

data()
data(package = .packages(all.available = TRUE)) #Searching for data set

#install.packages("carData")
library(carData) #Activating the package carData
mydata <- force(States) #Importing a data set called "States" and creating a data frame
head(mydata) #Showing first 6 rows of the newly created data frame

##    region   pop SATV SATM percent dollars pay
## AL    ESC  4041  470  514       8   3.648  27
## AK    PAC   550  438  476      42   7.887  43
## AZ    MTN  3665  445  497      25   4.231  30
## AR    WSC  2351  470  511       6   3.334  23
## CA    PAC 29760  419  484      45   4.826  39
## CO    MTN  3294  456  513      28   4.809  31

Explanation of a data set:

The States data frame contains 51 observations of 7 variables. The observations are the U. S. states and Washington, D. C.

Source of data set: United States (1992) Statistical Abstract of the United States. Bureau of the Census.

Explanation of the variables in the States data set:

region: U.S. Census regions.
pop: The population of each state, measured in thousands (1,000s).
SATV: This represents the average score achieved by graduating high-school students in each state on the verbal component of the Scholastic Aptitude Test (SAT), a widely-recognized university admission exam.
SATM: The average score of graduating high-school students in each state on the math component of the Scholastic Aptitude Test (SAT).
percent: The percentage of graduating high-school students in each state who took the SAT exam.
dollars: State spending on public education, reported in thousands of dollars per student.
pay: The average salary of teachers in each state, measured in thousands of dollars.

mydata$popreal <- mydata$pop * 1000 #Creating new variable popreal to represent the actual population figures (not in thousands)
head(mydata)

##    region   pop SATV SATM percent dollars pay  popreal
## AL    ESC  4041  470  514       8   3.648  27  4041000
## AK    PAC   550  438  476      42   7.887  43   550000
## AZ    MTN  3665  445  497      25   4.231  30  3665000
## AR    WSC  2351  470  511       6   3.334  23  2351000
## CA    PAC 29760  419  484      45   4.826  39 29760000
## CO    MTN  3294  456  513      28   4.809  31  3294000

colnames(mydata) <- c("Region", "Population", "SATVerbal", "SATMath", "Percent", "StateSpending", "Salary", "RealPopulation") #Renaming variables in data frame mydata
head(mydata)

##    Region Population SATVerbal SATMath Percent StateSpending Salary
## AL    ESC       4041       470     514       8         3.648     27
## AK    PAC        550       438     476      42         7.887     43
## AZ    MTN       3665       445     497      25         4.231     30
## AR    WSC       2351       470     511       6         3.334     23
## CA    PAC      29760       419     484      45         4.826     39
## CO    MTN       3294       456     513      28         4.809     31
##    RealPopulation
## AL        4041000
## AK         550000
## AZ        3665000
## AR        2351000
## CA       29760000
## CO        3294000

mydata2 <- mydata[c(-1, -2), -8] #Making new data frame mydata2 excluding the first two rows and the eight column 
mydata2[2,6] <- 1 #Changing particular value in a data frame
head(mydata2)

##    Region Population SATVerbal SATMath Percent StateSpending Salary
## AZ    MTN       3665       445     497      25         4.231     30
## AR    WSC       2351       470     511       6         1.000     23
## CA    PAC      29760       419     484      45         4.826     39
## CO    MTN       3294       456     513      28         4.809     31
## CN     NE       3287       430     471      74         7.914     43
## DE     SA        666       433     470      58         6.016     35

mydata3 <- mydata [mydata$Salary>=25 & mydata$Salary <=35, ] #Making new data frame mydata3 which includes only states with salaries between 25 000 and 35 000 inclusive 
head(mydata3)

##    Region Population SATVerbal SATMath Percent StateSpending Salary
## AL    ESC       4041       470     514       8         3.648     27
## AZ    MTN       3665       445     497      25         4.231     30
## CO    MTN       3294       456     513      28         4.809     31
## DE     SA        666       433     470      58         6.016     35
## FL     SA      12938       418     466      44         5.154     30
## GA     SA       6478       401     443      57         4.860     29
##    RealPopulation
## AL        4041000
## AZ        3665000
## CO        3294000
## DE         666000
## FL       12938000
## GA        6478000

summary(mydata) #Interpretation of descriptive statistics #There is not non available data in this data set

##      Region     Population      SATVerbal        SATMath     
##  SA     : 9   Min.   :  454   Min.   :397.0   Min.   :437.0  
##  MTN    : 8   1st Qu.: 1215   1st Qu.:422.5   1st Qu.:470.0  
##  WNC    : 7   Median : 3294   Median :443.0   Median :490.0  
##  NE     : 6   Mean   : 4877   Mean   :448.2   Mean   :497.4  
##  ENC    : 5   3rd Qu.: 5780   3rd Qu.:474.5   3rd Qu.:522.5  
##  PAC    : 5   Max.   :29760   Max.   :511.0   Max.   :577.0  
##  (Other):11                                                  
##     Percent      StateSpending       Salary      RealPopulation    
##  Min.   : 4.00   Min.   :2.993   Min.   :22.00   Min.   :  454000  
##  1st Qu.:11.50   1st Qu.:4.354   1st Qu.:27.50   1st Qu.: 1215000  
##  Median :25.00   Median :5.045   Median :30.00   Median : 3294000  
##  Mean   :33.75   Mean   :5.175   Mean   :30.94   Mean   : 4876647  
##  3rd Qu.:57.50   3rd Qu.:5.689   3rd Qu.:33.50   3rd Qu.: 5780000  
##  Max.   :74.00   Max.   :9.159   Max.   :43.00   Max.   :29760000  
##

library(psych) 
describe(mydata) #Interpretation of descriptive statistics

##                vars  n       mean         sd    median    trimmed
## Region*           1 51       5.27       2.45 5.000e+00       5.37
## Population        2 51    4876.65    5439.20 3.294e+03    3813.15
## SATVerbal         3 51     448.16      30.82 4.430e+02     447.29
## SATMath           4 51     497.39      34.57 4.900e+02     496.51
## Percent           5 51      33.75      24.07 2.500e+01      32.76
## StateSpending     6 51       5.18       1.38 5.040e+00       5.02
## Salary            7 51      30.94       5.31 3.000e+01      30.63
## RealPopulation    8 51 4876647.06 5439202.69 3.294e+06 3813146.34
##                       mad      min       max      range  skew
## Region*              2.97 1.00e+00 9.000e+00 8.0000e+00 -0.25
## Population        3239.48 4.54e+02 2.976e+04 2.9306e+04  2.41
## SATVerbal           37.06 3.97e+02 5.110e+02 1.1400e+02  0.18
## SATMath             40.03 4.37e+02 5.770e+02 1.4000e+02  0.23
## Percent             28.17 4.00e+00 7.400e+01 7.0000e+01  0.22
## StateSpending        1.03 2.99e+00 9.160e+00 6.1700e+00  0.97
## Salary               4.45 2.20e+01 4.300e+01 2.1000e+01  0.51
## RealPopulation 3239481.00 4.54e+05 2.976e+07 2.9306e+07  2.41
##                kurtosis        se
## Region*           -1.11      0.34
## Population         7.06    761.64
## SATVerbal         -1.11      4.32
## SATMath           -0.86      4.84
## Percent           -1.63      3.37
## StateSpending      0.71      0.19
## Salary            -0.49      0.74
## RealPopulation     7.06 761640.72

Interpretation of the descriptive statistics for data frame mydata

Mean for StateSpending: The mean value for StateSpending is approximately 5.175. This means that, on average, states spend around $5,175 per student on public education.
Median for Salary: The median value for Salary is approximately 30.00. It suggests that half of the states have teacher salaries below $30,000, while the other half have salaries above this figure.
IQR for RealPopulation: 5,780,000(Q3) - 1,215,000(Q1) = 4,565,000 (IQR), the IQR signifies that the middle 50% of the states have population sizes within this range. In other words, their populations span from a minimum of 1,215,000 to a maximum of 5,780,000.
Max. for Percent: The maximum percent of graduating high-school students per state who took the SAT exam is 74.
Min. for Salary: The minimum average teacher’s salary among the states is $22,000.
1st Qu. for SatVerbal: Q1 of 422.5 is indicating that 25% of the states in the dataset had an average SAT verbal scores of 422.5 or below this level while 75% had higher average scores.
3rd Qu. for Population: Q3 of 5780 is indication that 75% of the states have population sizes of 5,780,000 or below this level while 25% have population sizes above this level.
Skewness of Population: A skewness value of 2.41 indicates that the distribution of the “Population” variable is positively skewed, asymmetrical to the right.

hist(mydata$Percent,  #Histogram of the percentage of students who took the SAT exam per state
     main = "Distribution of the percent of students taking the SAT exam per state", 
     xlab = "Percent of students taking the SAT exam",
     ylab = "Frequecny",
     breaks = seq(from = 0, to = 100, by = 10),
     col = "lavenderblush", 
     border = "black")

Based on the graphical representation of the data above, the histogram exhibits bimodal distribution with its modes being at frequency of 12 in the interval of 0% to 10% and at frequency of 9 in the interval of 50% to 60%. This indicates that in the first mode, 12 states had 0 to 10 percentage rates of graduating students who took the SAT exam while in the second mode, 9 states had 50 to 60 percentages rates of graduating students who took the SAT exam. Furthermore, the histogram reveals that the highest participation rate is between 70% and 80%, with only two states exhibiting such high rates. From the descriptive statistics, we can conclude that the variable “Percent” has a skewness of 0.51, which explains histogram’s slight positive skewness, asymmetrical to the right.

boxplot(mydata [ , c(3,4)],  #Boxplots of average SATMath scores vs SATVerbal scores
        col = "lavenderblush", 
        border = "black")

The boxplots clearly show that, on average, students performed better in SAT Math than SAT Verbal. We can conclude that from the fact that SAT Math’s median score is around 490 points, compared to SAT Verbal’s median of about 440 points. This means that on SAT Math 50% of the states had an average score of around 490 points or more while the on SAT Verbal 50% of the states had an average scores of around 440 points or more. Furthermore, SAT Math’s interquartile range spans from approximately 470(Q1) to 520(Q3) points, while SAT Verbal’s IQR ranges from 420(Q1) to 480(Q3) points. This statistics shows how many points did the middle 50% (from 25% to 75%) of states had on average. Looking at individual states, the lowest average SAT Verbal score among the states is around 400 points, while the highest is approximately 510 points. On the other hand, for SAT Math, the lowest average score among the states is about 440 points, and the highest is around 580 points. Notably, neither boxplot suggests the presence of outliers.

Task 2

mydata4 <- read.table("./body mass.csv", #Data importing
                     header = TRUE, 
                     sep = ";", 
                     dec = ",")

head(mydata4)

##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2

library(pastecs)
stat.desc (mydata4) #Descriptive statistics

##                       ID         Mass
## nbr.val        50.000000 5.000000e+01
## nbr.null        0.000000 0.000000e+00
## nbr.na          0.000000 0.000000e+00
## min             1.000000 4.970000e+01
## max            50.000000 8.320000e+01
## range          49.000000 3.350000e+01
## sum          1275.000000 3.143800e+03
## median         25.500000 6.280000e+01
## mean           25.500000 6.287600e+01
## SE.mean         2.061553 8.501407e-01
## CI.mean.0.95    4.142845 1.708422e+00
## var           212.500000 3.613696e+01
## std.dev        14.577380 6.011403e+00
## coef.var        0.571662 9.560727e-02

round(stat.desc (mydata4, 2)) #Rounding to two decimals

##               ID Mass
## median        26   63
## mean          26   63
## SE.mean        2    1
## CI.mean.0.95   4    2
## var          212   36
## std.dev       15    6
## coef.var       1    0

hist(mydata4$Mass, #Histogarm of Mass
     main = "Histogram of Mass",
     xlab = "Body Mass (in kg)",
     ylab = "Frequency",
     breaks = seq(from = 30, to = 100, by = 10),
     col = "lavenderblush")

t.test(mydata4$Mass, #Testing hypothesis with t-test
       mu = 59.5,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata4$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

Testing hypothesis:

H0: 𝜇 = 59.5. H1: 𝜇 != 59.5.

p-value = 0.000234

We reject the H0 at p < 0.001 and accept H1 >>>>> We can not say that population mean (𝜇) of ninth graders’ weight in the school year of 2021/2022 is equal to 59.5. We can say that the average weight of ninth graders increased in comparison with school year 2018/2019.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

cohens_d(mydata4$Mass, mu = 59.5) #Calculating effect size through Cohen's d measure

## Cohen's d |       95% CI
## ------------------------
## 0.56      | [0.26, 0.86]
## 
## - Deviation from a difference of 59.5.

interpret_cohens_d(0.56, rules = "sawilowsky2009")  #Interpreting the results of calculating the effect size through Cohen's d measure

## [1] "medium"
## (Rules: sawilowsky2009)

Task 3

library(readxl) 
mydata5 <- read_xlsx("./Apartments.xlsx") #Import the dataset Apartments.xlsx

mydata5 <- as.data.frame(mydata5)

head(mydata5)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

mydata5$ParkingFactor <- factor(mydata5$Parking, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))

mydata5$BalconyFactor <- factor(mydata5$Balcony, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))

str(mydata5) #Changing categorical variables into factors and showing the structure of the data frame

## 'data.frame':    85 obs. of  7 variables:
##  $ Age          : num  7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance     : num  28 1 28 29 18 12 20 6 7 2 ...
##  $ Price        : num  1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking      : num  0 1 0 0 1 0 0 1 1 1 ...
##  $ Balcony      : num  1 0 0 1 1 1 1 1 0 0 ...
##  $ ParkingFactor: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ BalconyFactor: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...

t.test(mydata5$Price, #Testing hypothesis with t-test
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata5$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

H0: Mu_Price = 1900 eur. H1: Mu_Price != 1900 eur.

p-value = 0.004731

Conclusion: We reject the H0 at p = 0.004731 and accept H1 >>>>> We can not say that population mean of price for apartments is equal to 1900 eur. We accept H1 and assume that it is different.

fit1 <- lm(Price ~ Age, #Simple regression function
          data = mydata5)

summary(fit1) #Summary of fit1

## 
## Call:
## lm(formula = Price ~ Age, data = mydata5)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Simple linear function: Price = 2185.455 - 8.975 * Age

Estimated regression coefficient

H0: Beta 1 = 0 H1: Beta 1 != 0

p = 0.034

We reject H0 at p = 0.034 and accept H1.

Explanation of b1 = -8.975. If Age increases by 1 year, on average the Price of the apartment decreases by 8.975 euros.

Coefficient of determination, Multiple-R2: 0.05302

5.302% of the variability of Price can be explained by the effect of Age of an apartment.

cor(mydata5$Price, mydata5$Age) #Coefficient of correlation

## [1] -0.230255

There is a weak negative linear relationship between the Price of apartments and their Age.

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(mydata5[ c("Price", "Age", "Distance")],#Scatterplot matrix
                  col = "navyblue",
                  smooth = FALSE)

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata5 [c("Price", "Age", "Distance")])) #Calculating correlation matrix and associated p-values for the specified variables

##          Price   Age Distance
## Price     1.00 -0.23    -0.63
## Age      -0.23  1.00     0.04
## Distance -0.63  0.04     1.00
## 
## n= 85 
## 
## 
## P
##          Price  Age    Distance
## Price           0.0340 0.0000  
## Age      0.0340        0.6966  
## Distance 0.0000 0.6966

There is a negative semi-strong relationship between Price and Distance. (Potential problem with multicolinearity)

fit2 <- lm(Price ~ Age + Distance, #Multiple regression function
           data = mydata5)

vif(fit2) #Checking for multicolineairty with VIF statistics

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2)) #Calculating mean of VIF statistics

## [1] 1.001845

All variance inflation factors are less than 5 and also, their mean is low. That means that we can be quite sure that there is no strong multicolliniarity in my data.

mydata5$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
mydata5$CooksD <- round(cooks.distance(fit2), 3) #Cook's distance

hist(mydata5$StdResid, #Histogram of standardized residuals
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     col = "lavenderblush",
     main = "Histogram of standardized residuals")

There are not any units that have critical values (less than -3 or more than 3) of standardized residuals.

shapiro.test(mydata5$StdResid) #Shapiro-Wilk test

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata5$StdResid
## W = 0.95303, p-value = 0.003645

H0: Variable is normally distributed H1: Variable is not normally distributed

p-value = 0.003645

We reject H0 at p = 0.003645 and accept H1. We can not say that standardized residuals are normally distributed. We assume that they are not. However, this assumption is critical only for sample sizes of less than 30 units.

hist(mydata5$CooksD, #Histogram for Cook's distance
     xlab = "Cooks distance", 
     ylab = "Frequency",
     col = "lavenderblush",
     main = "Histogram of Cooks distance")

head(mydata5[order(-mydata5$StdResid),], 3) #Three units with lowest value of stand. residuals

##    Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 38   5       45  2180       1       1           Yes           Yes
## 33   2       11  2790       1       0           Yes            No
## 2   18        1  2800       1       0           Yes            No
##    StdResid CooksD
## 38    2.577  0.320
## 33    2.051  0.069
## 2     1.783  0.030

head(mydata5[order(-mydata5$CooksD),], 3) #Three units with highest value of Cooks distance

##    Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 38   5       45  2180       1       1           Yes           Yes
## 55  43       37  1740       0       0            No            No
## 33   2       11  2790       1       0           Yes            No
##    StdResid CooksD
## 38    2.577  0.320
## 55    1.445  0.104
## 33    2.051  0.069

mydata5 <- mydata5[-38, ] #Removing unit ID38 because of high Cook's distance value
mydata5 <- mydata5[-54, ] #Removing unit ID55 since even though unit ID38 was removed the Histogram of Cook's distance did not have a continuous distribution and this unit had a Cook's distance value higher than 0.1

fit2 <- lm(Price ~ Age + Distance, #Estimating fit2 again without ID38 and ID55
           data = mydata5)

mydata5$StdResid <- round(rstandard(fit2), 3) #Estimating standardized residuals again without ID38 and ID55
mydata5$CooksD <- round(cooks.distance(fit2), 3) #Estimating Cook's distance again without ID38 and ID55

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(data = mydata5, aes(x = StdResid)) +
  geom_histogram(binwidth = 0.5, fill = "lavenderblush", color = "black") +
  labs(title = "Histogram of Standardized Residuals", x = "Standardized Residuals", y = "Frequency") #Histogram of standardized residuals without unit ID38 and ID55

shapiro.test(mydata5$StdResid) #Shapiro-Wilk test

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata5$StdResid
## W = 0.95952, p-value = 0.01044

H0: Variable is normally distributed H1: Variable is not normally distributed

p-value = 0.01044

We reject H0 at p = 0.01044 and accept H1. We can not say that standardized residuals are normally distributed. We assume that they are not. However, this assumption is critical only for sample sizes of less than 30 units.

hist(mydata5$CooksD, #Histogram of Cook's distance without unit ID 38 and ID55
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     col = "lavenderblush",
     main = "Histogram of Cooks distances")

mydata5$StdFitted <- scale(fit2$fitted.values) #Adding the standardized values of the fitted values from a regression model stored in fit2

library(car)
scatterplot(y = mydata5$StdResid, x = mydata5$StdFitted,  
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            col = "navyblue",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE) #Creating a scatterplot to visualize the relationship between standardized residuals and standardized fitted values and check for heteroskedasticity

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2) #Checking heteroskedasticity through the Breusch Pagan Test

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.775135 
##  Prob > Chi2   =    0.05201969

H0: the variance is constant

Ha: the variance is not constant

p-value = 0.05201969

Since p-value is above the significant level (alpha) of 5%, we can not reject the null hypothesis. The variance between standardized residuals and standardized fitted values is constant and there is no potential heteroskedasticity.

fit2 <- lm(Price ~ Age + Distance, 
           data = mydata5) 

summary(fit2) #Summary of fit2

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

H0: Beta0 = 0 H1: Beta0 != 0 b0 = < 2e-16

We reject H0 at p-value < 0.001 and accept H1. Explanation of the intercept coefficient (b0), If the explanatory variables used in the regression function (Age and Distance) equal zero, the apartment on average will have Price of 2490.112 euros.

H0: Beta1 = 0 H1: Beta1 != 0 p = 0.0178 b1 = -7.850

We reject H0 at p-value = 0.0178 and accept H1. Explanation of b1 = -7.850, If Age increases by one year, the Price on average decreases by 7.850 euros, assuming that all other explanatory variables, included in the model, are constant.

H0: Beta2 = 0 H1: Beta2 != 0 p < 0.001 b2 = -23.945

We reject H0 at p-value < 0.001 and accept H1. Explanation of b1 = -23.945, If Distance increases by 1 km, on average Price decreases by 23.945 euros, assuming that all other explanatory variables, included in the model, are constant.

Coefficient of determination, Multiple-R2: 0.4968 49.68% of variability of Price is explained by linear effect of Age and Distance.

fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
           data = mydata5) #Creating fit3

anova(fit2, fit3) #Checking whether fit3 or fit2 fits data better through ANOVA

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Model 1 is better H1: Model 2 is better

p = 0.02813

We reject the null hypothesis at p = 0.02813. Model 2 is better >>> it explains more variability in the price of an apartment.

summary(fit3) #Summary of fit3

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = mydata5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2358.900     93.664  25.185  < 2e-16 ***
## Age                -7.197      3.148  -2.286  0.02499 *  
## Distance          -21.241      2.911  -7.296 2.14e-10 ***
## ParkingFactorYes  168.921     62.166   2.717  0.00811 ** 
## BalconyFactorYes   -6.985     58.745  -0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

H0: Beta3 = 0 H1: Beta3 != 0

p = 0.00811

We reject H0 at p = 0.00811. Assuming that all other variables are the same, on average the apartments that have parking have higher price by 168.921 compared to the ones that do not have a parking.

H0: Beta4 = 0 H1: Beta4 != 0

p = 0.90566

We accept null hypothesis at p = 0.90566. We can not say that Balcony has a statistical effect on Price.

Test of significance of regression H0: Ro2 = 0 H1: Ro2 > 0

F= 22.97, p < 0.001

We reject the null hypothesis at p < 0.001. We can say that population coefficient of determination is more than 0 meaning that we found liner relationship between Price and explanatory variables, included in the model.

mydata5$Fitted <- fitted.values(fit3) #Calculating fitted values
mydata5$Residuals <- residuals(fit3) #Calculating residuals

head(mydata5 [2, 12]) #Showing residual for ID2

## [1] 422.9572

R take-home exam

2023-09-19

Ilina Maksimovska

Task 1

Task 2

Task 3