R Markdown

candy_data <- read.table("./candy_data.csv",
                         sep = ",",
                         dec = ".")

#TASK 1 ##EXPLAIN THE DATA

names(candy_data)
##  [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11" "V12"
## [13] "V13"

###we can deduct the data contain 13 diferent variables

##DATA MANIPULATION

#candy_data$sugarprice <- candy_data$pricepercent / candy_data$sugarpercent

izračunamo razmerje med ceno in sladkorjem

ifelse(candy_data $sugarpercent >= 0.5,
       yes = 1,
       no = 0 )
## logical(0)
print(candy_data)
##                                                                                                V1
## 1                                                                                  competitorname
## 2                                                                                       100 Grand
## 3                                                                                    3 Musketeers
## 4                                                                                        One dime
## 5                                                                                     One quarter
## 6                                                                                       Air Heads
## 7                                                                                      Almond Joy
## 8                                                                                       Baby Ruth
## 9                                                                              Boston Baked Beans
## 10                                                                                     Candy Corn
## 11                                                                             Caramel Apple Pops
## 12                                                                                Charleston Chew
## 13                                                                     Chewey Lemonhead Fruit Mix
## 14                                                                                       Chiclets
## 15                                                                                           Dots
## 16                                                                                       Dum Dums
## 17                                                                                    Fruit Chews
## 18                                                                                        Fun Dip
## 19                                                                                     Gobstopper
## 20                                                                              Haribo Gold Bears
## 21                                                                              Haribo Happy Cola
## 22                                                                              Haribo Sour Bears
## 23                                                                             Haribo Twin Snakes
## 24                  Hersheys Kisses,1,0,0,0,0,0,0,0,1,.127,.093000002,55.375454\nHersheys Krackel
## 25 Hersheys Milk Chocolate,1,0,0,0,0,0,0,1,0,.43000001,.91799998,56.490501\nHersheys Special Dark
## 26                                                                                     Jawbusters
## 27                                                                                   Junior Mints
## 28                                                                                        Kit Kat
## 29                                                                                    Laffy Taffy
## 30                                                                                      Lemonhead
## 31                                                                    Lifesavers big ring gummies
## 32                        Peanut butter M&Ms,1,0,0,1,0,0,0,0,1,.82499999,.65100002,71.46505\nM&Ms
## 33                                                                                     Mike & Ike
## 34                                                                                      Milk Duds
## 35                                                                                      Milky Way
## 36                                                                             Milky Way Midnight
## 37                                                                       Milky Way Simply Caramel
## 38                                                                                         Mounds
## 39                                                                                    Mr Good Bar
## 40                                                                                          Nerds
## 41                                                                            Nestle Butterfinger
## 42                                                                                  Nestle Crunch
## 43                                                                                      Nik L Nip
## 44                                                                                    Now & Later
## 45                                                                                         Payday
## 46                                                                                    Peanut M&Ms
## 47                                                                                   Pixie Sticks
## 48                                                                                      Pop Rocks
## 49                                                                                      Red vines
## 50   Reeses Miniatures,1,0,0,1,0,0,0,0,0,.034000002,.27900001,81.866257\nReeses Peanut Butter cup
## 51       Reeses pieces,1,0,0,1,0,0,0,0,1,.40599999,.65100002,73.43499\nReeses stuffed with pieces
## 52                                                                                       Ring pop
## 53                                                                                           Rolo
## 54                                                                              Root Beer Barrels
## 55                                                                                          Runts
## 56                                                                                        Sixlets
## 57                                                                              Skittles original
## 58                                                                             Skittles wildberry
## 59                                                                                Nestle Smarties
## 60                                                                                 Smarties candy
## 61                                                                                       Snickers
## 62                                                                               Snickers Crisper
## 63                                                                                Sour Patch Kids
## 64                                                                          Sour Patch Tricksters
## 65                                                                                      Starburst
## 66                                                                            Strawberry bon bons
## 67                                                                                   Sugar Babies
## 68                                                                                    Sugar Daddy
## 69                                                                                   Super Bubble
## 70                                                                                   Swedish Fish
## 71                                                                                    Tootsie Pop
## 72                                                                           Tootsie Roll Juniors
## 73                                                                           Tootsie Roll Midgies
## 74                                                                        Tootsie Roll Snack Bars
## 75                                                                              Trolli Sour Bites
## 76                                                                                           Twix
## 77                                                                                      Twizzlers
## 78                                                                                       Warheads
## 79 Welchs Fruit Snacks,0,1,0,0,0,0,0,0,1,.31299999,.31299999,44.375519\nWerthers Original Caramel
## 80                                                                                       Whoppers
##           V2     V3      V4             V5     V6               V7   V8  V9
## 1  chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard bar
## 2          1      0       1              0      0                1    0   1
## 3          1      0       0              0      1                0    0   1
## 4          0      0       0              0      0                0    0   0
## 5          0      0       0              0      0                0    0   0
## 6          0      1       0              0      0                0    0   0
## 7          1      0       0              1      0                0    0   1
## 8          1      0       1              1      1                0    0   1
## 9          0      0       0              1      0                0    0   0
## 10         0      0       0              0      0                0    0   0
## 11         0      1       1              0      0                0    0   0
## 12         1      0       0              0      1                0    0   1
## 13         0      1       0              0      0                0    0   0
## 14         0      1       0              0      0                0    0   0
## 15         0      1       0              0      0                0    0   0
## 16         0      1       0              0      0                0    1   0
## 17         0      1       0              0      0                0    0   0
## 18         0      1       0              0      0                0    1   0
## 19         0      1       0              0      0                0    1   0
## 20         0      1       0              0      0                0    0   0
## 21         0      0       0              0      0                0    0   0
## 22         0      1       0              0      0                0    0   0
## 23         0      1       0              0      0                0    0   0
## 24         1      0       0              0      0                1    0   1
## 25         1      0       0              0      0                0    0   1
## 26         0      1       0              0      0                0    1   0
## 27         1      0       0              0      0                0    0   0
## 28         1      0       0              0      0                1    0   1
## 29         0      1       0              0      0                0    0   0
## 30         0      1       0              0      0                0    1   0
## 31         0      1       0              0      0                0    0   0
## 32         1      0       0              0      0                0    0   0
## 33         0      1       0              0      0                0    0   0
## 34         1      0       1              0      0                0    0   0
## 35         1      0       1              0      1                0    0   1
## 36         1      0       1              0      1                0    0   1
## 37         1      0       1              0      0                0    0   1
## 38         1      0       0              0      0                0    0   1
## 39         1      0       0              1      0                0    0   1
## 40         0      1       0              0      0                0    1   0
## 41         1      0       0              1      0                0    0   1
## 42         1      0       0              0      0                1    0   1
## 43         0      1       0              0      0                0    0   0
## 44         0      1       0              0      0                0    0   0
## 45         0      0       0              1      1                0    0   1
## 46         1      0       0              1      0                0    0   0
## 47         0      0       0              0      0                0    0   0
## 48         0      1       0              0      0                0    1   0
## 49         0      1       0              0      0                0    0   0
## 50         1      0       0              1      0                0    0   0
## 51         1      0       0              1      0                0    0   0
## 52         0      1       0              0      0                0    1   0
## 53         1      0       1              0      0                0    0   0
## 54         0      0       0              0      0                0    1   0
## 55         0      1       0              0      0                0    1   0
## 56         1      0       0              0      0                0    0   0
## 57         0      1       0              0      0                0    0   0
## 58         0      1       0              0      0                0    0   0
## 59         1      0       0              0      0                0    0   0
## 60         0      1       0              0      0                0    1   0
## 61         1      0       1              1      1                0    0   1
## 62         1      0       1              1      0                1    0   1
## 63         0      1       0              0      0                0    0   0
## 64         0      1       0              0      0                0    0   0
## 65         0      1       0              0      0                0    0   0
## 66         0      1       0              0      0                0    1   0
## 67         0      0       1              0      0                0    0   0
## 68         0      0       1              0      0                0    0   0
## 69         0      1       0              0      0                0    0   0
## 70         0      1       0              0      0                0    0   0
## 71         1      1       0              0      0                0    1   0
## 72         1      0       0              0      0                0    0   0
## 73         1      0       0              0      0                0    0   0
## 74         1      0       0              0      0                0    0   1
## 75         0      1       0              0      0                0    0   0
## 76         1      0       1              0      0                1    0   1
## 77         0      1       0              0      0                0    0   0
## 78         0      1       0              0      0                0    1   0
## 79         0      0       1              0      0                0    1   0
## 80         1      0       0              0      0                1    0   0
##         V10          V11          V12        V13
## 1  pluribus sugarpercent pricepercent winpercent
## 2         0    .73199999    .86000001  66.971725
## 3         0    .60399997    .51099998  67.602936
## 4         0         .011         .116  32.261086
## 5         0         .011    .51099998  46.116505
## 6         0    .90600002    .51099998  52.341465
## 7         0         .465    .76700002  50.347546
## 8         0    .60399997    .76700002  56.914547
## 9         1    .31299999    .51099998  23.417824
## 10        1    .90600002    .32499999  38.010963
## 11        0    .60399997    .32499999  34.517681
## 12        0    .60399997    .51099998  38.975037
## 13        1    .73199999    .51099998  36.017628
## 14        1         .046    .32499999  24.524988
## 15        1    .73199999    .51099998  42.272076
## 16        0    .73199999   .034000002  39.460556
## 17        1         .127   .034000002  43.088924
## 18        0    .73199999    .32499999  39.185505
## 19        1    .90600002    .45300001  46.783348
## 20        1         .465         .465   57.11974
## 21        1         .465         .465  34.158958
## 22        1         .465         .465   51.41243
## 23        1         .465         .465  42.178772
## 24        0    .43000001    .91799998  62.284481
## 25        0    .43000001    .91799998  59.236122
## 26        1   .093000002    .51099998  28.127439
## 27        1         .197    .51099998   57.21925
## 28        0    .31299999    .51099998    76.7686
## 29        0          .22         .116  41.389557
## 30        0         .046         .104  39.141056
## 31        0    .26699999    .27900001  52.911392
## 32        1    .82499999    .65100002  66.574585
## 33        1    .87199998    .32499999  46.411716
## 34        1    .30199999    .51099998  55.064072
## 35        0    .60399997    .65100002  73.099556
## 36        0    .73199999    .44100001  60.800701
## 37        0    .96499997    .86000001   64.35334
## 38        0    .31299999    .86000001  47.829754
## 39        0    .31299999    .91799998  54.526451
## 40        1    .84799999    .32499999  55.354046
## 41        0    .60399997    .76700002  70.735641
## 42        0    .31299999    .76700002   66.47068
## 43        1         .197    .97600001  22.445341
## 44        1          .22    .32499999    39.4468
## 45        0         .465    .76700002  46.296597
## 46        1    .59299999    .65100002  69.483788
## 47        1   .093000002         .023  37.722336
## 48        1    .60399997    .83700001  41.265511
## 49        1    .58099997         .116  37.348522
## 50        0    .72000003    .65100002   84.18029
## 51        0    .98799998    .65100002  72.887901
## 52        0    .73199999    .96499997  35.290756
## 53        1    .86000001    .86000001  65.716286
## 54        1    .73199999   .068999998  29.703691
## 55        1    .87199998    .27900001  42.849144
## 56        1          .22         .081     34.722
## 57        1    .94099998          .22   63.08514
## 58        1    .94099998          .22  55.103695
## 59        1    .26699999    .97600001  37.887188
## 60        1    .26699999         .116  45.995827
## 61        0         .546    .65100002  76.673782
## 62        0    .60399997    .65100002  59.529251
## 63        1   .068999998         .116  59.863998
## 64        1   .068999998         .116  52.825947
## 65        1    .15099999          .22  67.037628
## 66        1    .56900001   .057999998  34.578991
## 67        1    .96499997    .76700002   33.43755
## 68        0    .41800001    .32499999  32.230995
## 69        0         .162         .116  27.303865
## 70        1    .60399997         .755  54.861111
## 71        0    .60399997    .32499999  48.982651
## 72        0    .31299999    .51099998  43.068897
## 73        1    .17399999         .011  45.736748
## 74        0         .465    .32499999  49.653503
## 75        1    .31299999         .255  47.173229
## 76        0         .546    .90600002  81.642914
## 77        0          .22         .116  45.466282
## 78        0   .093000002         .116  39.011898
## 79        0         .186    .26699999  41.904308
## 80        1    .87199998    .84799999  49.524113

##CREATING NEW DATA FRAME, THAT INCLOUDES ONLY Chocolate and Nougat

candy_data2 <- candy_data [ , c(2,6) ]
print(candy_data2)
##           V2     V6
## 1  chocolate nougat
## 2          1      0
## 3          1      1
## 4          0      0
## 5          0      0
## 6          0      0
## 7          1      0
## 8          1      1
## 9          0      0
## 10         0      0
## 11         0      0
## 12         1      1
## 13         0      0
## 14         0      0
## 15         0      0
## 16         0      0
## 17         0      0
## 18         0      0
## 19         0      0
## 20         0      0
## 21         0      0
## 22         0      0
## 23         0      0
## 24         1      0
## 25         1      0
## 26         0      0
## 27         1      0
## 28         1      0
## 29         0      0
## 30         0      0
## 31         0      0
## 32         1      0
## 33         0      0
## 34         1      0
## 35         1      1
## 36         1      1
## 37         1      0
## 38         1      0
## 39         1      0
## 40         0      0
## 41         1      0
## 42         1      0
## 43         0      0
## 44         0      0
## 45         0      1
## 46         1      0
## 47         0      0
## 48         0      0
## 49         0      0
## 50         1      0
## 51         1      0
## 52         0      0
## 53         1      0
## 54         0      0
## 55         0      0
## 56         1      0
## 57         0      0
## 58         0      0
## 59         1      0
## 60         0      0
## 61         1      1
## 62         1      0
## 63         0      0
## 64         0      0
## 65         0      0
## 66         0      0
## 67         0      0
## 68         0      0
## 69         0      0
## 70         0      0
## 71         1      0
## 72         1      0
## 73         1      0
## 74         1      0
## 75         0      0
## 76         1      0
## 77         0      0
## 78         0      0
## 79         0      0
## 80         1      0

DELETING UNIT 5,10 and 15

candy_data3 <- candy_data2 [ c(-5,-10, -15) , ]
print(candy_data3)
##           V2     V6
## 1  chocolate nougat
## 2          1      0
## 3          1      1
## 4          0      0
## 6          0      0
## 7          1      0
## 8          1      1
## 9          0      0
## 11         0      0
## 12         1      1
## 13         0      0
## 14         0      0
## 16         0      0
## 17         0      0
## 18         0      0
## 19         0      0
## 20         0      0
## 21         0      0
## 22         0      0
## 23         0      0
## 24         1      0
## 25         1      0
## 26         0      0
## 27         1      0
## 28         1      0
## 29         0      0
## 30         0      0
## 31         0      0
## 32         1      0
## 33         0      0
## 34         1      0
## 35         1      1
## 36         1      1
## 37         1      0
## 38         1      0
## 39         1      0
## 40         0      0
## 41         1      0
## 42         1      0
## 43         0      0
## 44         0      0
## 45         0      1
## 46         1      0
## 47         0      0
## 48         0      0
## 49         0      0
## 50         1      0
## 51         1      0
## 52         0      0
## 53         1      0
## 54         0      0
## 55         0      0
## 56         1      0
## 57         0      0
## 58         0      0
## 59         1      0
## 60         0      0
## 61         1      1
## 62         1      0
## 63         0      0
## 64         0      0
## 65         0      0
## 66         0      0
## 67         0      0
## 68         0      0
## 69         0      0
## 70         0      0
## 71         1      0
## 72         1      0
## 73         1      0
## 74         1      0
## 75         0      0
## 76         1      0
## 77         0      0
## 78         0      0
## 79         0      0
## 80         1      0

RENAMING

Change peanutyalmondy to NUTS

names(candy_data)[names(candy_data) == "peanutyalmondy"] <- "NUTS"

#3. Present the descriptive statistics

summary(candy_data [ , c(-1,-2,-3,-4,-5,-6,-7,-8,-9,-10)])
##      V11                V12                V13           
##  Length:80          Length:80          Length:80         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

Sugarprecentage ranges from 1,1% to 98,8%, half of the samples were below 46,5% and avarage sugar precentage was 47,8%

price percentage ranges from 1,1% to 97,6%, half of the samples were below 47,8% and avarage sugar precentage was 50,3%

###in our new variable sugarprice the mean price percentage per sugar precentage is 1,97 and the meadiana for Sugarprice is 1,0.

colnames(candy_data)
##  [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11" "V12"
## [13] "V13"
#install.packages("ggplot2")  # Install ggplot2 if you haven't already
library(ggplot2)              # Load ggplot2
# Assuming candy_data is your data frame
ggplot(data = candy_data, aes(x = V11, y = V13)) +
  geom_point() +  # Adds points
  labs(title = "Scatter Plot of Win Percentage vs Sugar Percentage",
       x = "Sugar Percentage",
       y = "Win Percentage") +
  theme_minimal()  # Optional: adds a clean theme

##Results interpretation

The plot is titled “Scatter Plot of Win Percentage vs Sugar Percentage,” indicating it illustrates the relationship between these two metrics.

on the X-Axis we have sugarpercentage, ranging from 0 to 100.

and on the y-axis we have winpercent, ranging from 0% to 100%.

Data Points: Each point represents a specific candy’s sugar percentage and its corresponding win percentage.

patterns: we cannot conlcude there are any patterns visible and from this we cannot conclude any relation whatsoever between sugar precentage and win precentage.

###this is probably a less optimal example of raw data to analyze and analyze the data in R studio program for this task, however I believe I learned from this exercise and hope that I have showed I understand the concept of processing statistical data by programming in Rstudio.

#TASK 2

the average grade in this program was 74.

library(readxl)
Business_School <- read_excel("C:/Users/Gregor/Desktop/IMB/BOOTCAMP/R Take Home Exam 2024/Task 2/Business School.xlsx")
Business_School <- as.data.frame(Business_School)
head(Business_School)
##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1          1         Business            68.4      90.2              No
## 2          2 Computer Science            70.2      68.7             Yes
## 3          3          Finance            76.4      83.3              No
## 4          4         Business            82.6      88.7              No
## 5          5          Finance            76.9      75.4              No
## 6          6 Computer Science            83.3      82.1              No
##   Employability (Before) Employability (After) Status Annual Salary
## 1                    252                   276 Placed        111000
## 2                    101                   119 Placed        107000
## 3                    401                   462 Placed        109000
## 4                    287                   342 Placed        148000
## 5                    275                   347 Placed        255500
## 6                    254                   313 Placed        103500

C:/Users/Gregor/Desktop/IMB/BOOTCAMP/R Take Home Exam 2024/Task 2/Business School.xlsx

##1 Graph the distribution of undergrad degrees using the ggplot function. Which degree is the most common?

library(ggplot2)
ggplot(Business_School, aes(x = `Undergrad Degree`)) + 
  geom_bar() + 
  ylab("Frequency") +
  scale_fill_brewer(palette="blues")
## Warning: Unknown palette: "blues"

in the histogram we can seethe Business was the most common undegraduate degree inthe population of students, followed by Finance, Computer science, Engineering and Art on the last place

2 točka: Descriptive statistics of the Annual Salary

library(pastecs)
round(stat.desc(Business_School$`Annual Salary`))
##      nbr.val     nbr.null       nbr.na          min          max        range 
##          100            0            0        20000       340000       320000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##     10905800       103500       109058         4150         8235   1722373475 
##      std.dev     coef.var 
##        41501            0
library(ggplot2)
options(scipen = 999)
ggplot(Business_School, aes(x=`Annual Salary`)) + 
  geom_histogram(binwidth = 5000, fill = "navy", color = "black") +
  ylab("Frequency") + 
  xlab("Annual Salary") +
  ggtitle("Distribution of Annual Salary for MBA Students") +
  theme_minimal()

With descriptive statistics we can analyze that the annual sllaries of the MBA students range from minimal 20,000 to maximal 340,000. The avarage is 10,900 and the mediana is 103,500, which means half of population erns less than 109,500 per year and half of the population more.

The histogram shows skeewness to the right, possibly coused by the few outliers which drastically exceed 200,000 annually, which would have been upper limit of the unskeewed histogram shall top three earners be remowed as outliers and the population be analyzed as normally distibuted.

Additionally, to improve the readability of the histogram, with some colleagues we used the options(scipen = 999) function, which prevents scientific notation from being used on the x-axis. This method was sourced from this link: https://stackoverflow.com/questions/5352099/how-can-i-disable-scientific-notation).

#3 est the following hypothesis: 𝐻0: 𝜇MBA Grade = 74

mean(Business_School$`MBA Grade`)
## [1] 76.04055
sd(Business_School$`MBA Grade`)
## [1] 7.675114
t.test(Business_School$`MBA Grade`,
       mu = 74,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

Hypotheses: - Null (H₀): The mean MBA grade is equal to 74 (𝜇 = 74). - Alternative (H₁): The mean MBA grade is not 74 (𝜇 ≠ 74).

Upon running the one sample t-test we get results of mean = 76,04 and p-value = 0.00915, which is statistically significant and does not confirm our H₀hypotheses.And thus we cffnirm H₁ hypotheses.

So this year avarage grade is about two points huigher than lasts year generation, and that is statisticaly significiant but the answer to a question how much of the meaning this is to the school is beyond our data and understanding of the faculy program.

#TASK 3 You analyze the price per m2 for a sample of apartments in the Ljubljana region (Apartments.xlsx in Task 3 folder). Follow the questions in R Markdown included in the folder.

output: html_document: # code_folding: hide —

Import the dataset Apartments.xlsx

library(readxl)
Apartments <- read_excel("C:/Users/Gregor/Desktop/IMB/BOOTCAMP/R Take Home Exam 2024/Task 3/Apartments.xlsx")
Apartments <- as.data.frame(Apartments)
head(Apartments)
##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$Parking <- factor(Apartments$Parking,
                                           levels = c (0, 1),
                                           labels = c ("No", "Yes"))
head(Apartments)
##   Age Distance Price Parking Balcony
## 1   7       28  1640      No       1
## 2  18        1  2800     Yes       0
## 3   7       28  1660      No       0
## 4  28       29  1850      No       1
## 5  18       18  1640     Yes       1
## 6  28       12  1770      No       1

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

Apartments$Balcony <- factor(Apartments$Balcony,
                                           levels = c (0, 1),
                                           labels = c ("No", "Yes"))
mean(Apartments$Price)
## [1] 2018.941
sd(Apartments$Price)
## [1] 377.8417
t.test(Apartments$Price,
       mu = 1900,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

as we concloude the T-test we can analyze that the mean price for an apartment is with the result of an 2018,9€ diferent as the null hypothesis (1900€) suggests and with p value = 0.004731, which is less than 0,05 this becomes statistically significant diference, furthermore the null hypothesis value fall out of the 95% confidence interval, with range from 1935,4 - 2100,4. All this backs the conclusion, the avarage price per square meter is not 1900€ but estimated at 2018,9€.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
          data = Apartments)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108 <0.0000000000000002 ***
## Age           -8.975      4.164  -2.156               0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Estimate of the regression coefficient: Regression coeffciient = -8.975, the negative value represents the negative relationship between the age of an aparmtent and price, by that we mean for every one year the price goes down for 8.975€ per square meter.

Coefficient of correlation: we test correlation coeffiecency wtih conducting R-squared = 0.05302 as slope of the null hypothesis, with p value = 0.03401, which means the age does have a statistically sigificiant impact on the price, thus rejecting the null hypothesis β₁ = 0.

Coefficient of determination (R-squared): R-squared = 0.05302, suggesting that as little as 5,3% the impact of age on the price per squeare meter. Thus beeing quite low, we cannot conloude that the age itself beeing an important factor by the apartment pricing.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
## Loading required package: carData
scatterplotMatrix(Apartments[ , c(1, 2, 3)],
                  smooth = FALSE) 

#install.packages("Hmisc")
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:dplyr’:

src, summarize

The following objects are masked from ‘package:base’:

format.pval, units

rcorr(as.matrix(Apartments[ , c(1, 2, 3)]))
##            Age Distance Price
## Age       1.00     0.04 -0.23
## Distance  0.04     1.00 -0.63
## Price    -0.23    -0.63  1.00
## 
## n= 85 
## 
## 
## P
##          Age    Distance Price 
## Age             0.6966   0.0340
## Distance 0.6966          0.0000
## Price    0.0340 0.0000

Corelation between: Age - Distance = 0.04 is weak quite low impact on the price Age - Price = -0.23 is weak and thus not strongly related to lower prices despite the negative value of r Distance - Price = -0.63 is strong - meaning that the greater the distance the lower the price

And it does not appear to be the case of multicollinearity between Age, Distance and Price, since correlations between Age and Distance, which are independant variables is not strong.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
          data = Apartments)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10 < 0.0000000000000002 ***
## Age           -7.934      3.225   -2.46                0.016 *  
## Distance     -20.667      2.748   -7.52      0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845

Age Distance

1.001845 1.001845

Both VIF results for Age and Distance are 1,001845 and with that much lower than the trashold of 5 and with that we can cocloude there is no multicollinerity between the predictors. Meaning Variables affect the model unicely and the regression model remains stable. Estimations for Price based on Age and distance are thus thrustworty.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Apartments$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
Apartments$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances

head(Apartments[, c("StdResid", "CooksD")])
##   StdResid CooksD
## 1   -0.665  0.007
## 2    1.783  0.030
## 3   -0.594  0.006
## 4    0.754  0.008
## 5   -1.073  0.005
## 6   -0.778  0.005
head(Apartments[order(Apartments$StdResid),], 5) 
##    Age Distance Price Parking Balcony StdResid CooksD
## 53   7        2  1760      No     Yes   -2.152  0.066
## 13  12       14  1650      No     Yes   -1.499  0.013
## 72  12       14  1650      No      No   -1.499  0.013
## 20  13        8  1800      No      No   -1.381  0.012
## 35  14       16  1660      No     Yes   -1.261  0.008
head(Apartments[order(-Apartments$CooksD),], 5)
##    Age Distance Price Parking Balcony StdResid CooksD
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 55  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 53   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061

first aparmtent has the Cook’s distance = 0.320 second apartment Cook’s distance = 0.104

difference is quite big which suggest the first apartment to be an outlier since also the standardized residual is quite big 2.577 in order to confirm the suspicion, the removal of the first apartment is sugested.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:pastecs':
## 
##     first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Apartments_new <- Apartments %>%
  filter(!CooksD %in% c(0.320))

new model:

fit2_new <- lm(Price ~ Age + Distance,
          data = Apartments_new)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10 < 0.0000000000000002 ***
## Age           -7.934      3.225   -2.46                0.016 *  
## Distance     -20.667      2.748   -7.52      0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896
summary(fit2_new)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221 < 0.0000000000000002 ***
## Age           -6.464      3.159  -2.046                0.044 *  
## Distance     -22.955      2.786  -8.240     0.00000000000252 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 0.000000000002339

After removing the outlier of the first apartment the R-squared is still below 0.5 and thus the model is weak. Which means there are additional factors infuencing the price beside the Age and Distance.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

Apartments_new$StdFitted <- scale(fit2_new$fitted.values)

library(car)
scatterplot(y = Apartments_new$StdResid, x = Apartments_new$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

The distribution of the residuals on the scatterplot does not show eye panatrating pattern, and as mentioned in the exercise the suspicion of the heteroskedasticity arises. To test it we can choose a Breusch-Pagan test for the heteroskedasticity

#download.packages(olsrr)
#library(olsrr)
#ols_test_breusch_pagan(fit2_new)

Breusch Pagan Test for Heteroskedasticity

—————————————–

Ho: the variance is constant

Ha: the variance is not constant

Data

———————————

Response : Price

Variables: fitted values of Price

Test Summary

—————————–

DF = 1

Chi2 = 2.927455

Prob > Chi2 = 0.08708469

Breusch-Pagan test: Chi2 = 2.927455 with p = 0.08708469 which is statistically non significant and thus we reject the null hypothesis the existance of the heteroskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Apartments_new$StdResid <- round(rstandard(fit2_new), 3) #Standardized residuals
Apartments_new$CooksD <- round(cooks.distance(fit2_new), 3) #Cooks distances

library(ggplot2)
ggplot(Apartments_new, aes(x = StdResid)) +
  geom_histogram(binwidth = 0.2, fill = "navy", color = "gray") +
  labs(x = "Standardized Residuals", 
       y = "Frequency", 
       title = "Histogram of Standardized Residuals") +
  theme_minimal()

shapiro.test(Apartments_new$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments_new$StdResid
## W = 0.95649, p-value = 0.006355

H0: the data is normally distributed, H1: the data is not normally distributed.

P=0.006355 we can reject the H0 and thus we can concloude the unnormal distribution of hte stardardized residuals.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance,
          data = Apartments)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10 < 0.0000000000000002 ***
## Age           -7.934      3.225   -2.46                0.016 *  
## Distance     -20.667      2.748   -7.52      0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896

Distance: with each adiditonal unit of distance the price decreases for -20.667 and p<0,05 it is statistically significiant

Age: additional year of the apartment results in -7.934 decrease of rpice per square meter. and p<0,05 it is statistically significiant

R-squared value = 0.4396 meaning 43,96% of the variability in prices is explained by the combination of infleuences the age and distances of the apartments

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
          data = Apartments)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova suggests that the fit3 model follows the data much better than fit2 and with p=0.01007 is this also statistically sigificiant. meaning that adding aditional variables (Parking, Balcony) makes a models predicament of the price stornger.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415 < 0.0000000000000002 ***
## Age           -6.799      3.110  -2.186              0.03172 *  
## Distance     -18.045      2.758  -6.543        0.00000000528 ***
## ParkingYes   196.168     62.868   3.120              0.00251 ** 
## BalconyYes     1.935     60.014   0.032              0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 0.00000000001849

Parking coefficient: addition of the parking raises the price for 196.168€ and p=0.00251 makes it statistically sigificiant Balcony coefficiant: addition of the balcony raises the price for 1.935€ but p=0.97436 which makes it statistically unsigificiant

H0: none of the adidtional independant variables have a statistically sigificiant effect on the price H1: at least one of the adidtional independant variables have a statistically sigificiant effect on the price

We reject the H0 and confirm H1

Save fitted values and claculate the residual for apartment ID2.

Apartments$Fitted <- fitted.values(fit3) 
Apartments$Residuals <- residuals(fit3) 
head(Apartments[colnames(Apartments) %in% c("Fitted", "Residuals")])
##     Fitted  Residuals
## 1 1750.741 -110.74150
## 2 2357.411  442.58893
## 3 1748.807  -88.80674
## 4 1589.921  260.07897
## 5 2052.576 -412.57579
## 6 1896.691 -126.69107
Apartments[2, c("Fitted", "Residuals")]
##     Fitted Residuals
## 2 2357.411  442.5889

The residual for Apartment n2 is 422.5889