candy_data <- read.table("./candy_data.csv",
sep = ",",
dec = ".")
#TASK 1 ##EXPLAIN THE DATA
names(candy_data)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V12"
## [13] "V13"
###we can deduct the data contain 13 diferent variables
##DATA MANIPULATION
#candy_data$sugarprice <- candy_data$pricepercent / candy_data$sugarpercent
ifelse(candy_data $sugarpercent >= 0.5,
yes = 1,
no = 0 )
## logical(0)
print(candy_data)
## V1
## 1 competitorname
## 2 100 Grand
## 3 3 Musketeers
## 4 One dime
## 5 One quarter
## 6 Air Heads
## 7 Almond Joy
## 8 Baby Ruth
## 9 Boston Baked Beans
## 10 Candy Corn
## 11 Caramel Apple Pops
## 12 Charleston Chew
## 13 Chewey Lemonhead Fruit Mix
## 14 Chiclets
## 15 Dots
## 16 Dum Dums
## 17 Fruit Chews
## 18 Fun Dip
## 19 Gobstopper
## 20 Haribo Gold Bears
## 21 Haribo Happy Cola
## 22 Haribo Sour Bears
## 23 Haribo Twin Snakes
## 24 Hersheys Kisses,1,0,0,0,0,0,0,0,1,.127,.093000002,55.375454\nHersheys Krackel
## 25 Hersheys Milk Chocolate,1,0,0,0,0,0,0,1,0,.43000001,.91799998,56.490501\nHersheys Special Dark
## 26 Jawbusters
## 27 Junior Mints
## 28 Kit Kat
## 29 Laffy Taffy
## 30 Lemonhead
## 31 Lifesavers big ring gummies
## 32 Peanut butter M&Ms,1,0,0,1,0,0,0,0,1,.82499999,.65100002,71.46505\nM&Ms
## 33 Mike & Ike
## 34 Milk Duds
## 35 Milky Way
## 36 Milky Way Midnight
## 37 Milky Way Simply Caramel
## 38 Mounds
## 39 Mr Good Bar
## 40 Nerds
## 41 Nestle Butterfinger
## 42 Nestle Crunch
## 43 Nik L Nip
## 44 Now & Later
## 45 Payday
## 46 Peanut M&Ms
## 47 Pixie Sticks
## 48 Pop Rocks
## 49 Red vines
## 50 Reeses Miniatures,1,0,0,1,0,0,0,0,0,.034000002,.27900001,81.866257\nReeses Peanut Butter cup
## 51 Reeses pieces,1,0,0,1,0,0,0,0,1,.40599999,.65100002,73.43499\nReeses stuffed with pieces
## 52 Ring pop
## 53 Rolo
## 54 Root Beer Barrels
## 55 Runts
## 56 Sixlets
## 57 Skittles original
## 58 Skittles wildberry
## 59 Nestle Smarties
## 60 Smarties candy
## 61 Snickers
## 62 Snickers Crisper
## 63 Sour Patch Kids
## 64 Sour Patch Tricksters
## 65 Starburst
## 66 Strawberry bon bons
## 67 Sugar Babies
## 68 Sugar Daddy
## 69 Super Bubble
## 70 Swedish Fish
## 71 Tootsie Pop
## 72 Tootsie Roll Juniors
## 73 Tootsie Roll Midgies
## 74 Tootsie Roll Snack Bars
## 75 Trolli Sour Bites
## 76 Twix
## 77 Twizzlers
## 78 Warheads
## 79 Welchs Fruit Snacks,0,1,0,0,0,0,0,0,1,.31299999,.31299999,44.375519\nWerthers Original Caramel
## 80 Whoppers
## V2 V3 V4 V5 V6 V7 V8 V9
## 1 chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard bar
## 2 1 0 1 0 0 1 0 1
## 3 1 0 0 0 1 0 0 1
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 1 0 0 0 0 0 0
## 7 1 0 0 1 0 0 0 1
## 8 1 0 1 1 1 0 0 1
## 9 0 0 0 1 0 0 0 0
## 10 0 0 0 0 0 0 0 0
## 11 0 1 1 0 0 0 0 0
## 12 1 0 0 0 1 0 0 1
## 13 0 1 0 0 0 0 0 0
## 14 0 1 0 0 0 0 0 0
## 15 0 1 0 0 0 0 0 0
## 16 0 1 0 0 0 0 1 0
## 17 0 1 0 0 0 0 0 0
## 18 0 1 0 0 0 0 1 0
## 19 0 1 0 0 0 0 1 0
## 20 0 1 0 0 0 0 0 0
## 21 0 0 0 0 0 0 0 0
## 22 0 1 0 0 0 0 0 0
## 23 0 1 0 0 0 0 0 0
## 24 1 0 0 0 0 1 0 1
## 25 1 0 0 0 0 0 0 1
## 26 0 1 0 0 0 0 1 0
## 27 1 0 0 0 0 0 0 0
## 28 1 0 0 0 0 1 0 1
## 29 0 1 0 0 0 0 0 0
## 30 0 1 0 0 0 0 1 0
## 31 0 1 0 0 0 0 0 0
## 32 1 0 0 0 0 0 0 0
## 33 0 1 0 0 0 0 0 0
## 34 1 0 1 0 0 0 0 0
## 35 1 0 1 0 1 0 0 1
## 36 1 0 1 0 1 0 0 1
## 37 1 0 1 0 0 0 0 1
## 38 1 0 0 0 0 0 0 1
## 39 1 0 0 1 0 0 0 1
## 40 0 1 0 0 0 0 1 0
## 41 1 0 0 1 0 0 0 1
## 42 1 0 0 0 0 1 0 1
## 43 0 1 0 0 0 0 0 0
## 44 0 1 0 0 0 0 0 0
## 45 0 0 0 1 1 0 0 1
## 46 1 0 0 1 0 0 0 0
## 47 0 0 0 0 0 0 0 0
## 48 0 1 0 0 0 0 1 0
## 49 0 1 0 0 0 0 0 0
## 50 1 0 0 1 0 0 0 0
## 51 1 0 0 1 0 0 0 0
## 52 0 1 0 0 0 0 1 0
## 53 1 0 1 0 0 0 0 0
## 54 0 0 0 0 0 0 1 0
## 55 0 1 0 0 0 0 1 0
## 56 1 0 0 0 0 0 0 0
## 57 0 1 0 0 0 0 0 0
## 58 0 1 0 0 0 0 0 0
## 59 1 0 0 0 0 0 0 0
## 60 0 1 0 0 0 0 1 0
## 61 1 0 1 1 1 0 0 1
## 62 1 0 1 1 0 1 0 1
## 63 0 1 0 0 0 0 0 0
## 64 0 1 0 0 0 0 0 0
## 65 0 1 0 0 0 0 0 0
## 66 0 1 0 0 0 0 1 0
## 67 0 0 1 0 0 0 0 0
## 68 0 0 1 0 0 0 0 0
## 69 0 1 0 0 0 0 0 0
## 70 0 1 0 0 0 0 0 0
## 71 1 1 0 0 0 0 1 0
## 72 1 0 0 0 0 0 0 0
## 73 1 0 0 0 0 0 0 0
## 74 1 0 0 0 0 0 0 1
## 75 0 1 0 0 0 0 0 0
## 76 1 0 1 0 0 1 0 1
## 77 0 1 0 0 0 0 0 0
## 78 0 1 0 0 0 0 1 0
## 79 0 0 1 0 0 0 1 0
## 80 1 0 0 0 0 1 0 0
## V10 V11 V12 V13
## 1 pluribus sugarpercent pricepercent winpercent
## 2 0 .73199999 .86000001 66.971725
## 3 0 .60399997 .51099998 67.602936
## 4 0 .011 .116 32.261086
## 5 0 .011 .51099998 46.116505
## 6 0 .90600002 .51099998 52.341465
## 7 0 .465 .76700002 50.347546
## 8 0 .60399997 .76700002 56.914547
## 9 1 .31299999 .51099998 23.417824
## 10 1 .90600002 .32499999 38.010963
## 11 0 .60399997 .32499999 34.517681
## 12 0 .60399997 .51099998 38.975037
## 13 1 .73199999 .51099998 36.017628
## 14 1 .046 .32499999 24.524988
## 15 1 .73199999 .51099998 42.272076
## 16 0 .73199999 .034000002 39.460556
## 17 1 .127 .034000002 43.088924
## 18 0 .73199999 .32499999 39.185505
## 19 1 .90600002 .45300001 46.783348
## 20 1 .465 .465 57.11974
## 21 1 .465 .465 34.158958
## 22 1 .465 .465 51.41243
## 23 1 .465 .465 42.178772
## 24 0 .43000001 .91799998 62.284481
## 25 0 .43000001 .91799998 59.236122
## 26 1 .093000002 .51099998 28.127439
## 27 1 .197 .51099998 57.21925
## 28 0 .31299999 .51099998 76.7686
## 29 0 .22 .116 41.389557
## 30 0 .046 .104 39.141056
## 31 0 .26699999 .27900001 52.911392
## 32 1 .82499999 .65100002 66.574585
## 33 1 .87199998 .32499999 46.411716
## 34 1 .30199999 .51099998 55.064072
## 35 0 .60399997 .65100002 73.099556
## 36 0 .73199999 .44100001 60.800701
## 37 0 .96499997 .86000001 64.35334
## 38 0 .31299999 .86000001 47.829754
## 39 0 .31299999 .91799998 54.526451
## 40 1 .84799999 .32499999 55.354046
## 41 0 .60399997 .76700002 70.735641
## 42 0 .31299999 .76700002 66.47068
## 43 1 .197 .97600001 22.445341
## 44 1 .22 .32499999 39.4468
## 45 0 .465 .76700002 46.296597
## 46 1 .59299999 .65100002 69.483788
## 47 1 .093000002 .023 37.722336
## 48 1 .60399997 .83700001 41.265511
## 49 1 .58099997 .116 37.348522
## 50 0 .72000003 .65100002 84.18029
## 51 0 .98799998 .65100002 72.887901
## 52 0 .73199999 .96499997 35.290756
## 53 1 .86000001 .86000001 65.716286
## 54 1 .73199999 .068999998 29.703691
## 55 1 .87199998 .27900001 42.849144
## 56 1 .22 .081 34.722
## 57 1 .94099998 .22 63.08514
## 58 1 .94099998 .22 55.103695
## 59 1 .26699999 .97600001 37.887188
## 60 1 .26699999 .116 45.995827
## 61 0 .546 .65100002 76.673782
## 62 0 .60399997 .65100002 59.529251
## 63 1 .068999998 .116 59.863998
## 64 1 .068999998 .116 52.825947
## 65 1 .15099999 .22 67.037628
## 66 1 .56900001 .057999998 34.578991
## 67 1 .96499997 .76700002 33.43755
## 68 0 .41800001 .32499999 32.230995
## 69 0 .162 .116 27.303865
## 70 1 .60399997 .755 54.861111
## 71 0 .60399997 .32499999 48.982651
## 72 0 .31299999 .51099998 43.068897
## 73 1 .17399999 .011 45.736748
## 74 0 .465 .32499999 49.653503
## 75 1 .31299999 .255 47.173229
## 76 0 .546 .90600002 81.642914
## 77 0 .22 .116 45.466282
## 78 0 .093000002 .116 39.011898
## 79 0 .186 .26699999 41.904308
## 80 1 .87199998 .84799999 49.524113
##CREATING NEW DATA FRAME, THAT INCLOUDES ONLY Chocolate and Nougat
candy_data2 <- candy_data [ , c(2,6) ]
print(candy_data2)
## V2 V6
## 1 chocolate nougat
## 2 1 0
## 3 1 1
## 4 0 0
## 5 0 0
## 6 0 0
## 7 1 0
## 8 1 1
## 9 0 0
## 10 0 0
## 11 0 0
## 12 1 1
## 13 0 0
## 14 0 0
## 15 0 0
## 16 0 0
## 17 0 0
## 18 0 0
## 19 0 0
## 20 0 0
## 21 0 0
## 22 0 0
## 23 0 0
## 24 1 0
## 25 1 0
## 26 0 0
## 27 1 0
## 28 1 0
## 29 0 0
## 30 0 0
## 31 0 0
## 32 1 0
## 33 0 0
## 34 1 0
## 35 1 1
## 36 1 1
## 37 1 0
## 38 1 0
## 39 1 0
## 40 0 0
## 41 1 0
## 42 1 0
## 43 0 0
## 44 0 0
## 45 0 1
## 46 1 0
## 47 0 0
## 48 0 0
## 49 0 0
## 50 1 0
## 51 1 0
## 52 0 0
## 53 1 0
## 54 0 0
## 55 0 0
## 56 1 0
## 57 0 0
## 58 0 0
## 59 1 0
## 60 0 0
## 61 1 1
## 62 1 0
## 63 0 0
## 64 0 0
## 65 0 0
## 66 0 0
## 67 0 0
## 68 0 0
## 69 0 0
## 70 0 0
## 71 1 0
## 72 1 0
## 73 1 0
## 74 1 0
## 75 0 0
## 76 1 0
## 77 0 0
## 78 0 0
## 79 0 0
## 80 1 0
candy_data3 <- candy_data2 [ c(-5,-10, -15) , ]
print(candy_data3)
## V2 V6
## 1 chocolate nougat
## 2 1 0
## 3 1 1
## 4 0 0
## 6 0 0
## 7 1 0
## 8 1 1
## 9 0 0
## 11 0 0
## 12 1 1
## 13 0 0
## 14 0 0
## 16 0 0
## 17 0 0
## 18 0 0
## 19 0 0
## 20 0 0
## 21 0 0
## 22 0 0
## 23 0 0
## 24 1 0
## 25 1 0
## 26 0 0
## 27 1 0
## 28 1 0
## 29 0 0
## 30 0 0
## 31 0 0
## 32 1 0
## 33 0 0
## 34 1 0
## 35 1 1
## 36 1 1
## 37 1 0
## 38 1 0
## 39 1 0
## 40 0 0
## 41 1 0
## 42 1 0
## 43 0 0
## 44 0 0
## 45 0 1
## 46 1 0
## 47 0 0
## 48 0 0
## 49 0 0
## 50 1 0
## 51 1 0
## 52 0 0
## 53 1 0
## 54 0 0
## 55 0 0
## 56 1 0
## 57 0 0
## 58 0 0
## 59 1 0
## 60 0 0
## 61 1 1
## 62 1 0
## 63 0 0
## 64 0 0
## 65 0 0
## 66 0 0
## 67 0 0
## 68 0 0
## 69 0 0
## 70 0 0
## 71 1 0
## 72 1 0
## 73 1 0
## 74 1 0
## 75 0 0
## 76 1 0
## 77 0 0
## 78 0 0
## 79 0 0
## 80 1 0
names(candy_data)[names(candy_data) == "peanutyalmondy"] <- "NUTS"
#3. Present the descriptive statistics
summary(candy_data [ , c(-1,-2,-3,-4,-5,-6,-7,-8,-9,-10)])
## V11 V12 V13
## Length:80 Length:80 Length:80
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
###in our new variable sugarprice the mean price percentage per sugar precentage is 1,97 and the meadiana for Sugarprice is 1,0.
colnames(candy_data)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V12"
## [13] "V13"
#install.packages("ggplot2") # Install ggplot2 if you haven't already
library(ggplot2) # Load ggplot2
# Assuming candy_data is your data frame
ggplot(data = candy_data, aes(x = V11, y = V13)) +
geom_point() + # Adds points
labs(title = "Scatter Plot of Win Percentage vs Sugar Percentage",
x = "Sugar Percentage",
y = "Win Percentage") +
theme_minimal() # Optional: adds a clean theme
##Results interpretation
The plot is titled “Scatter Plot of Win Percentage vs Sugar Percentage,” indicating it illustrates the relationship between these two metrics.
on the X-Axis we have sugarpercentage, ranging from 0 to 100.
and on the y-axis we have winpercent, ranging from 0% to 100%.
Data Points: Each point represents a specific candy’s sugar percentage and its corresponding win percentage.
patterns: we cannot conlcude there are any patterns visible and from this we cannot conclude any relation whatsoever between sugar precentage and win precentage.
###this is probably a less optimal example of raw data to analyze and analyze the data in R studio program for this task, however I believe I learned from this exercise and hope that I have showed I understand the concept of processing statistical data by programming in Rstudio.
#TASK 2
the average grade in this program was 74.
library(readxl)
Business_School <- read_excel("C:/Users/Gregor/Desktop/IMB/BOOTCAMP/R Take Home Exam 2024/Task 2/Business School.xlsx")
Business_School <- as.data.frame(Business_School)
head(Business_School)
## Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1 1 Business 68.4 90.2 No
## 2 2 Computer Science 70.2 68.7 Yes
## 3 3 Finance 76.4 83.3 No
## 4 4 Business 82.6 88.7 No
## 5 5 Finance 76.9 75.4 No
## 6 6 Computer Science 83.3 82.1 No
## Employability (Before) Employability (After) Status Annual Salary
## 1 252 276 Placed 111000
## 2 101 119 Placed 107000
## 3 401 462 Placed 109000
## 4 287 342 Placed 148000
## 5 275 347 Placed 255500
## 6 254 313 Placed 103500
C:/Users/Gregor/Desktop/IMB/BOOTCAMP/R Take Home Exam 2024/Task 2/Business School.xlsx
##1 Graph the distribution of undergrad degrees using the ggplot function. Which degree is the most common?
library(ggplot2)
ggplot(Business_School, aes(x = `Undergrad Degree`)) +
geom_bar() +
ylab("Frequency") +
scale_fill_brewer(palette="blues")
## Warning: Unknown palette: "blues"
in the histogram we can seethe Business was the most common undegraduate
degree inthe population of students, followed by Finance, Computer
science, Engineering and Art on the last place
library(pastecs)
round(stat.desc(Business_School$`Annual Salary`))
## nbr.val nbr.null nbr.na min max range
## 100 0 0 20000 340000 320000
## sum median mean SE.mean CI.mean.0.95 var
## 10905800 103500 109058 4150 8235 1722373475
## std.dev coef.var
## 41501 0
library(ggplot2)
options(scipen = 999)
ggplot(Business_School, aes(x=`Annual Salary`)) +
geom_histogram(binwidth = 5000, fill = "navy", color = "black") +
ylab("Frequency") +
xlab("Annual Salary") +
ggtitle("Distribution of Annual Salary for MBA Students") +
theme_minimal()
With descriptive statistics we can analyze that the annual sllaries of the MBA students range from minimal 20,000 to maximal 340,000. The avarage is 10,900 and the mediana is 103,500, which means half of population erns less than 109,500 per year and half of the population more.
The histogram shows skeewness to the right, possibly coused by the few outliers which drastically exceed 200,000 annually, which would have been upper limit of the unskeewed histogram shall top three earners be remowed as outliers and the population be analyzed as normally distibuted.
Additionally, to improve the readability of the histogram, with some colleagues we used the options(scipen = 999) function, which prevents scientific notation from being used on the x-axis. This method was sourced from this link: https://stackoverflow.com/questions/5352099/how-can-i-disable-scientific-notation).
#3 est the following hypothesis: 𝐻0: 𝜇MBA Grade = 74
mean(Business_School$`MBA Grade`)
## [1] 76.04055
sd(Business_School$`MBA Grade`)
## [1] 7.675114
t.test(Business_School$`MBA Grade`,
mu = 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
Hypotheses: - Null (H₀): The mean MBA grade is equal to 74 (𝜇 = 74). - Alternative (H₁): The mean MBA grade is not 74 (𝜇 ≠ 74).
Upon running the one sample t-test we get results of mean = 76,04 and p-value = 0.00915, which is statistically significant and does not confirm our H₀hypotheses.And thus we cffnirm H₁ hypotheses.
So this year avarage grade is about two points huigher than lasts year generation, and that is statisticaly significiant but the answer to a question how much of the meaning this is to the school is beyond our data and understanding of the faculy program.
#TASK 3 You analyze the price per m2 for a sample of apartments in the Ljubljana region (Apartments.xlsx in Task 3 folder). Follow the questions in R Markdown included in the folder.
output: html_document: # code_folding: hide —
library(readxl)
Apartments <- read_excel("C:/Users/Gregor/Desktop/IMB/BOOTCAMP/R Take Home Exam 2024/Task 3/Apartments.xlsx")
Apartments <- as.data.frame(Apartments)
head(Apartments)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
Apartments$Parking <- factor(Apartments$Parking,
levels = c (0, 1),
labels = c ("No", "Yes"))
head(Apartments)
## Age Distance Price Parking Balcony
## 1 7 28 1640 No 1
## 2 18 1 2800 Yes 0
## 3 7 28 1660 No 0
## 4 28 29 1850 No 1
## 5 18 18 1640 Yes 1
## 6 28 12 1770 No 1
Apartments$Balcony <- factor(Apartments$Balcony,
levels = c (0, 1),
labels = c ("No", "Yes"))
mean(Apartments$Price)
## [1] 2018.941
sd(Apartments$Price)
## [1] 377.8417
t.test(Apartments$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
as we concloude the T-test we can analyze that the mean price for an apartment is with the result of an 2018,9€ diferent as the null hypothesis (1900€) suggests and with p value = 0.004731, which is less than 0,05 this becomes statistically significant diference, furthermore the null hypothesis value fall out of the 95% confidence interval, with range from 1935,4 - 2100,4. All this backs the conclusion, the avarage price per square meter is not 1900€ but estimated at 2018,9€.
fit1 <- lm(Price ~ Age,
data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <0.0000000000000002 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Estimate of the regression coefficient: Regression coeffciient = -8.975, the negative value represents the negative relationship between the age of an aparmtent and price, by that we mean for every one year the price goes down for 8.975€ per square meter.
Coefficient of correlation: we test correlation coeffiecency wtih conducting R-squared = 0.05302 as slope of the null hypothesis, with p value = 0.03401, which means the age does have a statistically sigificiant impact on the price, thus rejecting the null hypothesis β₁ = 0.
Coefficient of determination (R-squared): R-squared = 0.05302, suggesting that as little as 5,3% the impact of age on the price per squeare meter. Thus beeing quite low, we cannot conloude that the age itself beeing an important factor by the apartment pricing.
library(car)
## Loading required package: carData
scatterplotMatrix(Apartments[ , c(1, 2, 3)],
smooth = FALSE)
#install.packages("Hmisc")
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(Apartments[ , c(1, 2, 3)]))
## Age Distance Price
## Age 1.00 0.04 -0.23
## Distance 0.04 1.00 -0.63
## Price -0.23 -0.63 1.00
##
## n= 85
##
##
## P
## Age Distance Price
## Age 0.6966 0.0340
## Distance 0.6966 0.0000
## Price 0.0340 0.0000
Corelation between: Age - Distance = 0.04 is weak quite low impact on the price Age - Price = -0.23 is weak and thus not strongly related to lower prices despite the negative value of r Distance - Price = -0.63 is strong - meaning that the greater the distance the lower the price
And it does not appear to be the case of multicollinearity between Age, Distance and Price, since correlations between Age and Distance, which are independant variables is not strong.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 0.0000000000000002 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
vif(fit2)
## Age Distance
## 1.001845 1.001845
Both VIF results for Age and Distance are 1,001845 and with that much lower than the trashold of 5 and with that we can cocloude there is no multicollinerity between the predictors. Meaning Variables affect the model unicely and the regression model remains stable. Estimations for Price based on Age and distance are thus thrustworty.
Apartments$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
Apartments$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
head(Apartments[, c("StdResid", "CooksD")])
## StdResid CooksD
## 1 -0.665 0.007
## 2 1.783 0.030
## 3 -0.594 0.006
## 4 0.754 0.008
## 5 -1.073 0.005
## 6 -0.778 0.005
head(Apartments[order(Apartments$StdResid),], 5)
## Age Distance Price Parking Balcony StdResid CooksD
## 53 7 2 1760 No Yes -2.152 0.066
## 13 12 14 1650 No Yes -1.499 0.013
## 72 12 14 1650 No No -1.499 0.013
## 20 13 8 1800 No No -1.381 0.012
## 35 14 16 1660 No Yes -1.261 0.008
head(Apartments[order(-Apartments$CooksD),], 5)
## Age Distance Price Parking Balcony StdResid CooksD
## 38 5 45 2180 Yes Yes 2.577 0.320
## 55 43 37 1740 No No 1.445 0.104
## 33 2 11 2790 Yes No 2.051 0.069
## 53 7 2 1760 No Yes -2.152 0.066
## 22 37 3 2540 Yes Yes 1.576 0.061
first aparmtent has the Cook’s distance = 0.320 second apartment Cook’s distance = 0.104
difference is quite big which suggest the first apartment to be an outlier since also the standardized residual is quite big 2.577 in order to confirm the suspicion, the removal of the first apartment is sugested.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Apartments_new <- Apartments %>%
filter(!CooksD %in% c(0.320))
new model:
fit2_new <- lm(Price ~ Age + Distance,
data = Apartments_new)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 0.0000000000000002 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
summary(fit2_new)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 0.0000000000000002 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 0.00000000000252 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 0.000000000002339
After removing the outlier of the first apartment the R-squared is still below 0.5 and thus the model is weak. Which means there are additional factors infuencing the price beside the Age and Distance.
Apartments_new$StdFitted <- scale(fit2_new$fitted.values)
library(car)
scatterplot(y = Apartments_new$StdResid, x = Apartments_new$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
The distribution of the residuals on the scatterplot does not show eye
panatrating pattern, and as mentioned in the exercise the suspicion of
the heteroskedasticity arises. To test it we can choose a Breusch-Pagan
test for the heteroskedasticity
#download.packages(olsrr)
#library(olsrr)
#ols_test_breusch_pagan(fit2_new)
Breusch-Pagan test: Chi2 = 2.927455 with p = 0.08708469 which is statistically non significant and thus we reject the null hypothesis the existance of the heteroskedasticity.
Apartments_new$StdResid <- round(rstandard(fit2_new), 3) #Standardized residuals
Apartments_new$CooksD <- round(cooks.distance(fit2_new), 3) #Cooks distances
library(ggplot2)
ggplot(Apartments_new, aes(x = StdResid)) +
geom_histogram(binwidth = 0.2, fill = "navy", color = "gray") +
labs(x = "Standardized Residuals",
y = "Frequency",
title = "Histogram of Standardized Residuals") +
theme_minimal()
shapiro.test(Apartments_new$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments_new$StdResid
## W = 0.95649, p-value = 0.006355
H0: the data is normally distributed, H1: the data is not normally distributed.
P=0.006355 we can reject the H0 and thus we can concloude the unnormal distribution of hte stardardized residuals.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 0.0000000000000002 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
Distance: with each adiditonal unit of distance the price decreases for -20.667 and p<0,05 it is statistically significiant
Age: additional year of the apartment results in -7.934 decrease of rpice per square meter. and p<0,05 it is statistically significiant
R-squared value = 0.4396 meaning 43,96% of the variability in prices is explained by the combination of infleuences the age and distances of the apartments
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = Apartments)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova suggests that the fit3 model follows the data much better than fit2 and with p=0.01007 is this also statistically sigificiant. meaning that adding aditional variables (Parking, Balcony) makes a models predicament of the price stornger.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 0.0000000000000002 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 0.00000000528 ***
## ParkingYes 196.168 62.868 3.120 0.00251 **
## BalconyYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 0.00000000001849
Parking coefficient: addition of the parking raises the price for 196.168€ and p=0.00251 makes it statistically sigificiant Balcony coefficiant: addition of the balcony raises the price for 1.935€ but p=0.97436 which makes it statistically unsigificiant
H0: none of the adidtional independant variables have a statistically sigificiant effect on the price H1: at least one of the adidtional independant variables have a statistically sigificiant effect on the price
We reject the H0 and confirm H1
Apartments$Fitted <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)
head(Apartments[colnames(Apartments) %in% c("Fitted", "Residuals")])
## Fitted Residuals
## 1 1750.741 -110.74150
## 2 2357.411 442.58893
## 3 1748.807 -88.80674
## 4 1589.921 260.07897
## 5 2052.576 -412.57579
## 6 1896.691 -126.69107
Apartments[2, c("Fitted", "Residuals")]
## Fitted Residuals
## 2 2357.411 442.5889
The residual for Apartment n2 is 422.5889