Tajda Korosec, 25.9.2022 (R take-home Exam)

Task 1

PRESTIGE OF CANADIAN OCCUPATIONS: The Pineo-Porter method of socioeconomic status (SES) estimation attaches prestige scores to 16 occupational categories and is the basis for the Blishen scale. It assigns SES codes to the occupations listed in the 1981 Canadian Classification and Dictionary of Occupations

library(carData)
mydata1 <- Prestige
mydata1$ocupation <- rownames(mydata1)
rownames(mydata1) <- NULL


head(mydata1)
##   education income women prestige census type           ocupation
## 1     13.11  12351 11.16     68.8   1113 prof  gov.administrators
## 2     12.26  25879  4.02     69.1   1130 prof    general.managers
## 3     12.77   9271 15.70     63.4   1171 prof         accountants
## 4     11.42   8865  9.11     56.8   1175 prof purchasing.officers
## 5     14.62   8403 11.68     73.5   2111 prof            chemists
## 6     15.64  11030  5.13     77.6   2113 prof          physicists
colnames(mydata1) <- c("Education", "Income", "Women", "Prestige", "Census", "Type", "Ocupation")
head(mydata1)
##   Education Income Women Prestige Census Type           Ocupation
## 1     13.11  12351 11.16     68.8   1113 prof  gov.administrators
## 2     12.26  25879  4.02     69.1   1130 prof    general.managers
## 3     12.77   9271 15.70     63.4   1171 prof         accountants
## 4     11.42   8865  9.11     56.8   1175 prof purchasing.officers
## 5     14.62   8403 11.68     73.5   2111 prof            chemists
## 6     15.64  11030  5.13     77.6   2113 prof          physicists

Description off this data set:

  • education: Average education of occupational incumbents, years, in 1971.

  • income: Average income of incumbents, dollars, in 1971.

  • women: Percentage of incumbents who are women.

  • prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.

  • census: Canadian Census occupational code.

type: Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar.

summary(mydata1 [,c(-6, -7)])
##    Education          Income          Women           Prestige    
##  Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
##  1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
##  Median :10.540   Median : 5930   Median :13.600   Median :43.60  
##  Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
##  3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##      Census    
##  Min.   :1113  
##  1st Qu.:3120  
##  Median :5135  
##  Mean   :5402  
##  3rd Qu.:8312  
##  Max.   :9517
mean(mydata1$Income)
## [1] 6797.902
library(car)
scatterplot(y =mydata1$Education,
            x =mydata1$Income,
            ylab= "Years of education",
            xlab= "Average income in dollars, in 1971", 
            smooth=FALSE)

library(tidyr)
mydata1 <- drop_na(mydata1)

Average income in dollars is inreasing with years of education.

library(pastecs)
round(stat.desc(mydata1[ ,c(-6,-7)]))
##              Education   Income Women Prestige  Census
## nbr.val             98       98    98       98      98
## nbr.null             0        0     5        0       0
## nbr.na               0        0     0        0       0
## min                  6     1656     0       17    1113
## max                 16    25879    98       87    9517
## range               10    24223    98       70    8404
## sum               1058   680008  2841     4638  529206
## median              11     6036    14       44    5132
## mean                11     6939    29       47    5400
## SE.mean              0      427     3        2     271
## CI.mean.0.95         1      848     6        3     538
## var                  8 17877105   985      292 7205479
## std.dev              3     4228    31       17    2684
## coef.var             0        1     1        0       0

Descriptive statistics: for example for Education: - max education: 16 years, min= 6 years - The difference between max and min education is 10 years - median: 50% of occupations have higher education than 11 years, 50% have lower than 11 year - mean: the average years of education is 11 years - cl.mean.0.95: 1–> 95% of actual arithmetic mean is between 11-1 and 11+1. - nbr.na: I removed al non-available numbers so the number is 0. - coef.var: if we want to compare the variance between variables we have to look at coef.var: those variables who has the same coef.var have the same variability.

hist(mydata1$Income, 
     main = "Distribution of Income", 
     xlab = "Average income in dollars", 
     ylab = "Frequency", 
     breaks = seq(from = 0, to = 30000, by = 1000))

For example: this histogram shows the distribution of Average income in dollars in year 1971

library(ggplot2)
ggplot(mydata1, aes(x=Income)) +
  geom_histogram(binwidth = 1000, colour= "gray") +
  facet_wrap(~ Type, ncol = 1 ) +
  ylab ("Frequency")

We can see that prof (type of occupation: prof: Professional, Managerial, and Technical) have higer income compare to other two types.

ggplot(mydata1, aes(x=Type, y=Income)) + 
  geom_boxplot() +
  xlab("Type pf occupational")

We can see that professional type of occupation have higher income (Median is the higest) (we can see two with extra high income.) ## Task2

library(readr)
mydata2 <- read.table ("C:/Users/Tajda/Desktop/IMB BOOTCAMP/Task 2/Body mass.csv" , header =TRUE, sep=";", dec=",")

colnames(mydata2)[2] <- c("BodyMass")
head(mydata2)
##   ID BodyMass
## 1  1     62.1
## 2  2     64.5
## 3  3     56.5
## 4  4     53.4
## 5  5     61.3
## 6  6     62.2

Description: ID: ID of ninth grader: a student in their ninth year at school Body Mass: body weight in kilograms (kg)

1.Show the descriptive statistics of body mass and its distribution with the histogram.

mean(mydata2$BodyMass)
## [1] 62.876
sd(mydata2$BodyMass)
## [1] 6.011403
#install.packages("pastecs")
library(pastecs)
round(stat.desc(mydata2[,-1]), 2)
##      nbr.val     nbr.null       nbr.na          min          max        range 
##        50.00         0.00         0.00        49.70        83.20        33.50 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##      3143.80        62.80        62.88         0.85         1.71        36.14 
##      std.dev     coef.var 
##         6.01         0.10

Explanation: -nbr.val: 50 units of observation; students -the min body mass of students is 49.70 kg, the max body mass is 83.20 –> difference(range) between min and max mass is 33.50 kg -mean: the average body mass of students is 62.88 kg -median: 50% of students have a higher body mass than 62.80 kg, 50% of students have lower body mass than 62.80 kg -CI.mean.0.95: arithmetic mean of the body mass is with 95% between (62.88 - 1.71)kg and (62.88 + 1.71)kg -variance: the larger the var the larger the variability in variable (body mass) -coef.var it is used if we want to compare the variability between different numeric variables(which cannot be directly comparable), here we only have one

hist(mydata2$BodyMass,
     main = "Distribution of Body mass among ninth graders",
     ylab = "Frequency",
     xlab = "Body mass", 
     breaks = seq(0, 100, 1), 
     right = FALSE)

2. Test the following hypothesis with program R: 𝐻0: 𝜇 = 59.5.

H0:𝜇 = 59.5 (the average of body mass is 59.5 kg) H1(alternative): 𝜇is different than 59.5

we decide that α = 5%

t.test(mydata2$BodyMass,
       mu =59.5,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mydata2$BodyMass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

p-value = 0.000234 –> p < 0.001, p-value is way smaller than α (p<α) –> we reject H0 with α=5% at p< 0.001. It means we accept H1: the average body mass is different than 59.5 kg with α=5%.

  1. Explain the results and interpret the effect size.

online schooling effect on body weight: the average mass increased with online schooling. –> Effect size, r is betwwen 0,3 and 0,5 so the effect is medium.

t <- 3.9711
r <- sqrt((t^2)/(t^2 + 49))
r
## [1] 0.4934295

Task 3

Import the dataset Apartments.xlsx

library(readxl)
mydata3 <- read_xlsx("C:/Users/Tajda/Desktop/IMB BOOTCAMP/Task 3/Apartments.xlsx")
View(mydata3)
head(mydata3)
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description: - Age: Age of an apartment in years - Distance: The distance from city center in km - Price: Price per m2 - Parking: 0-No, 1-Yes - Balcony: 0-No, 1-Yes

Change categorical variables into factors

mydata3$ParkingFactor <- factor(mydata3$Parking,
                                levels = c (0, 1),
                                labels = c ("no", "yes"))


mydata3$BalconyFactor <- factor(mydata3$Balcony,
                               levels = c (0, 1), 
                               labels = c("no", "yes"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata3$Price,
       mu =1900,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

p = 0.004, p-value is smaller than α (α=5%) (p<α) we reject H0 with α=5% at p< 0.001. It means we accept H1: the average price (true mean) is different than 1900 eur with α=5%.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
           data =mydata3)

summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = mydata3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
cor(mydata3$Price, mydata3$Age)
## [1] -0.230255

Explanation: - the estimate of Regression coefficient (=-8.975): If age of the apartment is increased by 1 year, then the price of the apartment on average decreases by 8.975 euros with p < 0.034 (p<α).

-coefficient of correlation = Pearson correlation coefficient = r = -0.230 . The value is between 0.1 and 0.3 –> there is a weak negative relationship between price and age of the apartment.

-coefficient of determination = Multiple R-squared = 0.05302 = 5.3% of variability of Price is explained by liner effect of Age of the apartment. This proportion is low.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

mydata4 <- mydata3[ , c(3,1,2)]

library(car)
scatterplotMatrix(mydata4, smooth = FALSE)

Is there a potential problem with multicolinearity?

we have to look how strong is the relationship ship between Age and Distance (independent variables). We cannot say anything just from the matrix but only assume, if the majority of values are near by the line, we can assume there is a potential correlation between these two variables. There is no potential problem with multicolinearity because the line is horizontal.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, 
           data=mydata3)

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845

There is no strong relationship between Age and difference because VIF < 5 (5 is a cut point for check it out). So there is no problem with multicolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic case (outlier or unit with big influence).

mydata3$StdResid <- round(rstandard(fit2), 3)
mydata3$CooksD <- round(cooks.distance(fit2), 3)


head(mydata3[order(mydata3$StdResid),], 3)
## # A tibble: 3 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>          <dbl>  <dbl>
## 1     7        2  1760       0       1 no            yes            -2.15  0.066
## 2    12       14  1650       0       1 no            yes            -1.50  0.013
## 3    12       14  1650       0       0 no            no             -1.50  0.013
## # … with abbreviated variable names ¹​BalconyFactor, ²​StdResid

There is no outliers, all the units are above the -3, which is a lower limit value in a standardized form. None of cooks distance is above 1. So there is no unit with the big influence –> no need to remove any unit.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

mydata3$StdfittedValues <- scale(fit2$fitted.values)

library(car)
scatterplot(y = mydata3$StdResid, x= mydata3$StdfittedValues,
            ylab = "Standarized residuals",
            xlab = "Standarized fitted values",
            boxplots = FALSE, 
            regLine = FALSE, 
            smooth = FALSE)

Explanation of scatter plot: because we can see the variance is constant we can say (only assuming) that there is no potential heteroskedasticity. To be sure we need to do Breusch pagan test.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydata3$StdResid,
     xlab = "Standarized Residuals",
     ylab = "Frequency",
     main = "Histogram of standarized residuals")

shapiro.test(mydata3$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata3$StdResid
## W = 0.95303, p-value = 0.003645

H0: variable is normally distributed H1: variable is not normally distributed

p value = 0.003645 –> p-value is smaller than α (α=5%) (p<α). we can reject H0 and accept H1 With p=0.0036 with α=5%. So it means the standardized residuals are not distributed normally.

Assumption of normal distribution of errors is violated, meaning t-distribution may not be correct.

Estimate the fit2 again without potentially excluded cases and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance, 
           data=mydata3)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Explanation of all the coefficients:

  • the estimate of Regression coefficient: Age (=-7.934): If age of the apartment is increased by 1 year, then the price of the apartment on average decreases by 7.934 euros with p < 0.016 (p<α), assuming others (Distance) remains unchanged. Distance (=-20.667): If the distance of the apartment from the city center is increased by one km, then the price of the apartment on average decreases by 20.667 euros with p <0,001 (p<α), assuming others (Age) remains unchanged.

-coefficient of correlation = r = 0,66 .Linear relationship between Price and two explanatory variables (Age and Distance) is semi strong. -coefficient of determination = Multiple R-squared = 0.4396 = 44% of variability of Price is explained by liner effect of Age and Distance of the apartment.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
           data = mydata3)

With function anova check if model fit3 fits data better than model fi

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Fit2 (model 1) is more appropriate. H1: Fit3 (model 2) is more appropriate.

p value is 0.01007 –> p < α (α=5%) : we can reject H0: Fit3 is more appropriate/fits data better than fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = mydata3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2301.667     94.271  24.415  < 2e-16 ***
## Age                -6.799      3.110  -2.186  0.03172 *  
## Distance          -18.045      2.758  -6.543 5.28e-09 ***
## ParkingFactoryes  196.168     62.868   3.120  0.00251 ** 
## BalconyFactoryes    1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Explanation for regression coefficient for both categorical variables:

Parking: Given Age, Distance and Balcony Factor, the apartments with parking place have on average price higher by 196 euro with p = 0.0025 (p < α, (α=5%)) Balcony: p=0.97, p> α, (α=5%): we cannot claim that the presence of a balcony affects the price.

Hypothesis for F-statistics: H0: ρ squared = 0 H1: ρ squared > 0

p-value < 0.001 –> p < α, (α=5%) –> we reject H0 at p< 0.001. It means we have found the linear relationship between dependent (Price) and at least one explanatory variable. Model is good (ρ squared > 0)

Save fitted values and claculate the residual for apartment ID2.

mydata3$Fittedvalues <- fitted.values(fit3)

mydata3$Residuals <- residuals(fit3)
head(mydata3)
## # A tibble: 6 × 12
##     Age Distance Price Parking Balcony ParkingF…¹ Balco…² StdRe…³ CooksD Stdfi…⁴
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>      <fct>     <dbl>  <dbl>   <dbl>
## 1     7       28  1640       0       1 no         yes      -0.665  0.007  -0.771
## 2    18        1  2800       1       0 yes        no        1.78   0.03    1.11 
## 3     7       28  1660       0       0 no         no       -0.594  0.006  -0.771
## 4    28       29  1850       0       1 no         yes       0.754  0.008  -1.52 
## 5    18       18  1640       1       1 yes        yes      -1.07   0.005  -0.294
## 6    28       12  1770       0       1 no         yes      -0.778  0.005  -0.116
## # … with 2 more variables: Fittedvalues <dbl>, Residuals <dbl>, and abbreviated
## #   variable names ¹​ParkingFactor, ²​BalconyFactor, ³​StdResid,
## #   ⁴​StdfittedValues[,1]

calculation of residual for apartment ID2: Price: 2800 Fitted value: 2357.411

Residual: 2800-2357.411= 442.6 eur.