PRESTIGE OF CANADIAN OCCUPATIONS: The Pineo-Porter method of socioeconomic status (SES) estimation attaches prestige scores to 16 occupational categories and is the basis for the Blishen scale. It assigns SES codes to the occupations listed in the 1981 Canadian Classification and Dictionary of Occupations
library(carData)
mydata1 <- Prestige
mydata1$ocupation <- rownames(mydata1)
rownames(mydata1) <- NULL
head(mydata1)
## education income women prestige census type ocupation
## 1 13.11 12351 11.16 68.8 1113 prof gov.administrators
## 2 12.26 25879 4.02 69.1 1130 prof general.managers
## 3 12.77 9271 15.70 63.4 1171 prof accountants
## 4 11.42 8865 9.11 56.8 1175 prof purchasing.officers
## 5 14.62 8403 11.68 73.5 2111 prof chemists
## 6 15.64 11030 5.13 77.6 2113 prof physicists
colnames(mydata1) <- c("Education", "Income", "Women", "Prestige", "Census", "Type", "Ocupation")
head(mydata1)
## Education Income Women Prestige Census Type Ocupation
## 1 13.11 12351 11.16 68.8 1113 prof gov.administrators
## 2 12.26 25879 4.02 69.1 1130 prof general.managers
## 3 12.77 9271 15.70 63.4 1171 prof accountants
## 4 11.42 8865 9.11 56.8 1175 prof purchasing.officers
## 5 14.62 8403 11.68 73.5 2111 prof chemists
## 6 15.64 11030 5.13 77.6 2113 prof physicists
Description off this data set:
education: Average education of occupational incumbents, years, in 1971.
income: Average income of incumbents, dollars, in 1971.
women: Percentage of incumbents who are women.
prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census: Canadian Census occupational code.
type: Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar.
summary(mydata1 [,c(-6, -7)])
## Education Income Women Prestige
## Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
## 1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
## Median :10.540 Median : 5930 Median :13.600 Median :43.60
## Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
## 3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## Census
## Min. :1113
## 1st Qu.:3120
## Median :5135
## Mean :5402
## 3rd Qu.:8312
## Max. :9517
mean(mydata1$Income)
## [1] 6797.902
library(car)
scatterplot(y =mydata1$Education,
x =mydata1$Income,
ylab= "Years of education",
xlab= "Average income in dollars, in 1971",
smooth=FALSE)
library(tidyr)
mydata1 <- drop_na(mydata1)
Average income in dollars is inreasing with years of education.
library(pastecs)
round(stat.desc(mydata1[ ,c(-6,-7)]))
## Education Income Women Prestige Census
## nbr.val 98 98 98 98 98
## nbr.null 0 0 5 0 0
## nbr.na 0 0 0 0 0
## min 6 1656 0 17 1113
## max 16 25879 98 87 9517
## range 10 24223 98 70 8404
## sum 1058 680008 2841 4638 529206
## median 11 6036 14 44 5132
## mean 11 6939 29 47 5400
## SE.mean 0 427 3 2 271
## CI.mean.0.95 1 848 6 3 538
## var 8 17877105 985 292 7205479
## std.dev 3 4228 31 17 2684
## coef.var 0 1 1 0 0
Descriptive statistics: for example for Education: - max education: 16 years, min= 6 years - The difference between max and min education is 10 years - median: 50% of occupations have higher education than 11 years, 50% have lower than 11 year - mean: the average years of education is 11 years - cl.mean.0.95: 1–> 95% of actual arithmetic mean is between 11-1 and 11+1. - nbr.na: I removed al non-available numbers so the number is 0. - coef.var: if we want to compare the variance between variables we have to look at coef.var: those variables who has the same coef.var have the same variability.
hist(mydata1$Income,
main = "Distribution of Income",
xlab = "Average income in dollars",
ylab = "Frequency",
breaks = seq(from = 0, to = 30000, by = 1000))
For example: this histogram shows the distribution of Average income in
dollars in year 1971
library(ggplot2)
ggplot(mydata1, aes(x=Income)) +
geom_histogram(binwidth = 1000, colour= "gray") +
facet_wrap(~ Type, ncol = 1 ) +
ylab ("Frequency")
We can see that prof (type of occupation: prof: Professional,
Managerial, and Technical) have higer income compare to other two
types.
ggplot(mydata1, aes(x=Type, y=Income)) +
geom_boxplot() +
xlab("Type pf occupational")
We can see that professional type of occupation have higher income
(Median is the higest) (we can see two with extra high income.) ##
Task2
library(readr)
mydata2 <- read.table ("C:/Users/Tajda/Desktop/IMB BOOTCAMP/Task 2/Body mass.csv" , header =TRUE, sep=";", dec=",")
colnames(mydata2)[2] <- c("BodyMass")
head(mydata2)
## ID BodyMass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
Description: ID: ID of ninth grader: a student in their ninth year at school Body Mass: body weight in kilograms (kg)
1.Show the descriptive statistics of body mass and its distribution with the histogram.
mean(mydata2$BodyMass)
## [1] 62.876
sd(mydata2$BodyMass)
## [1] 6.011403
#install.packages("pastecs")
library(pastecs)
round(stat.desc(mydata2[,-1]), 2)
## nbr.val nbr.null nbr.na min max range
## 50.00 0.00 0.00 49.70 83.20 33.50
## sum median mean SE.mean CI.mean.0.95 var
## 3143.80 62.80 62.88 0.85 1.71 36.14
## std.dev coef.var
## 6.01 0.10
Explanation: -nbr.val: 50 units of observation; students -the min body mass of students is 49.70 kg, the max body mass is 83.20 –> difference(range) between min and max mass is 33.50 kg -mean: the average body mass of students is 62.88 kg -median: 50% of students have a higher body mass than 62.80 kg, 50% of students have lower body mass than 62.80 kg -CI.mean.0.95: arithmetic mean of the body mass is with 95% between (62.88 - 1.71)kg and (62.88 + 1.71)kg -variance: the larger the var the larger the variability in variable (body mass) -coef.var it is used if we want to compare the variability between different numeric variables(which cannot be directly comparable), here we only have one
hist(mydata2$BodyMass,
main = "Distribution of Body mass among ninth graders",
ylab = "Frequency",
xlab = "Body mass",
breaks = seq(0, 100, 1),
right = FALSE)
2. Test the following hypothesis with program R: 𝐻0: 𝜇 = 59.5.
H0:𝜇 = 59.5 (the average of body mass is 59.5 kg) H1(alternative): 𝜇is different than 59.5
we decide that α = 5%
t.test(mydata2$BodyMass,
mu =59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata2$BodyMass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
p-value = 0.000234 –> p < 0.001, p-value is way smaller than α (p<α) –> we reject H0 with α=5% at p< 0.001. It means we accept H1: the average body mass is different than 59.5 kg with α=5%.
online schooling effect on body weight: the average mass increased with online schooling. –> Effect size, r is betwwen 0,3 and 0,5 so the effect is medium.
t <- 3.9711
r <- sqrt((t^2)/(t^2 + 49))
r
## [1] 0.4934295
Import the dataset Apartments.xlsx
library(readxl)
mydata3 <- read_xlsx("C:/Users/Tajda/Desktop/IMB BOOTCAMP/Task 3/Apartments.xlsx")
View(mydata3)
head(mydata3)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description: - Age: Age of an apartment in years - Distance: The distance from city center in km - Price: Price per m2 - Parking: 0-No, 1-Yes - Balcony: 0-No, 1-Yes
mydata3$ParkingFactor <- factor(mydata3$Parking,
levels = c (0, 1),
labels = c ("no", "yes"))
mydata3$BalconyFactor <- factor(mydata3$Balcony,
levels = c (0, 1),
labels = c("no", "yes"))
t.test(mydata3$Price,
mu =1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
p = 0.004, p-value is smaller than α (α=5%) (p<α) we reject H0 with α=5% at p< 0.001. It means we accept H1: the average price (true mean) is different than 1900 eur with α=5%.
fit1 <- lm(Price ~ Age,
data =mydata3)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(mydata3$Price, mydata3$Age)
## [1] -0.230255
Explanation: - the estimate of Regression coefficient (=-8.975): If age of the apartment is increased by 1 year, then the price of the apartment on average decreases by 8.975 euros with p < 0.034 (p<α).
-coefficient of correlation = Pearson correlation coefficient = r = -0.230 . The value is between 0.1 and 0.3 –> there is a weak negative relationship between price and age of the apartment.
-coefficient of determination = Multiple R-squared = 0.05302 = 5.3% of variability of Price is explained by liner effect of Age of the apartment. This proportion is low.
mydata4 <- mydata3[ , c(3,1,2)]
library(car)
scatterplotMatrix(mydata4, smooth = FALSE)
Is there a potential problem with multicolinearity?
we have to look how strong is the relationship ship between Age and Distance (independent variables). We cannot say anything just from the matrix but only assume, if the majority of values are near by the line, we can assume there is a potential correlation between these two variables. There is no potential problem with multicolinearity because the line is horizontal.
fit2 <- lm(Price ~ Age + Distance,
data=mydata3)
vif(fit2)
## Age Distance
## 1.001845 1.001845
There is no strong relationship between Age and difference because VIF < 5 (5 is a cut point for check it out). So there is no problem with multicolinearity.
mydata3$StdResid <- round(rstandard(fit2), 3)
mydata3$CooksD <- round(cooks.distance(fit2), 3)
head(mydata3[order(mydata3$StdResid),], 3)
## # A tibble: 3 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFac…¹ StdRe…² CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 7 2 1760 0 1 no yes -2.15 0.066
## 2 12 14 1650 0 1 no yes -1.50 0.013
## 3 12 14 1650 0 0 no no -1.50 0.013
## # … with abbreviated variable names ¹BalconyFactor, ²StdResid
There is no outliers, all the units are above the -3, which is a lower limit value in a standardized form. None of cooks distance is above 1. So there is no unit with the big influence –> no need to remove any unit.
mydata3$StdfittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata3$StdResid, x= mydata3$StdfittedValues,
ylab = "Standarized residuals",
xlab = "Standarized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
Explanation of scatter plot: because we can see the variance is constant
we can say (only assuming) that there is no potential
heteroskedasticity. To be sure we need to do Breusch pagan test.
hist(mydata3$StdResid,
xlab = "Standarized Residuals",
ylab = "Frequency",
main = "Histogram of standarized residuals")
shapiro.test(mydata3$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata3$StdResid
## W = 0.95303, p-value = 0.003645
H0: variable is normally distributed H1: variable is not normally distributed
p value = 0.003645 –> p-value is smaller than α (α=5%) (p<α). we can reject H0 and accept H1 With p=0.0036 with α=5%. So it means the standardized residuals are not distributed normally.
Assumption of normal distribution of errors is violated, meaning t-distribution may not be correct.
fit2 <- lm(Price ~ Age + Distance,
data=mydata3)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Explanation of all the coefficients:
-coefficient of correlation = r = 0,66 .Linear relationship between Price and two explanatory variables (Age and Distance) is semi strong. -coefficient of determination = Multiple R-squared = 0.4396 = 44% of variability of Price is explained by liner effect of Age and Distance of the apartment.
fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
data = mydata3)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Fit2 (model 1) is more appropriate. H1: Fit3 (model 2) is more appropriate.
p value is 0.01007 –> p < α (α=5%) : we can reject H0: Fit3 is more appropriate/fits data better than fit2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingFactoryes 196.168 62.868 3.120 0.00251 **
## BalconyFactoryes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
Explanation for regression coefficient for both categorical variables:
Parking: Given Age, Distance and Balcony Factor, the apartments with parking place have on average price higher by 196 euro with p = 0.0025 (p < α, (α=5%)) Balcony: p=0.97, p> α, (α=5%): we cannot claim that the presence of a balcony affects the price.
Hypothesis for F-statistics: H0: ρ squared = 0 H1: ρ squared > 0
p-value < 0.001 –> p < α, (α=5%) –> we reject H0 at p< 0.001. It means we have found the linear relationship between dependent (Price) and at least one explanatory variable. Model is good (ρ squared > 0)
mydata3$Fittedvalues <- fitted.values(fit3)
mydata3$Residuals <- residuals(fit3)
head(mydata3)
## # A tibble: 6 × 12
## Age Distance Price Parking Balcony ParkingF…¹ Balco…² StdRe…³ CooksD Stdfi…⁴
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1 no yes -0.665 0.007 -0.771
## 2 18 1 2800 1 0 yes no 1.78 0.03 1.11
## 3 7 28 1660 0 0 no no -0.594 0.006 -0.771
## 4 28 29 1850 0 1 no yes 0.754 0.008 -1.52
## 5 18 18 1640 1 1 yes yes -1.07 0.005 -0.294
## 6 28 12 1770 0 1 no yes -0.778 0.005 -0.116
## # … with 2 more variables: Fittedvalues <dbl>, Residuals <dbl>, and abbreviated
## # variable names ¹ParkingFactor, ²BalconyFactor, ³StdResid,
## # ⁴StdfittedValues[,1]
calculation of residual for apartment ID2: Price: 2800 Fitted value: 2357.411
Residual: 2800-2357.411= 442.6 eur.