#data()
data(Chile)
firstmydata <- Chile
head(firstmydata)
## region population sex age education income statusquo vote
## 1 N 175000 M 65 P 35000 1.00820 Y
## 2 N 175000 M 29 PS 7500 -1.29617 N
## 3 N 175000 F 38 P 15000 1.23072 Y
## 4 N 175000 F 49 P 35000 -1.03163 N
## 5 N 175000 F 23 S 35000 -1.10496 N
## 6 N 175000 F 28 P 7500 -1.04685 N
The data set describes the Voting Intentions in the 1988 Chilean Plebiscite. It has 8 different variables and 2700 observations.
Variables:
Region: - C:Central - M:Metropolitan Santiago area - N:North - S:South - SA:city of Santiago
POPULATION: population size of respondent’s community
SEX: -F:female - M:male
AGE: in years
STATUSQUO: scale of support for the status-quo
VOTE: - A:will abstain - N:will vote no(against Pinochet) - U:undecided - Y:will vote yes(for Pinochet)
firstmydata1 <- drop_na(firstmydata)
head(firstmydata1, 10)
## region population sex age education income statusquo vote
## 1 N 175000 M 65 P 35000 1.00820 Y
## 2 N 175000 M 29 PS 7500 -1.29617 N
## 3 N 175000 F 38 P 15000 1.23072 Y
## 4 N 175000 F 49 P 35000 -1.03163 N
## 5 N 175000 F 23 S 35000 -1.10496 N
## 6 N 175000 F 28 P 7500 -1.04685 N
## 7 N 175000 M 26 PS 35000 -0.78626 N
## 8 N 175000 F 24 S 15000 -1.11348 N
## 9 N 175000 F 41 P 15000 -1.01292 U
## 10 N 175000 M 41 P 15000 -1.29617 N
We do this because there are some missing data in the data set and we want to get rid of the ones that are na (non applicable).
Firstly we removed the education and statusquo variable because we won’t be comparing this variables and we don’t need them anymore. We also made small changes to the names of the variables.
firstmydata2 <- (firstmydata1[ ,c(-2, -5, -7)])
colnames(firstmydata2) <- c("Region", "Gender", "Age", "Income", "Vote")
head(firstmydata2, 10) #shows 10 rows instead of 6
## Region Gender Age Income Vote
## 1 N M 65 35000 Y
## 2 N M 29 7500 N
## 3 N F 38 15000 Y
## 4 N F 49 35000 N
## 5 N F 23 35000 N
## 6 N F 28 7500 N
## 7 N M 26 35000 N
## 8 N F 24 15000 N
## 9 N F 41 15000 U
## 10 N M 41 15000 N
firstmydata2$VoteFactor <- factor(firstmydata2$Vote,
levels = c("A","N", "U", "Y"),
labels = c("abstain", "no", "undecided","yes")) #M(male) = 0, F(female) = 1
head(firstmydata2)
## Region Gender Age Income Vote VoteFactor
## 1 N M 65 35000 Y yes
## 2 N M 29 7500 N no
## 3 N F 38 15000 Y yes
## 4 N F 49 35000 N no
## 5 N F 23 35000 N no
## 6 N F 28 7500 N no
We renamed the levels of the Vote variable to more understandable ones to be able to read and understand the new formed variable VoteFactor faster and better.
summary(firstmydata2[ , -5])
## Region Gender Age Income VoteFactor
## C :548 F:1250 Min. :18.00 Min. : 2500 abstain :177
## M : 75 M:1181 1st Qu.:25.00 1st Qu.: 7500 no :867
## N :305 Median :36.00 Median : 15000 undecided:551
## S :655 Mean :38.29 Mean : 34020 yes :836
## SA:848 3rd Qu.:49.00 3rd Qu.: 35000
## Max. :70.00 Max. :200000
We removed the Vote variable because the data explanation is more clear if we look at the VoteFactor variable.
From the summary of data we can see that there are 548 people, who participated in this survey, from Central part of Chile, 75 people from Metropolitan Santiago, 305 people from North, 655 people from South and 848 people from city of Santiago. We can see that the most people participated from city of Santiago. Approximately half of the respondents were female and approximately half of them were male. The average age was 38.29 and the average income of the respondents was 34 020 Pesos. According to the survey 177 people said they will abstain, 551 said that they are undecided, 867 will vote no and 836 yes. From this we can predict that the voting will be close but from the survey we can predict that Pinochet will not be elected 3,6% (from the data from the survey).
sapply(firstmydata2[ ,c(3,4)], FUN = mean)
## Age Income
## 38.29 34019.95
We checked the median of the two variables that was already presented in the summary.
firstmydata2$GenderFactor <- factor(firstmydata2$Gender,
levels = c("F","M"),
labels = c("male", "female"))
firstmydata2$RegionFactor <- factor(firstmydata2$Region,
levels = c("C","M", "N", "S", "SA"),
labels = c("1", "2", "3", "4", "5"))
head(firstmydata2)
## Region Gender Age Income Vote VoteFactor GenderFactor RegionFactor
## 1 N M 65 35000 Y yes female 3
## 2 N M 29 7500 N no female 3
## 3 N F 38 15000 Y yes male 3
## 4 N F 49 35000 N no male 3
## 5 N F 23 35000 N no male 3
## 6 N F 28 7500 N no male 3
So in this case we formed the GenderFactor with male and female to be more understandable in the graphs we will show next.We also formed a RegionFactor to be able to explain the data better in the next graphs.
library(ggplot2)
ggplot(firstmydata2, aes (x = Age, fill = GenderFactor)) +
geom_histogram(position = position_dodge(width = 11), binwidth = 7, colour = "white") +
facet_wrap(~VoteFactor, ncol = 1) +
ylab("Frequency") +
labs(fill = "Gender") +
ggtitle("Distribution of votes people with certain age gave, by gender")
We formed a graph which shows the distribution of how different people of different age voted according to their gender. From the graph we can see that for every different vote the graphs are skewed to the right. By this we can assume that they had much bigger samples of people who were youger and little ‘’old’’ people, with age higher than 50. With the vote no, we can see a significant difference with the young males. Males in general prevailed the women regarding the “no” vote, no matter the age, with the exception of the agre group smaller than 20. We can also observe a very interesting observation that no man with the age smaller that 20, at any of the votes, didn’t answer or wasn’t interviewed. We can also see that the women were more indecisive than man and that very few people said that they will abstain.
ggplot(firstmydata2, aes (x = Age, fill = GenderFactor)) +
geom_histogram(position = position_dodge(width = 11), binwidth = 7, colour = "white") +
facet_wrap(~VoteFactor:RegionFactor) +
ylab("Frequency") +
labs(fill = "Gender") +
ggtitle("Distribution of votes people gave in a single region, by gender")
VOTES: - abstain - no - undecided - yes REGION: 1: Central 2: Metropolitan Santiago 3: North 4: South 5: city of Santiago
By this complex distribution we tried to show how different people of different age voted according to their gender (as we showed also in the previous histograms), according to the certain province they live in. Each column represents a different region. In this graph it is strongly seen which region is strongly against Pinochet (1:Central, 5: South), with the dominance of the male gender. We can also see that in the Metropolitan Santiago (2), very little people were interviewed. With this graph we can assume how politically oriented is a certain region, which can be crucial for political campaigns till before the voting.
round(stat.desc(firstmydata2[ ,c(3,4) ]),1)
## Age Income
## nbr.val 2431.0 2431.0
## nbr.null 0.0 0.0
## nbr.na 0.0 0.0
## min 18.0 2500.0
## max 70.0 200000.0
## range 52.0 197500.0
## sum 93083.0 82702500.0
## median 36.0 15000.0
## mean 38.3 34020.0
## SE.mean 0.3 806.0
## CI.mean.0.95 0.6 1580.5
## var 215.1 1579268017.4
## std.dev 14.7 39740.0
## coef.var 0.4 1.2
From the destriptive statistics we can observe that the people who were included in the survey were of the age between 18 and 70, with the average age 38.3. The range of the Age variable is in this case 52. The minimum income of the respondents was 2500, the maximum income 200 000. From the median wich was 15 000 we can see that half of the people had the income above that number. Based on this income data we can assume that people included in the survey were mostly wealthy people so this would mean that the survey is highly biased.
bodymassmydata <- read.table("C:/Users/Alisa/Downloads/R Take Home Exam/R Take Home Exam/Task 2/Body mass.csv",
header=TRUE,
sep=";",
dec=",")
head(bodymassmydata)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
hist(bodymassmydata$Mass,
xlab = "Mass",
ylab = "Frequency",
main = "Distribution of body mass",
breaks = seq(30, 100, 4))
With the histogram we show the distribution of body mass of ninth graders. We see that the histogram shows the normal distribution of data.
mean(bodymassmydata$Mass)
## [1] 62.876
t.test(bodymassmydata$Mass,
mu = 59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: bodymassmydata$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
With the t-test we tested the the null hypothesis which states that Mu = 59.5. According to the results of this test we can see that p value is 0.000234. On this basis we can reject the null hypothesis at p value that is smaller than 0.001. We can confirm hypothesis one that states that the mean of the body mass in 2021/22 is different that 59.5, compared to the body mass in 2018/19. From the t-test and the mean function we can see that it is 62.876.
To explain the results and interpret the effect size we firstly need to calculate the r.
sqrt(3.9711^2/((3.9711^2)+49))
## [1] 0.4934295
According to the result, we can conclude that the effect size is medium/high. Medium effect size = 0.3 and 0.5 Large effect size = 0.5 and above
library(readxl)
mydata = read_excel("C:/Users/Alisa/Downloads/R Take Home Exam/R Take Home Exam/Task 3/Apartments.xlsx")
head(mydata)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata$ParkingFactor <- factor(mydata$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
mydata$BalconyFactor <- factor(mydata$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
head(mydata)
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
t.test(mydata$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Because the p value = 0.005 and the value is smaller than 5%, we reject null (H0) hypothesis and accept H1 at given p value. We can conclude that the mean is different than 1900 eur.
fit1 <- lm(Price ~ Age,
data = mydata)
sqrt(summary(fit1)$r.squared)
## [1] 0.230255
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Regression coefficient represents the descending slope or the partial regression and is in this case presented with the number -8.975. It tells us that for each year of the apartment, the price (per square mater) on average decreases by 8.975%.
Coefficient of determination that is represented with R-squared. It is the share of explained variability of the dependent variable.In this case the number is quite low. 5,3% of variability of the price is explained by linear effect of age.
Coefficient of correlation tells us that the linear relationship between price and age is in this case positive but weak with R=0.23. It is weak because it’s between 0.1 and 0.3.
scatterplotMatrix(mydata[ ,c(3,1,2)],
smooth = FALSE)
From the matrix I don’t think there is a potential problem with multicolinearity (based very spread data), but to be sure we would have to calculate the correlation between variables.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
There is no problem. The correlation is very small and there is no multicolinearity because the number is smaller than 5.
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$StdResid,
xlab = "Standardized residual",
ylab = "Frequency",
main = "Histogram of standard residuals")
We made a histogram to check for potential problematic standardized residuals.
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
To visually check for big differences.
head(mydata[order(mydata$StdResid),],6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 7 2 1760 0 1 No Yes -2.15 0.066
## 2 12 14 1650 0 1 No Yes -1.50 0.013
## 3 12 14 1650 0 0 No No -1.50 0.013
## 4 13 8 1800 0 0 No No -1.38 0.012
## 5 14 16 1660 0 1 No Yes -1.26 0.008
## 6 24 5 1830 1 0 Yes No -1.19 0.012
head(mydata[order(-mydata$CooksD),],6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 Yes No 2.05 0.069
## 4 7 2 1760 0 1 No Yes -2.15 0.066
## 5 37 3 2540 1 1 Yes Yes 1.58 0.061
## 6 40 2 2400 0 1 No Yes 1.09 0.038
From the data we collected will remove apartment in line 53 because it has a significantly bigger standardized residual than all other. This apartment is very different than other based on the combination of variables it has. We will also remove line 38 because of the big Cooks distance.
mydata1 <- mydata[c(-38,-53),]
We now removed the potencial problematic outliers or units with huge influence.
scatterplot(y = fit2$residuals, x = fit2$fitted.values,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
From the graph we can assume that there is no violation of the distribution of variability, and the variance is constant. So we can reject heteroskedasticity and confirm homoskedasticity. To be sure we would have to run the Breuscg-Pagan heteroskedasticity test.
hist(mydata$StdResid,
xlab = "Standarized Residuals",
ylab = "Frequency",
main = "Histogram of standarized residuals")
We also distributed this histogram earlier to check for potential outliers.
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95303, p-value = 0.003645
From the p value that is smaller than 5%, we can reject the null hypothesis which states that the variable is normally distributed. In this case we can confirm that the variable is not normally distributed. In this case the histogram is skewed to the right.
fit2 <- lm(Price ~ Age + Distance,
data=mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Based on p values of age which is 0.016 and distance where p is smaller than 0.001 we can reject the null hypothesis. With this we tested the regression coefficient which tells us that age and distance have an effect on the price based on the hypothesis 1 that we confirmed.
From the estimate values we can see that if age of the apartment is increased by 1 year, the price on average decreases by 7.934 euro per square meter, assuming all variables remain unchanged. If distance is increased by 1 kilometer, then price on average decreases by 20.667 euros, ceteris paribus.
From the collected data we can see that the R squared (coefficient of determination) is 0.4396. From this we can conclude that 43.9% of variability of the price is explained by age and distance.
fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
data = mydata)
anova(fit2,fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Fit2 (model 1) is more appropriate. H1: Fit3 (model 2) is more appropriate.
Based on the p value which is 0.01007 we can reject the null hypothesis and assume that fit3 is more appropriate and fits data better than fit 2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingFactorYes 196.168 62.868 3.120 0.00251 **
## BalconyFactorYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
Hypothesis for F-statistics: H0: ρ squared = 0 H1: ρ squared > 0
Apartments with a parking space have on average higher price per square meter by 147.5 euros, with p = 0.0025, assuming other variables remain unchanged.Based on the high p value (0.974), we can not confirm that having a balcony has an effect on the price of the apartment.
mydata$Fittedvalues <- fitted.values(fit3)
mydata$StdResid <- residuals(fit3)
head(mydata)
## # A tibble: 6 × 10
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD Fittedvalues
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1 No Yes -111. 0.007 1751.
## 2 18 1 2800 1 0 Yes No 443. 0.03 2357.
## 3 7 28 1660 0 0 No No -88.8 0.006 1749.
## 4 28 29 1850 0 1 No Yes 260. 0.008 1590.
## 5 18 18 1640 1 1 Yes Yes -413. 0.005 2053.
## 6 28 12 1770 0 1 No Yes -127. 0.005 1897.
From the calculation we can see that the residual for apartment ID2 is 442.59.