Alisa Kim Jagodic

Task 1

#data()
data(Chile)
firstmydata <- Chile
head(firstmydata)
##   region population sex age education income statusquo vote
## 1      N     175000   M  65         P  35000   1.00820    Y
## 2      N     175000   M  29        PS   7500  -1.29617    N
## 3      N     175000   F  38         P  15000   1.23072    Y
## 4      N     175000   F  49         P  35000  -1.03163    N
## 5      N     175000   F  23         S  35000  -1.10496    N
## 6      N     175000   F  28         P   7500  -1.04685    N

The data set describes the Voting Intentions in the 1988 Chilean Plebiscite. It has 8 different variables and 2700 observations.

Variables:

Region: - C:Central - M:Metropolitan Santiago area - N:North - S:South - SA:city of Santiago

POPULATION: population size of respondent’s community

SEX: -F:female - M:male

AGE: in years

STATUSQUO: scale of support for the status-quo

VOTE: - A:will abstain - N:will vote no(against Pinochet) - U:undecided - Y:will vote yes(for Pinochet)

firstmydata1 <- drop_na(firstmydata) 

head(firstmydata1, 10)
##    region population sex age education income statusquo vote
## 1       N     175000   M  65         P  35000   1.00820    Y
## 2       N     175000   M  29        PS   7500  -1.29617    N
## 3       N     175000   F  38         P  15000   1.23072    Y
## 4       N     175000   F  49         P  35000  -1.03163    N
## 5       N     175000   F  23         S  35000  -1.10496    N
## 6       N     175000   F  28         P   7500  -1.04685    N
## 7       N     175000   M  26        PS  35000  -0.78626    N
## 8       N     175000   F  24         S  15000  -1.11348    N
## 9       N     175000   F  41         P  15000  -1.01292    U
## 10      N     175000   M  41         P  15000  -1.29617    N

We do this because there are some missing data in the data set and we want to get rid of the ones that are na (non applicable).

Firstly we removed the education and statusquo variable because we won’t be comparing this variables and we don’t need them anymore. We also made small changes to the names of the variables.

firstmydata2 <- (firstmydata1[ ,c(-2, -5, -7)])
colnames(firstmydata2) <- c("Region", "Gender", "Age", "Income", "Vote")
head(firstmydata2, 10) #shows 10 rows instead of 6
##    Region Gender Age Income Vote
## 1       N      M  65  35000    Y
## 2       N      M  29   7500    N
## 3       N      F  38  15000    Y
## 4       N      F  49  35000    N
## 5       N      F  23  35000    N
## 6       N      F  28   7500    N
## 7       N      M  26  35000    N
## 8       N      F  24  15000    N
## 9       N      F  41  15000    U
## 10      N      M  41  15000    N
firstmydata2$VoteFactor <- factor(firstmydata2$Vote,
                           levels = c("A","N", "U", "Y"),
                           labels = c("abstain", "no", "undecided","yes")) #M(male) = 0, F(female) = 1
head(firstmydata2)
##   Region Gender Age Income Vote VoteFactor
## 1      N      M  65  35000    Y        yes
## 2      N      M  29   7500    N         no
## 3      N      F  38  15000    Y        yes
## 4      N      F  49  35000    N         no
## 5      N      F  23  35000    N         no
## 6      N      F  28   7500    N         no

We renamed the levels of the Vote variable to more understandable ones to be able to read and understand the new formed variable VoteFactor faster and better.

summary(firstmydata2[ , -5])
##  Region   Gender        Age            Income           VoteFactor 
##  C :548   F:1250   Min.   :18.00   Min.   :  2500   abstain  :177  
##  M : 75   M:1181   1st Qu.:25.00   1st Qu.:  7500   no       :867  
##  N :305            Median :36.00   Median : 15000   undecided:551  
##  S :655            Mean   :38.29   Mean   : 34020   yes      :836  
##  SA:848            3rd Qu.:49.00   3rd Qu.: 35000                  
##                    Max.   :70.00   Max.   :200000

We removed the Vote variable because the data explanation is more clear if we look at the VoteFactor variable.

From the summary of data we can see that there are 548 people, who participated in this survey, from Central part of Chile, 75 people from Metropolitan Santiago, 305 people from North, 655 people from South and 848 people from city of Santiago. We can see that the most people participated from city of Santiago. Approximately half of the respondents were female and approximately half of them were male. The average age was 38.29 and the average income of the respondents was 34 020 Pesos. According to the survey 177 people said they will abstain, 551 said that they are undecided, 867 will vote no and 836 yes. From this we can predict that the voting will be close but from the survey we can predict that Pinochet will not be elected 3,6% (from the data from the survey).

sapply(firstmydata2[ ,c(3,4)], FUN = mean)
##      Age   Income 
##    38.29 34019.95

We checked the median of the two variables that was already presented in the summary.

firstmydata2$GenderFactor <- factor(firstmydata2$Gender,
                           levels = c("F","M"),
                           labels = c("male", "female"))

firstmydata2$RegionFactor <- factor(firstmydata2$Region,
                           levels = c("C","M", "N", "S", "SA"),
                           labels = c("1", "2", "3", "4", "5"))

head(firstmydata2)
##   Region Gender Age Income Vote VoteFactor GenderFactor RegionFactor
## 1      N      M  65  35000    Y        yes       female            3
## 2      N      M  29   7500    N         no       female            3
## 3      N      F  38  15000    Y        yes         male            3
## 4      N      F  49  35000    N         no         male            3
## 5      N      F  23  35000    N         no         male            3
## 6      N      F  28   7500    N         no         male            3

So in this case we formed the GenderFactor with male and female to be more understandable in the graphs we will show next.We also formed a RegionFactor to be able to explain the data better in the next graphs.

library(ggplot2)
ggplot(firstmydata2, aes (x = Age, fill = GenderFactor)) +
         geom_histogram(position = position_dodge(width = 11), binwidth = 7, colour = "white") +
         facet_wrap(~VoteFactor, ncol = 1) +
         ylab("Frequency") +
         labs(fill = "Gender") +
        ggtitle("Distribution of votes people with certain age gave, by gender") 

We formed a graph which shows the distribution of how different people of different age voted according to their gender. From the graph we can see that for every different vote the graphs are skewed to the right. By this we can assume that they had much bigger samples of people who were youger and little ‘’old’’ people, with age higher than 50. With the vote no, we can see a significant difference with the young males. Males in general prevailed the women regarding the “no” vote, no matter the age, with the exception of the agre group smaller than 20. We can also observe a very interesting observation that no man with the age smaller that 20, at any of the votes, didn’t answer or wasn’t interviewed. We can also see that the women were more indecisive than man and that very few people said that they will abstain.

ggplot(firstmydata2, aes (x = Age, fill = GenderFactor)) +
         geom_histogram(position = position_dodge(width = 11), binwidth = 7, colour = "white") +
         facet_wrap(~VoteFactor:RegionFactor) +
         ylab("Frequency") +
         labs(fill = "Gender") +
          ggtitle("Distribution of votes people gave in a single region, by gender") 

VOTES: - abstain - no - undecided - yes REGION: 1: Central 2: Metropolitan Santiago 3: North 4: South 5: city of Santiago

By this complex distribution we tried to show how different people of different age voted according to their gender (as we showed also in the previous histograms), according to the certain province they live in. Each column represents a different region. In this graph it is strongly seen which region is strongly against Pinochet (1:Central, 5: South), with the dominance of the male gender. We can also see that in the Metropolitan Santiago (2), very little people were interviewed. With this graph we can assume how politically oriented is a certain region, which can be crucial for political campaigns till before the voting.

round(stat.desc(firstmydata2[ ,c(3,4) ]),1)
##                  Age       Income
## nbr.val       2431.0       2431.0
## nbr.null         0.0          0.0
## nbr.na           0.0          0.0
## min             18.0       2500.0
## max             70.0     200000.0
## range           52.0     197500.0
## sum          93083.0   82702500.0
## median          36.0      15000.0
## mean            38.3      34020.0
## SE.mean          0.3        806.0
## CI.mean.0.95     0.6       1580.5
## var            215.1 1579268017.4
## std.dev         14.7      39740.0
## coef.var         0.4          1.2

From the destriptive statistics we can observe that the people who were included in the survey were of the age between 18 and 70, with the average age 38.3. The range of the Age variable is in this case 52. The minimum income of the respondents was 2500, the maximum income 200 000. From the median wich was 15 000 we can see that half of the people had the income above that number. Based on this income data we can assume that people included in the survey were mostly wealthy people so this would mean that the survey is highly biased.

Task 2

bodymassmydata <- read.table("C:/Users/Alisa/Downloads/R Take Home Exam/R Take Home Exam/Task 2/Body mass.csv",
                     header=TRUE, 
                     sep=";", 
                     dec=",")

head(bodymassmydata)
##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2
hist(bodymassmydata$Mass,
     xlab = "Mass",
     ylab = "Frequency",
     main = "Distribution of body mass",
     breaks = seq(30, 100, 4))

With the histogram we show the distribution of body mass of ninth graders. We see that the histogram shows the normal distribution of data.

mean(bodymassmydata$Mass)
## [1] 62.876
t.test(bodymassmydata$Mass,
       mu = 59.5,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  bodymassmydata$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

With the t-test we tested the the null hypothesis which states that Mu = 59.5. According to the results of this test we can see that p value is 0.000234. On this basis we can reject the null hypothesis at p value that is smaller than 0.001. We can confirm hypothesis one that states that the mean of the body mass in 2021/22 is different that 59.5, compared to the body mass in 2018/19. From the t-test and the mean function we can see that it is 62.876.

To explain the results and interpret the effect size we firstly need to calculate the r.

sqrt(3.9711^2/((3.9711^2)+49))
## [1] 0.4934295

According to the result, we can conclude that the effect size is medium/high. Medium effect size = 0.3 and 0.5 Large effect size = 0.5 and above

Task 3

Import the dataset Apartments.xlsx

library(readxl)
mydata = read_excel("C:/Users/Alisa/Downloads/R Take Home Exam/R Take Home Exam/Task 3/Apartments.xlsx")

head(mydata)
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors

mydata$ParkingFactor <- factor(mydata$Parking, 
                              levels = c(0, 1), 
                              labels = c("No", "Yes"))

mydata$BalconyFactor <- factor(mydata$Balcony, 
                              levels = c(0, 1), 
                              labels = c("No", "Yes"))

head(mydata)
## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>        
## 1     7       28  1640       0       1 No            Yes          
## 2    18        1  2800       1       0 Yes           No           
## 3     7       28  1660       0       0 No            No           
## 4    28       29  1850       0       1 No            Yes          
## 5    18       18  1640       1       1 Yes           Yes          
## 6    28       12  1770       0       1 No            Yes

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(mydata$Price,
       mu = 1900,
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Because the p value = 0.005 and the value is smaller than 5%, we reject null (H0) hypothesis and accept H1 at given p value. We can conclude that the mean is different than 1900 eur.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
           data = mydata)

sqrt(summary(fit1)$r.squared)
## [1] 0.230255
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Regression coefficient represents the descending slope or the partial regression and is in this case presented with the number -8.975. It tells us that for each year of the apartment, the price (per square mater) on average decreases by 8.975%.

Coefficient of determination that is represented with R-squared. It is the share of explained variability of the dependent variable.In this case the number is quite low. 5,3% of variability of the price is explained by linear effect of age.

Coefficient of correlation tells us that the linear relationship between price and age is in this case positive but weak with R=0.23. It is weak because it’s between 0.1 and 0.3.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

scatterplotMatrix(mydata[ ,c(3,1,2)],
            smooth = FALSE)

From the matrix I don’t think there is a potential problem with multicolinearity (based very spread data), but to be sure we would have to calculate the correlation between variables.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = mydata)

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2)
##      Age Distance 
## 1.001845 1.001845

There is no problem. The correlation is very small and there is no multicolinearity because the number is smaller than 5.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic case (outlier or unit with big influence).

mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)

hist(mydata$StdResid,
     xlab = "Standardized residual",
     ylab = "Frequency",
     main = "Histogram of standard residuals")

We made a histogram to check for potential problematic standardized residuals.

hist(mydata$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

To visually check for big differences.

head(mydata[order(mydata$StdResid),],6)
## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>  <dbl>
## 1     7        2  1760       0       1 No            Yes              -2.15  0.066
## 2    12       14  1650       0       1 No            Yes              -1.50  0.013
## 3    12       14  1650       0       0 No            No               -1.50  0.013
## 4    13        8  1800       0       0 No            No               -1.38  0.012
## 5    14       16  1660       0       1 No            Yes              -1.26  0.008
## 6    24        5  1830       1       0 Yes           No               -1.19  0.012
head(mydata[order(-mydata$CooksD),],6)
## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes           Yes               2.58  0.32 
## 2    43       37  1740       0       0 No            No                1.44  0.104
## 3     2       11  2790       1       0 Yes           No                2.05  0.069
## 4     7        2  1760       0       1 No            Yes              -2.15  0.066
## 5    37        3  2540       1       1 Yes           Yes               1.58  0.061
## 6    40        2  2400       0       1 No            Yes               1.09  0.038

From the data we collected will remove apartment in line 53 because it has a significantly bigger standardized residual than all other. This apartment is very different than other based on the combination of variables it has. We will also remove line 38 because of the big Cooks distance.

mydata1 <- mydata[c(-38,-53),]

We now removed the potencial problematic outliers or units with huge influence.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

scatterplot(y = fit2$residuals, x = fit2$fitted.values,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

From the graph we can assume that there is no violation of the distribution of variability, and the variance is constant. So we can reject heteroskedasticity and confirm homoskedasticity. To be sure we would have to run the Breuscg-Pagan heteroskedasticity test.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(mydata$StdResid,
     xlab = "Standarized Residuals",
     ylab = "Frequency",
     main = "Histogram of standarized residuals")

We also distributed this histogram earlier to check for potential outliers.

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.95303, p-value = 0.003645

From the p value that is smaller than 5%, we can reject the null hypothesis which states that the variable is normally distributed. In this case we can confirm that the variable is not normally distributed. In this case the histogram is skewed to the right.

Estimate the fit2 again without potentially excluded cases and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance, 
           data=mydata)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Based on p values of age which is 0.016 and distance where p is smaller than 0.001 we can reject the null hypothesis. With this we tested the regression coefficient which tells us that age and distance have an effect on the price based on the hypothesis 1 that we confirmed.

From the estimate values we can see that if age of the apartment is increased by 1 year, the price on average decreases by 7.934 euro per square meter, assuming all variables remain unchanged. If distance is increased by 1 kilometer, then price on average decreases by 20.667 euros, ceteris paribus.

From the collected data we can see that the R squared (coefficient of determination) is 0.4396. From this we can conclude that 43.9% of variability of the price is explained by age and distance.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
           data = mydata)

With function anova check if model fit3 fits data better than model fit2.

anova(fit2,fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Fit2 (model 1) is more appropriate. H1: Fit3 (model 2) is more appropriate.

Based on the p value which is 0.01007 we can reject the null hypothesis and assume that fit3 is more appropriate and fits data better than fit 2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2301.667     94.271  24.415  < 2e-16 ***
## Age                -6.799      3.110  -2.186  0.03172 *  
## Distance          -18.045      2.758  -6.543 5.28e-09 ***
## ParkingFactorYes  196.168     62.868   3.120  0.00251 ** 
## BalconyFactorYes    1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Hypothesis for F-statistics: H0: ρ squared = 0 H1: ρ squared > 0

Apartments with a parking space have on average higher price per square meter by 147.5 euros, with p = 0.0025, assuming other variables remain unchanged.Based on the high p value (0.974), we can not confirm that having a balcony has an effect on the price of the apartment.

Save fitted values and claculate the residual for apartment ID2.

mydata$Fittedvalues <- fitted.values(fit3)

mydata$StdResid <- residuals(fit3)
head(mydata)
## # A tibble: 6 × 10
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD Fittedvalues
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>  <dbl>        <dbl>
## 1     7       28  1640       0       1 No            Yes             -111.   0.007        1751.
## 2    18        1  2800       1       0 Yes           No               443.   0.03         2357.
## 3     7       28  1660       0       0 No            No               -88.8  0.006        1749.
## 4    28       29  1850       0       1 No            Yes              260.   0.008        1590.
## 5    18       18  1640       1       1 Yes           Yes             -413.   0.005        2053.
## 6    28       12  1770       0       1 No            Yes             -127.   0.005        1897.

From the calculation we can see that the residual for apartment ID2 is 442.59.