Marios Zoulias 01766825

Question 1

The data set ceosal2.RData contains information on chief executive officers for U.S. corporations. Two variables of interest are the annual compensation (\(salary\)) and the prior number of years as company CEO (\(ceoten\)).

First i want to load the data and check them a little, in order to have a first view of them. Then save them in a format that i can analyse easier in the next questions (taking only the columns i need).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
load("ceosal2.RData")
ceo_data = data

head(ceo_data)
summary(ceo_data)
##      salary            age           college            grad       
##  Min.   : 100.0   Min.   :33.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 471.0   1st Qu.:52.00   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median : 707.0   Median :57.00   Median :1.0000   Median :1.0000  
##  Mean   : 865.9   Mean   :56.43   Mean   :0.9718   Mean   :0.5311  
##  3rd Qu.:1119.0   3rd Qu.:62.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :5299.0   Max.   :86.00   Max.   :1.0000   Max.   :1.0000  
##      comten         ceoten           sales          profits      
##  Min.   : 2.0   Min.   : 0.000   Min.   :   29   Min.   :-463.0  
##  1st Qu.:12.0   1st Qu.: 3.000   1st Qu.:  561   1st Qu.:  34.0  
##  Median :23.0   Median : 6.000   Median : 1400   Median :  63.0  
##  Mean   :22.5   Mean   : 7.955   Mean   : 3529   Mean   : 207.8  
##  3rd Qu.:33.0   3rd Qu.:11.000   3rd Qu.: 3500   3rd Qu.: 208.0  
##  Max.   :58.0   Max.   :37.000   Max.   :51300   Max.   :2700.0  
##      mktval         lsalary          lsales          lmktval      
##  Min.   :  387   Min.   :4.605   Min.   : 3.367   Min.   : 5.958  
##  1st Qu.:  644   1st Qu.:6.155   1st Qu.: 6.330   1st Qu.: 6.468  
##  Median : 1200   Median :6.561   Median : 7.244   Median : 7.090  
##  Mean   : 3600   Mean   :6.583   Mean   : 7.231   Mean   : 7.399  
##  3rd Qu.: 3500   3rd Qu.:7.020   3rd Qu.: 8.161   3rd Qu.: 8.161  
##  Max.   :45400   Max.   :8.575   Max.   :10.845   Max.   :10.723  
##     comtensq         ceotensq         profmarg       
##  Min.   :   4.0   Min.   :   0.0   Min.   :-203.077  
##  1st Qu.: 144.0   1st Qu.:   9.0   1st Qu.:   4.231  
##  Median : 529.0   Median :  36.0   Median :   6.834  
##  Mean   : 656.7   Mean   : 114.1   Mean   :   6.420  
##  3rd Qu.:1089.0   3rd Qu.: 121.0   3rd Qu.:  10.947  
##  Max.   :3364.0   Max.   :1369.0   Max.   :  47.458
  1. Find the average salary and the average tenure in the sample.
    Average salary is 865.86 and average tenure is 7.95
mean(ceo_data$salary)
## [1] 865.8644
mean(ceo_data$ceoten)
## [1] 7.954802
  1. How many CEOs are in their first year as CEO (that is, \(ceoten=0\))? What is the longest tenure as a CEO?
    We can see that 5 CEOs are in their first year (ceoten = 0), and the max ceo tenure is 37 years.
ceo_data %>%
  count(ceoten == 0)
max(ceo_data$ceoten)
## [1] 37
  1. What is the average salary for CEOs with tenure longer than or equal to the average tenure? What is the average salary for CEOs with tenure shorter than the average tenure?
    We can see that the average salary for ceos with tenure longer or equal to average tenure is 1003.5 and for ceos with less than average tenure is 766.98.
above_av = ceo_data %>%
  filter(ceoten >= 7.95) %>%
  select(salary)

mean(above_av$salary)
## [1] 1003.5
bellow_av = ceo_data %>%
  filter(ceoten < 7.95) %>%
  select(salary)

mean(bellow_av$salary)
## [1] 766.9806
  1. Create a graph to examine the relationship between \(salary\) and \(ceoten\) for all CEOs. Comment.
    To examine this relationship we obviously need a scatterplot of salary and ceoten columns. We can see that the salary of a CEO does not strongly affected by the number of previous years as CEO. That might change from company to company, however we can see CEOs with 0-7 years of CEO experience receive the same income as CEOs with 7+ years of experience (to be sure, i would say that the relationship between salary and ceoten is slightly positive).
library(ggplot2)
ggplot(data = ceo_data, aes(x = ceoten, y = salary)) + geom_point() + stat_smooth(method = "lm")

  1. Estimate the simple regression model \[\log(salary)=\beta_0+\beta_1 ceoten+u,\] and report your results. What is the (approximate) predicted percentage increase in salary given one more year as a CEO?
    Given E(u) = 0, an increase of 1 year of ceo experience, gives on average an increase of 0.97% in salary.
model = lm(log(salary) ~ ceoten, data=ceo_data)
summary(model)
## 
## Call:
## lm(formula = log(salary) ~ ceoten, data = ceo_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.15314 -0.38319 -0.02251  0.44439  1.94337 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.505498   0.067991  95.682   <2e-16 ***
## ceoten      0.009724   0.006364   1.528    0.128    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6038 on 175 degrees of freedom
## Multiple R-squared:  0.01316,    Adjusted R-squared:  0.007523 
## F-statistic: 2.334 on 1 and 175 DF,  p-value: 0.1284

Question 2

The data set bwght.RData contains data on births to women in the United States. Two variables of interest are the infant birth weight in ounces (\(bwght\)), and the average number of cigarettes the mother smoked per day during pregnancy (\(cigs\)).

I will continue with the same strategy and i will try first to explore a little bit the data.

load("bwght.RData")
birth_data = data

head(birth_data)
summary(birth_data)
##      faminc          cigtax         cigprice         bwght      
##  Min.   : 0.50   Min.   : 2.00   Min.   :103.8   Min.   : 23.0  
##  1st Qu.:14.50   1st Qu.:15.00   1st Qu.:122.8   1st Qu.:107.0  
##  Median :27.50   Median :20.00   Median :130.8   Median :120.0  
##  Mean   :29.03   Mean   :19.55   Mean   :130.6   Mean   :118.7  
##  3rd Qu.:37.50   3rd Qu.:26.00   3rd Qu.:137.0   3rd Qu.:132.0  
##  Max.   :65.00   Max.   :38.00   Max.   :152.5   Max.   :271.0  
##                                                                 
##     fatheduc        motheduc         parity           male       
##  Min.   : 1.00   Min.   : 2.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:12.00   1st Qu.:12.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :12.00   Median :12.00   Median :1.000   Median :1.0000  
##  Mean   :13.19   Mean   :12.94   Mean   :1.633   Mean   :0.5209  
##  3rd Qu.:16.00   3rd Qu.:14.00   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :18.00   Max.   :18.00   Max.   :6.000   Max.   :1.0000  
##  NA's   :196     NA's   :1                                       
##      white             cigs            lbwght         bwghtlbs     
##  Min.   :0.0000   Min.   : 0.000   Min.   :3.135   Min.   : 1.438  
##  1st Qu.:1.0000   1st Qu.: 0.000   1st Qu.:4.673   1st Qu.: 6.688  
##  Median :1.0000   Median : 0.000   Median :4.787   Median : 7.500  
##  Mean   :0.7846   Mean   : 2.087   Mean   :4.760   Mean   : 7.419  
##  3rd Qu.:1.0000   3rd Qu.: 0.000   3rd Qu.:4.883   3rd Qu.: 8.250  
##  Max.   :1.0000   Max.   :50.000   Max.   :5.602   Max.   :16.938  
##                                                                    
##      packs           lfaminc       
##  Min.   :0.0000   Min.   :-0.6931  
##  1st Qu.:0.0000   1st Qu.: 2.6741  
##  Median :0.0000   Median : 3.3142  
##  Mean   :0.1044   Mean   : 3.0713  
##  3rd Qu.:0.0000   3rd Qu.: 3.6243  
##  Max.   :2.5000   Max.   : 4.1744  
## 
  1. Estimate the simple regression model \[bwght=\beta_0+\beta_1 cigs+u,\] and report your results.
    So we see that the equation is \[bwght=119.77 -0.51* cigs+u,\]
model = lm(bwght ~ cigs, data=birth_data)
summary(model)
## 
## Call:
## lm(formula = bwght ~ cigs, data = birth_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.772 -11.772   0.297  13.228 151.228 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 119.77190    0.57234 209.267  < 2e-16 ***
## cigs         -0.51377    0.09049  -5.678 1.66e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.13 on 1386 degrees of freedom
## Multiple R-squared:  0.02273,    Adjusted R-squared:  0.02202 
## F-statistic: 32.24 on 1 and 1386 DF,  p-value: 1.662e-08
  1. What is the predicted birth weight when \(cigs = 0\)? What about when \(cigs = 20\) (one pack per day)? Comment on the difference.

    With E(u) = 0 and :
    cigs = 0 we have \[bwght=119.77 -0.51 * 0 = 119.77\]
    cigs = 20 we have \[bwght=119.77 -0.51 * 20 = 109.57\]
    The difference between those two values is 20 X β1. The negative relationship between cigs and bwgt means that the more cigaretes the mother smokes during her pregnancy the less the infant’s weight on birth.

  2. Does this simple regression necessarily capture a causal relationship between the child’s birth weight and the mother’s smoking habits? Explain.

    In this regression we see a negative relationship between the two variables, as stated above. However this relationship does not necessarily means that more smoking habbits reduces infant’s weight. So we dont necessarily have a causal effect from smoking habbits on infant weight, especially having an R-squared equal to 0.02, meaning that cigs variable explains only 0.02 of bwght (also meaning that we ignore many other factors).

  3. To predict a birth weight of \(125\) ounces, what would \(cigs\) have to be? Comment.

    To predict a weight of 125 ounces we solve the equation \[125 = 119.77 - 0.51 * cigs\] In this way we find that cigs is equal to -10.25. A negative value is impossible for our case , as a woman can’t smoke -10 cigs per day. Probably given the data we have, we are not able to make good predictions

  4. The proportion of women in the sample who do not smoke while pregnant is about \(85\%\). Does this reconcile your finding from part 4?

    Obviously, we made a mistake. We made a model using the whole population, while our predictions can be made sufficiently only for the 15% of this. Our model would be better if we had only data of women who smoked during pregnancy. As a result, what we found in question 4 is absolutely logically false. We need better data and assumptions.