The data set ceosal2.RData contains information on chief executive officers for U.S. corporations. Two variables of interest are the annual compensation (\(salary\)) and the prior number of years as company CEO (\(ceoten\)).
First i want to load the data and check them a little, in order to have a first view of them. Then save them in a format that i can analyse easier in the next questions (taking only the columns i need).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
load("ceosal2.RData")
ceo_data = data
head(ceo_data)
summary(ceo_data)
## salary age college grad
## Min. : 100.0 Min. :33.00 Min. :0.0000 Min. :0.0000
## 1st Qu.: 471.0 1st Qu.:52.00 1st Qu.:1.0000 1st Qu.:0.0000
## Median : 707.0 Median :57.00 Median :1.0000 Median :1.0000
## Mean : 865.9 Mean :56.43 Mean :0.9718 Mean :0.5311
## 3rd Qu.:1119.0 3rd Qu.:62.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :5299.0 Max. :86.00 Max. :1.0000 Max. :1.0000
## comten ceoten sales profits
## Min. : 2.0 Min. : 0.000 Min. : 29 Min. :-463.0
## 1st Qu.:12.0 1st Qu.: 3.000 1st Qu.: 561 1st Qu.: 34.0
## Median :23.0 Median : 6.000 Median : 1400 Median : 63.0
## Mean :22.5 Mean : 7.955 Mean : 3529 Mean : 207.8
## 3rd Qu.:33.0 3rd Qu.:11.000 3rd Qu.: 3500 3rd Qu.: 208.0
## Max. :58.0 Max. :37.000 Max. :51300 Max. :2700.0
## mktval lsalary lsales lmktval
## Min. : 387 Min. :4.605 Min. : 3.367 Min. : 5.958
## 1st Qu.: 644 1st Qu.:6.155 1st Qu.: 6.330 1st Qu.: 6.468
## Median : 1200 Median :6.561 Median : 7.244 Median : 7.090
## Mean : 3600 Mean :6.583 Mean : 7.231 Mean : 7.399
## 3rd Qu.: 3500 3rd Qu.:7.020 3rd Qu.: 8.161 3rd Qu.: 8.161
## Max. :45400 Max. :8.575 Max. :10.845 Max. :10.723
## comtensq ceotensq profmarg
## Min. : 4.0 Min. : 0.0 Min. :-203.077
## 1st Qu.: 144.0 1st Qu.: 9.0 1st Qu.: 4.231
## Median : 529.0 Median : 36.0 Median : 6.834
## Mean : 656.7 Mean : 114.1 Mean : 6.420
## 3rd Qu.:1089.0 3rd Qu.: 121.0 3rd Qu.: 10.947
## Max. :3364.0 Max. :1369.0 Max. : 47.458
mean(ceo_data$salary)
## [1] 865.8644
mean(ceo_data$ceoten)
## [1] 7.954802
ceo_data %>%
count(ceoten == 0)
max(ceo_data$ceoten)
## [1] 37
above_av = ceo_data %>%
filter(ceoten >= 7.95) %>%
select(salary)
mean(above_av$salary)
## [1] 1003.5
bellow_av = ceo_data %>%
filter(ceoten < 7.95) %>%
select(salary)
mean(bellow_av$salary)
## [1] 766.9806
library(ggplot2)
ggplot(data = ceo_data, aes(x = ceoten, y = salary)) + geom_point() + stat_smooth(method = "lm")
model = lm(log(salary) ~ ceoten, data=ceo_data)
summary(model)
##
## Call:
## lm(formula = log(salary) ~ ceoten, data = ceo_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.15314 -0.38319 -0.02251 0.44439 1.94337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.505498 0.067991 95.682 <2e-16 ***
## ceoten 0.009724 0.006364 1.528 0.128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6038 on 175 degrees of freedom
## Multiple R-squared: 0.01316, Adjusted R-squared: 0.007523
## F-statistic: 2.334 on 1 and 175 DF, p-value: 0.1284
The data set bwght.RData contains data on births to women in the United States. Two variables of interest are the infant birth weight in ounces (\(bwght\)), and the average number of cigarettes the mother smoked per day during pregnancy (\(cigs\)).
I will continue with the same strategy and i will try first to explore a little bit the data.
load("bwght.RData")
birth_data = data
head(birth_data)
summary(birth_data)
## faminc cigtax cigprice bwght
## Min. : 0.50 Min. : 2.00 Min. :103.8 Min. : 23.0
## 1st Qu.:14.50 1st Qu.:15.00 1st Qu.:122.8 1st Qu.:107.0
## Median :27.50 Median :20.00 Median :130.8 Median :120.0
## Mean :29.03 Mean :19.55 Mean :130.6 Mean :118.7
## 3rd Qu.:37.50 3rd Qu.:26.00 3rd Qu.:137.0 3rd Qu.:132.0
## Max. :65.00 Max. :38.00 Max. :152.5 Max. :271.0
##
## fatheduc motheduc parity male
## Min. : 1.00 Min. : 2.00 Min. :1.000 Min. :0.0000
## 1st Qu.:12.00 1st Qu.:12.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :12.00 Median :12.00 Median :1.000 Median :1.0000
## Mean :13.19 Mean :12.94 Mean :1.633 Mean :0.5209
## 3rd Qu.:16.00 3rd Qu.:14.00 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :18.00 Max. :18.00 Max. :6.000 Max. :1.0000
## NA's :196 NA's :1
## white cigs lbwght bwghtlbs
## Min. :0.0000 Min. : 0.000 Min. :3.135 Min. : 1.438
## 1st Qu.:1.0000 1st Qu.: 0.000 1st Qu.:4.673 1st Qu.: 6.688
## Median :1.0000 Median : 0.000 Median :4.787 Median : 7.500
## Mean :0.7846 Mean : 2.087 Mean :4.760 Mean : 7.419
## 3rd Qu.:1.0000 3rd Qu.: 0.000 3rd Qu.:4.883 3rd Qu.: 8.250
## Max. :1.0000 Max. :50.000 Max. :5.602 Max. :16.938
##
## packs lfaminc
## Min. :0.0000 Min. :-0.6931
## 1st Qu.:0.0000 1st Qu.: 2.6741
## Median :0.0000 Median : 3.3142
## Mean :0.1044 Mean : 3.0713
## 3rd Qu.:0.0000 3rd Qu.: 3.6243
## Max. :2.5000 Max. : 4.1744
##
model = lm(bwght ~ cigs, data=birth_data)
summary(model)
##
## Call:
## lm(formula = bwght ~ cigs, data = birth_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.772 -11.772 0.297 13.228 151.228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.77190 0.57234 209.267 < 2e-16 ***
## cigs -0.51377 0.09049 -5.678 1.66e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.13 on 1386 degrees of freedom
## Multiple R-squared: 0.02273, Adjusted R-squared: 0.02202
## F-statistic: 32.24 on 1 and 1386 DF, p-value: 1.662e-08