Question 2

Q2.1

data <- wooldridge::bwght
obs <- nrow(data)
data_nonsmoke <- data %>% filter(cigs==0)
nonsmoke <- nrow(data_nonsmoke)
obs
## [1] 1388
nonsmoke
## [1] 1176

According to the result, there are 1388 observations, who are woman in the sample. Among them, 1176 woman do not smoke.

Q2.2

mean<- mean(data$cigs)
dis1 <-table(data$cigs)
dis2 <- hist(data$cigs)

mean
## [1] 2.087176
dis1
## 
##    0    1    2    3    4    5    6    7    8    9   10   12   15   20   30   40 
## 1176    3    4    7    9   19    6    4    5    1   55    5   19   62    5    6 
##   46   50 
##    1    1
dis2
## $breaks
##  [1]  0  5 10 15 20 25 30 35 40 45 50
## 
## $counts
##  [1] 1218   71   24   62    0    5    0    6    0    2
## 
## $density
##  [1] 0.1755043228 0.0102305476 0.0034582133 0.0089337176 0.0000000000
##  [6] 0.0007204611 0.0000000000 0.0008645533 0.0000000000 0.0002881844
## 
## $mids
##  [1]  2.5  7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5
## 
## $xname
## [1] "data$cigs"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

The average number of cigarettes smoked per day is 2.087. However, average is not a good measure because the distribution of data is not normal and highly skewed. Besides, there are outliers who smoke so much higher than others, which skews the average.

Q2.3

data_smoke <- data %>% filter(cigs != 0)
mean(data_smoke$cigs)
## [1] 13.66509

Among woman who smoked, the average number of cigarettes smoked perday was 13.665, which is much higher than the result in part 2.2. It is because in part 2, there are 85% of observations who are not smoke (or cigs=0); hence, this large proportions of people help to lower the average number of cigarettes smoked for the whole sample.

Q2.4

mean_fa <- mean(data$fatheduc, na.rm= TRUE)
mean_fa
## [1] 13.18624
na_fa <- sum(is.na(data$fatheduc))
na_fa
## [1] 196

The average of fatheduc or father’s years of education is 13.19 years. Only 1192 observations used to compute this average because there are 196 missing values

Q2.5

mean_inc <- mean(data$faminc)
std_inc <- sd(data$faminc)
mean_inc
## [1] 29.02666
std_inc
## [1] 18.73928

The average family income of the dataset is 29,027 dollar annually, while standard deviation is 18,739 dollar.

Question 3

Q3.1

data2 <- wooldridge::meap01
max(data2$math4)
## [1] 100
min(data2$math4)
## [1] 0

The largest and smallest values of math4 are 100 and 0. This range makes sense in theory because it technically can happen. However, in reality, these numbers told that there are some schools that have 100% pass rate while there are some in which no one pass from Math. It will be the large gap for education levels among schools.

Q3.2

data2_pass100 <- data2 %>% filter(math4==100)
pass100 <- nrow(data2_pass100)
perpass100 <- nrow(data2_pass100)/nrow(data2)*100
pass100
## [1] 38
perpass100
## [1] 2.084476

There are 38 schools which have perfect pass rate on the math test, which accounts for 2.08% of the total sample.

Q3.3

data2_pass50 <- data2 %>% filter(math4==50)
pass50 <- nrow(data2_pass50)
pass50
## [1] 17

There are 17 schools that have exactly 50% of math pass rate.

Q3.4

mean(data2$math4)
## [1] 71.909
mean(data2$read4)
## [1] 60.06188

The average pass rate for math is 71.909% while the average pass rate for reading is 60.062%. Hence, math is harder to pass

Q3.5

cor(data2$math4,data2$read4)
## [1] 0.8427281

The correlation between math4 and read4 is 0.843, indicating that there is a strong relation between the two variables.

Q3.6

mean(data2$exppp)
## [1] 5194.865
sd(data2$exppp)
## [1] 1091.89

The average of expenditure is 5,194 while the standard deviation is 1,092. It cannot be said to be wide variation without the context.

Q3.7

dif <- (6000-5500)/5500
dif
## [1] 0.09090909
dif_log <- (log(6000)-log(5500))
dif_log
## [1] 0.08701138
dif - dif_log
## [1] 0.003897714

School A have to pay 9.09% larger than school B. When we use natural logarit, the result will be 8.7%. It will be 0.39% different.