C2: Use BWGHT Data toundefined answer the question
library(dplyr)
library(dslabs)
library(ISLR2)
library(matlib)
library(wooldridge)
data("bwght")
dat <- bwght
attach(dat)
str(dat)
## 'data.frame': 1388 obs. of 14 variables:
## $ faminc : num 13.5 7.5 0.5 15.5 27.5 7.5 65 27.5 27.5 37.5 ...
## $ cigtax : num 16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 ...
## $ cigprice: num 122 122 122 122 122 ...
## $ bwght : int 109 133 129 126 134 118 140 86 121 129 ...
## $ fatheduc: int 12 6 NA 12 14 12 16 12 12 16 ...
## $ motheduc: int 12 12 12 12 12 14 14 14 17 18 ...
## $ parity : int 1 2 2 2 2 6 2 2 2 2 ...
## $ male : int 1 1 0 1 1 1 0 0 0 0 ...
## $ white : int 1 0 0 0 1 0 1 0 1 1 ...
## $ cigs : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lbwght : num 4.69 4.89 4.86 4.84 4.9 ...
## $ bwghtlbs: num 6.81 8.31 8.06 7.88 8.38 ...
## $ packs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ lfaminc : num 2.603 2.015 -0.693 2.741 3.314 ...
## - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
cig_during_preg <- sum(dat$cigs>0)
cig_during_preg # number of mothers smoking during prenancy
## [1] 212
- The data provides 1388 observations, corresponding to 1388 mothers
participating in the survey. During the pregnancy period, there are 212
mothers who smoked.
summary(cigs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.087 0.000 50.000
boxplot(cigs)

- The average number of cigarettes smoked per day is 2.087, which is
not a good measure of the “typical” women since most of women did not
smoke during the pregnancy period.
dat_smoked <- dat[dat$cigs>0,]
str(dat_smoked$cigs)
## int [1:212] 6 10 20 40 10 10 20 3 10 20 ...
summary(dat_smoked$cigs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 10.00 13.67 20.00 50.00
boxplot(dat_smoked$cigs, col = "light yellow")

- The average number of cigarettes smoked has a greater improvement
when we excluded 1176 non-smoked observations. The mean value becomes
more significant, which is 13.67, meaning that along mothers who smoked,
each person smoked averagely 13.67 cigarettes per day.
fatheduc_dat <- na.omit(dat$fatheduc)
str(fatheduc_dat)
## int [1:1192] 12 6 12 14 12 16 12 12 16 12 ...
## - attr(*, "na.action")= 'omit' int [1:196] 3 13 18 20 27 30 64 65 77 89 ...
summary(fatheduc_dat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 12.00 12.00 13.19 16.00 18.00
- Averagely, fathers participating the survey have 13.19 years of
education. As a few number of mother is single-mum, the father year of
education is noted as NA. Thus, only 1192 fathers’ data is
recorded.
summary(faminc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.50 14.50 27.50 29.03 37.50 65.00
sd(faminc)
## [1] 18.73928
- Averagely, the 1988 family income was $29 030. The standard
deviation is 18.74, which indicated a large income range of data. We can
see that families had unequal income.
C3: Working with MEAP01
data("meap01")
dat3 <- meap01
View(dat3)
attach(dat3)
str(dat3)
## 'data.frame': 1823 obs. of 11 variables:
## $ dcode : num 1010 2070 2080 3010 3010 3010 3020 3020 3020 3030 ...
## $ bcode : int 4937 597 4860 790 1403 4056 922 2864 4851 881 ...
## $ math4 : num 83.3 90.3 61.9 85.7 77.3 ...
## $ read4 : num 77.8 82.3 71.4 60 59.1 ...
## $ lunch : num 40.6 27.1 41.8 12.8 17.1 ...
## $ enroll : int 468 679 400 251 439 561 442 381 274 326 ...
## $ expend : num 2747475 1505772 2121871 1211034 1913501 ...
## $ exppp : num 5871 2218 5305 4825 4359 ...
## $ lenroll: num 6.15 6.52 5.99 5.53 6.08 ...
## $ lexpend: num 14.8 14.2 14.6 14 14.5 ...
## $ lexppp : num 8.68 7.7 8.58 8.48 8.38 ...
## - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
summary(math4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 61.60 76.40 71.91 87.00 100.00
- The percentage students satisfactory ranges from 0% to 100%, which
indicates that some schools had all students passing the math exam, some
schools had none student passing the exam. The statistic value will be
discussed below
perfect_math_school <- sum(math4 == 100) #number of schools having perfect math rate
perfect_math_rate = perfect_math_school/length(math4)*100
perfect_math_rate # percentage of school having perfect math rate
## [1] 2.084476
- There were 38 schools having all students passing the math exam. The
percentage on total is around 2%.
half_math_school <- sum(math4 == 50) #number of schools having math rate as 50%
half_math_rate = half_math_school/length(math4)*100
half_math_rate # percentage of school having 50% math rate
## [1] 0.9325288
- There were 17 schools having all students passing the math exam. The
percentage on total is around 0.93%.
mean(math4) # average pass rate for math
## [1] 71.909
mean(read4) # average pass rate for reading
## [1] 60.06188
- Math score had a bigger value of average passing rate compared to
reading score. Thus, reading test was harder to pass.
correlation <- cor(math4, read4) # correlation between math passing rate and reading passing rate.
correlation
## [1] 0.8427281
- The correlation value between math passing rate and reading passing
rate is 0.8427, indicating a strong positive relation between 2
variables.
mean(exppp) # average expenditures per pupil
## [1] 5194.865
sd(exppp) # standard deviation of expenditures per pupil
## [1] 1091.89
library(ggplot2)
ggplot(dat3, aes(x=exppp)) + geom_density()

- The average expenditures per pupil is $5194.865, while the standard
deviation is 1091.89. From the density plot, it can be seen that the
expenditures per pupil had a large range, the gap between student
spending is quite big.