HW1_230923

Homework 1 _ Working with BWGHT data

C2: Use BWGHT Data toundefined answer the question

library(dplyr)
library(dslabs)
library(ISLR2)
library(matlib)
library(wooldridge)
data("bwght")
dat <- bwght
attach(dat)
str(dat)

## 'data.frame':    1388 obs. of  14 variables:
##  $ faminc  : num  13.5 7.5 0.5 15.5 27.5 7.5 65 27.5 27.5 37.5 ...
##  $ cigtax  : num  16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 ...
##  $ cigprice: num  122 122 122 122 122 ...
##  $ bwght   : int  109 133 129 126 134 118 140 86 121 129 ...
##  $ fatheduc: int  12 6 NA 12 14 12 16 12 12 16 ...
##  $ motheduc: int  12 12 12 12 12 14 14 14 17 18 ...
##  $ parity  : int  1 2 2 2 2 6 2 2 2 2 ...
##  $ male    : int  1 1 0 1 1 1 0 0 0 0 ...
##  $ white   : int  1 0 0 0 1 0 1 0 1 1 ...
##  $ cigs    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lbwght  : num  4.69 4.89 4.86 4.84 4.9 ...
##  $ bwghtlbs: num  6.81 8.31 8.06 7.88 8.38 ...
##  $ packs   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ lfaminc : num  2.603 2.015 -0.693 2.741 3.314 ...
##  - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"

cig_during_preg <- sum(dat$cigs>0)
cig_during_preg # number of mothers smoking during prenancy

## [1] 212

The data provides 1388 observations, corresponding to 1388 mothers participating in the survey. During the pregnancy period, there are 212 mothers who smoked.

summary(cigs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.087   0.000  50.000

boxplot(cigs)

The average number of cigarettes smoked per day is 2.087, which is not a good measure of the “typical” women since most of women did not smoke during the pregnancy period.

dat_smoked <- dat[dat$cigs>0,]
str(dat_smoked$cigs)

##  int [1:212] 6 10 20 40 10 10 20 3 10 20 ...

summary(dat_smoked$cigs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   10.00   13.67   20.00   50.00

boxplot(dat_smoked$cigs, col = "light yellow")

The average number of cigarettes smoked has a greater improvement when we excluded 1176 non-smoked observations. The mean value becomes more significant, which is 13.67, meaning that along mothers who smoked, each person smoked averagely 13.67 cigarettes per day.

fatheduc_dat <- na.omit(dat$fatheduc)
str(fatheduc_dat)

##  int [1:1192] 12 6 12 14 12 16 12 12 16 12 ...
##  - attr(*, "na.action")= 'omit' int [1:196] 3 13 18 20 27 30 64 65 77 89 ...

summary(fatheduc_dat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   12.00   12.00   13.19   16.00   18.00

Averagely, fathers participating the survey have 13.19 years of education. As a few number of mother is single-mum, the father year of education is noted as NA. Thus, only 1192 fathers’ data is recorded.

summary(faminc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.50   14.50   27.50   29.03   37.50   65.00

sd(faminc)

## [1] 18.73928

Averagely, the 1988 family income was $29 030. The standard deviation is 18.74, which indicated a large income range of data. We can see that families had unequal income.

C3: Working with MEAP01

data("meap01")
dat3 <- meap01
View(dat3)
attach(dat3)
str(dat3)

## 'data.frame':    1823 obs. of  11 variables:
##  $ dcode  : num  1010 2070 2080 3010 3010 3010 3020 3020 3020 3030 ...
##  $ bcode  : int  4937 597 4860 790 1403 4056 922 2864 4851 881 ...
##  $ math4  : num  83.3 90.3 61.9 85.7 77.3 ...
##  $ read4  : num  77.8 82.3 71.4 60 59.1 ...
##  $ lunch  : num  40.6 27.1 41.8 12.8 17.1 ...
##  $ enroll : int  468 679 400 251 439 561 442 381 274 326 ...
##  $ expend : num  2747475 1505772 2121871 1211034 1913501 ...
##  $ exppp  : num  5871 2218 5305 4825 4359 ...
##  $ lenroll: num  6.15 6.52 5.99 5.53 6.08 ...
##  $ lexpend: num  14.8 14.2 14.6 14 14.5 ...
##  $ lexppp : num  8.68 7.7 8.58 8.48 8.38 ...
##  - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"

summary(math4)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   61.60   76.40   71.91   87.00  100.00

The percentage students satisfactory ranges from 0% to 100%, which indicates that some schools had all students passing the math exam, some schools had none student passing the exam. The statistic value will be discussed below

perfect_math_school <- sum(math4 == 100) #number of schools having perfect math rate
perfect_math_rate = perfect_math_school/length(math4)*100
perfect_math_rate # percentage of school having perfect math rate

## [1] 2.084476

There were 38 schools having all students passing the math exam. The percentage on total is around 2%.

half_math_school <- sum(math4 == 50) #number of schools having math rate as 50%
half_math_rate = half_math_school/length(math4)*100
half_math_rate # percentage of school having 50% math rate

## [1] 0.9325288

There were 17 schools having all students passing the math exam. The percentage on total is around 0.93%.

mean(math4) # average pass rate for math

## [1] 71.909

mean(read4) # average pass rate for reading

## [1] 60.06188

Math score had a bigger value of average passing rate compared to reading score. Thus, reading test was harder to pass.

correlation <- cor(math4, read4) # correlation between math passing rate and reading passing rate. 
correlation

## [1] 0.8427281

The correlation value between math passing rate and reading passing rate is 0.8427, indicating a strong positive relation between 2 variables.

mean(exppp) # average expenditures per pupil

## [1] 5194.865

sd(exppp) # standard deviation of expenditures per pupil

## [1] 1091.89

library(ggplot2)
ggplot(dat3, aes(x=exppp)) + geom_density()

The average expenditures per pupil is $5194.865, while the standard deviation is 1091.89. From the density plot, it can be seen that the expenditures per pupil had a large range, the gap between student spending is quite big.

HW1_230923

Vu Hong Ha

2023-09-23

Homework 1 _ Working with BWGHT data

C2: Use BWGHT Data toundefined answer the question

C3: Working with MEAP01