Today, we continue to learn how to navigate datasets in rstudio. This will be larger dataset (N=1000) and from the real world. We’ll also learn how o calcualte basic summary and descriptive statitics. And finally, we’ll create simple plots to look at our data.
You read this dataset last week as part of the end of class exercise. Today’s lab will dig deeper into this dataset. First, we load the dataset:
nc.data<-read.csv("./nc.csv")
The table below describe the dataset. It shows us what each variable means.
| variable | Description |
|---|---|
| fage | Father’s age in years. |
| mage | Mother’s age in years. |
| mature | Maturity status of mother. |
| weeks | Length of pregnancy in weeks. |
| premie | Whether the birth was classified as premature (premie) or full-term. |
| visits | Number of hospital visits during pregnancy. |
| marital | Whether a mother is married or not married at birth. |
| gained | Weight gained by mother during pregnancy in pounds. |
| weight | Weight of the baby at birth in pounds. |
| lowbirthweight | Whether the baby was classified as low birthweight (low) or not (not low). |
| gender | Gender of the baby, female or male. |
| habit | Status of the mother as a nonsmoker or a smoker. |
| whitemom | Whether mom is white or not white. |
As always, you want to make sure your data looks good in R. So let’s run the three commands we learned about last week:
head(nc.data, 5) gives us the first 5 rows.head(nc.data)
NA. This is important to keep in mind.str(nc.data)
'data.frame': 1000 obs. of 13 variables:
$ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
$ mage : int 13 14 15 15 15 15 15 15 16 16 ...
$ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
$ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
$ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
$ visits : int 10 15 11 6 9 19 12 5 9 13 ...
$ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
$ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
$ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
$ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
$ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
$ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
$ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
summary(nc.data)
fage mage mature weeks premie visits marital
Min. :14.00 Min. :13 mature mom :133 Min. :20.00 full term:846 Min. : 0.0 married :386
1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00 premie :152 1st Qu.:10.0 not married:613
Median :30.00 Median :27 Median :39.00 NA's : 2 Median :12.0 NA's : 1
Mean :30.26 Mean :27 Mean :38.33 Mean :12.1
3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00 3rd Qu.:15.0
Max. :55.00 Max. :50 Max. :45.00 Max. :30.0
NA's :171 NA's :2 NA's :9
gained weight lowbirthweight gender habit whitemom
Min. : 0.00 Min. : 1.000 low :111 female:503 nonsmoker:873 not white:284
1st Qu.:20.00 1st Qu.: 6.380 not low:889 male :497 smoker :126 white :714
Median :30.00 Median : 7.310 NA's : 1 NA's : 2
Mean :30.33 Mean : 7.101
3rd Qu.:38.00 3rd Qu.: 8.060
Max. :85.00 Max. :11.750
NA's :27
For numeric or integer variables, we can summarise the the values using three numbers: mean and median. Let’s calcualte the first two for the variable mage which is the mother’s age:
mean(nc.data$mage)
[1] 27
median(nc.data$mage)
[1] 27
Interesting. The mean and median are equal. What about father’s age:
mean(nc.data$fage)
[1] NA
median(nc.data$fage)
[1] NA
Oh wait, why do we getNAs? Remember, sometimes you have to tell R what to do with missing values. Otherwise R might fail to perform a command. If you look atour summary(nc.data) output, you’ll see NA's :171 for the variable fage. So let’s tell R to ignore the NAs:
mean(nc.data$fage, na.rm = T)
[1] 30.25573
median(nc.data$fage, na.rm = T)
[1] 30
Now we get some numbers. But keep in mind that we have some missing values for this variable. This might have an impact on our analysis later. Anyway, for father’s fge, teh mean and median are not the same. and hey don’t really have to be. But if theyr are soe close, why even use the two? Here’s why. The variable example here is created using 11 numbers; numbers 1 through 10 and 1000.
example<-c(1:10, 1000)
example
[1] 1 2 3 4 5 6 7 8 9 10 1000
Now look at the mean and median
mean(example)
[1] 95.90909
median(example)
[1] 6
Which one would you say better indicates the values of our variable? The median or the median? Of course, the median. There is one extreme value (1000) which is making the mean so high. Median on the other hand , is not prone to this extreme score. Let’s demonstrate this further:
example<-c(1:10, 1000000)
example
[1] 1 2 3 4 5 6 7 8 9 10 1000000
mean(example)
[1] 90914.09
median(example)
[1] 6
Increasing one extreme value also increases the mean. But the median stays the same. Median is not suceptible to extreme values, which makes it a better score to undertand a the distribution of values of a variable. can you think of a real life example where this using means can be misleading?
Ok, so how do we calcualte means and medians for individual variables? We use the $ which when typed after the dataset name, allows us to access the variables in the dataset (i.e., columns in the dataset). Once you type the $ after teh dataset name, press the tab button to slect from the available columns (variables). It should ook like this:
Once you slect a variable, you can use it as an input to a function:
mean(nc.data$mage)
[1] 27
We can do the same for median:
mean(nc.data$fage, na.rm = T)
[1] 30.25573
Remember, the dataset is just a table with rows and columns. We can access anything in this table: entire rows or enire columns or a single values at a specific row and columns. The command below gives us the first row:
nc.data[1,]
And the command, nc.data[,1] gives us the first column:
nc.data[,1]
[1] NA NA 19 21 NA NA 18 17 NA 20 30 NA NA NA 21 NA 14 16 20 18 NA 20 20 NA 26 NA NA 20 31 17 NA 19 20 17 19 18 18 NA 28 28 20 NA 25 20 23 21 21 24 20 20 21 21 25 20
[55] 21 NA 18 20 18 21 NA 22 NA 23 20 24 20 NA 25 24 NA 22 23 21 NA NA 21 31 20 NA 21 22 NA NA 23 19 24 20 21 24 20 NA 21 32 21 21 NA 19 NA NA 25 25 NA NA NA 27 NA NA
[109] 19 26 22 21 17 NA 21 21 29 NA 21 20 22 24 21 20 23 25 25 21 21 18 21 24 NA 21 21 31 24 20 21 18 NA 36 22 22 20 23 25 18 28 NA 22 NA NA 23 38 21 22 22 33 22 24 26
[163] 20 28 NA NA NA NA 28 19 NA 19 NA NA 22 22 22 35 24 NA 25 32 24 28 27 22 22 18 22 20 23 NA 26 24 24 29 22 21 25 22 18 22 23 25 NA NA 26 NA 35 25 22 NA NA 21 NA 21
[217] NA 24 NA NA 27 NA 21 NA 21 NA 28 34 24 22 28 35 24 26 25 24 26 31 26 27 26 22 24 NA 23 26 NA NA 26 23 NA 28 28 26 NA 22 23 21 NA 22 21 21 30 23 24 NA 21 22 29 NA
[271] 23 27 24 27 24 30 NA 21 25 NA 22 NA 27 25 25 31 28 26 41 NA 26 31 24 25 26 29 24 NA 28 28 24 20 NA 28 23 22 36 31 NA NA 26 23 27 26 NA 23 27 25 24 22 NA 20 NA 26
[325] NA NA NA 28 NA 25 23 22 21 26 44 26 NA 26 43 26 21 27 25 28 24 45 NA 26 27 27 25 NA 30 27 32 NA 29 30 22 NA 29 32 NA 24 27 22 28 25 47 46 NA NA 28 34 25 28 19 28
[379] 20 NA 26 30 32 NA NA NA NA 32 29 NA NA 24 NA NA 26 30 NA 23 24 32 32 NA 29 23 24 NA 26 25 36 NA 31 35 27 NA 28 NA 27 NA 24 27 34 32 29 29 24 25 NA NA 27 30 25 26
[433] 26 34 NA 23 21 25 NA 33 NA 24 NA 23 30 27 28 28 33 30 35 28 29 30 32 25 NA 33 28 29 30 29 27 NA 29 NA 26 27 28 25 32 29 30 NA NA 32 23 25 30 NA 32 26 NA 25 NA 42
[487] 22 NA 29 31 27 29 25 35 NA 29 28 33 26 28 28 30 31 27 32 37 30 40 31 30 31 33 45 28 32 36 30 31 NA NA 39 NA 33 32 44 35 27 37 NA NA 34 39 24 31 27 30 33 30 NA 34
[541] 27 33 28 NA 28 36 NA 27 NA 25 28 30 28 30 33 28 33 30 45 31 27 24 33 30 23 35 46 34 28 29 33 NA 31 29 24 35 28 34 28 33 NA 30 23 31 37 26 NA 25 34 32 33 34 37 31
[595] NA 30 28 28 31 29 30 31 27 45 34 28 30 37 31 30 31 28 42 30 35 34 32 NA 30 29 28 33 30 32 29 29 30 NA 30 34 26 22 34 40 36 35 34 31 27 33 33 29 30 36 NA 23 NA 24
[649] 31 34 33 29 33 31 31 34 31 33 30 25 33 32 30 41 36 33 NA 29 34 NA 35 29 31 33 31 31 29 34 36 NA 30 NA NA 32 32 31 30 34 38 35 30 28 26 35 34 35 27 42 32 NA 40 35
[703] 32 34 38 NA NA 31 31 34 35 NA 38 31 36 34 32 35 35 35 30 32 33 32 28 32 38 34 35 NA 43 37 30 32 32 32 NA 32 29 33 NA 33 30 NA 33 28 29 33 36 35 26 31 34 31 44 34
[757] 32 31 30 40 35 35 36 34 36 35 38 27 NA 33 35 32 45 42 32 NA 23 35 35 33 35 NA 32 42 32 31 38 33 36 36 36 36 27 34 31 34 33 35 31 42 33 35 35 33 33 40 32 31 34 38
[811] 37 27 34 32 38 34 35 48 34 34 37 37 34 25 32 35 37 40 38 38 35 35 34 33 35 35 37 34 33 36 32 33 35 35 40 36 40 36 34 22 38 31 34 40 30 26 35 35 30 34 39 38 38 38
[865] 38 NA 34 38 43 30 34 39 38 NA 36 39 36 34 39 37 33 32 NA 35 41 35 35 42 34 32 48 32 36 33 NA 32 38 31 38 37 42 35 37 37 34 39 41 30 32 40 37 42 33 39 33 42 30 37
[919] 33 37 36 39 35 32 40 34 28 41 37 34 41 36 38 39 37 35 NA 41 44 42 37 38 37 36 39 36 38 47 37 42 34 43 37 NA NA 41 38 45 38 37 39 44 42 NA 40 44 40 42 26 36 NA 41
[973] 39 43 38 NA 46 53 40 34 43 41 41 44 42 33 46 NA 48 50 42 43 40 34 NA 47 34 39 55 45
Important: Notice that the number of the rows or columns is placed before or after the comma inside the brackets row:[1,] - column:[,1] . This is very improtant. In R, everything is rows and columsn. so you want keep this notaiton in mind.
You can also use the notaiton below to access a specific value:
nc.data$age[1]
NULL
How do we read this line of code? We are telling R to take row 1 of the column named mage of the dataset named nc.data. If you refer to the column using the $ sign, then all you need to do is specify the row number. Let’s try a few things.
nc.data$age
NULL
nc.data$age[1]
NULL
nc.data$age[5]
NULL
nc.data$age[2:5]
NULL
So far, we’ve learned how to read a daaset, how to access rows and columns, and individual datapoints. We also know how to calculate the mean and the median. Now we’ll learn how to find calculate the mean and median for different groups. But before that, let us learn on other super useful command for distributions. Run this command:
table(nc.data$premie)
full term premie
846 152
The table function gives us a frquency table of different values/cases. If the variable is a factor, then the table gives us the number of times each level of that factor is present in the dataset. So here, for example, there are 152 cases of premie and 846 cases of of full term. Try the commands below:
table(nc.data$visits)
0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25 26 30
7 5 8 12 30 26 17 49 39 132 75 143 78 100 139 47 16 34 4 18 3 1 1 1 1 5
Oh, what’s this? We gave a nmeric variabel to the table function. But able is so nice that it still gives us a frquency table of values in in the visits variable. So how many mothers never visited the hospial during hteir pregnauncy? 7. How many mothers went once? None. How many mothers went 14 times? 100. And five mother actualy visited the hospial 30 times.
So if we are interested in comparing the mean age of mothers who gave birth to full birth and those who had premature babies? Basically, we are interested in mean age split by the sex variable:
by(nc.data$mage, nc.data$premie, mean)
nc.data$premie: full term
[1] 27
-------------------------------------------------------------------------------------------------------------------------------
nc.data$premie: premie
[1] 26.875
Amazing, no? We are talling R to take a varaible mage and expli it by premie and calculate the mean seperately for each group. Let’s try father’s age:
by(nc.data$fage, nc.data$premie, mean)
nc.data$premie: full term
[1] NA
-------------------------------------------------------------------------------------------------------------------------------
nc.data$premie: premie
[1] NA
Oh, we gat NAs. Why? Becasue remember, some father age was missing. So we have to tell R what to do. What if we told R to ingore the NAs?
by(nc.data$fage, nc.data$premie, mean, na.rm=T)
nc.data$premie: full term
[1] 30.2423
-------------------------------------------------------------------------------------------------------------------------------
nc.data$premie: premie
[1] 30.31579
There we go. It doesn’t seem like there is much of a diffrence. Let’s try a few other things:
by(nc.data$mage, nc.data$whitemom, mean)
nc.data$whitemom: not white
[1] 25.33099
-------------------------------------------------------------------------------------------------------------------------------
nc.data$whitemom: white
[1] 27.64986
Interesting age differenc ebetween the age of mothers who are white compared to non-whites. Run the command below. Try to make sense of the results:
by(nc.data$mage, nc.data$lowbirthweight, mean)
nc.data$lowbirthweight: low
[1] 26.96396
-------------------------------------------------------------------------------------------------------------------------------
nc.data$lowbirthweight: not low
[1] 27.0045
by(nc.data$mage, nc.data$gender, mean)
nc.data$gender: female
[1] 27.05368
-------------------------------------------------------------------------------------------------------------------------------
nc.data$gender: male
[1] 26.94567
by(nc.data$mage, nc.data$habit, mean)
nc.data$habit: nonsmoker
[1] 27.23711
-------------------------------------------------------------------------------------------------------------------------------
nc.data$habit: smoker
[1] 25.24603
by(nc.data$mage, nc.data$marital, mean)
nc.data$marital: married
[1] 23.56218
-------------------------------------------------------------------------------------------------------------------------------
nc.data$marital: not married
[1] 29.14192
Let’s make some elemntary plots:
hist(nc.data$mage)
boxplot(mage ~ habit, data = nc.data)