Introductory Statistics (CRN: 6896)



Objective

Today, we continue to learn how to navigate datasets in rstudio. This will be larger dataset (N=1000) and from the real world. We’ll also learn how o calcualte basic summary and descriptive statitics. And finally, we’ll create simple plots to look at our data.


Reading the dataset

You read this dataset last week as part of the end of class exercise. Today’s lab will dig deeper into this dataset. First, we load the dataset:

nc.data<-read.csv("./nc.csv")


The table below describe the dataset. It shows us what each variable means.

variable Description
fage Father’s age in years.
mage Mother’s age in years.
mature Maturity status of mother.
weeks Length of pregnancy in weeks.
premie Whether the birth was classified as premature (premie) or full-term.
visits Number of hospital visits during pregnancy.
marital Whether a mother is married or not married at birth.
gained Weight gained by mother during pregnancy in pounds.
weight Weight of the baby at birth in pounds.
lowbirthweight Whether the baby was classified as low birthweight (low) or not (not low).
gender Gender of the baby, female or male.
habit Status of the mother as a nonsmoker or a smoker.
whitemom Whether mom is white or not white.


As always, you want to make sure your data looks good in R. So let’s run the three commands we learned about last week:


  • First few rows of the dataset: This one is obvious. It shows us th first 10 rows of the dataset. We can change the numebr of rows we’d like to see by specifying it after the nam eof the dataset. The command head(nc.data, 5) gives us the first 5 rows.
head(nc.data)


  • Structure of the dataset: This one tells use what kind of variables we are reading: factors, numeric, integers, etc. I also shows us examples of datapoints. Note that there are cases where the recorded values is NA. This is important to keep in mind.
str(nc.data)
'data.frame':   1000 obs. of  13 variables:
 $ fage          : int  NA NA 19 21 NA NA 18 17 NA 20 ...
 $ mage          : int  13 14 15 15 15 15 15 15 16 16 ...
 $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
 $ weeks         : int  39 42 37 41 39 38 37 35 38 37 ...
 $ premie        : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
 $ visits        : int  10 15 11 6 9 19 12 5 9 13 ...
 $ marital       : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
 $ gained        : int  38 20 38 34 27 22 76 15 NA 52 ...
 $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
 $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
 $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
 $ habit         : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
 $ whitemom      : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...


  • Summary of the dataset: This one gives us a summary of each variable. For numeric or integers, this is the distirbution and some basic descriptives statistics. For factors, this is a tble fo cases (known as levels in R) for that variable.
summary(nc.data)
      fage            mage            mature        weeks             premie        visits            marital   
 Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00   full term:846   Min.   : 0.0   married    :386  
 1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00   premie   :152   1st Qu.:10.0   not married:613  
 Median :30.00   Median :27                     Median :39.00   NA's     :  2   Median :12.0   NA's       :  1  
 Mean   :30.26   Mean   :27                     Mean   :38.33                   Mean   :12.1                    
 3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00                   3rd Qu.:15.0                    
 Max.   :55.00   Max.   :50                     Max.   :45.00                   Max.   :30.0                    
 NA's   :171                                    NA's   :2                       NA's   :9                       
     gained          weight       lowbirthweight    gender          habit          whitemom  
 Min.   : 0.00   Min.   : 1.000   low    :111    female:503   nonsmoker:873   not white:284  
 1st Qu.:20.00   1st Qu.: 6.380   not low:889    male  :497   smoker   :126   white    :714  
 Median :30.00   Median : 7.310                               NA's     :  1   NA's     :  2  
 Mean   :30.33   Mean   : 7.101                                                              
 3rd Qu.:38.00   3rd Qu.: 8.060                                                              
 Max.   :85.00   Max.   :11.750                                                              
 NA's   :27                                                                                  
  • Summary gives us frquencies for factor variables. In the above output, we know there are 33 mature moms and 867 younger moms.
  • For numeric/integer variables, summary gives us mean and median, which we’ll talk about below. It also gives us the minimum, maximum, and the number of NAs. And, the 1st Qu and 3rd Qu which are values that cut off the first 25% and the 75% of the values of a varaible when it is sorted (think of them as medians of median splits).

Descriptives

For numeric or integer variables, we can summarise the the values using three numbers: mean and median. Let’s calcualte the first two for the variable mage which is the mother’s age:

mean(nc.data$mage)
[1] 27
median(nc.data$mage)
[1] 27

Interesting. The mean and median are equal. What about father’s age:

mean(nc.data$fage)
[1] NA
median(nc.data$fage)
[1] NA


Oh wait, why do we getNAs? Remember, sometimes you have to tell R what to do with missing values. Otherwise R might fail to perform a command. If you look atour summary(nc.data) output, you’ll see NA's :171 for the variable fage. So let’s tell R to ignore the NAs:

mean(nc.data$fage, na.rm = T)
[1] 30.25573
median(nc.data$fage, na.rm = T)
[1] 30

Now we get some numbers. But keep in mind that we have some missing values for this variable. This might have an impact on our analysis later. Anyway, for father’s fge, teh mean and median are not the same. and hey don’t really have to be. But if theyr are soe close, why even use the two? Here’s why. The variable example here is created using 11 numbers; numbers 1 through 10 and 1000.

example<-c(1:10, 1000)
example
 [1]    1    2    3    4    5    6    7    8    9   10 1000

Now look at the mean and median

mean(example)
[1] 95.90909
median(example)
[1] 6

Which one would you say better indicates the values of our variable? The median or the median? Of course, the median. There is one extreme value (1000) which is making the mean so high. Median on the other hand , is not prone to this extreme score. Let’s demonstrate this further:

example<-c(1:10, 1000000)
example
 [1]       1       2       3       4       5       6       7       8       9      10 1000000
mean(example)
[1] 90914.09
median(example)
[1] 6

Increasing one extreme value also increases the mean. But the median stays the same. Median is not suceptible to extreme values, which makes it a better score to undertand a the distribution of values of a variable. can you think of a real life example where this using means can be misleading?


Ok, so how do we calcualte means and medians for individual variables? We use the $ which when typed after the dataset name, allows us to access the variables in the dataset (i.e., columns in the dataset). Once you type the $ after teh dataset name, press the tab button to slect from the available columns (variables). It should ook like this:

Once you slect a variable, you can use it as an input to a function:

mean(nc.data$mage)
[1] 27

We can do the same for median:

mean(nc.data$fage, na.rm = T)
[1] 30.25573


Remember, the dataset is just a table with rows and columns. We can access anything in this table: entire rows or enire columns or a single values at a specific row and columns. The command below gives us the first row:

nc.data[1,]

And the command, nc.data[,1] gives us the first column:

nc.data[,1]
   [1] NA NA 19 21 NA NA 18 17 NA 20 30 NA NA NA 21 NA 14 16 20 18 NA 20 20 NA 26 NA NA 20 31 17 NA 19 20 17 19 18 18 NA 28 28 20 NA 25 20 23 21 21 24 20 20 21 21 25 20
  [55] 21 NA 18 20 18 21 NA 22 NA 23 20 24 20 NA 25 24 NA 22 23 21 NA NA 21 31 20 NA 21 22 NA NA 23 19 24 20 21 24 20 NA 21 32 21 21 NA 19 NA NA 25 25 NA NA NA 27 NA NA
 [109] 19 26 22 21 17 NA 21 21 29 NA 21 20 22 24 21 20 23 25 25 21 21 18 21 24 NA 21 21 31 24 20 21 18 NA 36 22 22 20 23 25 18 28 NA 22 NA NA 23 38 21 22 22 33 22 24 26
 [163] 20 28 NA NA NA NA 28 19 NA 19 NA NA 22 22 22 35 24 NA 25 32 24 28 27 22 22 18 22 20 23 NA 26 24 24 29 22 21 25 22 18 22 23 25 NA NA 26 NA 35 25 22 NA NA 21 NA 21
 [217] NA 24 NA NA 27 NA 21 NA 21 NA 28 34 24 22 28 35 24 26 25 24 26 31 26 27 26 22 24 NA 23 26 NA NA 26 23 NA 28 28 26 NA 22 23 21 NA 22 21 21 30 23 24 NA 21 22 29 NA
 [271] 23 27 24 27 24 30 NA 21 25 NA 22 NA 27 25 25 31 28 26 41 NA 26 31 24 25 26 29 24 NA 28 28 24 20 NA 28 23 22 36 31 NA NA 26 23 27 26 NA 23 27 25 24 22 NA 20 NA 26
 [325] NA NA NA 28 NA 25 23 22 21 26 44 26 NA 26 43 26 21 27 25 28 24 45 NA 26 27 27 25 NA 30 27 32 NA 29 30 22 NA 29 32 NA 24 27 22 28 25 47 46 NA NA 28 34 25 28 19 28
 [379] 20 NA 26 30 32 NA NA NA NA 32 29 NA NA 24 NA NA 26 30 NA 23 24 32 32 NA 29 23 24 NA 26 25 36 NA 31 35 27 NA 28 NA 27 NA 24 27 34 32 29 29 24 25 NA NA 27 30 25 26
 [433] 26 34 NA 23 21 25 NA 33 NA 24 NA 23 30 27 28 28 33 30 35 28 29 30 32 25 NA 33 28 29 30 29 27 NA 29 NA 26 27 28 25 32 29 30 NA NA 32 23 25 30 NA 32 26 NA 25 NA 42
 [487] 22 NA 29 31 27 29 25 35 NA 29 28 33 26 28 28 30 31 27 32 37 30 40 31 30 31 33 45 28 32 36 30 31 NA NA 39 NA 33 32 44 35 27 37 NA NA 34 39 24 31 27 30 33 30 NA 34
 [541] 27 33 28 NA 28 36 NA 27 NA 25 28 30 28 30 33 28 33 30 45 31 27 24 33 30 23 35 46 34 28 29 33 NA 31 29 24 35 28 34 28 33 NA 30 23 31 37 26 NA 25 34 32 33 34 37 31
 [595] NA 30 28 28 31 29 30 31 27 45 34 28 30 37 31 30 31 28 42 30 35 34 32 NA 30 29 28 33 30 32 29 29 30 NA 30 34 26 22 34 40 36 35 34 31 27 33 33 29 30 36 NA 23 NA 24
 [649] 31 34 33 29 33 31 31 34 31 33 30 25 33 32 30 41 36 33 NA 29 34 NA 35 29 31 33 31 31 29 34 36 NA 30 NA NA 32 32 31 30 34 38 35 30 28 26 35 34 35 27 42 32 NA 40 35
 [703] 32 34 38 NA NA 31 31 34 35 NA 38 31 36 34 32 35 35 35 30 32 33 32 28 32 38 34 35 NA 43 37 30 32 32 32 NA 32 29 33 NA 33 30 NA 33 28 29 33 36 35 26 31 34 31 44 34
 [757] 32 31 30 40 35 35 36 34 36 35 38 27 NA 33 35 32 45 42 32 NA 23 35 35 33 35 NA 32 42 32 31 38 33 36 36 36 36 27 34 31 34 33 35 31 42 33 35 35 33 33 40 32 31 34 38
 [811] 37 27 34 32 38 34 35 48 34 34 37 37 34 25 32 35 37 40 38 38 35 35 34 33 35 35 37 34 33 36 32 33 35 35 40 36 40 36 34 22 38 31 34 40 30 26 35 35 30 34 39 38 38 38
 [865] 38 NA 34 38 43 30 34 39 38 NA 36 39 36 34 39 37 33 32 NA 35 41 35 35 42 34 32 48 32 36 33 NA 32 38 31 38 37 42 35 37 37 34 39 41 30 32 40 37 42 33 39 33 42 30 37
 [919] 33 37 36 39 35 32 40 34 28 41 37 34 41 36 38 39 37 35 NA 41 44 42 37 38 37 36 39 36 38 47 37 42 34 43 37 NA NA 41 38 45 38 37 39 44 42 NA 40 44 40 42 26 36 NA 41
 [973] 39 43 38 NA 46 53 40 34 43 41 41 44 42 33 46 NA 48 50 42 43 40 34 NA 47 34 39 55 45


Important: Notice that the number of the rows or columns is placed before or after the comma inside the brackets row:[1,] - column:[,1] . This is very improtant. In R, everything is rows and columsn. so you want keep this notaiton in mind.

You can also use the notaiton below to access a specific value:

nc.data$age[1]
NULL

How do we read this line of code? We are telling R to take row 1 of the column named mage of the dataset named nc.data. If you refer to the column using the $ sign, then all you need to do is specify the row number. Let’s try a few things.

nc.data$age
NULL
nc.data$age[1]
NULL
nc.data$age[5]
NULL
nc.data$age[2:5]
NULL

So far, we’ve learned how to read a daaset, how to access rows and columns, and individual datapoints. We also know how to calculate the mean and the median. Now we’ll learn how to find calculate the mean and median for different groups. But before that, let us learn on other super useful command for distributions. Run this command:

table(nc.data$premie)

full term    premie 
      846       152 

The table function gives us a frquency table of different values/cases. If the variable is a factor, then the table gives us the number of times each level of that factor is present in the dataset. So here, for example, there are 152 cases of premie and 846 cases of of full term. Try the commands below:

table(nc.data$visits)

  0   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  22  23  24  25  26  30 
  7   5   8  12  30  26  17  49  39 132  75 143  78 100 139  47  16  34   4  18   3   1   1   1   1   5 


Oh, what’s this? We gave a nmeric variabel to the table function. But able is so nice that it still gives us a frquency table of values in in the visits variable. So how many mothers never visited the hospial during hteir pregnauncy? 7. How many mothers went once? None. How many mothers went 14 times? 100. And five mother actualy visited the hospial 30 times.


Summary statistics for groups

So if we are interested in comparing the mean age of mothers who gave birth to full birth and those who had premature babies? Basically, we are interested in mean age split by the sex variable:

by(nc.data$mage, nc.data$premie, mean)
nc.data$premie: full term
[1] 27
------------------------------------------------------------------------------------------------------------------------------- 
nc.data$premie: premie
[1] 26.875

Amazing, no? We are talling R to take a varaible mage and expli it by premie and calculate the mean seperately for each group. Let’s try father’s age:

by(nc.data$fage, nc.data$premie, mean)
nc.data$premie: full term
[1] NA
------------------------------------------------------------------------------------------------------------------------------- 
nc.data$premie: premie
[1] NA

Oh, we gat NAs. Why? Becasue remember, some father age was missing. So we have to tell R what to do. What if we told R to ingore the NAs?

by(nc.data$fage, nc.data$premie, mean, na.rm=T)
nc.data$premie: full term
[1] 30.2423
------------------------------------------------------------------------------------------------------------------------------- 
nc.data$premie: premie
[1] 30.31579

There we go. It doesn’t seem like there is much of a diffrence. Let’s try a few other things:

by(nc.data$mage, nc.data$whitemom, mean)
nc.data$whitemom: not white
[1] 25.33099
------------------------------------------------------------------------------------------------------------------------------- 
nc.data$whitemom: white
[1] 27.64986

Interesting age differenc ebetween the age of mothers who are white compared to non-whites. Run the command below. Try to make sense of the results:

by(nc.data$mage, nc.data$lowbirthweight, mean)
nc.data$lowbirthweight: low
[1] 26.96396
------------------------------------------------------------------------------------------------------------------------------- 
nc.data$lowbirthweight: not low
[1] 27.0045
by(nc.data$mage, nc.data$gender, mean)
nc.data$gender: female
[1] 27.05368
------------------------------------------------------------------------------------------------------------------------------- 
nc.data$gender: male
[1] 26.94567
by(nc.data$mage, nc.data$habit, mean)
nc.data$habit: nonsmoker
[1] 27.23711
------------------------------------------------------------------------------------------------------------------------------- 
nc.data$habit: smoker
[1] 25.24603
by(nc.data$mage, nc.data$marital, mean)
nc.data$marital: married
[1] 23.56218
------------------------------------------------------------------------------------------------------------------------------- 
nc.data$marital: not married
[1] 29.14192


Plotting

Let’s make some elemntary plots:

hist(nc.data$mage)

boxplot(mage ~ habit, data = nc.data) 

