Lab number one for the course is a simple exercise for learning how to use STATA. While STATA is quite powerful for performing statistical analyses, its cost may be a limiting factor for many professionals looking to do statistical analysis of data for research and other industries. R and the R Studio Integrated Development Environment offer a reasonable solution as R and R Studio are open source software that are free to use. For this lab, I will show you how to replicate the work done on STATA using R.
R operates using a series of programs stored in libraries. These libraries can be downloaded and installed directly from the R repositories via R Studio by typing in some simple code into the console or your script. Here, I add the libraries that I am going to use for the lab. (Note the comments preceeded by a hashtag, or pound, sign.)
library(tidyverse) # To import data in STATA format
library(Hmisc) # To summarize data
library(haven)
Some project require many more libraries than that, but these two will do for this lab.
Next, we need to bring in the dataset. For this lab, we are going to use the arbuthnot dataset, which is in a .dta format. Not a problem. R can handle it through the use of the haven package that you loaded above. Before you bring in your data, make sure you set the working directory to where the data are stored. Here, we will import the data to a temporary data frame and call it mydata.
mydata <- read_dta("arbuthnot.dta") # Imports the file in quotes. Make sure the working directory is correct.
Let’s do some quick data exploration.
describe(mydata) # Look at the variables quickly and see some of their values
## mydata
##
## 3 Variables 82 Observations
## ---------------------------------------------------------------------------
## year Format:%12.0g
## n missing distinct Info Mean Gmd .05 .10
## 82 0 82 1 1670 27.67 1633 1637
## .25 .50 .75 .90 .95
## 1649 1670 1690 1702 1706
##
## lowest : 1629 1630 1631 1632 1633, highest: 1706 1707 1708 1709 1710
## ---------------------------------------------------------------------------
## boys Format:%12.0g
## n missing distinct Info Mean Gmd .05 .10
## 82 0 79 1 5907 1906 3210 3400
## .25 .50 .75 .90 .95
## 4759 6073 7576 7911 8100
##
## lowest : 2890 3079 3157 3196 3209, highest: 8102 8239 8366 8379 8426
## ---------------------------------------------------------------------------
## girls Format:%12.0g
## n missing distinct Info Mean Gmd .05 .10
## 82 0 80 1 5535 1835 2911 3188
## .25 .50 .75 .90 .95
## 4457 5718 7150 7480 7654
##
## lowest : 2722 2746 2781 2840 2908, highest: 7656 7683 7687 7767 7779
## ---------------------------------------------------------------------------
For the first exercise, we need to find out more about the boys variable in the dataset.
describe(mydata$boys)
## mydata$boys Format:%12.0g
## n missing distinct Info Mean Gmd .05 .10
## 82 0 79 1 5907 1906 3210 3400
## .25 .50 .75 .90 .95
## 4759 6073 7576 7911 8100
##
## lowest : 2890 3079 3157 3196 3209, highest: 8102 8239 8366 8379 8426
The lab exercise asks you to visualize the girls variable with a scatterplot. So let’s do that in R:
plot(mydata$year, mydata$girls) # Creates a basic scatterplot of the "girls" variable. (NOTE: X, Y)
If we wanted to change the color, we can change the color to red:
plot(mydata$year, mydata$girls, col = "red") # Adding the color red to the points.
Is there an apparent trend in the number of girls baptized over the years? How would you describe it? (To ensure that your lab report is comprehensive, be sure to include the code needed to make the plot as well as your written interpretation.) Read all about the plot function.
plot(mydata$year, # The same scatterplot, with some labeling and color to it.
mydata$girls,
type = "p",
main = "Scatterplot of the number of girls baptized per year",
xlab = "YEAR",
ylab = "GIRLS",
col = "red")
lines(lowess(mydata),
col = "blue") # Adds a lowess line. What is that?
5 + 9
## [1] 14
5 / 9
## [1] 0.5555556
5 * 9
## [1] 45
5 - 9
## [1] -4
5^9
## [1] 1953125
sqrt(25) # Square root of "25"
## [1] 5
log(100, base = 10) # Log of 100, using the base 10
## [1] 2
exp(2) # Exponentiate 2
## [1] 7.389056
mydata$total = mydata$boys + mydata$girls # Creates new variable "total" by addings "boys" plus "girls"
plot(mydata$year, # X variable
mydata$total) # Y variable
5218 / 4683 # Ratio of boys to girls baptized in 1629
## [1] 1.114243
mydata$ratio = mydata$boys/mydata$girls
5218 / (5218 + 4683) # Proportion that are newborn boys in 1629
## [1] 0.5270175
mydata$boyratio = mydata$boys / mydata$total
mydata <- mutate(mydata, boyratio = (boys/total)) # A different way of doing it, courtesy of the "tidyverse" package
Now, generate a plot of the proportion of boys born over time. What do you see?
plot(mydata$year, # The same scatterplot, with some labeling and color to it.
mydata$boyratio,
type = "p",
main = "Scatterplot of the proportion of boys born over time",
xlab = "YEAR",
ylab = "PROPORTION OF BIRTHS THAT ARE BOYS",
col = "blue")
mydata$moreboys = mydata$boys > mydata$girls # "TRUE" if there are more boys than girls being born
Before we do exercise 4, we need to load the present data.
present <- read_dta("present.dta")
What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?
describe(present)
## present
##
## 3 Variables 74 Observations
## ---------------------------------------------------------------------------
## year Format:%10.0g
## n missing distinct Info Mean Gmd .05 .10
## 74 0 74 1 1976 25 1944 1947
## .25 .50 .75 .90 .95
## 1958 1976 1995 2006 2009
##
## lowest : 1940 1941 1942 1943 1944, highest: 2009 2010 2011 2012 2013
## ---------------------------------------------------------------------------
## boys Format:%10.0g
## n missing distinct Info Mean Gmd .05 .10
## 74 0 74 1 1917502 242448 1441193 1615829
## .25 .50 .75 .90 .95
## 1823071 1988038 2076156 2146859 2179796
##
## lowest : 1211684 1289734 1404587 1435301 1444365
## highest: 2179708 2179960 2184237 2186274 2208071
## ---------------------------------------------------------------------------
## girls Format:%10.0g
## n missing distinct Info Mean Gmd .05 .10
## 74 0 74 1 1825037 234497 1362835 1533097
## .25 .50 .75 .90 .95
## 1731210 1897810 1979778 2046355 2075985
##
## lowest : 1148715 1223693 1330869 1359499 1364631
## highest: 2074824 2078142 2081318 2082052 2108162
## ---------------------------------------------------------------------------
How do these counts compare to Arbuthnot’s? Are they of a similar magnitude?
summary(present$boys)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1211684 1823071 1988038 1917502 2076156 2208071
summary(mydata$boys)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2890 4759 6073 5907 7576 8426
summary(present$girls)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1148715 1731210 1897810 1825037 1979778 2108162
summary(present$boys)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1211684 1823071 1988038 1917502 2076156 2208071
Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response. Hint: You should be able to reuse your code from Ex 3 above, just replace the dataframe name.
present <- mutate(present, total = (boys + girls))
present <- mutate(present, boyratio = (boys/total))
plot(present$year, # The same scatterplot, with some labeling and color to it.
present$boyratio,
type = "p",
main = "Scatterplot of the proportion of boys born over time",
xlab = "YEAR",
ylab = "PROPORTION OF BIRTHS THAT ARE BOYS",
col = "blue")
In what year did we see the most total number of births in the U.S.? Hint: First calculate the totals and save it as a new variable. Then, sort your dataset in descending order based on the total column.
sort(present$total) # This doesn't help much because you have too many observations
## [1] 2360399 2513427 2735456 2794800 2808996 2936860 3136965 3144198
## [9] 3159958 3167788 3258411 3288672 3326632 3333279 3494398 3501564
## [17] 3520959 3535068 3554149 3555970 3559529 3600206 3606274 3612258
## [25] 3629238 3638933 3669141 3680537 3699940 3731386 3750850 3756547
## [33] 3760358 3760561 3809394 3846986 3880894 3891494 3899589 3902120
## [41] 3909510 3932181 3941553 3952767 3952841 3953590 3959417 3999386
## [49] 4000240 4017362 4021726 4025933 4027490 4040958 4047295 4058814
## [57] 4065014 4089950 4098020 4110907 4112052 4130665 4138349 4158212
## [65] 4163090 4167362 4203812 4244796 4247694 4254784 4257850 4265555
## [73] 4268326 4316233
tibble(present$year,present$total) # Again, this doesn't help
## # A tibble: 74 x 2
## `present$year` `present$total`
## <dbl> <dbl>
## 1 1940 2360399
## 2 1941 2513427
## 3 1942 2808996
## 4 1943 2936860
## 5 1944 2794800
## 6 1945 2735456
## 7 1946 3288672
## 8 1947 3699940
## 9 1948 3535068
## 10 1949 3559529
## # … with 64 more rows
max(present$total) # This tells you how many births were the highesst in a year, not what year
## [1] 4316233
present[present$total == max(present$total), "year"] # This does the trick!
## # A tibble: 1 x 1
## year
## <dbl>
## 1 2007
In that last command, we told R to filter data to keep only the highest “total” row. We then told it to return the value of “year” for that corresponding highest “total.”
That’s it. You’re all done with lab 1 using R. Congratulations!