This is the document for Lab1.
For those of you who have not used Stata before, this will probably be the hardest exercise you do! Don’t get discouraged. Stata has a steep “learning curve”, so once you have worked out the basics it will become less frustrating. Help each other and ask me for help if you get stuck. It is vital that you attempt these exercises if you want to become procient in using Stata. Form groups of two or three to do these exercises. You should write a do-le which carries out each of the steps from 2 onwards and also create a log-file into which all the output is sent.
.
. cd "Z:\L14009\Dataset_part_1"
Z:\L14009\Dataset_part_1
.
. insheet using "Z:\L14009\Dataset_part_1\Data lab 1.csv"
(12 vars, 4,290 obs)
.
. assert _N == 4290
.
. label variable er30001 "1968 interview number"
. label variable er30002 "person number"
. label variable er32000 "sex of individual"
. label variable er32022 "# live births to this individual"
. label variable er32049 "last known marital status"
. label variable er30733 "1992 interview number"
. label variable er30734 "sequence number"
. label variable er30735 "relation to head"
. label variable er30736 "age of individual"
. label variable er30748 "years of cdompleted education"
. label variable er30750 "tot labor income"
. label variable er30754 "ann work hrs"
.
.
. describe
Contains data
obs: 4,290
vars: 12
size: 81,510
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
er30001 int %8.0g 1968 interview number
er30002 int %8.0g person number
er32000 byte %8.0g sex of individual
er32022 byte %8.0g # live births to this individual
er32049 byte %8.0g last known marital status
er30733 int %8.0g 1992 interview number
er30734 byte %8.0g sequence number
er30735 byte %8.0g relation to head
er30736 byte %8.0g age of individual
er30748 byte %8.0g years of cdompleted education
er30750 long %12.0g tot labor income
er30754 int %8.0g ann work hrs
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
.
. summarize er30748
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
er30748 | 4,290 14.87249 15.07546 0 99
.
. replace er30748 = . if er30748 == 0 | er30748 == 99
(209 real changes made, 209 to missing)
.
. summarize er30748
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
er30748 | 4,081 12.5533 2.963696 1 17
.
. histogram er30748
(bin=36, start=1, width=.44444444)
.
.
. graph export "Lab1.png", replace
(note: file Lab1.png not found)
(file Lab1.png written in PNG format)
.
Variable Label er30001 1968 interview number er30002 person number er32000 sex of individual er32022 # live births to this individual er32049 last known marital status er30733 1992 interview number er30734 sequence number er30735 relation to head er30736 age of individual er30748 years of cdompleted education er30750 tot labor income er30754 ann work hrs 5. What is the average number of years of education in the sample? 6. The education variable takes the values 0,1,2,. . . ,99. It turns out that the values 0 and 99 denote missing values. Change these values into system missing values. In addition, set the variables earnings and hours to missing if education is missing. 7. Now what is the average number of years of education in the sample? 8. Create a table which shows the distribution of years of education across the sample. What proportion of the sample have had exactly 12 years of education? 9. Create a histogram which shows, graphically, the same information. Can you draw the histogram so that it has one " for each value of the variable? 10. Create a box and whisker plot to show how the distribution of earnings varies by years of education. Remove outliers on earnings which are above the 99th percentile and redraw the plot. 11. Repeat (10), but use the log of earnings instead. See also Section 2.8 in Cameron & Trivedi (2009) for some more challenging questions.