Introduction

This is the document for Lab1.

For those of you who have not used Stata before, this will probably be the hardest exercise you do! Don’t get discouraged. Stata has a steep “learning curve”, so once you have worked out the basics it will become less frustrating. Help each other and ask me for help if you get stuck. It is vital that you attempt these exercises if you want to become pro cient in using Stata. Form groups of two or three to do these exercises. You should write a do- le which carries out each of the steps from 2 onwards and also create a log-file into which all the output is sent.

Q1. There is a spreadsheet on the L14009 Moodle page called Data Lab 1.xls. Save this spread-sheet as a comma-separated-values (csv) file.

Q2.Read the csv le into Stata using the insheet command.

Q3. Use the assert command to check that there are 4,290 observations in the data.



. 
. cd "Z:\L14009\Dataset_part_1"
Z:\L14009\Dataset_part_1

. 
. insheet using "Z:\L14009\Dataset_part_1\Data lab 1.csv"
(12 vars, 4,290 obs)

. 
. assert _N == 4290

. 
. label variable er30001 "1968 interview number"

. label variable er30002 "person number"

. label variable er32000 "sex of individual"

. label variable er32022 "# live births to this individual"

. label variable er32049 "last known marital status"

. label variable er30733 "1992 interview number"

. label variable er30734 "sequence number"

. label variable er30735 "relation to head"

. label variable er30736 "age of individual"

. label variable er30748 "years of cdompleted education"

. label variable er30750 "tot labor income"

. label variable er30754 "ann work hrs"

. 
. 
. describe 

Contains data
  obs:         4,290                          
 vars:            12                          
 size:        81,510                          
-------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
er30001         int     %8.0g                 1968 interview number
er30002         int     %8.0g                 person number
er32000         byte    %8.0g                 sex of individual
er32022         byte    %8.0g                 # live births to this individual
er32049         byte    %8.0g                 last known marital status
er30733         int     %8.0g                 1992 interview number
er30734         byte    %8.0g                 sequence number
er30735         byte    %8.0g                 relation to head
er30736         byte    %8.0g                 age of individual
er30748         byte    %8.0g                 years of cdompleted education
er30750         long    %12.0g                tot labor income
er30754         int     %8.0g                 ann work hrs
-------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

. 
. summarize er30748 

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     er30748 |      4,290    14.87249    15.07546          0         99

. 
. replace er30748 = . if er30748 == 0 | er30748 == 99 
(209 real changes made, 209 to missing)

. 
. summarize er30748 

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     er30748 |      4,081     12.5533    2.963696          1         17

. 
.  histogram er30748 
(bin=36, start=1, width=.44444444)

. 
. 
. graph export "Lab1.png", replace 
(note: file Lab1.png not found)
(file Lab1.png written in PNG format)

. 

example his_plot

  1. Label the variables with the following variable labels:

Variable Label er30001 1968 interview number er30002 person number er32000 sex of individual er32022 # live births to this individual er32049 last known marital status er30733 1992 interview number er30734 sequence number er30735 relation to head er30736 age of individual er30748 years of cdompleted education er30750 tot labor income er30754 ann work hrs 5. What is the average number of years of education in the sample? 6. The education variable takes the values 0,1,2,. . . ,99. It turns out that the values 0 and 99 denote missing values. Change these values into system missing values. In addition, set the variables earnings and hours to missing if education is missing. 7. Now what is the average number of years of education in the sample? 8. Create a table which shows the distribution of years of education across the sample. What proportion of the sample have had exactly 12 years of education? 9. Create a histogram which shows, graphically, the same information. Can you draw the histogram so that it has one " for each value of the variable? 10. Create a box and whisker plot to show how the distribution of earnings varies by years of education. Remove outliers on earnings which are above the 99th percentile and redraw the plot. 11. Repeat (10), but use the log of earnings instead. See also Section 2.8 in Cameron & Trivedi (2009) for some more challenging questions.