Objective

Today you will:

You may only estimate: \[Y_i = \beta_0 + \beta_1 X_i + u_i\]

No multiple regression (yet!). Please also avoid using variables listed as factors or categorical variables in your regressions; we have only covered continuous variables in class so far.

Step 1: Dataset Selection

Please select one of the following datasets. The R code to load the data into R, and a brief description of the variables is provided below for each dataset:

Australian Health Service Utilization Data

Cross-section data originating from the 1977–1978 Australian Health Survey.

Variables:

  • visits Number of doctor visits in past 2 weeks.
  • gender Factor indicating gender.
  • age Age in years divided by 100.
  • income Annual income in tens of thousands of dollars.
  • illness Number of illnesses in past 2 weeks.
  • reduced Number of days of reduced activity in past 2 weeks due to illness or injury.
  • health General health questionnaire score using Goldberg’s method.
  • private Factor. Does the individual have private health insurance?
  • freepoor Factor. Does the individual have free government health insurance due to low income?
  • freerepat Factor. Does the individual have free government health insurance due to old age, disability or veteran status?
  • nchronic Factor. Is there a chronic condition not limiting activity?
  • lchronic Factor. Is there a chronic condition limiting activity?
library(pacman)
p_load(AER)
data("DoctorVisits")
head(DoctorVisits)
##   visits gender  age income illness reduced health private freepoor freerepat
## 1      1 female 0.19   0.55       1       4      1     yes       no        no
## 2      1 female 0.19   0.45       1       2      1     yes       no        no
## 3      1   male 0.19   0.90       3       0      0      no       no        no
## 4      1   male 0.19   0.15       1       0      0      no       no        no
## 5      1   male 0.19   0.45       2       5      1      no       no        no
## 6      1 female 0.19   0.35       5       1      9      no       no        no
##   nchronic lchronic
## 1       no       no
## 2       no       no
## 3       no       no
## 4       no       no
## 5      yes       no
## 6      yes       no

Cost Function of Electricity Producers 1970

Cross-section data, at the firm level, on electric power generation.

Variables:

  • cost total cost.
  • output total output.
  • labor wage rate.
  • laborshare cost share for labor.
  • capital capital price index.
  • capitalshare cost share for capital.
  • fuel fuel price.
  • fuelshare cost share for fuel.
library(pacman)
p_load(AER)
data("Electricity1970")
head(Electricity1970)
##      cost output   labor laborshare capital capitalshare   fuel fuelshare
## 1  0.2130      8 6869.47     0.3291  64.945       0.4197 18.000    0.2512
## 4  3.0427    869 8372.96     0.1030  68.227       0.2913 21.067    0.6057
## 5  9.4059   1412 7960.90     0.0891  40.692       0.1567 41.530    0.7542
## 14 0.7606     65 8971.89     0.2802  41.243       0.1282 28.539    0.5916
## 15 2.2587    295 8218.40     0.1772  71.940       0.1623 39.200    0.6606
## 16 1.3422    183 5063.49     0.0960  74.430       0.2629 35.510    0.6411

US General Social Survey 1974–2002

Cross-section data for 9120 women taken from every fourth year of the US General Social Survey between 1974 and 2002 to investigate the determinants of fertility.

Variables:

  • kids Number of children. This is coded as a numerical variable but note that the value 8 actually encompasses 8 or more children.
  • age Age of respondent.
  • education Highest year of school completed.
  • year GSS year for respondent.
  • siblings Number of brothers and sisters.
  • agefirstbirth Woman’s age at birth of first child.
  • ethnicity Factor indicating ethnicity. Is the individual Caucasian (“cauc”) or not (“other”)?
  • city16 Factor. Did the respondent live in a city (with population \(>\) 50,000) at age 16?
  • lowincome16 Factor. Was the income below average at age 16?
  • immigrant Factor. Was the respondent (or both parents) born abroad?
library(pacman)
p_load(AER)
data("GSS7402")
head(GSS7402)
##   kids age education year siblings agefirstbirth ethnicity city16 lowincome16
## 1    0  25        14 2002        1            NA      cauc     no          no
## 2    1  30        13 2002        4            19      cauc    yes          no
## 3    1  55         2 2002        1            27      cauc     no          no
## 4    2  57        16 2002        1            22      cauc     no          no
## 5    2  71        12 2002        6            29      cauc    yes          no
## 6    0  19        13 2002        1            NA     other    yes          no
##   immigrant
## 1        no
## 2        no
## 3       yes
## 4        no
## 5        no
## 6        no

Medicaid Utilization Data

Cross-section data originating from the 1986 Medicaid Consumer Survey.

Variables:

  • visits Number of doctor visits.
  • exposure Length of observation period for ambulatory care (days).
  • children Total number of children in the household.
  • age Age of the respondent.
  • income Annual household income (average of income range in million USD).
  • health1 The first principal component (divided by 1000) of three health-status variables: functional limitations, acute conditions, and chronic conditions.
  • health2 The second principal component (divided by 1000) of three health-status variables: functional limitations, acute conditions, and chronic conditions.
  • access Availability of health services (0 = low access, 1 = high access).
  • married Factor. Is the individual married?
  • gender Factor indicating gender.
  • ethnicity Factor indicating ethnicity (“cauc” or “other”).
  • school Number of years completed in school.
  • enroll Factor. Is the individual enrolled in a demonstration program?
  • program Factor indicating the managed care demonstration program: Aid to Families with Dependent Children (“afdc”) or non-institutionalized Supplementary Security Income (“ssi”).
library(pacman)
p_load(AER)
data("Medicaid1986")
head(Medicaid1986)
##   visits exposure children age income health1 health2 access married gender
## 1      0      100        1  24 14.500   0.495  -0.854   0.50      no female
## 2      1       90        3  19  6.000   0.520  -0.969   0.17      no female
## 3      0      106        4  17  8.377  -1.227   0.317   0.42      no female
## 4      0      114        2  29  6.000  -1.524   0.457   0.33      no female
## 5     11      115        1  26  8.500   0.173  -0.599   0.67      no female
## 6      3      102        1  22  6.000  -0.905   0.062   0.25      no female
##   ethnicity school enroll program
## 1      cauc     13    yes    afdc
## 2      cauc     11    yes    afdc
## 3      cauc     12    yes    afdc
## 4      cauc     12    yes    afdc
## 5      cauc     16    yes    afdc
## 6     other     12    yes    afdc

Determinants of Murder Rates in the United States

Cross-section data on states in 1950.

Variables:

  • rate Murder rate per 100,000 (FBI estimate, 1950).
  • convictions Number of convictions divided by number of murders in 1950.
  • executions Average number of executions during 1946–1950 divided by convictions in 1950.
  • time Median time served (in months) of convicted murderers released in 1951.
  • income Median family income in 1949 (in 1,000 USD).
  • lfp Labor force participation rate in 1950 (in percent).
  • noncauc Proportion of population that is non-Caucasian in 1950.
  • southern Factor indicating region.
library(pacman)
p_load(AER)
data("MurderRates")
head(MurderRates)
##    rate convictions executions time income  lfp noncauc southern
## 1 19.25       0.204      0.035   47   1.10 51.2   0.321      yes
## 2  7.53       0.327      0.081   58   0.92 48.5   0.224      yes
## 3  5.66       0.401      0.012   82   1.72 50.8   0.127       no
## 4  3.21       0.318      0.070  100   2.18 54.4   0.063       no
## 5  2.80       0.350      0.062  222   1.75 52.4   0.021       no
## 6  1.41       0.283      0.100  164   2.26 56.7   0.027       no

Step 2: Formulate a Research Question

Write your research question clearly: “Is there a linear relationship between ______ and ______?”

Define:

Step 3: Create a Scatterplot

Examine the scatterplot of your data. Does the relationship look linear? Are there outliers?

Step 4: Estimate the Model

Estimate your regression model. Present the results in a table that reports:

Step 5: Interpret Your Results

In complete sentences:

Step 6: Econometric Considerations

Even though we are using simple OLS, we must think carefully.

Consider the following:

Step 7: Share

Share your results with a classmate.

Step 8: Submit Work

Before you leave today, please submit your written worksheet to me and upload your R script to the in-class Canvas assignment for today.