Weekly Discussion 1

Author

Andres Garcia

Data Types & Slope Parameter

Data Set 1: Affairs

The Affairs data-set comes from the AER package and contains survey information on married individuals, including variables such as age, years married, religiousness, occupation rating, and the number of affairs. It helps us examine differences across people at one point in time rather than following the same people over multiple years.

data("Affairs")
head(Affairs)
   affairs gender age yearsmarried children religiousness education occupation
4        0   male  37        10.00       no             3        18          7
5        0 female  27         4.00       no             4        14          6
11       0 female  32        15.00      yes             1        12          1
16       0   male  57        15.00      yes             5        18          6
23       0   male  22         0.75       no             2        17          6
29       0 female  32         1.50       no             2        17          5
   rating
4       4
5       4
11      4
16      5
23      3
29      5
summary(Affairs)
    affairs          gender         age         yearsmarried    children 
 Min.   : 0.000   female:315   Min.   :17.50   Min.   : 0.125   no :171  
 1st Qu.: 0.000   male  :286   1st Qu.:27.00   1st Qu.: 4.000   yes:430  
 Median : 0.000                Median :32.00   Median : 7.000            
 Mean   : 1.456                Mean   :32.49   Mean   : 8.178            
 3rd Qu.: 0.000                3rd Qu.:37.00   3rd Qu.:15.000            
 Max.   :12.000                Max.   :57.00   Max.   :15.000            
 religiousness     education       occupation        rating     
 Min.   :1.000   Min.   : 9.00   Min.   :1.000   Min.   :1.000  
 1st Qu.:2.000   1st Qu.:14.00   1st Qu.:3.000   1st Qu.:3.000  
 Median :3.000   Median :16.00   Median :5.000   Median :4.000  
 Mean   :3.116   Mean   :16.17   Mean   :4.195   Mean   :3.932  
 3rd Qu.:4.000   3rd Qu.:18.00   3rd Qu.:6.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :20.00   Max.   :7.000   Max.   :5.000  
ggplot(Affairs, aes(x = age, y = affairs)) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Affairs Dataset: Number of Affairs vs Age",
    x = "Age",
    y = "Number of Affairs"
  ) +
  theme_minimal()

The plot suggests there is no strong linear relationship between age and the number of affairs, as values are scattered across ages with most observations clustered near zero

Type of data

Affairs is cross-sectional data because each row represents a different individual observed at a single point in time. We are comparing different people, not tracking the same people repeatedly across years.

Data Set 2: Gapminder

The gapminder data set contains information for many countries over multiple years, including life expectancy, population, and GDP per capita. Because the same countries appear repeatedly over time, this data set lets us study changes within countries as well as differences across countries.

head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
summary(gapminder)
        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1  
                                       
table(gapminder$country[1:20])

             Afghanistan                  Albania                  Algeria 
                      12                        8                        0 
                  Angola                Argentina                Australia 
                       0                        0                        0 
                 Austria                  Bahrain               Bangladesh 
                       0                        0                        0 
                 Belgium                    Benin                  Bolivia 
                       0                        0                        0 
  Bosnia and Herzegovina                 Botswana                   Brazil 
                       0                        0                        0 
                Bulgaria             Burkina Faso                  Burundi 
                       0                        0                        0 
                Cambodia                 Cameroon                   Canada 
                       0                        0                        0 
Central African Republic                     Chad                    Chile 
                       0                        0                        0 
                   China                 Colombia                  Comoros 
                       0                        0                        0 
        Congo, Dem. Rep.              Congo, Rep.               Costa Rica 
                       0                        0                        0 
           Cote d'Ivoire                  Croatia                     Cuba 
                       0                        0                        0 
          Czech Republic                  Denmark                 Djibouti 
                       0                        0                        0 
      Dominican Republic                  Ecuador                    Egypt 
                       0                        0                        0 
             El Salvador        Equatorial Guinea                  Eritrea 
                       0                        0                        0 
                Ethiopia                  Finland                   France 
                       0                        0                        0 
                   Gabon                   Gambia                  Germany 
                       0                        0                        0 
                   Ghana                   Greece                Guatemala 
                       0                        0                        0 
                  Guinea            Guinea-Bissau                    Haiti 
                       0                        0                        0 
                Honduras         Hong Kong, China                  Hungary 
                       0                        0                        0 
                 Iceland                    India                Indonesia 
                       0                        0                        0 
                    Iran                     Iraq                  Ireland 
                       0                        0                        0 
                  Israel                    Italy                  Jamaica 
                       0                        0                        0 
                   Japan                   Jordan                    Kenya 
                       0                        0                        0 
        Korea, Dem. Rep.              Korea, Rep.                   Kuwait 
                       0                        0                        0 
                 Lebanon                  Lesotho                  Liberia 
                       0                        0                        0 
                   Libya               Madagascar                   Malawi 
                       0                        0                        0 
                Malaysia                     Mali               Mauritania 
                       0                        0                        0 
               Mauritius                   Mexico                 Mongolia 
                       0                        0                        0 
              Montenegro                  Morocco               Mozambique 
                       0                        0                        0 
                 Myanmar                  Namibia                    Nepal 
                       0                        0                        0 
             Netherlands              New Zealand                Nicaragua 
                       0                        0                        0 
                   Niger                  Nigeria                   Norway 
                       0                        0                        0 
                    Oman                 Pakistan                   Panama 
                       0                        0                        0 
                Paraguay                     Peru              Philippines 
                       0                        0                        0 
                  Poland                 Portugal              Puerto Rico 
                       0                        0                        0 
                 Reunion                  Romania                   Rwanda 
                       0                        0                        0 
   Sao Tome and Principe             Saudi Arabia                  Senegal 
                       0                        0                        0 
                  Serbia             Sierra Leone                Singapore 
                       0                        0                        0 
         Slovak Republic                 Slovenia                  Somalia 
                       0                        0                        0 
            South Africa                    Spain                Sri Lanka 
                       0                        0                        0 
                   Sudan                Swaziland                   Sweden 
                       0                        0                        0 
             Switzerland                    Syria                   Taiwan 
                       0                        0                        0 
                Tanzania                 Thailand                     Togo 
                       0                        0                        0 
     Trinidad and Tobago                  Tunisia                   Turkey 
                       0                        0                        0 
                  Uganda           United Kingdom            United States 
                       0                        0                        0 
                 Uruguay                Venezuela                  Vietnam 
                       0                        0                        0 
      West Bank and Gaza              Yemen, Rep.                   Zambia 
                       0                        0                        0 
                Zimbabwe 
                       0 
unique(gapminder$year)
 [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
table(gapminder$country %in% c("United States", "China", "Brazil"), gapminder$year)
       
        1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
  FALSE  139  139  139  139  139  139  139  139  139  139  139  139
  TRUE     3    3    3    3    3    3    3    3    3    3    3    3

This table shows that the same countries appear in every year, confirming that the dataset tracks multiple countries over time. Therefore, gapminder is panel data because it combines both a cross-sectional dimension (countries) and a time dimension (years).

selected_countries <- gapminder %>%
  filter(country %in% c("United States", "China", "Brazil"))

ggplot(selected_countries, aes(x = year, y = lifeExp, color = country)) +
  geom_line(linewidth = 1) +
  geom_point() +
  labs(
    title = "Gapminder: Life Expectancy Over Time",
    x = "Year",
    y = "Life Expectancy"
  ) +
  theme_minimal()

In this chart we can oberse the life epextency between three differet countries throughout time. As expected USA has had the hgihest one with out much change sinde the 1950s. China has had the highest increase in life expectcy since the 50s going from 50 years old up to around 70 up to 2000.

Type of Data

gapminder is panel data because it follows the same units, which are countries, over multiple years. In other words, it combines a cross-sectional dimension (many countries) with a time-series dimension (repeated years).

Part 2: Intuiton for the Slope Formula

What is covariance?

Covariance tells us whether two variables move together. If x and y tend to increase together, covariance is positive; if one tends to increase while the other decreases, covariance is negative.

What is variance?

Variance measures how much a variable spreads out around its mean. A larger variance means the variable has more dispersion, while a smaller variance means the values are more tightly clustered.

Why does Cov(y, x) / Var(x) give the slope?

The slope in a simple linear regression measures how much y changes, on average, when x increases by one unit. Covariance tells us how strongly x and y move together, while variance tells us how much x itself varies. Dividing covariance by the variance of x scales that shared movement by the amount of variation in x, which gives the best-fitting slope for the regression line.

model <- lm(affairs ~ age, data = Affairs)
summary(model)

Call:
lm(formula = affairs ~ age, data = Affairs)

Residuals:
   Min     1Q Median     3Q    Max 
-2.285 -1.609 -1.270 -0.949 11.051 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.35711    0.48804   0.732   0.4646  
age          0.03382    0.01444   2.342   0.0195 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.287 on 599 degrees of freedom
Multiple R-squared:  0.00907,   Adjusted R-squared:  0.007416 
F-statistic: 5.483 on 1 and 599 DF,  p-value: 0.01953

For this example, I use the Affairs dataset and regress affairs on age.

The results show that age has a small positive and statistically significant effect on the number of affairs. However, the effect is very weak and the R-squared is close to zero, meaning age alone does a poor job of explaining differences in behavior across individuals.

beta1_lm <- coef(model)[2]
beta1_lm
     age 
0.033822 
beta1_formula <- cov(Affairs$affairs, Affairs$age) / var(Affairs$age)
beta1_formula
[1] 0.033822

Conclusion

In this discussion, I used two datasets with different structures. Affairs is cross-sectional because it compares many individuals at one point in time, while gapminder is panel data because it follows the same countries across multiple years. I also showed that the slope from a simple linear regression is the same as Cov(y, x) / Var(x), which helps explain the intuition behind the regression coefficient.