The Affairs data-set comes from the AER package and contains survey information on married individuals, including variables such as age, years married, religiousness, occupation rating, and the number of affairs. It helps us examine differences across people at one point in time rather than following the same people over multiple years.
data("Affairs")head(Affairs)
affairs gender age yearsmarried children religiousness education occupation
4 0 male 37 10.00 no 3 18 7
5 0 female 27 4.00 no 4 14 6
11 0 female 32 15.00 yes 1 12 1
16 0 male 57 15.00 yes 5 18 6
23 0 male 22 0.75 no 2 17 6
29 0 female 32 1.50 no 2 17 5
rating
4 4
5 4
11 4
16 5
23 3
29 5
summary(Affairs)
affairs gender age yearsmarried children
Min. : 0.000 female:315 Min. :17.50 Min. : 0.125 no :171
1st Qu.: 0.000 male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430
Median : 0.000 Median :32.00 Median : 7.000
Mean : 1.456 Mean :32.49 Mean : 8.178
3rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000
Max. :12.000 Max. :57.00 Max. :15.000
religiousness education occupation rating
Min. :1.000 Min. : 9.00 Min. :1.000 Min. :1.000
1st Qu.:2.000 1st Qu.:14.00 1st Qu.:3.000 1st Qu.:3.000
Median :3.000 Median :16.00 Median :5.000 Median :4.000
Mean :3.116 Mean :16.17 Mean :4.195 Mean :3.932
3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:6.000 3rd Qu.:5.000
Max. :5.000 Max. :20.00 Max. :7.000 Max. :5.000
ggplot(Affairs, aes(x = age, y = affairs)) +geom_point(alpha =0.6) +labs(title ="Affairs Dataset: Number of Affairs vs Age",x ="Age",y ="Number of Affairs" ) +theme_minimal()
The plot suggests there is no strong linear relationship between age and the number of affairs, as values are scattered across ages with most observations clustered near zero
Type of data
Affairs is cross-sectional data because each row represents a different individual observed at a single point in time. We are comparing different people, not tracking the same people repeatedly across years.
Data Set 2: Gapminder
The gapminder data set contains information for many countries over multiple years, including life expectancy, population, and GDP per capita. Because the same countries appear repeatedly over time, this data set lets us study changes within countries as well as differences across countries.
head(gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
summary(gapminder)
country continent year lifeExp
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
Algeria : 12 Asia :396 Median :1980 Median :60.71
Angola : 12 Europe :360 Mean :1980 Mean :59.47
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
Australia : 12 Max. :2007 Max. :82.60
(Other) :1632
pop gdpPercap
Min. :6.001e+04 Min. : 241.2
1st Qu.:2.794e+06 1st Qu.: 1202.1
Median :7.024e+06 Median : 3531.8
Mean :2.960e+07 Mean : 7215.3
3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Max. :1.319e+09 Max. :113523.1
table(gapminder$country[1:20])
Afghanistan Albania Algeria
12 8 0
Angola Argentina Australia
0 0 0
Austria Bahrain Bangladesh
0 0 0
Belgium Benin Bolivia
0 0 0
Bosnia and Herzegovina Botswana Brazil
0 0 0
Bulgaria Burkina Faso Burundi
0 0 0
Cambodia Cameroon Canada
0 0 0
Central African Republic Chad Chile
0 0 0
China Colombia Comoros
0 0 0
Congo, Dem. Rep. Congo, Rep. Costa Rica
0 0 0
Cote d'Ivoire Croatia Cuba
0 0 0
Czech Republic Denmark Djibouti
0 0 0
Dominican Republic Ecuador Egypt
0 0 0
El Salvador Equatorial Guinea Eritrea
0 0 0
Ethiopia Finland France
0 0 0
Gabon Gambia Germany
0 0 0
Ghana Greece Guatemala
0 0 0
Guinea Guinea-Bissau Haiti
0 0 0
Honduras Hong Kong, China Hungary
0 0 0
Iceland India Indonesia
0 0 0
Iran Iraq Ireland
0 0 0
Israel Italy Jamaica
0 0 0
Japan Jordan Kenya
0 0 0
Korea, Dem. Rep. Korea, Rep. Kuwait
0 0 0
Lebanon Lesotho Liberia
0 0 0
Libya Madagascar Malawi
0 0 0
Malaysia Mali Mauritania
0 0 0
Mauritius Mexico Mongolia
0 0 0
Montenegro Morocco Mozambique
0 0 0
Myanmar Namibia Nepal
0 0 0
Netherlands New Zealand Nicaragua
0 0 0
Niger Nigeria Norway
0 0 0
Oman Pakistan Panama
0 0 0
Paraguay Peru Philippines
0 0 0
Poland Portugal Puerto Rico
0 0 0
Reunion Romania Rwanda
0 0 0
Sao Tome and Principe Saudi Arabia Senegal
0 0 0
Serbia Sierra Leone Singapore
0 0 0
Slovak Republic Slovenia Somalia
0 0 0
South Africa Spain Sri Lanka
0 0 0
Sudan Swaziland Sweden
0 0 0
Switzerland Syria Taiwan
0 0 0
Tanzania Thailand Togo
0 0 0
Trinidad and Tobago Tunisia Turkey
0 0 0
Uganda United Kingdom United States
0 0 0
Uruguay Venezuela Vietnam
0 0 0
West Bank and Gaza Yemen, Rep. Zambia
0 0 0
Zimbabwe
0
This table shows that the same countries appear in every year, confirming that the dataset tracks multiple countries over time. Therefore, gapminder is panel data because it combines both a cross-sectional dimension (countries) and a time dimension (years).
selected_countries <- gapminder %>%filter(country %in%c("United States", "China", "Brazil"))ggplot(selected_countries, aes(x = year, y = lifeExp, color = country)) +geom_line(linewidth =1) +geom_point() +labs(title ="Gapminder: Life Expectancy Over Time",x ="Year",y ="Life Expectancy" ) +theme_minimal()
In this chart we can oberse the life epextency between three differet countries throughout time. As expected USA has had the hgihest one with out much change sinde the 1950s. China has had the highest increase in life expectcy since the 50s going from 50 years old up to around 70 up to 2000.
Type of Data
gapminder is panel data because it follows the same units, which are countries, over multiple years. In other words, it combines a cross-sectional dimension (many countries) with a time-series dimension (repeated years).
Part 2: Intuiton for the Slope Formula
What is covariance?
Covariance tells us whether two variables move together. If x and y tend to increase together, covariance is positive; if one tends to increase while the other decreases, covariance is negative.
What is variance?
Variance measures how much a variable spreads out around its mean. A larger variance means the variable has more dispersion, while a smaller variance means the values are more tightly clustered.
Why does Cov(y, x) / Var(x) give the slope?
The slope in a simple linear regression measures how much y changes, on average, when x increases by one unit. Covariance tells us how strongly x and y move together, while variance tells us how much x itself varies. Dividing covariance by the variance of x scales that shared movement by the amount of variation in x, which gives the best-fitting slope for the regression line.
model <-lm(affairs ~ age, data = Affairs)summary(model)
Call:
lm(formula = affairs ~ age, data = Affairs)
Residuals:
Min 1Q Median 3Q Max
-2.285 -1.609 -1.270 -0.949 11.051
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.35711 0.48804 0.732 0.4646
age 0.03382 0.01444 2.342 0.0195 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.287 on 599 degrees of freedom
Multiple R-squared: 0.00907, Adjusted R-squared: 0.007416
F-statistic: 5.483 on 1 and 599 DF, p-value: 0.01953
For this example, I use the Affairs dataset and regress affairs on age.
The results show that age has a small positive and statistically significant effect on the number of affairs. However, the effect is very weak and the R-squared is close to zero, meaning age alone does a poor job of explaining differences in behavior across individuals.
In this discussion, I used two datasets with different structures. Affairs is cross-sectional because it compares many individuals at one point in time, while gapminder is panel data because it follows the same countries across multiple years. I also showed that the slope from a simple linear regression is the same as Cov(y, x) / Var(x), which helps explain the intuition behind the regression coefficient.