Important Note: The purpose of this page is to introduce the dataset and how to organize it, not to instruct you on proper visualization or statistical analysis. In fact, some of the analyses are meaningless.

おことわり：このページの目的はデータセットの紹介し当該データをどのように整理することができるかを紹介することであり，リサーチ・クエスチョンや各データの特性に合わせて適切な統計分析手法をレクチャーするものではありません．

Please refer to “An Introduction to R” for basic usage of R and more detailed data manipulation methods.

Rの基本的な使い方やより詳しいデータの操作方法は「An Introduction to R」などを参照して各自勉強してください．

1 tidyverse package

Load the tidyverse, a package (or more precisely, a bundle of packages) for manipulating, extracting, and processing data.

データの整理・操作は tidyverse というパッケージに含まれる関数を使うのが一般的です．

# install.packages("tidyverse")  # run for the first time only
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.4.1     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.5.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Important function in dplyr (pronounced “dee-ply-er”) package:

filter … fetch only the subset (rows) we’re interested in
select … select one or multiple columns (variables)
mutate … creates new columns (variables)
rename … change columns (variables) name
left_join … join (merge) two datasets
summarise … summarize data
group_by … group the rows ** Note: group_by alone will not give any output; it should be followed by summarise function.

2 Example of `swiss`

Learn the basic usage of the tidyverse using the swiss data built-in to R.

Rにビルトインされているswissデータを用いて，tidyverseの基本的な使い方を確認します．

You can find the other built-in datasets using data().

他にビルトインされているデータセットは data() で確認できます．

For ease of understanding, we cut out only the first six rows (provinces) and the first four columns (variables) of the swiss dataset and define it as a new dataset named swiss2.

理解しやすいように，swissデータセットの最初の6行（provinces）と最初の4列（variables）だけを切り出して，「swiss2」という名前の新しいデータセットとして定義します．

swiss2 <- swiss[1:6, 1:4]  # Extract a portion of `swiss` dataset
swiss2

##              Fertility Agriculture Examination Education
## Courtelary        80.2        17.0          15        12
## Delemont          83.1        45.1           6         9
## Franches-Mnt      92.5        39.7           5         5
## Moutier           85.8        36.5          12         7
## Neuveville        76.9        43.5          17        15
## Porrentruy        76.1        35.3           9         7

2.1 filter, select, mutate, rename, and summarise

swiss2 %>% filter(Fertility > 80)  # Keep only provinces with a "Fertility" greater than 80

##              Fertility Agriculture Examination Education
## Courtelary        80.2        17.0          15        12
## Delemont          83.1        45.1           6         9
## Franches-Mnt      92.5        39.7           5         5
## Moutier           85.8        36.5          12         7

filter(swiss2, Fertility > 80)  # same as above

##              Fertility Agriculture Examination Education
## Courtelary        80.2        17.0          15        12
## Delemont          83.1        45.1           6         9
## Franches-Mnt      92.5        39.7           5         5
## Moutier           85.8        36.5          12         7

swiss2 %>% filter(Fertility > 80, Examination > 10)  # multiple conditions

##            Fertility Agriculture Examination Education
## Courtelary      80.2        17.0          15        12
## Moutier         85.8        36.5          12         7

swiss2 %>% select(Fertility, Examination)  # select specified variables

##              Fertility Examination
## Courtelary        80.2          15
## Delemont          83.1           6
## Franches-Mnt      92.5           5
## Moutier           85.8          12
## Neuveville        76.9          17
## Porrentruy        76.1           9

swiss2 %>% mutate(Exam_Edu = (Examination + Education)/2)  # creates new variable as the mean

##              Fertility Agriculture Examination Education Exam_Edu
## Courtelary        80.2        17.0          15        12     13.5
## Delemont          83.1        45.1           6         9      7.5
## Franches-Mnt      92.5        39.7           5         5      5.0
## Moutier           85.8        36.5          12         7      9.5
## Neuveville        76.9        43.5          17        15     16.0
## Porrentruy        76.1        35.3           9         7      8.0

swiss2 %>% rename(Exam = Examination)  # new name is `Exam`, old name is `Examination`

##              Fertility Agriculture Exam Education
## Courtelary        80.2        17.0   15        12
## Delemont          83.1        45.1    6         9
## Franches-Mnt      92.5        39.7    5         5
## Moutier           85.8        36.5   12         7
## Neuveville        76.9        43.5   17        15
## Porrentruy        76.1        35.3    9         7

swiss2 %>% summarize(mean_fert = mean(Fertility), sd_fert = sd(Fertility), 
                     median_fert = median(Fertility), mean_educ = mean(Education))

##   mean_fert  sd_fert median_fert mean_educ
## 1  82.43333 6.145459       81.65  9.166667

2.2 group_by and summarise

We can also compare states with high fertility rates with provinces with low fertility rates. To do so, add a variable called high_fertility_province that indicates whether the province has a high fertility rate or not, and use the group_by function with this variable.

出生率が高いか低いかでグループを分けて，そのグループごとに平均値や分散などを計算することもできます．その場合は出生率が高いかどうかを示す high_fertility_province という変数を作成し，それを group_by 関数に使用します．

swiss2 %>%
  mutate(high_fertility_province = Fertility > mean(Fertility)) %>%
  group_by(high_fertility_province) %>%
  summarize(mean_fert = mean(Fertility), mean_exam = mean(Examination))

## # A tibble: 2 × 3
##   high_fertility_province mean_fert mean_exam
##   <lgl>                       <dbl>     <dbl>
## 1 FALSE                        77.7     13.7 
## 2 TRUE                         87.1      7.67

If the dataset organized as described above is to be used for future analysis, the original dataset may be overwritten or a new dataset may be created (create a new object called swiss2_new).

Overwriting the original dataset (swiss2) will cause the original dataset to disappear from R, so I recommend assigning it to a new object (swiss2_new).

上記の操作を行ったデータセットを今後の分析に使用する場合，元のデータセットを上書きしてもよいですし，新しいデータセットを作成してもよいですが，上書きすると元のデータはR上から消えますので注意してください．

swiss2_new <- swiss2 %>% filter(Fertility > 90)  # create new
swiss2 <- swiss2 %>% filter(Fertility > 90)  # overwrite

2.3 left_join

Two datasets can be merged using the left_join function. For example, the dataset swiss2_pop contains two variables: province name and population.

2つのデータセットがある場合，left_join 関数を使って統合することができます．たとえば，swiss2_pop というデータセットには州の名前と人口の2つの変数があるとしましょう．

swiss2 <- swiss[1:6, 1:4]
swiss2b <- swiss2 %>% mutate(prov = rownames(swiss2))  # add province names as a column "prov"
swiss2b

##              Fertility Agriculture Examination Education         prov
## Courtelary        80.2        17.0          15        12   Courtelary
## Delemont          83.1        45.1           6         9     Delemont
## Franches-Mnt      92.5        39.7           5         5 Franches-Mnt
## Moutier           85.8        36.5          12         7      Moutier
## Neuveville        76.9        43.5          17        15   Neuveville
## Porrentruy        76.1        35.3           9         7   Porrentruy

swiss2_pop <- data.frame(prov = c("Courtelary", "Delemont", "Franches-Mnt", 
                                  "Moutier", "Neuveville", "Porrentruy"),
                         pop = c(100, 200, 300, 400, 500, 600))
swiss2_pop

##           prov pop
## 1   Courtelary 100
## 2     Delemont 200
## 3 Franches-Mnt 300
## 4      Moutier 400
## 5   Neuveville 500
## 6   Porrentruy 600

swiss2b %>% left_join(swiss2_pop, by = "prov")

##   Fertility Agriculture Examination Education         prov pop
## 1      80.2        17.0          15        12   Courtelary 100
## 2      83.1        45.1           6         9     Delemont 200
## 3      92.5        39.7           5         5 Franches-Mnt 300
## 4      85.8        36.5          12         7      Moutier 400
## 5      76.9        43.5          17        15   Neuveville 500
## 6      76.1        35.3           9         7   Porrentruy 600

In the above example, the state order in the dataset named swiss2b and the state order in the dataset named swiss2_pop are the same, but they can be merged even if they are different. It is also possible to merge even if there is not all the data in swiss2_pop.

上記の例では，swiss2b というデータセットの州の並びと swiss2_pop というデータセットの州の並びが同じでしたが，異なっている場合でもマージ可能です．また，swiss2_pop にすべての州のデータがなかったとしてもマージ可能です．

swiss2_pop <- data.frame(prov = c("Porrentruy", "Courtelary", "Delemont", 
                                  "Franches-Mnt", "Neuveville"),
                         pop = c(600, 100, 200, 300, 500))
swiss2_pop

##           prov pop
## 1   Porrentruy 600
## 2   Courtelary 100
## 3     Delemont 200
## 4 Franches-Mnt 300
## 5   Neuveville 500

swiss2b %>% left_join(swiss2_pop, by = "prov")

##   Fertility Agriculture Examination Education         prov pop
## 1      80.2        17.0          15        12   Courtelary 100
## 2      83.1        45.1           6         9     Delemont 200
## 3      92.5        39.7           5         5 Franches-Mnt 300
## 4      85.8        36.5          12         7      Moutier  NA
## 5      76.9        43.5          17        15   Neuveville 500
## 6      76.1        35.3           9         7   Porrentruy 600

3 IPUMS

IPUMS (Integrated Public Use Microdata Series) is the individual-level population database.

Example:

select “IPUMS USA”
select “Create your custom data set”
select “2021 ASC” under “Select samples”
select “economic characteristic” and “dwelling characteristic” under “Household” of “Select harmonized variables”
click “View cart” –> “Create data extract” –> “Submit extract”
login (or “Create an account” only for users who do not have an account)
When the download is ready, you will receive an email with the subject “IPUMS USA data extract is ready.”
Data can be downloaded from “Download .dat”. The file is in GZIP (.gz) format, so use unzip/decompression software (7-Zip, Lhaplus etc.) to extract the file.
Variable definitions can be found in the “Codebook.”

# install.packages("ipumsr")  # run for the first time only
library(ipumsr)

## Warning: パッケージ 'ipumsr' はバージョン 4.2.3 の R の下で造られました

setwd("c://ws_stat/ipums_dwellings_202309")
ddi <- read_ipums_ddi("usa_00001.xml")
data <- read_ipums_micro(ddi)

## Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.

names(data)  # name of variables

##  [1] "YEAR"      "SAMPLE"    "SERIAL"    "CBSERIAL"  "HHWT"      "CLUSTER"  
##  [7] "STRATA"    "GQ"        "FARM"      "OWNERSHP"  "OWNERSHPD" "MORTGAGE" 
## [13] "MORTGAG2"  "FARMPROD"  "ACREHOUS"  "MORTAMT1"  "MORTAMT2"  "TAXINCL"  
## [19] "INSINCL"   "PROPINSR"  "PROPTX99"  "OWNCOST"   "RENT"      "RENTGRS"  
## [25] "RENTMEAL"  "CONDOFEE"  "MOBLHOME"  "COSTELEC"  "COSTGAS"   "COSTWATR" 
## [31] "COSTFUEL"  "HHINCOME"  "FOODSTMP"  "VALUEH"    "LINGISOL"  "VACANCY"  
## [37] "VACDUR"    "VACOTH"    "KITCHEN"   "SINK"      "STOVE"     "ROOMS"    
## [43] "PLUMBING"  "HOTWATER"  "SHOWER"    "BUILTYR2"  "UNITSSTR"  "BEDROOMS"

nrow(data)  # number of observation

## [1] 3252599

# write.csv(data, "usa_00001_all.csv")  # save all observation
index_sample <- sample(x = 1:nrow(data), size = 100, replace = F)
write.csv(data[index_sample, ], "usa_00001.csv")  # save partial
summary(data[, c("YEAR", "HHINCOME", "VALUEH", "KITCHEN", "SINK", "STOVE", "ROOMS", 
                 "PLUMBING", "HOTWATER", "SHOWER", "BUILTYR2", "BEDROOMS")])

##       YEAR         HHINCOME           VALUEH           KITCHEN     
##  Min.   :2021   Min.   : -17000   Min.   :   1000   Min.   :0.000  
##  1st Qu.:2021   1st Qu.:  46800   1st Qu.: 200000   1st Qu.:4.000  
##  Median :2021   Median :  88400   Median : 450000   Median :4.000  
##  Mean   :2021   Mean   : 599698   Mean   :3232178   Mean   :3.785  
##  3rd Qu.:2021   3rd Qu.: 155000   3rd Qu.:9999999   3rd Qu.:4.000  
##  Max.   :2021   Max.   :9999999   Max.   :9999999   Max.   :4.000  
##       SINK         STOVE           ROOMS           PLUMBING        HOTWATER  
##  Min.   :0.0   Min.   :0.000   Min.   : 0.000   Min.   : 0.00   Min.   :0.0  
##  1st Qu.:2.0   1st Qu.:2.000   1st Qu.: 5.000   1st Qu.:20.00   1st Qu.:4.0  
##  Median :2.0   Median :2.000   Median : 6.000   Median :20.00   Median :4.0  
##  Mean   :1.9   Mean   :1.898   Mean   : 6.225   Mean   :18.98   Mean   :3.8  
##  3rd Qu.:2.0   3rd Qu.:2.000   3rd Qu.: 8.000   3rd Qu.:20.00   3rd Qu.:4.0  
##  Max.   :2.0   Max.   :2.000   Max.   :20.000   Max.   :20.00   Max.   :4.0  
##      SHOWER         BUILTYR2        BEDROOMS   
##  Min.   :0.000   Min.   : 0.00   Min.   :0.00  
##  1st Qu.:4.000   1st Qu.: 3.00   1st Qu.:3.00  
##  Median :4.000   Median : 5.00   Median :4.00  
##  Mean   :3.796   Mean   : 5.71   Mean   :3.88  
##  3rd Qu.:4.000   3rd Qu.: 7.00   3rd Qu.:5.00  
##  Max.   :4.000   Max.   :26.00   Max.   :9.00

data_mod <- data %>%
  filter(VALUEH < 9999999) %>%
  mutate(with_kitchen = KITCHEN >= 3, with_sink = SINK == 2, with_stove = STOVE == 2, 
         with_hot_cold = HOTWATER == 4, with_shower = SHOWER == 4, is_farm = FARM == 2,
         area = NA,
         area = ifelse(ACREHOUS == 1, 0, 1),
         built_yr = 2021,
         built_yr = ifelse(BUILTYR2 >= 10, BUILTYR2 + 1995, built_yr),
         built_yr = ifelse(BUILTYR2 == 9, 2002, built_yr),  # 2000-2004
         built_yr = ifelse(BUILTYR2 == 8, 1997, built_yr),  # 1995-1999
         built_yr = ifelse(BUILTYR2 == 7, 1992, built_yr),  # 1990-1994
         built_yr = ifelse(BUILTYR2 == 6, 1985, built_yr),  # 1980-1989
         built_yr = ifelse(BUILTYR2 == 5, 1975, built_yr),  # 1970-1979
         built_yr = ifelse(BUILTYR2 == 4, 1965, built_yr),  # 1960-1969
         built_yr = ifelse(BUILTYR2 == 3, 1955, built_yr),  # 1950-1959
         built_yr = ifelse(BUILTYR2 == 2, 1945, built_yr),  # 1940-1949
         built_yr = ifelse(BUILTYR2 == 1, 1930, built_yr),  # 1939 or earlier
         age = 2021 - built_yr) %>%
  select(VALUEH, with_kitchen, with_sink, ROOMS, with_hot_cold, with_shower, is_farm, 
         area, built_yr, age, BEDROOMS, HHINCOME)

# cor(data_mod$VALUEH, data_mod$HHINCOME)
dp <- data_mod[sample(1:nrow(data_mod), 10000), ]
plot(data.frame(value = dp$VALUEH, room = dp$ROOMS, age = jitter(dp$age), 
                bedroom = jitter(dp$BEDROOMS)), pch = ".")

lm_ipums <- lm(log(VALUEH) ~ with_kitchen + with_sink + ROOMS + with_hot_cold + with_shower +
               is_farm + area + age + BEDROOMS, data = data_mod)  # linear regression
summary(lm_ipums)

## 
## Call:
## lm(formula = log(VALUEH) ~ with_kitchen + with_sink + ROOMS + 
##     with_hot_cold + with_shower + is_farm + area + age + BEDROOMS, 
##     data = data_mod)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3434 -0.4499  0.0754  0.5849  4.5840 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.080e+01  1.608e-02  671.50   <2e-16 ***
## with_kitchenTRUE   6.042e-01  1.498e-02   40.33   <2e-16 ***
## with_sinkTRUE     -7.741e-01  2.630e-02  -29.43   <2e-16 ***
## ROOMS              5.457e-02  3.659e-04  149.13   <2e-16 ***
## with_hot_coldTRUE  6.505e-01  1.838e-02   35.39   <2e-16 ***
## with_showerTRUE   -1.113e-02  2.318e-02   -0.48    0.631    
## is_farmTRUE       -2.389e-01  5.654e-03  -42.26   <2e-16 ***
## area               2.019e-01  2.505e-03   80.62   <2e-16 ***
## age               -3.304e-03  2.742e-05 -120.48   <2e-16 ***
## BEDROOMS           2.145e-01  9.365e-04  229.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.019 on 2295380 degrees of freedom
## Multiple R-squared:  0.09506,    Adjusted R-squared:  0.09506 
## F-statistic: 2.679e+04 on 9 and 2295380 DF,  p-value: < 2.2e-16

A home is more valued if it: has a kitchen, more rooms/bedrooms, a larger square footage, a smaller age, etc.

4 OECD Statistics

OECD Statistics provides statistics on the economy, industry, population, environment, health, labor, and more for OECD member countries.

cf. 総務省統計局作成のマニュアル「OECD.Statの使い方」

Population by country

select “Demography and population” > “Population statistics” > “Historical population data”
click “Export” –> “Text file (CSV)” –> “Download”

Carbon dioxide (CO2) by country

select “Environment” > “Air and Climate” > “Air Emission Accounts”
click “Export” –> “Text file (CSV)” –> “Download”

setwd("c://ws_stat")
pop <- read.csv("HISTPOP.csv")
env <- read.csv("AEA.csv")

pop_mod <- pop %>%
  filter(SEX == "T") %>%
  rename(pop = Value, Year = Time) %>%
  select(Country, Year, pop)

env_mod <- env %>%
  filter(POLLUTANT == "GHG", ACTIVITY == "IND-TOTAL") %>%
  rename(ghg = Value) %>%
  select(Country, Year, ghg)
pop_env <- pop_mod %>%
  left_join(env_mod, by = c("Country", "Year"))  # merge two dataset

summary(lm(ghg ~ pop + I(pop^2) + as.factor(Year), data = pop_env))

## 
## Call:
## lm(formula = ghg ~ pop + I(pop^2) + as.factor(Year), data = pop_env)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -494984794  -96105105  -61494956   -4611388  676509799 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.302e+08  5.254e+06  24.777  < 2e-16 ***
## pop                  2.085e+01  5.215e-01  39.974  < 2e-16 ***
## I(pop^2)            -2.446e-07  1.005e-08 -24.337  < 2e-16 ***
## as.factor(Year)2011 -6.590e+05  7.347e+06  -0.090  0.92853    
## as.factor(Year)2012 -2.238e+06  7.347e+06  -0.305  0.76067    
## as.factor(Year)2013 -5.403e+06  7.347e+06  -0.735  0.46211    
## as.factor(Year)2014 -9.365e+06  7.347e+06  -1.275  0.20244    
## as.factor(Year)2015 -8.878e+06  7.347e+06  -1.208  0.22692    
## as.factor(Year)2016 -2.294e+07  7.399e+06  -3.101  0.00193 ** 
## as.factor(Year)2017 -3.080e+07  7.454e+06  -4.132 3.62e-05 ***
## as.factor(Year)2018 -3.252e+07  7.454e+06  -4.362 1.30e-05 ***
## as.factor(Year)2019 -3.744e+07  7.454e+06  -5.022 5.18e-07 ***
## as.factor(Year)2020 -4.906e+07  7.512e+06  -6.531 6.77e-11 ***
## as.factor(Year)2021 -6.013e+07  7.779e+06  -7.730 1.16e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 173600000 on 12851 degrees of freedom
##   ( 9703 個の観測値が欠損のため削除されました )
## Multiple R-squared:  0.1692, Adjusted R-squared:  0.1684 
## F-statistic: 201.4 on 13 and 12851 DF,  p-value: < 2.2e-16

There appears to be a nonlinear relationship between greenhouse gases and population, as the coefficient on population is estimated to be positive and the coefficient on population squared is estimated to be negative. Note, however, that the relationship between the two variables is not causal, but simply correlational (or co-movement relationship).

温室効果ガスと人口の間には非線形関係があり，人口の係数は正，人口の二乗の係数は負に推定されます．ただし，この2つの変数の関係は因果関係ではなくただの相関関係（または共変動関係）であることに留意しましょう．

5 e-Stat

e-Stat (English) or e-Stat (Japanese) is a portal site for data on population, households, labor, industry, economy, housing and construction, energy, transportation, information and communication, education, social security, etc. surveyed by the Japanese government.

e-Statは日本政府が調査した人口，家計，労働，産業，経済，住宅・建設，エネルギー，運輸，情報通信，教育，社会保障などに関するデータのポータルサイトです．

Example

search for “census”
click “Population Census”
click “2020 Population Census: File”
click “■Basic Complete Tabulation on Population and Households”
click “Table number 1-1, Excel” and then file will be downloaded

Note

Data on population, households, etc. aggregated at the municipal level is available
To merge with other datasets, use the 6-digit municipal code (地方公共団体コード, recorded in the column name “Area codes”)
To analyze it in R, the downloaded Excel file should be exported in CSV format.

Since numeric data contains commas in every three digits, these commas should be removed (using str_replace_all function) and then converted as numerical data (using as.numeric function). In addition, to exclude values for the whole of Japan and for the whole of the prefectures, the observed values whose area code is a multiple of 1000 are removed (Note: the remainder divided by 1000 is calculated as %%, so we delete observations whose remainder is equal to 0).

数字データは3桁ごとにカンマが含まれているので，このカンマを除去し，数値データとして変換する．さらに，日本全体や都道府県レベルの値が除外されるように，area codeが1000の倍数になっている観測値を削除する（1000で割った余りは %% で計算されるので，それが0と等しい観測値は削除する）．

Note that this is not sufficient to eliminate duplications (e.g., city total and each ward).

ただし，これだけでは重複（たとえば，市合計と各区の値が重複している）を排除できていない点に注意する．

setwd("c://ws_stat/")
pop <- read.csv("b01_01e.csv")
pop$pop_total <- as.numeric(str_replace_all(pop$pop_total, pattern = ",", replacement = ""))

## Warning: 強制変換により NA が生成されました

pop$area <- as.numeric(str_replace_all(pop$area, pattern = ",", replacement = ""))
pop_mod <- pop %>%
  filter(Area.codes..2020. %% 1000 != 0)
plot(pop_mod$pop_total, pop_mod$area, log = "xy", pch = ".")  # population and area of municipality

6 Digital National Land Information / 国土数値情報

Digital National Land Information provides basic data on national land, including geographical features, land use, and public facilities. The data is provided as GIS data.

地形・土地利用・公共施設などの国土に関する情報をGISデータとして提供しています．

Example

choose “地価公示データ (land price data)”
click “L01-23_13_GML.zip” data and then data will be downloaded
shapefile is stored in a ZIP file, so we should unzip it

library(sf)

## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE

setwd("c://ws_stat//L01-23_13_GML")
lp <- sf::st_read(dsn = getwd(), layer = "L01-23_13", quiet = T, stringsAsFactors = F)
st_crs(lp)  # Coordinate Reference System ... JGD2000

## Coordinate Reference System:
##   User input: JGD2000 
##   wkt:
## GEOGCRS["JGD2000",
##     DATUM["Japanese Geodetic Datum 2000",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["Japan - onshore and offshore."],
##         BBOX[17.09,122.38,46.05,157.65]],
##     ID["EPSG",4612]]

st_is_longlat(lp)  # TRUE if coordinates are in latitude/longitude format

## [1] TRUE

lp_shinjuku <- lp %>%
  filter(L01_023 == "新宿")
summary(lp_shinjuku$L01_006)  # land price variable [JPY]

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   605000   878750  1270000  3414551  2407500 36600000

# plot(lp_shinjuku[6])

Download additional data on administrative areas (polygon data) and overlay them.

行政区域データを追加でダウンロードし，重ねて描画する．

setwd("c://ws_stat//L01-23_13_GML")
region <- sf::st_read(dsn = getwd(), layer = "N03-23_13_230101", quiet = T, stringsAsFactors = F)
# st_crs(region)  # JGD2011
region_shinjuku <- region %>%
  filter(N03_007 == 13104)  # Municipality code of Shinjuku is 13104
st_geometry(region_shinjuku) %>% plot
# setting color of land price
cm_col <- cm.colors(n = 5)[cut(x = log(lp_shinjuku$L01_006), breaks = 5)]
st_geometry(lp_shinjuku) %>% plot(add = T, col = cm_col, pch = 19)

Land prices are high around Shinjuku Station.

新宿駅の周辺は地価が高い．

7 Replication file of published paper

Database

Example

choose Harvard Dataverse https://dataverse.harvard.edu/
search for “crime and immigration”
find “Replication data for: Crime and Immigration: Evidence from Large Immigrant Waves”, published in Review of Economics and Statistics … Bell, Fasani, and Machin (2013)
confirm “Terms” and see it has been released as public domain (CC0 license)
download .do (Stata program) and .dta (Stata data) files

setwd("c://ws_stat/bell")
crime <- haven::read_dta("crime_immig.dta")
crime <- crime %>%
  mutate(adult_pop = totalpop - pop014,  # over 15 yo
         asylum_pop = (dispsMF + subsMF)/adult_pop,  # proportion of immigrants
         viol_crime_rate = viol / adult_pop)  # violence crime rate
summary(lm(viol_crime_rate ~ asylum_pop + as.factor(year), data = crime))

## 
## Call:
## lm(formula = viol_crime_rate ~ asylum_pop + as.factor(year), 
##     data = crime)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.033066 -0.005174 -0.001281  0.003586  0.056121 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.0110758  0.0004011  27.615  < 2e-16 ***
## asylum_pop          2.8610436  0.0915565  31.249  < 2e-16 ***
## as.factor(year)2002 0.0032287  0.0005584   5.782 8.07e-09 ***
## as.factor(year)2003 0.0062137  0.0005577  11.141  < 2e-16 ***
## as.factor(year)2004 0.0084673  0.0005575  15.187  < 2e-16 ***
## as.factor(year)2005 0.0088038  0.0005579  15.780  < 2e-16 ***
## as.factor(year)2006 0.0087216  0.0005580  15.629  < 2e-16 ***
## as.factor(year)2007 0.0069726  0.0005583  12.489  < 2e-16 ***
## as.factor(year)2008 0.0059517  0.0005590  10.647  < 2e-16 ***
## as.factor(year)2009 0.0051784  0.0005600   9.247  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007593 on 3327 degrees of freedom
##   ( 756 個の観測値が欠損のため削除されました )
## Multiple R-squared:  0.2883, Adjusted R-squared:  0.2864 
## F-statistic: 149.7 on 9 and 3327 DF,  p-value: < 2.2e-16

8 Tokyo Open Data Catalog / 東京都オープンデータカタログ

Tokyo Open Data Catalog is a data portal platform.

Unfortunately, the datasets are basically recorded in Japanese, and searches can only be performed in Japanese.

英語版はないようです．

Example

search for “新型コロナウイルス” (Covid-19)
choose “東京都新型コロナウイルス感染症重症患者数”
move to “データとリソース”
click URL at the top of the page then CSV file will be downloaded

setwd("c://ws_stat/")
tokyo_covid <- read.csv("130001_tokyo_covid19_details_testing_positive_cases.csv")
str(tokyo_covid)

## 'data.frame':    1166 obs. of  17 variables:
##  $ 全国地方公共団体コード   : int  130001 130001 130001 130001 130001 130001 130001 130001 130001 130001 ...
##  $ 都道府県名               : chr  "東京都" "東京都" "東京都" "東京都" ...
##  $ 市区町村名               : logi  NA NA NA NA NA NA ...
##  $ 公表_年月日              : chr  "2020/2/28" "2020/2/29" "2020/3/1" "2020/3/2" ...
##  $ 陽性者数.累計.           : int  37 37 39 39 40 44 52 58 64 64 ...
##  $ 入院中                   : int  21 21 23 23 22 26 32 37 43 43 ...
##  $ 軽症.中等症              : int  16 16 18 18 17 20 25 29 35 35 ...
##  $ 軽症.中等症_確保病床     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ 重症                     : int  5 5 5 5 5 6 7 8 8 8 ...
##  $ 重症_確保病床            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ 宿泊療養                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ 宿泊療養施設_受入可能室数: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ 感染拡大時療養施設_床数  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ 自宅療養                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ 調整中                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ 死亡                     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ 退院                     : int  15 15 15 15 17 17 19 20 20 20 ...

Since the announcement date is recorded as string data in the format “yyyy-mm-dd”, it is converted to date-type data using the as.Data function.

公表日は「yyyy/mm/dd」という形式の文字列データとして記録されているので，as.Data 関数を使って日付型のデータに変換する．

tokyo_covid$date <- as.Date(tokyo_covid$公表_年月日)
plot(x = tokyo_covid$date, y = tokyo_covid$入院中, type = "l", 
     xlab = "Date", ylab = "Number of people hospitalized")

9 Kaggle

Kaggle is a data science competition platform. Kaggle provides many datasets that are ready for analysis.

Kaggleはデータサイエンスのコンペを実施しているプラットフォームで，すぐに分析可能な状態の多くのデータセットが提供されています．

Example

Search for “automobile”
choose “Automobile Dataset”
click “Download”
sign in (for example, “Sign in with Google”)
CSV file will be downloaded

setwd("c://ws_stat")
auto <- read.csv("Automobile_data.csv")
# str(auto)
table(auto$fuel.type)

## 
## diesel    gas 
##     20    185

table(auto$num.of.doors)  # Missing values seems to be "?" instead of "NA"

## 
##    ? four  two 
##    2  114   89

auto <- auto %>%
  mutate(ln_price = log(as.numeric(price)),
         num_door = ifelse(num.of.doors == "?", NA, num.of.doors),
         num_door = ifelse(num_door == "two", 2, 4),
         hp = as.numeric(horsepower))

## Warning in mask$eval_all_mutate(quo): 強制変換により NA が生成されました

## Warning in mask$eval_all_mutate(quo): 強制変換により NA が生成されました

lm_auto <- lm(ln_price ~ fuel.type + num_door + wheel.base + curb.weight + 
                engine.size + hp + city.mpg, data = auto)  # linear regression
summary(lm_auto)

## 
## Call:
## lm(formula = ln_price ~ fuel.type + num_door + wheel.base + curb.weight + 
##     engine.size + hp + city.mpg, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45979 -0.10706 -0.02197  0.12239  0.45095 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.0969622  0.4390109  18.444  < 2e-16 ***
## fuel.typegas -0.2766453  0.0642596  -4.305 2.68e-05 ***
## num_door      0.0063051  0.0160383   0.393 0.694668    
## wheel.base    0.0061218  0.0044188   1.385 0.167561    
## curb.weight   0.0002436  0.0000938   2.597 0.010154 *  
## engine.size   0.0013949  0.0007798   1.789 0.075257 .  
## hp            0.0049638  0.0009470   5.241 4.24e-07 ***
## city.mpg     -0.0172043  0.0049470  -3.478 0.000628 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1861 on 189 degrees of freedom
##   ( 8 個の観測値が欠損のため削除されました )
## Multiple R-squared:  0.8699, Adjusted R-squared:  0.865 
## F-statistic: 180.5 on 7 and 189 DF,  p-value: < 2.2e-16

Kaggle also provides R and Python code to analyze its datasets. For example, for the above data, we can view the code developed by the contributors for visualization and analysis from the “Code” section.

当該データセットを可視化・分析するRやPythonのコードも提供されています．

Data for empirical studies / 実証研究に使用できるデータの例

Sho Kuroda

2023-10-26

1 tidyverse package

2 Example of `swiss`

2.1 filter, select, mutate, rename, and summarise

2.2 group_by and summarise

2.3 left_join

3 IPUMS

4 OECD Statistics

5 e-Stat

6 Digital National Land Information / 国土数値情報

7 Replication file of published paper

8 Tokyo Open Data Catalog / 東京都オープンデータカタログ

9 Kaggle

Data for empirical studies / 実証研究に使用できるデータの例

Sho Kuroda

2023-10-26

1 tidyverse package

2 Example of swiss

2.1 filter, select, mutate, rename, and summarise

2.2 group_by and summarise

2.3 left_join

3 IPUMS

4 OECD Statistics

5 e-Stat

6 Digital National Land Information / 国土数値情報

7 Replication file of published paper

8 Tokyo Open Data Catalog / 東京都オープンデータカタログ

9 Kaggle

2 Example of `swiss`