Important Note: The purpose of this page is to introduce the dataset and how to organize it, not to instruct you on proper visualization or statistical analysis. In fact, some of the analyses are meaningless.
おことわり:このページの目的はデータセットの紹介し当該データをどのように整理することができるかを紹介することであり,リサーチ・クエスチョンや各データの特性に合わせて適切な統計分析手法をレクチャーするものではありません.
Please refer to “An Introduction to R” for basic usage of R and more detailed data manipulation methods.
Rの基本的な使い方やより詳しいデータの操作方法は「An Introduction to R」などを参照して各自勉強してください.
Load the tidyverse, a package (or more precisely, a
bundle of packages) for manipulating, extracting, and processing
data.
データの整理・操作は tidyverse
というパッケージに含まれる関数を使うのが一般的です.
# install.packages("tidyverse") # run for the first time only
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.1 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.5.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Important function in dplyr (pronounced “dee-ply-er”)
package:
filter … fetch only the subset (rows) we’re interested
inselect … select one or multiple columns
(variables)mutate … creates new columns (variables)rename … change columns (variables) nameleft_join … join (merge) two datasetssummarise … summarize datagroup_by … group the rows ** Note:
group_by alone will not give any output; it should be
followed by summarise function.swissLearn the basic usage of the tidyverse using the swiss
data built-in to R.
Rにビルトインされているswissデータを用いて,tidyverseの基本的な使い方を確認します.
You can find the other built-in datasets using
data().
他にビルトインされているデータセットは data()
で確認できます.
For ease of understanding, we cut out only the first six rows
(provinces) and the first four columns (variables) of the
swiss dataset and define it as a new dataset named
swiss2.
理解しやすいように,swissデータセットの最初の6行(provinces)と最初の4列(variables)だけを切り出して,「swiss2」という名前の新しいデータセットとして定義します.
swiss2 <- swiss[1:6, 1:4] # Extract a portion of `swiss` dataset
swiss2
## Fertility Agriculture Examination Education
## Courtelary 80.2 17.0 15 12
## Delemont 83.1 45.1 6 9
## Franches-Mnt 92.5 39.7 5 5
## Moutier 85.8 36.5 12 7
## Neuveville 76.9 43.5 17 15
## Porrentruy 76.1 35.3 9 7
swiss2 %>% filter(Fertility > 80) # Keep only provinces with a "Fertility" greater than 80
## Fertility Agriculture Examination Education
## Courtelary 80.2 17.0 15 12
## Delemont 83.1 45.1 6 9
## Franches-Mnt 92.5 39.7 5 5
## Moutier 85.8 36.5 12 7
filter(swiss2, Fertility > 80) # same as above
## Fertility Agriculture Examination Education
## Courtelary 80.2 17.0 15 12
## Delemont 83.1 45.1 6 9
## Franches-Mnt 92.5 39.7 5 5
## Moutier 85.8 36.5 12 7
swiss2 %>% filter(Fertility > 80, Examination > 10) # multiple conditions
## Fertility Agriculture Examination Education
## Courtelary 80.2 17.0 15 12
## Moutier 85.8 36.5 12 7
swiss2 %>% select(Fertility, Examination) # select specified variables
## Fertility Examination
## Courtelary 80.2 15
## Delemont 83.1 6
## Franches-Mnt 92.5 5
## Moutier 85.8 12
## Neuveville 76.9 17
## Porrentruy 76.1 9
swiss2 %>% mutate(Exam_Edu = (Examination + Education)/2) # creates new variable as the mean
## Fertility Agriculture Examination Education Exam_Edu
## Courtelary 80.2 17.0 15 12 13.5
## Delemont 83.1 45.1 6 9 7.5
## Franches-Mnt 92.5 39.7 5 5 5.0
## Moutier 85.8 36.5 12 7 9.5
## Neuveville 76.9 43.5 17 15 16.0
## Porrentruy 76.1 35.3 9 7 8.0
swiss2 %>% rename(Exam = Examination) # new name is `Exam`, old name is `Examination`
## Fertility Agriculture Exam Education
## Courtelary 80.2 17.0 15 12
## Delemont 83.1 45.1 6 9
## Franches-Mnt 92.5 39.7 5 5
## Moutier 85.8 36.5 12 7
## Neuveville 76.9 43.5 17 15
## Porrentruy 76.1 35.3 9 7
swiss2 %>% summarize(mean_fert = mean(Fertility), sd_fert = sd(Fertility),
median_fert = median(Fertility), mean_educ = mean(Education))
## mean_fert sd_fert median_fert mean_educ
## 1 82.43333 6.145459 81.65 9.166667
We can also compare states with high fertility rates with provinces
with low fertility rates. To do so, add a variable called
high_fertility_province that indicates whether the province
has a high fertility rate or not, and use the group_by
function with this variable.
出生率が高いか低いかでグループを分けて,そのグループごとに平均値や分散などを計算することもできます.その場合は出生率が高いかどうかを示す
high_fertility_province という変数を作成し,それを
group_by 関数に使用します.
swiss2 %>%
mutate(high_fertility_province = Fertility > mean(Fertility)) %>%
group_by(high_fertility_province) %>%
summarize(mean_fert = mean(Fertility), mean_exam = mean(Examination))
## # A tibble: 2 × 3
## high_fertility_province mean_fert mean_exam
## <lgl> <dbl> <dbl>
## 1 FALSE 77.7 13.7
## 2 TRUE 87.1 7.67
If the dataset organized as described above is to be used for future
analysis, the original dataset may be overwritten or a new dataset may
be created (create a new object called swiss2_new).
Overwriting the original dataset (swiss2) will cause the
original dataset to disappear from R, so I recommend assigning it to a
new object (swiss2_new).
上記の操作を行ったデータセットを今後の分析に使用する場合,元のデータセットを上書きしてもよいですし,新しいデータセットを作成してもよいですが,上書きすると元のデータはR上から消えますので注意してください.
swiss2_new <- swiss2 %>% filter(Fertility > 90) # create new
swiss2 <- swiss2 %>% filter(Fertility > 90) # overwrite
Two datasets can be merged using the left_join function.
For example, the dataset swiss2_pop contains two variables: province
name and population.
2つのデータセットがある場合,left_join
関数を使って統合することができます.たとえば,swiss2_pop
というデータセットには州の名前と人口の2つの変数があるとしましょう.
swiss2 <- swiss[1:6, 1:4]
swiss2b <- swiss2 %>% mutate(prov = rownames(swiss2)) # add province names as a column "prov"
swiss2b
## Fertility Agriculture Examination Education prov
## Courtelary 80.2 17.0 15 12 Courtelary
## Delemont 83.1 45.1 6 9 Delemont
## Franches-Mnt 92.5 39.7 5 5 Franches-Mnt
## Moutier 85.8 36.5 12 7 Moutier
## Neuveville 76.9 43.5 17 15 Neuveville
## Porrentruy 76.1 35.3 9 7 Porrentruy
swiss2_pop <- data.frame(prov = c("Courtelary", "Delemont", "Franches-Mnt",
"Moutier", "Neuveville", "Porrentruy"),
pop = c(100, 200, 300, 400, 500, 600))
swiss2_pop
## prov pop
## 1 Courtelary 100
## 2 Delemont 200
## 3 Franches-Mnt 300
## 4 Moutier 400
## 5 Neuveville 500
## 6 Porrentruy 600
swiss2b %>% left_join(swiss2_pop, by = "prov")
## Fertility Agriculture Examination Education prov pop
## 1 80.2 17.0 15 12 Courtelary 100
## 2 83.1 45.1 6 9 Delemont 200
## 3 92.5 39.7 5 5 Franches-Mnt 300
## 4 85.8 36.5 12 7 Moutier 400
## 5 76.9 43.5 17 15 Neuveville 500
## 6 76.1 35.3 9 7 Porrentruy 600
In the above example, the state order in the dataset named
swiss2b and the state order in the dataset named
swiss2_pop are the same, but they can be merged even if
they are different. It is also possible to merge even if there is not
all the data in swiss2_pop.
上記の例では,swiss2b というデータセットの州の並びと
swiss2_pop
というデータセットの州の並びが同じでしたが,異なっている場合でもマージ可能です.また,swiss2_pop
にすべての州のデータがなかったとしてもマージ可能です.
swiss2_pop <- data.frame(prov = c("Porrentruy", "Courtelary", "Delemont",
"Franches-Mnt", "Neuveville"),
pop = c(600, 100, 200, 300, 500))
swiss2_pop
## prov pop
## 1 Porrentruy 600
## 2 Courtelary 100
## 3 Delemont 200
## 4 Franches-Mnt 300
## 5 Neuveville 500
swiss2b %>% left_join(swiss2_pop, by = "prov")
## Fertility Agriculture Examination Education prov pop
## 1 80.2 17.0 15 12 Courtelary 100
## 2 83.1 45.1 6 9 Delemont 200
## 3 92.5 39.7 5 5 Franches-Mnt 300
## 4 85.8 36.5 12 7 Moutier NA
## 5 76.9 43.5 17 15 Neuveville 500
## 6 76.1 35.3 9 7 Porrentruy 600
IPUMS (Integrated Public Use Microdata Series) is the individual-level population database.
Example:
# install.packages("ipumsr") # run for the first time only
library(ipumsr)
## Warning: パッケージ 'ipumsr' はバージョン 4.2.3 の R の下で造られました
setwd("c://ws_stat/ipums_dwellings_202309")
ddi <- read_ipums_ddi("usa_00001.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
names(data) # name of variables
## [1] "YEAR" "SAMPLE" "SERIAL" "CBSERIAL" "HHWT" "CLUSTER"
## [7] "STRATA" "GQ" "FARM" "OWNERSHP" "OWNERSHPD" "MORTGAGE"
## [13] "MORTGAG2" "FARMPROD" "ACREHOUS" "MORTAMT1" "MORTAMT2" "TAXINCL"
## [19] "INSINCL" "PROPINSR" "PROPTX99" "OWNCOST" "RENT" "RENTGRS"
## [25] "RENTMEAL" "CONDOFEE" "MOBLHOME" "COSTELEC" "COSTGAS" "COSTWATR"
## [31] "COSTFUEL" "HHINCOME" "FOODSTMP" "VALUEH" "LINGISOL" "VACANCY"
## [37] "VACDUR" "VACOTH" "KITCHEN" "SINK" "STOVE" "ROOMS"
## [43] "PLUMBING" "HOTWATER" "SHOWER" "BUILTYR2" "UNITSSTR" "BEDROOMS"
nrow(data) # number of observation
## [1] 3252599
# write.csv(data, "usa_00001_all.csv") # save all observation
index_sample <- sample(x = 1:nrow(data), size = 100, replace = F)
write.csv(data[index_sample, ], "usa_00001.csv") # save partial
summary(data[, c("YEAR", "HHINCOME", "VALUEH", "KITCHEN", "SINK", "STOVE", "ROOMS",
"PLUMBING", "HOTWATER", "SHOWER", "BUILTYR2", "BEDROOMS")])
## YEAR HHINCOME VALUEH KITCHEN
## Min. :2021 Min. : -17000 Min. : 1000 Min. :0.000
## 1st Qu.:2021 1st Qu.: 46800 1st Qu.: 200000 1st Qu.:4.000
## Median :2021 Median : 88400 Median : 450000 Median :4.000
## Mean :2021 Mean : 599698 Mean :3232178 Mean :3.785
## 3rd Qu.:2021 3rd Qu.: 155000 3rd Qu.:9999999 3rd Qu.:4.000
## Max. :2021 Max. :9999999 Max. :9999999 Max. :4.000
## SINK STOVE ROOMS PLUMBING HOTWATER
## Min. :0.0 Min. :0.000 Min. : 0.000 Min. : 0.00 Min. :0.0
## 1st Qu.:2.0 1st Qu.:2.000 1st Qu.: 5.000 1st Qu.:20.00 1st Qu.:4.0
## Median :2.0 Median :2.000 Median : 6.000 Median :20.00 Median :4.0
## Mean :1.9 Mean :1.898 Mean : 6.225 Mean :18.98 Mean :3.8
## 3rd Qu.:2.0 3rd Qu.:2.000 3rd Qu.: 8.000 3rd Qu.:20.00 3rd Qu.:4.0
## Max. :2.0 Max. :2.000 Max. :20.000 Max. :20.00 Max. :4.0
## SHOWER BUILTYR2 BEDROOMS
## Min. :0.000 Min. : 0.00 Min. :0.00
## 1st Qu.:4.000 1st Qu.: 3.00 1st Qu.:3.00
## Median :4.000 Median : 5.00 Median :4.00
## Mean :3.796 Mean : 5.71 Mean :3.88
## 3rd Qu.:4.000 3rd Qu.: 7.00 3rd Qu.:5.00
## Max. :4.000 Max. :26.00 Max. :9.00
data_mod <- data %>%
filter(VALUEH < 9999999) %>%
mutate(with_kitchen = KITCHEN >= 3, with_sink = SINK == 2, with_stove = STOVE == 2,
with_hot_cold = HOTWATER == 4, with_shower = SHOWER == 4, is_farm = FARM == 2,
area = NA,
area = ifelse(ACREHOUS == 1, 0, 1),
built_yr = 2021,
built_yr = ifelse(BUILTYR2 >= 10, BUILTYR2 + 1995, built_yr),
built_yr = ifelse(BUILTYR2 == 9, 2002, built_yr), # 2000-2004
built_yr = ifelse(BUILTYR2 == 8, 1997, built_yr), # 1995-1999
built_yr = ifelse(BUILTYR2 == 7, 1992, built_yr), # 1990-1994
built_yr = ifelse(BUILTYR2 == 6, 1985, built_yr), # 1980-1989
built_yr = ifelse(BUILTYR2 == 5, 1975, built_yr), # 1970-1979
built_yr = ifelse(BUILTYR2 == 4, 1965, built_yr), # 1960-1969
built_yr = ifelse(BUILTYR2 == 3, 1955, built_yr), # 1950-1959
built_yr = ifelse(BUILTYR2 == 2, 1945, built_yr), # 1940-1949
built_yr = ifelse(BUILTYR2 == 1, 1930, built_yr), # 1939 or earlier
age = 2021 - built_yr) %>%
select(VALUEH, with_kitchen, with_sink, ROOMS, with_hot_cold, with_shower, is_farm,
area, built_yr, age, BEDROOMS, HHINCOME)
# cor(data_mod$VALUEH, data_mod$HHINCOME)
dp <- data_mod[sample(1:nrow(data_mod), 10000), ]
plot(data.frame(value = dp$VALUEH, room = dp$ROOMS, age = jitter(dp$age),
bedroom = jitter(dp$BEDROOMS)), pch = ".")
lm_ipums <- lm(log(VALUEH) ~ with_kitchen + with_sink + ROOMS + with_hot_cold + with_shower +
is_farm + area + age + BEDROOMS, data = data_mod) # linear regression
summary(lm_ipums)
##
## Call:
## lm(formula = log(VALUEH) ~ with_kitchen + with_sink + ROOMS +
## with_hot_cold + with_shower + is_farm + area + age + BEDROOMS,
## data = data_mod)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3434 -0.4499 0.0754 0.5849 4.5840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.080e+01 1.608e-02 671.50 <2e-16 ***
## with_kitchenTRUE 6.042e-01 1.498e-02 40.33 <2e-16 ***
## with_sinkTRUE -7.741e-01 2.630e-02 -29.43 <2e-16 ***
## ROOMS 5.457e-02 3.659e-04 149.13 <2e-16 ***
## with_hot_coldTRUE 6.505e-01 1.838e-02 35.39 <2e-16 ***
## with_showerTRUE -1.113e-02 2.318e-02 -0.48 0.631
## is_farmTRUE -2.389e-01 5.654e-03 -42.26 <2e-16 ***
## area 2.019e-01 2.505e-03 80.62 <2e-16 ***
## age -3.304e-03 2.742e-05 -120.48 <2e-16 ***
## BEDROOMS 2.145e-01 9.365e-04 229.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.019 on 2295380 degrees of freedom
## Multiple R-squared: 0.09506, Adjusted R-squared: 0.09506
## F-statistic: 2.679e+04 on 9 and 2295380 DF, p-value: < 2.2e-16
A home is more valued if it: has a kitchen, more rooms/bedrooms, a larger square footage, a smaller age, etc.
OECD Statistics provides statistics on the economy, industry, population, environment, health, labor, and more for OECD member countries.
cf. 総務省統計局作成のマニュアル「OECD.Statの使い方」
Population by country
Carbon dioxide (CO2) by country
setwd("c://ws_stat")
pop <- read.csv("HISTPOP.csv")
env <- read.csv("AEA.csv")
pop_mod <- pop %>%
filter(SEX == "T") %>%
rename(pop = Value, Year = Time) %>%
select(Country, Year, pop)
env_mod <- env %>%
filter(POLLUTANT == "GHG", ACTIVITY == "IND-TOTAL") %>%
rename(ghg = Value) %>%
select(Country, Year, ghg)
pop_env <- pop_mod %>%
left_join(env_mod, by = c("Country", "Year")) # merge two dataset
summary(lm(ghg ~ pop + I(pop^2) + as.factor(Year), data = pop_env))
##
## Call:
## lm(formula = ghg ~ pop + I(pop^2) + as.factor(Year), data = pop_env)
##
## Residuals:
## Min 1Q Median 3Q Max
## -494984794 -96105105 -61494956 -4611388 676509799
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.302e+08 5.254e+06 24.777 < 2e-16 ***
## pop 2.085e+01 5.215e-01 39.974 < 2e-16 ***
## I(pop^2) -2.446e-07 1.005e-08 -24.337 < 2e-16 ***
## as.factor(Year)2011 -6.590e+05 7.347e+06 -0.090 0.92853
## as.factor(Year)2012 -2.238e+06 7.347e+06 -0.305 0.76067
## as.factor(Year)2013 -5.403e+06 7.347e+06 -0.735 0.46211
## as.factor(Year)2014 -9.365e+06 7.347e+06 -1.275 0.20244
## as.factor(Year)2015 -8.878e+06 7.347e+06 -1.208 0.22692
## as.factor(Year)2016 -2.294e+07 7.399e+06 -3.101 0.00193 **
## as.factor(Year)2017 -3.080e+07 7.454e+06 -4.132 3.62e-05 ***
## as.factor(Year)2018 -3.252e+07 7.454e+06 -4.362 1.30e-05 ***
## as.factor(Year)2019 -3.744e+07 7.454e+06 -5.022 5.18e-07 ***
## as.factor(Year)2020 -4.906e+07 7.512e+06 -6.531 6.77e-11 ***
## as.factor(Year)2021 -6.013e+07 7.779e+06 -7.730 1.16e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 173600000 on 12851 degrees of freedom
## ( 9703 個の観測値が欠損のため削除されました )
## Multiple R-squared: 0.1692, Adjusted R-squared: 0.1684
## F-statistic: 201.4 on 13 and 12851 DF, p-value: < 2.2e-16
There appears to be a nonlinear relationship between greenhouse gases and population, as the coefficient on population is estimated to be positive and the coefficient on population squared is estimated to be negative. Note, however, that the relationship between the two variables is not causal, but simply correlational (or co-movement relationship).
温室効果ガスと人口の間には非線形関係があり,人口の係数は正,人口の二乗の係数は負に推定されます.ただし,この2つの変数の関係は因果関係ではなくただの相関関係(または共変動関係)であることに留意しましょう.
e-Stat (English) or e-Stat (Japanese) is a portal site for data on population, households, labor, industry, economy, housing and construction, energy, transportation, information and communication, education, social security, etc. surveyed by the Japanese government.
e-Statは日本政府が調査した人口,家計,労働,産業,経済,住宅・建設,エネルギー,運輸,情報通信,教育,社会保障などに関するデータのポータルサイトです.
Example
Note
Since numeric data contains commas in every three digits, these
commas should be removed (using str_replace_all function)
and then converted as numerical data (using as.numeric
function). In addition, to exclude values for the whole of Japan and for
the whole of the prefectures, the observed values whose area code is a
multiple of 1000 are removed (Note: the remainder divided by 1000 is
calculated as %%, so we delete observations whose remainder
is equal to 0).
数字データは3桁ごとにカンマが含まれているので,このカンマを除去し,数値データとして変換する.さらに,日本全体や都道府県レベルの値が除外されるように,area
codeが1000の倍数になっている観測値を削除する(1000で割った余りは
%%
で計算されるので,それが0と等しい観測値は削除する).
Note that this is not sufficient to eliminate duplications (e.g., city total and each ward).
ただし,これだけでは重複(たとえば,市合計と各区の値が重複している)を排除できていない点に注意する.
setwd("c://ws_stat/")
pop <- read.csv("b01_01e.csv")
pop$pop_total <- as.numeric(str_replace_all(pop$pop_total, pattern = ",", replacement = ""))
## Warning: 強制変換により NA が生成されました
pop$area <- as.numeric(str_replace_all(pop$area, pattern = ",", replacement = ""))
pop_mod <- pop %>%
filter(Area.codes..2020. %% 1000 != 0)
plot(pop_mod$pop_total, pop_mod$area, log = "xy", pch = ".") # population and area of municipality
Digital National Land Information provides basic data on national land, including geographical features, land use, and public facilities. The data is provided as GIS data.
地形・土地利用・公共施設などの国土に関する情報をGISデータとして提供しています.
Example
library(sf)
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
setwd("c://ws_stat//L01-23_13_GML")
lp <- sf::st_read(dsn = getwd(), layer = "L01-23_13", quiet = T, stringsAsFactors = F)
st_crs(lp) # Coordinate Reference System ... JGD2000
## Coordinate Reference System:
## User input: JGD2000
## wkt:
## GEOGCRS["JGD2000",
## DATUM["Japanese Geodetic Datum 2000",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["Horizontal component of 3D system."],
## AREA["Japan - onshore and offshore."],
## BBOX[17.09,122.38,46.05,157.65]],
## ID["EPSG",4612]]
st_is_longlat(lp) # TRUE if coordinates are in latitude/longitude format
## [1] TRUE
lp_shinjuku <- lp %>%
filter(L01_023 == "新宿")
summary(lp_shinjuku$L01_006) # land price variable [JPY]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 605000 878750 1270000 3414551 2407500 36600000
# plot(lp_shinjuku[6])
Download additional data on administrative areas (polygon data) and overlay them.
行政区域データを追加でダウンロードし,重ねて描画する.
setwd("c://ws_stat//L01-23_13_GML")
region <- sf::st_read(dsn = getwd(), layer = "N03-23_13_230101", quiet = T, stringsAsFactors = F)
# st_crs(region) # JGD2011
region_shinjuku <- region %>%
filter(N03_007 == 13104) # Municipality code of Shinjuku is 13104
st_geometry(region_shinjuku) %>% plot
# setting color of land price
cm_col <- cm.colors(n = 5)[cut(x = log(lp_shinjuku$L01_006), breaks = 5)]
st_geometry(lp_shinjuku) %>% plot(add = T, col = cm_col, pch = 19)
Land prices are high around Shinjuku Station.
新宿駅の周辺は地価が高い.
Database
Example
.do (Stata program) and .dta
(Stata data) filessetwd("c://ws_stat/bell")
crime <- haven::read_dta("crime_immig.dta")
crime <- crime %>%
mutate(adult_pop = totalpop - pop014, # over 15 yo
asylum_pop = (dispsMF + subsMF)/adult_pop, # proportion of immigrants
viol_crime_rate = viol / adult_pop) # violence crime rate
summary(lm(viol_crime_rate ~ asylum_pop + as.factor(year), data = crime))
##
## Call:
## lm(formula = viol_crime_rate ~ asylum_pop + as.factor(year),
## data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.033066 -0.005174 -0.001281 0.003586 0.056121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0110758 0.0004011 27.615 < 2e-16 ***
## asylum_pop 2.8610436 0.0915565 31.249 < 2e-16 ***
## as.factor(year)2002 0.0032287 0.0005584 5.782 8.07e-09 ***
## as.factor(year)2003 0.0062137 0.0005577 11.141 < 2e-16 ***
## as.factor(year)2004 0.0084673 0.0005575 15.187 < 2e-16 ***
## as.factor(year)2005 0.0088038 0.0005579 15.780 < 2e-16 ***
## as.factor(year)2006 0.0087216 0.0005580 15.629 < 2e-16 ***
## as.factor(year)2007 0.0069726 0.0005583 12.489 < 2e-16 ***
## as.factor(year)2008 0.0059517 0.0005590 10.647 < 2e-16 ***
## as.factor(year)2009 0.0051784 0.0005600 9.247 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.007593 on 3327 degrees of freedom
## ( 756 個の観測値が欠損のため削除されました )
## Multiple R-squared: 0.2883, Adjusted R-squared: 0.2864
## F-statistic: 149.7 on 9 and 3327 DF, p-value: < 2.2e-16
Tokyo Open Data Catalog is a data portal platform.
Unfortunately, the datasets are basically recorded in Japanese, and searches can only be performed in Japanese.
英語版はないようです.
Example
setwd("c://ws_stat/")
tokyo_covid <- read.csv("130001_tokyo_covid19_details_testing_positive_cases.csv")
str(tokyo_covid)
## 'data.frame': 1166 obs. of 17 variables:
## $ 全国地方公共団体コード : int 130001 130001 130001 130001 130001 130001 130001 130001 130001 130001 ...
## $ 都道府県名 : chr "東京都" "東京都" "東京都" "東京都" ...
## $ 市区町村名 : logi NA NA NA NA NA NA ...
## $ 公表_年月日 : chr "2020/2/28" "2020/2/29" "2020/3/1" "2020/3/2" ...
## $ 陽性者数.累計. : int 37 37 39 39 40 44 52 58 64 64 ...
## $ 入院中 : int 21 21 23 23 22 26 32 37 43 43 ...
## $ 軽症.中等症 : int 16 16 18 18 17 20 25 29 35 35 ...
## $ 軽症.中等症_確保病床 : int NA NA NA NA NA NA NA NA NA NA ...
## $ 重症 : int 5 5 5 5 5 6 7 8 8 8 ...
## $ 重症_確保病床 : int NA NA NA NA NA NA NA NA NA NA ...
## $ 宿泊療養 : int NA NA NA NA NA NA NA NA NA NA ...
## $ 宿泊療養施設_受入可能室数: int NA NA NA NA NA NA NA NA NA NA ...
## $ 感染拡大時療養施設_床数 : int NA NA NA NA NA NA NA NA NA NA ...
## $ 自宅療養 : int NA NA NA NA NA NA NA NA NA NA ...
## $ 調整中 : int NA NA NA NA NA NA NA NA NA NA ...
## $ 死亡 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ 退院 : int 15 15 15 15 17 17 19 20 20 20 ...
Since the announcement date is recorded as string data in the format
“yyyy-mm-dd”, it is converted to date-type data using the
as.Data function.
公表日は「yyyy/mm/dd」という形式の文字列データとして記録されているので,as.Data
関数を使って日付型のデータに変換する.
tokyo_covid$date <- as.Date(tokyo_covid$公表_年月日)
plot(x = tokyo_covid$date, y = tokyo_covid$入院中, type = "l",
xlab = "Date", ylab = "Number of people hospitalized")
Kaggle is a data science competition platform. Kaggle provides many datasets that are ready for analysis.
Kaggleはデータサイエンスのコンペを実施しているプラットフォームで,すぐに分析可能な状態の多くのデータセットが提供されています.
Example
setwd("c://ws_stat")
auto <- read.csv("Automobile_data.csv")
# str(auto)
table(auto$fuel.type)
##
## diesel gas
## 20 185
table(auto$num.of.doors) # Missing values seems to be "?" instead of "NA"
##
## ? four two
## 2 114 89
auto <- auto %>%
mutate(ln_price = log(as.numeric(price)),
num_door = ifelse(num.of.doors == "?", NA, num.of.doors),
num_door = ifelse(num_door == "two", 2, 4),
hp = as.numeric(horsepower))
## Warning in mask$eval_all_mutate(quo): 強制変換により NA が生成されました
## Warning in mask$eval_all_mutate(quo): 強制変換により NA が生成されました
lm_auto <- lm(ln_price ~ fuel.type + num_door + wheel.base + curb.weight +
engine.size + hp + city.mpg, data = auto) # linear regression
summary(lm_auto)
##
## Call:
## lm(formula = ln_price ~ fuel.type + num_door + wheel.base + curb.weight +
## engine.size + hp + city.mpg, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45979 -0.10706 -0.02197 0.12239 0.45095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.0969622 0.4390109 18.444 < 2e-16 ***
## fuel.typegas -0.2766453 0.0642596 -4.305 2.68e-05 ***
## num_door 0.0063051 0.0160383 0.393 0.694668
## wheel.base 0.0061218 0.0044188 1.385 0.167561
## curb.weight 0.0002436 0.0000938 2.597 0.010154 *
## engine.size 0.0013949 0.0007798 1.789 0.075257 .
## hp 0.0049638 0.0009470 5.241 4.24e-07 ***
## city.mpg -0.0172043 0.0049470 -3.478 0.000628 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1861 on 189 degrees of freedom
## ( 8 個の観測値が欠損のため削除されました )
## Multiple R-squared: 0.8699, Adjusted R-squared: 0.865
## F-statistic: 180.5 on 7 and 189 DF, p-value: < 2.2e-16
Kaggle also provides R and Python code to analyze its datasets. For example, for the above data, we can view the code developed by the contributors for visualization and analysis from the “Code” section.
当該データセットを可視化・分析するRやPythonのコードも提供されています.
.