参考資料

★★★ 1.回帰分析の説明変数や目的変数は正規分布していなくてもよいか? https://toukeier.hatenablog.com/entry/2019/09/08/224346

回帰分析の説明変数や目的変数は正規分布していなくてもよいか?という質問に対しては、「していなくてよい」が答えだ。 正規分布している必要があるのは残差なのだ。説明変数や目的変数ではない。これを間違えてはいけない。 残差が正規分布している必要がある理由は、分散分析のF検定の検定統計量の分母に該当するから。F分布は、分母・分子とも正規分布していることが前提だからだ。

2.統計ブログランキング https://blog.with2.net/rank4465-0.html

3.三重大学奥村先生サイト https://oku.edu.mie-u.ac.jp/~okumura/stat/first.html

4.自分のサイト - http://rpubs.com/fumi/569286 # 始めに文法の基本: - “文頭に#+スペースで大みだしができる。” - “以降を箇条書きで書いていく場合は、ここに書いたとおりスペース+”-“+スペースで書ける” - このローカルデータは、“C:\721540.—Rmd”であり - webには、http://rpubs.com/fumi/562118 - バックアップとして、上記のデータのダウンロードを C:\721540

# 厚木の緊急度推定における機械学習法と線形回帰分析の比較検討 - 厚木データのインポート

library(dummies)
## dummies-1.5.6 provided by Decision Patterns
library(data.table)
## Warning: package 'data.table' was built under R version 3.6.2
library(tidyr)
library(ranger)
## Warning: package 'ranger' was built under R version 3.6.2
library(xtable)
library(nnet)
## Warning: package 'nnet' was built under R version 3.6.2
library(e1071)
## Warning: package 'e1071' was built under R version 3.6.2
library(epitools)
library(car)
## Loading required package: carData
library(caret)
## Warning: package 'caret' was built under R version 3.6.2
## Loading required package: lattice
library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.6.2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
## 
##     importance
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(rgl)
library(rattle)
## Warning: package 'rattle' was built under R version 3.6.2
## Rattle: A free graphical interface for data science with R.
## バージョン 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## 'rattle()' と入力して、データを多角的に分析します。
## 
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
## 
##     importance
## The following object is masked from 'package:ranger':
## 
##     importance
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.6.2
## Loading required package: rpart
library(rpart)
library(epitools)
library(caret)
library(ggthemes)
library(readr)
## Warning: package 'readr' was built under R version 3.6.2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(cluster)
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

下水道データ読み込み# 基本統計量表示 

gesui = read_csv("gesuidou.csv")
## Parsed with column specification:
## cols(
##   OBJECTID = col_double(),
##   slope = col_double(),
##   long = col_double(),
##   uedokaburi = col_double(),
##   sitadokaburi = col_double(),
##   masuhonsuu = col_double(),
##   nendo = col_double(),
##   kei = col_double(),
##   kubun = col_double(),
##   did = col_double(),
##   kouhou = col_double(),
##   ekijyouka = col_double(),
##   kansyu = col_double(),
##   kinkyuudo = col_double()
## )
dim(gesui)
## [1] 478  14
summary(gesui)
##     OBJECTID         slope            long         uedokaburi    
##  Min.   :  1.0   Min.   :0.000   Min.   :1.004   Min.   : 1.020  
##  1st Qu.:120.2   1st Qu.:2.000   1st Qu.:2.152   1st Qu.: 2.675  
##  Median :239.5   Median :2.800   Median :3.183   Median : 3.722  
##  Mean   :239.5   Mean   :3.509   Mean   :3.646   Mean   : 4.138  
##  3rd Qu.:358.8   3rd Qu.:4.700   3rd Qu.:4.717   3rd Qu.: 4.900  
##  Max.   :478.0   Max.   :9.710   Max.   :9.940   Max.   :13.863  
##   sitadokaburi         masuhonsuu         nendo           kei        
##  Min.   : 0.000685   Min.   : 0.000   Min.   :1976   Min.   :0.2000  
##  1st Qu.: 1.844750   1st Qu.: 0.000   1st Qu.:1978   1st Qu.:0.2500  
##  Median : 2.693033   Median : 1.000   Median :1986   Median :0.6000  
##  Mean   : 2.983303   Mean   : 1.554   Mean   :1985   Mean   :0.5779  
##  3rd Qu.: 3.820872   3rd Qu.: 2.000   3rd Qu.:1991   3rd Qu.:0.8000  
##  Max.   :11.196442   Max.   :14.000   Max.   :1992   Max.   :1.6500  
##      kubun             did             kouhou         ekijyouka     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.0000  
##  Mean   :0.8285   Mean   :0.8264   Mean   :0.5021   Mean   :0.3033  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      kansyu        kinkyuudo   
##  Min.   :0.000   Min.   :0.00  
##  1st Qu.:0.000   1st Qu.:0.00  
##  Median :1.000   Median :2.00  
##  Mean   :0.705   Mean   :1.51  
##  3rd Qu.:1.000   3rd Qu.:3.00  
##  Max.   :1.000   Max.   :3.00
gesui<- data.frame(gesui) 
stargazer(as.data.frame(gesui),type = "html")
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
OBJECTID 478 239.500 138.131 1 120.2 358.8 478
slope 478 3.509 2.128 0 2 4.7 10
long 478 3.646 1.844 1.004 2.152 4.717 9.940
uedokaburi 478 4.138 2.096 1.020 2.675 4.900 13.863
sitadokaburi 478 2.983 1.686 0.001 1.845 3.821 11.196
masuhonsuu 478 1.554 2.090 0 0 2 14
nendo 478 1,985.094 6.015 1,976 1,978 1,991 1,992
kei 478 0.578 0.353 0.200 0.250 0.800 1.650
kubun 478 0.828 0.377 0 1 1 1
did 478 0.826 0.379 0 1 1 1
kouhou 478 0.502 0.501 0 0 1 1
ekijyouka 478 0.303 0.460 0 0 1 1
kansyu 478 0.705 0.457 0 0 1 1
kinkyuudo 478 1.510 1.462 0 0 3 3
ggplot(gesui,aes(slope))+
geom_histogram(aes(y = ..density..),
bins = 10,
colour = "gray",
fill = "blue")+
  theme_classic()

ggplot(gesui,aes(long))+
geom_histogram(aes(y = ..density..),
bins = 10,
colour = "gray",
fill = "blue")+
  theme_classic()

ggplot(gesui,aes(uedokaburi))+
geom_histogram(aes(y = ..density..),
bins = 10,
colour = "gray",
fill = "blue")+
  theme_classic()

ggplot(gesui,aes(sitadokaburi))+
geom_histogram(aes(y = ..density..),
bins = 10,
colour = "gray",
fill = "blue")+
  theme_classic()

ggplot(gesui,aes(masuhonsuu ))+
geom_histogram(aes(y = ..density..),
bins = 10,
colour = "gray",
fill = "blue")+
  theme_classic()

ggplot(gesui,aes(nendo))+
geom_histogram(aes(y = ..density..),
bins = 10,
colour = "gray",
fill = "blue")+
  theme_classic()

ggplot(gesui,aes(kei))+
geom_histogram(aes(y = ..density..),
bins = 10,
colour = "gray",
fill = "blue")+
  theme_classic()

ggplot(gesui,aes(ekijyouka))+
geom_histogram(aes(y = ..density..),
bins = 10,
colour = "gray",
fill = "blue")+
  theme_classic()

ggplot(gesui,aes(kinkyuudo))+
geom_histogram(aes(y = ..density..),
bins = 10,
colour = "gray",
fill = "blue")+
  theme_classic()

head(gesui)
##   OBJECTID slope  long uedokaburi sitadokaburi masuhonsuu nendo  kei kubun did
## 1        1  0.00 9.940   3.758000     3.258000          0  1983 1.10     1   1
## 2        2  5.00 1.232   2.331392     2.255000          0  1982 1.10     1   1
## 3        3  5.00 6.500   4.654844     4.513411          0  1981 0.60     1   1
## 4        4  2.70 5.606   4.584913     3.037999          0  1976 0.60     1   1
## 5        5  4.71 5.020   1.414000     1.403298          0  1992 0.25     0   1
## 6        6  1.10 1.317   1.544714     1.153690          3  1992 0.25     0   1
##   kouhou ekijyouka kansyu kinkyuudo
## 1      0         1      1         0
## 2      1         1      1         3
## 3      0         1      1         3
## 4      0         1      1         3
## 5      1         1      0         0
## 6      1         1      0         0
plot(gesui[,2],gesui[,3])

#plot(gesui[,2],gesui[,3],type="l",xlab="Length of slope", ylab="Length of long",cex=2,col="red")
#参照頁はcol=”red”でエラー
plot(gesui[,2],gesui[,3],type="p",xlab="Length of slope", ylab="Length of long", main="延長と傾斜の散布図",cex=0.8,col="red")

pairs(gesui[2:7], pch = 21, bg = c("red", "green3", "blue", "orenge","black", "yellow", "skyblue")[unclass(gesui$kinkyuudo)])

pairs(gesui[2:14], pch = 21)

upper <- function(x, y, ...){ # 右上の散布図を設定
  oldpar <- par(usr = c(0, 1, 0, 1)) # 3行下でテキストの位置を(0.5, 0.5)にするための調整
  v <- abs(cor(x, y))
  sz <- 1 + v * 2 # テキストのサイズを直線性の強さで調整
  text(0.5, 0.5, sprintf("%.3f", v), cex = sz)
  par(oldpar)
}

lower <- function(x, y, ...){ # 左下の散布図を設定
  points(x, y, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])
  if( abs(cor(x, y)) > 0.8 ){ # 直線性の強いデータの組だけ回帰式を作成
    lmobj <- lm(y~x)
    abline(lmobj)
  }
}

pairs(gesui[,2:14], upper.panel=upper, lower.panel=lower)

各変数の正規分布性の確認(qqnorm()関数)s悦明変数は正規分布しているか?

https://detail.chiebukuro.yahoo.co.jp/qa/question_detail/q1283089632 - https://sites.google.com/site/webtextofr/graph - 正規分布ならばQQプロットは直線になるはずです.

まずSLOPE

qqnorm(gesui$slope)

対数正規性を調べる http://cse.naro.affrc.go.jp/takezawa/r-tips/r/63.html https://tutorialmore.com/questions-1478863.htm

 x <- gesui$slope
 xy <- qqplot(qt(ppoints(50), df=5), x)
 abline(lsfit(xy$x, xy$y), lty=2)

 plot(qlnorm(ppoints(x)), sort(x))

次に桝本数:カテゴリー変数であり正規分布にはならない

qqnorm(gesui$masu)

qqnorm(gesui$uedokaburi)

qqnorm(gesui$sitadokaburi)

qqnorm(gesui$masu)

qqnorm(gesui$long)

シャピロ・ウィルク検定

-敢えて検定するまでもなく今までの分布形状を見ると、左側に偏った形状をしている。 https://tkstock.site/2017/04/12/2017-04-12-000000_1/ で傾斜に対してシャピロ・ウィルク検定を行うと

 shapiro.test(gesui$slope)
## 
##  Shapiro-Wilk normality test
## 
## data:  gesui$slope
## W = 0.89138, p-value < 2.2e-16

同様にして管路延長も正規分布でない

 shapiro.test(gesui$long)
## 
##  Shapiro-Wilk normality test
## 
## data:  gesui$long
## W = 0.92612, p-value = 1.365e-14

-data: gesui$long - W = 0.92612, p-value = 1.365e-14

同様に桝も正規分でない

 shapiro.test(gesui$masu)
## 
##  Shapiro-Wilk normality test
## 
## data:  gesui$masu
## W = 0.73461, p-value < 2.2e-16

Shapiro-Wilk normality test

data: gesui$masu W = 0.73461, p-value < 2.2e-16

# 説明変数slope等は正規分布する必要はない。必要なのは誤差分散が正規分布しているかいなかである

-https://toukeier.hatenablog.com/entry/2019/09/08/224346

slopeの確立密度関数

https://oku.edu.mie-u.ac.jp/~okumura/stat/first.html x=slope

 x <- gesui$slope
curve(dnorm(x), 6, -6)

slope等の非正規なデータの最尤法フィッティング

下準備:http://tips-r.blogspot.com/2014/05/repseq.html

rep(1, 10) #「○を○個だけrepeat」って感じ
##  [1] 1 1 1 1 1 1 1 1 1 1

slope等の非正規なデータの最尤法フィッティング

-https://oku.edu.mie-u.ac.jp/~okumura/stat/141115.html

slopeの平均 -平均3.509,標準偏差2.128の正規分布の密度関数 と指数関数 とを50:10で混ぜ合わせた簡単な関数を考えます:

何をやりたいのか途中でわからなくなったので中止 #x=478 #y=gesui$slope x = 1:20 y = c(11,4,13,10,4,8,6,16,7,12,10,13,6,5,1,4,2,0,0,1)

#z = rep(x, y) r = glm(y ~ dnorm(x,10,3) + exp(-x/10) - 1, family=poisson(link=“identity”)) summary(r) #curve(dnorm(x,10,3))