Carregando Pacotes
require(readxl)
## Carregando pacotes exigidos: readxl
require(readr)
## Carregando pacotes exigidos: readr
## Warning: package 'readr' was built under R version 4.2.2
require(skimr)
## Carregando pacotes exigidos: skimr
require(lmtest)
## Carregando pacotes exigidos: lmtest
## Carregando pacotes exigidos: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
require(car)
## Carregando pacotes exigidos: car
## Carregando pacotes exigidos: carData
require(dplyr)
## Carregando pacotes exigidos: dplyr
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Importando e tratando os dados
df <- read_csv("PS4_GamesSales.csv" )
## Rows: 1034 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Game, Year, Genre, Publisher
## dbl (5): North America, Europe, Japan, Rest of World, Global
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Limpando os NA's
df = na.omit(df)
Visualizando os dados
head(df)
## # A tibble: 6 × 9
## Game Year Genre Publi…¹ North…² Europe Japan Rest …³ Global
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Grand Theft Auto V 2014 Acti… Rockst… 6.06 9.71 0.6 3.02 19.4
## 2 Call of Duty: Black O… 2015 Shoo… Activi… 6.18 6.05 0.41 2.44 15.1
## 3 Red Dead Redemption 2 2018 Acti… Rockst… 5.26 6.21 0.21 2.26 13.9
## 4 Call of Duty: WWII 2017 Shoo… Activi… 4.67 6.21 0.4 2.12 13.4
## 5 FIFA 18 2017 Spor… EA Spo… 1.27 8.64 0.15 1.73 11.8
## 6 FIFA 17 2016 Spor… Electr… 1.26 7.95 0.12 1.61 10.9
## # … with abbreviated variable names ¹Publisher, ²`North America`,
## # ³`Rest of World`
df$Year = as.numeric(df$Year)
df = na.omit(df)
str(df)
## tibble [825 × 9] (S3: tbl_df/tbl/data.frame)
## $ Game : chr [1:825] "Grand Theft Auto V" "Call of Duty: Black Ops 3" "Red Dead Redemption 2" "Call of Duty: WWII" ...
## $ Year : num [1:825] 2014 2015 2018 2017 2017 ...
## $ Genre : chr [1:825] "Action" "Shooter" "Action-Adventure" "Shooter" ...
## $ Publisher : chr [1:825] "Rockstar Games" "Activision" "Rockstar Games" "Activision" ...
## $ North America: num [1:825] 6.06 6.18 5.26 4.67 1.27 1.26 4.49 3.64 3.11 2.91 ...
## $ Europe : num [1:825] 9.71 6.05 6.21 6.21 8.64 7.95 3.93 3.39 3.83 3.97 ...
## $ Japan : num [1:825] 0.6 0.41 0.21 0.4 0.15 0.12 0.21 0.32 0.19 0.27 ...
## $ Rest of World: num [1:825] 3.02 2.44 2.26 2.12 1.73 1.61 1.7 1.41 1.36 1.34 ...
## $ Global : num [1:825] 19.4 15.1 13.9 13.4 11.8 ...
## - attr(*, "na.action")= 'omit' Named int [1:209] 448 467 631 725 728 729 730 734 735 736 ...
## ..- attr(*, "names")= chr [1:209] "448" "467" "631" "725" ...
skim(df)
| Name | df |
| Number of rows | 825 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Game | 0 | 1 | 4 | 78 | 0 | 824 | 0 |
| Genre | 0 | 1 | 3 | 16 | 0 | 17 | 0 |
| Publisher | 0 | 1 | 2 | 38 | 0 | 152 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year | 0 | 1 | 2015.97 | 1.30 | 2013 | 2015.00 | 2016.00 | 2017.00 | 2020.00 | ▂▃▇▁▁ |
| North America | 0 | 1 | 0.26 | 0.62 | 0 | 0.00 | 0.05 | 0.19 | 6.18 | ▇▁▁▁▁ |
| Europe | 0 | 1 | 0.31 | 0.87 | 0 | 0.00 | 0.02 | 0.22 | 9.71 | ▇▁▁▁▁ |
| Japan | 0 | 1 | 0.04 | 0.12 | 0 | 0.00 | 0.00 | 0.04 | 2.17 | ▇▁▁▁▁ |
| Rest of World | 0 | 1 | 0.11 | 0.27 | 0 | 0.00 | 0.02 | 0.09 | 3.02 | ▇▁▁▁▁ |
| Global | 0 | 1 | 0.72 | 1.74 | 0 | 0.03 | 0.12 | 0.56 | 19.39 | ▇▁▁▁▁ |
Criando o modelo
model1 = lm(Japan ~ Genre + Publisher + `North America` + Europe + `Rest of World`, df)
Visualizando o nosso modelo e interpretando-o
par(mfrow= c(2,2))
plot(model1)
## Warning: not plotting observations with leverage one:
## 71, 116, 155, 163, 204, 225, 257, 294, 342, 367, 378, 385, 427, 468, 499, 501, 528, 549, 560, 567, 581, 596, 601, 614, 629, 641, 642, 649, 655, 672, 679, 687, 708, 709, 711, 714, 715, 718, 721, 724, 725, 730, 744, 746, 752, 757, 759, 760, 764, 765, 768, 771, 776, 778, 785, 790, 793, 796, 797, 823, 824, 825
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produzidos
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produzidos
Podemos notar no primeiro gráfico, que nossos seguem um padrão linear, o segundo nos demosntra que seguem aproximadamente uma distribuição normal, podemos notar no terceiro gráfico que nossos dados tem uma variância homogênea.
Chegando alguns testes indenpendentemente
### Chegando a normalidade via test shapiro
shapiro.test(model1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.37162, p-value < 2.2e-16
# Podemos notar via teste de shapiro wilk que nossos residuos nao são normais
hist(model1$residuals)
### Chegando a independencia dos residuos
durbinWatsonTest(model1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.01790156 1.963336 0.386
## Alternative hypothesis: rho != 0
# Podemos notar por meio do teste durbinwatson que os residuos são idependentes
### Chegando a homocedasticidade
bptest(model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 41.6, df = 170, p-value = 1
# Podemos notar por meio do Breusch-Pagan Test que a variância dos nossos dados
# é homogenea