Data Analysis: An Introduction to R Tools and Techniques
Author
Compiled By: Joshua Lizardi
1 Introduction
This text is an introduction to the R programming language for basic statistics & data analysis. Intended to be a reference for students attending one of Trine University’s Online Graduate Programs. This text will cover modern tools & methods used to summarize, analyze, and interpret data. The text is not meant to replace a statistics textbook, This text is not meant to replace a textbook on R programming. This text will try to avoid (when possible) the complex mathematical theory underpinning many of the methods displayed and focuses on correctly applying each method and the correct interpretation of results. The reader should approach this text like one approaches learning to drive. When learning to drive a car, only the rules of the road and how to operate a vehicle are necessary. No need to understand how a car runs, or how to change oil, or a tire (that knowledge comes later). In this text the rules of the road are statistical concepts and the car, is R.
Hence, this text will focus on showing HOW to use basic data analysis techniques. Rarely will the text go into WHY techniques are used, though I hope to, when necessary, try to explain WHAT is going on, and point to resources for further understanding.
The text will discuss descriptive analytics, basic data visualization, graphing, clustering & cluster plots. Followed by a section on reading and understanding null hypothesis testing and p-values. The text then covers some commonly used statistical tests, then a section on predictive modeling. Each section will utilize the same case study dataset. This text will also briefly cover a few topics concerning data cleaning/transforming/munging/wrangling.
The text main focus will concern using R to create/present things like summary statistics, contingency tables, pie charts, bar graphs, histograms, box plots, dendrograms & hierarchical clustering, K-means clustering and clustering plots, T-test, Chi2 test, ANOVA and Regression Analysis.
The case study is a contrived example and presentation of a practical application of modern Data science tools and techniques on a small dataset 13 variables and 50 observations . This case study is for educational purposes and may be used, remixed, and shared freely without limitation.
When I was a student Statistics was done in spiral notebooks with large textbooks that had never ending statistical tables printed in the back. Now there is R, R has simple easy-to-use functions and packages for everything from density & distributions to linear regression and neural networks, If it is statistics, it can be done in R. (Navarro, n.d.)
R is freely distributed online, it can be downloaded from: http://cran.r-project.org/ At the top of the page – under the heading “Download and Install R” – there are also download links for Windows users, Mac users, and Linux users. After you have installed R, download and install RStudio from: http://rstudio.org. RStudio provides a convenient way to work with R under all platforms and is also freely available (Muller).
The R console is where you give R commands and is the lower left window in RStudio. It is the same way you would interact with R on the command line or terminal. In other words, the “Console” tab in the lower left window is the only part of RStudio that is actually R itself; everything else is extra/optional (Analysis of Microbiome Data in R, n.d.).
You can enter text directly into the R Console next to the prompt, >, and hit enter to run the code. for the purpose of this text we will make use of a simple text file called an R script. To start create an R script by choosing the New File icon at the top-left, then choose R Script. The R script will open in the upper left window and is basically a plain text editor (think Notepad). You can have multiple R scripts open at once and they appear in tabs (Analysis of Microbiome Data in R, n.d.)
To follow along with the analysis presented in this text, copy-paste the R code presented in each section into a single R script. To run a script or selection of code from the R script, put the cursor on the line of code or highlight the code selection and then click the Run button at the top of the file window. Or just press CTRL-Enter .
The “Environment” tab in the top right window will list the variables and functions present in the current R session. You can view your dataset in a tabbed window after importing using the Environment tabs variables window (Analysis of Microbiome Data in R, n.d.).
2.0.0.1 R: Installing packages and libraries
R’s flexibility functionality and ease of use comes in the form packages and libraries. R Packages are free libraries of code written by R’s open source developer community. There are maybe tens of thousands of R packages (Quick List of Useful R Packages, 2023). This text will make use of at least 10.
To install the R packages and libraries necessary to follow along with this text in RStudio, open a RStudio session and paste in the console
Amazon web services defines Structured data as “…data that has a standardized format for efficient access by software and humans alike”(What Is Structured Data?, n.d.). Structured data is typically presented in a table (on a table?) with rows and columns.
This text will deal with structured data only. That is, data presented in rows and columns. Columns will correspond to variables, and rows to observations.
Basic Data type taxonomy for this text are Numerical and Categorical
Numerical: Continuous, Discrete
Categorical: Nominal, Ordinal, Binary
Data type determines what kinds of graphs, tables and test are appropriate (Types of Data in Statistics, n.d.).
2.2 Video Game Cost & Sales Case Study
Dataset Description: The data is composed of games created by a video game publisher over 15 years. The dataset; aggregated sales data; includes 50 observations (video games), 13 Features including marketing spend, research and development spend…etc. The data contains no missing values. The Cost, Sales and Profit features are aggregates of variables in the data;
Sales = Unit.Price*Units.Sold
Cost = R.D.Spend+Administration+Marketing.Spend
Profit = Sales - Cost
2.2.0.1Data collection methods:
All games where divided into groups based on game type. Then games that had many expansions or parts/chapters, or many related releases, such as re-themed/re-skinned games, where either removed or averaged out to create a representative.
Of games that released on multiple platforms, the best performing platform versions were chosen, the rest removed. Then the games of the dataset were selected randomly from the groups utilizing a cluster sampling algorithm.
The Game ID col contains the unique identifier for each game. the Id number contains coded information about each game, for example, multi-part games of the same series will have similar IDs, the same game on different platforms will only have a difference of 1 number in ID’s, etc.
The data is presented in the .csv format, and can be downloaded here
Before you re-code a categorical variable, you first need to figure out which numerical values correspond to each of its categories in the data set (Recoding and Labeling Variables, n.d.). (show examples)
2.3.2 Scaling numerical variables:
For data that has mean and standard deviation, standardization is done by taking away the mean from each observation and than dividing each observation by the standard deviation.The Standardized Variable always has mean 0 and standard deviation 1 (Standardized Linear Regression, n.d.)
For example, if profits have a mean and standard deviation of $480,000 and $160,000, respectively, then a game making $560,000 in profit has a standardized profit of (560,000-480,000)/160,000 = 1/2 because the profit is one-half standard deviation above the mean profit. The advantage of standardizing is that it facilitates the comparison of values that have different units of measurement. For example, compare the summary statistics from profits and units sold. Before and after standardizing.
2.3.3 Binning Variables:
The process of binning variables takes a variable and breaks it into a smaller number of ranges, also called grouping or discretizing. Note a version of this is done to create the histograms for continuous variables. The basic idea is to split data into groups based on some criteria. (Dividing a Continuous Variable into Categories, n.d.). (Comprehensive Guide to Grouping and Aggregating with Pandas - Practical Business Python, n.d.)
2.3.4 Aggregation of Binned Data:
An aggregation function takes multiple individual values and returns a summary. In the majority of the cases, this summary is a single value (Integrate.io, n.d.). The most common aggregation functions are an average or summation of values. Aggregation is typically used in conjunction with grouping. (Zaitsev, 2017). Once binned in groups we can apply aggregation to each group independently. After which we can combine the results into a variable to utilize for both descriptive visualizations and inferential analysis. (Comprehensive Guide to Grouping and Aggregating with Pandas - Practical Business Python, n.d.)
Code
#numerical data only subset*# for box plotskeeps =c("R.D.Spend", "Year.Created", "Marketing.Spend","Administration", "cost", "Profit", "Sales","units.sold", "IGN.Rating", "Unit.Price")numericdata = data[, (names(data) %in% keeps)]index =1:ncol(numericdata)numericdata[, index] =lapply(numericdata[, index], as.numeric)# coding categorical varibles with numueric lables and transposing (R to C)# for clustering index =1:ncol(x)x[, index] =lapply(x[, index], as.numeric)str(x)
dfx = xx =as.data.frame(t(x))# Bin continuous variable into high mid low vs continuous variablesdata$Unit.Pricebin =cut(data$Unit.Price, breaks =c(7, 50, 70, 200), labels =c("low price", "mid price", "high price"))data$units.soldbin =cut(data$units.sold, breaks =c(0, 9000, 10000, 15000), labels =c("low selling", "mid selling", "high selling"))#grouping by and aggreagatng varibles GrpBysumdata = data %>%group_by(Year.Created) %>%summarise(units.sold =sum(units.sold))colnames(GrpBysumdata)[2] ="TotalSold"GrpBydatamean3 = data %>%group_by(Platform) %>%summarise(Unit.Price =mean(Unit.Price))colnames(GrpBydatamean3)[2] ="AVERAGEPRICE"GrpBydatamean3 =as.data.frame(GrpBydatamean3)GrpBydata1 = data %>%group_by(IGP) %>%summarise(units.sold =sum(units.sold))colnames(GrpBydata1)[2] ="TotalSold"GrpBydata2 = data %>%group_by(IGP) %>%summarise(Sales =sum(Sales))
We start the study with a basic overview and application of descriptive analysis.
3 Descriptive Analysis
According to Harvard Business School Descriptive Analytics are the processes and methods used on data to identify trends and find relationships. It is the simplest form of data analysis (What Is Descriptive Analytics?, 2021).
Descriptive statistics, for the purpose of this text, are numbers that are used to summarize and describe data, they are also referred to as summary statistics. In practice we present several descriptive statistics at once to help give as full of a picture of the data as possible. Keep in mind descriptive statistics are just that, descriptive. They do not involve “generalizing beyond the data at hand” or IT CAN’T help you come to conclusions and make predictions based on your data. Generalizing is the business of inferential methods, which we will see in the next section (“What Are Descriptive Statistics?,” 2021).
Descriptive statistics are presented in graphs and tables. Apart from the above-described univariate descriptive statistics, (one variable only) there are also bi and/or multivariate descriptive statistics, which describe a relation between 2 or more variables. These could include, among others, scatter plots, cross-tabulations, clustering analysis, and other multi-dimensional graphical presentations . These are not plainly descriptive statistics anymore since the true aim of such analysis is to provide inductive and exploratory insights (“What Are Descriptive Statistics?,” 2021).
Descriptive statistics summarize the data at hand and can present data using graphs. Exploratory analysis helps you discover correlations and relationships among variables in the dataset using graphs and tables. So what really is the difference?
(Monica, 2020).
For the purposes of this text we will use the umbrella term Descriptive Analytics to refer to techniques (and attitudes) from both Descriptive statistics and Exploratory analysis. As well as using these terms interchangeably , in a colloquial fashion.
Descriptive Analytics are reported for the entire dataset (each variable separately and all of them at once), and for subgroups predefined by domain experts, or hinted at during the Exploratory/Descriptive phase ; for example,by grouping games according to their Platform and/or IGN rating, because a manager thinks it could be important,or because an exploratory table or graph hints at a meaningful association. Below a sample of the dataset with R code is presented.
3.1 Full Dataset Univariate Descriptive Analytics
Code
summary(data)
R.D.Spend Administration Marketing.Spend Profit
Min. : 0 Min. : 51283 Min. : 0 Min. : 6431
1st Qu.: 39936 1st Qu.:103731 1st Qu.:129300 1st Qu.: 53259
Median : 73051 Median :122700 Median :212716 Median :123546
Mean : 73722 Mean :121345 Mean :211025 Mean :107584
3rd Qu.:101603 3rd Qu.:144842 3rd Qu.:299469 3rd Qu.:149041
Max. :165349 Max. :182646 Max. :471784 Max. :226888
Platform Game.Type IGN.Rating Year.Created
Length:50 Length:50 Min. : 3.00 Min. :2007
Class :character Class :character 1st Qu.: 4.00 1st Qu.:2009
Mode :character Mode :character Median : 5.00 Median :2014
Mean : 5.52 Mean :2014
3rd Qu.: 6.00 3rd Qu.:2020
Max. :10.00 Max. :2022
cost Sales Unit.Price units.sold
Min. : 52285 Min. : 66335 Min. : 7.14 Min. : 6000
1st Qu.:293422 1st Qu.:387044 1st Qu.: 38.08 1st Qu.: 7826
Median :411889 Median :520625 Median : 52.98 Median : 8986
Mean :406091 Mean :513675 Mean : 58.44 Mean : 9273
3rd Qu.:516943 3rd Qu.:617329 3rd Qu.: 76.20 3rd Qu.:10578
Max. :774031 Max. :956208 Max. :157.22 Max. :12788
IGP gameids Unit.Pricebin units.soldbin
Length:50 284201892897.14 : 1 low price :22 low selling :25
Class :character 1142020891743.26: 1 mid price :12 mid selling : 7
Mode :character 1352018779751.58: 1 high price:16 high selling:18
1572012980480.36: 1
1662020876984.54: 1
1832020941634.5 : 1
(Other) :44
Code
knitr::kable(data)
R.D.Spend
Administration
Marketing.Spend
Profit
Platform
Game.Type
IGN.Rating
Year.Created
cost
Sales
Unit.Price
units.sold
IGP
gameids
Unit.Pricebin
units.soldbin
66051.52
182645.6
118148.20
226888.39
XBOX
Action-adventure.
10
2022
366845.3
593733.7
76.18
7794
No
41102022779476.1
high price
low selling
100671.96
91790.6
249744.55
182937.09
XBOX
Multiplayer online battle arena (MOBA)
10
2021
442207.1
625144.2
71.65
8725
Yes
42102021872571.6
high price
low selling
165349.20
136897.8
471784.10
182177.09
PC
Role-playing (RPG, ARPG, and More)
9
2020
774031.1
956208.2
157.22
6082
No
26920206082157.2
high price
low selling
91992.39
135495.1
252664.93
166198.07
PC
Racing.
9
2019
480152.4
646350.5
103.91
6220
Yes
24920196220103.9
high price
low selling
142107.34
91391.8
366168.42
124287.50
PC
Real-time strategy (RTS)
9
2018
599667.5
723955.0
87.97
8230
No
2592018823087.97
high price
low selling
123334.88
108679.2
304981.62
191026.85
PC
Multiplayer online battle arena (MOBA)
8
2015
536995.7
728022.5
73.32
9930
Yes
2282015993073.32
high price
mid selling
144372.41
118671.9
383199.62
193686.68
PlayStation
Role-playing (RPG, ARPG, and More)
8
2017
646243.9
839930.6
66.71
12590
No
36820171259066.7
mid price
high selling
78389.47
153773.4
299737.29
112538.52
PC
Shooters (FPS and TPS)
8
2013
531900.2
644438.7
50.39
12788
No
28820131278850.3
mid price
high selling
134615.46
147198.9
127716.82
191808.13
Nintendo
Racing.
8
2016
409531.2
601339.3
48.24
12465
Yes
14820161246548.2
low price
high selling
15505.73
127382.3
35534.17
113608.47
PC
Sports
8
2014
178422.2
292030.7
27.49
10623
No
29820141062327.4
low price
high selling
153441.51
101145.6
407934.54
125349.35
Nintendo
Real-time strategy (RTS)
7
2012
662521.6
787871.0
80.36
9804
No
1572012980480.36
high price
mid selling
75328.87
144136.0
134050.07
95973.11
Nintendo
Sports
7
2011
353514.9
449488.0
47.68
9428
Yes
1972011942847.68
low price
mid selling
130298.13
145530.1
323876.68
122804.55
PC
Real-time strategy (RTS)
6
2020
599704.9
722509.4
100.21
7210
Yes
25620207210100.2
high price
low selling
78013.11
121597.6
264346.06
157015.76
PC
Multiplayer online battle arena (MOBA)
6
2010
463956.7
620972.5
88.06
7052
Yes
2262010705288.06
high price
low selling
131876.90
99814.7
362861.36
146790.23
Nintendo
Role-playing (RPG, ARPG, and More)
6
2020
594553.0
741343.2
84.54
8769
No
1662020876984.54
high price
low selling
76253.86
113867.3
298664.47
156084.42
PC
Multiplayer online battle arena (MOBA)
6
2009
488785.6
644870.0
76.21
8462
Yes
2262009846276.21
high price
low selling
46426.07
157693.9
210797.67
149791.50
PC
Multiplayer online battle arena (MOBA)
6
2020
414917.7
564709.2
65.59
8610
Yes
2262020861065.59
mid price
low selling
64664.71
139553.2
137962.62
155748.03
Nintendo
Racing.
6
2008
342180.5
497928.5
46.87
10624
Yes
14620081062446.8
low price
high selling
63408.86
129219.6
46085.25
152201.97
PC
Racing.
6
2007
238713.7
390915.7
34.69
11269
Yes
24620071126934.6
low price
high selling
94657.16
145077.6
282574.31
53310.08
PC
Sports
5
2009
522309.0
575619.1
90.83
6337
No
2952009633790.83
high price
low selling
73994.56
122782.8
303319.26
28937.44
PlayStation
Sports
5
2007
500096.6
529034.0
88.17
6000
Yes
3952007600088.17
high price
low selling
44069.95
51283.1
197029.42
146098.18
XBOX
Multiplayer online battle arena (MOBA)
5
2007
292382.5
438480.7
61.82
7093
No
4252007709361.82
mid price
low selling
28754.33
118546.1
172795.67
141560.25
PC
Multiplayer online battle arena (MOBA)
5
2018
320096.0
461656.3
59.89
7708
Yes
2252018770859.89
mid price
low selling
72107.60
127864.6
353183.81
53241.58
PC
Sports
5
2008
553156.0
606397.5
58.06
10445
No
29520081044558
mid price
high selling
61136.38
152701.9
88218.23
138141.05
PC
Shooters (FPS and TPS)
5
2020
302056.5
440197.6
57.42
7666
No
2852020766657.42
mid price
low selling
65605.48
153032.1
107138.38
132168.98
PC
Shooters (FPS and TPS)
5
2009
325775.9
457944.9
56.97
8038
Yes
2852009803856.97
mid price
low selling
23640.93
96189.6
148001.11
134327.19
Nintendo
Puzzlers and party games.
5
2018
267831.7
402158.9
51.58
7797
Yes
1352018779751.58
mid price
low selling
38558.51
82982.1
174999.30
144269.99
XBOX
Multiplayer online battle arena (MOBA)
5
2012
296539.9
440809.9
42.54
10363
No
42520121036342.5
low price
high selling
91749.16
114175.8
294919.57
24081.67
PlayStation
Action-adventure.
5
2009
500844.5
524926.2
41.37
12690
Yes
31520091269041.3
low price
high selling
61994.48
115641.3
91131.24
33864.79
PlayStation
Sports
5
2007
268767.0
302631.8
38.25
7912
No
3952007791238.25
low price
low selling
77044.01
99281.3
140574.81
19261.80
PC
Shooters (FPS and TPS)
5
2011
316900.2
336162.0
37.13
9054
Yes
2852011905437.13
low price
mid selling
46014.02
85047.4
205517.64
20769.08
PC
Shooters (FPS and TPS)
5
2020
336579.1
357348.2
35.89
9958
Yes
2852020995835.89
low price
mid selling
20229.59
65947.9
185265.10
22495.25
PC
Shooters (FPS and TPS)
5
2016
271442.6
293937.9
33.42
8795
No
2852016879533.42
low price
low selling
22177.74
154806.1
28334.72
132615.72
PlayStation
Puzzlers and party games.
5
2016
205318.6
337934.3
32.83
10293
Yes
33520161029332.8
low price
high selling
0.00
135426.9
0.00
129904.93
XBOX
Puzzlers and party games.
5
2020
135426.9
265331.8
26.19
10131
Yes
43520201013126.1
low price
high selling
0.00
116983.8
45173.06
126989.88
PC
Sandbox.
5
2009
162156.9
289146.7
23.46
12324
Yes
27520091232423.4
low price
high selling
120542.52
148719.0
311613.29
10500.38
PC
Racing.
4
2007
580874.8
591375.1
76.26
7755
No
2442007775576.26
high price
low selling
114523.61
122616.8
261776.23
18457.77
PC
Shooters (FPS and TPS)
4
2020
498916.7
517374.4
46.14
11213
No
28420201121346.1
low price
high selling
101913.08
110594.1
229160.95
118468.25
PC
Action-adventure.
4
2009
441668.1
560136.4
45.95
12189
Yes
21420091218945.9
low price
high selling
55493.95
103057.5
214634.81
12566.79
Nintendo
Action-adventure.
4
2020
373186.2
385753.0
43.26
8917
Yes
1142020891743.26
low price
low selling
27892.92
84710.8
164470.71
11248.54
PC
Action-adventure.
4
2011
277074.4
288322.9
34.77
8292
No
2142011829234.77
low price
low selling
542.05
51743.2
0.00
14049.35
PC
Shooters (FPS and TPS)
4
2018
52285.2
66334.5
7.14
9289
No
284201892897.14
low price
mid selling
162597.70
151377.6
443898.53
192276.44
PC
Sports
3
2020
757873.8
950150.3
108.79
8734
Yes
29320208734108.7
high price
low selling
67532.53
105751.0
304768.73
6430.77
PC
Action-adventure.
3
2007
478052.3
484483.1
77.16
6279
Yes
2132007627977.16
high price
low selling
93863.75
127320.4
249839.44
111327.32
XBOX
Sports
3
2009
471023.6
582350.9
53.59
10866
No
49320091086653.5
mid price
high selling
119943.24
156547.4
256512.92
69234.42
Nintendo
Action-adventure.
3
2007
533003.6
602238.0
53.59
11238
No
11320071123853.5
mid price
high selling
1315.46
115816.2
297114.46
109630.39
PlayStation
Sports
3
2016
414246.1
523876.5
52.36
10006
No
39320161000652.3
mid price
high selling
28663.76
127056.2
201126.82
79153.00
XBOX
Action-adventure.
3
2014
356846.8
435999.8
38.02
11467
Yes
41320141146738
low price
high selling
86419.70
153514.1
0.00
84919.88
Nintendo
Shooters (FPS and TPS)
3
2020
239933.8
324853.7
34.50
9416
No
1832020941634.5
low price
mid selling
1000.23
124153.0
1903.93
111977.20
PC
Shooters (FPS and TPS)
3
2009
127057.2
239034.4
27.42
8719
Yes
2832009871927.42
low price
low selling
Code
print(dfSummary(data), method ='render')
Data Frame Summary
data
Dimensions: 50 x 16
Duplicates: 0
No
Variable
Stats / Values
Freqs (% of Valid)
Graph
Valid
Missing
1
R.D.Spend [numeric]
Mean (sd) : 73721.6 (45902.3)
min ≤ med ≤ max:
0 ≤ 73051.1 ≤ 165349
IQR (CV) : 61666.4 (0.6)
49 distinct values
50 (100.0%)
0 (0.0%)
2
Administration [numeric]
Mean (sd) : 121345 (28017.8)
min ≤ med ≤ max:
51283.1 ≤ 122700 ≤ 182646
IQR (CV) : 41111.3 (0.2)
50 distinct values
50 (100.0%)
0 (0.0%)
3
Marketing.Spend [numeric]
Mean (sd) : 211025 (122290)
min ≤ med ≤ max:
0 ≤ 212716 ≤ 471784
IQR (CV) : 170169 (0.6)
48 distinct values
50 (100.0%)
0 (0.0%)
4
Profit [numeric]
Mean (sd) : 107584 (61181.4)
min ≤ med ≤ max:
6430.8 ≤ 123546 ≤ 226888
IQR (CV) : 95782.5 (0.6)
50 distinct values
50 (100.0%)
0 (0.0%)
5
Platform [character]
1. Nintendo
·
2. PC
3. PlayStation
4. XBOX
9
(
18.0%
)
28
(
56.0%
)
6
(
12.0%
)
7
(
14.0%
)
50 (100.0%)
0 (0.0%)
6
Game.Type [character]
1. Action-adventure.
2. Multiplayer online battle
3. Puzzlers and party games.
4. Racing.
5. Real-time strategy (RTS)
6. Role-playing (RPG, ARPG,
7. Sandbox.
8. Shooters (FPS and TPS)
9. Sports
8
(
16.0%
)
8
(
16.0%
)
3
(
6.0%
)
5
(
10.0%
)
3
(
6.0%
)
3
(
6.0%
)
1
(
2.0%
)
10
(
20.0%
)
9
(
18.0%
)
50 (100.0%)
0 (0.0%)
7
IGN.Rating [integer]
Mean (sd) : 5.5 (1.9)
min ≤ med ≤ max:
3 ≤ 5 ≤ 10
IQR (CV) : 2 (0.3)
3
:
8
(
16.0%
)
4
:
6
(
12.0%
)
5
:
17
(
34.0%
)
6
:
7
(
14.0%
)
7
:
2
(
4.0%
)
8
:
5
(
10.0%
)
9
:
3
(
6.0%
)
10
:
2
(
4.0%
)
50 (100.0%)
0 (0.0%)
8
Year.Created [integer]
Mean (sd) : 2013.9 (5.1)
min ≤ med ≤ max:
2007 ≤ 2014 ≤ 2022
IQR (CV) : 10.8 (0)
16 distinct values
50 (100.0%)
0 (0.0%)
9
cost [numeric]
Mean (sd) : 406091 (162419)
min ≤ med ≤ max:
52285.2 ≤ 411889 ≤ 774031
IQR (CV) : 223521 (0.4)
50 distinct values
50 (100.0%)
0 (0.0%)
10
Sales [numeric]
Mean (sd) : 513675 (184056)
min ≤ med ≤ max:
66334.5 ≤ 520626 ≤ 956208
IQR (CV) : 230285 (0.4)
50 distinct values
50 (100.0%)
0 (0.0%)
11
Unit.Price [numeric]
Mean (sd) : 58.4 (27.1)
min ≤ med ≤ max:
7.1 ≤ 53 ≤ 157.2
IQR (CV) : 38.1 (0.5)
49 distinct values
50 (100.0%)
0 (0.0%)
12
units.sold [integer]
Mean (sd) : 9273.2 (1880.4)
min ≤ med ≤ max:
6000 ≤ 8985.5 ≤ 12788
IQR (CV) : 2752.8 (0.2)
50 distinct values
50 (100.0%)
0 (0.0%)
13
IGP [character]
1. No
2. Yes
23
(
46.0%
)
27
(
54.0%
)
50 (100.0%)
0 (0.0%)
14
gameids [factor]
1. 284201892897.14
2. 1142020891743.26
3. 1352018779751.58
4. 1572012980480.36
5. 1662020876984.54
6. 1832020941634.5
7. 1972011942847.68
8. 2132007627977.16
9. 2142011829234.77
10. 2252018770859.89
[ 40 others ]
1
(
2.0%
)
1
(
2.0%
)
1
(
2.0%
)
1
(
2.0%
)
1
(
2.0%
)
1
(
2.0%
)
1
(
2.0%
)
1
(
2.0%
)
1
(
2.0%
)
1
(
2.0%
)
40
(
80.0%
)
50 (100.0%)
0 (0.0%)
15
Unit.Pricebin [factor]
1. low price
2. mid price
3. high price
22
(
44.0%
)
12
(
24.0%
)
16
(
32.0%
)
50 (100.0%)
0 (0.0%)
16
units.soldbin [factor]
1. low selling
2. mid selling
3. high selling
25
(
50.0%
)
7
(
14.0%
)
18
(
36.0%
)
50 (100.0%)
0 (0.0%)
Generated by summarytools 1.0.1 (R version 4.2.1) 2023-05-13
The output of the preceding summary provides the descriptive statistics for all variables in the data set, this includes summary statistics, basic plots, and contingency tables for the categorical variables (and the time variable). Next we will look at pie charts histograms and box plots.
>>> Note: Platform is not in a data frame (table)
>>> Note: Platform is not in a data frame (table)
>>> suggestions
PieChart(Platform, hole=0) # traditional pie chart
PieChart(Platform, values="%") # display %'s on the chart
PieChart(Platform) # bar chart
Plot(Platform) # bubble plot
Plot(Platform, values="count") # lollipop plot
--- Platform ---
Nintendo PC PlayStation XBOX Total
Frequencies: 9 28 6 7 50
Proportions: 0.180 0.560 0.120 0.140 1.000
Chi-squared test of null hypothesis of equal probabilities
Chisq = 26.000, df = 3, p-value = 0.000
Code
PieChart(GameType)
>>> Note: GameType is not in a data frame (table)
>>> Note: GameType is not in a data frame (table)
>>> suggestions
PieChart(GameType, hole=0) # traditional pie chart
PieChart(GameType, values="%") # display %'s on the chart
PieChart(GameType) # bar chart
Plot(GameType) # bubble plot
Plot(GameType, values="count") # lollipop plot
--- GameType ---
GameType Count Prop
---------------------------------
Action-adventure. 8 0.160
Mltplyronlnbta(MOBA) 8 0.160
Puzzlersandpartygms. 3 0.060
Racing. 5 0.100
Real-timstratgy(RTS) 3 0.060
Rl-ply(RPG,ARPG,aMr) 3 0.060
Sandbox. 1 0.020
Shooters(FPSandTPS) 10 0.200
Sports 9 0.180
---------------------------------
Total 50 1.000
Chi-squared test of null hypothesis of equal probabilities
Chisq = 15.160, df = 8, p-value = 0.056
Code
PieChart(IGP)
>>> Note: IGP is not in a data frame (table)
>>> Note: IGP is not in a data frame (table)
>>> suggestions
PieChart(IGP, hole=0) # traditional pie chart
PieChart(IGP, values="%") # display %'s on the chart
PieChart(IGP) # bar chart
Plot(IGP) # bubble plot
Plot(IGP, values="count") # lollipop plot
--- IGP ---
No Yes Total
Frequencies: 23 27 50
Proportions: 0.460 0.540 1.000
Chi-squared test of null hypothesis of equal probabilities
Chisq = 0.320, df = 1, p-value = 0.572
Code
PieChart(Rating)
>>> Note: Rating is not in a data frame (table)
>>> Note: Rating is not in a data frame (table)
>>> suggestions
PieChart(Rating, hole=0) # traditional pie chart
PieChart(Rating, values="%") # display %'s on the chart
PieChart(Rating) # bar chart
Plot(Rating) # bubble plot
Plot(Rating, values="count") # lollipop plot
--- Rating ---
3 4 5 6 7 8 9 10 Total
Frequencies: 8 6 17 7 2 5 3 2 50
Proportions: 0.160 0.120 0.340 0.140 0.040 0.100 0.060 0.040 1.000
Chi-squared test of null hypothesis of equal probabilities
Chisq = 26.800, df = 7, p-value = 0.000
Code
PieChart(Year)
>>> Note: Year is not in a data frame (table)
>>> Note: Year is not in a data frame (table)
>>> suggestions
PieChart(Year, hole=0) # traditional pie chart
PieChart(Year, values="%") # display %'s on the chart
PieChart(Year) # bar chart
Plot(Year) # bubble plot
Plot(Year, values="count") # lollipop plot
--- Year ---
Year Count Prop
------------------
2007 7 0.140
2008 2 0.040
2009 8 0.160
2010 1 0.020
2011 3 0.060
2012 2 0.040
2013 1 0.020
2014 2 0.040
2015 1 0.020
2016 4 0.080
2017 1 0.020
2018 4 0.080
2019 1 0.020
2020 11 0.220
2021 1 0.020
2022 1 0.020
------------------
Total 50 1.000
Chi-squared test of null hypothesis of equal probabilities
Chisq = 44.080, df = 15, p-value = 0.000
>>> Low cell expected frequencies, so chi-squared approximation may not be accurate
Code
# HISTOGRAMSHistogram(R.D.Spend)
>>> Note: R.D.Spend is not in a data frame (table)
>>> Note: R.D.Spend is not in a data frame (table)
>>> Suggestions
bin_width: set the width of each bin
bin_start: set the start of the first bin
bin_end: set the end of the last bin
Histogram(R.D.Spend, density=TRUE) # smoothed curve + histogram
Plot(R.D.Spend) # Violin/Box/Scatterplot (VBS) plot
--- R.D.Spend ---
n miss mean sd min mdn max
50 0 73721.616 45902.256 0.000 73051.080 165349.200
No (Box plot) outliers
Bin Width: 20000
Number of Bins: 9
Bin Midpnt Count Prop Cumul.c Cumul.p
---------------------------------------------------------
0 > 20000 10000 6 0.12 6 0.12
20000 > 40000 30000 7 0.14 13 0.26
40000 > 60000 50000 4 0.08 17 0.34
60000 > 80000 70000 14 0.28 31 0.62
80000 > 100000 90000 5 0.10 36 0.72
100000 > 120000 110000 4 0.08 40 0.80
120000 > 140000 130000 5 0.10 45 0.90
140000 > 160000 150000 3 0.06 48 0.96
160000 > 180000 170000 2 0.04 50 1.00
Code
Histogram(Administration)
>>> Note: Administration is not in a data frame (table)
>>> Note: Administration is not in a data frame (table)
>>> Suggestions
bin_width: set the width of each bin
bin_start: set the start of the first bin
bin_end: set the end of the last bin
Histogram(Administration, density=TRUE) # smoothed curve + histogram
Plot(Administration) # Violin/Box/Scatterplot (VBS) plot
--- Administration ---
n miss mean sd min mdn max
50 0 121344.64 28017.80 51283.14 122699.79 182645.56
No (Box plot) outliers
Bin Width: 20000
Number of Bins: 8
Bin Midpnt Count Prop Cumul.c Cumul.p
---------------------------------------------------------
40000 > 60000 50000 2 0.04 2 0.04
60000 > 80000 70000 1 0.02 3 0.06
80000 > 100000 90000 8 0.16 11 0.22
100000 > 120000 110000 12 0.24 23 0.46
120000 > 140000 130000 13 0.26 36 0.72
140000 > 160000 150000 13 0.26 49 0.98
160000 > 180000 170000 0 0.00 49 0.98
180000 > 200000 190000 1 0.02 50 1.00
Code
Histogram(Profit)
>>> Note: Profit is not in a data frame (table)
>>> Note: Profit is not in a data frame (table)
>>> Suggestions
bin_width: set the width of each bin
bin_start: set the start of the first bin
bin_end: set the end of the last bin
Histogram(Profit, density=TRUE) # smoothed curve + histogram
Plot(Profit) # Violin/Box/Scatterplot (VBS) plot
--- Profit ---
n miss mean sd min mdn max
50 0 107583.881 61181.438 6430.770 123546.025 226888.390
No (Box plot) outliers
Bin Width: 50000
Number of Bins: 5
Bin Midpnt Count Prop Cumul.c Cumul.p
---------------------------------------------------------
0 > 50000 25000 12 0.24 12 0.24
50000 > 100000 75000 6 0.12 18 0.36
100000 > 150000 125000 20 0.40 38 0.76
150000 > 200000 175000 11 0.22 49 0.98
200000 > 250000 225000 1 0.02 50 1.00
Code
Histogram(Unit.Price)
>>> Note: Unit.Price is not in a data frame (table)
>>> Note: Unit.Price is not in a data frame (table)
>>> Suggestions
bin_width: set the width of each bin
bin_start: set the start of the first bin
bin_end: set the end of the last bin
Histogram(Unit.Price, density=TRUE) # smoothed curve + histogram
Plot(Unit.Price) # Violin/Box/Scatterplot (VBS) plot
--- Unit.Price ---
n miss mean sd min mdn max
50 0 58.441 27.103 7.140 52.975 157.220
--- Outliers --- from the box plot: 1
Small Large
----- -----
157.2
Bin Width: 20
Number of Bins: 8
Bin Midpnt Count Prop Cumul.c Cumul.p
---------------------------------------------------
0 > 20 10 1 0.02 1 0.02
20 > 40 30 13 0.26 14 0.28
40 > 60 50 17 0.34 31 0.62
60 > 80 70 9 0.18 40 0.80
80 > 100 90 6 0.12 46 0.92
100 > 120 110 3 0.06 49 0.98
120 > 140 130 0 0.00 49 0.98
140 > 160 150 1 0.02 50 1.00
Code
Histogram(units.sold)
>>> Note: units.sold is not in a data frame (table)
>>> Note: units.sold is not in a data frame (table)
>>> Suggestions
bin_width: set the width of each bin
bin_start: set the start of the first bin
bin_end: set the end of the last bin
Histogram(units.sold, density=TRUE) # smoothed curve + histogram
Plot(units.sold) # Violin/Box/Scatterplot (VBS) plot
--- units.sold ---
n miss mean sd min mdn max
50 0 9273.18 1880.45 6000.00 8985.50 12788.00
No (Box plot) outliers
Bin Width: 1000
Number of Bins: 7
Bin Midpnt Count Prop Cumul.c Cumul.p
-------------------------------------------------------
6000 > 7000 6500 5 0.10 5 0.10
7000 > 8000 7500 9 0.18 14 0.28
8000 > 9000 8500 11 0.22 25 0.50
9000 > 10000 9500 7 0.14 32 0.64
10000 > 11000 10500 8 0.16 40 0.80
11000 > 12000 11500 4 0.08 44 0.88
12000 > 13000 12500 6 0.12 50 1.00
Code
Histogram(cost)
>>> Note: cost is not in a data frame (table)
>>> Note: cost is not in a data frame (table)
>>> Suggestions
bin_width: set the width of each bin
bin_start: set the start of the first bin
bin_end: set the end of the last bin
Histogram(cost, density=TRUE) # smoothed curve + histogram
Plot(cost) # Violin/Box/Scatterplot (VBS) plot
--- cost ---
n miss mean sd min mdn max
50 0 406091.35 162419.01 52285.20 411888.64 774031.10
No (Box plot) outliers
Bin Width: 100000
Number of Bins: 8
Bin Midpnt Count Prop Cumul.c Cumul.p
---------------------------------------------------------
0 > 100000 50000 1 0.02 1 0.02
100000 > 200000 150000 4 0.08 5 0.10
200000 > 300000 250000 9 0.18 14 0.28
300000 > 400000 350000 10 0.20 24 0.48
400000 > 500000 450000 11 0.22 35 0.70
500000 > 600000 550000 11 0.22 46 0.92
600000 > 700000 650000 2 0.04 48 0.96
700000 > 800000 750000 2 0.04 50 1.00
Code
Histogram(Marketing.Spend)
>>> Note: Marketing.Spend is not in a data frame (table)
>>> Note: Marketing.Spend is not in a data frame (table)
>>> Suggestions
bin_width: set the width of each bin
bin_start: set the start of the first bin
bin_end: set the end of the last bin
Histogram(Marketing.Spend, density=TRUE) # smoothed curve + histogram
Plot(Marketing.Spend) # Violin/Box/Scatterplot (VBS) plot
--- Marketing.Spend ---
n miss mean sd min mdn max
50 0 211025.098 122290.311 0.000 212716.240 471784.100
No (Box plot) outliers
Bin Width: 50000
Number of Bins: 10
Bin Midpnt Count Prop Cumul.c Cumul.p
---------------------------------------------------------
0 > 50000 25000 8 0.16 8 0.16
50000 > 100000 75000 2 0.04 10 0.20
100000 > 150000 125000 7 0.14 17 0.34
150000 > 200000 175000 5 0.10 22 0.44
200000 > 250000 225000 7 0.14 29 0.58
250000 > 300000 275000 9 0.18 38 0.76
300000 > 350000 325000 5 0.10 43 0.86
350000 > 400000 375000 4 0.08 47 0.94
400000 > 450000 425000 2 0.04 49 0.98
450000 > 500000 475000 1 0.02 50 1.00
Code
Histogram(Sales)
>>> Note: Sales is not in a data frame (table)
>>> Note: Sales is not in a data frame (table)
>>> Suggestions
bin_width: set the width of each bin
bin_start: set the start of the first bin
bin_end: set the end of the last bin
Histogram(Sales, density=TRUE) # smoothed curve + histogram
Plot(Sales) # Violin/Box/Scatterplot (VBS) plot
--- Sales ---
n miss mean sd min mdn max
50 0 513675.23 184055.64 66334.55 520625.48 956208.19
No (Box plot) outliers
Bin Width: 100000
Number of Bins: 10
Bin Midpnt Count Prop Cumul.c Cumul.p
-----------------------------------------------------------
0 > 100000 50000 1 0.02 1 0.02
100000 > 200000 150000 0 0.00 1 0.02
200000 > 300000 250000 6 0.12 7 0.14
300000 > 400000 350000 7 0.14 14 0.28
400000 > 500000 450000 10 0.20 24 0.48
500000 > 600000 550000 10 0.20 34 0.68
600000 > 700000 650000 8 0.16 42 0.84
700000 > 800000 750000 5 0.10 47 0.94
800000 > 900000 850000 1 0.02 48 0.96
900000 > 1000000 950000 2 0.04 50 1.00
Code
# boxplot scaled & un-scaledboxplot(numericdata)
Code
boxplot(scale(numericdata))
Now we can delve into exploring bivariate & multivariate relationships.
3.3 Bivariate Descriptive Tables for Categorical Variables.
We will see these tables again soon in the inferential section.
corrplot(corMatrix, method ="number", order ="hclust", addrect =4)
Code
# correlation plotspairs.panels(numericdata)
The Bivariate correlation analysis above doesn’t take in to account the categorical variables of the dataset (CorrMatrix Function - RDocumentation, n.d.).
3.5 Clustering
Next we will explore clustering variables into groups that share similar characteristics... or is it grouping variables into clusters that share similar characteristics? Whatever the terminology Variables are put together that are similar with each other in some measurable way. Calculations and definitions of similarity and distance for the clusters vary by method. (Kaushik, 2016)
3.6 Hierarchical Clustering
Below a hierarchical clustering algorithm is used that can handle mixed data. This will help us visualize potential significant relationships between our numerical and non-numerical data in a multivariate manner (Kodali, 2016). Note a hierarchical clustering algorithm was utilized in the correlation analysis to group variables based on the values of each variables correlation coefficient.
To start any kind of clustering analysis the number of clusters we want to model needs to be specified beforehand. A graph of the calculated silhouette Width vs The number of clusters(groups) is usually presented and we pick the number of clusters, where the average distance falls suddenly (Mahendru, 2019). Note with for an increase in the number of clusters, the width decreases.
Code
# Hierarchical Clustering Analysis# Plot sihouette width (higher is better)d_dist =daisy(x, metric ="gower", type =list(logratio =4))sil_width =c(NA)for (i in2:10){ pam_fit =pam(d_dist,diss =TRUE,k = i ) sil_width[i] = pam_fit$silinfo$avg.width}# create plot of number of clusters vs total within sum of squaresplot(1:10, sil_width,xlab ="Number of clusters", ylim =c(0, 1),ylab ="Silhouette Width")lines(1:10, sil_width)
A K-means clustering algorithm was also run on the dataset. This algorithm doesn’t take into account non-numeric variables (Harris, 2021). Even so, we still see a similar association among variables was found using both methods.
Code
kmeans =kmeans(scale(t(numericdata)), 4, nstart =30000)# plot the clustersfviz_cluster(kmeans, data =scale(t(numericdata)), ellipse.type ="norm")
Too few points to calculate an ellipse
Too few points to calculate an ellipse
Too few points to calculate an ellipse
Its always good practice when clustering variables to run different types of clustering algorithms on the same data. This helps to give as full a picture as possible of probable correlations and associations among all variables (Kaushik, 2016). Notice any differences? Similarities? Where might these come from? Why do they exist? *“Calculations and definitions of similarity and distance for the clusters vary by method.”*
Different kinds of graphs including, histograms, bar graphs, box-plots, time series plots, below also help to assist in investigating the associations hinted at in the visualizations above.
Code
# HISTOGRAMS W/ FACETS IGN vs categorical variablesgf_histogram(~IGN.Rating, data = data, binwidth =64000, stat ="count") %>%gf_facet_wrap(~IGP) %>%gf_labs(title ="")
Code
gf_histogram(~IGN.Rating, data = data, binwidth =64000, stat ="count") %>%gf_facet_wrap(~Platform) %>%gf_labs(title ="")
Code
gf_histogram(~IGN.Rating, data = data, binwidth =64000, stat ="count") %>%gf_facet_wrap(~Year.Created) %>%gf_labs(title ="")
Code
gf_histogram(~IGN.Rating, data = data, binwidth =64000, stat ="count") %>%gf_facet_wrap(~Game.Type) %>%gf_labs(title ="")
mplot is an enhanced plotting engine for R. The mplot( )function consolidates daily plotting and formatting tasks into a single, easy-to-use application. Code and examples for one and two variables below.
Hint* look over any graphs and tables made so far (or make your own! )and come up with questions based on them.
That concludes our exploratory study of the video game dataset. We continue the study with an overview and application of inferential analysis.
4 Inferential Analysis
Inferential Analysis is concerned with making decisions, predictions, calculating estimates and intervals based on information contained in a set of data (Scott, 2009).
Hypothesis Tests: This section focuses on understanding of the basics of hypothesis testing.
below; the basic formal structure of hypothesis testing:
An observation has been made/ A question has been asked about the data.
The null (H0 ) and alternative (H1 or HA) hypotheses are specified.
With given data, a value of a statistic is calculated.
Under a set of general assumptions about the data, as well as assuming the null hypothesis is true, the distribution of the test statistic is known.
Given the distribution and value of the test statistic, as well as the form of the alternative hypothesis, we can calculate a p-value of the test.
Based on the p-value and prespecified level of significance, we make a decision. One of: – Fail to reject the null hypothesis. OR. Reject the null hypothesis.
Inferential Analysis (statistical testing) is done for subgroups of variables from our dataset as a way to (answer any questions)(confirm or deny any hypothesis) about these subgroups of data that may have arisen during the Exploratory phase.
All statistical tests, include, a set of assumptions, a null hypothesis, an alternate hypothesis, p-value and a significance level.This allows us to create statistically valid estimates & intervals. (Given the set of general assumptions is true.)
Keep in mind! The point. The smaller the p-value the stronger the evidence is in support of the alternate hypothesis, The smaller the p-value the more likely the alternate hypothesis is true. Reject the null hypothesis for small p-values.
4.1 Summary of Inferential Analysis Steps
Step 1. Create a Question about the data, most likely based on the Exploratory phase.
Step 2. Take the question and make it a statement of a relationship among variables not existing, or a difference between variables not existing, and you have a null hypothesis.
Step 3. Choose an appropriate statistical test, verify any required test assumptions.
Step 4. Pick a significance level, which is 0.05, usually. (why?)
Step 5. Reject the null hypothesis if the calculated p-value is less than the significance level. (why).
For example,we may have initially grouped games according to their Platform and IGN rating and made graphs and tables . So there could be many (questions asked) = (hypothesis made) about this grouping, for example are the differences we see among Game platform types significant? Does a significant relationship between Platform type and IGN rating exist? Can we use a variable or combination of variables to predict units sold?
4.2 Assumption Checking for Statistical Testing
Assumption checking allows you to determine if conclusions drawn from the results of your analysis are valid. Assumptions are the requirements you must fulfill (Moran, 2017). To stick to our car example just as you should not drive a car until you can demonstrate working knowledge of the rules of the road, you should not conduct statistical analyses without demonstrating that your data “follows the rules, and can receive a permit for testing if you will”.
So now that we have talked about how to come up with questions, create hypothesis and check our test assumptions. Lets practice with an example reading p-value results for a statistical test. We will start with an example of assumption checking. First we will test two assumptions about our data. If you Note the Statistical test Assumptions Section, a common assumption is one of data needing independent and random observations.
4.2.1 Ljung-Box Test for independence
The Ljung-Box test (sometimes called the Portmanteau test) is used to test whether or not observations over time are independent. (10.3 - Regression with Autoregressive Errors | STAT 462, n.d.)
Ljung-Box test assumptions:This procedures require certain assumptions on the data which we will not discuss. see (10.3 - Regression with Autoregressive Errors | STAT 462, n.d.)
H0: The observations are independent in time.
HA: The observations are not independent in time.
4.2.2Run test of randomness
The Run test of randomness sometimes called the Geary test, is a nonparametric test. Commonly used to test data for for randomness (Lani, 2009).
Runs Test for randomness assumptions:
Assumption #1: Independence of observations.
H0: the sequence of observations are random
Ha: the sequence of observations are not random(there exist some pattern)
x$gameids = data$gameidsindex =1:ncol(x)x[, index] =lapply(x[, index], as.numeric)##H0:-The symbols occur in random order reject if p-value <.05randtests ::runs.test(x$gameids)
Runs Test
data: x$gameids
statistic = -1.143, runs = 22, n1 = 25, n2 = 25, n = 50, p-value =
0.253
alternative hypothesis: nonrandomness
Code
#Null Hypothesis (H0): autocorrelation is not present#If the p-value <.05 rejectBox.test(x$gameids)
What should we conclude from the testing? What are the possible issues our data could face in testing? More on these issues at the end of the analysis.
4.3 Chi Squared Testing
R’s built-in chi-squared test, chisq.test, compares the proportion of counts in each category with the expected proportions. By default, the expected frequencies in each category are assumed to be equal (Team, 2018). The “Goodness of fit Test” is the test you would use if to check if the differences you observe in visuals like pie charts or bar graphs are statistically significant. The “Test of Association” is used to test for relationships between to categorical variables.()
4.3.0.1Chi2 Goodness of Fit Test assumptions:
Assumption #1: One categorical variable.
Assumption #2: Independence of random observations.
Assumption #3: The groups of the categorical variable must be mutually exclusive.
Assumption #4: There must be at least 5 expected frequencies in each level of the variable.
Test Hypothesis:
H0: All category groups have Equal Probabilities.
H1: At least one of the categories has a Probability unequal to other categories.
Code
#PieChart() makes a graph with along with chisq.test for counts.#note the violations of assumption #4 in for 2 of the test.. #Remedial Measures: chisq.test(counts(data$Platform),simulate.p.value =TRUE)
Chi-squared test for given probabilities with simulated p-value (based
on 2000 replicates)
data: counts(data$Platform)
X-squared = 26, df = NA, p-value = 0.0005
Chi-squared test for given probabilities with simulated p-value (based
on 2000 replicates)
data: counts(data$IGP)
X-squared = 0.32, df = NA, p-value = 0.679
Chi-squared test for given probabilities with simulated p-value (based
on 2000 replicates)
data: counts(data$IGN.Rating)
X-squared = 26.8, df = NA, p-value = 0.0015
Chi-squared test for given probabilities with simulated p-value (based
on 2000 replicates)
data: counts(data$Game.Type)
X-squared = 15.16, df = NA, p-value = 0.059
4.4 Remedial Measures: Chi2Testing
We take our first look at how to make corrections to the data or test if test assumptions are not valid for the data. In this case we use a different method to obtain a valid p-value.
Pearson's Chi-squared test with simulated p-value (based on 2000
replicates)
data: Platform.IGNRating
X-squared = 34.52, df = NA, p-value = 0.0295
Code
chisq.test(Platform.IGP,simulate.p.value =TRUE)
Pearson's Chi-squared test with simulated p-value (based on 2000
replicates)
data: Platform.IGP
X-squared = 0.5087, df = NA, p-value = 0.977
4.5 t-Test & ANOVA
Continuing our study of inference with a comparison of the average amount spent on developing games grouped by different departments, which are unrelated/independent groups.
4.5.0.1 Unpaired t- Testing
Assumption #1: Two continuous variables.
Assumption #2: Independence of random observations.
Assumption #3: Both variables are approximately normally distributed.
Assumption #4: Both variables have approximately the same variance.
Assumption #5: No significant outliers.
Hypothesis
H0: The difference between variable means is zero
H1: The difference between variable means is not zero
The following object is masked from 'package:psych':
outlier
Code
library(rstatix)
Attaching package: 'rstatix'
The following object is masked from 'package:MASS':
select
The following objects are masked from 'package:mosaic':
cor_test, prop_test, t_test
The following object is masked from 'package:stats':
filter
Code
library(car)grubbs.test(data$Administration, type =10, opposite =FALSE, two.sided =FALSE)
Grubbs test for one outlier
data: data$Administration
G = 2.5006, U = 0.8698, p-value = 0.251
alternative hypothesis: lowest value 51283.14 is an outlier
Code
grubbs.test(data$Marketing.Spend, type =10, opposite =FALSE, two.sided =FALSE)
Grubbs test for one outlier
data: data$Marketing.Spend
G = 2.1323, U = 0.9053, p-value = 0.743
alternative hypothesis: highest value 471784.1 is an outlier
Code
grubbs.test(data$R.D.Spend, type =10, opposite =FALSE, two.sided =FALSE)
Grubbs test for one outlier
data: data$R.D.Spend
G = 1.996, U = 0.917, p-value = 1
alternative hypothesis: highest value 165349.2 is an outlier
Shapiro-Wilk normality test
data: data$Administration
W = 0.9702, p-value = 0.237
Code
shapiro.test(data$Marketing.Spend)
Shapiro-Wilk normality test
data: data$Marketing.Spend
W = 0.9744, p-value = 0.345
Code
shapiro.test(data$R.D.Spend)
Shapiro-Wilk normality test
data: data$R.D.Spend
W = 0.9673, p-value = 0.18
Code
leveneTest(data$Administration,data$Marketing.Spend, center = mean)
Levene's Test for Homogeneity of Variance (center = mean)
Df F value Pr(>F)
group 47 0.255 0.974
2
Code
leveneTest(data$Marketing.Spend,data$R.D.Spend)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 48 171315312150900043184486424644 0.00000000000000192 ***
1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
leveneTest(data$Administration,data$R.D.Spend)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 48 33984670753675280972846260086 0.00000000000000431 ***
1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
#reject if p-value <.05t.test(data$Administration,data$Marketing.Spend)
Welch Two Sample t-test
data: data$Administration and data$Marketing.Spend
t = -5.055, df = 54.13, p-value = 0.00000526
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-125250.2 -54110.7
sample estimates:
mean of x mean of y
121345 211025
Code
t.test(data$R.D.Spend,data$Marketing.Spend)
Welch Two Sample t-test
data: data$R.D.Spend and data$Marketing.Spend
t = -7.433, df = 62.54, p-value = 0.000000000365
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-174223 -100384
sample estimates:
mean of x mean of y
73721.6 211025.1
Code
t.test(data$Administration,data$R.D.Spend)
Welch Two Sample t-test
data: data$Administration and data$R.D.Spend
t = 6.262, df = 81.06, p-value = 0.0000000172
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
32491.1 62755.0
sample estimates:
mean of x mean of y
121344.6 73721.6
4.6 Remedial Measures: t-Testing
Again in this case we use a different method to obtain a valid p-value. How?- The standard t-test assumes that the variance of the two groups are equal. We use instead a Welch’s t-test, a test for means where equal variances is not assumed. So we find an alternative test where the assumption violated doesn’t matter. - Welch’s t-test in R. This is a typical way around a violated assumption. In statistics speak we say “The Welch’s t test is the equivalent non parametric test for the studentized t test” - Some notes about Welch’s t-test - The power of Welch’s test.
Kruskal-Wallis rank sum test
data: spend by Department
Kruskal-Wallis chi-squared = 48.32, df = 2, p-value = 0.0000000000322
4.8 Remedial Measures: ANOVA
“Roughly speaking, a test or estimator is called ‘robust’ if it still works reasonably well, even if some assumptions required for its theoretical development are not met in practice (BruceET, 2021).” The ANOVA test for means is considered to be robust to violations of the homogeneity of variances assumption when the groups’ sizes are similar.
In our case group sizes are equal. Therefore results from our ANOVA test are valid even though our data failed assumption #4. Note we still used an alternative test, The Kruskal-Wallis test. This test is the equivalent non - parametric test for the One way ANOVA test. Notice Results from all tests do agree.
According to investopedia.com, Predictive Analysis is the use of a mix of statistical and machine learning modeling techniques for making predictions about the future. “Predictive analytics looks at current and historical data patterns to determine if those patterns are likely to emerge again.”(Predictive Analytics, n.d.)
This section of the text will cover; regression analysis for predicting numerical variables, logistic regression for predicting categorical variables (this is also referred to as classification ), time series analysis for the prediction of time dependent variables, and finally we will end with a brief introduction into utilizing neural networks for prediction and classification.
One notable difference between this section and the last is the switch to focusing on performance metrics and the use of hold out datasets for model validation.
While the paradigm of p-value calculating and hypothesis testing do play a role here, ( mostly in the model & variable selection phases ). Significance testing of variables is not the “goal” here… the goal here, is good predictions on unseen data. To this end, methods from both the previous sections will be utilized, so a good grasp of topics previously covered is imperative before moving forward.
5.1Regression Analysis:
Regression analysis is used to predict the value of a numerical dependent variable based on the value of at least one independent variable. Methods inference aid in explaining the impact of changes to independent variables on the dependent variable. These techniques can help in producing models that have “better” predictions.
Assumption #1: Multiple feature variables of any type, One continuous target variable.
Assumption #2: The relationship between the features and some transformation of target variable is linear.
Assumption #3: Independence of random observations.
Assumption #4: All variables approximately normally distributed.
Assumption #5: All variables have approximately the same variance.
Assumption #6: There is No Multicollinearity Among feature Variables.
Code
rm(list =ls())library(gvlma)library(MASS)library(forecast)data =read.csv("~/50_Video Games.csv",stringsAsFactors=TRUE)drops <-c("gameids")data= data[ , !(names(data) %in% drops)]#regression model predicting units sold #model fitting marketing model1 =lm(data$units.sold ~data$Marketing.Spend, data = data)mod1=gvlma(model1)summary(model1)
Call:
lm(formula = data$units.sold ~ data$Marketing.Spend, data = data)
Residuals:
Min 1Q Median 3Q Max
-3051 -1592 -356 1132 3731
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9780.21420 533.32979 18.3 <0.0000000000000002 ***
data$Marketing.Spend -0.00240 0.00219 -1.1 0.28
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1880 on 48 degrees of freedom
Multiple R-squared: 0.0244, Adjusted R-squared: 0.00409
F-statistic: 1.2 on 1 and 48 DF, p-value: 0.279
Code
plot.gvlma(mod1)
Code
mod1
Call:
lm(formula = data$units.sold ~ data$Marketing.Spend, data = data)
Coefficients:
(Intercept) data$Marketing.Spend
9780.2142 -0.0024
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = model1)
Value p-value Decision
Global Stat 3.24845 0.517 Assumptions acceptable.
Skewness 0.79037 0.374 Assumptions acceptable.
Kurtosis 1.20017 0.273 Assumptions acceptable.
Link Function 0.00811 0.928 Assumptions acceptable.
Heteroscedasticity 1.24981 0.264 Assumptions acceptable.
Code
#model fitting Rd model2 =lm(data$units.sold ~data$R.D.Spend, data = data)mod2=gvlma(model2)summary(model2)
Call:
lm(formula = data$units.sold ~ data$R.D.Spend, data = data)
Residuals:
Min 1Q Median 3Q Max
-3273 -1432 -299 1241 3522
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9380.28099 511.74898 18.33 <0.0000000000000002 ***
data$R.D.Spend -0.00145 0.00591 -0.25 0.81
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1900 on 48 degrees of freedom
Multiple R-squared: 0.00126, Adjusted R-squared: -0.0195
F-statistic: 0.0604 on 1 and 48 DF, p-value: 0.807
Code
plot.gvlma(mod2)
Code
mod2
Call:
lm(formula = data$units.sold ~ data$R.D.Spend, data = data)
Coefficients:
(Intercept) data$R.D.Spend
9380.28099 -0.00145
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = model2)
Value p-value Decision
Global Stat 3.463 0.484 Assumptions acceptable.
Skewness 0.282 0.595 Assumptions acceptable.
Kurtosis 1.288 0.256 Assumptions acceptable.
Link Function 0.391 0.532 Assumptions acceptable.
Heteroscedasticity 1.502 0.220 Assumptions acceptable.
Code
#model fitting marketing model3 =lm(data$units.sold ~data$Unit.Price, data = data)mod3=gvlma(model3)summary(model3)
Call:
lm(formula = data$units.sold ~ data$Unit.Price, data = data)
Residuals:
Min 1Q Median 3Q Max
-2259 -1287 -157 1220 3641
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11567.35 531.00 21.78 < 0.0000000000000002 ***
data$Unit.Price -39.26 8.26 -4.75 0.000019 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1570 on 48 degrees of freedom
Multiple R-squared: 0.32, Adjusted R-squared: 0.306
F-statistic: 22.6 on 1 and 48 DF, p-value: 0.0000185
Code
plot.gvlma(mod3)
Code
mod3
Call:
lm(formula = data$units.sold ~ data$Unit.Price, data = data)
Coefficients:
(Intercept) data$Unit.Price
11567.4 -39.3
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = model3)
Value p-value Decision
Global Stat 2.5846 0.630 Assumptions acceptable.
Skewness 1.5604 0.212 Assumptions acceptable.
Kurtosis 0.9497 0.330 Assumptions acceptable.
Link Function 0.0643 0.800 Assumptions acceptable.
Heteroscedasticity 0.0101 0.920 Assumptions acceptable.
Code
#all non correlated variables model =lm(data$units.sold ~. -units.sold -Sales -Profit -cost -Year.Created -IGN.Rating, data = data)mod=gvlma(model)summary(model)
Call:
lm(formula = ((data$units.sold^lambda)/lambda) ~ . - units.sold -
Sales - Profit - cost - Year.Created - IGN.Rating, data = data)
Coefficients:
(Intercept)
-0.0825937832
R.D.Spend
0.0000000550
Administration
0.0000000936
Marketing.Spend
0.0000000410
PlatformPC
0.0008674707
PlatformPlayStation
-0.0007373308
PlatformXBOX
0.0033384235
Game.TypeMultiplayer online battle arena (MOBA)
0.0033065419
Game.TypePuzzlers and party games.
0.0024447066
Game.TypeRacing.
0.0029368748
Game.TypeReal-time strategy (RTS)
0.0036570825
Game.TypeRole-playing (RPG, ARPG, and More)
0.0086166708
Game.TypeSandbox.
0.0074161604
Game.TypeShooters (FPS and TPS)
0.0001845619
Game.TypeSports
0.0011497796
Unit.Price
-0.0004116328
IGPYes
0.0010734659
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = new_model)
Value p-value Decision
Global Stat 7.005 0.1356 Assumptions acceptable.
Skewness 0.841 0.3592 Assumptions acceptable.
Kurtosis 0.874 0.3497 Assumptions acceptable.
Link Function 4.562 0.0327 Assumptions NOT satisfied!
Heteroscedasticity 0.729 0.3933 Assumptions acceptable.
Code
#Adjust for overfitting#Use 80% of dataset as training set and remaining 20% as testing setset.seed(168988)sample =sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.6,0.4))train = data[sample, ]test = data[!sample, ] #Refined model fit , fix for each model model =lm(units.sold ~. -units.sold -Sales -Profit -cost -Year.Created -IGN.Rating, data = train)#model evaluationsummary(model)
Call:
lm(formula = units.sold ~ . - units.sold - Sales - Profit - cost -
Year.Created - IGN.Rating, data = train)
Coefficients:
(Intercept)
8983.97070
R.D.Spend
0.01917
Administration
0.01818
Marketing.Spend
0.00783
PlatformPC
326.85229
PlatformPlayStation
354.10692
PlatformXBOX
1265.87947
Game.TypeMultiplayer online battle arena (MOBA)
536.54472
Game.TypePuzzlers and party games.
437.30033
Game.TypeRacing.
867.70128
Game.TypeReal-time strategy (RTS)
400.81265
Game.TypeRole-playing (RPG, ARPG, and More)
3371.99236
Game.TypeSandbox.
2488.26148
Game.TypeShooters (FPS and TPS)
-262.78032
Game.TypeSports
776.93612
Unit.Price
-107.15825
IGPYes
558.16095
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = model)
Value p-value Decision
Global Stat 4.05441 0.399 Assumptions acceptable.
Skewness 1.70355 0.192 Assumptions acceptable.
Kurtosis 0.04579 0.831 Assumptions acceptable.
Link Function 2.29652 0.130 Assumptions acceptable.
Heteroscedasticity 0.00855 0.926 Assumptions acceptable.
Code
#test model on unseen datapdata =predict(model, newdata = test,)#model evaluation#mape <10 great ... 10-20 good ... 20-50 ok ...<50 badaccuracy(pdata, test$units.sold)
ME RMSE MAE MPE MAPE
Test set 41.622 1131.26 948.576 -0.870989 9.9329
Code
res =as.data.frame(round(test$units.sold - pdata),0)table =cbind(test$units.sold,round(pdata,0),res)table
Call:
lm(formula = ((units.sold^lambda)/lambda) ~ . - units.sold -
Sales - Profit - cost - Year.Created - IGN.Rating, data = train)
Coefficients:
(Intercept)
28.58733713
R.D.Spend
0.00001115
Administration
0.00001147
Marketing.Spend
0.00000481
PlatformPC
0.23168335
PlatformPlayStation
0.11422110
PlatformXBOX
0.81411645
Game.TypeMultiplayer online battle arena (MOBA)
0.47406392
Game.TypePuzzlers and party games.
0.46950196
Game.TypeRacing.
0.50587260
Game.TypeReal-time strategy (RTS)
0.33752255
Game.TypeRole-playing (RPG, ARPG, and More)
1.89507334
Game.TypeSandbox.
1.35910178
Game.TypeShooters (FPS and TPS)
-0.03027904
Game.TypeSports
0.55574426
Unit.Price
-0.06424986
IGPYes
0.25880967
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = new_model)
Value p-value Decision
Global Stat 3.5517 0.470 Assumptions acceptable.
Skewness 1.5891 0.207 Assumptions acceptable.
Kurtosis 0.0832 0.773 Assumptions acceptable.
Link Function 1.7883 0.181 Assumptions acceptable.
Heteroscedasticity 0.0911 0.763 Assumptions acceptable.
Code
#test model on unseen datapdata2 = (predict(new_model, newdata = test,))#model evaluation#mape <10 great ... 10-20 good ... 20-50 ok ...<50 badaccuracy(pdata2*lambda,(test$units.sold^lambda))
ME RMSE MAE MPE MAPE
Test set -0.00266521 0.107353 0.0936984 -0.0930724 1.77156
Note* “The one way ANOVA model is identical to the linear regression model with one categorical variable - the group. When using the linear regression the results will be the same ANOVA table and the same p-value.” - https://www.statskingdom.com/doc_anova.html . Try it out on the anovadata data set, see if you get the same results.
5.2 Logistic Regression Analysis
Logistic Regression analysis concern statistical models known as logit models, as opposed to the linear models used in regression; though this is a kinda of misnomer, as both model are subsets of a larger class of models calledgeneralized linear models, this includes ANOVA as well. (Beyond Logistic Regression, n.d.)
Logit models are used in predictive analytics for categorical dependent variables, based on at least one independent variable. (What Is Logistic Regression?, n.d.). Predictive analysis for a categorical dependent variable is offten referred to as classification.
Assumption #1: Multiple feature variables of any type, One continuous target variable.
Assumption #2: The relationship between the features and some transformation of target variable is linear.
Assumption #3: Independence of random observations.
Assumption #4: All variables approximately normally distributed.
Assumption #5: All variables have approximately the same variance.
Assumption #6: There is No Multicollinearity Among feature Variables.
Code
library(forecast)library(caret)
Attaching package: 'caret'
The following object is masked from 'package:mosaic':
dotPlot
Code
library(fastDummies)rm(list =ls())data =read.csv("~/50_Video Games.csv",stringsAsFactors=TRUE)#Adjust for overfitting#Use 70% of dataset as training set and remaining 30% as testing setset.seed(168989) #7 and 0 sample <-sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.6,0.4))train <- data[sample, ]test <- data[!sample, ] logmodel <-glm(IGP~., data = train,family ="binomial")summary(logmodel)
Call:
glm(formula = IGP ~ ., family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.00002834 -0.00001433 -0.00000002 0.00000421 0.00007239
Coefficients: (1 not defined because of singularities)
Estimate
(Intercept) 29240.4281452081013
R.D.Spend 73414.1703861917777
Administration 73414.1613511074102
Marketing.Spend 73414.1672007999732
Profit 73414.1714038907376
PlatformPC -373.1464739538064
PlatformPlayStation -433.0080715562260
PlatformXBOX -1070.2425904622673
Game.TypeMultiplayer online battle arena (MOBA) -108.4417547538506
Game.TypePuzzlers and party games. -467.8504076024070
Game.TypeRacing. -764.3972176317810
Game.TypeReal-time strategy (RTS) -618.0205071556582
Game.TypeRole-playing (RPG, ARPG, and More) -2328.3049753069054
Game.TypeSandbox. -1670.0744821309625
Game.TypeShooters (FPS and TPS) -9.5374177143739
Game.TypeSports -48.6225275119522
IGN.Rating 47.8319401061441
Year.Created -15.7986001839345
cost NA
Sales -73414.1783372929203
Unit.Price 77.0297076037164
units.sold 0.4083716522604
gameids 0.0000000000193
Std. Error z value
(Intercept) 15849567.5515316426754 0
R.D.Spend 35521249.4668596461415 0
Administration 35521243.1408409625292 0
Marketing.Spend 35521247.9900171309710 0
Profit 35521251.2150845378637 0
PlatformPC 289563.4477628378663 0
PlatformPlayStation 255740.8106062586885 0
PlatformXBOX 673425.5798976761289 0
Game.TypeMultiplayer online battle arena (MOBA) 394560.5836117870640 0
Game.TypePuzzlers and party games. 317232.7138482637238 0
Game.TypeRacing. 38751068.2598710581660 0
Game.TypeReal-time strategy (RTS) 621468.2836358817294 0
Game.TypeRole-playing (RPG, ARPG, and More) 1336643.3898657865357 0
Game.TypeSandbox. 995403.4291097223759 0
Game.TypeShooters (FPS and TPS) 170702.5483604636102 0
Game.TypeSports 162427.0008985286404 0
IGN.Rating 45123.0311261153620 0
Year.Created 8273.9072929590766 0
cost NA NA
Sales 35521252.6594500169158 0
Unit.Price 36031.9709425770270 0
units.sold 198.9865542775573 0
gameids 0.0000000127286 0
Pr(>|z|)
(Intercept) 1
R.D.Spend 1
Administration 1
Marketing.Spend 1
Profit 1
PlatformPC 1
PlatformPlayStation 1
PlatformXBOX 1
Game.TypeMultiplayer online battle arena (MOBA) 1
Game.TypePuzzlers and party games. 1
Game.TypeRacing. 1
Game.TypeReal-time strategy (RTS) 1
Game.TypeRole-playing (RPG, ARPG, and More) 1
Game.TypeSandbox. 1
Game.TypeShooters (FPS and TPS) 1
Game.TypeSports 1
IGN.Rating 1
Year.Created 1
cost NA
Sales 1
Unit.Price 1
units.sold 1
gameids 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 42.942861533340 on 30 degrees of freedom
Residual deviance: 0.000000013759 on 9 degrees of freedom
AIC: 44
Number of Fisher Scoring iterations: 25
Code
#test model on unseen datapdata <-predict(logmodel, newdata = test)pdata =as.data.frame(ifelse(pdata >0,0,1))test$IGP<-ifelse(as.numeric(test$IGP) >1,1,0)table =as.data.frame(cbind(test$IGP,pdata))confusionMatrix(factor(table$`test$IGP`),factor(table$`ifelse(pdata > 0, 0, 1)`))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2 5
1 1 11
Accuracy : 0.684
95% CI : (0.434, 0.874)
No Information Rate : 0.842
P-Value [Acc > NIR] : 0.979
Kappa : 0.23
Mcnemar's Test P-Value : 0.221
Sensitivity : 0.667
Specificity : 0.688
Pos Pred Value : 0.286
Neg Pred Value : 0.917
Prevalence : 0.158
Detection Rate : 0.105
Detection Prevalence : 0.368
Balanced Accuracy : 0.677
'Positive' Class : 0
#####################################################################################################################################################################rm(list =ls())data =read.csv("~/50_Video Games.csv",stringsAsFactors=TRUE)#Adjust for overfitting#Use 70% of dataset as training set and remaining 30% as testing setset.seed(168989) #7 and 0 sample <-sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.6,0.4))train <- data[sample, ]test <- data[!sample, ] logmodel <-glm(IGP~.-IGP -gameids -cost -Platform, data = train,family ="binomial")summary(logmodel)
Call:
glm(formula = IGP ~ . - IGP - gameids - cost - Platform, family = "binomial",
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.476 -0.379 0.000 0.183 1.553
Coefficients:
Estimate Std. Error z value
(Intercept) 399.60880 421.41572 0.95
R.D.Spend 696.87320 713.16243 0.98
Administration 696.87321 713.16237 0.98
Marketing.Spend 696.87312 713.16235 0.98
Profit 696.87311 713.16236 0.98
Game.TypeMultiplayer online battle arena (MOBA) 8.97974 5.92772 1.51
Game.TypePuzzlers and party games. 13.50005 6926.30230 0.00
Game.TypeRacing. 13.28762 5542.64153 0.00
Game.TypeReal-time strategy (RTS) 20.43086 10754.01114 0.00
Game.TypeRole-playing (RPG, ARPG, and More) -37.71766 5487.58612 -0.01
Game.TypeSandbox. 0.72570 10754.02223 0.00
Game.TypeShooters (FPS and TPS) -0.24538 2.05925 -0.12
Game.TypeSports -2.52468 3.61598 -0.70
IGN.Rating 1.65462 1.25927 1.31
Year.Created -0.23688 0.22792 -1.04
Sales -696.87331 713.16249 -0.98
Unit.Price 1.24655 1.03298 1.21
units.sold 0.00821 0.00685 1.20
Pr(>|z|)
(Intercept) 0.34
R.D.Spend 0.33
Administration 0.33
Marketing.Spend 0.33
Profit 0.33
Game.TypeMultiplayer online battle arena (MOBA) 0.13
Game.TypePuzzlers and party games. 1.00
Game.TypeRacing. 1.00
Game.TypeReal-time strategy (RTS) 1.00
Game.TypeRole-playing (RPG, ARPG, and More) 0.99
Game.TypeSandbox. 1.00
Game.TypeShooters (FPS and TPS) 0.91
Game.TypeSports 0.49
IGN.Rating 0.19
Year.Created 0.30
Sales 0.33
Unit.Price 0.23
units.sold 0.23
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 42.943 on 30 degrees of freedom
Residual deviance: 18.824 on 13 degrees of freedom
AIC: 54.82
Number of Fisher Scoring iterations: 18
Code
#test model on unseen datapdata <-predict(logmodel, newdata = test)pdata =as.data.frame(ifelse(pdata >0,0,1))test$IGP<-ifelse(as.numeric(test$IGP) >1,1,0)table =as.data.frame(cbind(test$IGP,pdata))confusionMatrix(factor(table$`test$IGP`),factor(table$`ifelse(pdata > 0, 0, 1)`))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 5 2
1 1 11
Accuracy : 0.842
95% CI : (0.604, 0.966)
No Information Rate : 0.684
P-Value [Acc > NIR] : 0.105
Kappa : 0.65
Mcnemar's Test P-Value : 1.000
Sensitivity : 0.833
Specificity : 0.846
Pos Pred Value : 0.714
Neg Pred Value : 0.917
Prevalence : 0.316
Detection Rate : 0.263
Detection Prevalence : 0.368
Balanced Accuracy : 0.840
'Positive' Class : 0
Time series analysis is the analysis of data collected over time. In time series analysis, time is a significant variable. This is usually something we want to avoid, especially in the regression methods covered so far. Predictive analysis using time dependent data is usually referred to as forecasting.
The key difference between modeling data via time series methods or using the methods discussed so far is “Time series analysis accounts for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend or seasonal variation) that should be accounted for.”
library(readr)library(aweek)library(dplyr)library(forecast)df <-read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)colnames(df)[colnames(df) =="Date"] ="ds"colnames(df)[colnames(df) =="Units.Sold"] ="y"#colnames(df)[colnames(df) == "Total.Profit"] ="y"# Convert to class date and remove $df <- df %>%mutate(ds =as.Date(ds, format ="%Y-%m-%d"))#df <- df %>% # mutate(y = parse_number(y))#Dataframe must have columns 'ds' and 'y' with the dates and values respectively.m <-prophet(df, yearly.seasonality =TRUE , daily.seasonality =TRUE)future <-make_future_dataframe(m, periods =5)# Rforecast <-predict(m, future)plot(m, forecast,ylim =c(0, 6000))
Code
prophet_plot_components(m, forecast)
Code
dyplot.prophet(m, forecast)
Code
model1_cv <-cross_validation(m, initial =330 ,horizon =365/12, units ="days")
Making 13 forecasts with cutoffs between 2021-11-29 02:00:00 and 2022-05-30 14:00:00
###############################################rm(list =ls())library(prophet)library(readr)library(aweek)library(dplyr)library(forecast)df <-read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)colnames(df)[colnames(df) =="Date"] ="ds"colnames(df)[colnames(df) =="Total.Profit"] ="y"# Convert to class date and remove $df <- df %>%mutate(ds =as.Date(ds, format ="%Y-%m-%d"))df <- df %>%mutate(y =parse_number(y))df$week_num <-strftime(df$ds, format ="%V")df = df %>%group_by(week_num) %>%summarize(y =sum(y))colnames(df)[colnames(df) =="week_num"] ="ds"df <- df %>%mutate(ds =get_date(ds, year=2021))#Dataframe must have columns 'ds' and 'y' with the dates and values respectively.m <-prophet(df, yearly.seasonality =TRUE , daily.seasonality =TRUE)
Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
Code
future <-make_future_dataframe(m, periods =5)# Rforecast <-predict(m, future)plot(m, forecast,ylim =c(0, 6000))
Code
prophet_plot_components(m, forecast)
Code
dyplot.prophet(m, forecast)
Code
model1_cv <-cross_validation(m, initial =330 ,horizon =365/12, units ="days")
Making 1 forecasts with cutoffs between 2021-12-03 14:00:00 and 2021-12-03 14:00:00
###############################################rm(list =ls())library(prophet)library(readr)library(aweek)library(dplyr)library(forecast)df <-read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)colnames(df)[colnames(df) =="Date"] ="ds"colnames(df)[colnames(df) =="Total.Profit"] ="y"# Convert to class date and remove $df <- df %>%mutate(ds =as.Date(ds, format ="%Y-%m-%d"))df <- df %>%mutate(y =parse_number(y))df$week_num <-strftime(df$ds, format ="%V")df = df %>%group_by(week_num) %>%summarize(y =mean(y))colnames(df)[colnames(df) =="week_num"] ="ds"df <- df %>%mutate(ds =get_date(ds, year=2021))#Dataframe must have columns 'ds' and 'y' with the dates and values respectively.m <-prophet(df, yearly.seasonality =TRUE , daily.seasonality =TRUE)
Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
Code
future <-make_future_dataframe(m, periods =5)# Rforecast <-predict(m, future)plot(m, forecast,ylim =c(0, 6000))
Code
prophet_plot_components(m, forecast)
Code
dyplot.prophet(m, forecast)
Code
model1_cv <-cross_validation(m, initial =330 ,horizon =365/12, units ="days")
Making 1 forecasts with cutoffs between 2021-12-03 14:00:00 and 2021-12-03 14:00:00
Neural Networks are inspired by the structure of the human brain used to detect patterns in data sets. These models can detect the most subtle and complex relationships between variables using shear mathematical power. Neural networks can be used to make predictions on dependent variables of any type; including numerical, categorical and time series.
The structure of a neural-network algorithm has three layers. The input layer of a neural network is where each variables starts hence the size of your input layer is the number of variables in your dataset. The output layer of a neural network is where the results will be displayed. The hidden layers are in the middle. One (very) simple way to think about a Neural Network for new analyst is a net of logit models.
Though, if we want we can use other activation functions. And we can even mix and match… it gets complicated …
The important conceptual point to keep in mind is we input variables, it outputs predictions. We can check the predictions using the techniques and metrics we have utilized for predictive analysis so far .
Attaching package: 'neuralnet'
The following object is masked from 'package:dplyr':
compute
Code
library(caret)library(generics)
Attaching package: 'generics'
The following object is masked from 'package:keras':
evaluate
The following object is masked from 'package:lubridate':
as.difftime
The following object is masked from 'package:caret':
train
The following object is masked from 'package:dplyr':
explain
The following objects are masked from 'package:base':
as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
setequal, union
Code
library(forecast)library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
The following objects are masked from 'package:psych':
alpha, rescale
The following object is masked from 'package:mosaic':
rescale
The following object is masked from 'package:lessR':
rescale
data <-dummy_cols(data, select_columns =c('Game.Type', 'Platform', 'IGP'),remove_selected_columns =TRUE)maxs <-apply(data, 2, max) mins <-apply(data, 2, min)data=scale(data,center = mins, scale = maxs - mins)#Adjust for overfitting#Use 80% of dataset as training set and remaining 20% as testing setset.seed(168988)sample =sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.8,0.2))train =as.data.frame(data[sample, ])test =as.data.frame(data[!sample, ]) model =neuralnet(factor(IGP_Yes)~IGN.Rating+ cost + Unit.Price+ Game.Type_Sports,data=train,hidden=c(4,2),rep =1,act.fct ="logistic",linear.output =FALSE)plot(model,rep ="best")
Code
#test model on unseen datapdata =as.data.frame(predict(model, newdata = test))#model evaluation#mape <10 great ... 10-20 good ... 20-50 ok ...<50 badtable =as.data.frame(cbind(round((pdata$V2)),(test$IGP_Yes)))table
ME RMSE MAE MPE MAPE
Test set 125.753 798.024 654.017 0.610453 6.71192
Code
sum(res)
[1] -2139
Code
library(tidyverse)library(keras)library(neuralnet)library(prophet)library(readr)library(aweek)library(dplyr)library(forecast)df <-read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)data = df colnames(df)[colnames(df) =="Date"] ="ds"colnames(df)[colnames(df) =="Units.Sold"] ="y"#colnames(df)[colnames(df) == "Total.Profit"] ="y"# Convert to class date and remove $df <- df %>%mutate(ds =as.Date(ds, format ="%Y-%m-%d"))#df <- df %>% # mutate(y = parse_number(y))data = df fit =nnetar(ts(df$y),lambda=0.5)fit
Series: ts(df$y)
Model: NNAR(1,1)
Call: nnetar(y = ts(df$y), lambda = 0.5)
Average of 20 networks, each of which is
a 1-1-1 network with 4 weights
options were - linear output units
sigma^2 estimated as 278
ME RMSE MAE MPE MAPE MASE ACF1
Training set 69.6006 340.94 296.709 -305.662 347.691 0.772336 0.00712707
Code
##########################################################################################avg################33rm(list =ls())df <-read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)colnames(df)[colnames(df) =="Date"] ="ds"colnames(df)[colnames(df) =="Units.Sold"] ="y"#colnames(df)[colnames(df) == "Total.Profit"] ="y"# Convert to class date and remove $df <- df %>%mutate(ds =as.Date(ds, format ="%Y-%m-%d"))#df <- df %>% # mutate(y = parse_number(y))df$week_num <-strftime(df$ds, format ="%V")df = df %>%group_by(week_num) %>%summarize(y =mean(y))colnames(df)[colnames(df) =="week_num"] ="ds"df <- df %>%mutate(ds =get_date(ds, year=2021))data = df fit =nnetar(ts(df$y),lambda=0.5 ,xreg = df$Total.Profit)fit
Series: ts(df$y)
Model: NNAR(1,1)
Call: nnetar(y = ts(df$y), xreg = df$Total.Profit, lambda = 0.5)
Average of 20 networks, each of which is
a 1-1-1 network with 4 weights
options were - linear output units
sigma^2 estimated as 30.5
ME RMSE MAE MPE MAPE MASE ACF1
Training set 7.62245 125.19 101.2 -4.86941 20.8985 0.679469 0.0206835
Code
###############################################rm(list =ls())df <-read.csv("~/Daily_Demand_Forecasting_Orders.csv",stringsAsFactors=FALSE)colnames(df)[colnames(df) =="Date"] ="ds"colnames(df)[colnames(df) =="Total.Profit"] ="y"# Convert to class date and remove $df <- df %>%mutate(ds =as.Date(ds, format ="%Y-%m-%d"))df <- df %>%mutate(y =parse_number(y))df$week_num <-strftime(df$ds, format ="%V")df = df %>%group_by(week_num) %>%summarize(y =mean(y))colnames(df)[colnames(df) =="week_num"] ="ds"df <- df %>%mutate(ds =get_date(ds, year=2021))data = df fit =nnetar(ts(df$y),lambda=0.5 ,xreg = df$Total.Profit)fit
Series: ts(df$y)
Model: NNAR(5,3)
Call: nnetar(y = ts(df$y), xreg = df$Total.Profit, lambda = 0.5)
Average of 20 networks, each of which is
a 5-3-1 network with 22 weights
options were - linear output units
sigma^2 estimated as 49.4
ME RMSE MAE MPE MAPE MASE ACF1
Training set 17.522 294.914 216.112 -1.04745 6.82279 0.269934 0.00272143
6 Next Steps:, Further research/analysis.
Look back to the descriptive analytic section ,what other relationships do you think are worth testing why? What methods would you use to test those relationships?
This concludes our analysis of the video game dataset.
7 R Programming Resources:
This text is in no way meant to be a complete reference for the R programming language, but rather an introduction to many of the concepts utilized in modern statistical approaches to problem solving. The following resources will prove to be useful if you would like a deeper understanding of R:
Alternative Hypothesis: In hypothesis testing, the null hypothesis and an alternative hypothesis are put forward. If the data are sufficiently strong to reject the null hypothesis, then the null hypothesis is rejected in favor of an alternative hypothesis. For instance, if the null hypothesis were that mu 1 = mu 2 then the alternative hypothesis (for a two-tailed test) would be mu 1 != mu 2 .
Analysis of Variance: Analysis of variance is a method for testing hypotheses about means. It is the most widely-used method of statistical inference for the analysis of experimental data.
Average: The (arithmetic) mean; Any measure of central tendency.
Bar Chart: A graphical method of presenting data. A bar is drawn for each level of a variable. The height of each bar contains the value of the variable. Bar charts are useful for displaying things such as frequency counts and percent increases. They are not recommended for displaying means (despite the widespread practice) since box plots present more information in the same amount of space.
Beta weight: A standardized regression coefficient.
Bias: 1. A sampling method is biased if each element does not have an equal chance of being selected. A sample of internet users found reading an online statistics book would be a biased sample of all internet users. A random sample is unbiased. Note that possible bias refers to the sampling method, not the result. An unbiased method could, by chance, lead to a very non-representative sample.
2. An estimator is biased if it systematically overestimates or underestimates the parameter it is estimating. In other words, it is biased if the mean of the sampling distribution of the statistic is not the parameter it is estimating, The sample mean is an unbiased estimate of the population mean. The mean squared deviation of sample scores from their mean is a biased estimate of the variance since it tends to underestimate the population variance.
Binomial Distribution: A probability distribution for independent events for which there are only two possible outcomes such as a coin flip. If one of the two outcomes is defined as a success, then the probability of exactly x successes out of N trials (events) is given by:
Bin Width: Also known as the class interval, the bin width is a division of data for use in a histogram. For instance, it is possible to partition scores on a 100 point test into class intervals of 1-25, 26-49, 50-74 and 75-100.
Bivariate: Bivariate data is data for which there are two variables for each observation. That is, two scores per subject.
Bonferroni Correction: In general, to keep the familywise error rate (FER) at or below .05, the per-comparison error rate (PCER) should be: PCER = .05/c where c is the number of comparisons. More generally, to insure that the FER is less than or equal to alpha, use PCER = alpha/c.
Box Plot: One of the more effective graphical summaries of a data set, the box plot generally shows mean, median, 25th and 75th percentiles, and outliers. A standard box plot is composed of the median, upper hinge, lower hinge, higher adjacent value, lower adjacent value, outside values, and far out values. An example is shown below. Parallel box plots are very useful for comparing distributions.
Central Tendency: There are many measures of the center of a distribution. These are called measures of central tendency. The most common are the mean, median, and, mode. Others include the trimean, trimmed mean, and geometric mean.)
Class Frequency: One of the components of a histogram, the class frequency is the number of observations in each class interval. See also: relative frequency.
Class Interval: Also known as bin width, the class interval is a division of data for use in a histogram. For instance, it is possible to partition scores on a 100 point test into class intervals of 1-25, 26-49, 50-74 and 75-100.
Conditional Probability: The probability that event A occurs given that event B has already occurred is called the conditional probability of A given B. Symbolically, this is written as P(A|B). The probability it rains on Monday given that it rained on Sunday would be written as P(Rain on Monday | Rain on Sunday).
Confidence Interval: A confidence interval is a range of scores likely to contain the parameter being estimated. Intervals can be constructed to be more or less likely to contain the parameter: 95% of 95% confidence intervals contain the estimated parameter whereas 99% of 99% confidence intervals contain the estimated parameter. The wider the confidence interval, the more uncertainty there is about the value of the parameter.
Confounding: Two or more variables are confounded if their effects cannot be separated because they vary together. For example, if a study on the effect of light inadvertently manipulated heat along with light, then light and heat would be confounded.
Cook’s D: Cook’s D is a measure of the influence of an observation in regression and is proportional to the sum of the squared differences between predictions made with all observations in the analysis and predictions made leaving out the observation in question.
Constant: A value that does not change. Values such as pie, or the mass of the Earth are constants.
Continuous Variables: Variables that can take on any value in a certain range. Time and distance are continuous; gender, SAT score and “time rounded to the nearest second” are not. Variables that are not continuous are known as discrete variables. No measured variable is truly continuous; however, discrete variables measured with enough precision can often be considered continuous for practical purposes.
Dependent Variable: A variable that measures the experimental outcome. In most experiments, the effects of the independent variable on the dependent variables are observed. For example, if a study investigated the effectiveness of an experimental treatment for depression, then the measure of depression would be the dependent variable.
Descriptive Statistics: 1. The branch of statistics concerned with describing and summarizing data. 2. A set of statistics such as the mean, standard deviation, and skew that describe a distribution.
Degrees of Freedom: The degrees of freedom of an estimate is the number of independent pieces of information that go into the estimate. In general, the degrees of freedom for an estimate is equal to the number of values minus the number of parameters estimated en route to the estimate in question. For example, to estimate the population variance, one must first estimate the population mean. Therefore, if the estimate of variance is based on N observations, there are N-1 degrees of freedom.
Discrete Variables: Variables that can only take on a finite number of values are called “discrete variables.” All qualitative variables are discrete. Some quantitative variables are discrete, such as performance rated as 1,2,3,4, or 5, or temperature rounded to the nearest degree. Sometimes, a variable that takes on enough discrete values can be considered to be continuous for practical purposes. One example is time to the nearest millisecond.
Distribution: The distribution of empirical data is called a frequency distribution and consists of a count of the number of occurrences of each value. If the data are continuous, then a grouped frequency distribution is used. Typically, a distribution is portrayed using a frequency polygon or a histogram. Mathematical equations are often used to define distributions. The normal distribution is, perhaps, the best known example. Many empirical distributions are approximated well by mathematical distributions such as the normal distribution.
Expected Value: The expected value of a statistic is the mean of the sampling distribution of the statistic. It can be loosely thought of as the long-run average value of the statistic. Factor (Independent Variable) Variables that are manipulated by the experimenter, as opposed to dependent variables. Most experiments consist of observing the effect of the independent variable(s) on the dependent variable(s).
False Positive: A false positive occurs when a diagnostic procedure returns a positive result while the true state of the subject is negative. For example, if a test for strep says the patient has strep when in fact he or she does not, then the error in diagnosis would be called a false positive. In some contexts, a false positive is called a false alarm. The concept is similar to a Type I error in significance testing.
Familywise Error Rate: When a series of significance tests is conducted, the familywise error rate (FER) is the probability that one or more of the significance tests results in a Type I error.
Far Out Value: One of the components of a box plot, far out values are those that are more than 2 steps beyond the nearest hinge. They are beyond an outer fence.
Favorable Outcome: A favorable outcome is the outcome of interest. For example one could define a favorable outcome in the flip of a coin as a head. The term “favorable outcome” does not
necessarily mean that the outcome is desirable – in some experiments, the favorable outcome could be the failure of a test, or the occurrence of an undesirable event.
Frequency Distribution: For a discrete variable, a frequency distribution consists of the distribution of the number of occurrences for each value of the variable. For a continuous variable, it is the number of occurrences for a variety of ranges of variables.
Frequency Table: A table containing the number of occurrences in each class of data; for example, the number of each color of M&Ms in a bag. Frequency tables often used to create histograms and frequency polygons. When a frequency table is created for a quantitative variable, a grouped frequency table is generally used.
Histogram: A histogram is a graphical representation of a distribution . It partitions the variable on the x-axis into various contiguous class intervals of (usually) equal widths. The heights of the bars represent the class frequencies.
History Effect: A problem of confounding where the passage of time, and not the variable of interest, is responsible for observed effects. See also: third variable problem.
Homogeneity of Variance: The assumption that the variances of all the populations are equal.
Homoscedasticity: In linear regression, the assumption that the variance around the regression line is the same for all values of the predictor variable.
Independent Events: Events A and B are independent events if the probability of Event B occurring is the same whether or not Event A occurs. For example, if you throw two dice, the probability that the second die comes up 1 is independent of whether the first die came up Formally, this can be stated in terms of conditional probabilities: P(A|B) = P(A) and P(B|A) = P(B).
Inferential Statistics: The branch of statistics concerned with drawing conclusions about a population from a sample. This is generally done through random sampling, followed by inferences made about central tendency, or any of a number of other aspects of a distribution.
Influence: Influence refers to the degree to which a single observation in regression influences the estimation of the regression parameters. It is often measured in terms how much the predicted scores for other observations would differ if the observation in question were not included.
Interquartile Range: The Interquartile Range (IQR) is the 75th percentile minus the 25th percentile. It is a robust measure of variability.
Interval Estimate: An interval estimate is a range of scores likely to contain the estimated parameter. see “confidence interval.”
Interval Scale: One of four commonly used levels of measurement, an interval scales is a numerical scales in which intervals have the same meaning throughout. As an example, consider the Fahrenheit scale of temperature. The difference between 30 degrees and 40 degrees represents the same temperature difference as the difference between 80 degrees and 90 degrees. This is because each 10 degree interval has the same physical meaning (in terms of the kinetic energy. Unlike ratio scales, interval scales do not have a true zero point.
Levels of Measurement: Measurement scales differ in their level of measurement. There are four common levels of measurement: 1. Nominal scales are only labels. 2. Ordinal Scales are ordered but are not truly quantitative. Equal intervals on the ordinal scale do not imply equal intervals on the underlying trait. 3. Interval scales are are ordered and equal intervals equal intervals on the underlying trait. However, interval scales do not have a true zero point. 4. Ratio scales are interval scales that do have a true zero point. With ratio scales, it is sensible to talk about one value being twice as large as another, for example.
Leverage: Leverage is a factor affecting the influence of an observation in regression. Leverage is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The greater an observation’s leverage, the more potential it has to be an influential observation.
Lies: There are three types of lies: 1. regular lies 2. damned lies 3. statistics This is according to Benjamin Disraeli as quoted by Mark Twain.
Line Graph: Essentially a bar graph in which the height of each par is represented by a single point, with each of these points connected by a line. Line graphs are best used to show change over time, and should not be used if your X-axis is not an ordered variable. An example is shown below.
Linear Combination: A linear combination of variables is a way of creating a new variable by combining other variables. A linear combination is one in which each variable is multiplied by a coefficient and the are products summed. For example, if Y = 3X1 + 2X2 + .5X3 then Y is a linear combination of the variables X1, X2, and X3.
Linear Regression: Linear regression is a method for predicting a criterion variable from one or more predictor variable. In simple regression, the criterion is predicted from a single predictor variable and the best-fitting straight line is of the form Y’ = bX + A where Y’ is the predicted score, X is the predictor variable, b is the slope, and A is the Y intercept. Typically, the criterion for the “best fitting” line is the line for which the sum of the squared errors of prediction is minimized. In multiple regression, the criterion is predicted from two or more predictor variables.
Linear Relationship: There is a perfect linear relationship between two variables if a scatter-plot of the points falls on a straight line. The relationship is linear even if the points diverge from the line as long as the divergence is random rather than being systematic.
Linear Transformation: A linear transformation is any transformation of a variable that can be achieved by multiplying it by a constant, and then adding a second constant. If Y is the transformed value of X, then Y = aX + b. The transformation from degrees Fahrenheit to degrees Centigrade is linear and is done using the formula: C = 0.55556F - 17.7778.
Logarithm: The logarithm of a number is the power the base of the logarithm has to be raised to in order to equal the number. If the base of the logarithm is 10 and the number is 1,000, then the log is 3 since 10 has to be raised to the 3rd power to equal 1,000.
Margin of Error: When a statistic is used to estimate a parameter, it is common to compute a confidence interval. The margin of error is the difference between the statistic and the endpoints of the interval. For example, if the statistic were 0.6 and the confidence interval ranged from 0.4 to 0.8, then the margin of error would be 0.20. Unless otherwise specified, the 95% confidence interval is used.
Mean: Also known as the arithmetic mean, the mean is typically what is meant by the word “average.” The mean is perhaps the most common measure of central tendency. The mean of a variable is given by (the sum of all its values)/(the number of values). For example, the mean of 4, 8, and 9 is 7. The sample mean is written as M, and the population mean as the Greek letter mu ( mu ). Despite its popularity, the mean may not be an appropriate measure of central tendency for skewed distributions, or in situations with outliers. Other than the arithmetic mean, there is the geometric mean and the harmonic mean.
Median:The median is a popular measure of central tendency. It is the 50th percentile of a distribution. To find the median of a number of values, first order them, then find the observation in the middle: the median of 5, 2, 7, 9, and 4 is 5. (Note that if there is an even number of values, one takes the average of the middle two: the median of 4, 6, 8, and 10 is 7.) The median is often more appropriate than the mean in skewed distributions and in situations with outliers.
Mode: The mode is a measure of central tendency. It is the most frequent value in a distribution: the mode of 3, 4, 4, 5, 5, 5, 8 is 5. Note that the mode may be very different from the mean and the median.
Multiple Regression: Multiple regression is linear regression in which two or more predictor variables are used to predict the criterion.
Negative Association: There is a negative association between variables X and Y if smaller values of X are associated with larger values of Y and larger values of X are associated with smaller values of Y.
Nominal Scales: A nominal scale is one of four commonly-used levels of measurement. No ordering is implied, and addition/subtraction and multiplication/division would be inappropriate for a variable on a nominal scale. {Female, Male} and {Buddhist, Christian, Hindu, Muslim} have no natural ordering (except alphabetic). Occasionally, numeric values are nominal: for instance, if a variable were coded as Female = 1, Male =2, the set {1,2} is still nominal.
Non-representative: A non-representative sample is a sample that does not accurately reflect the population.
Normal Distribution: One of the most common continuous distributions, a normal distribution is sometimes referred to as a “bell-shaped distribution.” If mu is the distribution mean, and sigma the standard deviation,If the mean is 0 and the standard deviation is 1, the distribution is referred to as the “standard normal distribution.”
Null Hypothesis: A null hypothesis is a hypothesis tested in significance testing. It is typically the hypothesis that a parameter is zero or that a difference between parameters is zero. For example, the null hypothesis might be that the difference between population means is zero. Experimenters typically design experiments to allow the null hypothesis to be rejected.
Omnibus Null Hypothesis: The null hypothesis that all population means are equal.
One Tailed: The last step in significance testing involves calculating the probability that a statistic would differ as much or more from the parameter specified in the null hypothesis as does the statistics obtained in the experiment. A probability computed considering differences in only one direction, such as the statistic is larger than the parameter, is called a one-tailed probability. For example, if a parameter is 0 and the statistic is 12, a one-tailed probability (the positive tail) would be the probability of a statistic being >= to 12. Compare with the two-tailed probability which would be the probability of being either <= -12 or >=12.
Ordinal Scales: One of four commonly-used levels of measurement, an ordinal scale is a set of ordered values. However, there is no set distance between scale values. For instance, for the scale: (Very Poor, Poor, Average, Good, Very Good) is an ordinal scale.
Outer Fence: In a box plot, the lower outer fence is two steps below the lower hinge whereas the upper inner fence is two steps above the upper hinge.
Outlier: Outliers are atypical, infrequent observations; values that have an extreme deviation from the center of the distribution. There is no universally-agreed on criterion for defining an outlier, and outliers should only be discarded with extreme caution. However, one should always assess the effects of outliers on the statistical conclusions.
Outside Values: A component of a box plot, outside values are more than one step beyond the nearest hinge but not more than two steps. They are beyond an inner fence but not beyond an outer fence.
Pairwise Comparisons: Two or more box plots drawn on the same Y-axis. These are often useful in comparing features of distributions. An example portraying the times it took samples of women and men to do a task is shown below.
Parallel Box Plots: Two or more box plots drawn on the same Y-axis. These are often useful in comparing features of distributions. An example portraying the times it took samples of women and men to do a task is shown below.
Parameter: A value calculated in a population. For example, the mean of the numbers in a population is a parameter. Compare with a statistic, which is a value computed in a sample to estimate a parameter.
Partial slope: The partial slope in multiple regression is the slope of the relationship between the part of the predictor variable that is independent of the other predictor variables and criterion. It is also the regression coefficient for the predictor variable in question.
Pearson’s r: Pearson’s correlation is a measure of the strength of the linear relationship between two variables. It ranges from -1 for a perfect negative relationship to +1 for a perfect positive relationship. A correlation of 0 means that there is no linear relationship. 1. Define IR as the integer portion of R (the number to the left of the decimal point). 2. Define FR as the fractional portion or R. 3. Find the scores with Rank IR and with Rank IR + 1. 4. Interpolate by multiplying the difference between the scores by FR and add the result to the lower score.
Per-Comparison Error Rate: The per-comparison error rate refers to the Type I error rate of any one significance test conducted as part of a series of significance tests. Thus, if 10 significance tests were each conducted at 0.05 significance level, then the per-comparison error rate would be 0.05. Compare with the familywise error rate.
Pie Chart: A graphical representation of data, the pie chart shows relative frequencies of classes of data. It is a circle cut into a number of wedges, one for each class, with the area of each wedge proportional to its relative frequency. Pie charts are only effective for a small number of classes, and are one of the less effective graphical representations.
Point Estimate: When a parameter is being estimated, the estimate can be either a single number or it can be a range of numbers such as in a confidence interval. When the estimate is a single number, the estimate is called a “point estimate.”
Polynomial Regression: Polynomial regression is a form of multiple regression in which powers of a predictor variable instead of other predictor variables are used. In the following example, the criterion (Y) is predicted by X, X2 and, X3. Y = b1X + b2X2 + b3X3 + A
Population: A population is the complete set of observations a researcher is interested in. Contrast this with a sample which is a subset of a population. A population can be defined in a manner convenient for a researcher. For example, one could define a population as all girls in fourth grade in Houston, Texas. Or, a different population is the set of all girls in fourth grade in the United States. Inferential statistics are computed from sample data in order to make inferences about the population.
Positive Association: There is a positive association between variables X and Y if smaller values of X are associated with smaller values of Y and larger values of X are associated with larger values of Y.
Power: In significance testing, power is the probability of rejecting a false null hypothesis.
Precision: A statistic’s precision concerns to how close it is expected to be to the parameter it is estimating. Precise statistics are vary less from sample to sample. The precision of a statistic is usually defined in terms of it standard error.
Predictor: A predictor variable is a variable used in regression to predict another variable. It is sometimes referred to as an independent variable if it is manipulated rather than just measured.
Probability Density: For a discrete random variable, a probability distribution contains the probability of each possible outcome. However, for a continuous random variable, the probability of any one outcome is zero (if you specify it to enough decimal places). A probability density function is a formula that can be used to compute probabilities of a range of outcomes for a continuous random variable. The sum of all densities is always 1.0 and the value of the function is always greater or equal to zero.
Probability Distribution: For a discrete random variable, a probability distribution contains the probability of each possible outcome. The sum of all probabilities is always 1.0. See binomial distribution for an example.
Probability Value: In significance testing, the probability value (sometimes called the p value) is the probability of obtaining a statistic as different or more different from the parameter specified in the null hypothesis as the statistic obtained in the experiment. The probability value is computed assuming the null hypothesis is true. The lower the probability value, the stronger the evidence that the null hypothesis is false. Traditionally, the null hypothesis is rejected if the probability value is below 0.05.
Qualitative Variable: Also known as categorical variables, qualitative variables are variables with no natural sense of ordering. They are therefore measured on a nominal scale. For instance, hair color (Black, Brown, Gray, Red, Yellow) is a qualitative variable, as is name (Adam, Becky, Christina, Dave . . .). Qualitative variables can be coded to appear numeric but their numbers are meaningless, as in male=1, female=2. Variables that are not qualitative are known as quantitative variables.
Quantitative Variable: Variables that are measured on a numeric or quantitative scale. Ordinal, interval and ratio scales are quantitative. A country’s population, a person’s shoe size, or a car’s speed are all quantitative variables. Variables that are not quantitative are known as qualitative variables.
Quantile-Quantile Plot: A quantile-quantile or q-q plot is an exploratory graphical device used to check the validity of a distributional assumption for a data set. In general, the basic idea is to compute the theoretically expected value for each data point based on the distribution in question. If the data indeed follow the assumed distribution, then the points on the q-q plot will fall approximately on a straight line.
Random Sampling: The process of selecting a subset of a population for the purposes of statistical inference. Random sampling means that every member of the population is equally likely to be chosen.
Range: The difference between the maximum and minimum values of a variable or distribution. The range is the simplest measure of variability.
Ratio Scale: One of the four basic levels of measurement, a ratio scale is a numerical scale with a true zero point and in which a given size interval has the same interpretation for the entire scale. Weight is a ratio scale, Therefore, it is meaningful to say that a 200 pound person weighs twice as much as a 100 pound person.
Regression: Regression means “prediction.” The regression of Y on X means the prediction of Y by X.
Regression Coefficient: A regression coefficient is the slope of the regression line in simple regression or the partial slope in multiple regression.
Regression Line: In linear regression, the line of best fit is called the regression line.
Relative Frequency: The proportion of observations falling into a given class. For example, if a bag of 55 M & M’s has 11 green M&M’s, then the frequency of green M&M’s is 11 and the relative frequency is 11/55 = 0.20. Relative frequencies are often used in histograms, pie charts, and bar graphs.
Relative Frequency Distribution: A relative frequency distribution is just like a frequency distribution except that it consists of the proportions of occurrences instead of the numbers of occurrences for each value (or range of values) of a variable.
Reliability: Although there are many ways to conceive of the reliability of a test, the classical way is to define the reliability as the correlation between two parallel forms of the test. When defined this way, the reliability is the ratio of true score variance to test score variance. Chronbach’s alpha is a common measure of reliability.
Representative Sample: A representative sample is a sample chosen to match the qualities of the population from which it is drawn. With a large sample size, random sampling will approximate a representative sample; stratified random sampling can be used to make a small sample more representative.
Robust: Something is robust if it holds up well in the face of adversity. A measure of central tendency or variability is considered robust if it is not greatly affected by a few extreme scores. A statistical test is considered robust if it works well in spite of moderate violations of the assumptions on which it is based.
Sample: A sample is a subset of a population, often taken for the purpose of statistical inference. Generally, one uses a random sample.
Sampling Distribution: A sampling distribution can be thought of as a relative frequency distribution with a very large number of samples. More precisely, a relative frequency distribution approaches the sampling distribution as the number of samples approaches infinity. When a variable is discrete, the heights of the distribution are probabilities. When a variable is continuous, the class intervals have no width and and the heights of the distribution are probability densities.
Scatter Plot: A scatter plot of two variables shows the values of one variable on the Y axis and the values of the other variable on the X axis. Scatter plots are well suited for revealing the relationship between two variables. The scatter plot shown below illustrates the relationship between grip strength and arm strength in a sample of workers.
Significance Level: In significance testing, the significance level is the highest value of a probability value for which the null hypothesis is rejected. Common significance levels are 0.05 and 0.01. If the 0.05 level is used, then the null hypothesis is rejected if the probability value is less than or equal to 0.05.
Significance Testing: A statistical procedure that tests the viability of the null hypothesis. If data (or more extreme data) are very unlikely given that the null hypothesis is true, then the null hypothesis is rejected. If the data or more extreme data are not unlikely, then the null hypothesis is not rejected. If the null hypothesis is rejected, then the result of the test is said to be significant. A statistically significant effect does not mean the effect is important.
Simple Regression: Simple regression is linear regression in which one more predictor variable is used to predict the criterion.
Skew: A distribution is skewed if one tail extends out further than the other. A distribution has a positive skew (is skewed to the right) if the tail to the right is longer. It has a negative skew (skewed to the left) if the tail to the left is longer.
Slope: The slope of a line is the change in Y for each change of one unit of X. It is sometimes defined as “rise over run” which is the same thing. The slope of the black line in the graph is 0.675 because the line increases by 0.675 each time X increases by 1.0.
Standard Deviation: The standard deviation is a widely used measure of variability. It is computed by taking the square root of the variance. An important attribute of the standard deviation as a 684
Standard error of the Estimate: measure of variability is that if the mean and standard deviation of a normal distribution are known, it is possible to compute the percentile rank associated with any given score.
Standard Error:
Standard Error of Measurement:
Standard Error of the Mean: The standard error of the mean is the standard deviation of the sampling distribution of the mean. The formula for the standard error of the mean in a population is:
Standard Normal Distribution: The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1.. The transformation from a raw score X to a z score can be done using the following formula: z = (X - mu )/ sigma Transforming a variable in this way is called “standardizing” the variable. It should be kept in mind that if X is not normally distributed then the transformed variable will not be normally distributed either.
Standardize: A variable is standardized if it has a mean of 0 and a standard deviation of 1. The transformation from a raw score X to a standard score can be done using the following formula: X standardized = (X - mu )/ sigma where mu is the mean and sigma is the standard deviation. Transforming a variable in this way is called “standardizing” the variable. It should be kept in mind that if X is not normally distributed then the transformed variable will not be normally distributed either.
Statistics: 1. What you are studying right now, also known as statistical analysis, or statistical inference. It is a field of study concerned with summarizing data, interpreting data, and making decisions based on data. 2. A quantity calculated in a sample to estimate a value in a population is called a “statistic.”
Stratified Random Sampling: In stratified random sampling, the population is divided into a number of subgroups (or strata). Random samples are then taken from each subgroup with sample sizes proportional to the size of the subgroup in the population. For instance, if a population contained equal numbers of men and women, and the variable of interest is suspected to vary by gender, one might conduct stratified random sampling to insure a representative sample.
Sturgis’ Rule: One method of determining the number of classes for a histogram, Sturgis’ rule is to take 1 + Log2(N) classes, rounded to the nearest integer.
Sum of Squares Error: In linear regression, the sum of squares error is the sum of squared errors of prediction. In analysis of variance, it is the sum of squared deviations from cell means for between-subjects factors and the Subjects x Treatment interaction for within-subject factors.
Symmetric Distribution: In a symmetric distribution, the upper and lower halves of the distribution are mirror images of each other. In a symmetric distribution, the mean is equal to the median.
t distribution: The t distribution is the distribution of a value sampled from a normal distribution divided by an estimate of the distribution’s standard deviation. In practice, the value is typically a statistic such as the mean or the difference between means and the standard deviation is an estimate of the standard error of the statistic.
t test: Most commonly, a significance test of the difference between means based on the t distribution. Other applications include (a) testing the significance of the difference between a sample mean and a hypothesized value of the mean and (b) testing a specific contrast among means.
Third Variable Problem: A type of confounding in which a third variable leads to a mistaken causal relationship between two others. For instance, cities with a greater number of churches have a higher crime rate. However, more churches do not lead to more crime, but instead the third variable, population, leads to both more churches and more crime.
Tukey HSD Test: The “Honestly Significantly Different” (HSD) test developed by the statistician John Tukey to test all pairwise comparisons among means. The test is based on the “studentized range distribution.”
Two Tailed: The last step in significance testing involves calculating the probability that a statistic would differ as much or more from the parameter specified in the null hypothesis as does the statistics obtained in the experiment. A probability computed considering differences in both direction (statistic either larger or smaller than the parameter) is called two-tailed probability. For example, if a parameter is 0 and the statistic is 12, a two-tailed probability would be the he probability of being either <= -12 or >=12. Compare with the one-tailed probability which would be the probability of a statistic being >= to 12 if that were the direction specified in advance.
Type I Error: In significance testing, the error of rejecting a true null hypothesis.
Type II Error: In significance testing, the failure to reject a false null hypothesis.
Unbiased: A sample is said to be unbiased when every individual has an equal chance of being chosen from the population. An estimator is unbiased if it does not systematically overestimate or underestimate the parameter it is estimating. In other words, it is unbiased if the mean of the sampling distribution of the statistic is the parameter it is estimating, The sample mean is an unbiased estimate of the population mean.
Variability: Variability refers to the extent to which values differ from one another. That is, how much they vary. Variability can also be thought of as how spread out a distribution is. The standard deviation and the semi-interquartile range are measures of variability.
Variable: Something that can take on different values. For example, different subjects in an experiment weigh different amounts. Therefore “weight” is a variable in the experiment. Or, subjects may be given different doses of a drug. This would make “dosage” a variable. Variables can be dependent or independent, qualitative or quantitative, and continuous or discrete.
Variance: The variance is a widely used measure of variability. It is defined as the mean squared Measures of variability deviation of scores from the mean. T
Measures of variability:
Y Intercept: The Y-intercept of a line is the value of Y at the point that the line intercepts the Y axis. It is the value of Y when X equals 0. The Y intercept of the black line shown in the graph is 0.785.
z score: The number of standard deviations a score is from the mean of its population. The term”standard score” is usually used for normal populations; the terms “z score” and “normal deviate” should only be used in reference to normal distributions. The transformation from a raw score X to a z score can be done using the following formula: z = (X - mu )/ sigma Transforming a variable in this way is called “standardizing” the variable. It should be kept in mind that if X is not normally distributed then the transformed variable will not be normally distributed either.
9 Practical Problems:
A finance manager claims that the average profit of most games we create is $6000, with a standard deviation of $1000. Find the probability that a random sample of 36 games averages less than $5700 in profit.
Suppose a senior analyst from the company claimed that the average profit from most games made by the company is at least $600,000, and any game that makes less is a fluke, an outlier. Suppose that you suspect the claim may be exaggerated. Use our sample of 50 games, find the average in profits. Test the CEO’s claim, against your suspicion, at the 5% level of significance.
The CEO of the business claims that because the average profit from the past 3 Nintendo race car games was less than $200,000. The next Nintendo race car game we make will have an average profit less than 200,000. The Given this information. Find the probability that profit from a random sample of 3 of these games’ averages less than $200,000. What does this suggest about the claim made by the CEO?
“In Game Purchases” IGP, are a growing revenue stream for many video-game companies. One survey showed that up to 20% of players take part in IGP. Based this Information, what is the probability that for a random sample of 10 gamers 6 will take part in IGP. Do we need our dataset to answer this question?
A freshly hired product analyst alleges that “in game purchases” are favored by the rating agency IGN Entertainment Inc. His claim is that this would bias any metrics based on these Ratings that we use to measure the performance of our games. “…essentially this would/could have us making games for high ratings and not for high sales or profit (for the customers), for example putting IGP in every game because it will raise our IGN rating.”Is there a significant association between IGP and IGN Rating?
The Owner wants to understand the relationship between sales and the amount spent to develop, market, and distribute a game. She suspects the sales of new games can be predicted from the amount of time money and effort spent on the game, regardless of game type or console.
Suppose the Operations Manager claimed that Research & development, Marketing, and Administration have roughly the same budget on every project. How can we check this claim.
On Average the company spends the same amount on marketing as we do on Research & Development per game.
Suppose a Manager claimed that Marketing, and Administration have roughly the same average budget on every project. How can we check this claim.
How could we check the claim that the average sales of MOBA games before 2012 were significantly higher than sales of MOBA games after 2012.
The CFO of the business claims that IGP is a main driver of sales among all games. Given this assumption. How could we check this?
The marketing manager wants to know the average sales of MOBA games after 2012.
The marketing manager wants to estimate the average sales of MOBA games before 2012.
How could we check the claim that average sales of MOBA games before 2012 were significantly lower than sales of RTS games after 2012?
The marketing manager wants to estimate the minimum cost of making a high rated MMO game.
Dataset 2contains the play time in minutes from a high earning MOBA game. A new analyst is convinced time of day effects playtime regardless of country. Test whether country and times of day have an effect on playtime. Is the analyst claim justified ? Explain.
Suppose the Operations Manager claimed that the median number of minutes played in the US can’t be more than 750 minutes. A sales manager doubts the accuracy of this claim. Can you reject the claim given the data?
9.1 Statistical Test Assumptions:
Ljung-Box test assumptions:This procedures require certain assumptions on the data which we will not discuss. see (10.3 - Regression with Autoregressive Errors | STAT 462, n.d.)
Runs Test for randomness assumptions:
Assumption #1: Independence of observations.
Chi2Goodness of Fit Test assumptions:
Assumption #1: One categorical variable.
Assumption #2: Independence of random observations.
Assumption #3: The groups of the categorical variable must be mutually exclusive.
Assumption #4: There must be at least 5 expected frequencies in each level of the variable.
Chi2 Test of Independence assumptions:
Assumption #1: Two categorical variables.
Assumption #2: Independence of random observations.
Assumption #3: The groups of the categorical variable must be mutually exclusive.
Assumption #4: There must be at least 5 expected frequencies in each level of the variable.
t- Testing
Assumption #1: Two continuous variables.
Assumption #2: Independence of random observations.
Assumption #3: Both variables are approximately normally distributed.
Assumption #4: Both variables have approximately the same variance.
ANOVA
Assumption #1: Multiple continuous variables.
Assumption #2: Independence of random observations.
Assumption #3: All variables approximately normally distributed.
Assumption #4: All variables have approximately the same variance.
Assumption #5: There is No Multicollinearity Among feature Variables
Linear Regression
Assumption #1: Multiple feature variables of any type, One continuous target variable.
Assumption #2: The relationship between the features and the target variable is linear.
Assumption #3: Independence of random observations.
Assumption #4: All variables approximately normally distributed.
Assumption #5: All variables have approximately the same variance.
Assumption #6: There is No Multicollinearity Among feature Variables.
10References and Resources
R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Comprehensive Guide to Grouping and Aggregating with Pandas—Practical Business Python. (n.d.). Retrieved April 15, 2023, from https://pbpython.com/groupby-agg.html
Hassani, H., & Yeganegi, M. R. (2020). Selecting optimal lag order in Ljung–Box test. Physica A: Statistical Mechanics and Its Applications, 541, 123700. https://doi.org/10.1016/j.physa.2019.123700
Scott, D. M. (2009). Statistics, Inferential. In R. Kitchin & N. Thrift (Eds.), International Encyclopedia of Human Geography (pp. 429–435). Elsevier. https://doi.org/10.1016/B978-008044910-4.00535-6
R. Pruim, D. T. Kaplan and N. J. Horton. The mosaic Package: Helping Students to ‘Think with Data’ Using R (2017). The R Journal, 9(1):77-102.
‘corrplot’:
Taiyun Wei and Viliam Simko (2021). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.92). https://github.com/taiyun/corrplot
‘psych’
William Revelle (2023). _psych: Procedures for Psychological, Psychometric, and Personality Research_. Northwestern University, Evanston, Illinois. R package version 2.3.3, https://CRAN.R-project.org/package=psych.
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
‘forecast’:
Hyndman R, Athanasopoulos G, Bergmeir C, Caceres G, Chhay L, O’Hara-Wild M, Petropoulos F, Razbash S, Wang E, Yasmeen F (2022). _forecast: Forecasting functions for time series and linear models_. R package version 8.18,<https://pkg.robjhyndman.com/forecast/>.
Hyndman RJ, Khandakar Y (2008). “Automatic time series forecasting: the forecast package for R.” _Journal of Statistical Software_, *26*(3), 1-22. doi:10.18637/jss.v027.i03 https://doi.org/10.18637/jss.v027.i03
My name is Joshua Lizardi. For the past 7 years, I have worked for various institutions teaching a wide range of courses in Math, Statistics and Technology. These included Quantitative Reasoning, Calculus ,Applied Technical Mathematics, Remedial Mathematics, Statistics, Computers & Office Automation, Introductory College Algebra, Intermediate College Algebra, Remedial Mathematics, Business Statistics.
I hold a bachelor’s in mathematics (Mercy College), a master’s in applied mathematics (Purdue University), and a master’s in data analytics (Western Governors University). I also hold a few certifications including “SAS Certified Statistical Business Analyst SAS 9”, “SAS Certified Base Programmer SAS 9”, “Oracle Database SQL Certified Associate”.
Subjects like mathematics, statistics, and computer science should not be taught as if they were spectator sports, the best way to learn these subjects is to perform them. Although understanding textbooks and lecture notes is valuable, the learning that comes from one’s own attempts at solving problems is the key to becoming competent in the subject overall. I have always been passionate about mathematics statistics and computer science, and I enjoy encouraging students to see the utility of these subjects.
SPECIALTIES
Applied Mathematics Applied Statistics Data Analytics Data Science Machine Learning Artificial Intelligence
SKILLS
R Python SQL SAS MiniTab Tableau Power BI Microsoft Office