UNIVARIATE STATISTICS TERMINOLOGIES

This mini project involves learning the basic terminologies of univariate statistics in R. It involves the following: 1.Analysing and finding mean,median,mode,sd,quatile ranges of a dataset. 2.Loading library psych and learning its functionality for storing statistical summaries in data frames. 3.Plotting boxplots and histograms. 4.Summarising categorical data,using frequencies. Here we use arthritis dataset and analyse the dataset using terms like crosstables,xtabs,collapsing tables and adding margins,etc. 5.Assignment: To tell whether automatic transition results in better mileage of the cars or no.(using data set mtcars) 6.Assignment: To check whetehr age impacts the arthritis patients treatement.ie “treated” people.(using arthritis dataset)

1. lets load dataset mtcars consisting of 32 obs and 11 variables.

data("mtcars")
View(mtcars)

various stat functions are as follows:

mean(mtcars$mpg) #gives the mean of variable(column) mpg

## [1] 20.09062

median(mtcars$mpg) #gives the median of variable(column) mpg

## [1] 19.2

min(mtcars$mpg)#gives the min value of variable(column) mpg

## [1] 10.4

max(mtcars$mpg)#gives the mex value of variable(column) mpg

## [1] 33.9

var(mtcars$mpg)#gives the variance of variable(column) mpg

## [1] 36.3241

sd(mtcars$mpg)#gives the sd of variable(column) mpg

## [1] 6.026948

IQR(mtcars$mpg)#gives the Inter quartile range of variable(column) mpg

## [1] 7.375

summary(mtcars) #summarises data for all variables together giving its statistical values.

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

s=summary(mtcars$mpg) #stores the summary of column mpg in table format in s.
class(s)

## [1] "summaryDefault" "table"

names(s) #gives the names of table data

## [1] "Min."    "1st Qu." "Median"  "Mean"    "3rd Qu." "Max."

we can access the values stored in s by following command:

s["Max."] #to access the maximum value in s

## Max. 
## 33.9

s["1st Qu."] #to access the first quatile value in s

## 1st Qu. 
##   15.42

ss=summary(mtcars) #similarly we can store entire summary and access individual values.
class(ss)

## [1] "table"

names(ss)

## NULL

how to access value of data in ss

ss[1,2] #1st row n 2nd col::cyl column min data

## [1] "Min.   :4.000  "

ss[5,6] #5th row n 6th col:: 3rd quartile of wt variable.

## [1] "3rd Qu.:3.610  "

2. Loading library psych..it is used to store summary data in different manner using describe function.

library(psych)
describe(mtcars) #gives detailed statistical measurments of all column variables.

##      vars  n   mean     sd median trimmed    mad   min    max  range  skew
## mpg     1 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50  0.61
## cyl     2 32   6.19   1.79   6.00    6.23   2.97  4.00   8.00   4.00 -0.17
## disp    3 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90  0.38
## hp      4 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00  0.73
## drat    5 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17  0.27
## wt      6 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91  0.42
## qsec    7 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40  0.37
## vs      8 32   0.44   0.50   0.00    0.42   0.00  0.00   1.00   1.00  0.24
## am      9 32   0.41   0.50   0.00    0.38   0.00  0.00   1.00   1.00  0.36
## gear   10 32   3.69   0.74   4.00    3.62   1.48  3.00   5.00   2.00  0.53
## carb   11 32   2.81   1.62   2.00    2.65   1.48  1.00   8.00   7.00  1.05
##      kurtosis    se
## mpg     -0.37  1.07
## cyl     -1.76  0.32
## disp    -1.21 21.91
## hp      -0.14 12.12
## drat    -0.71  0.09
## wt      -0.02  0.17
## qsec     0.34  0.32
## vs      -2.00  0.09
## am      -1.92  0.09
## gear    -1.07  0.13
## carb     1.26  0.29

kurtosis above means measure of peakness.

store it in object

sss=describe(mtcars)
class(sss)  #class will be DF wic was table earlier

## [1] "psych"      "describe"   "data.frame"

View(sss)
View(ss) #note the difference.earlier "ss" was a table .This "sss" is a DF

accessing values within df

rownames(sss) #gives all the column headers(ie variable names)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

colnames(sss) #gives all the statistical measurement names.

##  [1] "vars"     "n"        "mean"     "sd"       "median"   "trimmed" 
##  [7] "mad"      "min"      "max"      "range"    "skew"     "kurtosis"
## [13] "se"

lets acces the values in the above dataframe.

sss["gear","max"] #gear is row n max is col

## [1] 5

sss[c("gear","mpg","wt"),c("max","sd","skew")] #multiple elements access

##        max   sd skew
## gear  5.00 0.74 0.53
## mpg  33.90 6.03 0.61
## wt    5.42 0.98 0.42

3. Plotting boxplots,histograms boxplot is made as follows:

r boxplot(mtcars)

r boxplot(mtcars$mpg)

r boxplot(mtcars$hp)

plotting histograms:

r hist(mtcars$vs)

r hist(mtcars$mpg,breaks=10) #do not use breaks if data is less

lets generate random normal distributuion data set(x) and plot histogram

x=rnorm(10000) 
hist(x)

hist(x,breaks=100)

4. lets summarise categorical data,using frequencies.

loading library vcd and grid and dataset arthritis which consists of 85obs and 5 variables.

Arthritis dataset consists Data from Koch & Edwards (1988) from a double-blind clinical trial investigating a new treatment for rheumatoid arthritis.

Format of dataset is as follows:

ID:patient ID.

Treatment:factor indicating treatment (Placebo, Treated).

Sex:factor indicating sex (Female, Male).

Age:age of patient.

Improved:ordered factor indicating treatment outcome (None, Some, Marked).

library(vcd)

## Loading required package: grid

library(grid)
data("Arthritis")
View(Arthritis)

lets Summarise it

summary(Arthritis)

##        ID          Treatment      Sex          Age          Improved 
##  Min.   : 1.00   Placebo:43   Female:59   Min.   :23.00   None  :42  
##  1st Qu.:21.75   Treated:41   Male  :25   1st Qu.:46.00   Some  :14  
##  Median :42.50                            Median :57.00   Marked:28  
##  Mean   :42.50                            Mean   :53.36              
##  3rd Qu.:63.25                            3rd Qu.:63.00              
##  Max.   :84.00                            Max.   :74.00

describe(Arthritis)

##            vars  n  mean    sd median trimmed   mad min max range  skew
## ID            1 84 42.50 24.39   42.5   42.50 31.13   1  84    83  0.00
## Treatment*    2 84  1.49  0.50    1.0    1.49  0.00   1   2     1  0.05
## Sex*          3 84  1.30  0.46    1.0    1.25  0.00   1   2     1  0.87
## Age           4 84 53.36 12.77   57.0   54.44 10.38  23  74    51 -0.76
## Improved*     5 84  1.83  0.90    1.5    1.79  0.74   1   3     2  0.33
##            kurtosis   se
## ID            -1.24 2.66
## Treatment*    -2.02 0.05
## Sex*          -1.26 0.05
## Age           -0.45 1.39
## Improved*     -1.71 0.10

tab=table(Arthritis$Improved) #storing count of values of variable improved in tab.
tab

## 
##   None   Some Marked 
##     42     14     28

prop.table converts into proportion ie percentages

t=round(prop.table(tab)*100,2)
t

## 
##   None   Some Marked 
##  50.00  16.67  33.33

output shows that out of 84 patients 50%shows no improvemnt,16.67% has some improvement and 33.33% has marked improvement.

lets create more dimension table ie cross tables

tab=table(Arthritis$Improved,Arthritis$Sex)
tab

##         
##          Female Male
##   None       25   17
##   Some       12    2
##   Marked     22    6

class(Arthritis) #its a df

## [1] "data.frame"

another way to create cross table is xtabs

xtabs(~Improved+Sex,data=Arthritis)

##         Sex
## Improved Female Male
##   None       25   17
##   Some       12    2
##   Marked     22    6

xtabs(~Improved+Treatment+Sex,data=Arthritis)

## , , Sex = Female
## 
##         Treatment
## Improved Placebo Treated
##   None        19       6
##   Some         7       5
##   Marked       6      16
## 
## , , Sex = Male
## 
##         Treatment
## Improved Placebo Treated
##   None        10       7
##   Some         0       2
##   Marked       1       5

the diff is just the syntax btw xtabs ntable function.o/p is the same.

lets create a two way table btw treatemnt and improved for further analysis.

tab=table(Arthritis$Treatment,Arthritis$Improved)
tab

##          
##           None Some Marked
##   Placebo   29    7      7
##   Treated   13    7     21

output:we can say that placebo received patients had no improvmnet & treated patients had marked improvmtn. but instead of values lets convert into percentages. by using prop.table

round(prop.table(tab)*100,2)

##          
##            None  Some Marked
##   Placebo 34.52  8.33   8.33
##   Treated 15.48  8.33  25.00

NOTE:but this percentage is overall.ie 34% is outof 84 entries n not out of people who received placebo. i need % across placebo sum n across treatemnt sum hence i write 1 for first value ie treatment n 2 across improvemnt.

final=round(prop.table(tab,1)*100,2) #gives percentage across treatment.ie row % will add upto 100%
final

##          
##            None  Some Marked
##   Placebo 67.44 16.28  16.28
##   Treated 31.71 17.07  51.22

addmargins(final,2)

##          
##             None   Some Marked    Sum
##   Placebo  67.44  16.28  16.28 100.00
##   Treated  31.71  17.07  51.22 100.00

CONCLUSION: 67% WHO RECEIVED PLACEBO HAD NO IMPROVMNT N 51% WHO WERE TREATED SHOWED IMPROVEMNT. HENCE DOING TREATEMENT IS BETTER THAN giving PLACEBO to arthritis patients..

lets check summary across improvement

final1=round(prop.table(tab,2)*100,2) #Gives % across improved.ie columns will add upto 100%.
addmargins(final1,1)

##          
##             None   Some Marked
##   Placebo  69.05  50.00  25.00
##   Treated  30.95  50.00  75.00
##   Sum     100.00 100.00 100.00

lets see how to collapse table

tab

##          
##           None Some Marked
##   Placebo   29    7      7
##   Treated   13    7     21

margin.table(tab) #collapses the table n says it has 84 observations ie rows

## [1] 84

margin.table(tab,1)

## 
## Placebo Treated 
##      43      41

margin.table(tab,2)

## 
##   None   Some Marked 
##     42     14     28

lets add margins

tab=table(Arthritis$Treatment,Arthritis$Improved)
tab

##          
##           None Some Marked
##   Placebo   29    7      7
##   Treated   13    7     21

addmargins(tab)

##          
##           None Some Marked Sum
##   Placebo   29    7      7  43
##   Treated   13    7     21  41
##   Sum       42   14     28  84

addmargins(tab,1) #nos.across treatment are added

##          
##           None Some Marked
##   Placebo   29    7      7
##   Treated   13    7     21
##   Sum       42   14     28

addmargins(tab,2) #nos across improved are added.

##          
##           None Some Marked Sum
##   Placebo   29    7      7  43
##   Treated   13    7     21  41

lets see a three dimensional example.

tab3=xtabs(~Treatment+Sex+Improved,data=Arthritis)
tab3

## , , Improved = None
## 
##          Sex
## Treatment Female Male
##   Placebo     19   10
##   Treated      6    7
## 
## , , Improved = Some
## 
##          Sex
## Treatment Female Male
##   Placebo      7    0
##   Treated      5    2
## 
## , , Improved = Marked
## 
##          Sex
## Treatment Female Male
##   Placebo      6    1
##   Treated     16    5

ftable(tab3) #displays better way

##                  Improved None Some Marked
## Treatment Sex                             
## Placebo   Female            19    7      6
##           Male              10    0      1
## Treated   Female             6    5     16
##           Male               7    2      5

#=======================
margin.table(tab3)

## [1] 84

margin.table(tab3,1)

## Treatment
## Placebo Treated 
##      43      41

margin.table(tab3,c(1,3))

##          Improved
## Treatment None Some Marked
##   Placebo   29    7      7
##   Treated   13    7     21

margin.table(tab3,c(1,2))

##          Sex
## Treatment Female Male
##   Placebo     32   11
##   Treated     27   14

ftable(prop.table(tab3,c(1,3))*100) #gives percentages in dimension of treatemtn n improved.ie.placebo and none will add upto 100%

##                  Improved      None      Some    Marked
## Treatment Sex                                          
## Placebo   Female           65.51724 100.00000  85.71429
##           Male             34.48276   0.00000  14.28571
## Treated   Female           46.15385  71.42857  76.19048
##           Male             53.84615  28.57143  23.80952

#also treated and marked will add upto 100%

ftable(addmargins(prop.table(tab3,c(1,2)),3)) # 3 is for improved.so improved is summation.

##                  Improved       None       Some     Marked        Sum
## Treatment Sex                                                        
## Placebo   Female          0.59375000 0.21875000 0.18750000 1.00000000
##           Male            0.90909091 0.00000000 0.09090909 1.00000000
## Treated   Female          0.22222222 0.18518519 0.59259259 1.00000000
##           Male            0.50000000 0.14285714 0.35714286 1.00000000

CONCLUSION: 59 % of femals who were treated had marked improvemnt 90% of males who were on placebo n 50 % who were treated had no improvment hence we can say that the drug has more impact on females dan males.

e.g of mtcars 5. QUESTION: To tell whether automatic transition results in better mileage of the cars or no. lets load data mtcars

data(mtcars)
View(mtcars)

lets create cross table of mileage n automation

mycars=mtcars[,c("mpg","am")]
mycars

##                      mpg am
## Mazda RX4           21.0  1
## Mazda RX4 Wag       21.0  1
## Datsun 710          22.8  1
## Hornet 4 Drive      21.4  0
## Hornet Sportabout   18.7  0
## Valiant             18.1  0
## Duster 360          14.3  0
## Merc 240D           24.4  0
## Merc 230            22.8  0
## Merc 280            19.2  0
## Merc 280C           17.8  0
## Merc 450SE          16.4  0
## Merc 450SL          17.3  0
## Merc 450SLC         15.2  0
## Cadillac Fleetwood  10.4  0
## Lincoln Continental 10.4  0
## Chrysler Imperial   14.7  0
## Fiat 128            32.4  1
## Honda Civic         30.4  1
## Toyota Corolla      33.9  1
## Toyota Corona       21.5  0
## Dodge Challenger    15.5  0
## AMC Javelin         15.2  0
## Camaro Z28          13.3  0
## Pontiac Firebird    19.2  0
## Fiat X1-9           27.3  1
## Porsche 914-2       26.0  1
## Lotus Europa        30.4  1
## Ford Pantera L      15.8  1
## Ferrari Dino        19.7  1
## Maserati Bora       15.0  1
## Volvo 142E          21.4  1

xtabs(~am+mpg,data=mycars) #bad idea bcz we cannot retrive any information from this.bcz frequency value is extreme less.

##    mpg
## am  10.4 13.3 14.3 14.7 15 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2
##   0    2    1    1    1  0    2    1    0    1    1    1    1    1    2
##   1    0    0    0    0  1    0    0    1    0    0    0    0    0    0
##    mpg
## am  19.7 21 21.4 21.5 22.8 24.4 26 27.3 30.4 32.4 33.9
##   0    0  0    1    1    1    1  0    0    0    0    0
##   1    1  2    1    0    1    0  1    1    2    1    1

mean(mycars$mpg)

## [1] 20.09062

lets subset data acc to am value 0 & 1

mycars.am.0=mycars[mycars$am==0,]
mycars.am.0

##                      mpg am
## Hornet 4 Drive      21.4  0
## Hornet Sportabout   18.7  0
## Valiant             18.1  0
## Duster 360          14.3  0
## Merc 240D           24.4  0
## Merc 230            22.8  0
## Merc 280            19.2  0
## Merc 280C           17.8  0
## Merc 450SE          16.4  0
## Merc 450SL          17.3  0
## Merc 450SLC         15.2  0
## Cadillac Fleetwood  10.4  0
## Lincoln Continental 10.4  0
## Chrysler Imperial   14.7  0
## Toyota Corona       21.5  0
## Dodge Challenger    15.5  0
## AMC Javelin         15.2  0
## Camaro Z28          13.3  0
## Pontiac Firebird    19.2  0

mycars.am.1=mycars[mycars$am==1,]

now lets find mean of both wen am =0 n am=1

mean(mycars.am.0$mpg)

## [1] 17.14737

mean(mycars.am.1$mpg)

## [1] 24.39231

CONCLUSION:hence average mileage is better(mpg) wehn transmision is automatic.(am=1)

QUESTION:To check whetehr age impacts the arthritis patients treatement.ie “treated” people.

myarth=Arthritis[Arthritis$Treatment=="Treated",] #we subset people who were treated in object myarth.

none.improved=myarth[myarth$Improved=="None",] #subsetted none improved
some.improved=myarth[myarth$Improved=="Some",] #subsetted none improved
marked.improved=myarth[myarth$Improved=="Marked",] #subsetted none improved

mean(none.improved$Age)

## [1] 49.84615

mean(some.improved$Age)

## [1] 56.71429

mean(marked.improved$Age)

## [1] 56.80952

CONCLUSION:IT SHOWS THAT WHEN D AGE OF PATIENTS IS ON THE HIGHER SIDE THEN MEDICINE WORKS BETTER. OLDER PEOPLE MEDICINES WORKS BETTER N R TREATED BETTER. BUSSINESS UNDERSTNADING IS THAT OLDER PEOPLE HAVE MAJOR SYMPTOMS HENCE THE RESULT OF TREATMENT IS ALSO MUCH PRONOUNCED ON THEM DEN YOUNGER PEOPLE.

                                           **FINISH**

UNIVARIATE STATISTICS TERMINOLOGIES

Executed by Neha More

Nov 3rd, 2017