This mini project involves learning the basic terminologies of univariate statistics in R. It involves the following: 1.Analysing and finding mean,median,mode,sd,quatile ranges of a dataset. 2.Loading library psych and learning its functionality for storing statistical summaries in data frames. 3.Plotting boxplots and histograms. 4.Summarising categorical data,using frequencies. Here we use arthritis dataset and analyse the dataset using terms like crosstables,xtabs,collapsing tables and adding margins,etc. 5.Assignment: To tell whether automatic transition results in better mileage of the cars or no.(using data set mtcars) 6.Assignment: To check whetehr age impacts the arthritis patients treatement.ie “treated” people.(using arthritis dataset)
1. lets load dataset mtcars consisting of 32 obs and 11 variables.
data("mtcars")
View(mtcars)
various stat functions are as follows:
mean(mtcars$mpg) #gives the mean of variable(column) mpg
## [1] 20.09062
median(mtcars$mpg) #gives the median of variable(column) mpg
## [1] 19.2
min(mtcars$mpg)#gives the min value of variable(column) mpg
## [1] 10.4
max(mtcars$mpg)#gives the mex value of variable(column) mpg
## [1] 33.9
var(mtcars$mpg)#gives the variance of variable(column) mpg
## [1] 36.3241
sd(mtcars$mpg)#gives the sd of variable(column) mpg
## [1] 6.026948
IQR(mtcars$mpg)#gives the Inter quartile range of variable(column) mpg
## [1] 7.375
summary(mtcars) #summarises data for all variables together giving its statistical values.
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
s=summary(mtcars$mpg) #stores the summary of column mpg in table format in s.
class(s)
## [1] "summaryDefault" "table"
names(s) #gives the names of table data
## [1] "Min." "1st Qu." "Median" "Mean" "3rd Qu." "Max."
we can access the values stored in s by following command:
s["Max."] #to access the maximum value in s
## Max.
## 33.9
s["1st Qu."] #to access the first quatile value in s
## 1st Qu.
## 15.42
ss=summary(mtcars) #similarly we can store entire summary and access individual values.
class(ss)
## [1] "table"
names(ss)
## NULL
how to access value of data in ss
ss[1,2] #1st row n 2nd col::cyl column min data
## [1] "Min. :4.000 "
ss[5,6] #5th row n 6th col:: 3rd quartile of wt variable.
## [1] "3rd Qu.:3.610 "
2. Loading library psych..it is used to store summary data in different manner using describe function.
library(psych)
describe(mtcars) #gives detailed statistical measurments of all column variables.
## vars n mean sd median trimmed mad min max range skew
## mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61
## cyl 2 32 6.19 1.79 6.00 6.23 2.97 4.00 8.00 4.00 -0.17
## disp 3 32 230.72 123.94 196.30 222.52 140.48 71.10 472.00 400.90 0.38
## hp 4 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73
## drat 5 32 3.60 0.53 3.70 3.58 0.70 2.76 4.93 2.17 0.27
## wt 6 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42
## qsec 7 32 17.85 1.79 17.71 17.83 1.42 14.50 22.90 8.40 0.37
## vs 8 32 0.44 0.50 0.00 0.42 0.00 0.00 1.00 1.00 0.24
## am 9 32 0.41 0.50 0.00 0.38 0.00 0.00 1.00 1.00 0.36
## gear 10 32 3.69 0.74 4.00 3.62 1.48 3.00 5.00 2.00 0.53
## carb 11 32 2.81 1.62 2.00 2.65 1.48 1.00 8.00 7.00 1.05
## kurtosis se
## mpg -0.37 1.07
## cyl -1.76 0.32
## disp -1.21 21.91
## hp -0.14 12.12
## drat -0.71 0.09
## wt -0.02 0.17
## qsec 0.34 0.32
## vs -2.00 0.09
## am -1.92 0.09
## gear -1.07 0.13
## carb 1.26 0.29
kurtosis above means measure of peakness.
store it in object
sss=describe(mtcars)
class(sss) #class will be DF wic was table earlier
## [1] "psych" "describe" "data.frame"
View(sss)
View(ss) #note the difference.earlier "ss" was a table .This "sss" is a DF
accessing values within df
rownames(sss) #gives all the column headers(ie variable names)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
colnames(sss) #gives all the statistical measurement names.
## [1] "vars" "n" "mean" "sd" "median" "trimmed"
## [7] "mad" "min" "max" "range" "skew" "kurtosis"
## [13] "se"
lets acces the values in the above dataframe.
sss["gear","max"] #gear is row n max is col
## [1] 5
sss[c("gear","mpg","wt"),c("max","sd","skew")] #multiple elements access
## max sd skew
## gear 5.00 0.74 0.53
## mpg 33.90 6.03 0.61
## wt 5.42 0.98 0.42
| 3. Plotting boxplots,histograms boxplot is made as follows: |
r boxplot(mtcars) |
r boxplot(mtcars$mpg) |
r boxplot(mtcars$hp) |
| plotting histograms: |
r hist(mtcars$vs) |
r hist(mtcars$mpg,breaks=10) #do not use breaks if data is less |
lets generate random normal distributuion data set(x) and plot histogram
x=rnorm(10000)
hist(x)
hist(x,breaks=100)
| 4. lets summarise categorical data,using frequencies. |
| loading library vcd and grid and dataset arthritis which consists of 85obs and 5 variables. |
| Arthritis dataset consists Data from Koch & Edwards (1988) from a double-blind clinical trial investigating a new treatment for rheumatoid arthritis. |
| Format of dataset is as follows: |
| ID:patient ID. |
| Treatment:factor indicating treatment (Placebo, Treated). |
| Sex:factor indicating sex (Female, Male). |
| Age:age of patient. |
| Improved:ordered factor indicating treatment outcome (None, Some, Marked). |
library(vcd)
## Loading required package: grid
library(grid)
data("Arthritis")
View(Arthritis)
lets Summarise it
summary(Arthritis)
## ID Treatment Sex Age Improved
## Min. : 1.00 Placebo:43 Female:59 Min. :23.00 None :42
## 1st Qu.:21.75 Treated:41 Male :25 1st Qu.:46.00 Some :14
## Median :42.50 Median :57.00 Marked:28
## Mean :42.50 Mean :53.36
## 3rd Qu.:63.25 3rd Qu.:63.00
## Max. :84.00 Max. :74.00
describe(Arthritis)
## vars n mean sd median trimmed mad min max range skew
## ID 1 84 42.50 24.39 42.5 42.50 31.13 1 84 83 0.00
## Treatment* 2 84 1.49 0.50 1.0 1.49 0.00 1 2 1 0.05
## Sex* 3 84 1.30 0.46 1.0 1.25 0.00 1 2 1 0.87
## Age 4 84 53.36 12.77 57.0 54.44 10.38 23 74 51 -0.76
## Improved* 5 84 1.83 0.90 1.5 1.79 0.74 1 3 2 0.33
## kurtosis se
## ID -1.24 2.66
## Treatment* -2.02 0.05
## Sex* -1.26 0.05
## Age -0.45 1.39
## Improved* -1.71 0.10
tab=table(Arthritis$Improved) #storing count of values of variable improved in tab.
tab
##
## None Some Marked
## 42 14 28
prop.table converts into proportion ie percentages
t=round(prop.table(tab)*100,2)
t
##
## None Some Marked
## 50.00 16.67 33.33
output shows that out of 84 patients 50%shows no improvemnt,16.67% has some improvement and 33.33% has marked improvement.
lets create more dimension table ie cross tables
tab=table(Arthritis$Improved,Arthritis$Sex)
tab
##
## Female Male
## None 25 17
## Some 12 2
## Marked 22 6
class(Arthritis) #its a df
## [1] "data.frame"
another way to create cross table is xtabs
xtabs(~Improved+Sex,data=Arthritis)
## Sex
## Improved Female Male
## None 25 17
## Some 12 2
## Marked 22 6
xtabs(~Improved+Treatment+Sex,data=Arthritis)
## , , Sex = Female
##
## Treatment
## Improved Placebo Treated
## None 19 6
## Some 7 5
## Marked 6 16
##
## , , Sex = Male
##
## Treatment
## Improved Placebo Treated
## None 10 7
## Some 0 2
## Marked 1 5
the diff is just the syntax btw xtabs ntable function.o/p is the same.
lets create a two way table btw treatemnt and improved for further analysis.
tab=table(Arthritis$Treatment,Arthritis$Improved)
tab
##
## None Some Marked
## Placebo 29 7 7
## Treated 13 7 21
output:we can say that placebo received patients had no improvmnet & treated patients had marked improvmtn. but instead of values lets convert into percentages. by using prop.table
round(prop.table(tab)*100,2)
##
## None Some Marked
## Placebo 34.52 8.33 8.33
## Treated 15.48 8.33 25.00
NOTE:but this percentage is overall.ie 34% is outof 84 entries n not out of people who received placebo. i need % across placebo sum n across treatemnt sum hence i write 1 for first value ie treatment n 2 across improvemnt.
final=round(prop.table(tab,1)*100,2) #gives percentage across treatment.ie row % will add upto 100%
final
##
## None Some Marked
## Placebo 67.44 16.28 16.28
## Treated 31.71 17.07 51.22
addmargins(final,2)
##
## None Some Marked Sum
## Placebo 67.44 16.28 16.28 100.00
## Treated 31.71 17.07 51.22 100.00
CONCLUSION: 67% WHO RECEIVED PLACEBO HAD NO IMPROVMNT N 51% WHO WERE TREATED SHOWED IMPROVEMNT. HENCE DOING TREATEMENT IS BETTER THAN giving PLACEBO to arthritis patients..
lets check summary across improvement
final1=round(prop.table(tab,2)*100,2) #Gives % across improved.ie columns will add upto 100%.
addmargins(final1,1)
##
## None Some Marked
## Placebo 69.05 50.00 25.00
## Treated 30.95 50.00 75.00
## Sum 100.00 100.00 100.00
lets see how to collapse table
tab
##
## None Some Marked
## Placebo 29 7 7
## Treated 13 7 21
margin.table(tab) #collapses the table n says it has 84 observations ie rows
## [1] 84
margin.table(tab,1)
##
## Placebo Treated
## 43 41
margin.table(tab,2)
##
## None Some Marked
## 42 14 28
lets add margins
tab=table(Arthritis$Treatment,Arthritis$Improved)
tab
##
## None Some Marked
## Placebo 29 7 7
## Treated 13 7 21
addmargins(tab)
##
## None Some Marked Sum
## Placebo 29 7 7 43
## Treated 13 7 21 41
## Sum 42 14 28 84
addmargins(tab,1) #nos.across treatment are added
##
## None Some Marked
## Placebo 29 7 7
## Treated 13 7 21
## Sum 42 14 28
addmargins(tab,2) #nos across improved are added.
##
## None Some Marked Sum
## Placebo 29 7 7 43
## Treated 13 7 21 41
lets see a three dimensional example.
tab3=xtabs(~Treatment+Sex+Improved,data=Arthritis)
tab3
## , , Improved = None
##
## Sex
## Treatment Female Male
## Placebo 19 10
## Treated 6 7
##
## , , Improved = Some
##
## Sex
## Treatment Female Male
## Placebo 7 0
## Treated 5 2
##
## , , Improved = Marked
##
## Sex
## Treatment Female Male
## Placebo 6 1
## Treated 16 5
ftable(tab3) #displays better way
## Improved None Some Marked
## Treatment Sex
## Placebo Female 19 7 6
## Male 10 0 1
## Treated Female 6 5 16
## Male 7 2 5
#=======================
margin.table(tab3)
## [1] 84
margin.table(tab3,1)
## Treatment
## Placebo Treated
## 43 41
margin.table(tab3,c(1,3))
## Improved
## Treatment None Some Marked
## Placebo 29 7 7
## Treated 13 7 21
margin.table(tab3,c(1,2))
## Sex
## Treatment Female Male
## Placebo 32 11
## Treated 27 14
ftable(prop.table(tab3,c(1,3))*100) #gives percentages in dimension of treatemtn n improved.ie.placebo and none will add upto 100%
## Improved None Some Marked
## Treatment Sex
## Placebo Female 65.51724 100.00000 85.71429
## Male 34.48276 0.00000 14.28571
## Treated Female 46.15385 71.42857 76.19048
## Male 53.84615 28.57143 23.80952
#also treated and marked will add upto 100%
ftable(addmargins(prop.table(tab3,c(1,2)),3)) # 3 is for improved.so improved is summation.
## Improved None Some Marked Sum
## Treatment Sex
## Placebo Female 0.59375000 0.21875000 0.18750000 1.00000000
## Male 0.90909091 0.00000000 0.09090909 1.00000000
## Treated Female 0.22222222 0.18518519 0.59259259 1.00000000
## Male 0.50000000 0.14285714 0.35714286 1.00000000
CONCLUSION: 59 % of femals who were treated had marked improvemnt 90% of males who were on placebo n 50 % who were treated had no improvment hence we can say that the drug has more impact on females dan males.
e.g of mtcars 5. QUESTION: To tell whether automatic transition results in better mileage of the cars or no. lets load data mtcars
data(mtcars)
View(mtcars)
lets create cross table of mileage n automation
mycars=mtcars[,c("mpg","am")]
mycars
## mpg am
## Mazda RX4 21.0 1
## Mazda RX4 Wag 21.0 1
## Datsun 710 22.8 1
## Hornet 4 Drive 21.4 0
## Hornet Sportabout 18.7 0
## Valiant 18.1 0
## Duster 360 14.3 0
## Merc 240D 24.4 0
## Merc 230 22.8 0
## Merc 280 19.2 0
## Merc 280C 17.8 0
## Merc 450SE 16.4 0
## Merc 450SL 17.3 0
## Merc 450SLC 15.2 0
## Cadillac Fleetwood 10.4 0
## Lincoln Continental 10.4 0
## Chrysler Imperial 14.7 0
## Fiat 128 32.4 1
## Honda Civic 30.4 1
## Toyota Corolla 33.9 1
## Toyota Corona 21.5 0
## Dodge Challenger 15.5 0
## AMC Javelin 15.2 0
## Camaro Z28 13.3 0
## Pontiac Firebird 19.2 0
## Fiat X1-9 27.3 1
## Porsche 914-2 26.0 1
## Lotus Europa 30.4 1
## Ford Pantera L 15.8 1
## Ferrari Dino 19.7 1
## Maserati Bora 15.0 1
## Volvo 142E 21.4 1
xtabs(~am+mpg,data=mycars) #bad idea bcz we cannot retrive any information from this.bcz frequency value is extreme less.
## mpg
## am 10.4 13.3 14.3 14.7 15 15.2 15.5 15.8 16.4 17.3 17.8 18.1 18.7 19.2
## 0 2 1 1 1 0 2 1 0 1 1 1 1 1 2
## 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0
## mpg
## am 19.7 21 21.4 21.5 22.8 24.4 26 27.3 30.4 32.4 33.9
## 0 0 0 1 1 1 1 0 0 0 0 0
## 1 1 2 1 0 1 0 1 1 2 1 1
mean(mycars$mpg)
## [1] 20.09062
lets subset data acc to am value 0 & 1
mycars.am.0=mycars[mycars$am==0,]
mycars.am.0
## mpg am
## Hornet 4 Drive 21.4 0
## Hornet Sportabout 18.7 0
## Valiant 18.1 0
## Duster 360 14.3 0
## Merc 240D 24.4 0
## Merc 230 22.8 0
## Merc 280 19.2 0
## Merc 280C 17.8 0
## Merc 450SE 16.4 0
## Merc 450SL 17.3 0
## Merc 450SLC 15.2 0
## Cadillac Fleetwood 10.4 0
## Lincoln Continental 10.4 0
## Chrysler Imperial 14.7 0
## Toyota Corona 21.5 0
## Dodge Challenger 15.5 0
## AMC Javelin 15.2 0
## Camaro Z28 13.3 0
## Pontiac Firebird 19.2 0
mycars.am.1=mycars[mycars$am==1,]
now lets find mean of both wen am =0 n am=1
mean(mycars.am.0$mpg)
## [1] 17.14737
mean(mycars.am.1$mpg)
## [1] 24.39231
CONCLUSION:hence average mileage is better(mpg) wehn transmision is automatic.(am=1)
QUESTION:To check whetehr age impacts the arthritis patients treatement.ie “treated” people.
myarth=Arthritis[Arthritis$Treatment=="Treated",] #we subset people who were treated in object myarth.
none.improved=myarth[myarth$Improved=="None",] #subsetted none improved
some.improved=myarth[myarth$Improved=="Some",] #subsetted none improved
marked.improved=myarth[myarth$Improved=="Marked",] #subsetted none improved
mean(none.improved$Age)
## [1] 49.84615
mean(some.improved$Age)
## [1] 56.71429
mean(marked.improved$Age)
## [1] 56.80952
CONCLUSION:IT SHOWS THAT WHEN D AGE OF PATIENTS IS ON THE HIGHER SIDE THEN MEDICINE WORKS BETTER. OLDER PEOPLE MEDICINES WORKS BETTER N R TREATED BETTER. BUSSINESS UNDERSTNADING IS THAT OLDER PEOPLE HAVE MAJOR SYMPTOMS HENCE THE RESULT OF TREATMENT IS ALSO MUCH PRONOUNCED ON THEM DEN YOUNGER PEOPLE.
**FINISH**