1.1 The dataset teengamb
concerns a study of teenage gambling in Britain. Make a numerical and graphical summary of the data, commenting on any features that you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data.
Information regarding the dataset is found here: http://www.stat.unc.edu/faculty/cji/664-2012/HW1-faraway.pdf
This is also in the faraway package.
sex
: 0 = male, 1 = femalestatus
: socioeconomic status score based on parents’ occupationincome
: in pounds per weekverbal
: verbal score in words out of 12 correctly definedgamble
: expenditure on gambling in pounds per yeardata(teengamb, package='faraway')
head(teengamb)
## sex status income verbal gamble
## 1 1 51 2.00 8 0.0
## 2 1 28 2.50 8 0.0
## 3 1 37 2.00 6 0.0
## 4 1 28 7.00 4 7.3
## 5 1 65 2.00 8 19.6
## 6 1 61 3.47 6 0.1
What are the dimensions of this dataframe?
dim(teengamb)
## [1] 47 5
What is the summary descriptive statistics for this dataframe?
summary(teengamb)
## sex status income verbal
## Min. :0.0000 Min. :18.00 Min. : 0.600 Min. : 1.00
## 1st Qu.:0.0000 1st Qu.:28.00 1st Qu.: 2.000 1st Qu.: 6.00
## Median :0.0000 Median :43.00 Median : 3.250 Median : 7.00
## Mean :0.4043 Mean :45.23 Mean : 4.642 Mean : 6.66
## 3rd Qu.:1.0000 3rd Qu.:61.50 3rd Qu.: 6.210 3rd Qu.: 8.00
## Max. :1.0000 Max. :75.00 Max. :15.000 Max. :10.00
## gamble
## Min. : 0.0
## 1st Qu.: 1.1
## Median : 6.0
## Mean : 19.3
## 3rd Qu.: 19.4
## Max. :156.0
What is the split between men and women?
sex_count <- factor(teengamb$sex)
levels(sex_count) <- c('Male', 'Female')
plot(sex_count,ylab="Frequency", main="Male vs. Female Count")
Let’s use ggplot2.
if (!require(ggplot2)) install.packages("ggplot2")
## Loading required package: ggplot2
require(ggplot2)
teengamb$sex <- factor(teengamb$sex)
levels(teengamb$sex) <- c('Male','Female')
ggplot(teengamb, aes(x=sex)) + geom_bar(aes(fill=sex)) + labs(title="Male vs. Female Frequency", y="Count")
Looks like in this dataset that are more men in this set. Let’s see the overall count for gambling expenditure. Are these teenagers big time gamblers?
ggplot(teengamb, aes(x=gamble)) + geom_histogram(bins=50, fill='blue',col='black') + labs(title="Histogram of Gambling Expenditure", x='Pounds per Year', y="Count")
The brief overview of the histogram suggests that the data is very much right skew, with many teenagers overall appearing not to be ‘big rollers’.
Now let’s ask if income matters at all. Do teenagers with bigger income appear to be bigger gamblers? Let’s plot this, and see if we can subset this against sex.
ggplot(teengamb, aes(x=income,y=gamble,col=sex)) + geom_point() + labs(title="Income vs. Gambling Expenditure", x="Income", y="Gambling Expenditure")
At an initial glance, it appears that men seem to be overall tend to gamble more than women, particularly when the income increases.
men <- subset(teengamb, teengamb$sex == 'Male')
female <- subset(teengamb, teengamb$sex == 'Female')
summary(men)
## sex status income verbal
## Male :28 Min. :18.00 Min. : 0.600 Min. : 1.000
## Female: 0 1st Qu.:38.00 1st Qu.: 2.000 1st Qu.: 6.000
## Median :51.00 Median : 3.375 Median : 7.000
## Mean :52.00 Mean : 4.976 Mean : 6.821
## 3rd Qu.:65.25 3rd Qu.: 6.625 3rd Qu.: 8.250
## Max. :75.00 Max. :15.000 Max. :10.000
## gamble
## Min. : 0.000
## 1st Qu.: 2.775
## Median : 14.250
## Mean : 29.775
## 3rd Qu.: 42.175
## Max. :156.000
summary(female)
## sex status income verbal
## Male : 0 Min. :18.00 Min. : 1.500 Min. :4.000
## Female:19 1st Qu.:28.00 1st Qu.: 2.000 1st Qu.:6.000
## Median :30.00 Median : 3.000 Median :6.000
## Mean :35.26 Mean : 4.149 Mean :6.421
## 3rd Qu.:43.00 3rd Qu.: 5.750 3rd Qu.:8.000
## Max. :65.00 Max. :10.000 Max. :8.000
## gamble
## Min. : 0.000
## 1st Qu.: 0.100
## Median : 1.700
## Mean : 3.866
## 3rd Qu.: 6.000
## Max. :19.600
As you can see, both the mean and median for gambling expenditures for women is less than the men’s gambling expenditures.
1.3 The dataset prostate
is from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Make a numerical and graphical summary of the data as in the first question.
Information regarding the dataset is found here: https://rdrr.io/cran/faraway/man/prostate.html
lcavol
: log(cancer volume)lweight
: log(prostate weight)age
: agelbph
: log(benign prostatic hyperplasia amount)svi
: seminal vesicle invasionlcp
: log(capsular penetration)gleason
: Gleason score. Score ranges from 0 to 10, with 6 or greater indicates likely cancer.pgg45
: percentage Gleason scores 4 or 5lpsa
: log(prostate specific antigen)Numerical Data Summary:
data(prostate, package='faraway')
head(prostate)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.7695 50 -1.386294 0 -1.38629 6 0 -0.43078
## 2 -0.9942523 3.3196 58 -1.386294 0 -1.38629 6 0 -0.16252
## 3 -0.5108256 2.6912 74 -1.386294 0 -1.38629 7 20 -0.16252
## 4 -1.2039728 3.2828 58 -1.386294 0 -1.38629 6 0 -0.16252
## 5 0.7514161 3.4324 62 -1.386294 0 -1.38629 6 0 0.37156
## 6 -1.0498221 3.2288 50 -1.386294 0 -1.38629 6 0 0.76547
summary(prostate)
## lcavol lweight age lbph
## Min. :-1.3471 Min. :2.375 Min. :41.00 Min. :-1.3863
## 1st Qu.: 0.5128 1st Qu.:3.376 1st Qu.:60.00 1st Qu.:-1.3863
## Median : 1.4469 Median :3.623 Median :65.00 Median : 0.3001
## Mean : 1.3500 Mean :3.653 Mean :63.87 Mean : 0.1004
## 3rd Qu.: 2.1270 3rd Qu.:3.878 3rd Qu.:68.00 3rd Qu.: 1.5581
## Max. : 3.8210 Max. :6.108 Max. :79.00 Max. : 2.3263
## svi lcp gleason pgg45
## Min. :0.0000 Min. :-1.3863 Min. :6.000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:-1.3863 1st Qu.:6.000 1st Qu.: 0.00
## Median :0.0000 Median :-0.7985 Median :7.000 Median : 15.00
## Mean :0.2165 Mean :-0.1794 Mean :6.753 Mean : 24.38
## 3rd Qu.:0.0000 3rd Qu.: 1.1786 3rd Qu.:7.000 3rd Qu.: 40.00
## Max. :1.0000 Max. : 2.9042 Max. :9.000 Max. :100.00
## lpsa
## Min. :-0.4308
## 1st Qu.: 1.7317
## Median : 2.5915
## Mean : 2.4784
## 3rd Qu.: 3.0564
## Max. : 5.5829
Interestingly, there are some logarithmic values that are negative, which is not possible. There may be some data errors, or perhaps, the values may have been placed in incorrectly. Let’s continue further with the analysis.
Let’s look at the histogram frequency for age.
ggplot(prostate, aes(x=age)) + geom_histogram(col='black', fill='green', bins=40) + labs(title="Frequency Count for Age", x="Age",y="Count")
Most of the men are greater than 60 years old, which the visualization confirms the mean value from summary
of approximately 63.
Does age correlate with volume and weight of the prostate cancer?
ggplot(prostate, aes(x=age,y=lcavol)) + geom_point(col='blue') + labs(title='Age vs. Prostate Cancer Volume (log)')
ggplot(prostate, aes(x=age,y=lweight)) + geom_point(col='red') + labs(title='Age vs. Prostate Cancer Weight (log)')
At least from initial visualization, it’s difficult to say if there’s a trend in age vs. size/weight. This means that we will need to look into this further.
Let’s look closer into the gleason
score.
ggplot(prostate, aes(x=gleason)) + geom_bar() + labs(title="Frequency of Gleason Scores")
We can certainly do a futher look, but we will continue onto the next exercise in the meantime.
The dataset sat
comes from a study entitled “Getting What You Pay For: The Debate Over Equity in Public School Expenditures.” Make a numerical and graphical summary of the data as in the first question.
More information: https://www.rdocumentation.org/packages/faraway/versions/1.0.7/topics/sat
expend
: Current expenditure per pupil in average daily attendance in public elementary and secondary schools, 1994-95 (in thousands of dollars)ratio
: Average pupil/teacher ratio in public elementary and secondary schools, Fall 1994salary
: Estimated average annual salary of teachers in public elementary and secondary schools, 1994-95 (in thousands of dollars)takers
: Percentage of all eligible students taking the SAT, 1994-95verbal
: Average verbal SAT score, 1994-95math
: Average math SAT score, 1994-95total
: Average total score on the SAT, 1994-95data(sat, package='faraway')
head(sat)
## expend ratio salary takers verbal math total
## Alabama 4.405 17.2 31.144 8 491 538 1029
## Alaska 8.963 17.6 47.951 47 445 489 934
## Arizona 4.778 19.3 32.175 27 448 496 944
## Arkansas 4.459 17.1 28.934 6 482 523 1005
## California 4.992 24.0 41.078 45 417 485 902
## Colorado 5.443 18.4 34.571 29 462 518 980
summary(sat)
## expend ratio salary takers
## Min. :3.656 Min. :13.80 Min. :25.99 Min. : 4.00
## 1st Qu.:4.882 1st Qu.:15.22 1st Qu.:30.98 1st Qu.: 9.00
## Median :5.768 Median :16.60 Median :33.29 Median :28.00
## Mean :5.905 Mean :16.86 Mean :34.83 Mean :35.24
## 3rd Qu.:6.434 3rd Qu.:17.57 3rd Qu.:38.55 3rd Qu.:63.00
## Max. :9.774 Max. :24.30 Max. :50.05 Max. :81.00
## verbal math total
## Min. :401.0 Min. :443.0 Min. : 844.0
## 1st Qu.:427.2 1st Qu.:474.8 1st Qu.: 897.2
## Median :448.0 Median :497.5 Median : 945.5
## Mean :457.1 Mean :508.8 Mean : 965.9
## 3rd Qu.:490.2 3rd Qu.:539.5 3rd Qu.:1032.0
## Max. :516.0 Max. :592.0 Max. :1107.0
ggplot(sat, aes(x=expend,y=total)) + geom_point()
ggplot(sat, aes(x=total)) + geom_histogram(bins=50)