Suggested Exercises

1.1 The dataset teengamb concerns a study of teenage gambling in Britain. Make a numerical and graphical summary of the data, commenting on any features that you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data.

Information regarding the dataset is found here: http://www.stat.unc.edu/faculty/cji/664-2012/HW1-faraway.pdf

This is also in the faraway package.

sex: 0 = male, 1 = female
status: socioeconomic status score based on parents’ occupation
income: in pounds per week
verbal: verbal score in words out of 12 correctly defined
gamble: expenditure on gambling in pounds per year

data(teengamb, package='faraway')
head(teengamb)

##   sex status income verbal gamble
## 1   1     51   2.00      8    0.0
## 2   1     28   2.50      8    0.0
## 3   1     37   2.00      6    0.0
## 4   1     28   7.00      4    7.3
## 5   1     65   2.00      8   19.6
## 6   1     61   3.47      6    0.1

What are the dimensions of this dataframe?

dim(teengamb)

## [1] 47  5

What is the summary descriptive statistics for this dataframe?

summary(teengamb)

##       sex             status          income           verbal     
##  Min.   :0.0000   Min.   :18.00   Min.   : 0.600   Min.   : 1.00  
##  1st Qu.:0.0000   1st Qu.:28.00   1st Qu.: 2.000   1st Qu.: 6.00  
##  Median :0.0000   Median :43.00   Median : 3.250   Median : 7.00  
##  Mean   :0.4043   Mean   :45.23   Mean   : 4.642   Mean   : 6.66  
##  3rd Qu.:1.0000   3rd Qu.:61.50   3rd Qu.: 6.210   3rd Qu.: 8.00  
##  Max.   :1.0000   Max.   :75.00   Max.   :15.000   Max.   :10.00  
##      gamble     
##  Min.   :  0.0  
##  1st Qu.:  1.1  
##  Median :  6.0  
##  Mean   : 19.3  
##  3rd Qu.: 19.4  
##  Max.   :156.0

What is the split between men and women?

sex_count <- factor(teengamb$sex)
levels(sex_count) <- c('Male', 'Female')
plot(sex_count,ylab="Frequency", main="Male vs. Female Count")

Let’s use ggplot2.

if (!require(ggplot2)) install.packages("ggplot2")

## Loading required package: ggplot2

require(ggplot2)

teengamb$sex <- factor(teengamb$sex)
levels(teengamb$sex) <- c('Male','Female')
ggplot(teengamb, aes(x=sex)) + geom_bar(aes(fill=sex)) + labs(title="Male vs. Female Frequency", y="Count")

Looks like in this dataset that are more men in this set. Let’s see the overall count for gambling expenditure. Are these teenagers big time gamblers?

ggplot(teengamb, aes(x=gamble)) + geom_histogram(bins=50, fill='blue',col='black') + labs(title="Histogram of Gambling Expenditure", x='Pounds per Year', y="Count")

The brief overview of the histogram suggests that the data is very much right skew, with many teenagers overall appearing not to be ‘big rollers’.

Now let’s ask if income matters at all. Do teenagers with bigger income appear to be bigger gamblers? Let’s plot this, and see if we can subset this against sex.

ggplot(teengamb, aes(x=income,y=gamble,col=sex)) + geom_point() + labs(title="Income vs. Gambling Expenditure", x="Income", y="Gambling Expenditure")

At an initial glance, it appears that men seem to be overall tend to gamble more than women, particularly when the income increases.

men <- subset(teengamb, teengamb$sex == 'Male')
female <- subset(teengamb, teengamb$sex == 'Female')
summary(men)

##      sex         status          income           verbal      
##  Male  :28   Min.   :18.00   Min.   : 0.600   Min.   : 1.000  
##  Female: 0   1st Qu.:38.00   1st Qu.: 2.000   1st Qu.: 6.000  
##              Median :51.00   Median : 3.375   Median : 7.000  
##              Mean   :52.00   Mean   : 4.976   Mean   : 6.821  
##              3rd Qu.:65.25   3rd Qu.: 6.625   3rd Qu.: 8.250  
##              Max.   :75.00   Max.   :15.000   Max.   :10.000  
##      gamble       
##  Min.   :  0.000  
##  1st Qu.:  2.775  
##  Median : 14.250  
##  Mean   : 29.775  
##  3rd Qu.: 42.175  
##  Max.   :156.000

summary(female)

##      sex         status          income           verbal     
##  Male  : 0   Min.   :18.00   Min.   : 1.500   Min.   :4.000  
##  Female:19   1st Qu.:28.00   1st Qu.: 2.000   1st Qu.:6.000  
##              Median :30.00   Median : 3.000   Median :6.000  
##              Mean   :35.26   Mean   : 4.149   Mean   :6.421  
##              3rd Qu.:43.00   3rd Qu.: 5.750   3rd Qu.:8.000  
##              Max.   :65.00   Max.   :10.000   Max.   :8.000  
##      gamble      
##  Min.   : 0.000  
##  1st Qu.: 0.100  
##  Median : 1.700  
##  Mean   : 3.866  
##  3rd Qu.: 6.000  
##  Max.   :19.600

As you can see, both the mean and median for gambling expenditures for women is less than the men’s gambling expenditures.

1.3 The dataset prostate is from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Make a numerical and graphical summary of the data as in the first question.

Information regarding the dataset is found here: https://rdrr.io/cran/faraway/man/prostate.html

lcavol: log(cancer volume)
lweight: log(prostate weight)
age: age
lbph: log(benign prostatic hyperplasia amount)
svi: seminal vesicle invasion
lcp: log(capsular penetration)
gleason: Gleason score. Score ranges from 0 to 10, with 6 or greater indicates likely cancer.
pgg45: percentage Gleason scores 4 or 5
lpsa: log(prostate specific antigen)

Numerical Data Summary:

data(prostate, package='faraway')
head(prostate)

##       lcavol lweight age      lbph svi      lcp gleason pgg45     lpsa
## 1 -0.5798185  2.7695  50 -1.386294   0 -1.38629       6     0 -0.43078
## 2 -0.9942523  3.3196  58 -1.386294   0 -1.38629       6     0 -0.16252
## 3 -0.5108256  2.6912  74 -1.386294   0 -1.38629       7    20 -0.16252
## 4 -1.2039728  3.2828  58 -1.386294   0 -1.38629       6     0 -0.16252
## 5  0.7514161  3.4324  62 -1.386294   0 -1.38629       6     0  0.37156
## 6 -1.0498221  3.2288  50 -1.386294   0 -1.38629       6     0  0.76547

summary(prostate)

##      lcavol           lweight           age             lbph        
##  Min.   :-1.3471   Min.   :2.375   Min.   :41.00   Min.   :-1.3863  
##  1st Qu.: 0.5128   1st Qu.:3.376   1st Qu.:60.00   1st Qu.:-1.3863  
##  Median : 1.4469   Median :3.623   Median :65.00   Median : 0.3001  
##  Mean   : 1.3500   Mean   :3.653   Mean   :63.87   Mean   : 0.1004  
##  3rd Qu.: 2.1270   3rd Qu.:3.878   3rd Qu.:68.00   3rd Qu.: 1.5581  
##  Max.   : 3.8210   Max.   :6.108   Max.   :79.00   Max.   : 2.3263  
##       svi              lcp             gleason          pgg45       
##  Min.   :0.0000   Min.   :-1.3863   Min.   :6.000   Min.   :  0.00  
##  1st Qu.:0.0000   1st Qu.:-1.3863   1st Qu.:6.000   1st Qu.:  0.00  
##  Median :0.0000   Median :-0.7985   Median :7.000   Median : 15.00  
##  Mean   :0.2165   Mean   :-0.1794   Mean   :6.753   Mean   : 24.38  
##  3rd Qu.:0.0000   3rd Qu.: 1.1786   3rd Qu.:7.000   3rd Qu.: 40.00  
##  Max.   :1.0000   Max.   : 2.9042   Max.   :9.000   Max.   :100.00  
##       lpsa        
##  Min.   :-0.4308  
##  1st Qu.: 1.7317  
##  Median : 2.5915  
##  Mean   : 2.4784  
##  3rd Qu.: 3.0564  
##  Max.   : 5.5829

Interestingly, there are some logarithmic values that are negative, which is not possible. There may be some data errors, or perhaps, the values may have been placed in incorrectly. Let’s continue further with the analysis.

Let’s look at the histogram frequency for age.

ggplot(prostate, aes(x=age)) + geom_histogram(col='black', fill='green', bins=40) + labs(title="Frequency Count for Age", x="Age",y="Count")

Most of the men are greater than 60 years old, which the visualization confirms the mean value from summary of approximately 63.

Does age correlate with volume and weight of the prostate cancer?

ggplot(prostate, aes(x=age,y=lcavol)) + geom_point(col='blue') + labs(title='Age vs. Prostate Cancer Volume (log)')

ggplot(prostate, aes(x=age,y=lweight)) + geom_point(col='red') + labs(title='Age vs. Prostate Cancer Weight (log)')

At least from initial visualization, it’s difficult to say if there’s a trend in age vs. size/weight. This means that we will need to look into this further.

Let’s look closer into the gleason score.

ggplot(prostate, aes(x=gleason)) + geom_bar() + labs(title="Frequency of Gleason Scores")

We can certainly do a futher look, but we will continue onto the next exercise in the meantime.

The dataset sat comes from a study entitled “Getting What You Pay For: The Debate Over Equity in Public School Expenditures.” Make a numerical and graphical summary of the data as in the first question.

More information: https://www.rdocumentation.org/packages/faraway/versions/1.0.7/topics/sat

expend: Current expenditure per pupil in average daily attendance in public elementary and secondary schools, 1994-95 (in thousands of dollars)
ratio: Average pupil/teacher ratio in public elementary and secondary schools, Fall 1994
salary: Estimated average annual salary of teachers in public elementary and secondary schools, 1994-95 (in thousands of dollars)
takers: Percentage of all eligible students taking the SAT, 1994-95
verbal: Average verbal SAT score, 1994-95
math: Average math SAT score, 1994-95
total: Average total score on the SAT, 1994-95

data(sat, package='faraway')
head(sat)

##            expend ratio salary takers verbal math total
## Alabama     4.405  17.2 31.144      8    491  538  1029
## Alaska      8.963  17.6 47.951     47    445  489   934
## Arizona     4.778  19.3 32.175     27    448  496   944
## Arkansas    4.459  17.1 28.934      6    482  523  1005
## California  4.992  24.0 41.078     45    417  485   902
## Colorado    5.443  18.4 34.571     29    462  518   980

summary(sat)

##      expend          ratio           salary          takers     
##  Min.   :3.656   Min.   :13.80   Min.   :25.99   Min.   : 4.00  
##  1st Qu.:4.882   1st Qu.:15.22   1st Qu.:30.98   1st Qu.: 9.00  
##  Median :5.768   Median :16.60   Median :33.29   Median :28.00  
##  Mean   :5.905   Mean   :16.86   Mean   :34.83   Mean   :35.24  
##  3rd Qu.:6.434   3rd Qu.:17.57   3rd Qu.:38.55   3rd Qu.:63.00  
##  Max.   :9.774   Max.   :24.30   Max.   :50.05   Max.   :81.00  
##      verbal           math           total       
##  Min.   :401.0   Min.   :443.0   Min.   : 844.0  
##  1st Qu.:427.2   1st Qu.:474.8   1st Qu.: 897.2  
##  Median :448.0   Median :497.5   Median : 945.5  
##  Mean   :457.1   Mean   :508.8   Mean   : 965.9  
##  3rd Qu.:490.2   3rd Qu.:539.5   3rd Qu.:1032.0  
##  Max.   :516.0   Max.   :592.0   Max.   :1107.0

ggplot(sat, aes(x=expend,y=total)) + geom_point()

ggplot(sat, aes(x=total)) + geom_histogram(bins=50)

Linear Regression Chapter 1

Joel Park

2/1/2018

Chapter 1 Regression

Suggested Exercises