Data621-HW1

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(faraway)

## Warning: package 'faraway' was built under R version 3.4.4

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Question 1 :

The dataset teengamb concerns a study of teenage gambling in Britain. Make a numerical and graphical summary of the data, commenting on any features that you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data.

I would like to experiment the below questions

What percentage of income do teens spend on gambling to indicate an issue

newdata = teengamb
newdata$income <- newdata$income * 52  
newdata$poi <- (newdata$gamble / newdata$income)*100
hist(newdata$poi,xlab="percent of income to gambling")

(length(which(newdata$poi > 10)) / nrow(newdata) ) * 100

## [1] 29.78723

The result shows that about 30% of teens spend more than 10 percent of their income in gambling which is not good

male & female gambling comparisons

plot(gamble ~ sex, newdata)

plot(sort(newdata$gamble), ylab = "Sorted Expenditure")

plot(newdata$gamble ~ newdata$verbal)

## question 1.3 The dataset prostate is from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Make a numerical and graphical summary of the data as in the first question

data("prostate")
summary(prostate)

##      lcavol           lweight           age             lbph        
##  Min.   :-1.3471   Min.   :2.375   Min.   :41.00   Min.   :-1.3863  
##  1st Qu.: 0.5128   1st Qu.:3.376   1st Qu.:60.00   1st Qu.:-1.3863  
##  Median : 1.4469   Median :3.623   Median :65.00   Median : 0.3001  
##  Mean   : 1.3500   Mean   :3.653   Mean   :63.87   Mean   : 0.1004  
##  3rd Qu.: 2.1270   3rd Qu.:3.878   3rd Qu.:68.00   3rd Qu.: 1.5581  
##  Max.   : 3.8210   Max.   :6.108   Max.   :79.00   Max.   : 2.3263  
##       svi              lcp             gleason          pgg45       
##  Min.   :0.0000   Min.   :-1.3863   Min.   :6.000   Min.   :  0.00  
##  1st Qu.:0.0000   1st Qu.:-1.3863   1st Qu.:6.000   1st Qu.:  0.00  
##  Median :0.0000   Median :-0.7985   Median :7.000   Median : 15.00  
##  Mean   :0.2165   Mean   :-0.1794   Mean   :6.753   Mean   : 24.38  
##  3rd Qu.:0.0000   3rd Qu.: 1.1786   3rd Qu.:7.000   3rd Qu.: 40.00  
##  Max.   :1.0000   Max.   : 2.9042   Max.   :9.000   Max.   :100.00  
##       lpsa        
##  Min.   :-0.4308  
##  1st Qu.: 1.7317  
##  Median : 2.5915  
##  Mean   : 2.4784  
##  3rd Qu.: 3.0564  
##  Max.   : 5.5829

prostate$gleason <- factor(prostate$gleason)
prostate$svi <- factor(prostate$svi)
summary(prostate)

##      lcavol           lweight           age             lbph        
##  Min.   :-1.3471   Min.   :2.375   Min.   :41.00   Min.   :-1.3863  
##  1st Qu.: 0.5128   1st Qu.:3.376   1st Qu.:60.00   1st Qu.:-1.3863  
##  Median : 1.4469   Median :3.623   Median :65.00   Median : 0.3001  
##  Mean   : 1.3500   Mean   :3.653   Mean   :63.87   Mean   : 0.1004  
##  3rd Qu.: 2.1270   3rd Qu.:3.878   3rd Qu.:68.00   3rd Qu.: 1.5581  
##  Max.   : 3.8210   Max.   :6.108   Max.   :79.00   Max.   : 2.3263  
##  svi         lcp          gleason     pgg45             lpsa        
##  0:76   Min.   :-1.3863   6:35    Min.   :  0.00   Min.   :-0.4308  
##  1:21   1st Qu.:-1.3863   7:56    1st Qu.:  0.00   1st Qu.: 1.7317  
##         Median :-0.7985   8: 1    Median : 15.00   Median : 2.5915  
##         Mean   :-0.1794   9: 5    Mean   : 24.38   Mean   : 2.4784  
##         3rd Qu.: 1.1786           3rd Qu.: 40.00   3rd Qu.: 3.0564  
##         Max.   : 2.9042           Max.   :100.00   Max.   : 5.5829

hist(prostate$lcavol, xlab = "Cancer Volume", main = "")

plot(density(prostate$lcavol, na.rm = TRUE), main = "")

plot(lcavol ~ lweight, prostate)
abline(lm(lcavol ~ lweight, prostate))

question 1.4

The dataset sat comes from a study entitled “Getting What You Pay For: The Debate Over Equity in Public School Expenditures.” Make a numerical and graphical summary of the data as in the first question.

data(sat)
summary(sat)

##      expend          ratio           salary          takers     
##  Min.   :3.656   Min.   :13.80   Min.   :25.99   Min.   : 4.00  
##  1st Qu.:4.882   1st Qu.:15.22   1st Qu.:30.98   1st Qu.: 9.00  
##  Median :5.768   Median :16.60   Median :33.29   Median :28.00  
##  Mean   :5.905   Mean   :16.86   Mean   :34.83   Mean   :35.24  
##  3rd Qu.:6.434   3rd Qu.:17.57   3rd Qu.:38.55   3rd Qu.:63.00  
##  Max.   :9.774   Max.   :24.30   Max.   :50.05   Max.   :81.00  
##      verbal           math           total       
##  Min.   :401.0   Min.   :443.0   Min.   : 844.0  
##  1st Qu.:427.2   1st Qu.:474.8   1st Qu.: 897.2  
##  Median :448.0   Median :497.5   Median : 945.5  
##  Mean   :457.1   Mean   :508.8   Mean   : 965.9  
##  3rd Qu.:490.2   3rd Qu.:539.5   3rd Qu.:1032.0  
##  Max.   :516.0   Max.   :592.0   Max.   :1107.0

hist(sat$expend, xlab = "Expenditure", main = "")

plot(density(sat$expend, na.rm = TRUE), main = "")

plot(expend ~ ratio, sat)
abline(lm(expend ~ ratio, sat))

plot(expend ~ verbal, sat)
abline(lm(expend ~ verbal, sat))

question 1.5

The dataset divusa contains data on divorces in the United States from 1920 to 1996. Make a numerical and graphical summary of the data as in the first question

data("divusa")
summary(divusa)

##       year         divorce        unemployed         femlab     
##  Min.   :1920   Min.   : 6.10   Min.   : 1.200   Min.   :22.70  
##  1st Qu.:1939   1st Qu.: 8.70   1st Qu.: 4.200   1st Qu.:27.47  
##  Median :1958   Median :10.60   Median : 5.600   Median :37.10  
##  Mean   :1958   Mean   :13.27   Mean   : 7.173   Mean   :38.58  
##  3rd Qu.:1977   3rd Qu.:20.30   3rd Qu.: 7.500   3rd Qu.:47.80  
##  Max.   :1996   Max.   :22.80   Max.   :24.900   Max.   :59.30  
##     marriage          birth           military     
##  Min.   : 49.70   Min.   : 65.30   Min.   : 1.940  
##  1st Qu.: 61.90   1st Qu.: 68.90   1st Qu.: 3.469  
##  Median : 74.10   Median : 85.90   Median : 9.102  
##  Mean   : 72.97   Mean   : 88.89   Mean   :12.365  
##  3rd Qu.: 80.00   3rd Qu.:107.30   3rd Qu.:14.266  
##  Max.   :118.10   Max.   :122.90   Max.   :86.641

hist(divusa$divorce, xlab = "DIVORCE", main = "")

plot(sort(divusa$divorce), ylab = "Sorted Divorce")

lmod10 <- lm(divorce ~ marriage, divusa)
coef(lmod10)

## (Intercept)    marriage 
##  30.1081994  -0.2307625

plot(divorce ~ femlab, divusa)
abline(lm(divorce ~ femlab, divusa))

plot(divorce ~ military, divusa)
abline(lm(divorce ~ military, divusa))