A survey was conducted to study teenage gambling in Britain. This frame contains the following columns:
sex 0=male, 1=female
status Socioeconomic status score based on parents’ occupation
income in pounds per week
verbal verbal score in words out of 12 correctly defined
gamble expenditure on gambling in pounds per year
The data set ‘’teengamp’’ has 47 observations and 5 variables. As We notice there are 28 men and 19 females teenagers. There no missing value in the dataset.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# load data set teengamp
teengamp <- read.table('teengamb.txt', header = T)
# read the first 5 rows
head(teengamp)
## sex status income verbal gamble
## 1 1 51 2.00 8 0.0
## 2 1 28 2.50 8 0.0
## 3 1 37 2.00 6 0.0
## 4 1 28 7.00 4 7.3
## 5 1 65 2.00 8 19.6
## 6 1 61 3.47 6 0.1
Thre is not enough information on this data set to do inference. We can see basic statitics using summary function. The graph below show
# attch data set
attach(teengamp)
summary(teengamp)
## sex status income verbal
## Min. :0.0000 Min. :18.00 Min. : 0.600 Min. : 1.00
## 1st Qu.:0.0000 1st Qu.:28.00 1st Qu.: 2.000 1st Qu.: 6.00
## Median :0.0000 Median :43.00 Median : 3.250 Median : 7.00
## Mean :0.4043 Mean :45.23 Mean : 4.642 Mean : 6.66
## 3rd Qu.:1.0000 3rd Qu.:61.50 3rd Qu.: 6.210 3rd Qu.: 8.00
## Max. :1.0000 Max. :75.00 Max. :15.000 Max. :10.00
## gamble
## Min. : 0.0
## 1st Qu.: 1.1
## Median : 6.0
## Mean : 19.3
## 3rd Qu.: 19.4
## Max. :156.0
str(teengamp)
## 'data.frame': 47 obs. of 5 variables:
## $ sex : int 1 1 1 1 1 1 1 1 1 1 ...
## $ status: int 51 28 37 28 65 61 28 27 43 18 ...
## $ income: num 2 2.5 2 7 2 3.47 5.5 6.42 2 6 ...
## $ verbal: int 8 8 6 4 8 6 7 5 6 7 ...
## $ gamble: num 0 0 0 7.3 19.6 0.1 1.45 6.6 1.7 0.1 ...
# Look at the distribution for test '
summary(factor(teengamp$sex))
## 0 1
## 28 19
The distribution of income shows a wide range, indicating a diverse economic background among the teenagers.The distribution of gambling expenditure is highly skewed, with most teenagers spending very little, while a few spend much more. This could indicate that gambling is not a common activity for most but significant for some
library(ggplot2)
ggplot(teengamp, aes(x=income, y=gamble)) + geom_point(size =1) + facet_grid(~sex)
par(mfrow=c(2,3))
# Histograms for each numerical variable
hist(teengamp$income, main="Income Distribution", xlab="Income")
hist(teengamp$verbal, main="Verbal Score Distribution", xlab="Verbal Score")
hist(teengamp$gamble, main="Gambling Expenditure Distribution", xlab="Gambling Expenditure")
# Boxplots to compare gambling expenditure by sex and status
boxplot(gamble ~ sex, data=teengamp, main="Gambling Expenditure by Sex", xlab="Sex", ylab="Gambling Expenditure")
boxplot(gamble ~ status, data=teengamp, main="Gambling Expenditure by Status", xlab="Status", ylab="Gambling Expenditure")
library(GGally)
ggpairs(teengamp,
color = 2:4,
aes(color = factor(sex),
alpha=0.5))
## Warning in warn_if_args_exist(list(...)): Extra arguments: "color" are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
BY looking at the pairplot,m we can notice that there is a correlation between income and gamble
1.3 The dataset prostate is from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Make a numerical and graphical summary of the data as in the first question
Solution
The prostate data set has 97 observations and 9 variables. The variables are:
lpsa: log PSA score
lcavol: log cancer volume
lweight: log prostate weight
age: age of patient
lbph: log of the amount of benign prostatic hyperplasia
svi: seminal vesicle invasion
lcp: log of capsular penetration
gleason: Gleason score
pgg45: percent of Gleason scores 4 or 5
As We can notice minimum age that appear to have a prostate cancer is 41 and the maximum age is 79. It is important to understand the relationships between these variables and the data is distributed.
prostate <- read.table('prostate.txt',header=T)
head(prostate)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.7695 50 -1.386294 0 -1.38629 6 0 -0.43078
## 2 -0.9942523 3.3196 58 -1.386294 0 -1.38629 6 0 -0.16252
## 3 -0.5108256 2.6912 74 -1.386294 0 -1.38629 7 20 -0.16252
## 4 -1.2039728 3.2828 58 -1.386294 0 -1.38629 6 0 -0.16252
## 5 0.7514161 3.4324 62 -1.386294 0 -1.38629 6 0 0.37156
## 6 -1.0498221 3.2288 50 -1.386294 0 -1.38629 6 0 0.76547
str(prostate)
## 'data.frame': 97 obs. of 9 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
summary(prostate)
## lcavol lweight age lbph
## Min. :-1.3471 Min. :2.375 Min. :41.00 Min. :-1.3863
## 1st Qu.: 0.5128 1st Qu.:3.376 1st Qu.:60.00 1st Qu.:-1.3863
## Median : 1.4469 Median :3.623 Median :65.00 Median : 0.3001
## Mean : 1.3500 Mean :3.653 Mean :63.87 Mean : 0.1004
## 3rd Qu.: 2.1270 3rd Qu.:3.878 3rd Qu.:68.00 3rd Qu.: 1.5581
## Max. : 3.8210 Max. :6.108 Max. :79.00 Max. : 2.3263
## svi lcp gleason pgg45
## Min. :0.0000 Min. :-1.3863 Min. :6.000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:-1.3863 1st Qu.:6.000 1st Qu.: 0.00
## Median :0.0000 Median :-0.7985 Median :7.000 Median : 15.00
## Mean :0.2165 Mean :-0.1794 Mean :6.753 Mean : 24.38
## 3rd Qu.:0.0000 3rd Qu.: 1.1786 3rd Qu.:7.000 3rd Qu.: 40.00
## Max. :1.0000 Max. : 2.9042 Max. :9.000 Max. :100.00
## lpsa
## Min. :-0.4308
## 1st Qu.: 1.7317
## Median : 2.5915
## Mean : 2.4784
## 3rd Qu.: 3.0564
## Max. : 5.5829
attach(prostate)
par(mfrow = c(3,3)) # 3*3 plots
for (col in names(prostate)){
hist(prostate[[col]], main = paste('Histogram of', col), xlab=col, col='lightblue', border='black') # histogram for all the numerical value features
}
for (col in names(prostate)) {
boxplot(prostate[[col]], main=paste('Boxplot of ', col), xlab=col, col='lightblue', border = 'black')
}
library(corrplot)
## corrplot 0.92 loaded
cor_matrix <- cor(prostate)
corrplot(cor_matrix, method = 'circle', type='upper',order= 'hclust',
addCoef.col = 'black', tl.col = 'black',tl.srt = 45,
title = 'Correlation of Plot Prostate Cancer', mar = c(0,0,1,0))
Histograms and boxplots reveal the distribution and potential outliers
in the data. The scatter plot matrix helps in visualizing relationships
between variables. We use the correlation plot to show how well these
variables are related to each other.
This model suggests that lcavol, lweight are significant predictors of lpsa, while age is less so. The model explains a significant portion of the variance in lpsa, indicating that it has a reasonably good fit. The R-squared value of 0.5981 indicates that about 60% of the variability in lpsa is explained by the model. The F-statistic is significant (p-value < 0.0001), suggesting that the model as a whole is a good fit for the data.
lm_pros <- lm(lpsa ~ lcavol+age+lweight+gleason)
summary(lm_pros)
##
## Call:
## lm(formula = lpsa ~ lcavol + age + lweight + gleason)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48679 -0.46467 -0.00243 0.38421 1.95227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.85636 1.04299 -0.821 0.413732
## lcavol 0.64443 0.07367 8.748 9.82e-14 ***
## age -0.01320 0.01125 -1.173 0.243783
## lweight 0.58736 0.16506 3.559 0.000592 ***
## gleason 0.17211 0.12103 1.422 0.158393
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7476 on 92 degrees of freedom
## Multiple R-squared: 0.5981, Adjusted R-squared: 0.5806
## F-statistic: 34.22 on 4 and 92 DF, p-value: < 2.2e-16
plot(lm_pros)
Exercise 4
1.4 The dataset sat comes from a study entitled “Getting What You Pay For: The Debate Over Equity in Public School Expenditures.” Make a numerical and graphical summary of the data as in the first question.
Solution
The datset SAT has 50 observations and 8 columns. These variables are:
Variable Description State: Name of state expend: Expenditure per pupil in average daily attendance in public elementary and secondary schools, 1994-95 (in thousands of dollars) ratio: Average pupil/teacher ratio in public elementary and secondary schools, Fall 1994 salary: Estimated average annual salary of teachers in public elementary and secondary schools, 1994-95 (in thousands of dollars) frac: Percentage of all eligible students taking the SAT, 1994-95 verbal: Average verbal SAT score, 1994-95 math: Average math SAT score, 1994-95 sat: Average total score on the SAT, 1994-95
We change the row names into columns and make an histogram of how the data is distributed. We loop through all the numerical variables to create the histograms. The correlation plot shows that there is a high correlation between salary and expenditure per pupil in average daily attendance in public elementary and secondary schools. State like North Dakota has the highest math test score for SAT and Iowa has the highest total points for the SAT Test score.
library(tibble)
sat <- read.table('sat.txt',header = T, sep='\t')
head(sat)
## expend ratio salary takers verbal math total
## Alabama 4.405 17.2 31.144 8 491 538 1029
## Alaska 8.963 17.6 47.951 47 445 489 934
## Arizona 4.778 19.3 32.175 27 448 496 944
## Arkansas 4.459 17.1 28.934 6 482 523 1005
## California 4.992 24.0 41.078 45 417 485 902
## Colorado 5.443 18.4 34.571 29 462 518 980
# Changing the shape of the table
new_df <- rownames_to_column(sat, var='State')
new_df
## State expend ratio salary takers verbal math total
## 1 Alabama 4.405 17.2 31.144 8 491 538 1029
## 2 Alaska 8.963 17.6 47.951 47 445 489 934
## 3 Arizona 4.778 19.3 32.175 27 448 496 944
## 4 Arkansas 4.459 17.1 28.934 6 482 523 1005
## 5 California 4.992 24.0 41.078 45 417 485 902
## 6 Colorado 5.443 18.4 34.571 29 462 518 980
## 7 Connecticut 8.817 14.4 50.045 81 431 477 908
## 8 Delaware 7.030 16.6 39.076 68 429 468 897
## 9 Florida 5.718 19.1 32.588 48 420 469 889
## 10 Georgia 5.193 16.3 32.291 65 406 448 854
## 11 Hawaii 6.078 17.9 38.518 57 407 482 889
## 12 Idaho 4.210 19.1 29.783 15 468 511 979
## 13 Illinois 6.136 17.3 39.431 13 488 560 1048
## 14 Indiana 5.826 17.5 36.785 58 415 467 882
## 15 Iowa 5.483 15.8 31.511 5 516 583 1099
## 16 Kansas 5.817 15.1 34.652 9 503 557 1060
## 17 Kentucky 5.217 17.0 32.257 11 477 522 999
## 18 Louisiana 4.761 16.8 26.461 9 486 535 1021
## 19 Maine 6.428 13.8 31.972 68 427 469 896
## 20 Maryland 7.245 17.0 40.661 64 430 479 909
## 21 Massachusetts 7.287 14.8 40.795 80 430 477 907
## 22 Michigan 6.994 20.1 41.895 11 484 549 1033
## 23 Minnesota 6.000 17.5 35.948 9 506 579 1085
## 24 Mississippi 4.080 17.5 26.818 4 496 540 1036
## 25 Missouri 5.383 15.5 31.189 9 495 550 1045
## 26 Montana 5.692 16.3 28.785 21 473 536 1009
## 27 Nebraska 5.935 14.5 30.922 9 494 556 1050
## 28 Nevada 5.160 18.7 34.836 30 434 483 917
## 29 New Hampshire 5.859 15.6 34.720 70 444 491 935
## 30 New Jersey 9.774 13.8 46.087 70 420 478 898
## 31 New Mexico 4.586 17.2 28.493 11 485 530 1015
## 32 New York 9.623 15.2 47.612 74 419 473 892
## 33 North Carolina 5.077 16.2 30.793 60 411 454 865
## 34 North Dakota 4.775 15.3 26.327 5 515 592 1107
## 35 Ohio 6.162 16.6 36.802 23 460 515 975
## 36 Oklahoma 4.845 15.5 28.172 9 491 536 1027
## 37 Oregon 6.436 19.9 38.555 51 448 499 947
## 38 Pennsylvania 7.109 17.1 44.510 70 419 461 880
## 39 Rhode Island 7.469 14.7 40.729 70 425 463 888
## 40 South Carolina 4.797 16.4 30.279 58 401 443 844
## 41 South Dakota 4.775 14.4 25.994 5 505 563 1068
## 42 Tennessee 4.388 18.6 32.477 12 497 543 1040
## 43 Texas 5.222 15.7 31.223 47 419 474 893
## 44 Utah 3.656 24.3 29.082 4 513 563 1076
## 45 Vermont 6.750 13.8 35.406 68 429 472 901
## 46 Virginia 5.327 14.6 33.987 65 428 468 896
## 47 Washington 5.906 20.2 36.151 48 443 494 937
## 48 West Virginia 6.107 14.8 31.944 17 448 484 932
## 49 Wisconsin 6.930 15.9 37.746 9 501 572 1073
## 50 Wyoming 6.160 14.9 31.285 10 476 525 1001
str(new_df)
## 'data.frame': 50 obs. of 8 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ expend: num 4.41 8.96 4.78 4.46 4.99 ...
## $ ratio : num 17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
## $ salary: num 31.1 48 32.2 28.9 41.1 ...
## $ takers: int 8 47 27 6 45 29 81 68 48 65 ...
## $ verbal: int 491 445 448 482 417 462 431 429 420 406 ...
## $ math : int 538 489 496 523 485 518 477 468 469 448 ...
## $ total : int 1029 934 944 1005 902 980 908 897 889 854 ...
summary(new_df)
## State expend ratio salary
## Length:50 Min. :3.656 Min. :13.80 Min. :25.99
## Class :character 1st Qu.:4.882 1st Qu.:15.22 1st Qu.:30.98
## Mode :character Median :5.768 Median :16.60 Median :33.29
## Mean :5.905 Mean :16.86 Mean :34.83
## 3rd Qu.:6.434 3rd Qu.:17.57 3rd Qu.:38.55
## Max. :9.774 Max. :24.30 Max. :50.05
## takers verbal math total
## Min. : 4.00 Min. :401.0 Min. :443.0 Min. : 844.0
## 1st Qu.: 9.00 1st Qu.:427.2 1st Qu.:474.8 1st Qu.: 897.2
## Median :28.00 Median :448.0 Median :497.5 Median : 945.5
## Mean :35.24 Mean :457.1 Mean :508.8 Mean : 965.9
## 3rd Qu.:63.00 3rd Qu.:490.2 3rd Qu.:539.5 3rd Qu.:1032.0
## Max. :81.00 Max. :516.0 Max. :592.0 Max. :1107.0
attach(new_df)
## The following object is masked from teengamp:
##
## verbal
new10 <- new_df %>% arrange(math) %>% tail()
new10
## State expend ratio salary takers verbal math total
## 45 South Dakota 4.775 14.4 25.994 5 505 563 1068
## 46 Utah 3.656 24.3 29.082 4 513 563 1076
## 47 Wisconsin 6.930 15.9 37.746 9 501 572 1073
## 48 Minnesota 6.000 17.5 35.948 9 506 579 1085
## 49 Iowa 5.483 15.8 31.511 5 516 583 1099
## 50 North Dakota 4.775 15.3 26.327 5 515 592 1107
library(dplyr)
corr_matrix <- cor(new_df[,-1])
corrplot(corr_matrix, method = 'circle', type='upper',order= 'hclust',
addCoef.col = 'black', tl.col = 'black',tl.srt = 45,
title = 'Correlation of Plot SAT ', mar = c(0,0,1,0))
# Scatter plot matrix
pairs(new_df[,-1], main = "Scatterplot Matrix", pch = 19, col = "blue")
**Excerise 1.5
The dataset divusa contains data on divorces in the United States from 1920 to 1996. Make a numerical and graphical summary of the data as in the first question
Solution
Divorce in the USA 1920-1996 > Description
Divorce rates in the USA from 1920-1996
Format A data frame with 77 observations on the following 7 variables.
year the year from 1920-1996
divorce divorce per 1000 women aged 15 or more
unemployed unemployment rate
femlab percent female participation in labor force aged 16+
marriage marriages per 1000 unmarried women aged 16+
birth births per 1000 women aged 15-44
military military personnel per 1000 population
# Read the dataset
df <- read.table("divusa.txt", header = TRUE, sep = "\t")
# Summary statistics
summary(df)
## year divorce unemployed femlab
## Min. :1920 Min. : 6.10 Min. : 1.200 Min. :22.70
## 1st Qu.:1939 1st Qu.: 8.70 1st Qu.: 4.200 1st Qu.:27.47
## Median :1958 Median :10.60 Median : 5.600 Median :37.10
## Mean :1958 Mean :13.27 Mean : 7.173 Mean :38.58
## 3rd Qu.:1977 3rd Qu.:20.30 3rd Qu.: 7.500 3rd Qu.:47.80
## Max. :1996 Max. :22.80 Max. :24.900 Max. :59.30
## marriage birth military
## Min. : 49.70 Min. : 65.30 Min. : 1.940
## 1st Qu.: 61.90 1st Qu.: 68.90 1st Qu.: 3.469
## Median : 74.10 Median : 85.90 Median : 9.102
## Mean : 72.97 Mean : 88.89 Mean :12.365
## 3rd Qu.: 80.00 3rd Qu.:107.30 3rd Qu.:14.266
## Max. :118.10 Max. :122.90 Max. :86.641
str(df)
## 'data.frame': 77 obs. of 7 variables:
## $ year : int 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 ...
## $ divorce : num 8 7.2 6.6 7.1 7.2 7.2 7.5 7.8 7.8 8 ...
## $ unemployed: num 5.2 11.7 6.7 2.4 5 3.2 1.8 3.3 4.2 3.2 ...
## $ femlab : num 22.7 22.8 22.9 23 23.1 ...
## $ marriage : num 92 83 79.7 85.2 80.3 79.2 78.7 77 74.1 75.5 ...
## $ birth : num 118 120 111 110 111 ...
## $ military : num 3.22 3.56 2.46 2.21 2.29 ...
# Correlation matrix
cor(df[,-1]) # Exclude 'year' column
## divorce unemployed femlab marriage birth military
## divorce 1.00000000 -0.2106019 0.91039698 -0.5342554 -0.7219242 0.01857483
## unemployed -0.21060188 1.0000000 -0.25746176 -0.2707630 -0.3138890 -0.40029295
## femlab 0.91039698 -0.2574618 1.00000000 -0.6486273 -0.6040949 0.05126339
## marriage -0.53425537 -0.2707630 -0.64862728 1.0000000 0.6737273 0.25819826
## birth -0.72192425 -0.3138890 -0.60409490 0.6737273 1.0000000 0.14089864
## military 0.01857483 -0.4002930 0.05126339 0.2581983 0.1408986 1.00000000
cor(df[,-1]) # Exclude 'year' column
## divorce unemployed femlab marriage birth military
## divorce 1.00000000 -0.2106019 0.91039698 -0.5342554 -0.7219242 0.01857483
## unemployed -0.21060188 1.0000000 -0.25746176 -0.2707630 -0.3138890 -0.40029295
## femlab 0.91039698 -0.2574618 1.00000000 -0.6486273 -0.6040949 0.05126339
## marriage -0.53425537 -0.2707630 -0.64862728 1.0000000 0.6737273 0.25819826
## birth -0.72192425 -0.3138890 -0.60409490 0.6737273 1.0000000 0.14089864
## military 0.01857483 -0.4002930 0.05126339 0.2581983 0.1408986 1.00000000
# 2. Graphical Summary
head(df)
## year divorce unemployed femlab marriage birth military
## 1 1920 8.0 5.2 22.70 92.0 117.9 3.2247
## 2 1921 7.2 11.7 22.79 83.0 119.8 3.5614
## 3 1922 6.6 6.7 22.88 79.7 111.2 2.4553
## 4 1923 7.1 2.4 22.97 85.2 110.5 2.2065
## 5 1924 7.2 5.0 23.06 80.3 110.9 2.2889
## 6 1925 7.2 3.2 23.15 79.2 106.6 2.1735
# Convert the data to long format for easier plotting
df_long <- reshape2::melt(df, id.vars = "year")
# Plot histograms for each variable
ggplot(df_long, aes(x = value)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
facet_wrap(~variable, scales = "free") +
labs(title = "Histograms of DivUSA Variables", x = "Value", y = "Frequency") +
theme_minimal()
corr_matrix <- cor(df[-1])
corrplot(corr_matrix, method = 'circle', type='upper',order= 'hclust',
addCoef.col = 'black', tl.col = 'black',tl.srt = 45,
title = 'Correlation of Female working ', mar = c(0,0,1,0))
The histogram displays the distribution of all the variables and gives us a better insight of the kind of data we have. It is important to notice that divorce among female who participate in labor forced at the of 16 and up have a strong correlation.