The dataset teengamb concerns a study of teenage gambling in Britain. Make a numerical and graphical summary of the data, commenting on any features that you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data.
library(faraway)
## Warning: package 'faraway' was built under R version 4.3.3
data(teengamb)
summary(teengamb)
## sex status income verbal
## Min. :0.0000 Min. :18.00 Min. : 0.600 Min. : 1.00
## 1st Qu.:0.0000 1st Qu.:28.00 1st Qu.: 2.000 1st Qu.: 6.00
## Median :0.0000 Median :43.00 Median : 3.250 Median : 7.00
## Mean :0.4043 Mean :45.23 Mean : 4.642 Mean : 6.66
## 3rd Qu.:1.0000 3rd Qu.:61.50 3rd Qu.: 6.210 3rd Qu.: 8.00
## Max. :1.0000 Max. :75.00 Max. :15.000 Max. :10.00
## gamble
## Min. : 0.0
## 1st Qu.: 1.1
## Median : 6.0
## Mean : 19.3
## 3rd Qu.: 19.4
## Max. :156.0
str(teengamb) # Structure of the dataset
## 'data.frame': 47 obs. of 5 variables:
## $ sex : int 1 1 1 1 1 1 1 1 1 1 ...
## $ status: int 51 28 37 28 65 61 28 27 43 18 ...
## $ income: num 2 2.5 2 7 2 3.47 5.5 6.42 2 6 ...
## $ verbal: int 8 8 6 4 8 6 7 5 6 7 ...
## $ gamble: num 0 0 0 7.3 19.6 0.1 1.45 6.6 1.7 0.1 ...
sapply(teengamb, function(x) sum(is.na(x))) # Check for missing values
## sex status income verbal gamble
## 0 0 0 0 0
cor(teengamb) # Correlation matrix
## sex status income verbal gamble
## sex 1.0000000 -0.48093526 -0.1154586 -0.1069762 -0.40780533
## status -0.4809353 1.00000000 -0.2750340 0.5316102 -0.05042081
## income -0.1154586 -0.27503402 1.0000000 -0.1755707 0.62207690
## verbal -0.1069762 0.53161022 -0.1755707 1.0000000 -0.22005619
## gamble -0.4078053 -0.05042081 0.6220769 -0.2200562 1.00000000
hist(teengamb$gamble, breaks = 15, col = "lightblue", main = "Distribution of Gambling Expenditure", xlab = "Gambling Amount")
### Boxplot to Compare Gambling by Gender
boxplot(gamble ~ sex, data = teengamb, col = c("pink", "lightblue"),
names = c("Female", "Male"),
main = "Gambling Expenditure by Gender",
ylab = "Gambling Expenditure")
### Scatterplot of Gambling vs. Income
plot(teengamb$income, teengamb$gamble, pch = 19, col = "blue",
xlab = "Income", ylab = "Gambling Expenditure",
main = "Relationship between Income and Gambling")
abline(lm(gamble ~ income, data = teengamb), col = "red", lwd = 2)
### Key Observations:
Gambling Distribution: The histogram may show a skewed distribution, with some teenagers spending significantly more than others.
Gender Differences: The boxplot may reveal that males tend to spend more on gambling than females.
Income Correlation: The scatterplot might suggest a positive relationship between income and gambling expenditure.
The dataset prostate is from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Make a numerical and graphical summary of the data as in the first question.
library(faraway)
data(prostate)
summary(prostate) # Summary statistics
## lcavol lweight age lbph
## Min. :-1.3471 Min. :2.375 Min. :41.00 Min. :-1.3863
## 1st Qu.: 0.5128 1st Qu.:3.376 1st Qu.:60.00 1st Qu.:-1.3863
## Median : 1.4469 Median :3.623 Median :65.00 Median : 0.3001
## Mean : 1.3500 Mean :3.653 Mean :63.87 Mean : 0.1004
## 3rd Qu.: 2.1270 3rd Qu.:3.878 3rd Qu.:68.00 3rd Qu.: 1.5581
## Max. : 3.8210 Max. :6.108 Max. :79.00 Max. : 2.3263
## svi lcp gleason pgg45
## Min. :0.0000 Min. :-1.3863 Min. :6.000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:-1.3863 1st Qu.:6.000 1st Qu.: 0.00
## Median :0.0000 Median :-0.7985 Median :7.000 Median : 15.00
## Mean :0.2165 Mean :-0.1794 Mean :6.753 Mean : 24.38
## 3rd Qu.:0.0000 3rd Qu.: 1.1786 3rd Qu.:7.000 3rd Qu.: 40.00
## Max. :1.0000 Max. : 2.9042 Max. :9.000 Max. :100.00
## lpsa
## Min. :-0.4308
## 1st Qu.: 1.7317
## Median : 2.5915
## Mean : 2.4784
## 3rd Qu.: 3.0564
## Max. : 5.5829
# Check the structure and missing values:
str(prostate) # Data types and structure
## 'data.frame': 97 obs. of 9 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
sapply(prostate, function(x) sum(is.na(x))) # Count missing values
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 0 0 0 0 0 0 0 0 0
cor(prostate) # Correlation matrix
## lcavol lweight age lbph svi lcp
## lcavol 1.00000000 0.194128387 0.2249999 0.02734971 0.53884500 0.67531058
## lweight 0.19412839 1.000000000 0.3075247 0.43493174 0.10877818 0.10023889
## age 0.22499988 0.307524741 1.0000000 0.35018592 0.11765804 0.12766778
## lbph 0.02734971 0.434931744 0.3501859 1.00000000 -0.08584327 -0.00699944
## svi 0.53884500 0.108778185 0.1176580 -0.08584327 1.00000000 0.67311122
## lcp 0.67531058 0.100238891 0.1276678 -0.00699944 0.67311122 1.00000000
## gleason 0.43241705 -0.001283003 0.2688916 0.07782044 0.32041222 0.51482991
## pgg45 0.43365224 0.050846195 0.2761124 0.07846000 0.45764762 0.63152807
## lpsa 0.73446028 0.354121818 0.1695929 0.17980950 0.56621818 0.54881316
## gleason pgg45 lpsa
## lcavol 0.432417052 0.4336522 0.7344603
## lweight -0.001283003 0.0508462 0.3541218
## age 0.268891599 0.2761124 0.1695929
## lbph 0.077820444 0.0784600 0.1798095
## svi 0.320412221 0.4576476 0.5662182
## lcp 0.514829912 0.6315281 0.5488132
## gleason 1.000000000 0.7519045 0.3689867
## pgg45 0.751904512 1.0000000 0.4223157
## lpsa 0.368986693 0.4223157 1.0000000
This helps identify relationships between different variables in the dataset.
hist(prostate$lpsa, breaks = 15, col = "lightblue",
main = "Distribution of Log PSA Levels",
xlab = "Log PSA Level")
### Boxplot of Log Cancer Volume (lcavol)
boxplot(prostate$lcavol, col = "lightgreen",
main = "Boxplot of Log Cancer Volume",
ylab = "Log Cancer Volume")
plot(prostate$lcavol, prostate$lpsa, col = "red", pch = 19,
main = "Log Cancer Volume vs Log PSA",
xlab = "Log Cancer Volume", ylab = "Log PSA Level")
abline(lm(lpsa ~ lcavol, data = prostate), col = "blue") # Add regression line
### Pairwise Scatter Plot for Key Variables
pairs(prostate[, c("lcavol", "lweight", "age", "lbph", "lpsa")],
col = "darkgreen", pch = 19)
Interpretation of Results: Summary statistics provide insights into the
distribution of PSA levels and other clinical variables. Histogram shows
the distribution of log PSA levels, which are often used instead of raw
PSA values. Boxplot visualizes the distribution of log cancer volume.
Scatter plot examines the relationship between log cancer volume and PSA
levels. Pairwise scatter plots help explore relationships between
multiple clinical variables.
# Load dataset
data(sat)
# View first few rows
head(sat)
## expend ratio salary takers verbal math total
## Alabama 4.405 17.2 31.144 8 491 538 1029
## Alaska 8.963 17.6 47.951 47 445 489 934
## Arizona 4.778 19.3 32.175 27 448 496 944
## Arkansas 4.459 17.1 28.934 6 482 523 1005
## California 4.992 24.0 41.078 45 417 485 902
## Colorado 5.443 18.4 34.571 29 462 518 980
# Check structure
str(sat)
## 'data.frame': 50 obs. of 7 variables:
## $ expend: num 4.41 8.96 4.78 4.46 4.99 ...
## $ ratio : num 17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
## $ salary: num 31.1 48 32.2 28.9 41.1 ...
## $ takers: int 8 47 27 6 45 29 81 68 48 65 ...
## $ verbal: int 491 445 448 482 417 462 431 429 420 406 ...
## $ math : int 538 489 496 523 485 518 477 468 469 448 ...
## $ total : int 1029 934 944 1005 902 980 908 897 889 854 ...
# Summary statistics
summary(sat)
## expend ratio salary takers
## Min. :3.656 Min. :13.80 Min. :25.99 Min. : 4.00
## 1st Qu.:4.882 1st Qu.:15.22 1st Qu.:30.98 1st Qu.: 9.00
## Median :5.768 Median :16.60 Median :33.29 Median :28.00
## Mean :5.905 Mean :16.86 Mean :34.83 Mean :35.24
## 3rd Qu.:6.434 3rd Qu.:17.57 3rd Qu.:38.55 3rd Qu.:63.00
## Max. :9.774 Max. :24.30 Max. :50.05 Max. :81.00
## verbal math total
## Min. :401.0 Min. :443.0 Min. : 844.0
## 1st Qu.:427.2 1st Qu.:474.8 1st Qu.: 897.2
## Median :448.0 Median :497.5 Median : 945.5
## Mean :457.1 Mean :508.8 Mean : 965.9
## 3rd Qu.:490.2 3rd Qu.:539.5 3rd Qu.:1032.0
## Max. :516.0 Max. :592.0 Max. :1107.0
This provides an overview of the dataset, including variable types and summary statistics like min, max, mean, and quartiles.
colSums(is.na(sat)) # Count missing values per column
## expend ratio salary takers verbal math total
## 0 0 0 0 0 0 0
If missing values exist, handle them using na.omit(sat) or imputation methods.
cor(sat, use = "complete.obs") # Correlation between numeric variables
## expend ratio salary takers verbal math
## expend 1.0000000 -0.371025386 0.869801513 0.5926274 -0.41004987 -0.34941409
## ratio -0.3710254 1.000000000 -0.001146081 -0.2130536 0.06376664 0.09542173
## salary 0.8698015 -0.001146081 1.000000000 0.6167799 -0.47696364 -0.40131282
## takers 0.5926274 -0.213053607 0.616779867 1.0000000 -0.89326296 -0.86938393
## verbal -0.4100499 0.063766636 -0.476963635 -0.8932630 1.00000000 0.97025604
## math -0.3494141 0.095421730 -0.401312817 -0.8693839 0.97025604 1.00000000
## total -0.3805370 0.081253823 -0.439883381 -0.8871187 0.99150325 0.99350238
## total
## expend -0.38053700
## ratio 0.08125382
## salary -0.43988338
## takers -0.88711868
## verbal 0.99150325
## math 0.99350238
## total 1.00000000
This helps identify relationships between variables.
Histogram of SAT Scores
hist(sat$total, breaks = 15, col = "lightblue",
main = "Distribution of SAT Scores", xlab = "Total SAT Score")
boxplot(sat$total ~ sat$expend, col = "lightgreen",
main = "SAT Scores vs Expenditure", xlab = "Expenditure", ylab = "Total SAT Score")
### Scatter Plot: SAT Scores vs Expenditure
plot(sat$expend, sat$total, col = "blue", pch = 19,
main = "SAT Scores vs School Expenditure", xlab = "Expenditure per Student", ylab = "Total SAT Score")
abline(lm(sat$total ~ sat$expend), col = "red") # Add regression line
Interpretation of Results:
Summary statistics reveal the spread and central tendency of SAT scores and expenditures. Correlation matrix helps understand relationships between school expenditure and SAT performance. Histograms & boxplots show the distribution and possible outliers. Scatter plots help visualize trends, such as whether higher spending leads to better SAT scores
# Load necessary package
library(faraway)
# Load dataset
data(divusa)
# View first few rows
head(divusa)
## year divorce unemployed femlab marriage birth military
## 1 1920 8.0 5.2 22.70 92.0 117.9 3.2247
## 2 1921 7.2 11.7 22.79 83.0 119.8 3.5614
## 3 1922 6.6 6.7 22.88 79.7 111.2 2.4553
## 4 1923 7.1 2.4 22.97 85.2 110.5 2.2065
## 5 1924 7.2 5.0 23.06 80.3 110.9 2.2889
## 6 1925 7.2 3.2 23.15 79.2 106.6 2.1735
# Check structure of dataset
str(divusa)
## 'data.frame': 77 obs. of 7 variables:
## $ year : int 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 ...
## $ divorce : num 8 7.2 6.6 7.1 7.2 7.2 7.5 7.8 7.8 8 ...
## $ unemployed: num 5.2 11.7 6.7 2.4 5 3.2 1.8 3.3 4.2 3.2 ...
## $ femlab : num 22.7 22.8 22.9 23 23.1 ...
## $ marriage : num 92 83 79.7 85.2 80.3 79.2 78.7 77 74.1 75.5 ...
## $ birth : num 118 120 111 110 111 ...
## $ military : num 3.22 3.56 2.46 2.21 2.29 ...
# Summary statistics
summary(divusa)
## year divorce unemployed femlab
## Min. :1920 Min. : 6.10 Min. : 1.200 Min. :22.70
## 1st Qu.:1939 1st Qu.: 8.70 1st Qu.: 4.200 1st Qu.:27.47
## Median :1958 Median :10.60 Median : 5.600 Median :37.10
## Mean :1958 Mean :13.27 Mean : 7.173 Mean :38.58
## 3rd Qu.:1977 3rd Qu.:20.30 3rd Qu.: 7.500 3rd Qu.:47.80
## Max. :1996 Max. :22.80 Max. :24.900 Max. :59.30
## marriage birth military
## Min. : 49.70 Min. : 65.30 Min. : 1.940
## 1st Qu.: 61.90 1st Qu.: 68.90 1st Qu.: 3.469
## Median : 74.10 Median : 85.90 Median : 9.102
## Mean : 72.97 Mean : 88.89 Mean :12.365
## 3rd Qu.: 80.00 3rd Qu.:107.30 3rd Qu.:14.266
## Max. :118.10 Max. :122.90 Max. :86.641
This provides an overview of the dataset, including variable types and basic statistics like min, max, mean, and quartiles.
colSums(is.na(divusa)) # Count missing values per column
## year divorce unemployed femlab marriage birth military
## 0 0 0 0 0 0 0
If missing values exist, handle them using na.omit(divusa) or imputation techniques
cor(divusa, use = "complete.obs") # Correlation between numeric variables
## year divorce unemployed femlab marriage
## year 1.000000000 0.87923753 -0.2344792 0.98598207 -0.6173255
## divorce 0.879237534 1.00000000 -0.2106019 0.91039698 -0.5342554
## unemployed -0.234479195 -0.21060188 1.0000000 -0.25746176 -0.2707630
## femlab 0.985982068 0.91039698 -0.2574618 1.00000000 -0.6486273
## marriage -0.617325533 -0.53425537 -0.2707630 -0.64862728 1.0000000
## birth -0.576313991 -0.72192425 -0.3138890 -0.60409490 0.6737273
## military 0.007267171 0.01857483 -0.4002930 0.05126339 0.2581983
## birth military
## year -0.5763140 0.007267171
## divorce -0.7219242 0.018574832
## unemployed -0.3138890 -0.400292954
## femlab -0.6040949 0.051263390
## marriage 0.6737273 0.258198260
## birth 1.0000000 0.140898643
## military 0.1408986 1.000000000
This helps identify relationships between variables.
Histogram of Divorce Rate
hist(divusa$divorce, breaks = 15, col = "lightblue",
main = "Distribution of Divorce Rates in the U.S.",
xlab = "Divorce Rate")
### Line Plot of Divorce Rate Over Tim
plot(divusa$year, divusa$divorce, type = "l", col = "blue", lwd = 2,
main = "Divorce Rate in the U.S. (1920-1996)",
xlab = "Year", ylab = "Divorce Rate")
### Scatter Plot: Divorce Rate vs Unemployment
plot(divusa$unemployed, divusa$divorce, col = "red", pch = 19,
main = "Divorce Rate vs Unemployment Rate",
xlab = "Unemployment Rate", ylab = "Divorce Rate")
abline(lm(divusa$divorce ~ divusa$unemployed), col = "blue") # Add regression line
Interpretation of Results: Summary statistics provide insights into how the divorce rate varied over time. Histogram shows the distribution of divorce rates. Line plot highlights trends in divorce rates from 1920 to 1996. Scatter plot helps analyze relationships between economic factors (e.g., unemployment) and divorce rates.
Discussion Question:
“In both the teengamb and prostate datasets, we analyzed numerical summaries and visualized relationships between variables. When working with real-world data, what are some potential biases or limitations that might affect the interpretation of these datasets? How can we address these issues to ensure more reliable insights?”
Answer: When working with real-world datasets like teengamb (teenage gambling behavior) and prostate (prostate cancer patients), there are several potential biases and limitations that could affect our interpretation:
Use random sampling to ensure a diverse and representative dataset. Combine self-reported data with objective measures where possible. Use Statistical Adjustments
Apply techniques like normalization, outlier removal, or transformations (e.g., log transformation of skewed PSA levels). Conduct regression analysis while controlling for confounders to isolate the effect of key variables. Validate Findings with External Data
Compare insights from the dataset with other studies to check for consistency. Use cross-validation techniques to test the robustness of predictive models. Conclusion Recognizing and addressing biases in data analysis is crucial to making reliable interpretations. By improving study design, using appropriate statistical techniques, and validating results with external data, we can reduce errors and derive more meaningful insights from datasets like teengamb and prostate.
Answer to other’s Question:
Answer: Regression analysis isn’t about finding a single “true” model—it’s about building a model that’s useful, interpretable, and fits the data well. There are often multiple ways to get to a reasonable and statistically significant answer, so the key is knowing when to choose one model over another and when to stop refining it. 1. Choosing One Model Over Another A. Model Performance Matters We typically compare models using metrics like: • Adjusted R² – Tells us how well the model explains the data while adjusting for extra variables. • AIC/BIC (Akaike/Bayesian Information Criterion) – Helps balance model fit with complexity (lower is better). • Cross-validation – Tests how well the model generalizes to new data. B. Keep It Simple (But Not Too Simple) • A model with fewer variables is often better if it explains the data well—this avoids unnecessary complexity. • Overfitting is a big risk when we try to make a model too perfect for our dataset; it may not work well with new data. C. Practicality and Interpretability • If two models perform similarly, the simpler one or the one that makes more sense in the real world is usually the better choice. • A business or research problem often requires a model that aligns with domain knowledge rather than just statistical fit. 2. When is a Model “Good Enough”? A model is good enough when it: • Answers the research question – Does it provide useful insights? Follows key assumptions – Is it reasonable and not violating key regression rules (e.g., no multicollinearity, normally distributed errors)? Generalizes well – Does it work well on new data without being too rigid or too flexible? Balances accuracy and simplicity – Is it making reliable predictions without overcomplicating things?
• Instead of chasing a “perfect” model, focus on one that provides meaningful insights, works well for decision-making, and remains stable over different datasets.