Week1_Data621

Exercise 1.1:

The dataset teengamb concerns a study of teenage gambling in Britain. Make a numerical and graphical summary of the data, commenting on any features that you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data.

Load the Data

library(faraway)

## Warning: package 'faraway' was built under R version 4.3.3

data(teengamb)

Numerical Summary

summary(teengamb)

##       sex             status          income           verbal     
##  Min.   :0.0000   Min.   :18.00   Min.   : 0.600   Min.   : 1.00  
##  1st Qu.:0.0000   1st Qu.:28.00   1st Qu.: 2.000   1st Qu.: 6.00  
##  Median :0.0000   Median :43.00   Median : 3.250   Median : 7.00  
##  Mean   :0.4043   Mean   :45.23   Mean   : 4.642   Mean   : 6.66  
##  3rd Qu.:1.0000   3rd Qu.:61.50   3rd Qu.: 6.210   3rd Qu.: 8.00  
##  Max.   :1.0000   Max.   :75.00   Max.   :15.000   Max.   :10.00  
##      gamble     
##  Min.   :  0.0  
##  1st Qu.:  1.1  
##  Median :  6.0  
##  Mean   : 19.3  
##  3rd Qu.: 19.4  
##  Max.   :156.0

str(teengamb)  # Structure of the dataset

## 'data.frame':    47 obs. of  5 variables:
##  $ sex   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ status: int  51 28 37 28 65 61 28 27 43 18 ...
##  $ income: num  2 2.5 2 7 2 3.47 5.5 6.42 2 6 ...
##  $ verbal: int  8 8 6 4 8 6 7 5 6 7 ...
##  $ gamble: num  0 0 0 7.3 19.6 0.1 1.45 6.6 1.7 0.1 ...

sapply(teengamb, function(x) sum(is.na(x)))  # Check for missing values

##    sex status income verbal gamble 
##      0      0      0      0      0

cor(teengamb)  # Correlation matrix

##               sex      status     income     verbal      gamble
## sex     1.0000000 -0.48093526 -0.1154586 -0.1069762 -0.40780533
## status -0.4809353  1.00000000 -0.2750340  0.5316102 -0.05042081
## income -0.1154586 -0.27503402  1.0000000 -0.1755707  0.62207690
## verbal -0.1069762  0.53161022 -0.1755707  1.0000000 -0.22005619
## gamble -0.4078053 -0.05042081  0.6220769 -0.2200562  1.00000000

Graphical Summary

Histogram of Gambling Expenditure

hist(teengamb$gamble, breaks = 15, col = "lightblue", main = "Distribution of Gambling Expenditure", xlab = "Gambling Amount")

### Boxplot to Compare Gambling by Gender

boxplot(gamble ~ sex, data = teengamb, col = c("pink", "lightblue"), 
        names = c("Female", "Male"), 
        main = "Gambling Expenditure by Gender", 
        ylab = "Gambling Expenditure")

### Scatterplot of Gambling vs. Income

plot(teengamb$income, teengamb$gamble, pch = 19, col = "blue",
     xlab = "Income", ylab = "Gambling Expenditure",
     main = "Relationship between Income and Gambling")
abline(lm(gamble ~ income, data = teengamb), col = "red", lwd = 2)

### Key Observations:

Gambling Distribution: The histogram may show a skewed distribution, with some teenagers spending significantly more than others.

Gender Differences: The boxplot may reveal that males tend to spend more on gambling than females.

Income Correlation: The scatterplot might suggest a positive relationship between income and gambling expenditure.

Exercise 1.3:

The dataset prostate is from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Make a numerical and graphical summary of the data as in the first question.

Load the Data

library(faraway)
data(prostate)

Numerical Summary

summary(prostate)  # Summary statistics

##      lcavol           lweight           age             lbph        
##  Min.   :-1.3471   Min.   :2.375   Min.   :41.00   Min.   :-1.3863  
##  1st Qu.: 0.5128   1st Qu.:3.376   1st Qu.:60.00   1st Qu.:-1.3863  
##  Median : 1.4469   Median :3.623   Median :65.00   Median : 0.3001  
##  Mean   : 1.3500   Mean   :3.653   Mean   :63.87   Mean   : 0.1004  
##  3rd Qu.: 2.1270   3rd Qu.:3.878   3rd Qu.:68.00   3rd Qu.: 1.5581  
##  Max.   : 3.8210   Max.   :6.108   Max.   :79.00   Max.   : 2.3263  
##       svi              lcp             gleason          pgg45       
##  Min.   :0.0000   Min.   :-1.3863   Min.   :6.000   Min.   :  0.00  
##  1st Qu.:0.0000   1st Qu.:-1.3863   1st Qu.:6.000   1st Qu.:  0.00  
##  Median :0.0000   Median :-0.7985   Median :7.000   Median : 15.00  
##  Mean   :0.2165   Mean   :-0.1794   Mean   :6.753   Mean   : 24.38  
##  3rd Qu.:0.0000   3rd Qu.: 1.1786   3rd Qu.:7.000   3rd Qu.: 40.00  
##  Max.   :1.0000   Max.   : 2.9042   Max.   :9.000   Max.   :100.00  
##       lpsa        
##  Min.   :-0.4308  
##  1st Qu.: 1.7317  
##  Median : 2.5915  
##  Mean   : 2.4784  
##  3rd Qu.: 3.0564  
##  Max.   : 5.5829

Check for Missing Values

# Check the structure and missing values:

str(prostate)  # Data types and structure

## 'data.frame':    97 obs. of  9 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...

sapply(prostate, function(x) sum(is.na(x)))  # Count missing values

##  lcavol lweight     age    lbph     svi     lcp gleason   pgg45    lpsa 
##       0       0       0       0       0       0       0       0       0

Compute Correlation Matrix

cor(prostate)  # Correlation matrix

##             lcavol      lweight       age        lbph         svi         lcp
## lcavol  1.00000000  0.194128387 0.2249999  0.02734971  0.53884500  0.67531058
## lweight 0.19412839  1.000000000 0.3075247  0.43493174  0.10877818  0.10023889
## age     0.22499988  0.307524741 1.0000000  0.35018592  0.11765804  0.12766778
## lbph    0.02734971  0.434931744 0.3501859  1.00000000 -0.08584327 -0.00699944
## svi     0.53884500  0.108778185 0.1176580 -0.08584327  1.00000000  0.67311122
## lcp     0.67531058  0.100238891 0.1276678 -0.00699944  0.67311122  1.00000000
## gleason 0.43241705 -0.001283003 0.2688916  0.07782044  0.32041222  0.51482991
## pgg45   0.43365224  0.050846195 0.2761124  0.07846000  0.45764762  0.63152807
## lpsa    0.73446028  0.354121818 0.1695929  0.17980950  0.56621818  0.54881316
##              gleason     pgg45      lpsa
## lcavol   0.432417052 0.4336522 0.7344603
## lweight -0.001283003 0.0508462 0.3541218
## age      0.268891599 0.2761124 0.1695929
## lbph     0.077820444 0.0784600 0.1798095
## svi      0.320412221 0.4576476 0.5662182
## lcp      0.514829912 0.6315281 0.5488132
## gleason  1.000000000 0.7519045 0.3689867
## pgg45    0.751904512 1.0000000 0.4223157
## lpsa     0.368986693 0.4223157 1.0000000

This helps identify relationships between different variables in the dataset.

Visualizing the Data

Histogram of PSA Levels

hist(prostate$lpsa, breaks = 15, col = "lightblue",
     main = "Distribution of Log PSA Levels",
     xlab = "Log PSA Level")

### Boxplot of Log Cancer Volume (lcavol)

boxplot(prostate$lcavol, col = "lightgreen",
        main = "Boxplot of Log Cancer Volume",
        ylab = "Log Cancer Volume")

Scatter Plot: Log Cancer Volume vs PSA

plot(prostate$lcavol, prostate$lpsa, col = "red", pch = 19,
     main = "Log Cancer Volume vs Log PSA",
     xlab = "Log Cancer Volume", ylab = "Log PSA Level")
abline(lm(lpsa ~ lcavol, data = prostate), col = "blue")  # Add regression line

### Pairwise Scatter Plot for Key Variables

pairs(prostate[, c("lcavol", "lweight", "age", "lbph", "lpsa")], 
      col = "darkgreen", pch = 19)

Interpretation of Results: Summary statistics provide insights into the distribution of PSA levels and other clinical variables. Histogram shows the distribution of log PSA levels, which are often used instead of raw PSA values. Boxplot visualizes the distribution of log cancer volume. Scatter plot examines the relationship between log cancer volume and PSA levels. Pairwise scatter plots help explore relationships between multiple clinical variables.

Exercise 1.4:

Load and Inspect the Data

# Load dataset
data(sat)

# View first few rows
head(sat)

##            expend ratio salary takers verbal math total
## Alabama     4.405  17.2 31.144      8    491  538  1029
## Alaska      8.963  17.6 47.951     47    445  489   934
## Arizona     4.778  19.3 32.175     27    448  496   944
## Arkansas    4.459  17.1 28.934      6    482  523  1005
## California  4.992  24.0 41.078     45    417  485   902
## Colorado    5.443  18.4 34.571     29    462  518   980

# Check structure
str(sat)

## 'data.frame':    50 obs. of  7 variables:
##  $ expend: num  4.41 8.96 4.78 4.46 4.99 ...
##  $ ratio : num  17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
##  $ salary: num  31.1 48 32.2 28.9 41.1 ...
##  $ takers: int  8 47 27 6 45 29 81 68 48 65 ...
##  $ verbal: int  491 445 448 482 417 462 431 429 420 406 ...
##  $ math  : int  538 489 496 523 485 518 477 468 469 448 ...
##  $ total : int  1029 934 944 1005 902 980 908 897 889 854 ...

# Summary statistics
summary(sat)

##      expend          ratio           salary          takers     
##  Min.   :3.656   Min.   :13.80   Min.   :25.99   Min.   : 4.00  
##  1st Qu.:4.882   1st Qu.:15.22   1st Qu.:30.98   1st Qu.: 9.00  
##  Median :5.768   Median :16.60   Median :33.29   Median :28.00  
##  Mean   :5.905   Mean   :16.86   Mean   :34.83   Mean   :35.24  
##  3rd Qu.:6.434   3rd Qu.:17.57   3rd Qu.:38.55   3rd Qu.:63.00  
##  Max.   :9.774   Max.   :24.30   Max.   :50.05   Max.   :81.00  
##      verbal           math           total       
##  Min.   :401.0   Min.   :443.0   Min.   : 844.0  
##  1st Qu.:427.2   1st Qu.:474.8   1st Qu.: 897.2  
##  Median :448.0   Median :497.5   Median : 945.5  
##  Mean   :457.1   Mean   :508.8   Mean   : 965.9  
##  3rd Qu.:490.2   3rd Qu.:539.5   3rd Qu.:1032.0  
##  Max.   :516.0   Max.   :592.0   Max.   :1107.0

This provides an overview of the dataset, including variable types and summary statistics like min, max, mean, and quartiles.

Check for Missing Values

colSums(is.na(sat))  # Count missing values per column

## expend  ratio salary takers verbal   math  total 
##      0      0      0      0      0      0      0

If missing values exist, handle them using na.omit(sat) or imputation methods.

Compute Correlation Matrix

cor(sat, use = "complete.obs")  # Correlation between numeric variables

##            expend        ratio       salary     takers      verbal        math
## expend  1.0000000 -0.371025386  0.869801513  0.5926274 -0.41004987 -0.34941409
## ratio  -0.3710254  1.000000000 -0.001146081 -0.2130536  0.06376664  0.09542173
## salary  0.8698015 -0.001146081  1.000000000  0.6167799 -0.47696364 -0.40131282
## takers  0.5926274 -0.213053607  0.616779867  1.0000000 -0.89326296 -0.86938393
## verbal -0.4100499  0.063766636 -0.476963635 -0.8932630  1.00000000  0.97025604
## math   -0.3494141  0.095421730 -0.401312817 -0.8693839  0.97025604  1.00000000
## total  -0.3805370  0.081253823 -0.439883381 -0.8871187  0.99150325  0.99350238
##              total
## expend -0.38053700
## ratio   0.08125382
## salary -0.43988338
## takers -0.88711868
## verbal  0.99150325
## math    0.99350238
## total   1.00000000

This helps identify relationships between variables.

Visualizing the Data

Histogram of SAT Scores

hist(sat$total, breaks = 15, col = "lightblue", 
     main = "Distribution of SAT Scores", xlab = "Total SAT Score")

Boxplot of SAT Scores by Expenditure

boxplot(sat$total ~ sat$expend, col = "lightgreen",
        main = "SAT Scores vs Expenditure", xlab = "Expenditure", ylab = "Total SAT Score")

### Scatter Plot: SAT Scores vs Expenditure

plot(sat$expend, sat$total, col = "blue", pch = 19,
     main = "SAT Scores vs School Expenditure", xlab = "Expenditure per Student", ylab = "Total SAT Score")
abline(lm(sat$total ~ sat$expend), col = "red")  # Add regression line

Interpretation of Results:

Summary statistics reveal the spread and central tendency of SAT scores and expenditures. Correlation matrix helps understand relationships between school expenditure and SAT performance. Histograms & boxplots show the distribution and possible outliers. Scatter plots help visualize trends, such as whether higher spending leads to better SAT scores

Exercise 1.5:

Load and Inspect the Data

# Load necessary package
library(faraway)

# Load dataset
data(divusa)

# View first few rows
head(divusa)

##   year divorce unemployed femlab marriage birth military
## 1 1920     8.0        5.2  22.70     92.0 117.9   3.2247
## 2 1921     7.2       11.7  22.79     83.0 119.8   3.5614
## 3 1922     6.6        6.7  22.88     79.7 111.2   2.4553
## 4 1923     7.1        2.4  22.97     85.2 110.5   2.2065
## 5 1924     7.2        5.0  23.06     80.3 110.9   2.2889
## 6 1925     7.2        3.2  23.15     79.2 106.6   2.1735

# Check structure of dataset
str(divusa)

## 'data.frame':    77 obs. of  7 variables:
##  $ year      : int  1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 ...
##  $ divorce   : num  8 7.2 6.6 7.1 7.2 7.2 7.5 7.8 7.8 8 ...
##  $ unemployed: num  5.2 11.7 6.7 2.4 5 3.2 1.8 3.3 4.2 3.2 ...
##  $ femlab    : num  22.7 22.8 22.9 23 23.1 ...
##  $ marriage  : num  92 83 79.7 85.2 80.3 79.2 78.7 77 74.1 75.5 ...
##  $ birth     : num  118 120 111 110 111 ...
##  $ military  : num  3.22 3.56 2.46 2.21 2.29 ...

# Summary statistics
summary(divusa)

##       year         divorce        unemployed         femlab     
##  Min.   :1920   Min.   : 6.10   Min.   : 1.200   Min.   :22.70  
##  1st Qu.:1939   1st Qu.: 8.70   1st Qu.: 4.200   1st Qu.:27.47  
##  Median :1958   Median :10.60   Median : 5.600   Median :37.10  
##  Mean   :1958   Mean   :13.27   Mean   : 7.173   Mean   :38.58  
##  3rd Qu.:1977   3rd Qu.:20.30   3rd Qu.: 7.500   3rd Qu.:47.80  
##  Max.   :1996   Max.   :22.80   Max.   :24.900   Max.   :59.30  
##     marriage          birth           military     
##  Min.   : 49.70   Min.   : 65.30   Min.   : 1.940  
##  1st Qu.: 61.90   1st Qu.: 68.90   1st Qu.: 3.469  
##  Median : 74.10   Median : 85.90   Median : 9.102  
##  Mean   : 72.97   Mean   : 88.89   Mean   :12.365  
##  3rd Qu.: 80.00   3rd Qu.:107.30   3rd Qu.:14.266  
##  Max.   :118.10   Max.   :122.90   Max.   :86.641

This provides an overview of the dataset, including variable types and basic statistics like min, max, mean, and quartiles.

Check for Missing Values

colSums(is.na(divusa))  # Count missing values per column

##       year    divorce unemployed     femlab   marriage      birth   military 
##          0          0          0          0          0          0          0

If missing values exist, handle them using na.omit(divusa) or imputation techniques

Compute Correlation Matrix

cor(divusa, use = "complete.obs")  # Correlation between numeric variables

##                    year     divorce unemployed      femlab   marriage
## year        1.000000000  0.87923753 -0.2344792  0.98598207 -0.6173255
## divorce     0.879237534  1.00000000 -0.2106019  0.91039698 -0.5342554
## unemployed -0.234479195 -0.21060188  1.0000000 -0.25746176 -0.2707630
## femlab      0.985982068  0.91039698 -0.2574618  1.00000000 -0.6486273
## marriage   -0.617325533 -0.53425537 -0.2707630 -0.64862728  1.0000000
## birth      -0.576313991 -0.72192425 -0.3138890 -0.60409490  0.6737273
## military    0.007267171  0.01857483 -0.4002930  0.05126339  0.2581983
##                 birth     military
## year       -0.5763140  0.007267171
## divorce    -0.7219242  0.018574832
## unemployed -0.3138890 -0.400292954
## femlab     -0.6040949  0.051263390
## marriage    0.6737273  0.258198260
## birth       1.0000000  0.140898643
## military    0.1408986  1.000000000

This helps identify relationships between variables.

Visualizing the Data

Histogram of Divorce Rate

hist(divusa$divorce, breaks = 15, col = "lightblue",
     main = "Distribution of Divorce Rates in the U.S.",
     xlab = "Divorce Rate")

### Line Plot of Divorce Rate Over Tim

plot(divusa$year, divusa$divorce, type = "l", col = "blue", lwd = 2,
     main = "Divorce Rate in the U.S. (1920-1996)",
     xlab = "Year", ylab = "Divorce Rate")

### Scatter Plot: Divorce Rate vs Unemployment

plot(divusa$unemployed, divusa$divorce, col = "red", pch = 19,
     main = "Divorce Rate vs Unemployment Rate",
     xlab = "Unemployment Rate", ylab = "Divorce Rate")
abline(lm(divusa$divorce ~ divusa$unemployed), col = "blue")  # Add regression line

Interpretation of Results: Summary statistics provide insights into how the divorce rate varied over time. Histogram shows the distribution of divorce rates. Line plot highlights trends in divorce rates from 1920 to 1996. Scatter plot helps analyze relationships between economic factors (e.g., unemployment) and divorce rates.

Discussion Question:

“In both the teengamb and prostate datasets, we analyzed numerical summaries and visualized relationships between variables. When working with real-world data, what are some potential biases or limitations that might affect the interpretation of these datasets? How can we address these issues to ensure more reliable insights?”

Answer: When working with real-world datasets like teengamb (teenage gambling behavior) and prostate (prostate cancer patients), there are several potential biases and limitations that could affect our interpretation:

Sample Bias Teengamb Dataset: If the data was collected from a specific region in Britain or from a certain socioeconomic group, the results might not generalize to all teenagers. Prostate Dataset: The study only includes men who were already scheduled for prostatectomy, meaning it excludes those who opted for other treatments or were undiagnosed. This could introduce selection bias.
Measurement Bias Self-reported gambling behavior (in teengamb) might lead to underreporting or overreporting, affecting the accuracy of the analysis. PSA levels in prostate could be influenced by factors outside of cancer progression, such as infections or medications, potentially leading to misclassification errors.
Confounding Variables In both datasets, there may be additional unmeasured factors influencing the results. For teengamb: Parental influence, peer pressure, or cultural factors might impact gambling habits but are not included in the dataset. For prostate: Lifestyle factors such as diet, exercise, or genetic predisposition could impact PSA levels and tumor growth.
Small Sample Size & Outliers With 97 observations in prostate and a likely small number in teengamb, the findings might not be statistically robust. Extreme outliers (e.g., exceptionally high gambling expenditure or PSA levels) can skew averages and correlations. How to Address These Issues? Improve Data Collection

Use random sampling to ensure a diverse and representative dataset. Combine self-reported data with objective measures where possible. Use Statistical Adjustments

Apply techniques like normalization, outlier removal, or transformations (e.g., log transformation of skewed PSA levels). Conduct regression analysis while controlling for confounders to isolate the effect of key variables. Validate Findings with External Data

Compare insights from the dataset with other studies to check for consistency. Use cross-validation techniques to test the robustness of predictive models. Conclusion Recognizing and addressing biases in data analysis is crucial to making reliable interpretations. By improving study design, using appropriate statistical techniques, and validating results with external data, we can reduce errors and derive more meaningful insights from datasets like teengamb and prostate.

Answer to other’s Question:

Answer: Regression analysis isn’t about finding a single “true” model—it’s about building a model that’s useful, interpretable, and fits the data well. There are often multiple ways to get to a reasonable and statistically significant answer, so the key is knowing when to choose one model over another and when to stop refining it. 1. Choosing One Model Over Another A. Model Performance Matters We typically compare models using metrics like: • Adjusted R² – Tells us how well the model explains the data while adjusting for extra variables. • AIC/BIC (Akaike/Bayesian Information Criterion) – Helps balance model fit with complexity (lower is better). • Cross-validation – Tests how well the model generalizes to new data. B. Keep It Simple (But Not Too Simple) • A model with fewer variables is often better if it explains the data well—this avoids unnecessary complexity. • Overfitting is a big risk when we try to make a model too perfect for our dataset; it may not work well with new data. C. Practicality and Interpretability • If two models perform similarly, the simpler one or the one that makes more sense in the real world is usually the better choice. • A business or research problem often requires a model that aligns with domain knowledge rather than just statistical fit. 2. When is a Model “Good Enough”? A model is good enough when it: • Answers the research question – Does it provide useful insights? Follows key assumptions – Is it reasonable and not violating key regression rules (e.g., no multicollinearity, normally distributed errors)? Generalizes well – Does it work well on new data without being too rigid or too flexible? Balances accuracy and simplicity – Is it making reliable predictions without overcomplicating things?

• Instead of chasing a “perfect” model, focus on one that provides meaningful insights, works well for decision-making, and remains stable over different datasets.

Week1_Data621_HW

Mubashira Qari

2025-02-01

Exercise 1.1:

Load the Data

Numerical Summary

Graphical Summary

Histogram of Gambling Expenditure

Exercise 1.3:

Load the Data

Numerical Summary

Check for Missing Values

Compute Correlation Matrix

Visualizing the Data

Histogram of PSA Levels

Scatter Plot: Log Cancer Volume vs PSA

Exercise 1.4:

Load and Inspect the Data

Check for Missing Values

Compute Correlation Matrix

Visualizing the Data

Boxplot of SAT Scores by Expenditure

Exercise 1.5:

Load and Inspect the Data

Check for Missing Values

Compute Correlation Matrix

Visualizing the Data