stats-lab-1.knit

QUESTION 9

install.packages("ISLR2", repos = "https://cloud.r-project.org/")

## Installing package into 'C:/Users/saisr/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'ISLR2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\saisr\AppData\Local\Temp\RtmpCkAdgm\downloaded_packages

library(ISLR2)             
data(Auto)  # Load the dataset
str(Auto)  # Check the structure

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

head(Auto)  # View the first few rows

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

9a:

str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

The variables which are numerical and continuous are quantitative so: mpg : num 18 15 18 16 17 15 14 14 14 15 … cylinders : num 8 8 8 8 8 8 8 8 8 8 … displacement: num 307 350 318 304 302 429 454 440 455 390 … horsepower : num 130 165 150 150 140 198 220 215 225 190 … weight : num 3504 3693 3436 3433 3449 … acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 … year : num 70 70 70 70 70 70 70 70 70 70 … The variables which represent categorical data are qualitative so: origin : num 1 1 1 1 1 1 1 1 1 1 … name : Factor w/ 304 levels “amc ambassador brougham”,..: 49 36 231 14 161 141 54 223 241 2 …

9b:

sapply(Auto[, sapply(Auto, is.numeric)], range)

##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,]  9.0         3           68         46   1613          8.0   70      1
## [2,] 46.6         8          455        230   5140         24.8   82      3

here we can see that there is the range i.e., minimum and maximum values of the qualitative variables.

9c:

sapply(Auto[, sapply(Auto, is.numeric)], mean)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    23.445918     5.471939   194.411990   104.469388  2977.584184    15.541327 
##         year       origin 
##    75.979592     1.576531

sapply(Auto[, sapply(Auto, is.numeric)], sd)

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    7.8050075    1.7057832  104.6440039   38.4911599  849.4025600    2.7588641 
##         year       origin 
##    3.6837365    0.8055182

here we can see the mean and standard deviation of each of the variables.

9d:

A subset without rows 10 to 85

Auto_subset <- Auto[-(10:85), ]

calculating range, mean and standard deviation

sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], range)  # Range

##       mpg cylinders displacement horsepower weight acceleration year origin
## [1,] 11.0         3           68         46   1649          8.5   70      1
## [2,] 46.6         8          455        230   4997         24.8   82      3

sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], mean)   # Mean

##          mpg    cylinders displacement   horsepower       weight acceleration 
##    24.404430     5.373418   187.240506   100.721519  2935.971519    15.726899 
##         year       origin 
##    77.145570     1.601266

sapply(Auto_subset[, sapply(Auto_subset, is.numeric)], sd)     # Standard deviation

##          mpg    cylinders displacement   horsepower       weight acceleration 
##     7.867283     1.654179    99.678367    35.708853   811.300208     2.693721 
##         year       origin 
##     3.106217     0.819910

9e:

Scatterplot: MPG vs Weight

install.packages("ggplot2", repos = "https://cloud.r-project.org/")

## Installing package into 'C:/Users/saisr/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\saisr\AppData\Local\Temp\RtmpCkAdgm\downloaded_packages

library(ggplot2)             
ggplot(Auto, aes(x = weight, y = mpg)) + 
  geom_point() + 
  geom_smooth(method = "lm", col = "red") +
  ggtitle("MPG vs Weight") +
  xlab("Weight") +
  ylab("Miles Per Gallon")

## `geom_smooth()` using formula = 'y ~ x'

comment: The scatterplot shows a clear downward trend. As weight increases, miles per gallon (MPG) decreases. This suggests that heavier cars consume more fuel per mile. The red regression line represents a linear relationship between weight and MPG.

Scatterplot: MPG vs Horsepower

ggplot(Auto, aes(x = horsepower, y = mpg)) + 
  geom_point() + 
  geom_smooth(method = "lm", col = "purple") +
  ggtitle("MPG vs Horsepower") +
  xlab("Horsepower") +
  ylab("Miles Per Gallon")

## `geom_smooth()` using formula = 'y ~ x'

comment: The scatterplot shows a clear downward trend, indicating that as horsepower increases, miles per gallon decreases. This suggests that cars with more powerful engines consume more fuel, making them less fuel-efficient.

Boxplots (Categorical vs Numerical Variables)

ggplot(Auto, aes(x = as.factor(cylinders), y = mpg, fill = as.factor(cylinders))) +
  geom_boxplot() +
  ggtitle("MPG by Number of Cylinders") +
  xlab("Cylinders") +
  ylab("Miles Per Gallon") +
  theme_minimal()

comment: Cars with fewer cylinders have higher MPG. Cars with 8 cylinders are the least fuel efficient.

9f: based on the scatter plots and other plots that are plotted between mpg and other variables we can tell that there are different types of correlation and every factor has a reason that it can be useful to predict mpg with other variables. - weight: strong negative correlation. As cars get heavier, they require more fuel to move, reducing fuel efficiency. This relationship is linear making weight an important variable to include in any regression model for predicting MPG. - horsepower: negative correlation.Powerful engines are less fuel-efficient.so the relationship seems linear, it might be worth exploring non-linear models to capture any more complex interactions between horsepower and MPG. - year : positive correlation. The year of manufacture is an important variable for predicting MPG. - Cylinders – Cars with fewer cylinders are more fuel-efficient. - Displacement – big engines are typically less fuel-efficient. - Origin – Region of manufacture impacts fuel efficiency.

QUESTION 10

10a:

# View the Boston data set
data("Boston")

# Check the structure of the Boston dataset
str(Boston)

## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

nrow(Boston)

## [1] 506

ncol(Boston)

## [1] 13

Each row represents a census tract in the Boston area.
Each column represents different predictors or variables for each census tract, such as crime rates, property taxes.

10b:

library(ISLR2)

# Load the Boston data set
data("Boston")

pairs(Boston[, -14], main = "Pairwise Scatterplots of Boston Housing Data")

Findings: - Crime Rate vs. Property Tax: There might be a positive correlation between crime rate and property tax. In some cases, higher crime areas might have higher taxes for community policing or funding. - Crime Rate vs. Number of Rooms: A negative relationship might be observed. Areas with higher crime rates might have lower average numbers of rooms per dwelling due to economic factors affecting the region. - Average Rooms per Dwelling vs. Property Tax: Expect a positive correlation. Wealthier areas with larger homes are likely to have higher property taxes to fund local services like schools, roads, etc. - Distance to Employment Centers vs. House Price: A positive correlation could be observed, as people in wealthier areas might have easier access to employment hubs, driving house prices up.

10c:

# Correlation between each predictor and crime rate
cor(Boston[, -14])  # Remove the 'medv' column (target variable)

##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##              lstat       medv
## crim     0.4556215 -0.3883046
## zn      -0.4129946  0.3604453
## indus    0.6037997 -0.4837252
## chas    -0.0539293  0.1752602
## nox      0.5908789 -0.4273208
## rm      -0.6138083  0.6953599
## age      0.6023385 -0.3769546
## dis     -0.4969958  0.2499287
## rad      0.4886763 -0.3816262
## tax      0.5439934 -0.4685359
## ptratio  0.3740443 -0.5077867
## lstat    1.0000000 -0.7376627
## medv    -0.7376627  1.0000000

# Scatterplot of crime rate vs some predictors (e.g., tax, rooms)
plot(Boston$crim, Boston$tax, main = "Crime Rate vs Tax Rate", xlab = "Crime Rate", ylab = "Tax Rate")

Yes, several predictors are associated with the per capita crime rate (CRIM).
Positive: INDUS (0.41): Areas with higher crime rates tend to have a higher proportion of industrial land. NOX (0.42): Higher crime rates are associated with poorer air quality. RAD (0.63): Areas near radial highways, which may have more traffic and less policing, tend to have higher crime rates. TAX (0.58): Higher crime rates are associated with higher property tax rates, possibly due to increased funding needed for public services. LSTAT (0.46): Crime rates are higher in areas with a larger percentage of lower-income residents.
Negative: MEDV (-0.39): Higher crime rates are associated with lower median house prices, suggesting that areas with higher crime rates are less desirable and have lower property values.

10d.

for this we can summarize range of each predictor.

# Summary statistics of each predictor
summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

comment: - Crime Rate: Range: Min: 0.006, Max: 88.98. There are some census tracts with very high crime rates, as the maximum value is extremely high (88.98). The majority of the areas, however, seem to have much lower crime rates, with the mean being 3.61 and the median around 0.26. - Tax Rate: Range: Min: 187, Max: 711. The maximum tax rate is 711, which is quite high. Given that the median is 330 and the mean is 408, it appears that there are some areas with very high property tax rates, potentially due to the needs of areas with higher crime rates. - Pupil-Teacher Ratio: Range: Min: 12.6, Max: 22. The maximum pupil-teacher ratio is 22, which indicates that some areas have relatively high ratios.

10e:

# Count the number of census tracts bounding the Charles River
sum(Boston$chas == 1)

## [1] 35

There are 35 census tracts in this data set bound the Charles river.

10f:

# Median pupil-teacher ratio
median(Boston$ptratio)

## [1] 19.05

the rator is 19.05

10g:

# Find the index of the census tract with the lowest 'medv'
lowest_medv_index <- which.min(Boston$medv)

# View the details of that census tract
Boston[lowest_medv_index, ]

##        crim zn indus chas   nox    rm age    dis rad tax ptratio lstat medv
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 30.59    5

The census tract with the lowest median value of owner-occupied homes (MEDV = 5) has the following values for the predictors: Crime rate (CRIM): 38.35 Proportion of residential land zoned for large lots (ZN): 0 Industrial proportion (INDUS): 18.1 Charles River dummy (CHAS): 0 Nitrogen oxides concentration (NOX): 0.693 Average number of rooms (RM): 5.453 Age of homes (AGE): 100 Distance to employment centers (DIS): 1.49 Radial highway accessibility (RAD): 24 Property tax rate (TAX): 666 Pupil-teacher ratio (PTRATIO): 20.2 Percentage of lower-income population (LSTAT): 30.59 Median value of owner-occupied homes (MEDV): 5
Comparison to Overall Ranges:

Crime rate (CRIM): 38.35 (maximum: 88.98, minimum: 0.00632) This is very high compared to the overall range. A high crime rate suggests a less safe area, which can significantly lower property values.
Proportion of residential land zoned for large lots (ZN): 0 (maximum: 100) A value of 0 indicates no land is zoned for large lots, which likely contributes to a more urbanized area with smaller homes, lower living space, and lower property values.
Industrial proportion (INDUS): 18.1 (maximum: 27.74) This value is on the higher end, indicating a high industrial presence. Areas with high industrial activity tend to have lower residential property values due to noise, pollution, and less aesthetic appeal.
Charles River dummy (CHAS): 0 (maximum: 1) Being farther from the Charles River (value = 0) means this tract is likely less desirable than those close to the river. Proximity to the river generally raises property values.

These factors collectively contribute to the lower property values in this tract High crime rate High industrial presence Older homes Higher pollution Smaller homes Higher tax rates Less proximity to desirable features like the Charles River or employment centers.

10h:

sum(Boston$rm > 7)

## [1] 64

sum(Boston$rm > 8)

## [1] 13

64 census tracts where the average number of rooms per dwelling is more than seven rooms.
13 census tracts where the average number of rooms per dwelling is more than eight rooms.
comment: Census tracts with more than eight rooms per dwelling tend to be larger homes, often located in more affluent areas.Larger homes often indicate suburban or residential areas with lower population density and more space per person.
Thus, the census tracts with more than eight rooms per dwelling are likely to be in wealthier neighborhoods.