This question deals with the airquality dataset. Read in the data and answer the following questions:
DataSet <- airquality
Number of rows
nrow(DataSet)
## [1] 153
Number of columns
ncol(DataSet)
## [1] 6
Column labels:
colnames(DataSet)
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
Number of missing value in the order of Ozone to Day
sum(is.na(DataSet[,"Ozone"]))
## [1] 37
sum(is.na(DataSet[,"Solar.R"]))
## [1] 7
str(DataSet)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
Variable: Wind is numeric. The rest are interger.
DataSet$Day <- as.factor(DataSet$Day)
DataSet$Month <- as.factor(DataSet$Month)
boxplot(split(DataSet$Ozone, DataSet$Month), main = "Ozone count by month")
boxplot(split(DataSet$Solar.R, DataSet$Month), main = "Solar.R count by month")
boxplot(split(DataSet$Wind, DataSet$Month), main = "Wind count by month")
boxplot(split(DataSet$Temp, DataSet$Month), main = "Temp count by month")
plot(DataSet[,c(1,2,3,4,5,6)])
ggscatter(DataSet, x = "Ozone", y = "Solar.R", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "pearson", xlab = "Ozone", ylab = "Solar.R")
cor.test(DataSet$Ozone, DataSet$Solar.R)
##
## Pearson's product-moment correlation
##
## data: DataSet$Ozone and DataSet$Solar.R
## t = 3.8798, df = 109, p-value = 0.0001793
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.173194 0.502132
## sample estimates:
## cor
## 0.3483417
There is a weak positive correlation between Solar.R and Ozone
ggscatter(DataSet, x = "Ozone", y = "Wind", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "pearson", xlab = "Ozone", ylab = "Wind")
cor.test(DataSet$Ozone, DataSet$Wind)
##
## Pearson's product-moment correlation
##
## data: DataSet$Ozone and DataSet$Wind
## t = -8.0401, df = 114, p-value = 9.272e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7063918 -0.4708713
## sample estimates:
## cor
## -0.6015465
There is a strong negative correlation between Wind and Ozone
ggscatter(DataSet, x = "Ozone", y = "Temp", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "pearson", xlab = "Ozone", ylab = "Temp")
cor.test(DataSet$Ozone, DataSet$Temp)
##
## Pearson's product-moment correlation
##
## data: DataSet$Ozone and DataSet$Temp
## t = 10.418, df = 114, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5913340 0.7812111
## sample estimates:
## cor
## 0.6983603
There is a strong positive correlation between Temp and Ozone
Base on results above, you should go to places are cold, have strong wind and have limit expo to sun.
This question deals with the Carseats dataset. I own a retail store in which I am going to bring in child’s car seat(s) to sell. If I want to be successful with this initiative, what should I do?
library(ISLR)
Dataset2 <- Carseats
What are the important factors will drive sales of car seats?
Set Expectations:
What are the important factors will drive sales of car seats?
Collected Information:
Based on Consumer Report and Parents magazines, customer should know:
Revise Expectation:
People who do research about car seat may want to buy new car seats. Usually people who do research about car seat are educated people. Those people might want to buy new car seat in the store.
New question: Does education level of the area affect the sale of car seat?
Set Expectations:
Missing value in the dataset:
sum(is.na.data.frame(Dataset2))
## [1] 0
Variables in the dataset:
colnames(Dataset2)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
Dataset can help us to answer the question.
Collect Information:
Some basic information about the data set
str(Dataset2)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
summary(Dataset2)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
Maybe want to subset the dataset since the max in the age variable is 80. Usually people have children around 30 to 50.
hist(Dataset2$Age,
main = "Histogram for Age",
xlab = "Age",
border = "blue",
col = "green",
xlim = c(25,80))
abline(v = mean(Dataset2$Age), col = "red", lwd = 2)
abline(v = median(Dataset2$Age), col = "black", lwd = 2)
We might subset to only take to the age of 70.
Dataset2a <- Dataset2 %>% filter(Age <= 70)
str(Dataset2)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
summary(Dataset2a)
## Sales CompPrice Income Advertising
## Min. : 0.160 Min. : 77.0 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.720 1st Qu.:115.0 1st Qu.: 44.50 1st Qu.: 0.000
## Median : 7.680 Median :125.0 Median : 69.00 Median : 5.000
## Mean : 7.792 Mean :125.4 Mean : 69.02 Mean : 6.613
## 3rd Qu.: 9.495 3rd Qu.:135.5 3rd Qu.: 90.50 3rd Qu.:11.000
## Max. :16.270 Max. :175.0 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24 Bad : 75 Min. :25.00 Min. :10.00
## 1st Qu.:135.5 1st Qu.:100 Good : 70 1st Qu.:36.00 1st Qu.:12.00
## Median :269.0 Median :119 Medium:178 Median :49.00 Median :14.00
## Mean :263.4 Mean :116 Mean :47.96 Mean :13.85
## 3rd Qu.:397.0 3rd Qu.:131 3rd Qu.:60.00 3rd Qu.:16.00
## Max. :509.0 Max. :191 Max. :70.00 Max. :18.00
## Urban US
## No : 93 No :112
## Yes:230 Yes:211
##
##
##
##
Revise Expectation:
New question: Does education level of the areas which have average of age of having children affect the sale of car seat?
c.Build a formal model
Use the boxplot to see if the education level affect the sales of carseat.
boxplot(split(Dataset2a$Sales, Dataset2a$Education), main = "Sales of each education level")
Base on the the graph, the unit sale levels are mostly even between different education level.
The lowest education level even seems to be have the highest sales.
Education level is not the main factor drive sale of car seats.
Maybe have to think about higher education level, those people tend to have less children than those have lower education level.
It might be good to aim selling the average to lower education level, since we might sell more car seats.
Boxplot about ShelveLoc
boxplot(split(Dataset2$Sales, Dataset2$ShelveLoc), main = "Sales based on shelving location")
Based on the graph the shelving location is need to be consider when selling carseats.
The good location have higher average sell.
It seems to be if you want to have a better sale of car seats. You should put the products in good shelving location.
Explornatory