Question 1:

This question deals with the airquality dataset. Read in the data and answer the following questions:

DataSet <- airquality
  1. How many rows and columns does the data set have?

Number of rows

nrow(DataSet)
## [1] 153

Number of columns

ncol(DataSet)
## [1] 6
  1. What are the column labels of the data?

Column labels:

colnames(DataSet)
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"
  1. Determine how many missing values are missing in each column.

Number of missing value in the order of Ozone to Day

sum(is.na(DataSet[,"Ozone"]))
## [1] 37
sum(is.na(DataSet[,"Solar.R"]))
## [1] 7
  1. Determine what type of variable each column has. Change the Day and Month variables to factors.
str(DataSet)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

Variable: Wind is numeric. The rest are interger.

DataSet$Day <- as.factor(DataSet$Day)
DataSet$Month <- as.factor(DataSet$Month)
  1. Create a boxplot of ozone count split by month.
boxplot(split(DataSet$Ozone, DataSet$Month), main = "Ozone count by month")

  1. Create a boxplot for each of the other variables. Split again by month.
boxplot(split(DataSet$Solar.R, DataSet$Month), main = "Solar.R count by month")

boxplot(split(DataSet$Wind, DataSet$Month), main = "Wind count by month")

boxplot(split(DataSet$Temp, DataSet$Month), main = "Temp count by month")

  1. Determine if there are any correlation between variables.
plot(DataSet[,c(1,2,3,4,5,6)])

ggscatter(DataSet, x = "Ozone", y = "Solar.R", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "pearson", xlab = "Ozone", ylab = "Solar.R")

cor.test(DataSet$Ozone, DataSet$Solar.R)
## 
##  Pearson's product-moment correlation
## 
## data:  DataSet$Ozone and DataSet$Solar.R
## t = 3.8798, df = 109, p-value = 0.0001793
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.173194 0.502132
## sample estimates:
##       cor 
## 0.3483417

There is a weak positive correlation between Solar.R and Ozone

ggscatter(DataSet, x = "Ozone", y = "Wind", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "pearson", xlab = "Ozone", ylab = "Wind")

cor.test(DataSet$Ozone, DataSet$Wind)
## 
##  Pearson's product-moment correlation
## 
## data:  DataSet$Ozone and DataSet$Wind
## t = -8.0401, df = 114, p-value = 9.272e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7063918 -0.4708713
## sample estimates:
##        cor 
## -0.6015465

There is a strong negative correlation between Wind and Ozone

ggscatter(DataSet, x = "Ozone", y = "Temp", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "pearson", xlab = "Ozone", ylab = "Temp")

cor.test(DataSet$Ozone, DataSet$Temp)
## 
##  Pearson's product-moment correlation
## 
## data:  DataSet$Ozone and DataSet$Temp
## t = 10.418, df = 114, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5913340 0.7812111
## sample estimates:
##       cor 
## 0.6983603

There is a strong positive correlation between Temp and Ozone

  1. If I want to limit my exposure to ozone, what should I do?

Base on results above, you should go to places are cold, have strong wind and have limit expo to sun.

Question 2:

This question deals with the Carseats dataset. I own a retail store in which I am going to bring in child’s car seat(s) to sell. If I want to be successful with this initiative, what should I do?

library(ISLR)
Dataset2 <- Carseats
  1. Develop a question. Ensure that your question has the characteristics of a good question.

What are the important factors will drive sales of car seats?

  1. Go through the epicycle of data analysis - including doing research.
  1. State the question

Set Expectations:

What are the important factors will drive sales of car seats?

Collected Information:

Based on Consumer Report and Parents magazines, customer should know:

link

link

Revise Expectation:

People who do research about car seat may want to buy new car seats. Usually people who do research about car seat are educated people. Those people might want to buy new car seat in the store.

New question: Does education level of the area affect the sale of car seat?

  1. Exploring the data

Set Expectations:

Missing value in the dataset:

sum(is.na.data.frame(Dataset2))
## [1] 0

Variables in the dataset:

colnames(Dataset2)
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

Dataset can help us to answer the question.

Collect Information:

Some basic information about the data set

str(Dataset2)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
summary(Dataset2)
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
## 

Maybe want to subset the dataset since the max in the age variable is 80. Usually people have children around 30 to 50.

hist(Dataset2$Age,
     main = "Histogram for Age",
     xlab = "Age",
     border = "blue",
     col = "green",
     xlim = c(25,80))
abline(v = mean(Dataset2$Age), col = "red", lwd = 2)
abline(v = median(Dataset2$Age), col = "black", lwd = 2)

We might subset to only take to the age of 70.

Dataset2a <- Dataset2 %>% filter(Age <= 70)
str(Dataset2)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
summary(Dataset2a)
##      Sales          CompPrice         Income        Advertising    
##  Min.   : 0.160   Min.   : 77.0   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.720   1st Qu.:115.0   1st Qu.: 44.50   1st Qu.: 0.000  
##  Median : 7.680   Median :125.0   Median : 69.00   Median : 5.000  
##  Mean   : 7.792   Mean   :125.4   Mean   : 69.02   Mean   : 6.613  
##  3rd Qu.: 9.495   3rd Qu.:135.5   3rd Qu.: 90.50   3rd Qu.:11.000  
##  Max.   :16.270   Max.   :175.0   Max.   :120.00   Max.   :29.000  
##    Population        Price      ShelveLoc        Age          Education    
##  Min.   : 10.0   Min.   : 24   Bad   : 75   Min.   :25.00   Min.   :10.00  
##  1st Qu.:135.5   1st Qu.:100   Good  : 70   1st Qu.:36.00   1st Qu.:12.00  
##  Median :269.0   Median :119   Medium:178   Median :49.00   Median :14.00  
##  Mean   :263.4   Mean   :116                Mean   :47.96   Mean   :13.85  
##  3rd Qu.:397.0   3rd Qu.:131                3rd Qu.:60.00   3rd Qu.:16.00  
##  Max.   :509.0   Max.   :191                Max.   :70.00   Max.   :18.00  
##  Urban       US     
##  No : 93   No :112  
##  Yes:230   Yes:211  
##                     
##                     
##                     
## 

Revise Expectation:

New question: Does education level of the areas which have average of age of having children affect the sale of car seat?

c.Build a formal model

Use the boxplot to see if the education level affect the sales of carseat.

boxplot(split(Dataset2a$Sales, Dataset2a$Education), main = "Sales of each education level")

  1. Interpret the Results

Base on the the graph, the unit sale levels are mostly even between different education level.

The lowest education level even seems to be have the highest sales.

  1. Communicate the Results

Education level is not the main factor drive sale of car seats.

Maybe have to think about higher education level, those people tend to have less children than those have lower education level.

It might be good to aim selling the average to lower education level, since we might sell more car seats.

  1. Do an exploratory data analysis.

Boxplot about ShelveLoc

boxplot(split(Dataset2$Sales, Dataset2$ShelveLoc), main = "Sales based on shelving location")

Based on the graph the shelving location is need to be consider when selling carseats.

The good location have higher average sell.

It seems to be if you want to have a better sale of car seats. You should put the products in good shelving location.

  1. What is the type of question that we are answering?

Explornatory