##Question 1
DataSet <- airquality
1. How many rows and columns does the data set have?
nrow(DataSet)
## [1] 153
ncol(DataSet)
## [1] 6
The DataSet has 153 rows and 6 columns
2. What are the column labels of the data?
colnames(DataSet)
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
3. Determine how many missing values are missing in each column.
sum(is.na(DataSet$Ozone))
## [1] 37
sum(is.na(DataSet$Solar.R))
## [1] 7
sum(is.na(DataSet$Wind))
## [1] 0
sum(is.na(DataSet$Temp))
## [1] 0
sum(is.na(DataSet$Month))
## [1] 0
sum(is.na(DataSet$Day))
## [1] 0
The Ozone column has 37 missing values, and the Solar.R column has 7 missing values
4. Determine what type of variable each column has. Change the Day and Month variables to factors.
sapply(DataSet, class)
## Ozone Solar.R Wind Temp Month Day
## "integer" "integer" "numeric" "integer" "integer" "integer"
as.factor(DataSet$Month)
## [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6
## [38] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7
## [75] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [112] 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [149] 9 9 9 9 9
## Levels: 5 6 7 8 9
as.factor(DataSet$Day)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## [51] 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## [76] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 8
## [101] 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 2
## [126] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
## [151] 28 29 30
## 31 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... 31
5. Create a boxplot of ozone count split by month.
ozone <- DataSet$Ozone
month <- DataSet$Month
boxplot(ozone~month)
6. Create a boxplot for each of the other variables. Split again by month.
solar.r <- DataSet$Solar.R
wind <- DataSet$Wind
temp <- DataSet$Temp
day <- DataSet$Day
boxplot(solar.r~month)
boxplot(wind~month)
boxplot(temp~month)
boxplot(day~month)
7. Determine if there are any correlation between variables.
cortest <- cor.test(DataSet$Ozone, DataSet$Month,
method = "pearson")
cortest2 <- cor.test(DataSet$Solar.R, DataSet$Month, method= "pearson")
cortest3 <- cor.test(DataSet$Wind, DataSet$Month, method= "pearson")
cortest4 <- cor.test(DataSet$Temp, DataSet$Month, method= "pearson")
8. If I want to limit my exposure to ozone, what should I do?
Although we have limited information (data that only goes from May to September) we can visibly assume, through the boxplots and correlation tests, that the average level of ozone rises during July and August, which are, sensibly, the hottest months. In fact, through the correlation test, we can prove that temperature and ozone levels are directly correlated, with the exception of a few outliers. Therefore, if someone’s concern is to limit their exposure to ozone, one should visit New York during the winter, as, on average, ozone levels rise with warmer temperatures
## Question 2
This question deals with the Carseats dataset. I own a retail store in which I am going to bring in child’s car seat(s) to sell. If I want to be successful with this initiative, what should I do?
library(ISLR)
Dataset2 <- Carseats
View(Dataset2)
1.Develop a question. Ensure that your question has the characteristics of a good question.
What are the important factors that drive successful sales of car seats?
2.Go through the epicycle of data analysis - including doing research
a. State the question
Expectations:
What are the important factors will drive sales of car seats? Collected Information:
Based on “Baby Logic” and “Safe Kids” (links below) Customer should know:
-Quality: A new carseat has benefits compared to used car seats, and the features and designs that reflect quality are: easy to clean, seats’ straps, seats’ harness, etc.
-Type of car seats: The type of car seats are: Rear-facing, Convertible, Booster. Also Carseat for specific child age for Infant and Toodler. A child’s weigth and height influence the buying decision.
-Price is not necessarily indicative of a good product.
-Price range is usually from 11 to 500, but in our dataset it is from 20 to 200.
-Car seats are sold in retail stores both in store and their online website (Walmart, Canadian Tire, etc.). Amazon is also selling car seats.
https://www.safekids.org/ultimate-car-seat-guide/basic-tips/buying/
Missing value in the dataset:
sum(is.na.data.frame(Dataset2))
## [1] 0
Variables in the dataset:
colnames(Dataset2)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
summary(Dataset2)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
Dataset2a <- Dataset2 %>% filter(Age <= 70)
str(Dataset2a)
## 'data.frame': 323 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 136 121 117 122 115 ...
## $ Income : num 73 48 35 100 64 81 78 94 35 28 ...
## $ Advertising: num 11 16 10 4 3 15 9 4 2 11 ...
## $ Population : num 276 260 269 466 340 425 150 503 393 29 ...
## $ Price : num 120 83 80 97 128 120 100 94 136 86 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 2 1 2 3 2 ...
## $ Age : num 42 65 59 55 38 67 26 50 62 53 ...
## $ Education : num 17 10 12 14 13 10 10 13 18 18 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 1 2 ...
summary(Dataset2a)
## Sales CompPrice Income Advertising
## Min. : 0.160 Min. : 77.0 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.720 1st Qu.:115.0 1st Qu.: 44.50 1st Qu.: 0.000
## Median : 7.680 Median :125.0 Median : 69.00 Median : 5.000
## Mean : 7.792 Mean :125.4 Mean : 69.02 Mean : 6.613
## 3rd Qu.: 9.495 3rd Qu.:135.5 3rd Qu.: 90.50 3rd Qu.:11.000
## Max. :16.270 Max. :175.0 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24 Bad : 75 Min. :25.00 Min. :10.00
## 1st Qu.:135.5 1st Qu.:100 Good : 70 1st Qu.:36.00 1st Qu.:12.00
## Median :269.0 Median :119 Medium:178 Median :49.00 Median :14.00
## Mean :263.4 Mean :116 Mean :47.96 Mean :13.85
## 3rd Qu.:397.0 3rd Qu.:131 3rd Qu.:60.00 3rd Qu.:16.00
## Max. :509.0 Max. :191 Max. :70.00 Max. :18.00
## Urban US
## No : 93 No :112
## Yes:230 Yes:211
##
##
##
##
We do this because generally people will have kids between 30 and 50. We don’t really need customers older than 70 for this particular problem.
c.Build a model. Does education level of the areas affect the sale of car seat?
boxplot(split(Dataset2a$Sales, Dataset2a$Education))
d.Interpretation As we can see, the data is pretty even, but the lowest education level actually seems to have the highest sales. This can be explained by the trend of lower income households having more children earlier.
e. Results
Education is not visibly the main factor that drives sales of car seats. However, we can expect to reach higher sales if we target a lower education customer base. What works in our favour for this, is that usually income and education levels can be subdivided in geographical areas, because of affordability. Therefore, the optimal place to sell car seats would be in a lower income neighbourhood.
boxplot(split(Dataset2a$Sales, Dataset2a$Advertising))
With this boxplot, we can detect that advertising levels influence sales. However, we can probably choose something a little bit more indicative.
boxplot(split(Dataset2a$Sales, Dataset2a$ShelveLoc))
That’s way better. Now we can see that, in addition to targeting lower education profiles, placing the car seats in the right shelf location will have a major impact on the sales of car seats.
What is the type of question that we are answering? This is an exploratory type of analysis