Assignment 3 DSCI100

Ross Ciancio

24/02/2020

Complete the questions below. Compile your answers using Markdown and upload to your RPubs page. Submit a link to your RPubs page (through MOODLE) for grading.



##Question 1


DataSet <- airquality



1. How many rows and columns does the data set have?

nrow(DataSet)
## [1] 153
ncol(DataSet)
## [1] 6

The DataSet has 153 rows and 6 columns


2. What are the column labels of the data?

colnames(DataSet)
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"



3. Determine how many missing values are missing in each column.

sum(is.na(DataSet$Ozone))
## [1] 37
sum(is.na(DataSet$Solar.R))
## [1] 7
sum(is.na(DataSet$Wind))
## [1] 0
sum(is.na(DataSet$Temp))
## [1] 0
sum(is.na(DataSet$Month))
## [1] 0
sum(is.na(DataSet$Day))
## [1] 0

The Ozone column has 37 missing values, and the Solar.R column has 7 missing values


4. Determine what type of variable each column has. Change the Day and Month variables to factors.

sapply(DataSet, class)
##     Ozone   Solar.R      Wind      Temp     Month       Day 
## "integer" "integer" "numeric" "integer" "integer" "integer"
as.factor(DataSet$Month)
##   [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6
##  [38] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7
##  [75] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [112] 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
## [149] 9 9 9 9 9
## Levels: 5 6 7 8 9
as.factor(DataSet$Day)
##   [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
##  [26] 26 27 28 29 30 31 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19
##  [51] 20 21 22 23 24 25 26 27 28 29 30 1  2  3  4  5  6  7  8  9  10 11 12 13 14
##  [76] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1  2  3  4  5  6  7  8 
## [101] 9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1  2 
## [126] 3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
## [151] 28 29 30
## 31 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ... 31



5. Create a boxplot of ozone count split by month.

ozone <- DataSet$Ozone
month <- DataSet$Month

boxplot(ozone~month)



6. Create a boxplot for each of the other variables. Split again by month.

solar.r <- DataSet$Solar.R
wind <- DataSet$Wind
temp <- DataSet$Temp
day <- DataSet$Day

boxplot(solar.r~month)

boxplot(wind~month)

boxplot(temp~month)

boxplot(day~month)



7. Determine if there are any correlation between variables.

cortest <- cor.test(DataSet$Ozone, DataSet$Month, 
                    method = "pearson")

cortest2 <- cor.test(DataSet$Solar.R, DataSet$Month, method= "pearson")

cortest3 <- cor.test(DataSet$Wind, DataSet$Month, method= "pearson")

cortest4 <- cor.test(DataSet$Temp, DataSet$Month, method= "pearson")



8. If I want to limit my exposure to ozone, what should I do?

Although we have limited information (data that only goes from May to September) we can visibly assume, through the boxplots and correlation tests, that the average level of ozone rises during July and August, which are, sensibly, the hottest months. In fact, through the correlation test, we can prove that temperature and ozone levels are directly correlated, with the exception of a few outliers. Therefore, if someone’s concern is to limit their exposure to ozone, one should visit New York during the winter, as, on average, ozone levels rise with warmer temperatures


## Question 2

This question deals with the Carseats dataset. I own a retail store in which I am going to bring in child’s car seat(s) to sell. If I want to be successful with this initiative, what should I do?

library(ISLR)
Dataset2 <- Carseats
View(Dataset2)


1.Develop a question. Ensure that your question has the characteristics of a good question.

What are the important factors that drive successful sales of car seats?


2.Go through the epicycle of data analysis - including doing research

a. State the question

Expectations:

What are the important factors will drive sales of car seats? Collected Information:

Based on “Baby Logic” and “Safe Kids” (links below) Customer should know:

-Quality: A new carseat has benefits compared to used car seats, and the features and designs that reflect quality are: easy to clean, seats’ straps, seats’ harness, etc.

-Type of car seats: The type of car seats are: Rear-facing, Convertible, Booster. Also Carseat for specific child age for Infant and Toodler. A child’s weigth and height influence the buying decision.

-Price is not necessarily indicative of a good product.

-Price range is usually from 11 to 500, but in our dataset it is from 20 to 200.

-Car seats are sold in retail stores both in store and their online website (Walmart, Canadian Tire, etc.). Amazon is also selling car seats.

https://www.baby-logic.com/blogs/news/75216132-car-seat-safety-and-when-to-upgrade-your-child-s-car-seat

https://www.safekids.org/ultimate-car-seat-guide/basic-tips/buying/


  1. Exploring the data

Missing value in the dataset:

sum(is.na.data.frame(Dataset2))
## [1] 0

Variables in the dataset:

colnames(Dataset2)
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"
summary(Dataset2)
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
## 
Dataset2a <- Dataset2 %>% filter(Age <= 70)

str(Dataset2a)
## 'data.frame':    323 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 136 121 117 122 115 ...
##  $ Income     : num  73 48 35 100 64 81 78 94 35 28 ...
##  $ Advertising: num  11 16 10 4 3 15 9 4 2 11 ...
##  $ Population : num  276 260 269 466 340 425 150 503 393 29 ...
##  $ Price      : num  120 83 80 97 128 120 100 94 136 86 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 2 1 2 3 2 ...
##  $ Age        : num  42 65 59 55 38 67 26 50 62 53 ...
##  $ Education  : num  17 10 12 14 13 10 10 13 18 18 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 1 2 ...
summary(Dataset2a)
##      Sales          CompPrice         Income        Advertising    
##  Min.   : 0.160   Min.   : 77.0   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.720   1st Qu.:115.0   1st Qu.: 44.50   1st Qu.: 0.000  
##  Median : 7.680   Median :125.0   Median : 69.00   Median : 5.000  
##  Mean   : 7.792   Mean   :125.4   Mean   : 69.02   Mean   : 6.613  
##  3rd Qu.: 9.495   3rd Qu.:135.5   3rd Qu.: 90.50   3rd Qu.:11.000  
##  Max.   :16.270   Max.   :175.0   Max.   :120.00   Max.   :29.000  
##    Population        Price      ShelveLoc        Age          Education    
##  Min.   : 10.0   Min.   : 24   Bad   : 75   Min.   :25.00   Min.   :10.00  
##  1st Qu.:135.5   1st Qu.:100   Good  : 70   1st Qu.:36.00   1st Qu.:12.00  
##  Median :269.0   Median :119   Medium:178   Median :49.00   Median :14.00  
##  Mean   :263.4   Mean   :116                Mean   :47.96   Mean   :13.85  
##  3rd Qu.:397.0   3rd Qu.:131                3rd Qu.:60.00   3rd Qu.:16.00  
##  Max.   :509.0   Max.   :191                Max.   :70.00   Max.   :18.00  
##  Urban       US     
##  No : 93   No :112  
##  Yes:230   Yes:211  
##                     
##                     
##                     
## 

We do this because generally people will have kids between 30 and 50. We don’t really need customers older than 70 for this particular problem.

c.Build a model. Does education level of the areas affect the sale of car seat?

boxplot(split(Dataset2a$Sales, Dataset2a$Education))

d.Interpretation As we can see, the data is pretty even, but the lowest education level actually seems to have the highest sales. This can be explained by the trend of lower income households having more children earlier.


e. Results

Education is not visibly the main factor that drives sales of car seats. However, we can expect to reach higher sales if we target a lower education customer base. What works in our favour for this, is that usually income and education levels can be subdivided in geographical areas, because of affordability. Therefore, the optimal place to sell car seats would be in a lower income neighbourhood.

boxplot(split(Dataset2a$Sales, Dataset2a$Advertising))

With this boxplot, we can detect that advertising levels influence sales. However, we can probably choose something a little bit more indicative.

boxplot(split(Dataset2a$Sales, Dataset2a$ShelveLoc))

That’s way better. Now we can see that, in addition to targeting lower education profiles, placing the car seats in the right shelf location will have a major impact on the sales of car seats.


What is the type of question that we are answering? This is an exploratory type of analysis