DSCI 100 Assignment 3

Complete the questions below. Compile your answers using Markdown and upload to your RPubs page. Submit a link to your RPubs page (through MOODLE) for grading.

Question 1:

This question deals with the airquality dataset. Read in the data and answer the following questions:

DataSet <- airquality

How many rows and columns does the data set have?

Number of rows

nrow(DataSet)

## [1] 153

Number of columns

ncol(DataSet)

## [1] 6

What are the column labels of the data?

Column labels:

colnames(DataSet)

## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"

Determine how many missing values are missing in each column.

Number of missing value in the order of Ozone to Day

sum(is.na(DataSet[,"Ozone"]))

## [1] 37

sum(is.na(DataSet[,"Solar.R"]))

## [1] 7

Determine what type of variable each column has. Change the Day and Month variables to factors.

str(DataSet)

## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

Variable: Wind is numeric. The rest are interger.

DataSet$Day <- as.factor(DataSet$Day)
DataSet$Month <- as.factor(DataSet$Month)

Create a boxplot of ozone count split by month.

boxplot(split(DataSet$Ozone, DataSet$Month), main = "Ozone count by month")

Create a boxplot for each of the other variables. Split again by month.

boxplot(split(DataSet$Solar.R, DataSet$Month), main = "Solar.R count by month")

boxplot(split(DataSet$Wind, DataSet$Month), main = "Wind count by month")

boxplot(split(DataSet$Temp, DataSet$Month), main = "Temp count by month")

Determine if there are any correlation between variables.

plot(DataSet[,c(1,2,3,4,5,6)])

ggscatter(DataSet, x = "Ozone", y = "Solar.R", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "pearson", xlab = "Ozone", ylab = "Solar.R")

cor.test(DataSet$Ozone, DataSet$Solar.R)

## 
##  Pearson's product-moment correlation
## 
## data:  DataSet$Ozone and DataSet$Solar.R
## t = 3.8798, df = 109, p-value = 0.0001793
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.173194 0.502132
## sample estimates:
##       cor 
## 0.3483417

There is a weak positive correlation between Solar.R and Ozone

ggscatter(DataSet, x = "Ozone", y = "Wind", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "pearson", xlab = "Ozone", ylab = "Wind")

cor.test(DataSet$Ozone, DataSet$Wind)

## 
##  Pearson's product-moment correlation
## 
## data:  DataSet$Ozone and DataSet$Wind
## t = -8.0401, df = 114, p-value = 9.272e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7063918 -0.4708713
## sample estimates:
##        cor 
## -0.6015465

There is a strong negative correlation between Wind and Ozone

ggscatter(DataSet, x = "Ozone", y = "Temp", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "pearson", xlab = "Ozone", ylab = "Temp")

cor.test(DataSet$Ozone, DataSet$Temp)

## 
##  Pearson's product-moment correlation
## 
## data:  DataSet$Ozone and DataSet$Temp
## t = 10.418, df = 114, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5913340 0.7812111
## sample estimates:
##       cor 
## 0.6983603

There is a strong positive correlation between Temp and Ozone

If I want to limit my exposure to ozone, what should I do?

Base on results above, you should go to places are cold, have strong wind and have limit expo to sun.

Question 2:

This question deals with the Carseats dataset. I own a retail store in which I am going to bring in child’s car seat(s) to sell. If I want to be successful with this initiative, what should I do?

library(ISLR)
Dataset2 <- Carseats

Develop a question. Ensure that your question has the characteristics of a good question.

What are the important factors will drive sales of car seats?

Go through the epicycle of data analysis - including doing research.

State the question

Set Expectations:

What are the important factors will drive sales of car seats?

Collected Information:

Based on Consumer Report and Parents magazines, customer should know:

Quality: Suggest buying new carseat not using the used car seat and checking car seats features and design: easy to clean, seats’ straps, seats’ harness, etc.
Type of car seats: There are couple type car seats such as Rear-facing, Convertible, Booster. Also Carseat for specific child age for Infant and Toodler. Suggest change carseat based on child’s weigth and height.
Price is not good indicate for good product. Some midprice products are actually better.
Price range seems to be from 11 to 500.
Car seats are sold in retail stores both in store and their online website (Walmart, Canadian Tire, etc.). Amazon is also selling car seats.
Do some research for car seat before buying. See the car seat in real life before decide buying it online.

link

Revise Expectation:

People who do research about car seat may want to buy new car seats. Usually people who do research about car seat are educated people. Those people might want to buy new car seat in the store.

New question: Does education level of the area affect the sale of car seat?

Exploring the data

Set Expectations:

Missing value in the dataset:

sum(is.na.data.frame(Dataset2))

## [1] 0

Variables in the dataset:

colnames(Dataset2)

##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

Dataset can help us to answer the question.

Collect Information:

Some basic information about the data set

str(Dataset2)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

summary(Dataset2)

##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
##

Maybe want to subset the dataset since the max in the age variable is 80. Usually people have children around 30 to 50.

hist(Dataset2$Age,
     main = "Histogram for Age",
     xlab = "Age",
     border = "blue",
     col = "green",
     xlim = c(25,80))
abline(v = mean(Dataset2$Age), col = "red", lwd = 2)
abline(v = median(Dataset2$Age), col = "black", lwd = 2)

We might subset to only take to the age of 70.

Dataset2a <- Dataset2 %>% filter(Age <= 70)

str(Dataset2)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

summary(Dataset2a)

##      Sales          CompPrice         Income        Advertising    
##  Min.   : 0.160   Min.   : 77.0   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.720   1st Qu.:115.0   1st Qu.: 44.50   1st Qu.: 0.000  
##  Median : 7.680   Median :125.0   Median : 69.00   Median : 5.000  
##  Mean   : 7.792   Mean   :125.4   Mean   : 69.02   Mean   : 6.613  
##  3rd Qu.: 9.495   3rd Qu.:135.5   3rd Qu.: 90.50   3rd Qu.:11.000  
##  Max.   :16.270   Max.   :175.0   Max.   :120.00   Max.   :29.000  
##    Population        Price      ShelveLoc        Age          Education    
##  Min.   : 10.0   Min.   : 24   Bad   : 75   Min.   :25.00   Min.   :10.00  
##  1st Qu.:135.5   1st Qu.:100   Good  : 70   1st Qu.:36.00   1st Qu.:12.00  
##  Median :269.0   Median :119   Medium:178   Median :49.00   Median :14.00  
##  Mean   :263.4   Mean   :116                Mean   :47.96   Mean   :13.85  
##  3rd Qu.:397.0   3rd Qu.:131                3rd Qu.:60.00   3rd Qu.:16.00  
##  Max.   :509.0   Max.   :191                Max.   :70.00   Max.   :18.00  
##  Urban       US     
##  No : 93   No :112  
##  Yes:230   Yes:211  
##                     
##                     
##                     
##

Revise Expectation:

New question: Does education level of the areas which have average of age of having children affect the sale of car seat?

c.Build a formal model

Use the boxplot to see if the education level affect the sales of carseat.

boxplot(split(Dataset2a$Sales, Dataset2a$Education), main = "Sales of each education level")

Interpret the Results

Base on the the graph, the unit sale levels are mostly even between different education level.

The lowest education level even seems to be have the highest sales.

Communicate the Results

Education level is not the main factor drive sale of car seats.

Maybe have to think about higher education level, those people tend to have less children than those have lower education level.

It might be good to aim selling the average to lower education level, since we might sell more car seats.

Do an exploratory data analysis.

Boxplot about ShelveLoc

boxplot(split(Dataset2$Sales, Dataset2$ShelveLoc), main = "Sales based on shelving location")

Based on the graph the shelving location is need to be consider when selling carseats.

The good location have higher average sell.

It seems to be if you want to have a better sale of car seats. You should put the products in good shelving location.

What is the type of question that we are answering?

Explornatory

DSCI 100 Assignment 3

Minh Le

‘r Sys.Date()’

Complete the questions below. Compile your answers using Markdown and upload to your RPubs page. Submit a link to your RPubs page (through MOODLE) for grading.

Question 1:

Question 2: