Lab Assignment 1

First, load the packages.

Next, import the data set Cereal_Data.xslx from Canvas and display the first 6 rows of the data set.

load("~/Cereal_Data_1_.RData")

 head(Cereal_Data_1_)

## # A tibble: 6 x 15
##   Shelf Name  Manufacturer Type  Calories Protein   Fat Sodium Fiber
##   <chr> <chr> <chr>        <chr>    <dbl>   <dbl> <dbl>  <dbl> <dbl>
## 1 Top   100%… N            C           70       4     1    130  10  
## 2 Top   100%… Q            C          120       3     5     15   2  
## 3 Top   All-… K            C           70       4     1    260   9  
## 4 Top   All-… K            C           50       4     0    140  14  
## 5 Top   Almo… R            C          110       2     2    200   1  
## 6 Bott… Appl… G            C          110       2     2    180   1.5
## # … with 6 more variables: Carbohydrates <dbl>, Sugars <dbl>,
## #   Potassium <dbl>, Vitamins <dbl>, `Weight (of One Serving Cup)` <dbl>,
## #   `Cups in Serving` <dbl>

2. Consider the variables in the data set. Identify the variables that are qualitative and those that are quantitative.

str(Cereal_Data_1_)

## Classes 'tbl_df', 'tbl' and 'data.frame':    77 obs. of  15 variables:
##  $ Shelf                      : chr  "Top" "Top" "Top" "Top" ...
##  $ Name                       : chr  "100%_Bran" "100%_Natural_Bran" "All-Bran" "All-Bran_with_Extra_Fiber" ...
##  $ Manufacturer               : chr  "N" "Q" "K" "K" ...
##  $ Type                       : chr  "C" "C" "C" "C" ...
##  $ Calories                   : num  70 120 70 50 110 110 110 130 90 90 ...
##  $ Protein                    : num  4 3 4 4 2 2 2 3 2 3 ...
##  $ Fat                        : num  1 5 1 0 2 2 0 2 1 0 ...
##  $ Sodium                     : num  130 15 260 140 200 180 125 210 200 210 ...
##  $ Fiber                      : num  10 2 9 14 1 1.5 1 2 4 5 ...
##  $ Carbohydrates              : num  5 8 7 8 14 10.5 11 18 15 13 ...
##  $ Sugars                     : num  6 8 5 0 8 10 14 8 6 5 ...
##  $ Potassium                  : num  280 135 320 330 NA 70 30 100 125 190 ...
##  $ Vitamins                   : num  25 0 25 25 25 25 25 25 25 25 ...
##  $ Weight (of One Serving Cup): num  1 1 1 1 1 1 1 1.33 1 1 ...
##  $ Cups in Serving            : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...

Qualitative variables are the shelf, name, manufacturer, and type Quantitative are calories, protein, fat, sodium, fiber, carbohyrates, sugars, potassium, vitamins, weight, and cups in serving. — #### 3. Consider the variable Shelf. This variable is the shelf position of the cereal (bottom, middle, top) starting from the floor up. To see whether the shelf position is associated with one measure of nutritive value, the amount of sugar, look at the data for the variable Sugars. Compare the sugar content of cereals on each shelf by making a separate histogram for the sugar content of the cereals on each shelf: a total of three histograms. Use the sugar content values as they are - do not factor in the serving size. (The data for one of the cereals, Quaker Oatmeal, is missing. Just continue with what is available. That’s the way it is in real life - values are missing, files are incomplete, etc.)

topshelf <- subset(Cereal_Data_1_, Shelf == "Top")

middleshelf <- subset(Cereal_Data_1_, Shelf == "Middle")

bottomshelf <- subset(Cereal_Data_1_, Shelf == "Bottom")

gf_histogram(~Sugars, title = "A Histogram for the Sugar Quanity on the Top Shelf", ylab = "Cereals", data=topshelf, binwidth = 2, breaks=seq(0,16, by =2), color="blue", fill="green")

gf_histogram(~Sugars, title = "A Histogram for the Sugar Quanity on the Middle Shelf", ylab = "Cereals",data=middleshelf, binwidth = 2, breaks=seq(0,16, by =2), color="pink", fill="blue")

gf_histogram(~Sugars, title = "A Histogram for the Sugar Quanity on the Bottom Shelf", ylab = "Cereals", data=bottomshelf, binwidth = 2, breaks=seq(0,16, by =2), color="blue", fill="purple")

## Warning: Removed 1 rows containing non-finite values (stat_bin).

4. Briefly describe the distribution in each histogram with respect to shape. Based on your histograms, which shelf position has cereals with the most sugar?

The top shelf is symetrical which means the amount of sugar on the top shelf is even. The middle shelf is skewed left so the majority of the cerals on that shelf have high sugar content. The bottom shelf is skewed right so the the majority of cereals on this shelf are not as sugary.

It is not in order from most to least amount of sugar. The top shelf has a more diverse amount of sugar than the other shelves. It is more symmetrical and not all sugary or not not sugary. The middle shelf mostly has all the sugary cereals and the bottom has the least sugary ones. So I believe that is it somewhat realted to the sugar content but that there has to be another confounding variable. The middle shelf may have more sugar since it is eye contact to kids, so they would be drawn to it more.

**6. Find the five-number-summary, mean, and standard deviation of the variable “Fiber”.**

favstats(Cereal_Data_1_$Fiber)

##  min Q1 median Q3 max     mean       sd  n missing
##    0  1      2  3  14 2.151948 2.383364 77       0

#### **8. Are Calories and Carbohydrates related in cereals? Let’s find out by studying the linear regression between the two variables.**

(a) Draw a scatterplot of the two variabes, with Calories on the y axis and Carbohydrates on the x axis. Does there appear to be a linear association? If so, is it positive or negative? Strong or weak?

xyplot(Calories ~ Carbohydrates , data = Cereal_Data_1_, xlab="Carbohydrates", ylab = "Calories" , main = "The Relationship between Calories and Carbohydrates")

The linear association is positive, but weak. — ##### (b) What is the correlation coefficient, r? In one of the cases the Carbohydrates entry is NA, so if you use the “cor” function, it will return “NA”. In that case, use na.omit(Cereal_Data) instead of Cereal_Data when naming the dataset.

xyplot(Carbohydrates ~ Calories, xlab="Calories", ylab="Carbohydrates", type=c("p", "r"), data=Cereal_Data_1_)

cor(Carbohydrates~Calories, data =na.omit(Cereal_Data_1_))

## [1] 0.270606

(c) Create a regression model, call it “cal_carb_model”, and obtain a summary of this model.

cal_carb_model <-lm(Calories ~ Carbohydrates, data = Cereal_Data_1_)
summary(cal_carb_model)

## 
## Call:
## lm(formula = Calories ~ Carbohydrates, data = Cereal_Data_1_)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.644  -8.844   0.187   9.555  50.187 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    87.8459     8.6211  10.190 9.78e-16 ***
## Carbohydrates   1.2922     0.5634   2.294   0.0246 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.06 on 74 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.06638,    Adjusted R-squared:  0.05376 
## F-statistic: 5.261 on 1 and 74 DF,  p-value: 0.02465

(d) What is the least square regression equation?

Interprete the slope in context here.
Interprete the intercept in context here. Does the intercept make practical sense here?

Slope: 0.05137 Intercept: 19342.8 y=19342.8 + 0.05137X The Y intercept does not make practical sense here because having that amount of carbohydrates in cereal is absurbly high.

(e) What percent of the change in Calories is explained by the change in Carbohydrates and the regression equation? Is this regression equation effective in modeling the realationship between the two variables?

rsquared(cal_carb_model)

## [1] 0.06637725

6.6377% This means that the regression equation is not effective in modeling the relationship between these two variables because this percentage is not close 100%. That means only 6.6377% is shown on the graph. —

(f) Use the regression equation to predict the calories in cereals with 18 grams of carbohydrates.

regression <-makeFun(cal_carb_model)
regression(18)

##        1 
## 111.1053

Lab Assignment 1

Olivia Zacok

October 17, 2019

2. Consider the variables in the data set. Identify the variables that are qualitative and those that are quantitative.

4. Briefly describe the distribution in each histogram with respect to shape. Based on your histograms, which shelf position has cereals with the most sugar?

The top shelf is symetrical which means the amount of sugar on the top shelf is even. The middle shelf is skewed left so the majority of the cerals on that shelf have high sugar content. The bottom shelf is skewed right so the the majority of cereals on this shelf are not as sugary.

**6. Find the five-number-summary, mean, and standard deviation of the variable “Fiber”.**

Lab Assignment 1

Olivia Zacok

October 17, 2019

2. Consider the variables in the data set. Identify the variables that are qualitative and those that are quantitative.

4. Briefly describe the distribution in each histogram with respect to shape. Based on your histograms, which shelf position has cereals with the most sugar?

The top shelf is symetrical which means the amount of sugar on the top shelf is even. The middle shelf is skewed left so the majority of the cerals on that shelf have high sugar content. The bottom shelf is skewed right so the the majority of cereals on this shelf are not as sugary.

5. (Bonus Question) Consider your histograms for sugar content. Is the shelf position of a cereal related to its nutritive value as measured by sugar content? Can you provide a possible explanation for this relation?

6. Find the five-number-summary, mean, and standard deviation of the variable “Fiber”.

#### 8. Are Calories and Carbohydrates related in cereals? Let’s find out by studying the linear regression between the two variables.

(a) Draw a scatterplot of the two variabes, with Calories on the y axis and Carbohydrates on the x axis. Does there appear to be a linear association? If so, is it positive or negative? Strong or weak?

(c) Create a regression model, call it “cal_carb_model”, and obtain a summary of this model.

(d) What is the least square regression equation?

(e) What percent of the change in Calories is explained by the change in Carbohydrates and the regression equation? Is this regression equation effective in modeling the realationship between the two variables?

(f) Use the regression equation to predict the calories in cereals with 18 grams of carbohydrates.

**6. Find the five-number-summary, mean, and standard deviation of the variable “Fiber”.**

#### **8. Are Calories and Carbohydrates related in cereals? Let’s find out by studying the linear regression between the two variables.**