Packages used in this homework: MASS, HSAUR3, dplyr, ggplot2, tidyr, scales, and tibble

Echo is set to false in each code chunk per the instructions NO R CODE The pipe ‘%>%’ operator is used throughout this homework assignment. The pipe operator is used to call functions continuously on an object without the need of entering the object in each function.

Question #1

From Handbook page 23 do the following Exercises 1.1, 1.2, 1.3, 1.4, 1.5.

Exercise 1.1 Calculate the median profit for the companies in the US and the median profit for the companies in the UK, France, and Germany

Answer: To calculate the median profit, we will use dplyr to aggregate the country variable. We will set the argument na.rm equal to true to remove NA values from the median calculation. Once the aggregation is completed, we will use the dplyr function filter to select the four contries from the aggregation.

## # A tibble: 4 x 2
##          country Median_Profit
##           <fctr>         <dbl>
## 1         France         0.190
## 2        Germany         0.230
## 3 United Kingdom         0.205
## 4  United States         0.240

Exercise 1.2 Find all German companies with negative profit.

Answer: To answer this question, we must subset the Forbes2000 data set, where country equals Germany and profits are less than 0. To ensure the final data frame is more clear, we will only select the columns name, country, and profits.

##                         name country profits
## 350        Allianz Worldwide Germany   -1.23
## 364         Deutsche Telekom Germany  -25.83
## 397                     E.ON Germany   -0.73
## 431      HVB-HypoVereinsbank Germany   -0.87
## 500              Commerzbank Germany   -0.31
## 798    Infineon Technologies Germany   -0.51
## 869              BHW Holding Germany   -0.38
## 926  Bankgesellschaft Berlin Germany   -0.74
## 1034           W&W-Wustenrot Germany   -0.08
## 1187         mg technologies Germany   -0.13
## 1477 Nurnberger Beteiligungs Germany   -0.03
## 1887            SPAR Handels Germany   -0.40
## 1994                Mobilcom Germany   -3.62

Exercise 1.3 To which business category do most of the Bermuda island companies belong?

Answer: To answer this question, we will use the dplyr functions filter, group_by, summarize, arrange, and slice. First, we will filter the Forbes2000 dataset to only show Bermuda companies. Once the filter is in place, we will use dplyr’s aggregation method–group_by and summarize– to count the number of observations for country and category. Once the aggregation is completed, we will arrange the dataset so that the highest count of category is in the first row, and then we will slice the data.frame to only reflect the first row of data. This will provide us with the category and count of companies within that category.

## # A tibble: 1 x 3
## # Groups:   country [1]
##   country  category Count
##    <fctr>    <fctr> <int>
## 1 Bermuda Insurance    10

Exercise 1.4 For the 50 companies in the Forbes data set with the highest profits, plot sales against assets (or some suitable transformation of each variable), labeling each point with the appropriate country name which may need to be abbreviated (using abbreviate) to avoid making the plot look too ‘messy’.

Answer: First, we will arrange the Forbes2000 data from highest to lowest using the dplyr arrange function. Then, we will slice the first 50 observations to provide the 50 companies with the highest profits. Using ggplot 2, we will plot a scatterplot of sales vs asses setting the color argument to the country variable. Using geom_text, we will label the individual points and abbreviate the country name. Due to the plot bleeding together with the labels, we will zoom in between the x coordinates 15 and 70 on a second plot.

Exercise 1.5 Find the average value of sales for the companies in each country in the Forbes data set, and find the number of companies in each country with profits above 5 billion US dollars.

Answer: We will compute the aggregate average sales and count the profits above 5 billion using the dplyr group_by/summarize method.

## # A tibble: 61 x 3
##                      country Average_Sales Profits_Above_5bil
##                       <fctr>         <dbl>              <int>
##  1                    Africa      6.820000                  0
##  2                 Australia      5.244595                  0
##  3 Australia/ United Kingdom     11.595000                  0
##  4                   Austria      4.142500                  0
##  5                   Bahamas      1.350000                  0
##  6                   Belgium     10.114444                  0
##  7                   Bermuda      6.840500                  0
##  8                    Brazil      6.338667                  0
##  9                    Canada      6.429643                  0
## 10            Cayman Islands      1.660000                  0
## # ... with 51 more rows

Question #2

Please attempt to use the R Graphics Cookbook to complete the following problems with GGplot2 2.1 (Handbook), 2.3 (Handbook)- Please use the methods from Chapter 6 of the Cookbook.

Exercise 2.1 The data in Table 2.3 are part of a data set collected from a survey of household expenditure and give the expenditure of 20 single men and 20 single women on four commodity groups. The unites of expenditure are Hong Kong dollars, and the four commodity groups are housing housing including fuel and light, food foodstuffs, including alcohol and tobacco, goods other goods, including clothing, footwear, and durable goods, service services, including transport and vehicles.

The aim of the survey was to investigate how the division of household expenditure between the four commodity groups depends on total expenditure and to find out whether this relationship differs for men and women. Use appropriate graphical methods to answer these questions and state your conclusions.

Answer: We will first begin by examing the relationship of gender to expenditures. To do this, we will use the percentage of the expenditure to the total expenditures. Using dplyr’s mutate function, we will add these vectors to the houshold dataframe. After the vectors have been added, we will select the new vectors removing the old data which will be unecessary for the final plot (a bar graph faceted by gender). We will convert the data from wide to long format using tidyr’s gather function placing the categorical variables into Type and the percentage values to Percent. Because we computed the mean value of the expenditure variables to total expenditure, we will take only the distinct values in the dataframe. Using ggplot2, we will plot the bar graph and facet by the gender variable. The final plot shows that there is no siginificant difference across genders for different types of expenditures to total expenditure.

Part 2. In the second plot, we will examine the relationship of the individual expenditures to total expenditure with scatter plots. Using a similar mutate and gather method above we will plot the relationship using ggplot’s geom_point geom. The final output shows a clear relationship between the service expenditure and total expenditure. It is clear that as total expenditures increase so does spending on services. Additionally the relationship between total expenditure and food expenditures cannot be determined because the data does not fit a line. The other two variables goods and housing do increase, but do not share the same fit as the service expenditure.

Part 3. In the last plot, we will examine the density curves of each expenditure by gender. It is evident that from the density curves that females all spend much less than males in food spending. Additionally, there is more of an even distribution of housing expenditures for females than there is for males.

Exercise 2.3 Mortality rates per 100,000 from male suicides for a number of age groups and a number of countries are given in Table 2.5. Construct sideby-side box plots for the data from different age groups, and comment on what the graphic tells us about the data.

Question 3

Using a single R expression, calculate the median absolute deviation 1.4826*median|x-µ|,where µ is the sample median. Dataset=chickwts. Use the R function mad() to verify your answer.

Answer: The answer is output below and mad(chickwts$weight) is equal to the value of the manual test.

## [1] 91.9212

## [1] TRUE

Question 4

Using the state.x77 data matrix, obtain side-by-side boxplots of the per capita income variable for the nine di???erent divisions de???ned by the variable state.division. Comment on the plot. Use the following code to access the dataset and read about the dataset. data(state) head(state.x77) ?state.x77

Answer: First, we will set state.x77 to a dataframe to allow us to bind the state.division factor to state.x77. We will then use the geomboxplot function to produce the plot.

It is evident from the boxplot that the Pacific division has the highest per capita income with the smallest deviation and that the East South Central division has the lowest income with a low standard deviation. The South Atlantic has the highest standard deviation of the state divisions.

Question 5

Using the state.x77 data matrix, ???nd the state with the minimum per capita income in the New England region as de???ned by the factor state.division. Use the vector state.name to get the state name.

Answer: Again, we will bind the necessary columns, to the dataframe state.x77. Once we bind the needed columns we will use dplyr’s filter function to filter out only observations with the state.division of ‘New England’. Next with dplyr’s slice function we will filter out the observation with the minimum income. Finally, to clean the output data, we will select the columns pertinent to the question.

## # A tibble: 1 x 3
##    State Income state.division
##   <fctr>  <dbl>         <fctr>
## 1  Maine   3694    New England

Question 6

Use subscripting operations on Cars93 to ???nd the vehicles with highway mileage of less than 25 miles per gallon (variable MPG.highway) and weight (variable Weight) over 3500 lbs. Print the model name, the price range (low, high), highway mileage, and the weight of the cars that satisfy these conditions. data(Cars93, package= “MASS”)

Answer: Using a standard subsetting procedure subset MPG.highway less than 25 and Weight greater than 3500. Subset column names Model, Price, MPG.Highway, and Weight. Use dplyr’s arrange function to sort the data.frame by Price.

##         Model Price MPG.highway Weight
## 1  Lumina_APV  16.3          23   3715
## 2       Astro  16.6          20   4025
## 3     Caravan  19.0          21   3705
## 4         MPV  19.1          24   3735
## 5       Quest  19.1          23   4100
## 6  Silhouette  19.5          23   3715
## 7     Eurovan  19.7          21   3960
## 8    Aerostar  19.9          20   3735
## 9      Previa  22.7          22   3785
## 10    Stealth  25.8          24   3805
## 11   Diamante  26.1          24   3730
## 12      ES300  28.0          24   3510
## 13      SC300  35.2          23   3515
## 14        Q45  47.9          22   4000

Problem 7

Form a matrix object named mycars from the variables Min.Price, Max.Price, MPG.city, MPG.highway, EngineSize, Length, and Weight of the data frame Cars93 from MASS. Use it to create a list object named cars.stats containing named components as follows: (a) a vector of means, named Cars.Means, (b) a vector of standard errors of the means, named Cars.Std.Errors, (c) a matrix with 2 rows containing lower and upper limits of 99% con???dence intervals for the means, named Cars.CI.99.

Answer: First, we will need create the function or the standard error. The name of the function is std. After the function is created we will create a amtrix with the required variables for the list setting the matrix to mycars. Once we have the function and matrix created, we will construct the list. We will begin with Cars.Means which is simply the colMeans of the matrix. Cars.Std.Errors is mycars applied over the columns of std. The Cars.CI.99 matrix is created by taking the colMeans plus/minus the standard error in a 7 column, 2 row matrix.

## $Cars.Means
##   Min.Price   Max.Price    MPG.city MPG.highway  EngineSize      Length 
##   17.125806   21.898925   22.365591   29.086022    2.667742  183.204301 
##      Weight 
## 3072.903226 
## 
## $Cars.Std.Errors
##   Min.Price   Max.Price    MPG.city MPG.highway  EngineSize      Length 
##   0.9069210   1.1438051   0.5827473   0.5528742   0.1075695   1.5141964 
##      Weight 
##  61.1694186 
## 
## $Cars.CI.99
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
## [1,] 16.21889 20.75512 21.78284 28.53315 2.560172 181.6901 3011.734
## [2,] 18.03273 23.04273 22.94834 29.63890 2.775311 184.7185 3134.073

Problem 8

Use the apply() function on the 3-dimensional array iris3 to compute: (a) sample means of the variables Sepal Length, Sepal Width, Petal Length, and Petal Width, for each of the three species Setosa, Versicolor, and Virginica; (b) sample means of the variables Sepal Length, Sepal Width, Petal Length, and Petal Width, for the entire data set

Answer: This is a simple calculation using the apply function. For 8a, we will need to set the margin to the vector c(2,3) to compute the means over the three dimensions of the matrix. For part B, we will simply need to specify the margin as 2 to compute the means of the required variables.

Problem8a

##          Setosa Versicolor Virginica
## Sepal L.  5.006      5.936     6.588
## Sepal W.  3.428      2.770     2.974
## Petal L.  1.462      4.260     5.552
## Petal W.  0.246      1.326     2.026

Problem8B

## Sepal L. Sepal W. Petal L. Petal W. 
## 5.843333 3.057333 3.758000 1.199333

Problem9

Use the state.x77 data matrix and the tapply() function to obtain (a) the mean per capita income of the states in each of the four regions de???ned by the factor state.region, (b) the maximum illiteracy rates for states in each of the nine divisions de???ned by the factor state.division, (c) the number of states in each region, (d) the median high school graduation rates for groups of states de???ned by combinations of the factors state.region and state.size.

Answer: Using the functions for state size, bind the required vectors to state.x77. Once the vectors are bound to the dataframe, compute the tapply for part a using Income and state.region against the function mean. For part B, use tapply on the vectors Illiteracy and state.division against the function max. For part C, use tapply on the vectors state.name and state.region against the function length. For part D, use the vectors HS Grad as the first argument and the list state.region and state.size as the second argument against the function median.

Problem9A

##     Northeast         South North Central          West 
##      4570.222      4011.938      4611.083      4702.615

Problem9B

##        New England    Middle Atlantic     South Atlantic 
##                1.3                1.4                2.3 
## East South Central West South Central East North Central 
##                2.4                2.8                0.9 
## West North Central           Mountain            Pacific 
##                0.8                2.2                1.9

Problem9C

##     Northeast         South North Central          West 
##             9            16            12            13

Problem9D

##               Small Medium Large
## Northeast      55.9  56.00 51.45
## South          48.1  41.30 47.40
## North Central  53.3  54.50 52.90
## West           62.4  61.75 62.60

Problem 10

Produce a scatter plot matrix of the variables mpg, disp, hp, drat, and qsec in the data frame mtcars. Use di???erent colors to identify cars belonging to each of the categories de???ned by the carsize variable in di???erent colors. data(mtcars) head(mtcars) carsize <- cut(mtcars[,“wt”], breaks=c(0, 2.5, 3.5, 5.5), +labels=c(“Compact”, “Midsize”, “Large”))

Answer: Use the carsize function to compute the car sizes and bind the vector to the dataframe mtcars. Use dplyr to select the variables needed for the scatterplot. Set the colors of each car size to green, blue, and red. Use the function pairs to create the scatter plot matrix with the colors argument equal to the variable cols.

Problem 11

Use the function aov() to perform a one-way analysis of variance on the chickwts data with feed as the treatment factor. Assign the result to an object named chick.aov and use it to print, an anova table. Use this object to obtain side-by-side box plots of the residuals for each feed. attach(chickwts) head(chickwts)

Answer: Create the object chick.aov using the aov function. Create a dataframe named dat by binding the chick.aov residuals with the feed types. Use ggplot to create a boxplot of these two variables.

Problem 12

Write an R function ttest() for conducting a one-sample t-test. Return a list object containing the two components: the t-statistic named T and the two-sided p-value named P. Use this function to test the hypothesis that the mean of the weight variable (in chickwts dataset) is equal to 240 against the two-sided alternative.

Answer: Create the function with the arguments y, x. Within the function, declare the variable n1 which is the lenght of y, y1 which is the mean of y, v1 which is the variance of y, set t equal to the one way t-test using the above aruments, find the z score using the z score formula, and find the p value of the t test using the p value formula. Set the output of the one sided t statstic and the p value in a list as the output of the function. T1 and P1 correspond to the values of the one sided test and T2 and P2 correspond to the two sided test.

Using the one way t test, compute the t statsitc and p value of the one sided test. After the computation, it is clear when compared to the two sided test that the hypothetical mean is not a good approximation of the mean because of the low t statistic, high p value, and the mean is outside of the 95% CI of the two sided test.

## $T1
## [1] 2.299879
## 
## $P1
## [1] 0.02145507
## 
## $T2
##        t 
## 28.20202 
## 
## $P2
## [1] 5.919394e-40

## 
##  One Sample t-test
## 
## data:  chickwts$weight
## t = 28.202, df = 70, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  242.8301 279.7896
## sample estimates:
## mean of x 
##  261.3099

STAT 701 Homework 1

Jeremiah Lowhorn

Monday, August 28, 2017