beer.data <- read.csv("http://homepages.gac.edu/~anienow2/MCS_142/Data/beer.csv", header=TRUE, sep=",")
hist(beer.data$PercentAlcohol, main="Histogram of Percent Alcohol in 86 Domestic Beers", xlab="Percent Alcohol")
The mean is 4.7593023. The standard deviation is 0.7523106. The median is 4.7. The histogram displays the distrubution and frequency of distribution of the percent alcohol in each of the 86 domestic beers.
boxplot(beer.data$PercentAlcohol, horizontal=TRUE, xlab="Percent Alcohol", main="Boxplot of Percent Alcohol in 86 Domestic Beers")
The five number summary includes the minimum, the 1st quartile, the median, the mean, the 3rd quartile, and the maximum, respectively: 0.4, 4.325, 4.7, 4.759, 5, 6.5. By using a boxplot, the outlier is visably displayed.
Observation 57 is an outlier. I know this because the function which.min(beer.data$PercentAlcohol) returns the location of the smallest value in the dataset. Observation is an outlier because of its percent alcohol at 0.4, which is much lower than the other observations.
The mean without the outlier is 4.810588. The mean with the outlier is 4.759. The median without the outlier is 4.7. The median with the outlier is 4.7. With the outlier out of the dataset, the median has remained the same, unless I typed in the wrong code, but the mean increased. Having the outlier gone from the dataset makes the statistics more accurately depict the data because outliers always cause a slight skew in the statistics.
Without the outlier, the standard deviation is 0.5863575. With the outlier, the standard deviation is 0.7523106. Thus, not having the outlier there lowered the standard deviation. Therefore, it means that the data distribution is now closer to the mean and the data can be more accurately displayed without the outlier.
This excercise has taught me how to better use R Studio to display data and use that data to calculate statistics. This excercise has also taught me how to interpret outliers, and also, how to remove them from the dataset when calculating statistics. From that, I learned how much outliers affect the data and its statistics. By removing the outlier from the dataset, I was able to more accurately calculate statistics that would better represent the data.
hist(beer.data$Calories, main="Histogram of Calories in 86 Domestic Beers", xlab="Calories")
The mean is 141.0581395. The standard deviation is 27.7913887. This high of a standard deviation suggests that the mean is not very reliable to dataset.
boxplot(beer.data$Calories, main="Boxplot of Calories in 86 Domestic Beers", horizontal=TRUE, xlab="Calories")
As seen in the boxplot above, the brand that was an outlier for the percent alcohol data is not an outlier for the calories data. According to the boxplot, there are no outliers depicted.
These two groups may refer to the higher calorie beers and the lower calorie beers, also known as normal and low-cal/light beers. There is a connection between number of calories and number of carbohydrates. The higher calorie beers tend to have more carbohydrates than the lower calorie beers. There is not as noticable of a connection between calories and percent alcohol.