HW2

Preliminaries

rm(list = ls())
aliens <- read.csv ("aliens.csv", header = TRUE, stringsAsFactors = TRUE)
source('specialfunctions.R')
my_sample <- make.my.sample(35046056, 100, aliens)

Question 1

college.table <- table(my_sample$college)
college.table

## 
## Callisto   Europa Ganymede       Io 
##       31       22       20       27

The output of this code is a table that shows how many of the 100 sampled aliens attended each college. In this case that would be 31 aliens at Callisto, 22 aliens at Europa, 20 aliens at Ganymede, and 27 aliens at Io. We can also summarize this by percentage with the largest number of aliens at Callisto (31%), followed by Io (27%), sightly fewer at Europa (22%) and the smallest group at Ganymede (20%). Overall however, the distribution across all colleges is pretty even with no one college showing a super majority.

color.table <- table(my_sample$color)
color.table

## 
## Blue Pink 
##   36   64

This table shows the number of aliens in each color category with the more common color being pink with 64 aliens. The less common color is blue with only 36 aliens. This indicates that pink aliens are much more common than blue ones in this particular sample and that the distribution is not evenly split among the two colors.

island.table <- table(my_sample$island)
island.table

## 
##      Blick Nanspucket      Plume 
##         28         33         39

This table shows that the largest number of aliens in this sample live on plume with 39 aliens (or 39% of aliens), followed by Nanspucket with 33 aliens (33%) and then the smallest group from Blick with 28 aliens (28%). Even though plume seems to have the most aliens, none of the other two islands have an overwhelming majority of the sample either, so the distribution is relatively balanced.

Question 2

barplot(college.table)

barplot(color.table)

barplot(island.table)

Question 3

college_color.table <- table(my_sample$college, my_sample$color)
college_color.table

##           
##            Blue Pink
##   Callisto   13   18
##   Europa      7   15
##   Ganymede    7   13
##   Io          9   18

island_color.table <- table(my_sample$island, my_sample$color)
island_color.table

##             
##              Blue Pink
##   Blick         7   21
##   Nanspucket   21   12
##   Plume         8   31

Question 4

barplot(college_color.table)

barplot(island_color.table)

This function looks a bit odd because it stacks the two categories, college and color, or island and color, on top of each other. Instead of seeing separate bars for each college/island within each color, the counts are stacked vertically in one combined bar per group which makes it harder to directly compare categories.

barplot(college_color.table, beside = T)

barplot(island_color.table, beside = T)

This function, beside = T, tells R not to stack the bars for each variable within a category. Instead it places them side by side so now each color has multiple bars next to each other which makes it easier to compare colleges or islands within each group.

barplot(college_color.table, beside = T, legend = T)

barplot(island_color.table, beside = T, legend = T)

This argument, legend = T, adds a key/legend to the graph. This legend shows which color of bar corresponds to which category. Without this key it may be unclear what each bar means.

I personally, like the gar graphs with both the legend = T argument and the beside = T argument because I find that they are easier to interpret since this way the categories are clearly separated and labeled. This makes the data easier to interpret.

The previous orientation for the bars and the legend being on top of the bars was a bit confusing, so this is what I did to fix that. I also added a y-axis title and graph title to make it easier to interpret for me.

opar <- par(no.readonly = TRUE)
par(mar = c(5, 4, 4, 8), xpd = TRUE)

bp <- barplot(college_color.table,beside = TRUE, col = c("gray30","gray55","gray75","gray90"), ylab = "Number of Aliens", xlab = "Color", main = "College Distribution by Color")

legend("topright", inset = c(-0.30, 0), legend = rownames(college_color.table), fill = c("gray30","gray55","gray75","gray90"), bty = "n")

par(opar)

opar <- par(no.readonly = TRUE)
par(mar = c(5, 4, 4, 8), xpd = TRUE)

bp <- barplot(island_color.table,beside = TRUE, col = c("gray30","gray60","gray85"), ylab = "Number of Aliens", xlab = "Color", main = "Island Distribution by Color")

legend("topright", inset = c(-0.30, 0), legend = rownames(island_color.table), fill = c("gray30","gray60","gray85"), bty = "n")

par(opar)

Question 5

One thing I noticed about the categorical variables is that pink aliens are more common overall than blue aliens (64 pink vs. 36 blue). As a result, across the island and color graph and the college and color graphs, generally more pink aliens are present in all of the categories/bars (except for on the island and color graph, the island of Nanspucket which seems to have significantly more blue aliens.) However, this difference does not seem large or consistent enough to suggest a strong relationship. In the college and color graph, although all colleges have more pink than blue aliens, this likely reflects the overall higher number of pink aliens in the sample rather than a meaningful association between color and college. There also does not appear to be a consistent pattern between color and island on the second graph. This may suggest that there is no strong association between these categorical variables.

Question 6

mean(my_sample$anxiety)

## [1] 49.83

median(my_sample$anxiety)

## [1] 49.5

var(my_sample$anxiety)

## [1] 30.38495

sd(my_sample$anxiety)

## [1] 5.512254

mean(my_sample$income)

## [1] 70780

median(my_sample$income)

## [1] 59000

var(my_sample$income)

## [1] 1938961212

sd(my_sample$income)

## [1] 44033.64

mean(my_sample$intelligence)

## [1] 108.54

median(my_sample$intelligence)

## [1] 108

var(my_sample$intelligence)

## [1] 87.76606

sd(my_sample$intelligence)

## [1] 9.368354

hist(my_sample$anxiety, main = "Histogram of Anxiety", xlab = "Anxiety Score")

boxplot(my_sample$anxiety, main = "Boxplot of Anxiety", ylab = "Anxiety Score")

hist(my_sample$income, main = "Histogram of Income", xlab = "Income")

boxplot(my_sample$income, main = "Boxplot of Income", ylab = "Income")

hist(my_sample$intelligence, main = "Histogram of Intelligence", xlab = "Intelligence Score")

boxplot(my_sample$intelligence, main = "Boxplot of Intelligence", ylab = "Intelligence Score")

For the anxiety variable, the mean and median are very close (mean = 49.83, median = 49.5). This supports the fact that the distribution is almost symmetrical, which is reflected on the histogram which also looks fairly balanced. The fact that the mean and median are pretty close to each other also suggests that there are no strong outliers and the small standard deviation (5.51) suggests that most anxiety scores cluster closer to the mean. Similarly, for the intelligence variable, the mean and median are also very close (mean = 108.54, median = 108). This again suggests that the distribution is pretty symmetric which is also reflected in the histogram. However, for the intelligence variable, the standard deviation is a bit larger (9.368) and this indicates more variability of values from the mean. In the income variable, the mean is much larger than the median, (mean = 70,780, median = 59,000). In this case, because the mean is so much larger than the median, we can see that this aligns with a right-skewed/positively skewed distribution which is also reflected in the histogram. The most likely reason for this would be the few aliens with very high incomes (outliers) as we can clearly see in the box plot. Those high incomes pull the mean upward, but the median is lower because it is not affected as much by extreme values. Because of these extreme values, the standard deviation is also very large (44,033.64).

Question 7

Yes, there are ouliers in the distributions, mainly for the income variable based on the 1.5 * IQR rule. In the income variable box plot we can see that there are points beyond the end of the whiskers. According to the 1.5 * IQR rule, the whiskers of the box plot will extend to the max and min values unless there are values more than 1.5 * IQR beyond Q1 and Q3, and in this case, the aliens with extremely high incomes (the outliers) fall beyond this range and are represented as dots in the box plot. Though these points are extreme, I don’t think they should be excluded from the data set because outliers are generally only removed if the values are data entry errors, if they are physically impossible values, or if we know for sure that the data was measured incorrectly. In this case, the high incomes are not impossible, they are just rare. This makes the income variable naturally right-skewed. If we removed the outliers the real distribution would be distorted so they should be kept in the dataset.

Question 8

hist(my_sample$income, breaks = 5, main="Histogram of Income (5 bins)", xlab="Income")

hist(my_sample$income, breaks = 10, main="Histogram of Income (10 bins)", xlab="Income")

hist(my_sample$income, breaks = 20, main="Histogram of Income (20 bins)", xlab="Income")

hist(my_sample$income, breaks = 50, main="Histogram of Income (50 bins)", xlab="Income")

After trying a few different numbers of bins, I belive that around 20 bins provides the best view of the income distribution. Too few bins like 5 make it hard to read the histogram and clearly see its shape, but too many bins like 50 makes the histogram too noisy and also hard to interpret and individual bins may have very few or zero observations. Using around 20 bins clearly shows that the distribution is right skewed, with more aliens earning lower incomes and only a few earning very high incomes.

Question 9

boxplot(my_sample$anxiety~my_sample$island, ylim = c(30, 70))

This code includes the “ylim” argument because it zooms in the box plot to only the meaningful range of numbers on the y-axis, making the differences between the islands easier to see. Since anxiety scores have a mean of 49.83 and a standard deviation of about 5.5, most values will naturally fall between 30 and 70. Without the “ylim” function the boxplot might stretch to extreme outliers, or be so zoomed out that the main part of the boxplot would be compressed and harder to read and compare islands. The medians of anxiety for each island show typical anxiety levels while the boxes (IQR) show variability: which islands have more spread in anxiety scores. A larger IQR indicates more variability among aliens from that island. Lastly,the whiskers and any dots show outliers, which are aliens with unusually high or low anxiety scores.These boxplots show that the distribution of anxiety for aliens on the three islands is roughly the same across islands. We can see that the median (the thick black line in the middle of the box) is almost the same for all three islands and that the standard deviation (the length between the whiskers) is also roughly the same.

Question 10

boxplot(my_sample$income ~ my_sample$college, main="Income by College", xlab="College", ylab="Income",)

boxplot(my_sample$intelligence ~ my_sample$antennae, main="Intelligence by Antennae Type", xlab="Antennae",ylab="Intelligence Score")

I made two new side-by-side boxplots, the first one looking at Income by College and the second one looking at Intelligence by Antennae Type. In the first boxplot, Income by College we can see that the median incomes are pretty similar across all colleges which indicates that the typical income for the aliens across all colleges is about the same. What varies across these boxplots is their IQR and their standard deviations. We can see that Callisto and Io have a similar IQR which means that these aliens in those colleges have a bit less variability in income. We can also see that Callisto has a larger distance between its whiskers (larger standard deviation), this may be because its outlier is a larger income than most of the incomes that fall among the median. On the other hand, we can see that Europa and Ganymede have very large standard deviations which may indicate that their outliers lie very far from their typical income values. Europa also has the largest IQR which means that this college has the most variability in income. From the intelligence boxplot, the median intelligence scores are fairly similar across antennae types, although the spread (IQR and SD) varies slightly. For example, aliens with “Curly” antennae have a smaller standard deviation, suggesting most aliens in this group are close to the median.

Question 11

my_sample_2 <- make.my.sample(35046057, 100, aliens)
boxplot(my_sample_2$income ~ my_sample_2$college, main="Income by College (New Sample)", xlab="College", ylab="Income",)

boxplot(my_sample_2$intelligence ~ my_sample_2$antennae, main="Intelligence by Antennae Type (New Sample)", xlab="Antennae Type", ylab="Intelligence Score",)

In these new boxplots with the new sample of aliens we can see that for the Income vs. College boxplot, the medians differ slightly from the medians in question 10. While all of the medians for the alien’s income across colleges in question 10 were the same, in this sample of aliens only Callisto, Io and Ganymede have similar medians (though Io is a slight bit lower). This indicates that the aliens in these three colleges have a similar typical income where 50% of the aliens have a higher income than this and 50% have a lower income. However, Europa has a lower median than the other three colleges and than the Europa sample in question 10. Europa also has a very small distance between the median and the Q1, this may indicate that the data are positively skewed because a smaller distance between the median and Q1 means that the 25% of data points directly below the median are densely packed or concentrated within a narrow range while the larger distance between the median and Q3 indicates that the 25% of data points directly above the median are more dispersed or spread out. The Europa sample also appears to be positvely skewed in question 10’s sample of aliens as well. Europa and Ganymede seem to have smaller standard deviations than the sample in question 10 however. We know this because of the smaller distance between the two whiskers. For the Intelligence vs. Antennae graph, the two boxplots by antenae type seem to have slightly different medians with aliens who have straight antennae having a higher median, but these two plots have the same standard deviation. On the other hand, the original sample from question 10 have similar medians, but different standard deviations with the straight antennae sample having greater variation.

Question 12

newvar <- c(70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 102, 104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130)
mean(newvar)

## [1] 100

hist(newvar, main="Histogram of newvar", xlab="Values")

boxplot(newvar, main="Boxplot of newvar")

Question 13

newvar2 <- c(80, 85, 90, 92, 94, 96, 98, 99, 99, 99, 100, 100, 100, 100, 100, 101, 101, 101, 102, 104, 106, 108, 110, 115, 120, 90, 92, 108, 110, 94)
mean(newvar2)

## [1] 99.8

hist(newvar2, main="Histogram of newvar2", xlab="Values")

boxplot(newvar2, main="Boxplot of newvar2")

Question 14

newvar3 <- c(80, 82, 84, 85, 85, 86, 88, 90, 87, 83, 84, 86, 88, 89, 81, 110, 112, 114, 115, 115, 116, 118, 120, 117, 113, 114, 116, 118, 119, 111)
mean(newvar3)

## [1] 100.2

hist(newvar3, main="Histogram of newvar3 (Bimodal)", xlab="Values")

boxplot(newvar3, main="Boxplot of newvar3")

Question 15

Based on the previous distributions, I believe that histograms are better for accurately representing the data and showing the detailed shape of the distribution. For the uniform distribution in question 12, the histogram clearly showed the flat, even spread across values. The boxplot only showed symmetry and range but it didn’t show that the distribution was flat. For the symmetrical unimodal distriution from question 13, the histogram showed the single peak with the most values around the mean in the middle and then tapering off at the end.The boxplot only showed symmetry and median location but it did not clearly show that there was one peak. Also the box plot for the uniform and unimodal distributions looked similar, so based on boxplot alone it would be hard to dtermine the type of distribution and the shape. For the bimodal distribution in question 14, the histogram clearly showed two separate peaks. The boxplot did not show this separation/peaks of the data, instead it just showed a wider spread/IQR and symmetry.

Question 16

newvar_extreme <- c(newvar2, 1000)
mean(newvar2)

## [1] 99.8

median(newvar2)

## [1] 100

mean(newvar_extreme)

## [1] 128.8387

median(newvar_extreme)

## [1] 100

After adding one extreme value (1000) to my second new variable (from question 13), the mean increased a lot, while the median stayed the exact same. This is because the mean uses all of the values and is sensitive to extreme values because it sways the average towards that extreme. In this case, the 1000 pulls the average upward. The median didn’t change at all because the median only depends on the middle value. One extreme high value at the end does not shift the center that much because there were a few values of 100 in the center for “newvar2.” This shows that the mean is affected by outliers while the median is more resistant to extreme values.

HW2

Piyusha Majgaonkar 35046056

02/16/26