aliens <- read.csv("aliens.csv", header = TRUE, stringsAsFactors = TRUE)
library (skimr)
source('special_functions.R')
my_sample <- suppressWarnings(make.my.sample(33243684, 100, aliens))
The first thing you’ll do is to make a simple table showing the distribution of a categorical variable, using the table command. To make a table showing the distribution of the college variable in your sample of aliens, naming it ‘college.table’, and showing the output, do this:
college.table <- table(my_sample$college)
college.table
##
## Callisto Europa Ganymede Io
## 27 22 28 23
color.table <- table(my_sample$color)
color.table
##
## Blue Pink
## 30 70
politics.table <- table(my_sample$politics)
politics.table
##
## Democrulite Independone Republicant
## 32 38 30
anxiety.table <- table(my_sample$anxiety)
anxiety.table
##
## 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 63
## 2 1 1 5 4 6 8 7 6 10 10 6 7 6 6 2 2 5 3 1 1 1
intelligence.table <- table(my_sample$intelligence)
intelligence.table
##
## 85 89 91 92 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
## 1 1 1 1 1 2 3 1 3 4 4 2 3 5 2 3 4 2 4 6
## 110 111 112 113 114 115 116 117 118 119 121 122 123 124 127 130
## 1 1 4 3 5 6 2 4 6 4 4 2 2 1 1 1
income.table <- table(my_sample$income)
income.table
##
## 13000 15000 18000 21000 22000 25000 26000 27000 28000 30000 31000
## 1 2 1 2 1 2 2 3 3 1 1
## 32000 33000 34000 35000 36000 37000 38000 39000 41000 43000 46000
## 3 3 3 2 1 1 2 2 1 2 1
## 48000 49000 50000 51000 53000 54000 55000 57000 58000 60000 61000
## 3 2 1 2 1 1 1 3 1 2 1
## 62000 63000 64000 65000 66000 67000 68000 70000 71000 73000 74000
## 1 1 1 1 1 1 1 1 1 1 1
## 76000 78000 79000 81000 82000 83000 85000 90000 93000 97000 1e+05
## 1 1 1 2 1 1 1 1 2 1 1
## 103000 104000 110000 111000 113000 115000 118000 120000 130000 132000 138000
## 1 1 1 2 1 1 1 1 1 1 1
## 142000 143000 144000 153000 157000 214000
## 1 1 1 1 1 1
You can make a bar graph of the results by making your table into an argument of the barplot function, like this: 1
barplot(college.table)
barplot(color.table)
barplot(politics.table)
Callisto Europa Ganymede Io 0 5 10 15 20 25 Do this, and also do it for the two other variables that you used in Question 1. ## Question 3 You can also make a contingency table showing the joint distribution of two categorical variables, with the same table function that you used in Question 1. You simply have to give it the two variables as separate arguments, separating them with a comma. Make two separate contingency tables, for two distinct pairs of variables.
college.color.table <- table(my_sample$college,my_sample$color)
college.color.table
##
## Blue Pink
## Callisto 10 17
## Europa 5 17
## Ganymede 6 22
## Io 9 14
color.politics.table <- table(my_sample$color,my_sample$politics)
color.politics.table
##
## Democrulite Independone Republicant
## Blue 5 12 13
## Pink 27 26 17
You can also make a bar graph that shows the joint distribution of two variables by using a contingency table (like the one you made in Question 3) as the argument to the barplot function.
barplot(college.color.table, beside = T, legend.text = T)
barplot(color.politics.table, beside = T, legend.text = T)
Explain what these two arguments do. Do you like these graphs better with, or without, these arguments? The arguments divide the graphs, so that they are next to each other vertically, and also include a legend and different colors to either differentiate the color from the politic groups, or colleges. I think having these arguments is very useful. This is because I think having beside=T to put the variables next to each other and legend=T so there is a legend, makes the chart overall easier to understand and read.
Based on the graphs you made in Question 4, do you conclude anything about how the categorical variables that you’re looking at might be related to each other? If so, what do you conclude? It shows the differences between the populations of the different color groups more clearly. For example there are a lot more pink in the population overall than blue. As well as the most pink in Democrulite, and least amount of blue.
Use all of these functions to give both numerical and graphical summaries of the anxiety, income, and intelligence variables. Compare the median and the mean, and explain the patterns you find.
hist(my_sample$anxiety)
boxplot(my_sample$anxiety)
mean(my_sample$anxiety)
## [1] 49.74
median(my_sample$anxiety)
## [1] 49.5
var(my_sample$anxiety)
## [1] 22.17414
sd(my_sample$anxiety)
## [1] 4.708943
hist(my_sample$income)
boxplot(my_sample$income)
mean(my_sample$income)
## [1] 64050
median(my_sample$income)
## [1] 54500
var(my_sample$income)
## [1] 1514027778
sd(my_sample$income)
## [1] 38910.51
hist(my_sample$intelligence)
boxplot(my_sample$intelligence)
mean(my_sample$intelligence)
## [1] 108.93
median(my_sample$intelligence)
## [1] 109
var(my_sample$intelligence)
## [1] 85.96475
sd(my_sample$intelligence)
## [1] 9.271718
Are there outliers in the distributions you’ve looked at, based on the 1.5xIQR rule? If so, do you think these points should be excluded from the data set for the purpose of descriptive statistics, or should they be kept? Justify your conclusion. There are outliers in the data for anxiety and income ## Question 8 Sometimes the histogram that R gives you doesn’t have suitable bin sizes, and you need to tell the hist function how many bins to make. One way to do this is to use the breaks argument (for example, breaks = 50). Try different numbers of bins for the histogram of the income variable.
hist(my_sample$income, breaks=50)
hist(my_sample$income, breaks=100)
What do you think is the best value for the breaks argument in this case? Why? I think the best value for the breaks argument in my opinion is 50 because it allows you to see the bars better, and therefore the frequencys they reach easier. ## Question 9 The boxplot function also allows you to make separate boxplots, based on the values of a categorical variable, like this:
boxplot(my_sample$anxiety~my_sample$island, ylim = c(30, 70))
my_sample$island
## [1] Nanspucket Plume Plume Plume Plume Blick
## [7] Plume Blick Plume Nanspucket Nanspucket Blick
## [13] Nanspucket Plume Blick Plume Blick Plume
## [19] Plume Plume Plume Plume Nanspucket Blick
## [25] Nanspucket Nanspucket Plume Blick Blick Blick
## [31] Plume Plume Nanspucket Plume Plume Nanspucket
## [37] Blick Plume Blick Blick Plume Nanspucket
## [43] Plume Nanspucket Nanspucket Blick Blick Blick
## [49] Nanspucket Blick Plume Plume Blick Plume
## [55] Blick Plume Plume Blick Blick Blick
## [61] Blick Plume Plume Blick Blick Plume
## [67] Nanspucket Plume Nanspucket Plume Blick Plume
## [73] Blick Blick Nanspucket Plume Plume Nanspucket
## [79] Blick Nanspucket Nanspucket Blick Plume Plume
## [85] Nanspucket Blick Nanspucket Plume Nanspucket Nanspucket
## [91] Nanspucket Blick Blick Blick Nanspucket Plume
## [97] Nanspucket Plume Plume Blick
## Levels: Blick Nanspucket Plume
my_sample$anxiety
## [1] 52 52 51 55 49 40 49 53 51 47 46 49 50 57 59 49 43 53 48 48 55 46 45 51 63
## [26] 50 44 43 47 51 50 47 46 46 45 53 50 57 45 46 51 48 52 47 57 56 48 45 47 52
## [51] 54 46 50 43 45 53 49 52 57 50 44 40 42 44 58 49 49 53 44 58 50 50 47 47 51
## [76] 54 54 60 54 58 56 52 48 46 49 54 52 46 49 48 50 57 41 54 50 43 45 53 43 49
Blick Nanspucket Plume 30 40 50 60 70 Note that the little squiggly line can be read as ‘depends on’ – in this case, you’re making a boxplot of anxiety, depending on island. Run this command. Why does it include the ylim argument? What happens if you leave this out? What do these boxplots tell you about the distribution of anxiety for aliens from each of the three islands? The ylim argument specifies the upper and lower numbers of the y-axis. When you leave out this argument the box plots disappear. The boxplots show that the mean anxiety for all the arguments is very similar, and the max and min for Blick and Plume are very similar. Nanspuckets max is also similar bu their min is higher. ## Question 10 Make two more side-by-side boxplots, in each case exploring the distribution of one of the quantitative variables, depending on the values of one of the categorical variables. Explain what you learn from these plots.
boxplot(my_sample$intelligence~my_sample$college, ylim = c(75, 150))
my_sample$intelligence
## [1] 94 112 121 103 130 123 114 112 108 118 122 96 114 100 105 100 119 113
## [19] 117 85 99 119 112 108 106 98 117 98 102 114 102 107 100 103 103 119
## [37] 111 106 117 112 103 116 91 103 116 115 115 115 105 115 121 104 105 96
## [55] 99 115 109 98 109 108 100 118 101 121 121 114 99 97 101 92 113 114
## [73] 99 109 118 89 127 118 117 118 96 122 104 113 102 123 106 124 108 110
## [91] 115 95 106 107 118 109 95 109 109 119
my_sample$college
## [1] Europa Callisto Ganymede Europa Io Ganymede Ganymede Io
## [9] Callisto Callisto Ganymede Callisto Ganymede Europa Callisto Callisto
## [17] Ganymede Io Io Callisto Callisto Io Ganymede Io
## [25] Europa Ganymede Io Europa Callisto Io Callisto Europa
## [33] Europa Io Europa Ganymede Europa Callisto Ganymede Ganymede
## [41] Callisto Europa Callisto Io Ganymede Io Ganymede Ganymede
## [49] Callisto Ganymede Ganymede Callisto Callisto Europa Callisto Ganymede
## [57] Ganymede Europa Europa Callisto Europa Ganymede Callisto Ganymede
## [65] Io Io Callisto Europa Europa Europa Io Io
## [73] Europa Europa Io Callisto Ganymede Ganymede Ganymede Io
## [81] Europa Ganymede Callisto Ganymede Europa Ganymede Callisto Ganymede
## [89] Callisto Io Io Callisto Callisto Io Ganymede Io
## [97] Europa Callisto Io Io
## Levels: Callisto Europa Ganymede Io
boxplot(my_sample$memory~my_sample$color, ylim = c(75, 150))
my_sample$color
## [1] Pink Pink Blue Blue Pink Pink Pink Blue Blue Blue Pink Pink Pink Pink Blue
## [16] Blue Pink Blue Pink Blue Blue Blue Blue Pink Blue Pink Pink Pink Pink Pink
## [31] Pink Pink Pink Pink Blue Pink Pink Pink Pink Pink Pink Pink Pink Blue Pink
## [46] Blue Pink Pink Blue Pink Pink Pink Pink Pink Pink Pink Pink Pink Pink Blue
## [61] Pink Pink Pink Pink Blue Pink Pink Pink Pink Pink Pink Blue Pink Pink Pink
## [76] Pink Blue Pink Blue Blue Pink Blue Blue Pink Blue Blue Blue Pink Pink Pink
## [91] Blue Pink Pink Pink Pink Pink Blue Pink Pink Pink
## Levels: Blue Pink
my_sample$memory
## [1] 79 113 101 98 122 99 91 87 102 111 98 82 87 91 93 97 108 83
## [19] 93 64 90 96 86 86 105 62 100 91 97 88 94 101 86 74 97 98
## [37] 107 99 92 87 100 119 79 73 92 91 87 100 101 93 104 93 96 83
## [55] 88 91 79 92 103 99 92 96 90 103 96 89 85 89 92 80 88 85
## [73] 93 112 92 81 103 102 88 101 87 101 93 86 93 105 101 108 99 83
## [91] 96 74 102 76 89 90 83 100 78 96
my_sample_2<- suppressWarnings(make.my.sample(33243684+1, 100, aliens))
boxplot(my_sample_2$memory~my_sample$color, ylim = c(75, 150))
my_sample$color
## [1] Pink Pink Blue Blue Pink Pink Pink Blue Blue Blue Pink Pink Pink Pink Blue
## [16] Blue Pink Blue Pink Blue Blue Blue Blue Pink Blue Pink Pink Pink Pink Pink
## [31] Pink Pink Pink Pink Blue Pink Pink Pink Pink Pink Pink Pink Pink Blue Pink
## [46] Blue Pink Pink Blue Pink Pink Pink Pink Pink Pink Pink Pink Pink Pink Blue
## [61] Pink Pink Pink Pink Blue Pink Pink Pink Pink Pink Pink Blue Pink Pink Pink
## [76] Pink Blue Pink Blue Blue Pink Blue Blue Pink Blue Blue Blue Pink Pink Pink
## [91] Blue Pink Pink Pink Pink Pink Blue Pink Pink Pink
## Levels: Blue Pink
my_sample$memory
## [1] 79 113 101 98 122 99 91 87 102 111 98 82 87 91 93 97 108 83
## [19] 93 64 90 96 86 86 105 62 100 91 97 88 94 101 86 74 97 98
## [37] 107 99 92 87 100 119 79 73 92 91 87 100 101 93 104 93 96 83
## [55] 88 91 79 92 103 99 92 96 90 103 96 89 85 89 92 80 88 85
## [73] 93 112 92 81 103 102 88 101 87 101 93 86 93 105 101 108 99 83
## [91] 96 74 102 76 89 90 83 100 78 96
There are differences in this boxplot from question 10, because there are no outliers, and the minimum is lower. ## Question 12 Make a new variable, with at least 40 values, that has (a) a mean of about 100, and (b) a fairly uniform distribution. Make a histogram and a boxplot.
new_var12 <- c(50, 100, 150)
mean(new_var12)
## [1] 100
median(new_var12)
## [1] 100
hist(new_var12)
Make another new variable, again with at least 40 values and a mean of about 100, that has a fairly symmetrical, unimodal distribution. Again, make a histogram and boxplot.
new_var12 <- c(25, 75, 100, 125, 150, 175)
mean(new_var12)
## [1] 108.3333
median(new_var12)
## [1] 112.5
hist(new_var12)
Now make another one, again with at least 40 values and a mean of about 100, that has a bimodal distribution. Again, make a histogram and boxplot.
new_var12 <- c(94, 96, 98, 100, 102, 104, 106)
mean(new_var12)
## [1] 100
median(new_var12)
## [1] 100
hist(new_var12)
Based on 12-14, do you think histograms or boxplots are better for showing the detailed shape of distributions? Why? I think histograms are better for showing the shape of a sample because you are able to see the differences between levels of the data easier. I think because of this overall histograms are a better way to view data efficently.
Add a single extremely high value to one of your variables from problems 12-14. What happens to the mean? What happens to the median? The mean and median both became signifigantly higher.
new_var12 <- c(50, 100, 150, 500)
mean(new_var12)
## [1] 200
median(new_var12)
## [1] 125
hist(new_var12)