Preliminaries

aliens <- read.csv("aliens.csv", header = TRUE, stringsAsFactors = TRUE)
library (skimr) 
source('special_functions.R') 
my_sample <- suppressWarnings(make.my.sample(33243684, 100, aliens))

Question 1

The first thing you’ll do is to make a simple table showing the distribution of a categorical variable, using the table command. To make a table showing the distribution of the college variable in your sample of aliens, naming it ‘college.table’, and showing the output, do this:

college.table <- table(my_sample$college)
college.table
## 
## Callisto   Europa Ganymede       Io 
##       27       22       28       23
color.table <- table(my_sample$color)
color.table
## 
## Blue Pink 
##   30   70
politics.table <- table(my_sample$politics)
politics.table
## 
## Democrulite Independone Republicant 
##          32          38          30
anxiety.table <- table(my_sample$anxiety)
anxiety.table
## 
## 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 63 
##  2  1  1  5  4  6  8  7  6 10 10  6  7  6  6  2  2  5  3  1  1  1
intelligence.table <- table(my_sample$intelligence)
intelligence.table
## 
##  85  89  91  92  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 
##   1   1   1   1   1   2   3   1   3   4   4   2   3   5   2   3   4   2   4   6 
## 110 111 112 113 114 115 116 117 118 119 121 122 123 124 127 130 
##   1   1   4   3   5   6   2   4   6   4   4   2   2   1   1   1
income.table <- table(my_sample$income)
income.table
## 
##  13000  15000  18000  21000  22000  25000  26000  27000  28000  30000  31000 
##      1      2      1      2      1      2      2      3      3      1      1 
##  32000  33000  34000  35000  36000  37000  38000  39000  41000  43000  46000 
##      3      3      3      2      1      1      2      2      1      2      1 
##  48000  49000  50000  51000  53000  54000  55000  57000  58000  60000  61000 
##      3      2      1      2      1      1      1      3      1      2      1 
##  62000  63000  64000  65000  66000  67000  68000  70000  71000  73000  74000 
##      1      1      1      1      1      1      1      1      1      1      1 
##  76000  78000  79000  81000  82000  83000  85000  90000  93000  97000  1e+05 
##      1      1      1      2      1      1      1      1      2      1      1 
## 103000 104000 110000 111000 113000 115000 118000 120000 130000 132000 138000 
##      1      1      1      2      1      1      1      1      1      1      1 
## 142000 143000 144000 153000 157000 214000 
##      1      1      1      1      1      1

27 Callisto 22 Europa 28 Ganymede 23 Io

30 Blue 70 Pink

32 Democrulite 38 Independone 30 Republicant

Question 2

You can make a bar graph of the results by making your table into an argument of the barplot function, like this: 1

barplot(college.table)

barplot(color.table)

barplot(politics.table)

Callisto Europa Ganymede Io 0 5 10 15 20 25 Do this, and also do it for the two other variables that you used in Question 1. ## Question 3 You can also make a contingency table showing the joint distribution of two categorical variables, with the same table function that you used in Question 1. You simply have to give it the two variables as separate arguments, separating them with a comma. Make two separate contingency tables, for two distinct pairs of variables.

college.color.table <- table(my_sample$college,my_sample$color)
college.color.table
##           
##            Blue Pink
##   Callisto   10   17
##   Europa      5   17
##   Ganymede    6   22
##   Io          9   14
color.politics.table <- table(my_sample$color,my_sample$politics)
color.politics.table
##       
##        Democrulite Independone Republicant
##   Blue           5          12          13
##   Pink          27          26          17

Question 4

You can also make a bar graph that shows the joint distribution of two variables by using a contingency table (like the one you made in Question 3) as the argument to the barplot function.

barplot(college.color.table, beside = T, legend.text = T)

barplot(color.politics.table, beside = T, legend.text = T)

Explain what these two arguments do. Do you like these graphs better with, or without, these arguments? The arguments divide the graphs, so that they are next to each other vertically, and also include a legend and different colors to either differentiate the color from the politic groups, or colleges. I think having these arguments is very useful. This is because I think having beside=T to put the variables next to each other and legend=T so there is a legend, makes the chart overall easier to understand and read.

Question 5

Based on the graphs you made in Question 4, do you conclude anything about how the categorical variables that you’re looking at might be related to each other? If so, what do you conclude? It shows the differences between the populations of the different color groups more clearly. For example there are a lot more pink in the population overall than blue. As well as the most pink in Democrulite, and least amount of blue.

Question 6

Use all of these functions to give both numerical and graphical summaries of the anxiety, income, and intelligence variables. Compare the median and the mean, and explain the patterns you find.

hist(my_sample$anxiety)

boxplot(my_sample$anxiety)

mean(my_sample$anxiety)
## [1] 49.74
median(my_sample$anxiety)
## [1] 49.5
var(my_sample$anxiety)
## [1] 22.17414
sd(my_sample$anxiety)
## [1] 4.708943
hist(my_sample$income)

boxplot(my_sample$income)

mean(my_sample$income)
## [1] 64050
median(my_sample$income)
## [1] 54500
var(my_sample$income)
## [1] 1514027778
sd(my_sample$income)
## [1] 38910.51
hist(my_sample$intelligence)

boxplot(my_sample$intelligence)

mean(my_sample$intelligence)
## [1] 108.93
median(my_sample$intelligence)
## [1] 109
var(my_sample$intelligence)
## [1] 85.96475
sd(my_sample$intelligence)
## [1] 9.271718

Question 7

Are there outliers in the distributions you’ve looked at, based on the 1.5xIQR rule? If so, do you think these points should be excluded from the data set for the purpose of descriptive statistics, or should they be kept? Justify your conclusion. There are outliers in the data for anxiety and income ## Question 8 Sometimes the histogram that R gives you doesn’t have suitable bin sizes, and you need to tell the hist function how many bins to make. One way to do this is to use the breaks argument (for example, breaks = 50). Try different numbers of bins for the histogram of the income variable.

hist(my_sample$income, breaks=50)

hist(my_sample$income, breaks=100)

What do you think is the best value for the breaks argument in this case? Why? I think the best value for the breaks argument in my opinion is 50 because it allows you to see the bars better, and therefore the frequencys they reach easier. ## Question 9 The boxplot function also allows you to make separate boxplots, based on the values of a categorical variable, like this:

boxplot(my_sample$anxiety~my_sample$island, ylim = c(30, 70))

my_sample$island
##   [1] Nanspucket Plume      Plume      Plume      Plume      Blick     
##   [7] Plume      Blick      Plume      Nanspucket Nanspucket Blick     
##  [13] Nanspucket Plume      Blick      Plume      Blick      Plume     
##  [19] Plume      Plume      Plume      Plume      Nanspucket Blick     
##  [25] Nanspucket Nanspucket Plume      Blick      Blick      Blick     
##  [31] Plume      Plume      Nanspucket Plume      Plume      Nanspucket
##  [37] Blick      Plume      Blick      Blick      Plume      Nanspucket
##  [43] Plume      Nanspucket Nanspucket Blick      Blick      Blick     
##  [49] Nanspucket Blick      Plume      Plume      Blick      Plume     
##  [55] Blick      Plume      Plume      Blick      Blick      Blick     
##  [61] Blick      Plume      Plume      Blick      Blick      Plume     
##  [67] Nanspucket Plume      Nanspucket Plume      Blick      Plume     
##  [73] Blick      Blick      Nanspucket Plume      Plume      Nanspucket
##  [79] Blick      Nanspucket Nanspucket Blick      Plume      Plume     
##  [85] Nanspucket Blick      Nanspucket Plume      Nanspucket Nanspucket
##  [91] Nanspucket Blick      Blick      Blick      Nanspucket Plume     
##  [97] Nanspucket Plume      Plume      Blick     
## Levels: Blick Nanspucket Plume
my_sample$anxiety
##   [1] 52 52 51 55 49 40 49 53 51 47 46 49 50 57 59 49 43 53 48 48 55 46 45 51 63
##  [26] 50 44 43 47 51 50 47 46 46 45 53 50 57 45 46 51 48 52 47 57 56 48 45 47 52
##  [51] 54 46 50 43 45 53 49 52 57 50 44 40 42 44 58 49 49 53 44 58 50 50 47 47 51
##  [76] 54 54 60 54 58 56 52 48 46 49 54 52 46 49 48 50 57 41 54 50 43 45 53 43 49

Blick Nanspucket Plume 30 40 50 60 70 Note that the little squiggly line can be read as ‘depends on’ – in this case, you’re making a boxplot of anxiety, depending on island. Run this command. Why does it include the ylim argument? What happens if you leave this out? What do these boxplots tell you about the distribution of anxiety for aliens from each of the three islands? The ylim argument specifies the upper and lower numbers of the y-axis. When you leave out this argument the box plots disappear. The boxplots show that the mean anxiety for all the arguments is very similar, and the max and min for Blick and Plume are very similar. Nanspuckets max is also similar bu their min is higher. ## Question 10 Make two more side-by-side boxplots, in each case exploring the distribution of one of the quantitative variables, depending on the values of one of the categorical variables. Explain what you learn from these plots.

boxplot(my_sample$intelligence~my_sample$college, ylim = c(75, 150))

my_sample$intelligence
##   [1]  94 112 121 103 130 123 114 112 108 118 122  96 114 100 105 100 119 113
##  [19] 117  85  99 119 112 108 106  98 117  98 102 114 102 107 100 103 103 119
##  [37] 111 106 117 112 103 116  91 103 116 115 115 115 105 115 121 104 105  96
##  [55]  99 115 109  98 109 108 100 118 101 121 121 114  99  97 101  92 113 114
##  [73]  99 109 118  89 127 118 117 118  96 122 104 113 102 123 106 124 108 110
##  [91] 115  95 106 107 118 109  95 109 109 119
my_sample$college
##   [1] Europa   Callisto Ganymede Europa   Io       Ganymede Ganymede Io      
##   [9] Callisto Callisto Ganymede Callisto Ganymede Europa   Callisto Callisto
##  [17] Ganymede Io       Io       Callisto Callisto Io       Ganymede Io      
##  [25] Europa   Ganymede Io       Europa   Callisto Io       Callisto Europa  
##  [33] Europa   Io       Europa   Ganymede Europa   Callisto Ganymede Ganymede
##  [41] Callisto Europa   Callisto Io       Ganymede Io       Ganymede Ganymede
##  [49] Callisto Ganymede Ganymede Callisto Callisto Europa   Callisto Ganymede
##  [57] Ganymede Europa   Europa   Callisto Europa   Ganymede Callisto Ganymede
##  [65] Io       Io       Callisto Europa   Europa   Europa   Io       Io      
##  [73] Europa   Europa   Io       Callisto Ganymede Ganymede Ganymede Io      
##  [81] Europa   Ganymede Callisto Ganymede Europa   Ganymede Callisto Ganymede
##  [89] Callisto Io       Io       Callisto Callisto Io       Ganymede Io      
##  [97] Europa   Callisto Io       Io      
## Levels: Callisto Europa Ganymede Io
boxplot(my_sample$memory~my_sample$color, ylim = c(75, 150))

my_sample$color
##   [1] Pink Pink Blue Blue Pink Pink Pink Blue Blue Blue Pink Pink Pink Pink Blue
##  [16] Blue Pink Blue Pink Blue Blue Blue Blue Pink Blue Pink Pink Pink Pink Pink
##  [31] Pink Pink Pink Pink Blue Pink Pink Pink Pink Pink Pink Pink Pink Blue Pink
##  [46] Blue Pink Pink Blue Pink Pink Pink Pink Pink Pink Pink Pink Pink Pink Blue
##  [61] Pink Pink Pink Pink Blue Pink Pink Pink Pink Pink Pink Blue Pink Pink Pink
##  [76] Pink Blue Pink Blue Blue Pink Blue Blue Pink Blue Blue Blue Pink Pink Pink
##  [91] Blue Pink Pink Pink Pink Pink Blue Pink Pink Pink
## Levels: Blue Pink
my_sample$memory
##   [1]  79 113 101  98 122  99  91  87 102 111  98  82  87  91  93  97 108  83
##  [19]  93  64  90  96  86  86 105  62 100  91  97  88  94 101  86  74  97  98
##  [37] 107  99  92  87 100 119  79  73  92  91  87 100 101  93 104  93  96  83
##  [55]  88  91  79  92 103  99  92  96  90 103  96  89  85  89  92  80  88  85
##  [73]  93 112  92  81 103 102  88 101  87 101  93  86  93 105 101 108  99  83
##  [91]  96  74 102  76  89  90  83 100  78  96

Question 11

my_sample_2<- suppressWarnings(make.my.sample(33243684+1, 100, aliens))
boxplot(my_sample_2$memory~my_sample$color, ylim = c(75, 150))

my_sample$color
##   [1] Pink Pink Blue Blue Pink Pink Pink Blue Blue Blue Pink Pink Pink Pink Blue
##  [16] Blue Pink Blue Pink Blue Blue Blue Blue Pink Blue Pink Pink Pink Pink Pink
##  [31] Pink Pink Pink Pink Blue Pink Pink Pink Pink Pink Pink Pink Pink Blue Pink
##  [46] Blue Pink Pink Blue Pink Pink Pink Pink Pink Pink Pink Pink Pink Pink Blue
##  [61] Pink Pink Pink Pink Blue Pink Pink Pink Pink Pink Pink Blue Pink Pink Pink
##  [76] Pink Blue Pink Blue Blue Pink Blue Blue Pink Blue Blue Blue Pink Pink Pink
##  [91] Blue Pink Pink Pink Pink Pink Blue Pink Pink Pink
## Levels: Blue Pink
my_sample$memory
##   [1]  79 113 101  98 122  99  91  87 102 111  98  82  87  91  93  97 108  83
##  [19]  93  64  90  96  86  86 105  62 100  91  97  88  94 101  86  74  97  98
##  [37] 107  99  92  87 100 119  79  73  92  91  87 100 101  93 104  93  96  83
##  [55]  88  91  79  92 103  99  92  96  90 103  96  89  85  89  92  80  88  85
##  [73]  93 112  92  81 103 102  88 101  87 101  93  86  93 105 101 108  99  83
##  [91]  96  74 102  76  89  90  83 100  78  96

There are differences in this boxplot from question 10, because there are no outliers, and the minimum is lower. ## Question 12 Make a new variable, with at least 40 values, that has (a) a mean of about 100, and (b) a fairly uniform distribution. Make a histogram and a boxplot.

new_var12 <- c(50, 100, 150)
mean(new_var12)
## [1] 100
median(new_var12)
## [1] 100
hist(new_var12)

Question 13

Make another new variable, again with at least 40 values and a mean of about 100, that has a fairly symmetrical, unimodal distribution. Again, make a histogram and boxplot.

new_var12 <- c(25, 75, 100, 125, 150, 175)
mean(new_var12)
## [1] 108.3333
median(new_var12)
## [1] 112.5
hist(new_var12)

Question 14

Now make another one, again with at least 40 values and a mean of about 100, that has a bimodal distribution. Again, make a histogram and boxplot.

new_var12 <- c(94, 96, 98, 100, 102, 104, 106)
mean(new_var12)
## [1] 100
median(new_var12)
## [1] 100
hist(new_var12)

Question 15

Based on 12-14, do you think histograms or boxplots are better for showing the detailed shape of distributions? Why? I think histograms are better for showing the shape of a sample because you are able to see the differences between levels of the data easier. I think because of this overall histograms are a better way to view data efficently.

Question 16

Add a single extremely high value to one of your variables from problems 12-14. What happens to the mean? What happens to the median? The mean and median both became signifigantly higher.

new_var12 <- c(50, 100, 150, 500)
mean(new_var12)
## [1] 200
median(new_var12)
## [1] 125
hist(new_var12)