A: 300 tanning salons nationwide
A: Yes, because the dataset included at least 3 random samples from each states.
A: Because the data was collected on humans, the bias might be able to occur. The investigators might only call the randomly select salons in each state, instead of directly visit the salons.
A: Probably not. 90% of the salons which did not pose a health risk might directly inform their customer about the risk of skin cancer. The collected samples might be biased.
A: Hospital A
A: Hospital B
A: Hospital A
| Dead Rate | Hospital A | Hospital B |
|---|---|---|
| Good | 0.01 (6/600) | 0.016 (8/500) |
| Poor | 0.038 (57/1500) | 0.04 (8/200) |
From the above table showing the dead rate in each condition, Hospital A had the lower dead rate in both condition of the patients.
The table of results when include all patients (in either condition) is
| _ | Hospital A | Hospital B |
|---|---|---|
| Died | 63 | 16 |
| Servived | 2037 | 684 |
| Total | 2100 | 700 |
A: Hospital B had a lower death rate Hospital A death rate = (6+57)/(1500+600) = 3% Hospital B death rate = (8+8)/(500+200) = 2.286%
A: From (c) and (e), we could claim that the lurking variable (which is the conditions of the patients) had a large effect on the dead ratio. So, Hospital A would really be the best choice in this case.
The below example I found on the internet is the numbers of flights on time and delayed for Alaska Airlines, and America West Airline at 5 airports in June 1991. References Moore, David S. (2003), The Basic Practice of Statistics (3rd ed.), New York, NY: W.H. Freeman.
**Alaska Airlines** **America West Airlines**
| _ | On time | Delayed | Delay Rate | On time | Delayed | Delay Rate |
|---|---|---|---|---|---|---|
| LA | 497 | 62 | 11.1% | 694 | 117 | 14.4% |
| Phoenix | 221 | 12 | 5.4% | 4840 | 415 | 7.9% |
| San Diego | 212 | 20 | 8.6% | 383 | 65 | 14.5% |
| San Fran. | 503 | 102 | 16.9% | 320 | 129 | 28.7% |
| Seattle | 1841 | 305 | 14.2% | 201 | 61 | 23.3% |
| Total | 3274 | 501 | 13.3% | 6438 | 787 | 10.9% |
The table shows that the America West Airlines’ Delay Rate are higher at all 5 airports. However, the delay rate of the Alaska Airlines is higher when the airport locations (lurking variable) are not considered.
# need to replace next line with where the file is. See homework for how.
# code for Exercise 1 is already entered below as an example
criminal_justice <- read.csv("~/Documents/criminal_justice.csv")
head(criminal_justice)
## state police judicial corrections violent_crime
## 1 Alabama 159 71 103 486
## 2 Alaska 412 204 273 567
## 3 Arizona 231 120 206 532
## 4 Arkansas 149 72 137 445
## 5 California 290 201 257 622
## 6 Colorado 238 86 190 334
# Calculate mean, etc here.
summary(criminal_justice)
## state police judicial corrections
## Alabama : 1 Min. : 104 Min. : 52.0 Min. : 90
## Alaska : 1 1st Qu.: 164 1st Qu.: 72.5 1st Qu.:134
## Arizona : 1 Median : 190 Median : 92.0 Median :162
## Arkansas : 1 Mean : 224 Mean : 99.1 Mean :179
## California: 1 3rd Qu.: 232 3rd Qu.:114.0 3rd Qu.:204
## Colorado : 1 Max. :1348 Max. :205.0 Max. :609
## (Other) :45
## violent_crime
## Min. : 81
## 1st Qu.: 282
## Median : 384
## Mean : 442
## 3rd Qu.: 550
## Max. :1508
##
A: The data must be collected from every states and many sources to find the proportions in each variable. Thus, the police records should not be the sole source for the data on violent crime rate.
# Calculate mean, etc here.
summary(criminal_justice$violent_crime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 81 282 384 442 550 1510
A: The mean of the variable ‘violent_crime’ is 441.6 The median of the variable ‘violent_crime’ is 384.0
# Calculate mean, etc here.
sd(criminal_justice$violent_crime)
## [1] 241.4
A: The standard deviation of the variable ‘violent_crime’ is 241.3983
# Calculate mean, etc here.
IQR(criminal_justice$violent_crime)
## [1] 268
A: The interquartile range of the variable ‘violent_crime’ is 268
# make your histograms here
hist(criminal_justice$violent_crime, col="grey", xlab="violent crime", ylab="Number of states", main="Violent Crime Rates of States (2004)" )
A: In the Violent Crime Rates of States histogram, the number of violent crimes appears to range from 0 to about 1000, with a possible outlier at 1500. This histogram seems to be non-symmetric distribution because the data stretch out more to the right than to the left.
# modify the code below to add appropriate axis labeling
hist(criminal_justice$violent_crime, breaks=15, col="grey",
xlim=c(0,2000), ylim=c(0,12),
xlab="violent crime", ylab="Number of states",
main="Violent Crime Rates of States (2004)" )
A: This histrogram was specified the number of breaks (i.e. 15) as the breaks option. It give more details to the graph, and some differences from the last one. The number of violent crimes appears to range from 0 to about 900, with a possible outlier higher than 1500.
# median of the variable police
median(criminal_justice$police)
## [1] 190
# create a new variable
criminal_justice$PoliceState = recode( criminal_justice$police, "lo:190='low'; else='high'" )
head(criminal_justice)
## state police judicial corrections violent_crime PoliceState
## 1 Alabama 159 71 103 486 low
## 2 Alaska 412 204 273 567 high
## 3 Arizona 231 120 206 532 high
## 4 Arkansas 149 72 137 445 low
## 5 California 290 201 257 622 high
## 6 Colorado 238 86 190 334 high
Make your new variable and count the number of states that are high payment and low payment. Does this align with what you expect, given the definition of a median?
tally( ~PoliceState, data=criminal_justice )
##
## high low
## 25 26
A: The result shows that there are 26 low police states, and 25 high police states. Since we calculate the median from criminal_justice$police which have 51 records, the middle number after sorting the variable police is the 26th largest value. For this reason, we can claim that the align gives the definition of a median.
tapply(criminal_justice$violent_crime, criminal_justice$PoliceState, mean)
## high low
## 527.7 358.9
A: The mean of the low state is 358.88, while the mean of the high state is 527.68. Mean measures the center of each group in the ‘violent_crime’ datasets by the sum of the values in each group divided by the number of states in each group. There is an high outlier in the dataset (from c) and e) histograms) This high value will pull the mean up in the group which is containing this high value. Thus, it is possible that the result will not be the same.
# make your scatterplot for 8) and 10) here
plot(criminal_justice$police,criminal_justice$violent_crime,
#xlim=c(0,1000), ylim=c(0,500),
xlab="police", ylab="violent_crime",
main="The relationship between police and violent_crime")
Describe the relationship. Are there any outliers? Is the trend linear?
A: ‘police’ is positively associated with ‘violent_ crime’ with a strong association along a linear trend. One state with a high ‘police’ also has a high ‘violent_crime’, and is considered an outlier.
There is one outlier in the plot and the larger values of ‘police’ tend to be asociated with higher level of ‘violent_crime’. Since the points are scattered widely around any straight line with a strong influence from the outlier, the ‘police’ appears to have a weak positive linear association with ‘violent_crime’.
# make your scatterplot for 9) here
plot(criminal_justice$judicial, criminal_justice$violent_crime,
#xlim=c(0,1000), ylim=c(0,500),
xlab="judicial", ylab="violent_crime",
main="The relationship between judicial and violent_crime")
#abline(lm(criminal_justice$judicial~criminal_justice$violent_crime), col="red")
# regression line (y~x)
A: The strength of the association between ‘violent_ crime’ and ‘judicial’ in the plot is worse than the original plot. The ‘judicial’ appears to have a weak positive linear association with the ‘violent_crime’ and there are also some outliners in the plot.
# make your scatterplot for 8) and 10) here
plot(criminal_justice$police,criminal_justice$violent_crime,
xlab="police", ylab="violent_crime",
main="The relationship between police and violent_crime")
model<-lm(criminal_justice$violent_crime~criminal_justice$police)
# regression line (y~x)
abline(model, col="red")
# summary model
model
##
## Call:
## lm(formula = criminal_justice$violent_crime ~ criminal_justice$police)
##
## Coefficients:
## (Intercept) criminal_justice$police
## 217.237 0.999
model$coefficients[[2]]*100+model$coefficients[[1]]
## [1] 317.2
A: The regression line is shown as a red line in the plot. From the output of the linear model ‘model’, The number corresponding to the y-intercept is 217.2368. The slope is 0.9994. Thus, the regression line for the relationship between violent_crime and police is y = 217.2368 + 0.9994*x If there is only 100 per capita on police, the predicted violent_crime will be 317.175 This prediction quite makes sense in my opinion. According to the raw data’s trend and the regression line, for the ‘police’ values around 100, the violent_crime could be vary from 100 to 400.
nrow( criminal_justice )
## [1] 51
crim = subset( criminal_justice, police < 1000 )
nrow( crim )
## [1] 50
# reproduce your scatterplot from 9) here, add the new line with a different color.
plot(criminal_justice$police,criminal_justice$violent_crime,
xlim=c(0,800),ylim=c(0,1000),
xlab="police", ylab="violent_crime",
main="The relationship between police and violent_crime")
model<-lm(criminal_justice$violent_crime~criminal_justice$police)
newmodel<-lm(crim$violent_crime~crim$police)
# regression line (y~x)
abline(model, col="red")
abline(newmodel, col="blue")
# summary model
newmodel
##
## Call:
## lm(formula = crim$violent_crime ~ crim$police)
##
## Coefficients:
## (Intercept) crim$police
## 137.6 1.4
newmodel$coefficients[[2]]*100+newmodel$coefficients[[1]]
## [1] 277.5
Which is better, do you think? Which gives a better prediction?
A: The regression line is shown as a blue line in the plot. From the output of the linear model ‘newmodel’, The number corresponding to the y-intercept is 137.576. The slope is 1.399. Thus, the regression line for the relationship between violent_crime and police is y = 137.576 + 1.399*x If there is only 100 per capita on police, the predicted violent_crime for the new crim dataset is 277.497.
The regression line which fits the dataset the most, should be the one that provides the best predictions and closed to zero residuals. We can find the Multiple R-Squared which tells the Goodness-of-Fit from summary(model). In the first case, the Multiple R-Squared is 0.4967 and the second case’s Multiple R-Squared is 0.1805. The higher the R-squared, the better the model fits the data. Therefore, the first case with the whole dataset gives a better prediction.
A: Due to the result we may recap that the higher the pament on the police, the higher the violent crime rate. But we cannot make a decision based on just these 2 variables. Since the violent crime rate is low, there is no need to pay more on police. While in the states where the higher the violent crime rate, the higher the police costs.
A: From the tweets using hashtag #iPhone6
What kind of the topic the users talk about? > Catogorical variable
how many people retweet that topic? > Quantitative variable
Hop many people favourite that topic? > Quantitative variable
How many followers the user who tweet that topic has? > Quantitative variable
What is the user’s gender? > Catogorical variable
Did he/she buy/reserve iPhone6 already? or Do it seem like he/she does not want to buy one? > Catogorical variable
# data loading
mydata<-read.csv("~/Documents/mydata.csv")
head(mydata)
## Topics retweet favourite follower sex BuyiPhone6
## 1 UserExperience 1379 46 356 female Y
## 2 UserExperience 3598 249 1192000 male Y
## 3 General 128 11 155 female N
## 4 General 1931 114 1536000 male N
## 5 News 683 43 716000 male N
## 6 News 467 24 1067000 male Y
# compute the mean
summary(mydata)
## Topics retweet favourite follower
## Features :3 Min. : 16 Min. : 1.0 Min. : 155
## General :7 1st Qu.: 34 1st Qu.: 9.5 1st Qu.: 5142
## News :5 Median : 128 Median : 31.0 Median : 86000
## Opinion :4 Mean : 599 Mean : 173.6 Mean : 467252
## UserExperience:4 3rd Qu.:1037 3rd Qu.: 82.0 3rd Qu.: 357500
## Max. :3598 Max. :2838.0 Max. :4285000
## sex BuyiPhone6
## female: 8 N:10
## male :15 Y:13
##
##
##
##
# make a table
tally( BuyiPhone6~Topics, data=mydata, format="count")
## Topics
## BuyiPhone6 Features General News Opinion UserExperience
## N 0 4 3 3 0
## Y 3 3 2 1 4
# plot of your choice with two variables
bargraph(~BuyiPhone6, group=Topics, data = mydata, auto.key=TRUE, horizontal = TRUE)
A: Among Twitter accounts who were talking about #iPhone6, users who tweeted about user experice of the iPhone6 appears to be the users who reserved or bought one already.
A: My sample data was haphazardly picked from high retweeted tweets of the Twitter users who put the hashtag #iPhone6 in their tweets. Ignoring issue of sampling, my results generalize the population of Twitter users talking about the new iPhone6.