Homework 1

Your name: Rapheephan Thongkham-Uan

Collaborators names: -

Question 1: Do tanning salons mislead their customers?

What is the sample?

A: 300 tanning salons nationwide

Do you think the sample is representative of all tanning salons in the US?

A: Yes, because the dataset included at least 3 random samples from each states.

Although the sample is random, discuss why the results may not paint an accurate picture of the dangers of tanning.

A: Because the data was collected on humans, the bias might be able to occur. The investigators might only call the randomly select salons in each state, instead of directly visit the salons.

Do you think the study accurately portrays the messages tanning salons give to teenage girls?

A: Probably not. 90% of the salons which did not pose a health risk might directly inform their customer about the risk of skin cancer. The collected samples might be biased.

Question 2: Simpson’s Paradox.

Considering only patients in good condition, which hospital has a lower death rate?

A: Hospital A

Considering only patients in poor condition, which hospital has a lower death rate?

A: Hospital B

No matter what condition a patient is in, which hospital is the best choice?

A: Hospital A

Dead Rate	Hospital A	Hospital B
Good	0.01 (6/600)	0.016 (8/500)
Poor	0.038 (57/1500)	0.04 (8/200)

From the above table showing the dead rate in each condition, Hospital A had the lower dead rate in both condition of the patients.

Generate a table of overall death rates

The table of results when include all patients (in either condition) is

_	Hospital A	Hospital B
Died	63	16
Servived	2037	684
Total	2100	700

Use your work to determine which hospital has a lower death rate.

A: Hospital B had a lower death rate Hospital A death rate = (6+57)/(1500+600) = 3% Hospital B death rate = (8+8)/(500+200) = 2.286%

Discuss the discrepancy

A: From (c) and (e), we could claim that the lurking variable (which is the conditions of the patients) had a large effect on the dead ratio. So, Hospital A would really be the best choice in this case.

(Optional) Find another example on the web.

The below example I found on the internet is the numbers of flights on time and delayed for Alaska Airlines, and America West Airline at 5 airports in June 1991. References Moore, David S. (2003), The Basic Practice of Statistics (3rd ed.), New York, NY: W.H. Freeman.

                                **Alaska Airlines**                         **America West Airlines**

_	On time	Delayed	Delay Rate	On time	Delayed	Delay Rate
LA	497	62	11.1%	694	117	14.4%
Phoenix	221	12	5.4%	4840	415	7.9%
San Diego	212	20	8.6%	383	65	14.5%
San Fran.	503	102	16.9%	320	129	28.7%
Seattle	1841	305	14.2%	201	61	23.3%
Total	3274	501	13.3%	6438	787	10.9%

The table shows that the America West Airlines’ Delay Rate are higher at all 5 airports. However, the delay rate of the Alaska Airlines is higher when the airport locations (lurking variable) are not considered.

Question 3: Criminal justice data set.

Load the data

# need to replace next line with where the file is.  See homework for how.
# code for Exercise 1 is already entered below as an example
criminal_justice <- read.csv("~/Documents/criminal_justice.csv")

head(criminal_justice)

##        state police judicial corrections violent_crime
## 1    Alabama    159       71         103           486
## 2     Alaska    412      204         273           567
## 3    Arizona    231      120         206           532
## 4   Arkansas    149       72         137           445
## 5 California    290      201         257           622
## 6   Colorado    238       86         190           334

# Calculate mean, etc here.
summary(criminal_justice)

##         state        police        judicial      corrections 
##  Alabama   : 1   Min.   : 104   Min.   : 52.0   Min.   : 90  
##  Alaska    : 1   1st Qu.: 164   1st Qu.: 72.5   1st Qu.:134  
##  Arizona   : 1   Median : 190   Median : 92.0   Median :162  
##  Arkansas  : 1   Mean   : 224   Mean   : 99.1   Mean   :179  
##  California: 1   3rd Qu.: 232   3rd Qu.:114.0   3rd Qu.:204  
##  Colorado  : 1   Max.   :1348   Max.   :205.0   Max.   :609  
##  (Other)   :45                                               
##  violent_crime 
##  Min.   :  81  
##  1st Qu.: 282  
##  Median : 384  
##  Mean   : 442  
##  3rd Qu.: 550  
##  Max.   :1508  
##

Why do you think police records are not the sole source for the data on violent crime rates?

A: The data must be collected from every states and many sources to find the proportions in each variable. Thus, the police records should not be the sole source for the data on violent crime rate.

Calculate the mean, median, standard deviation, and interquartile range for the variable ‘violent_crime’

# Calculate mean, etc here.
summary(criminal_justice$violent_crime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      81     282     384     442     550    1510

A: The mean of the variable ‘violent_crime’ is 441.6 The median of the variable ‘violent_crime’ is 384.0

# Calculate mean, etc here.
sd(criminal_justice$violent_crime)

## [1] 241.4

A: The standard deviation of the variable ‘violent_crime’ is 241.3983

# Calculate mean, etc here.
IQR(criminal_justice$violent_crime)

## [1] 268

A: The interquartile range of the variable ‘violent_crime’ is 268

Plot a histogram of the variable ‘violent_crime’.

# make your histograms here
hist(criminal_justice$violent_crime, col="grey", xlab="violent crime", ylab="Number of states", main="Violent Crime Rates of States (2004)" )

plot of chunk unnamed-chunk-6

Discuss these results and describe the distribution of violent crime (shape, center, spread). Also, any outliers?

A: In the Violent Crime Rates of States histogram, the number of violent crimes appears to range from 0 to about 1000, with a possible outlier at 1500. This histogram seems to be non-symmetric distribution because the data stretch out more to the right than to the left.

Draw a second histogram where you increase the number of bins from the default (7 in this case, I believe) to 15. What is visible in the second plot that is hidden in the first plot?

# modify the code below to add appropriate axis labeling
hist(criminal_justice$violent_crime, breaks=15, col="grey", 
     xlim=c(0,2000), ylim=c(0,12),
     xlab="violent crime", ylab="Number of states", 
     main="Violent Crime Rates of States (2004)" )

plot of chunk unnamed-chunk-7

A: This histrogram was specified the number of breaks (i.e. 15) as the breaks option. It give more details to the graph, and some differences from the last one. The number of violent crimes appears to range from 0 to about 900, with a possible outlier higher than 1500.

Dividing the data

Calculate the median of the variable police, i.e., median of the distribution of the amount spent per capita on police across states (including DC). You are now going to make a new variable of high police states and low police states. A state will be a high state if it pays more than the median, and a low state otherwise. You can make this new variable with the recode command.

# median of the variable police
median(criminal_justice$police)

## [1] 190

# create a new variable
criminal_justice$PoliceState = recode( criminal_justice$police, "lo:190='low'; else='high'" )
head(criminal_justice)

##        state police judicial corrections violent_crime PoliceState
## 1    Alabama    159       71         103           486         low
## 2     Alaska    412      204         273           567        high
## 3    Arizona    231      120         206           532        high
## 4   Arkansas    149       72         137           445         low
## 5 California    290      201         257           622        high
## 6   Colorado    238       86         190           334        high

Make your new variable and count the number of states that are high payment and low payment. Does this align with what you expect, given the definition of a median?

tally( ~PoliceState, data=criminal_justice )

## 
## high  low 
##   25   26

A: The result shows that there are 26 low police states, and 25 high police states. Since we calculate the median from criminal_justice$police which have 51 records, the middle number after sorting the variable police is the 26th largest value. For this reason, we can claim that the align gives the definition of a median.

Calculate the mean for ’violent crime’ for each of your two groups separately. They will differ. Show your results and discuss possible reasons why you think they differ._

tapply(criminal_justice$violent_crime, criminal_justice$PoliceState, mean)

##  high   low 
## 527.7 358.9

A: The mean of the low state is 358.88, while the mean of the high state is 527.68. Mean measures the center of each group in the ‘violent_crime’ datasets by the sum of the values in each group divided by the number of states in each group. There is an high outlier in the dataset (from c) and e) histograms) This high value will pull the mean up in the group which is containing this high value. Thus, it is possible that the result will not be the same.

Make a scatterplot showing the relationship between ’violent crime’ and ‘police’._

# make your scatterplot for 8) and 10) here
plot(criminal_justice$police,criminal_justice$violent_crime,
     #xlim=c(0,1000), ylim=c(0,500),
     xlab="police", ylab="violent_crime",
     main="The relationship between police and violent_crime")

plot of chunk unnamed-chunk-11

Describe the relationship. Are there any outliers? Is the trend linear?

A: ‘police’ is positively associated with ‘violent_ crime’ with a strong association along a linear trend. One state with a high ‘police’ also has a high ‘violent_crime’, and is considered an outlier.

There is one outlier in the plot and the larger values of ‘police’ tend to be asociated with higher level of ‘violent_crime’. Since the points are scattered widely around any straight line with a strong influence from the outlier, the ‘police’ appears to have a weak positive linear association with ‘violent_crime’.

Do the same for violent crime and judicial expenditures. Compare the strength of the relationship in this plot to the original plot. Are there outliers in the judicial expenditures?

# make your scatterplot for 9) here
plot(criminal_justice$judicial, criminal_justice$violent_crime,
     #xlim=c(0,1000), ylim=c(0,500),
     xlab="judicial", ylab="violent_crime",
     main="The relationship between judicial and violent_crime")

plot of chunk unnamed-chunk-12

#abline(lm(criminal_justice$judicial~criminal_justice$violent_crime), col="red") 
# regression line (y~x)

A: The strength of the association between ‘violent_ crime’ and ‘judicial’ in the plot is worse than the original plot. The ‘judicial’ appears to have a weak positive linear association with the ‘violent_crime’ and there are also some outliners in the plot.

Make a linear model to predict violent crime with the police variable. Be sure to use all your data. Using abline(), add this line to your scatterplot in 8). Use it to predict the violent crime for a hypothetical state that only paid 100 per capita on police. Does this prediction make sense given the data you have?

# make your scatterplot for 8) and 10) here
plot(criminal_justice$police,criminal_justice$violent_crime,
     xlab="police", ylab="violent_crime",
     main="The relationship between police and violent_crime")

model<-lm(criminal_justice$violent_crime~criminal_justice$police)
# regression line (y~x) 
abline(model, col="red")

plot of chunk unnamed-chunk-13

# summary model
model

## 
## Call:
## lm(formula = criminal_justice$violent_crime ~ criminal_justice$police)
## 
## Coefficients:
##             (Intercept)  criminal_justice$police  
##                 217.237                    0.999

model$coefficients[[2]]*100+model$coefficients[[1]]

## [1] 317.2

A: The regression line is shown as a red line in the plot. From the output of the linear model ‘model’, The number corresponding to the y-intercept is 217.2368. The slope is 0.9994. Thus, the regression line for the relationship between violent_crime and police is y = 217.2368 + 0.9994*x If there is only 100 per capita on police, the predicted violent_crime will be 317.175 This prediction quite makes sense in my opinion. According to the raw data’s trend and the regression line, for the ‘police’ values around 100, the violent_crime could be vary from 100 to 400.

Dropping outliers

Now find and remove any outliers using the subset command. Here is the subset() command being used to drop all states with a per-capita judicial payment of 100 or more.

nrow( criminal_justice )

## [1] 51

crim = subset( criminal_justice, police < 1000 )
nrow( crim )

## [1] 50

With your new dataset, make a new linear model and use it to predict violent crime for a 100 per capita cost. Add this line to your plot as well, but make it a different color.

# reproduce your scatterplot from 9) here, add the new line with a different color.
plot(criminal_justice$police,criminal_justice$violent_crime,
     xlim=c(0,800),ylim=c(0,1000),
     xlab="police", ylab="violent_crime",
     main="The relationship between police and violent_crime")

model<-lm(criminal_justice$violent_crime~criminal_justice$police)
newmodel<-lm(crim$violent_crime~crim$police)
# regression line (y~x) 
abline(model, col="red") 
abline(newmodel, col="blue")

plot of chunk unnamed-chunk-15

# summary model
newmodel

## 
## Call:
## lm(formula = crim$violent_crime ~ crim$police)
## 
## Coefficients:
## (Intercept)  crim$police  
##       137.6          1.4

newmodel$coefficients[[2]]*100+newmodel$coefficients[[1]]

## [1] 277.5

Which is better, do you think? Which gives a better prediction?

A: The regression line is shown as a blue line in the plot. From the output of the linear model ‘newmodel’, The number corresponding to the y-intercept is 137.576. The slope is 1.399. Thus, the regression line for the relationship between violent_crime and police is y = 137.576 + 1.399*x If there is only 100 per capita on police, the predicted violent_crime for the new crim dataset is 277.497.

The regression line which fits the dataset the most, should be the one that provides the best predictions and closed to zero residuals. We can find the Multiple R-Squared which tells the Goodness-of-Fit from summary(model). In the first case, the Multiple R-Squared is 0.4967 and the second case’s Multiple R-Squared is 0.1805. The higher the R-squared, the better the model fits the data. Therefore, the first case with the whole dataset gives a better prediction.

A senator looks at your work and proposes slashing police costs. What do you say to this?

A: Due to the result we may recap that the higher the pament on the police, the higher the violent crime rate. But we cannot make a decision based on just these 2 variables. Since the violent crime rate is low, there is no need to pay more on police. While in the states where the higher the violent crime rate, the higher the police costs.

Question 4: A small, causal survey.

What questions were on your survey? Does each question correspond to a categorical or quantitative variable?

A: From the tweets using hashtag #iPhone6

What kind of the topic the users talk about? > Catogorical variable
how many people retweet that topic? > Quantitative variable
Hop many people favourite that topic? > Quantitative variable
How many followers the user who tweet that topic has? > Quantitative variable
What is the user’s gender? > Catogorical variable
Did he/she buy/reserve iPhone6 already? or Do it seem like he/she does not want to buy one? > Catogorical variable

Make a small csv file with your data and load it into R.

# data loading
mydata<-read.csv("~/Documents/mydata.csv")
head(mydata)

##           Topics retweet favourite follower    sex BuyiPhone6
## 1 UserExperience    1379        46      356 female          Y
## 2 UserExperience    3598       249  1192000   male          Y
## 3        General     128        11      155 female          N
## 4        General    1931       114  1536000   male          N
## 5           News     683        43   716000   male          N
## 6           News     467        24  1067000   male          Y

Compute the mean of a quantitative variable.

# compute the mean
summary(mydata)

##             Topics     retweet       favourite         follower      
##  Features      :3   Min.   :  16   Min.   :   1.0   Min.   :    155  
##  General       :7   1st Qu.:  34   1st Qu.:   9.5   1st Qu.:   5142  
##  News          :5   Median : 128   Median :  31.0   Median :  86000  
##  Opinion       :4   Mean   : 599   Mean   : 173.6   Mean   : 467252  
##  UserExperience:4   3rd Qu.:1037   3rd Qu.:  82.0   3rd Qu.: 357500  
##                     Max.   :3598   Max.   :2838.0   Max.   :4285000  
##      sex     BuyiPhone6
##  female: 8   N:10      
##  male  :15   Y:13      
##                        
##                        
##                        
##

Make a table of your categorical variable.

# make a table
tally( BuyiPhone6~Topics, data=mydata, format="count")

##           Topics
## BuyiPhone6 Features General News Opinion UserExperience
##          N        0       4    3       3              0
##          Y        3       3    2       1              4

Either make a scatterplot or a side-by-side bar plot of two of your variables.

# plot of your choice with two variables
bargraph(~BuyiPhone6, group=Topics, data = mydata, auto.key=TRUE, horizontal = TRUE)

plot of chunk unnamed-chunk-19

A sentence of interpretation of your results (ignoring issues of sampling, etc.).

A: Among Twitter accounts who were talking about #iPhone6, users who tweeted about user experice of the iPhone6 appears to be the users who reserved or bought one already.

Do your results generalize to some wider population? Why or why not? And if they do, what population?

A: My sample data was haphazardly picked from high retweeted tweets of the Twitter users who put the hashtag #iPhone6 in their tweets. Ignoring issue of sampling, my results generalize the population of Twitter users talking about the new iPhone6.