Homework #2

(Problem 1.1 in ALR)

United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, (the gross national product per person in U.S. dollars), and fertility, (the birth rate per 1000 females), both from the year 2009.The data is for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries.The data was collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.Identify the predictor and the response.“x” The Predictor in this case would be the ppgdp (gross national product per person) “y” and The Response is fertility (the birth rate per 1000 female)

Question 1.1

Identify the predictor and the response.

                                                  ANSWER

I started by first loading the required package alr4 to which I then was able to plot my data as shown below.

Code: ##plot(x = UN11\(ppgdp, y = UN11\)fertility)

Question 1.2

Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?

                                                  ANSWER

As the plotted data below demostates a natural linear regression I would conclude that yes the simple linear regression model does seem plausible for a summary of this graph.

Code: ##plot(log(UN11\(fertility)~log(UN11\)ppgdp), ylab =“UN11\(fertility", xlab = "UN11\)ppgdp”)

Question 1.3

Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.

                                                ANSWER

It seems as though that the graph above as a result shows a good summary as the average or the mean of the data supports linearity except for a few outliers.

Question 2

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

(a)

How, if at all, does the slope of the prediction equation change?

                                               ANSWER

The prediction equation: yˆ = a + bx, provides a prediction yˆ for every value of x or in other words for every value of x, we can calculate a y value. x being the explanatory variable (British pound) and yˆ being the estimated response outcome aka the prediction. In conclusion the slope of the prediction equation changes by whatever the prediction value of “y” is.

(b)

How, if at all, does the correlation change?

                                               ANSWER

The correlation changes between variables as one of the variables changes in value. To test the correlation, one needs to make predictions that are testable. Depending on the prediction, the correlation can change to a negative or positive correlation. In summary the correlation changes by the prediction.

Question 3

Water runoff in the Sierras (Data file: water) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots.

                                             ANSWER

Year appears to be largely unrelated to each of the other variables; All streams seem to positively correlate with each other however all streams starting with O don’t seem to correlate with streams starting with A showing that the water can be correlated to other streams but can’t be predicted from past years of where the water comes from.

Question 4

Professor ratings (Data file: Rateprof) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)

                                           ANSWER

We start by loading the data and reviewing the matrix

##data(Rateprof)

##pairs(Rateprof)

As the matrix stands we need to further scrape the data to reproduce the scatterplot matrix in Figure 1.13 in the ALR book. To do this we need to further identify the data frame that we want to work with as the there too much data currently within the matrix. The data frame itself holds 17 columns, we only need the 8th – 12th. So we use the code below to give us the reproduced scatterplot matrix.

Code: pairs(Rateprof[,8:12])

Question 5

For the student.survey data file in the smss package, conduct regression analyses relating

(i)

y = political ideology and x = religiosity,

Step 1 • Install the package & the data file

library(smss)

data(“student.survey”)

Both variables contain categorical data.

• Graph the variables and their relationship.

                                      Can't Grpah within chunk

Code: >ggplot(data = student.survey, aes(x = re, y = pi)) + geom_point() + xlab(“Religiosity”) + ylab(“Political Ideology”)

(ii)

y = high school GPA and x = hours of TV watching.

                                         Can't Grpah within chunk

Code: >ggplot(data = student.survey, aes(x = tv, y = hi)) + geom_point() + xlab(“Hours of TV watching”) + ylab(“High school GPA”)

As we can clearly see from the graph above there’s a big correlation between the number of hours a high school student watches tv and their GPA in high school. The graph above shows that the fewer hours a student in high school watches TV the higher his/her GPA becomes.

(a)

Use graphical ways to portray the individual variables and their relationship.

ANSWER

In terms of the relationship between political ideology and religiosity, it seems as though graphically there’s a positive relationship between the two in that the more a student practices religion the more conservative he/she becomes over the course of a week.

(b)

Interpret descriptive statistics for summarizing the individual variables and their relationship.

CODE: >summary(student.survey)

                                   Can't Grpah within chunk

ANSWER

After running a summary of the student survey data, I was able to find many descriptive statistics to summarize the individual variable. First being in political ideology where I found that 63% of all 60 observations showed to be that the students were at some degree of liberal. Additionally, I also found that 73% of students either never went to church or occasionally went to church. With only 27% of the all the students that go to church most weeks and every week we can start to get a much better sense of the relationship between the two. Should more students go to church most to every week then we would probably see a much bigger favor of students being conservative.

In terms of the relationship between High School GPA and Hours of watching TV, the descriptive statistics shown from a summary show that on overage students with a 3.3GPA watch an average a little more than 7 hours a week.

(c)

Summarize and interpret results of inferential analyses.

ANSWER

To further explore this data (number of hours a high school student watches tv and their GPA) in terms of an inferential analysis, I turn to a summary of two variable at a time. Doing so would give me insight to the T – Value and P Score. Calculated from the sample data in a hypothesis test we calculate the test statistic under the null hypothesis. We find that there’s a very low T- Value (-2.11). It shows that it has no significance for the difference between the two variables. The smaller the t – score the more the groups hold a similarity which was confirmed and articulated from the visualization.

The p-value being .04 rounded indicates that the result is statistically significant. In this case we would reject the null hypothesis and lean in favor of the alternative hypothesis. Being in favor of the alternative hypothesis we can conclude that there is some statistical significance between the two measured phenomenon.

CODE: >cor.test(student.survey\(tv, student.survey\)hi)

In terms of the relationship between political ideology and religiosity there has to be a numerical vector in order to conduct an inferential analysis.

library(smss)

data(“student.survey”)

cor.test(student.survey\(re, student.survey\)pi)

Error in cor.test.default(student.survey\(re, student.survey\)pi) : ‘x’ must be a numeric vector

Question 6

For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60.

Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful. (Here’s a useful hint video: https://www.youtube.com/watch?v=1tSqSMOyNFE)

                                            ANSWER

This is not sufficient evidence because what we’re doing is simply taking the average of the average. The overall class average is 70 for two tests but the specially tutored students increase from 50 to 60. How do we know if the specially tutored students would have had the same result not tutored? What if the tutoring had no effect on the students whatsoever? What we’re discussing is regression to the mean and regression to the mean shows that should you have a bad result in say a test, there’s a high likelihood that your results will be better the next time around. However, we shouldn’t expect special tutored students to be better or worse the second time around because of random chance variables that are unforeseen. As mentioned from the video, this is why it’s so important to say use control groups in clinical trials. This way we can see if drugs work better or worse by random chance.

Distill is a publication format for scientific and technical writing, native to the web.

Learn more about using Distill for R Markdown at https://rstudio.github.io/distill.