DACSS-603
(Problem 1.1 in ALR) United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.
Predictor: ppgdp
Response: fertility
library(alr4)
data(UN11)
library(ggplot2)
ggplot(data = UN11, aes(x=ppgdp,y=fertility)) + geom_point()
The scatterplot shows a marked decline in fertility rates as GDP increases. I will now recreate the scatterplot with a straight-line function to see if it appears to be appropriate for this presentation of the data.
data(UN11)
ggplot(data = UN11, aes(x=ppgdp,y=fertility)) + geom_point()+
geom_smooth(method="lm",se=FALSE)
As can be seen above, a straight-line function is not appropriate; the data is not currently presented in a linear manner (the data is L-shaped). I will now try a linear regression model to see if a straight-line funtion is applicable there.
data(UN11)
ggplot(data = UN11, aes(x=log(ppgdp),y=log(fertility))) + geom_point() +
geom_smooth(method="lm",se=FALSE)
A simple linear regression model is much more plausible for a straight-line function.
(Problem 9.47 in SMSS) Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).
To account for the conversion rate from USD to GBP, the value of the response must be divided by 1.33. The slope shall also be divided by 1.33.
Correlation isn’t affected by units of measurement, so it would not change in this scenario.
(Problem 1.5 in ALR) Water runoff in the Sierras (Data file: water) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots.
Since this matrix presents a lot of information, I’ll summarize:
* Year doesn’t seem to be related to runoff or water levels
* The following variables appear to be correlated with each other: OPBPC, OPRC, OPSLAKE. All parts of the matrix with 2 of these variables exhibit a dependence among themselves that is not present between OPBPC, OPRC, and OPSLAKE and APMAM, APSAB, APSLAKE. That being said, though, there also appears to be a correlation among APMAM, APSAB, APSLAKE.
* BSAAM is more closely related to OPBPC, OPRC, and OPSLAKE than to APMAM, APSAB, APSLAKE.
(Problem 1.6 in ALR, modified) Professor ratings (Data file: Rateprof) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)
library(alr4)
data(Rateprof)
pairs(Rateprof[c("quality","clarity","helpfulness","easiness","raterInterest")])
There is a strong correlation among “quality”, “clarity”, and “helpfulness.” As those variables increase, so too do the professors’ ratings. This makes sense as the quality and clarity of the material, as well as the professor’s helpfulness are major factors in undergraduate learning. There appears to be some correlation among “helpfulness” and “easiness” but the data is much more dispersed. “raterInterest” seems pretty consistent in the middle of each graph, indicating that the rater is at least moderately interested in the subject matter of the courses they are rating.
(Problem 9.34 in SMSS) For the “student.survey” data file in the smss package, conduce regression analyses relating:
a. y = political ideology and x = religiosity
b. y = high school GPA and x = hours of TV watching
(You can use ?student.survey in the R console after loading the package to see what each variable means)
* Use graphical ways to portray the individual variables and their relationships.
* Interpret descriptive statistics for summarizing the individual variables and their relationships.
* Summarize and interpret the results of inferential analyses.
[1] "subj" "ge" "ag" "hi" "co" "dh" "dr" "tv" "sp"
[10] "ne" "ah" "ve" "pa" "pi" "re" "ab" "aa" "ld"
The variables I will be focusing on (as per the problem) are “re” (x) and “pi” (y) for subsection (a); and then “hi” (x) and “tv” (y) for subsection (b)
library(smss)
data("student.survey")
ggplot(data=student.survey,aes(x=re,fill=pi))+
geom_bar() + labs(x="Religiosity", fill ="Political Ideology")
The graph above is one (of several possible) visualizations of the relationship between religiosity and political ideology. I couldn’t figure out how to get more info on what exactly the variables mean, so I’m assuming “Religiosity” refers to the frequency individuals of different political ideologies go to church/temple/mosque/etc.. From left to right, the frequency goes from “never” to “every week.” As frequency increases, so too does conservatism. While not a majority by any means, it is still significant to note that those who identify as very conservative only appear in the bar labelled “every week,” whereas those who identify as very liberal are not even present on the graph to the right of “occasionally.” This, therefore, indicates that those who are heavily liberal-leaning in political ideology are far less likely to go to church/temple/mosque/etc. regularly/frequently than those who are more conservative.
data("student.survey")
ggplot(data=student.survey,aes(x=hi, y=tv)) +
geom_point() + labs(x="High School GPA", y="Hours Watching TV")
Once again, this graph is just one of several visualizations that could be used. I chose a scatterplot to reflect individual responses; a standard bar graph for this scenario is, in my opinion, misleading as outliers appear to be a much higher concentration of responses. Given the measurements on this graph, I am also assuming that the y-axis refers to hours of TV watched per week. While this graph does not show a linear relationship between the two variables, there is a higher concentration of responses with higher GPA’s and lower # of hours watching TV. I will conduct a simple regression model to test whether a linear realtionship exists.
ggplot(data = student.survey, aes(x=log(hi),y=log(tv))) + geom_point() +
labs(x="High School GPA",y="Hours Watching TV")
Even with a linear regression model, there does not appear to be a linear relationship between these 2 variables. There is still a higher concentration of responses on the higher end of the spectrum, but there is enough variation in the responses to argue that “Hours Watching TV” does not have a correlative affect on “High School GPA.”
Now I will present some descriptive/summary statistics of all 4 variables to show their statistical significance.
pi re hi
very liberal : 8 never :15 Min. :2.000
liberal :24 occasionally:29 1st Qu.:3.000
slightly liberal : 6 most weeks : 7 Median :3.350
moderate :10 every week : 9 Mean :3.308
slightly conservative: 6 3rd Qu.:3.625
conservative : 4 Max. :4.000
very conservative : 2
tv
Min. : 0.000
1st Qu.: 3.000
Median : 6.000
Mean : 7.267
3rd Qu.:10.000
Max. :37.000
Since each sample has a different number of subjects/respondents, the results are somewhat skewed.The distribution of high school GPA looks relatively normal and unimodal. The distribution of “Hours Watching TV” on the other hand, has several outliers which can be seen both on the above graph and in the summary statistics via the difference between the 3rd quartile and maximum values.
Regression toward the mean implies that outlying/extreme values will always occur each time the test or experiment is conducted. Additionally the values found in any reproduction of the test or experiment will be the same as the previous. In this scenario, the students could have been chosen from the group of those not doing well by chance, and the “improvement” seen in the graph might just be regression toward the mean and not actual academic improvement.