DACSS 603
United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.
1.1.1. Identify the predictor and the response.
1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?
1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.
1.1.1. Identify the predictor and the response.
The predictor is ppgdp. This is because the problem is studying the dependence of fertility on ppgdp (gross national product per person) which is independent/explanatory.
The response variable is fertility. This is because fertility is the dependent variable in relation to ppgdp.
1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?
region group fertility ppgdp lifeExpF pctUrban
Afghanistan Asia other 5.968 499.0 49.49 23
Albania Europe other 1.525 3677.2 80.40 53
Algeria Africa africa 2.142 4473.0 75.00 67
Angola Africa africa 5.135 4321.9 53.17 59
Anguilla Caribbean other 2.000 13750.1 81.10 100
Argentina Latin Amer other 2.172 9162.1 79.89 93
fertility ppgdp
Afghanistan 5.968 499.0
Albania 1.525 3677.2
Algeria 2.142 4473.0
Angola 5.135 4321.9
Anguilla 2.000 13750.1
Here I use a table to represent the variables extracted from the UN11 data
| fertility | ppgdp | |
|---|---|---|
| Afghanistan | 5.968 | 499.0 |
| Albania | 1.525 | 3677.2 |
| Algeria | 2.142 | 4473.0 |
| Angola | 5.135 | 4321.9 |
| Anguilla | 2.000 | 13750.1 |
| Argentina | 2.172 | 9162.1 |
logi [1:199, 1:2] FALSE FALSE FALSE FALSE FALSE FALSE ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:199] "Afghanistan" "Albania" "Algeria" "Angola" ...
..$ : chr [1:2] "fertility" "ppgdp"
summary(United_Nations_11) #numerical summary
fertility ppgdp
Min. :1.134 Min. : 114.8
1st Qu.:1.754 1st Qu.: 1283.0
Median :2.262 Median : 4684.5
Mean :2.761 Mean : 13012.0
3rd Qu.:3.545 3rd Qu.: 15520.5
Max. :6.925 Max. :105095.4
Variables fertility and ppgdp renamed for better understandability
United_Nations_11) data using fertility on the vertical axis versus ppgdp on the horizontal axis.
United_Nations_11<-UN11
ggplot(data = UN11, aes(x = ppgdp ,y = fertility))+
geom_point(color=5)+
labs(title = "Fertility vs United Nations Gross National Product Per Person USD")
This linear regression scatter plot does not appear to be an effective summary of the data. The mean is not linear and the variance is not constant. This could be partly attributed to the crowding of the x-axis and y-axis. Therefore, using natural logarithms for the x-axis and y-axis would be warranted to determine if there is any indication of the plausibility of a straight-line mean function.
1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.
Scatterplot reflecting Natural Log of variables fertility vs. ppgdp
Scatterplot reflecting Natural Log of variables fertility vs. ppgdp with linear regression
United_Nations_11<-UN11
ggplot(data = UN11,aes(x = log(ppgdp),y = log(fertility)))+
geom_point(color=5)+
geom_smooth(method ="lm")+
labs(title = "Natural_Log of Fertility vs UN Gross National Product Per Person USD")
Implementation of the natural logarithmic scale for the UN11 scatterplot appears to indicate an effective representation of a linear regression. Compared to the previous scattorplot, this plot appears to be linear, and the variance seems to be plausible. Therefore, the relationship between fertility and gross domestic product(ppgdp )is linear.
Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).
Part A
When responses are converted to British pounds sterling the slope changes,thus the slope increases by 1.33 times the original. This is because there is an inverse relationship between the slope,and the explanatory variables.
Part B
There will be no change in the correlation. This is because the strength and pattern (correlation) cannot be affected by change in units.
Water runoff in the Sierras (Data file: water) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM.
Draw the scatterplot matrix for these data and summarize the information available from these plots.
First, we load the (Data files: water) and check the structure of the data
Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
1 1948 9.13 3.58 3.91 4.10 7.43 6.47 54235
2 1949 5.28 4.82 5.20 7.55 11.11 10.26 67567
3 1950 4.20 3.77 3.67 9.52 12.20 11.35 66161
4 1951 4.60 4.46 3.93 11.14 15.15 11.13 68094
5 1952 7.15 4.99 4.88 16.34 20.05 22.81 107080
str(water) #concise look at data frame
'data.frame': 43 obs. of 8 variables:
$ Year : int 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 ...
$ APMAM : num 9.13 5.28 4.2 4.6 7.15 9.7 5.02 6.7 10.5 9.1 ...
$ APSAB : num 3.58 4.82 3.77 4.46 4.99 5.65 1.45 7.44 5.85 6.13 ...
$ APSLAKE: num 3.91 5.2 3.67 3.93 4.88 4.91 1.77 6.51 3.38 4.08 ...
$ OPBPC : num 4.1 7.55 9.52 11.14 16.34 ...
$ OPRC : num 7.43 11.11 12.2 15.15 20.05 ...
$ OPSLAKE: num 6.47 10.26 11.35 11.13 22.81 ...
$ BSAAM : int 54235 67567 66161 68094 107080 67594 65356 67909 92715 70024 ...
summary(water) #provides numerical overview of data
Year APMAM APSAB APSLAKE
Min. :1948 Min. : 2.700 Min. : 1.450 Min. : 1.77
1st Qu.:1958 1st Qu.: 4.975 1st Qu.: 3.390 1st Qu.: 3.36
Median :1969 Median : 7.080 Median : 4.460 Median : 4.62
Mean :1969 Mean : 7.323 Mean : 4.652 Mean : 4.93
3rd Qu.:1980 3rd Qu.: 9.115 3rd Qu.: 5.685 3rd Qu.: 5.83
Max. :1990 Max. :18.080 Max. :11.960 Max. :13.02
OPBPC OPRC OPSLAKE BSAAM
Min. : 4.050 Min. : 4.350 Min. : 4.600 Min. : 41785
1st Qu.: 7.975 1st Qu.: 7.875 1st Qu.: 8.705 1st Qu.: 59857
Median : 9.550 Median :11.110 Median :12.140 Median : 69177
Mean :12.836 Mean :12.002 Mean :13.522 Mean : 77756
3rd Qu.:16.545 3rd Qu.:14.975 3rd Qu.:16.920 3rd Qu.: 92206
Max. :43.370 Max. :24.850 Max. :33.070 Max. :146345
For better readability and understandability of the data, a table is created
Removing any missing data
Year APMAM APSAB APSLAKE
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:43 FALSE:43 FALSE:43 FALSE:43
OPBPC OPRC OPSLAKE BSAAM
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:43 FALSE:43 FALSE:43 FALSE:43
Draw the scatterplot matrix for these data and summarize the information available from these plots.
We now draw a scatterplot matrix for the water data for better understandably
#Since data has already been imported we can rename the water variable
water_supply<-water
head(water_supply,5) #first five rows of dataset
Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
1 1948 9.13 3.58 3.91 4.10 7.43 6.47 54235
2 1949 5.28 4.82 5.20 7.55 11.11 10.26 67567
3 1950 4.20 3.77 3.67 9.52 12.20 11.35 66161
4 1951 4.60 4.46 3.93 11.14 15.15 11.13 68094
5 1952 7.15 4.99 4.88 16.34 20.05 22.81 107080
Scatterplot matrix for water data
pairs(water_supply,main = "Sierra Southern California Water Supply Runoff",
pch = 21, bg = "green")
For more granular insight, we run a simple linear regression on the water data set
Call:
lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC +
OPSLAKE, data = water)
Residuals:
Min 1Q Median 3Q Max
-12690 -4936 -1424 4173 18542
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15944.67 4099.80 3.889 0.000416 ***
APMAM -12.77 708.89 -0.018 0.985725
APSAB -664.41 1522.89 -0.436 0.665237
APSLAKE 2270.68 1341.29 1.693 0.099112 .
OPBPC 69.70 461.69 0.151 0.880839
OPRC 1916.45 641.36 2.988 0.005031 **
OPSLAKE 2211.58 752.69 2.938 0.005729 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7557 on 36 degrees of freedom
Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
Analysis:
Starting with variable Year we see that it does not seem to be appear to be particularly related to any of the other variables.APSLAKE,APSAB and APMAM appear to be correlated, and it is observed in relation to the runoff variable BSAAM they do not seem to appear to have a strong correlation.
It appears that OPSLAKE and OPRC are strongly correlated as well being positively linear. This suggests statistical significance of the p-value in the linear regression Pr(>|t|)< 0.05.The BSAAM variable appears to be strongly correlated with them as well.Since these variables appear to be correlated with each other, it may cause a multicollinearity problem. Furthermore, with a p-value: < 2.2e-16 the model appears to be statistically significant. The residuals display of the regression a wide disparity from a Minimum of -12690 to a Maximum of 18542. This may suggest outliers in the model.
Finally, the multiple R-squared: value of 0.9248 and the Adjusted R-squared: 0.9123 values are close to 1 which suggests there is minimal overfitting.
Professor ratings (Data file: Rateprof) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)
First, the data (Data file: Rateprof) is imported from the ALR4 dataset
gender numYears numRaters numCourses pepper discipline
1 male 7 11 5 no Hum
2 male 6 11 5 no Hum
3 male 10 43 2 no Hum
4 male 11 24 5 no Hum
5 male 11 19 7 no Hum
dept quality helpfulness clarity easiness
1 English 4.636364 4.636364 4.636364 4.818182
2 Religious Studies 4.318182 4.545455 4.090909 4.363636
3 Art 4.790698 4.720930 4.860465 4.604651
4 English 4.250000 4.458333 4.041667 2.791667
5 Spanish 4.684211 4.684211 4.684211 4.473684
raterInterest sdQuality sdHelpfulness sdClarity sdEasiness
1 3.545455 0.5518564 0.6741999 0.5045250 0.4045199
2 4.000000 0.9020179 0.9341987 0.9438798 0.5045250
3 3.432432 0.4529343 0.6663898 0.4129681 0.5407021
4 3.181818 0.9325048 0.9315329 0.9990938 0.5882300
5 4.214286 0.6500112 0.8200699 0.5823927 0.6117753
sdRaterInterest
1 1.1281521
2 1.0744356
3 1.2369438
4 1.3322506
5 0.9749613
Missing values are removed and then a summary of the data is run,to gain insight into the structure of the data.
gender numYears numRaters numCourses
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:366 FALSE:366 FALSE:366 FALSE:366
pepper discipline dept quality
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:366 FALSE:366 FALSE:366 FALSE:366
helpfulness clarity easiness raterInterest
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:366 FALSE:366 FALSE:366 FALSE:366
sdQuality sdHelpfulness sdClarity sdEasiness
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:366 FALSE:366 FALSE:366 FALSE:366
sdRaterInterest
Mode :logical
FALSE:366
str(Rateprof)
'data.frame': 366 obs. of 17 variables:
$ gender : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
$ numYears : int 7 6 10 11 11 10 7 11 11 7 ...
$ numRaters : int 11 11 43 24 19 15 17 16 12 18 ...
$ numCourses : int 5 5 2 5 7 9 3 3 4 4 ...
$ pepper : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ discipline : Factor w/ 4 levels "Hum","SocSci",..: 1 1 1 1 1 1 1 1 1 1 ...
$ dept : Factor w/ 48 levels "Accounting","Anthropology",..: 17 42 3 17 45 45 45 17 34 17 ...
$ quality : num 4.64 4.32 4.79 4.25 4.68 ...
$ helpfulness : num 4.64 4.55 4.72 4.46 4.68 ...
$ clarity : num 4.64 4.09 4.86 4.04 4.68 ...
$ easiness : num 4.82 4.36 4.6 2.79 4.47 ...
$ raterInterest : num 3.55 4 3.43 3.18 4.21 ...
$ sdQuality : num 0.552 0.902 0.453 0.933 0.65 ...
$ sdHelpfulness : num 0.674 0.934 0.666 0.932 0.82 ...
$ sdClarity : num 0.505 0.944 0.413 0.999 0.582 ...
$ sdEasiness : num 0.405 0.505 0.541 0.588 0.612 ...
$ sdRaterInterest: num 1.128 1.074 1.237 1.332 0.975 ...
Five variables we are focused on extracted from Rateprof dataset.
#data set renamed for subset
rate_my_prof<-Rateprof
colnames(rate_my_prof) #column names in dataset
[1] "gender" "numYears" "numRaters"
[4] "numCourses" "pepper" "discipline"
[7] "dept" "quality" "helpfulness"
[10] "clarity" "easiness" "raterInterest"
[13] "sdQuality" "sdHelpfulness" "sdClarity"
[16] "sdEasiness" "sdRaterInterest"
Five variable subset of RateProf dataset
quality helpfulness clarity easiness raterInterest
1 4.636364 4.636364 4.636364 4.818182 3.545455
2 4.318182 4.545455 4.090909 4.363636 4.000000
3 4.790698 4.720930 4.860465 4.604651 3.432432
4 4.250000 4.458333 4.041667 2.791667 3.181818
5 4.684211 4.684211 4.684211 4.473684 4.214286
6 4.233333 4.266667 4.200000 4.533333 3.916667
Table of subset of data for better understandability
| quality | helpfulness | clarity | easiness | raterInterest |
|---|---|---|---|---|
| 4.63636 | 4.63636 | 4.63636 | 4.81818 | 3.54545 |
| 4.31818 | 4.54545 | 4.09091 | 4.36364 | 4.00000 |
| 4.79070 | 4.72093 | 4.86047 | 4.60465 | 3.43243 |
| 4.25000 | 4.45833 | 4.04167 | 2.79167 | 3.18182 |
| 4.68421 | 4.68421 | 4.68421 | 4.47368 | 4.21429 |
| 4.23333 | 4.26667 | 4.20000 | 4.53333 | 3.91667 |
Scatterplot Matrix of five RateProf variables
pairs(rate_my_prof,
col = "green3",
pch = 20,
main = "Rate my Professor Matrix ScatterPlot ")
Provide a brief description of the relationships between the five ratings
Interpretation:
We see that if there is an intersection of any two variables then there is linear correlation of varying degrees of strength. Furthermore,if the correlation is not as linear then the correlation is weak.
We see that the relationship between some pairs of variables indicate better positive linear correlations than others.
Quality-Clarity indicates a very strong positive linear correlation
Quality-Happiness indicates a very strong positive linear correlation
Quality-Easiness indicates a weak positive linear correlation
Quality-RaterInterest indicates a weak positive linear correlation
Helpfulness-Easiness indicates a weak linear correlation
Helpfulness-RaterInterest indicates a weak linear correlation
Clarity-Helpfulness indicates a positive correlation
Clarity-RaterInterest indicates a weak positive correlation
Clarity-Easiness indicates a weak positive correlation
Easiness and RaterInterest indicates a very weak positive correlation
(Problem 9.34 in SMSS)
For the student.survey data file in the smss package, conduct regression analyses relating (i) y = political ideology and x = religiosity, (ii) y = high school GPA and x = hours of TV watching.
(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)
I first, import and inspect the student.survey dataset from SMSS text.
subj ge ag hi co dh dr tv sp ne ah ve pa pi
1 1 m 32 2.2 3.5 0 5.0 3 5 0 0 FALSE r conservative
2 2 f 23 2.1 3.5 1200 0.3 15 7 5 6 FALSE d liberal
3 3 f 27 3.3 3.0 1300 1.5 0 4 3 0 FALSE d liberal
4 4 f 35 3.5 3.2 1500 8.0 5 5 6 3 FALSE i moderate
5 5 m 23 3.1 3.5 1600 10.0 6 6 3 0 FALSE i very liberal
re ab aa ld
1 most weeks FALSE FALSE FALSE
2 occasionally FALSE FALSE NA
3 most weeks FALSE FALSE NA
4 occasionally FALSE FALSE FALSE
5 never FALSE FALSE FALSE
To gain better understanding of the data I look at a summery, strings,and column names of the data set
colnames(student_survey_data) #column names of dataset
[1] "subj" "ge" "ag" "hi" "co" "dh" "dr" "tv" "sp"
[10] "ne" "ah" "ve" "pa" "pi" "re" "ab" "aa" "ld"
summary(student_survey_data) #numeric structure of data
subj ge ag hi
Min. : 1.00 f:31 Min. :22.00 Min. :2.000
1st Qu.:15.75 m:29 1st Qu.:24.00 1st Qu.:3.000
Median :30.50 Median :26.50 Median :3.350
Mean :30.50 Mean :29.17 Mean :3.308
3rd Qu.:45.25 3rd Qu.:31.00 3rd Qu.:3.625
Max. :60.00 Max. :71.00 Max. :4.000
co dh dr tv
Min. :2.600 Min. : 0 Min. : 0.200 Min. : 0.000
1st Qu.:3.175 1st Qu.: 205 1st Qu.: 1.450 1st Qu.: 3.000
Median :3.500 Median : 640 Median : 2.000 Median : 6.000
Mean :3.453 Mean :1232 Mean : 3.818 Mean : 7.267
3rd Qu.:3.725 3rd Qu.:1350 3rd Qu.: 5.000 3rd Qu.:10.000
Max. :4.000 Max. :8000 Max. :20.000 Max. :37.000
sp ne ah ve
Min. : 0.000 Min. : 0.000 Min. : 0.000 Mode :logical
1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 0.000 FALSE:60
Median : 5.000 Median : 3.000 Median : 0.500
Mean : 5.483 Mean : 4.083 Mean : 1.433
3rd Qu.: 7.000 3rd Qu.: 5.250 3rd Qu.: 2.000
Max. :16.000 Max. :14.000 Max. :11.000
pa pi re ab
d:21 very liberal : 8 never :15 Mode :logical
i:24 liberal :24 occasionally:29 FALSE:60
r:15 slightly liberal : 6 most weeks : 7
moderate :10 every week : 9
slightly conservative: 6
conservative : 4
very conservative : 2
aa ld
Mode :logical Mode :logical
FALSE:59 FALSE:44
NA's :1 NA's :16
str(student_survey_data)
'data.frame': 60 obs. of 18 variables:
$ subj: int 1 2 3 4 5 6 7 8 9 10 ...
$ ge : Factor w/ 2 levels "f","m": 2 1 1 1 2 2 2 1 2 2 ...
$ ag : int 32 23 27 35 23 39 24 31 34 28 ...
$ hi : num 2.2 2.1 3.3 3.5 3.1 3.5 3.6 3 3 4 ...
$ co : num 3.5 3.5 3 3.2 3.5 3.5 3.7 3 3 3.1 ...
$ dh : int 0 1200 1300 1500 1600 350 0 5000 5000 900 ...
$ dr : num 5 0.3 1.5 8 10 3 0.2 1.5 2 2 ...
$ tv : num 3 15 0 5 6 4 5 5 7 1 ...
$ sp : int 5 7 4 5 6 5 12 3 5 1 ...
$ ne : int 0 5 3 6 3 7 4 3 3 2 ...
$ ah : int 0 6 0 3 0 0 2 1 0 1 ...
$ ve : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ pa : Factor w/ 3 levels "d","i","r": 3 1 1 2 2 1 2 2 2 2 ...
$ pi : Ord.factor w/ 7 levels "very liberal"<..: 6 2 2 4 1 2 2 2 1 3 ...
$ re : Ord.factor w/ 4 levels "never"<"occasionally"<..: 3 2 3 2 1 2 2 2 2 1 ...
$ ab : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ aa : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ ld : logi FALSE NA NA FALSE FALSE NA ...
Subset of data extracted from student survey dataset we are inspecting
pi re hi tv
1 conservative most weeks 2.2 3
2 liberal occasionally 2.1 15
3 liberal most weeks 3.3 0
4 moderate occasionally 3.5 5
5 very liberal never 3.1 6
Missing values are removed and then a summary of the data is run to gain insight and inspect structure of the data.
pi re hi tv
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:60 FALSE:60 FALSE:60 FALSE:60
For easier readability the data is placed in a table.
| Political_Ideology | Religiosity | HighSchoolGPA | HoursTVWatched |
|---|---|---|---|
| conservative | most weeks | 2.2 | 3 |
| liberal | occasionally | 2.1 | 15 |
| liberal | most weeks | 3.3 | 0 |
| moderate | occasionally | 3.5 | 5 |
| very liberal | never | 3.1 | 6 |
| liberal | occasionally | 3.5 | 4 |
Part A
Use graphical ways to portray the individual variables and their relationship.
i) Now that we know the structure of the data we can represent it visually with a plot. The first plot represents variables: y = political ideology(pi) vs x = religiosity(re).
student_survey_plot<-plot(pi~re,data = student.survey,
#survey plot using plot function
main = "Political Ideology vs. Religiosity")
I use xtabs() function to gain further insight into the data since it is categorical.
re
pi never occasionally most weeks every week
very liberal 3 5 0 0
liberal 8 14 1 1
slightly liberal 2 1 1 2
moderate 1 8 1 0
slightly conservative 1 1 2 2
conservative 0 0 2 2
very conservative 0 0 0 2
ii)
The second plot is represented visually with the following variables: y = high school GPA(hi) and x = hours of TV watching(tv). Since the variables are numeric it is more straight forward for interpretation.
student_survey_plot<-plot(hi~tv,data = student.survey,
xlab="Hours of TV Watching",
ylab="High school GPA",
col = "green",
main = "High School GPA vs. Hours of TV
Watching") #survey plot using plot function
Analysis:
Inspection of the Political Ideology(pi) vs. Religiosity(re) plot yielded very little meaningful information in its current form, as a categorical subset of the data. I used the xtabs() function to gain further insight in to the data since it is categorical. Whereas, the Hours of TV Watching(tv) and High school GPA(hi) data are numeric and did yield some insight. I will further explore any correlations in Part B.
Part B
Interpret descriptive statistics for summarizing the individual variables and their relationship.
i)
I rename all 4 variables of the subset for better understandability
I then use my favorite tool, the table to gain further insight into the political ideology and religiosity variables.
Religiosity
Political_Ideology never occasionally most weeks every week
very liberal 3 5 0 0
liberal 8 14 1 1
slightly liberal 2 1 1 2
moderate 1 8 1 0
slightly conservative 1 1 2 2
conservative 0 0 2 2
very conservative 0 0 0 2
Summary of 4 variables pi,re,hi,tv
summary(student_survey_data)
pi re hi
very liberal : 8 never :15 Min. :2.000
liberal :24 occasionally:29 1st Qu.:3.000
slightly liberal : 6 most weeks : 7 Median :3.350
moderate :10 every week : 9 Mean :3.308
slightly conservative: 6 3rd Qu.:3.625
conservative : 4 Max. :4.000
very conservative : 2
tv
Min. : 0.000
1st Qu.: 3.000
Median : 6.000
Mean : 7.267
3rd Qu.:10.000
Max. :37.000
Interpretation:
The relationship between the two categorical variables Political Ideology and Religiosity , where Political Ideology is the dependent variable and Religiosity independent, yields the following insight based on the xtabs and summary analysis. A mode of 24 for liberal Political Ideology associated to a mode of 29 for the liberal sample Religiosity occasionally. This result coincides with the xtabs table indicating a mode of 14 of the liberal sample who attend a religious service occasionally.
ii)
Here I use a linear model to determine if there is a correlation between Highschool GPA and Hours Watching TV
ggscatter(student.survey,x="tv",y="hi",
add = "reg.line",conf.int = TRUE,
xlab = "Hours Watching TV",ylab = "HighSchool_GPA",title = "HighSchool_GPA vs Hours Watching TV")
I can now gain further insight into the variables high school gpa(hi) and hours watching tv(tv)
skim(student_survey_data) #provides concise,descriptive insight into the data
| Name | student_survey_data |
| Number of rows | 60 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| pi | 0 | 1 | TRUE | 7 | lib: 24, mod: 10, ver: 8, sli: 6 |
| re | 0 | 1 | TRUE | 4 | occ: 29, nev: 15, eve: 9, mos: 7 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| hi | 0 | 1 | 3.31 | 0.46 | 2 | 3 | 3.35 | 3.62 | 4 | ▂▁▇▇▆ |
| tv | 0 | 1 | 7.27 | 6.72 | 0 | 3 | 6.00 | 10.00 | 37 | ▇▃▁▁▁ |
str(student_survey_data)
'data.frame': 60 obs. of 4 variables:
$ pi: Ord.factor w/ 7 levels "very liberal"<..: 6 2 2 4 1 2 2 2 1 3 ...
$ re: Ord.factor w/ 4 levels "never"<"occasionally"<..: 3 2 3 2 1 2 2 2 2 1 ...
$ hi: num 2.2 2.1 3.3 3.5 3.1 3.5 3.6 3 3 4 ...
$ tv: num 3 15 0 5 6 4 5 5 7 1 ...
summary(student_survey_data)
pi re hi
very liberal : 8 never :15 Min. :2.000
liberal :24 occasionally:29 1st Qu.:3.000
slightly liberal : 6 most weeks : 7 Median :3.350
moderate :10 every week : 9 Mean :3.308
slightly conservative: 6 3rd Qu.:3.625
conservative : 4 Max. :4.000
very conservative : 2
tv
Min. : 0.000
1st Qu.: 3.000
Median : 6.000
Mean : 7.267
3rd Qu.:10.000
Max. :37.000
Interpretation:
Students Highschool GPA has a mean of 3.31gpa and median of 3.35gpa. The GPA’s are within a range of 2.00 for minimum and 4.00 for maximum. It has a standard deviation of 0.46 which indicates the data are clustered around the mean. This is verified in the graph.
Students Hours Watching TV has a mean of 7.3 hours, median of 6 hours. The hours watched range from a minimum of 0 to a maximum value of 37 hours per week.
PART C
Summarize and interpret results of inferential analyses.
i)
To gain better insight into the political ideology and religiosity variables, I use the function cor.test() (as discussed in class).The cor.test() function will provide an association or correlation between paired samples political ideology(pi) and religiosity variables(re).
# correlation test
cor.test(as.numeric(student_survey_data$pi),as.numeric(student_survey_data$re))
Pearson's product-moment correlation
data: as.numeric(student_survey_data$pi) and as.numeric(student_survey_data$re)
t = 5.4163, df = 58, p-value = 1.221e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3818345 0.7265650
sample estimates:
cor
0.5795661
Interpretation:
Test_Statistic = 5.4163 The p-value = 1.221e-06 is less p-value is <0.05 we see the correlation between political ideology and religiosity variables** is statistically significant since the p-value is < 0.05 we therefore reject the null hypothesis.
To gain better insight into the “HighSchoolGPA vs Hours Watching TV” I use a cor.test as discussed in class
cor.test(as.numeric(student_survey_data$hi),as.numeric(student_survey_data$tv))
Pearson's product-moment correlation
data: as.numeric(student_survey_data$hi) and as.numeric(student_survey_data$tv)
t = -2.1144, df = 58, p-value = 0.03879
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.48826914 -0.01457694
sample estimates:
cor
-0.2675115
ii)
Pearson correlation plot “HighSchoolGPA vs Hours Watching TV”
ggscatter(student.survey,x="tv",y="hi",
add = "reg.line",conf.int = TRUE,
cor.coef=TRUE,cor.method = "pearson",
xlab = "Hours Watching TV",ylab = "HighSchool_GPA",title = "HighSchool_GPA vs Hours Watching TV")
Interpretation:
Test statistic = -2.1144
Based on the cor.test we see the correlation between gpa and television watching is statistically significant since the p-value is < 0.05 at p-value: 0.039, we therefore reject the null hypothesis. Graphically we also observe there is a moderately weak negative correlation between gpa and television watching.
For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful.
Regression toward the mean is an idea that refers to the fact that if one sample of a random variable is extreme, the next sampling of the same random variable is likely to be closer to its mean.
In this case, the original sample mean of the specially tutored students was 50 which is very low compared to the overall class sample mean of 70; therefore the next sampling of the same set of the specially tutored students is likely to produce a value of a sample mean closer to the overall mean of 70; and this is purely due to the statistical phenomenon of regression towards mean and not due to any special effect of the special tutoring program ; even without the special tutoring program, this second sample mean would most likely be closer to 70 than the first sample mean.