Homework Two

DACSS 603

Cynthia Hester
March 5,2022

Question 1

United Nations (Data file: UN11) The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. The data were collected from the United Nations (2011). We will study the dependence of fertility on ppgdp.

1.1.1. Identify the predictor and the response.

1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?

1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.


Solution

1.1.1. Identify the predictor and the response.

The predictor is ppgdp. This is because the problem is studying the dependence of fertility on ppgdp (gross national product per person) which is independent/explanatory.

The response variable is fertility. This is because fertility is the dependent variable in relation to ppgdp.

1.1.2 Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph?

First, we import the data from the United Nations (Datafile:UN11) contained in data accompanying the ALR text book data set.

hide
data("UN11")                      #load UN11 data, which contains
United_Nations_11<-UN11           #"UN11" renamed for easier readability
head(United_Nations_11)
                region  group fertility   ppgdp lifeExpF pctUrban
Afghanistan       Asia  other     5.968   499.0    49.49       23
Albania         Europe  other     1.525  3677.2    80.40       53
Algeria         Africa africa     2.142  4473.0    75.00       67
Angola          Africa africa     5.135  4321.9    53.17       59
Anguilla     Caribbean  other     2.000 13750.1    81.10      100
Argentina   Latin Amer  other     2.172  9162.1    79.89       93

I now separate the two variables, fertility and ppgdp from the UN11 dataset for inspection

hide
#separates fertility and ppgdp from dataset

United_Nations_11 <- United_Nations_11 %>%
  select(c(fertility,ppgdp))
head(United_Nations_11,5)               # First five rows of variables fertility and ppgdp verified
            fertility   ppgdp
Afghanistan     5.968   499.0
Albania         1.525  3677.2
Algeria         2.142  4473.0
Angola          5.135  4321.9
Anguilla        2.000 13750.1

Here I use a table to represent the variables extracted from the UN11 data

hide
kable(head(United_Nations_11),format = "markdown",digits = 3,colnames = c('fertility','ppgdp'),
      caption = "United Nations Fertility and Gross National Product Per Person in USD")
Table 1: United Nations Fertility and Gross National Product Per Person in USD
fertility ppgdp
Afghanistan 5.968 499.0
Albania 1.525 3677.2
Algeria 2.142 4473.0
Angola 5.135 4321.9
Anguilla 2.000 13750.1
Argentina 2.172 9162.1

Here I look at a summary of the two variables as well as remove any missing data

hide
is.na(United_Nations_11) %>% #removes missing values from variables we
  str()                      #provides a concise overview of the data set
 logi [1:199, 1:2] FALSE FALSE FALSE FALSE FALSE FALSE ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:199] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ : chr [1:2] "fertility" "ppgdp"

I now look at a numerical summary of the two variables to get a more granular overview of the data.

hide
  summary(United_Nations_11)                 #numerical summary
   fertility         ppgdp         
 Min.   :1.134   Min.   :   114.8  
 1st Qu.:1.754   1st Qu.:  1283.0  
 Median :2.262   Median :  4684.5  
 Mean   :2.761   Mean   : 13012.0  
 3rd Qu.:3.545   3rd Qu.: 15520.5  
 Max.   :6.925   Max.   :105095.4  

Variables fertility and ppgdp renamed for better understandability

hide
United_Nations_rename<-UN11 %>%
  rename(Gross_National_Product_Per_Person_USD=ppgdp)%>%
  rename(Fertility_Birthrate_per_1000_females_from_2009=fertility)

We now look at the scatterplot of the UN11 (renamed

United_Nations_11) data using fertility on the vertical axis versus ppgdp on the horizontal axis.

hide
United_Nations_11<-UN11
ggplot(data = UN11, aes(x = ppgdp ,y = fertility))+
  geom_point(color=5)+
  labs(title = "Fertility vs United Nations Gross National Product Per Person USD")


Analysis

This linear regression scatter plot does not appear to be an effective summary of the data. The mean is not linear and the variance is not constant. This could be partly attributed to the crowding of the x-axis and y-axis. Therefore, using natural logarithms for the x-axis and y-axis would be warranted to determine if there is any indication of the plausibility of a straight-line mean function.


1.1.3 Draw the scatterplot of log(fertility) versus log(ppgdp) using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? If you use a different base of logarithms, the shape of the graph won’t change, but the values on the axes will change.


Scatterplot reflecting Natural Log of variables fertility vs. ppgdp

hide
United_Nations_11<-UN11
ggplot(data = UN11,aes(x = log(ppgdp),y = log(fertility)))+
  geom_point(color=5)+
  labs(title = "Natural_Log of Fertility vs UN Gross National Product Per Person USD")

Scatterplot reflecting Natural Log of variables fertility vs. ppgdp with linear regression

hide
United_Nations_11<-UN11
ggplot(data = UN11,aes(x = log(ppgdp),y = log(fertility)))+
  geom_point(color=5)+
  geom_smooth(method ="lm")+
  labs(title = "Natural_Log of Fertility vs UN Gross National Product Per Person USD")

Analysis

Implementation of the natural logarithmic scale for the UN11 scatterplot appears to indicate an effective representation of a linear regression. Compared to the previous scattorplot, this plot appears to be linear, and the variance seems to be plausible. Therefore, the relationship between fertility and gross domestic product(ppgdp )is linear.


Question 2

Annual income, in dollars, is an explanatory variable in a regression analysis. For a British version of the report on the analysis, all responses are converted to British pounds sterling (1 pound equals about 1.33 dollars, as of 2016).

Solution

Part A

How, if at all, does the slope of the prediction equation change?

When responses are converted to British pounds sterling the slope changes,thus the slope increases by 1.33 times the original. This is because there is an inverse relationship between the slope,and the explanatory variables.

Part B

How, if at all, does the correlation change?

There will be no change in the correlation. This is because the strength and pattern (correlation) cannot be affected by change in units.


Question 3

Water runoff in the Sierras (Data file: water) Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM.

Draw the scatterplot matrix for these data and summarize the information available from these plots.

Solution

First, we load the (Data files: water) and check the structure of the data

hide
data("water")                       #importing water data set
head(water,5)                       #looking at first 5 rows of data set
  Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
hide
str(water)                          #concise look at data frame
'data.frame':   43 obs. of  8 variables:
 $ Year   : int  1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 ...
 $ APMAM  : num  9.13 5.28 4.2 4.6 7.15 9.7 5.02 6.7 10.5 9.1 ...
 $ APSAB  : num  3.58 4.82 3.77 4.46 4.99 5.65 1.45 7.44 5.85 6.13 ...
 $ APSLAKE: num  3.91 5.2 3.67 3.93 4.88 4.91 1.77 6.51 3.38 4.08 ...
 $ OPBPC  : num  4.1 7.55 9.52 11.14 16.34 ...
 $ OPRC   : num  7.43 11.11 12.2 15.15 20.05 ...
 $ OPSLAKE: num  6.47 10.26 11.35 11.13 22.81 ...
 $ BSAAM  : int  54235 67567 66161 68094 107080 67594 65356 67909 92715 70024 ...
hide
summary(water)                      #provides numerical overview of data
      Year          APMAM            APSAB           APSLAKE     
 Min.   :1948   Min.   : 2.700   Min.   : 1.450   Min.   : 1.77  
 1st Qu.:1958   1st Qu.: 4.975   1st Qu.: 3.390   1st Qu.: 3.36  
 Median :1969   Median : 7.080   Median : 4.460   Median : 4.62  
 Mean   :1969   Mean   : 7.323   Mean   : 4.652   Mean   : 4.93  
 3rd Qu.:1980   3rd Qu.: 9.115   3rd Qu.: 5.685   3rd Qu.: 5.83  
 Max.   :1990   Max.   :18.080   Max.   :11.960   Max.   :13.02  
     OPBPC             OPRC           OPSLAKE           BSAAM       
 Min.   : 4.050   Min.   : 4.350   Min.   : 4.600   Min.   : 41785  
 1st Qu.: 7.975   1st Qu.: 7.875   1st Qu.: 8.705   1st Qu.: 59857  
 Median : 9.550   Median :11.110   Median :12.140   Median : 69177  
 Mean   :12.836   Mean   :12.002   Mean   :13.522   Mean   : 77756  
 3rd Qu.:16.545   3rd Qu.:14.975   3rd Qu.:16.920   3rd Qu.: 92206  
 Max.   :43.370   Max.   :24.850   Max.   :33.070   Max.   :146345  

For better readability and understandability of the data, a table is created

hide
library(DT)
datatable(head(water,25), options = list(
  columnDefs = list(list(className = 'dt-center',targets = 7)),
  rownames = T,
  pageLength = 5,
  autowidth = T,
  lengthMenu =c(5,10,15,20)
))

Removing any missing data

hide
is.na(water) %>%     #removes missing values from variables we are working with
  summary()          #summary of data set, provides numerical insight
    Year           APMAM           APSAB          APSLAKE       
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:43        FALSE:43        FALSE:43        FALSE:43       
   OPBPC            OPRC          OPSLAKE          BSAAM        
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:43        FALSE:43        FALSE:43        FALSE:43       

Draw the scatterplot matrix for these data and summarize the information available from these plots.

We now draw a scatterplot matrix for the water data for better understandably

hide
#Since data has already been imported we can rename the water variable
water_supply<-water
head(water_supply,5)            #first five rows of dataset
  Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080

Scatterplot matrix for water data

hide
pairs(water_supply,main = "Sierra Southern California Water Supply Runoff",
      pch = 21, bg = "green")

For more granular insight, we run a simple linear regression on the water data set

hide
#Simple linear regression of water data set

lm_water_supply<-lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE,data = water)
summary(lm_water_supply)

Call:
lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
    OPSLAKE, data = water)

Residuals:
   Min     1Q Median     3Q    Max 
-12690  -4936  -1424   4173  18542 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15944.67    4099.80   3.889 0.000416 ***
APMAM         -12.77     708.89  -0.018 0.985725    
APSAB        -664.41    1522.89  -0.436 0.665237    
APSLAKE      2270.68    1341.29   1.693 0.099112 .  
OPBPC          69.70     461.69   0.151 0.880839    
OPRC         1916.45     641.36   2.988 0.005031 ** 
OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7557 on 36 degrees of freedom
Multiple R-squared:  0.9248,    Adjusted R-squared:  0.9123 
F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

Analysis:

Starting with variable Year we see that it does not seem to be appear to be particularly related to any of the other variables.APSLAKE,APSAB and APMAM appear to be correlated, and it is observed in relation to the runoff variable BSAAM they do not seem to appear to have a strong correlation.

It appears that OPSLAKE and OPRC are strongly correlated as well being positively linear. This suggests statistical significance of the p-value in the linear regression Pr(>|t|)< 0.05.The BSAAM variable appears to be strongly correlated with them as well.Since these variables appear to be correlated with each other, it may cause a multicollinearity problem. Furthermore, with a p-value: < 2.2e-16 the model appears to be statistically significant. The residuals display of the regression a wide disparity from a Minimum of -12690 to a Maximum of 18542. This may suggest outliers in the model.

Finally, the multiple R-squared: value of 0.9248 and the Adjusted R-squared: 0.9123 values are close to 1 which suggests there is minimal overfitting.


Question 4

Professor ratings (Data file: Rateprof) In the website and online forum RateMyProfessors.com, students rate and comment on their instructors. Launched in 1999, the site includes millions of ratings on thousands of instructors. The data file includes the summaries of the ratings of 364 instructors at a large campus in the Midwest (Bleske-Rechek and Fritsch, 2011). Each instructor included in the data had at least 10 ratings over a several year period. Students provided ratings of 1–5 on quality, helpfulness, clarity, easiness of instructor’s courses, and raterInterest in the subject matter covered in the instructor’s courses. The data file provides the averages of these five ratings. Use R to reproduce the scatterplot matrix in Figure 1.13 in the ALR book (page 20). Provide a brief description of the relationships between the five ratings. (The variables don’t have to be in the same order)

Solution

First, the data (Data file: Rateprof) is imported from the ALR4 dataset

hide
data("Rateprof")
head(Rateprof,5)                           #first 5 rows of data
  gender numYears numRaters numCourses pepper discipline
1   male        7        11          5     no        Hum
2   male        6        11          5     no        Hum
3   male       10        43          2     no        Hum
4   male       11        24          5     no        Hum
5   male       11        19          7     no        Hum
               dept  quality helpfulness  clarity easiness
1           English 4.636364    4.636364 4.636364 4.818182
2 Religious Studies 4.318182    4.545455 4.090909 4.363636
3               Art 4.790698    4.720930 4.860465 4.604651
4           English 4.250000    4.458333 4.041667 2.791667
5           Spanish 4.684211    4.684211 4.684211 4.473684
  raterInterest sdQuality sdHelpfulness sdClarity sdEasiness
1      3.545455 0.5518564     0.6741999 0.5045250  0.4045199
2      4.000000 0.9020179     0.9341987 0.9438798  0.5045250
3      3.432432 0.4529343     0.6663898 0.4129681  0.5407021
4      3.181818 0.9325048     0.9315329 0.9990938  0.5882300
5      4.214286 0.6500112     0.8200699 0.5823927  0.6117753
  sdRaterInterest
1       1.1281521
2       1.0744356
3       1.2369438
4       1.3322506
5       0.9749613

Missing values are removed and then a summary of the data is run,to gain insight into the structure of the data.

hide
is.na(Rateprof) %>%        #removes missing values from variables we
summary()                  #insight into numerical data
   gender         numYears       numRaters       numCourses     
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:366       FALSE:366       FALSE:366       FALSE:366      
   pepper        discipline         dept          quality       
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:366       FALSE:366       FALSE:366       FALSE:366      
 helpfulness      clarity         easiness       raterInterest  
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:366       FALSE:366       FALSE:366       FALSE:366      
 sdQuality       sdHelpfulness   sdClarity       sdEasiness     
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:366       FALSE:366       FALSE:366       FALSE:366      
 sdRaterInterest
 Mode :logical  
 FALSE:366      
hide
str(Rateprof)
'data.frame':   366 obs. of  17 variables:
 $ gender         : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
 $ numYears       : int  7 6 10 11 11 10 7 11 11 7 ...
 $ numRaters      : int  11 11 43 24 19 15 17 16 12 18 ...
 $ numCourses     : int  5 5 2 5 7 9 3 3 4 4 ...
 $ pepper         : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ discipline     : Factor w/ 4 levels "Hum","SocSci",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ dept           : Factor w/ 48 levels "Accounting","Anthropology",..: 17 42 3 17 45 45 45 17 34 17 ...
 $ quality        : num  4.64 4.32 4.79 4.25 4.68 ...
 $ helpfulness    : num  4.64 4.55 4.72 4.46 4.68 ...
 $ clarity        : num  4.64 4.09 4.86 4.04 4.68 ...
 $ easiness       : num  4.82 4.36 4.6 2.79 4.47 ...
 $ raterInterest  : num  3.55 4 3.43 3.18 4.21 ...
 $ sdQuality      : num  0.552 0.902 0.453 0.933 0.65 ...
 $ sdHelpfulness  : num  0.674 0.934 0.666 0.932 0.82 ...
 $ sdClarity      : num  0.505 0.944 0.413 0.999 0.582 ...
 $ sdEasiness     : num  0.405 0.505 0.541 0.588 0.612 ...
 $ sdRaterInterest: num  1.128 1.074 1.237 1.332 0.975 ...

Five variables we are focused on extracted from Rateprof dataset.

hide
#data set renamed for subset
rate_my_prof<-Rateprof
colnames(rate_my_prof)            #column names in dataset
 [1] "gender"          "numYears"        "numRaters"      
 [4] "numCourses"      "pepper"          "discipline"     
 [7] "dept"            "quality"         "helpfulness"    
[10] "clarity"         "easiness"        "raterInterest"  
[13] "sdQuality"       "sdHelpfulness"   "sdClarity"      
[16] "sdEasiness"      "sdRaterInterest"

Five variable subset of RateProf dataset

hide
#subset of RateProf data

rate_my_prof<-select(Rateprof,c('quality','helpfulness','clarity','easiness','raterInterest'))
head(rate_my_prof)     #first 5 rows of data set
   quality helpfulness  clarity easiness raterInterest
1 4.636364    4.636364 4.636364 4.818182      3.545455
2 4.318182    4.545455 4.090909 4.363636      4.000000
3 4.790698    4.720930 4.860465 4.604651      3.432432
4 4.250000    4.458333 4.041667 2.791667      3.181818
5 4.684211    4.684211 4.684211 4.473684      4.214286
6 4.233333    4.266667 4.200000 4.533333      3.916667

Table of subset of data for better understandability

hide
kable(head(rate_my_prof),format = "markdown",digits = 5,
colnames = c('Quality','Helpfulness','Clarity','Easiness','RaterInterest'),caption
= "Rate My Professor")
Table 2: Rate My Professor
quality helpfulness clarity easiness raterInterest
4.63636 4.63636 4.63636 4.81818 3.54545
4.31818 4.54545 4.09091 4.36364 4.00000
4.79070 4.72093 4.86047 4.60465 3.43243
4.25000 4.45833 4.04167 2.79167 3.18182
4.68421 4.68421 4.68421 4.47368 4.21429
4.23333 4.26667 4.20000 4.53333 3.91667

Scatterplot Matrix of five RateProf variables

hide
pairs(rate_my_prof,
      col = "green3",
      pch = 20,
      main = "Rate my Professor Matrix ScatterPlot ")

Provide a brief description of the relationships between the five ratings

Interpretation:

We see that if there is an intersection of any two variables then there is linear correlation of varying degrees of strength. Furthermore,if the correlation is not as linear then the correlation is weak.

We see that the relationship between some pairs of variables indicate better positive linear correlations than others.


Question 5

(Problem 9.34 in SMSS)

For the student.survey data file in the smss package, conduct regression analyses relating (i) y = political ideology and x = religiosity, (ii) y = high school GPA and x = hours of TV watching.

(You can use ?student.survey in the R console, after loading the package, to see what each variable means.)


Solution

I first, import and inspect the student.survey dataset from SMSS text.

hide
data("student.survey")                   #import dataset
student_survey_data<-student.survey      #renamed for better understandability
head(student_survey_data,5)              #inspects first five rows of data
  subj ge ag  hi  co   dh   dr tv sp ne ah    ve pa           pi
1    1  m 32 2.2 3.5    0  5.0  3  5  0  0 FALSE  r conservative
2    2  f 23 2.1 3.5 1200  0.3 15  7  5  6 FALSE  d      liberal
3    3  f 27 3.3 3.0 1300  1.5  0  4  3  0 FALSE  d      liberal
4    4  f 35 3.5 3.2 1500  8.0  5  5  6  3 FALSE  i     moderate
5    5  m 23 3.1 3.5 1600 10.0  6  6  3  0 FALSE  i very liberal
            re    ab    aa    ld
1   most weeks FALSE FALSE FALSE
2 occasionally FALSE FALSE    NA
3   most weeks FALSE FALSE    NA
4 occasionally FALSE FALSE FALSE
5        never FALSE FALSE FALSE

To gain better understanding of the data I look at a summery, strings,and column names of the data set

hide
colnames(student_survey_data)                 #column names of dataset
 [1] "subj" "ge"   "ag"   "hi"   "co"   "dh"   "dr"   "tv"   "sp"  
[10] "ne"   "ah"   "ve"   "pa"   "pi"   "re"   "ab"   "aa"   "ld"  
hide
summary(student_survey_data)                  #numeric structure of data
      subj       ge           ag              hi       
 Min.   : 1.00   f:31   Min.   :22.00   Min.   :2.000  
 1st Qu.:15.75   m:29   1st Qu.:24.00   1st Qu.:3.000  
 Median :30.50          Median :26.50   Median :3.350  
 Mean   :30.50          Mean   :29.17   Mean   :3.308  
 3rd Qu.:45.25          3rd Qu.:31.00   3rd Qu.:3.625  
 Max.   :60.00          Max.   :71.00   Max.   :4.000  
                                                       
       co              dh             dr               tv        
 Min.   :2.600   Min.   :   0   Min.   : 0.200   Min.   : 0.000  
 1st Qu.:3.175   1st Qu.: 205   1st Qu.: 1.450   1st Qu.: 3.000  
 Median :3.500   Median : 640   Median : 2.000   Median : 6.000  
 Mean   :3.453   Mean   :1232   Mean   : 3.818   Mean   : 7.267  
 3rd Qu.:3.725   3rd Qu.:1350   3rd Qu.: 5.000   3rd Qu.:10.000  
 Max.   :4.000   Max.   :8000   Max.   :20.000   Max.   :37.000  
                                                                 
       sp               ne               ah             ve         
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Mode :logical  
 1st Qu.: 3.000   1st Qu.: 2.000   1st Qu.: 0.000   FALSE:60       
 Median : 5.000   Median : 3.000   Median : 0.500                  
 Mean   : 5.483   Mean   : 4.083   Mean   : 1.433                  
 3rd Qu.: 7.000   3rd Qu.: 5.250   3rd Qu.: 2.000                  
 Max.   :16.000   Max.   :14.000   Max.   :11.000                  
                                                                   
 pa                         pi                re         ab         
 d:21   very liberal         : 8   never       :15   Mode :logical  
 i:24   liberal              :24   occasionally:29   FALSE:60       
 r:15   slightly liberal     : 6   most weeks  : 7                  
        moderate             :10   every week  : 9                  
        slightly conservative: 6                                    
        conservative         : 4                                    
        very conservative    : 2                                    
     aa              ld         
 Mode :logical   Mode :logical  
 FALSE:59        FALSE:44       
 NA's :1         NA's :16       
                                
                                
                                
                                
hide
str(student_survey_data)
'data.frame':   60 obs. of  18 variables:
 $ subj: int  1 2 3 4 5 6 7 8 9 10 ...
 $ ge  : Factor w/ 2 levels "f","m": 2 1 1 1 2 2 2 1 2 2 ...
 $ ag  : int  32 23 27 35 23 39 24 31 34 28 ...
 $ hi  : num  2.2 2.1 3.3 3.5 3.1 3.5 3.6 3 3 4 ...
 $ co  : num  3.5 3.5 3 3.2 3.5 3.5 3.7 3 3 3.1 ...
 $ dh  : int  0 1200 1300 1500 1600 350 0 5000 5000 900 ...
 $ dr  : num  5 0.3 1.5 8 10 3 0.2 1.5 2 2 ...
 $ tv  : num  3 15 0 5 6 4 5 5 7 1 ...
 $ sp  : int  5 7 4 5 6 5 12 3 5 1 ...
 $ ne  : int  0 5 3 6 3 7 4 3 3 2 ...
 $ ah  : int  0 6 0 3 0 0 2 1 0 1 ...
 $ ve  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ pa  : Factor w/ 3 levels "d","i","r": 3 1 1 2 2 1 2 2 2 2 ...
 $ pi  : Ord.factor w/ 7 levels "very liberal"<..: 6 2 2 4 1 2 2 2 1 3 ...
 $ re  : Ord.factor w/ 4 levels "never"<"occasionally"<..: 3 2 3 2 1 2 2 2 2 1 ...
 $ ab  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ aa  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ ld  : logi  FALSE NA NA FALSE FALSE NA ...

Subset of data extracted from student survey dataset we are inspecting

hide
student_survey_data<-student_survey_data %>% 
  select(c(pi,re,hi,tv))               #data subset for plots
  head(student_survey_data,5)
            pi           re  hi tv
1 conservative   most weeks 2.2  3
2      liberal occasionally 2.1 15
3      liberal   most weeks 3.3  0
4     moderate occasionally 3.5  5
5 very liberal        never 3.1  6

Missing values are removed and then a summary of the data is run to gain insight and inspect structure of the data.

hide
is.na(student_survey_data) %>%         #removes missing values from data 
summary(student_survey_data)
     pi              re              hi              tv         
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:60        FALSE:60        FALSE:60        FALSE:60       

For easier readability the data is placed in a table.

hide
kable(head(student_survey_data),format = "markdown",digits = 3,
col.names = c('Political_Ideology','Religiosity','HighSchoolGPA','HoursTVWatched'),caption
= "Student_Survey_Data")
Table 3: Student_Survey_Data
Political_Ideology Religiosity HighSchoolGPA HoursTVWatched
conservative most weeks 2.2 3
liberal occasionally 2.1 15
liberal most weeks 3.3 0
moderate occasionally 3.5 5
very liberal never 3.1 6
liberal occasionally 3.5 4

Part A

Use graphical ways to portray the individual variables and their relationship.

i) Now that we know the structure of the data we can represent it visually with a plot. The first plot represents variables: y = political ideology(pi) vs x = religiosity(re).

hide
student_survey_plot<-plot(pi~re,data = student.survey,
        #survey plot using plot function
                          main = "Political Ideology vs. Religiosity")

I use xtabs() function to gain further insight into the data since it is categorical.

hide
data("student.survey")
xtabs(~pi+re,student.survey)
                       re
pi                      never occasionally most weeks every week
  very liberal              3            5          0          0
  liberal                   8           14          1          1
  slightly liberal          2            1          1          2
  moderate                  1            8          1          0
  slightly conservative     1            1          2          2
  conservative              0            0          2          2
  very conservative         0            0          0          2

ii)

The second plot is represented visually with the following variables: y = high school GPA(hi) and x = hours of TV watching(tv). Since the variables are numeric it is more straight forward for interpretation.

hide
student_survey_plot<-plot(hi~tv,data = student.survey,
                          xlab="Hours of TV Watching",
                          ylab="High school GPA",
                          col = "green",
                          main = "High School GPA vs. Hours of TV 
Watching")                #survey plot using plot function

Analysis:

Inspection of the Political Ideology(pi) vs. Religiosity(re) plot yielded very little meaningful information in its current form, as a categorical subset of the data. I used the xtabs() function to gain further insight in to the data since it is categorical. Whereas, the Hours of TV Watching(tv) and High school GPA(hi) data are numeric and did yield some insight. I will further explore any correlations in Part B.


Part B

Interpret descriptive statistics for summarizing the individual variables and their relationship.

i)

I rename all 4 variables of the subset for better understandability

hide
student_survey_rename<-student.survey %>% 
  rename(Political_Ideology=pi) %>% 
  rename(Religiosity=re) %>% 
  rename(Hours_of_TV=tv) %>% 
  rename(HighSchool_GPA=hi)

I then use my favorite tool, the table to gain further insight into the political ideology and religiosity variables.

hide
data("student.survey")
xtabs(~Political_Ideology+Religiosity,student_survey_rename)
                       Religiosity
Political_Ideology      never occasionally most weeks every week
  very liberal              3            5          0          0
  liberal                   8           14          1          1
  slightly liberal          2            1          1          2
  moderate                  1            8          1          0
  slightly conservative     1            1          2          2
  conservative              0            0          2          2
  very conservative         0            0          0          2

Summary of 4 variables pi,re,hi,tv

hide
summary(student_survey_data)
                     pi                re           hi       
 very liberal         : 8   never       :15   Min.   :2.000  
 liberal              :24   occasionally:29   1st Qu.:3.000  
 slightly liberal     : 6   most weeks  : 7   Median :3.350  
 moderate             :10   every week  : 9   Mean   :3.308  
 slightly conservative: 6                     3rd Qu.:3.625  
 conservative         : 4                     Max.   :4.000  
 very conservative    : 2                                    
       tv        
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 6.000  
 Mean   : 7.267  
 3rd Qu.:10.000  
 Max.   :37.000  
                 

Interpretation:

The relationship between the two categorical variables Political Ideology and Religiosity , where Political Ideology is the dependent variable and Religiosity independent, yields the following insight based on the xtabs and summary analysis. A mode of 24 for liberal Political Ideology associated to a mode of 29 for the liberal sample Religiosity occasionally. This result coincides with the xtabs table indicating a mode of 14 of the liberal sample who attend a religious service occasionally.


ii)

Here I use a linear model to determine if there is a correlation between Highschool GPA and Hours Watching TV

hide
ggscatter(student.survey,x="tv",y="hi",
          add = "reg.line",conf.int = TRUE,
          xlab = "Hours Watching TV",ylab = "HighSchool_GPA",title = "HighSchool_GPA vs Hours Watching TV")

I can now gain further insight into the variables high school gpa(hi) and hours watching tv(tv)

hide
skim(student_survey_data)      #provides concise,descriptive insight into the data
Table 4: Data summary
Name student_survey_data
Number of rows 60
Number of columns 4
_______________________
Column type frequency:
factor 2
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
pi 0 1 TRUE 7 lib: 24, mod: 10, ver: 8, sli: 6
re 0 1 TRUE 4 occ: 29, nev: 15, eve: 9, mos: 7

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hi 0 1 3.31 0.46 2 3 3.35 3.62 4 ▂▁▇▇▆
tv 0 1 7.27 6.72 0 3 6.00 10.00 37 ▇▃▁▁▁
hide
str(student_survey_data)
'data.frame':   60 obs. of  4 variables:
 $ pi: Ord.factor w/ 7 levels "very liberal"<..: 6 2 2 4 1 2 2 2 1 3 ...
 $ re: Ord.factor w/ 4 levels "never"<"occasionally"<..: 3 2 3 2 1 2 2 2 2 1 ...
 $ hi: num  2.2 2.1 3.3 3.5 3.1 3.5 3.6 3 3 4 ...
 $ tv: num  3 15 0 5 6 4 5 5 7 1 ...
hide
summary(student_survey_data)
                     pi                re           hi       
 very liberal         : 8   never       :15   Min.   :2.000  
 liberal              :24   occasionally:29   1st Qu.:3.000  
 slightly liberal     : 6   most weeks  : 7   Median :3.350  
 moderate             :10   every week  : 9   Mean   :3.308  
 slightly conservative: 6                     3rd Qu.:3.625  
 conservative         : 4                     Max.   :4.000  
 very conservative    : 2                                    
       tv        
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 6.000  
 Mean   : 7.267  
 3rd Qu.:10.000  
 Max.   :37.000  
                 

Interpretation:

Students Highschool GPA has a mean of 3.31gpa and median of 3.35gpa. The GPA’s are within a range of 2.00 for minimum and 4.00 for maximum. It has a standard deviation of 0.46 which indicates the data are clustered around the mean. This is verified in the graph.

Students Hours Watching TV has a mean of 7.3 hours, median of 6 hours. The hours watched range from a minimum of 0 to a maximum value of 37 hours per week.

PART C

Summarize and interpret results of inferential analyses.

i)

To gain better insight into the political ideology and religiosity variables, I use the function cor.test() (as discussed in class).The cor.test() function will provide an association or correlation between paired samples political ideology(pi) and religiosity variables(re).

hide
# correlation test
cor.test(as.numeric(student_survey_data$pi),as.numeric(student_survey_data$re))

    Pearson's product-moment correlation

data:  as.numeric(student_survey_data$pi) and as.numeric(student_survey_data$re)
t = 5.4163, df = 58, p-value = 1.221e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3818345 0.7265650
sample estimates:
      cor 
0.5795661 

Interpretation:

Test_Statistic = 5.4163 The p-value = 1.221e-06 is less p-value is <0.05 we see the correlation between political ideology and religiosity variables** is statistically significant since the p-value is < 0.05 we therefore reject the null hypothesis.


To gain better insight into the “HighSchoolGPA vs Hours Watching TV” I use a cor.test as discussed in class

hide
cor.test(as.numeric(student_survey_data$hi),as.numeric(student_survey_data$tv))

    Pearson's product-moment correlation

data:  as.numeric(student_survey_data$hi) and as.numeric(student_survey_data$tv)
t = -2.1144, df = 58, p-value = 0.03879
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.48826914 -0.01457694
sample estimates:
       cor 
-0.2675115 

ii)

Pearson correlation plot “HighSchoolGPA vs Hours Watching TV”

hide
ggscatter(student.survey,x="tv",y="hi",
          add = "reg.line",conf.int = TRUE,
          cor.coef=TRUE,cor.method = "pearson",
          xlab = "Hours Watching TV",ylab = "HighSchool_GPA",title = "HighSchool_GPA vs Hours Watching TV")

Interpretation:

Test statistic = -2.1144

Based on the cor.test we see the correlation between gpa and television watching is statistically significant since the p-value is < 0.05 at p-value: 0.039, we therefore reject the null hypothesis. Graphically we also observe there is a moderately weak negative correlation between gpa and television watching.


Question 6

For a class of 100 students, the teacher takes the 10 students who perform poorest on the midterm exam and enrolls them in a special tutoring program. The overall class mean is 70 on both the midterm and final, but the mean for the specially tutored students increases from 50 to 60. Use the concept of regression toward the mean to explain why this is not sufficient evidence to imply that the tutoring program was successful.


Solution

Regression toward the mean is an idea that refers to the fact that if one sample of a random variable is extreme, the next sampling of the same random variable is likely to be closer to its mean.

In this case, the original sample mean of the specially tutored students was 50 which is very low compared to the overall class sample mean of 70; therefore the next sampling of the same set of the specially tutored students is likely to produce a value of a sample mean closer to the overall mean of 70; and this is purely due to the statistical phenomenon of regression towards mean and not due to any special effect of the special tutoring program ; even without the special tutoring program, this second sample mean would most likely be closer to 70 than the first sample mean.