CAT.utf8

C. A. T STAT 324

QUESTION ONE (2 MARKS)

Write down an R syntax that you have used to export the pitch voice data excel/CSV format to R format. Familiarize yourself with the sample R code I have shared via class email.

voice_data <- read.csv("POLITENESS_DATA_JK(1).csv")
head(voice_data)

##   subject gender scenario attitude voice_pitch SEX age edu polite green_space
## 1      F1      F        1      pol       213.3   0  20   1      1           2
## 2      F1      F        1      inf       204.5   0  34   1      0           3
## 3      F1      F        2      pol       285.1   0  23   1      1           5
## 4      F1      F        2      inf       259.7   0  30   2      0           6
## 5      F1      F        3      pol       203.9   0  45   3      1           9
## 6      F1      F        3      inf       286.9   0  22   3      0          10

Structure of the data

str(voice_data)

## 'data.frame':    84 obs. of  10 variables:
##  $ subject    : Factor w/ 6 levels "F1","F2","F3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender     : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ scenario   : int  1 1 2 2 3 3 4 4 5 5 ...
##  $ attitude   : Factor w/ 2 levels "inf","pol": 2 1 2 1 2 1 2 1 2 1 ...
##  $ voice_pitch: num  213 204 285 260 204 ...
##  $ SEX        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age        : int  20 34 23 30 45 22 50 42 38 33 ...
##  $ edu        : int  1 1 1 2 3 3 1 1 1 1 ...
##  $ polite     : int  1 0 1 0 1 0 1 0 1 0 ...
##  $ green_space: int  2 3 5 6 9 10 2 4 5 6 ...

Overview

summary(voice_data)

##  subject gender    scenario attitude  voice_pitch         SEX     
##  F1:14   F:42   Min.   :1   inf:42   Min.   : 82.2   Min.   :0.0  
##  F2:14   M:42   1st Qu.:2   pol:42   1st Qu.:131.6   1st Qu.:0.0  
##  F3:14          Median :4            Median :203.9   Median :0.5  
##  M3:14          Mean   :4            Mean   :193.6   Mean   :0.5  
##  M4:14          3rd Qu.:6            3rd Qu.:248.6   3rd Qu.:1.0  
##  M7:14          Max.   :7            Max.   :306.8   Max.   :1.0  
##                                      NA's   :1                    
##       age             edu            polite       green_space    
##  Min.   :17.00   Min.   :1.000   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:28.75   1st Qu.:1.000   1st Qu.:0.000   1st Qu.: 4.000  
##  Median :36.00   Median :2.000   Median :1.000   Median : 7.000  
##  Mean   :41.00   Mean   :1.929   Mean   :0.619   Mean   : 7.702  
##  3rd Qu.:50.00   3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 9.000  
##  Max.   :95.00   Max.   :3.000   Max.   :1.000   Max.   :24.000  
##

Data Preprocessing

Some variables in the data are numeric and they need to be factor for example polite, sex, education level etc. .

voice_data <- transform(voice_data,gender = as.factor(gender),edu = as.factor(edu),SEX = as.factor(SEX),polite = as.factor(polite))
str(voice_data)

## 'data.frame':    84 obs. of  10 variables:
##  $ subject    : Factor w/ 6 levels "F1","F2","F3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender     : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ scenario   : int  1 1 2 2 3 3 4 4 5 5 ...
##  $ attitude   : Factor w/ 2 levels "inf","pol": 2 1 2 1 2 1 2 1 2 1 ...
##  $ voice_pitch: num  213 204 285 260 204 ...
##  $ SEX        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age        : int  20 34 23 30 45 22 50 42 38 33 ...
##  $ edu        : Factor w/ 3 levels "1","2","3": 1 1 1 2 3 3 1 1 1 1 ...
##  $ polite     : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 2 1 ...
##  $ green_space: int  2 3 5 6 9 10 2 4 5 6 ...

QUESTION TWO (3 MARKS)

Fit a linear model 1 with pitch voice as the dependent variable and age as the covariate variable and interpret all the parameter values

lmodel1 <- lm(voice_pitch ~ age, data = voice_data)
summary(lmodel1)

## 
## Call:
## lm(formula = voice_pitch ~ age, data = voice_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -116.304  -63.324    7.593   54.383  109.388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 181.5171    18.8330   9.638 4.37e-15 ***
## age           0.2942     0.4242   0.694     0.49    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.75 on 81 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.005904,   Adjusted R-squared:  -0.006369 
## F-statistic: 0.481 on 1 and 81 DF,  p-value: 0.4899

residual: describes the didstribution of error terms from the fitted model. The distribution of the difference between modeled values and the observed values.
Age is less statistically significant to the response values. The p-value is greater than the significance level of 0.05.
the model accounts for about 0.5% of the variability in the data.
the p-value for F-statistic is greater than the sig. level of 0.05. Therefore the model is less significant.

QUESTION THREE (5 MARKS)

Fit a linear model 2 with pitch voice as the dependent variable and with age and sex as independent variables of interest. Compare statistical output in model 2 with model 1 above and appropriately make the necessary conclusions

lmodel2 <- lm(voice_pitch ~ age + SEX, data = voice_data)
summary(lmodel2)

## 
## Call:
## lm(formula = voice_pitch ~ age + SEX, data = voice_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.929 -24.609  -6.436  25.437  88.708 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  223.3794    10.6165  21.041   <2e-16 ***
## age            0.5987     0.2305   2.598   0.0112 *  
## SEX1        -110.0293     7.8439 -14.027   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.57 on 80 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7127, Adjusted R-squared:  0.7055 
## F-statistic:  99.2 on 2 and 80 DF,  p-value: < 2.2e-16

age and sex are statistically significant to the model. The p-values are less than 0.05.
a unit change in age causes 0.5987 times change in the response variable when other covariates are held constant.
being male(1 i.e SEX1) reduces the response variable by -110.0293 compared to female.

Compare model 1 and model 2

anova(lmodel1,lmodel2,test = "Chisq")

## Analysis of Variance Table
## 
## Model 1: voice_pitch ~ age
## Model 2: voice_pitch ~ age + SEX
##   Res.Df    RSS Df Sum of Sq  Pr(>Chi)    
## 1     81 350158                           
## 2     80 101214  1    248944 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than significant level of 0.05. We reject the null hypothesis.There is a significant difference between model 1 and model 2.

QUESTION FOUR (5 MARKS)

Fit an extended linear model 3 with pitch voice as the dependent variable and with age, sex and education level as independent variables of interest. Compare the statistical output in model 3 with model 2 above and make the necessary conclusions.

lmodel3 <- lm(voice_pitch ~ age + SEX + edu, data = voice_data)
summary(lmodel3)

## 
## Call:
## lm(formula = voice_pitch ~ age + SEX + edu, data = voice_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.408 -25.762  -5.787  24.887  89.369 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  222.6934    11.6022  19.194   <2e-16 ***
## age            0.5998     0.2350   2.553   0.0126 *  
## SEX1        -110.0545     7.9555 -13.834   <2e-16 ***
## edu2           0.1214     9.3181   0.013   0.9896    
## edu3           2.2865    10.1946   0.224   0.8231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.01 on 78 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7129, Adjusted R-squared:  0.6982 
## F-statistic: 48.42 on 4 and 78 DF,  p-value: < 2.2e-16

sex and age are statistically significant. Their p-values are less than 0.05. education level is less statistically significant.
The model accounts for about 71% of the variability in the data.
the p-value of the F-statistic is less than 0.05. The model is significant.

Compare model 2 and model 3

anova(lmodel2,lmodel3,test = "Chisq")

## Analysis of Variance Table
## 
## Model 1: voice_pitch ~ age + SEX
## Model 2: voice_pitch ~ age + SEX + edu
##   Res.Df    RSS Df Sum of Sq Pr(>Chi)
## 1     80 101214                      
## 2     78 101134  2    80.071   0.9696

The p-value is greater than the significance level of 0.05. We fail to t=reject the nll hypothesis. There is no significant difference in model 2 and model 3.

QUESTION FIVE (5 MARKS)

Fit a logistic regression model 1 with politeness as a dependent variable of interest and age, sex and education level as independent variables. Give your interpretation of the statistical output. Also, write down an R syntax that solves the same problem by specifying the link function.

logistic1 <- glm(polite ~ age + SEX + edu, data = voice_data, family = binomial(link = "logit"))
summary(logistic1)

## 
## Call:
## glm(formula = polite ~ age + SEX + edu, family = binomial(link = "logit"), 
##     data = voice_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5698  -1.3320   0.9004   0.9959   1.1262  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.865114   0.668216   1.295    0.195
## age         -0.007455   0.013347  -0.559    0.576
## SEX1        -0.181466   0.454061  -0.400    0.689
## edu2        -0.062542   0.531433  -0.118    0.906
## edu3         0.163872   0.585292   0.280    0.779
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 111.64  on 83  degrees of freedom
## Residual deviance: 110.94  on 79  degrees of freedom
## AIC: 120.94
## 
## Number of Fisher Scoring iterations: 4

the independent variables are less significant to the model. Their p-values are greater than 0.05

anova(logistic1, test = "Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: polite
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                    83     111.64         
## age   1  0.38680        82     111.25   0.5340
## SEX   1  0.15534        81     111.10   0.6935
## edu   2  0.16279        79     110.94   0.9218

Adding the covariates to the model has a very small residual deviance. They are less significant to the model

QUESTION SIX (5 MARKS)

Fit a probit model 2 using the variables of interest as specified in model 1 above by specifying the link function. Briefly compare the statistical output in question (5) above and question (6) and make the necessary conclusions. In your own understanding, justify if any differences exist.

probit_1 <- glm(polite ~ age + SEX + edu, data = voice_data, family = binomial(link = "probit"))

Results

summary(probit_1)

## 
## Call:
## glm(formula = polite ~ age + SEX + edu, family = binomial(link = "probit"), 
##     data = voice_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5690  -1.3314   0.9013   0.9960   1.1244  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.535715   0.412301   1.299    0.194
## age         -0.004539   0.008285  -0.548    0.584
## SEX1        -0.112754   0.280561  -0.402    0.688
## edu2        -0.039881   0.329308  -0.121    0.904
## edu3         0.097995   0.359844   0.272    0.785
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 111.64  on 83  degrees of freedom
## Residual deviance: 110.94  on 79  degrees of freedom
## AIC: 120.94
## 
## Number of Fisher Scoring iterations: 4

The covariates are less significant to the model.

QUESTION SIX (5 MARKS)

Also, fit a Poisson regression model 1 with the frequency of visits to the green spaces for the last one month as the dependent variable and the variables age, sex and education level as explanatory variables respectively. Give interpretations of the coefficients in the statistical output model and make the necessary recommendations.

The response variable here has more than two outcomes. It is a count data. Therefore we are going to specify the poisson() as family. By default, it has link = “log”

poisson_1 <- glm(scenario ~ age + SEX + edu, data = voice_data, family = poisson(link = "log"))
summary(poisson_1)

## 
## Call:
## glm(formula = scenario ~ age + SEX + edu, family = poisson(link = "log"), 
##     data = voice_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.97565  -0.91271   0.00206   0.74471   1.74251  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.111302   0.162848   6.824 8.84e-12 ***
## age          0.003787   0.003142   1.206   0.2280    
## SEX1        -0.026140   0.109942  -0.238   0.8121    
## edu2         0.152403   0.133339   1.143   0.2531    
## edu3         0.246519   0.140452   1.755   0.0792 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 91.924  on 83  degrees of freedom
## Residual deviance: 87.065  on 79  degrees of freedom
## AIC: 358.87
## 
## Number of Fisher Scoring iterations: 5

It is not perfect. Leave a comment or any correction. Anything..

$$$$