Write down an R syntax that you have used to export the pitch voice data excel/CSV format to R format. Familiarize yourself with the sample R code I have shared via class email.
voice_data <- read.csv("POLITENESS_DATA_JK(1).csv")
head(voice_data)
## subject gender scenario attitude voice_pitch SEX age edu polite green_space
## 1 F1 F 1 pol 213.3 0 20 1 1 2
## 2 F1 F 1 inf 204.5 0 34 1 0 3
## 3 F1 F 2 pol 285.1 0 23 1 1 5
## 4 F1 F 2 inf 259.7 0 30 2 0 6
## 5 F1 F 3 pol 203.9 0 45 3 1 9
## 6 F1 F 3 inf 286.9 0 22 3 0 10
str(voice_data)
## 'data.frame': 84 obs. of 10 variables:
## $ subject : Factor w/ 6 levels "F1","F2","F3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ scenario : int 1 1 2 2 3 3 4 4 5 5 ...
## $ attitude : Factor w/ 2 levels "inf","pol": 2 1 2 1 2 1 2 1 2 1 ...
## $ voice_pitch: num 213 204 285 260 204 ...
## $ SEX : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 20 34 23 30 45 22 50 42 38 33 ...
## $ edu : int 1 1 1 2 3 3 1 1 1 1 ...
## $ polite : int 1 0 1 0 1 0 1 0 1 0 ...
## $ green_space: int 2 3 5 6 9 10 2 4 5 6 ...
summary(voice_data)
## subject gender scenario attitude voice_pitch SEX
## F1:14 F:42 Min. :1 inf:42 Min. : 82.2 Min. :0.0
## F2:14 M:42 1st Qu.:2 pol:42 1st Qu.:131.6 1st Qu.:0.0
## F3:14 Median :4 Median :203.9 Median :0.5
## M3:14 Mean :4 Mean :193.6 Mean :0.5
## M4:14 3rd Qu.:6 3rd Qu.:248.6 3rd Qu.:1.0
## M7:14 Max. :7 Max. :306.8 Max. :1.0
## NA's :1
## age edu polite green_space
## Min. :17.00 Min. :1.000 Min. :0.000 Min. : 0.000
## 1st Qu.:28.75 1st Qu.:1.000 1st Qu.:0.000 1st Qu.: 4.000
## Median :36.00 Median :2.000 Median :1.000 Median : 7.000
## Mean :41.00 Mean :1.929 Mean :0.619 Mean : 7.702
## 3rd Qu.:50.00 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 9.000
## Max. :95.00 Max. :3.000 Max. :1.000 Max. :24.000
##
Some variables in the data are numeric and they need to be factor for example polite, sex, education level etc. .
voice_data <- transform(voice_data,gender = as.factor(gender),edu = as.factor(edu),SEX = as.factor(SEX),polite = as.factor(polite))
str(voice_data)
## 'data.frame': 84 obs. of 10 variables:
## $ subject : Factor w/ 6 levels "F1","F2","F3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ scenario : int 1 1 2 2 3 3 4 4 5 5 ...
## $ attitude : Factor w/ 2 levels "inf","pol": 2 1 2 1 2 1 2 1 2 1 ...
## $ voice_pitch: num 213 204 285 260 204 ...
## $ SEX : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 20 34 23 30 45 22 50 42 38 33 ...
## $ edu : Factor w/ 3 levels "1","2","3": 1 1 1 2 3 3 1 1 1 1 ...
## $ polite : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 2 1 ...
## $ green_space: int 2 3 5 6 9 10 2 4 5 6 ...
Fit a linear model 1 with pitch voice as the dependent variable and age as the covariate variable and interpret all the parameter values
lmodel1 <- lm(voice_pitch ~ age, data = voice_data)
summary(lmodel1)
##
## Call:
## lm(formula = voice_pitch ~ age, data = voice_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -116.304 -63.324 7.593 54.383 109.388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181.5171 18.8330 9.638 4.37e-15 ***
## age 0.2942 0.4242 0.694 0.49
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.75 on 81 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.005904, Adjusted R-squared: -0.006369
## F-statistic: 0.481 on 1 and 81 DF, p-value: 0.4899
Fit a linear model 2 with pitch voice as the dependent variable and with age and sex as independent variables of interest. Compare statistical output in model 2 with model 1 above and appropriately make the necessary conclusions
lmodel2 <- lm(voice_pitch ~ age + SEX, data = voice_data)
summary(lmodel2)
##
## Call:
## lm(formula = voice_pitch ~ age + SEX, data = voice_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.929 -24.609 -6.436 25.437 88.708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 223.3794 10.6165 21.041 <2e-16 ***
## age 0.5987 0.2305 2.598 0.0112 *
## SEX1 -110.0293 7.8439 -14.027 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.57 on 80 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7127, Adjusted R-squared: 0.7055
## F-statistic: 99.2 on 2 and 80 DF, p-value: < 2.2e-16
Compare model 1 and model 2
anova(lmodel1,lmodel2,test = "Chisq")
## Analysis of Variance Table
##
## Model 1: voice_pitch ~ age
## Model 2: voice_pitch ~ age + SEX
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 81 350158
## 2 80 101214 1 248944 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is less than significant level of 0.05. We reject the null hypothesis.There is a significant difference between model 1 and model 2.
Fit an extended linear model 3 with pitch voice as the dependent variable and with age, sex and education level as independent variables of interest. Compare the statistical output in model 3 with model 2 above and make the necessary conclusions.
lmodel3 <- lm(voice_pitch ~ age + SEX + edu, data = voice_data)
summary(lmodel3)
##
## Call:
## lm(formula = voice_pitch ~ age + SEX + edu, data = voice_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.408 -25.762 -5.787 24.887 89.369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 222.6934 11.6022 19.194 <2e-16 ***
## age 0.5998 0.2350 2.553 0.0126 *
## SEX1 -110.0545 7.9555 -13.834 <2e-16 ***
## edu2 0.1214 9.3181 0.013 0.9896
## edu3 2.2865 10.1946 0.224 0.8231
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.01 on 78 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7129, Adjusted R-squared: 0.6982
## F-statistic: 48.42 on 4 and 78 DF, p-value: < 2.2e-16
Compare model 2 and model 3
anova(lmodel2,lmodel3,test = "Chisq")
## Analysis of Variance Table
##
## Model 1: voice_pitch ~ age + SEX
## Model 2: voice_pitch ~ age + SEX + edu
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 80 101214
## 2 78 101134 2 80.071 0.9696
The p-value is greater than the significance level of 0.05. We fail to t=reject the nll hypothesis. There is no significant difference in model 2 and model 3.
Fit a logistic regression model 1 with politeness as a dependent variable of interest and age, sex and education level as independent variables. Give your interpretation of the statistical output. Also, write down an R syntax that solves the same problem by specifying the link function.
logistic1 <- glm(polite ~ age + SEX + edu, data = voice_data, family = binomial(link = "logit"))
summary(logistic1)
##
## Call:
## glm(formula = polite ~ age + SEX + edu, family = binomial(link = "logit"),
## data = voice_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5698 -1.3320 0.9004 0.9959 1.1262
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.865114 0.668216 1.295 0.195
## age -0.007455 0.013347 -0.559 0.576
## SEX1 -0.181466 0.454061 -0.400 0.689
## edu2 -0.062542 0.531433 -0.118 0.906
## edu3 0.163872 0.585292 0.280 0.779
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111.64 on 83 degrees of freedom
## Residual deviance: 110.94 on 79 degrees of freedom
## AIC: 120.94
##
## Number of Fisher Scoring iterations: 4
anova(logistic1, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: polite
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 83 111.64
## age 1 0.38680 82 111.25 0.5340
## SEX 1 0.15534 81 111.10 0.6935
## edu 2 0.16279 79 110.94 0.9218
Adding the covariates to the model has a very small residual deviance. They are less significant to the model
Fit a probit model 2 using the variables of interest as specified in model 1 above by specifying the link function. Briefly compare the statistical output in question (5) above and question (6) and make the necessary conclusions. In your own understanding, justify if any differences exist.
probit_1 <- glm(polite ~ age + SEX + edu, data = voice_data, family = binomial(link = "probit"))
Results
summary(probit_1)
##
## Call:
## glm(formula = polite ~ age + SEX + edu, family = binomial(link = "probit"),
## data = voice_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5690 -1.3314 0.9013 0.9960 1.1244
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.535715 0.412301 1.299 0.194
## age -0.004539 0.008285 -0.548 0.584
## SEX1 -0.112754 0.280561 -0.402 0.688
## edu2 -0.039881 0.329308 -0.121 0.904
## edu3 0.097995 0.359844 0.272 0.785
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111.64 on 83 degrees of freedom
## Residual deviance: 110.94 on 79 degrees of freedom
## AIC: 120.94
##
## Number of Fisher Scoring iterations: 4
Also, fit a Poisson regression model 1 with the frequency of visits to the green spaces for the last one month as the dependent variable and the variables age, sex and education level as explanatory variables respectively. Give interpretations of the coefficients in the statistical output model and make the necessary recommendations.
The response variable here has more than two outcomes. It is a count data. Therefore we are going to specify the poisson() as family. By default, it has link = “log”
poisson_1 <- glm(scenario ~ age + SEX + edu, data = voice_data, family = poisson(link = "log"))
summary(poisson_1)
##
## Call:
## glm(formula = scenario ~ age + SEX + edu, family = poisson(link = "log"),
## data = voice_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.97565 -0.91271 0.00206 0.74471 1.74251
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.111302 0.162848 6.824 8.84e-12 ***
## age 0.003787 0.003142 1.206 0.2280
## SEX1 -0.026140 0.109942 -0.238 0.8121
## edu2 0.152403 0.133339 1.143 0.2531
## edu3 0.246519 0.140452 1.755 0.0792 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 91.924 on 83 degrees of freedom
## Residual deviance: 87.065 on 79 degrees of freedom
## AIC: 358.87
##
## Number of Fisher Scoring iterations: 5
$$$$