Name your submission using the scheme:
LastName_FirstName.pdf etc.
For example: Zhao_Linda .rmd, .pdf, .html or .docx.
Instruction: This exam requires you to use R. It is completely open book/notes/internet. Write your answers using .rmd format and knitr it into one of the html/pdf/docx format. Show your codes, plots or R-output when needed. You can use echo = TRUE to show your codes. If you have trouble formatting the plots, don’t worry about it. We are not looking for pretty solutions, rather to see if you are able to make sense out of data using R. Make sure the compiled pdf/html/docx shows your answers completely and that they are not cut-off. Throughout the exam, you do not need to use any LaTeX or mathematical equations.
All the answers should be clearly supported by relevant R code.
Data for Midterm: The data for midterm can be found at:
/canvas/Files/Midterm/AFR_2012.csv,
/canvas/Files/Midterm/train_fram.csv, and
/canvas/Files/Midterm/test_fram.csv.
Midterm Question File can be found at:
/canvas/Files/Midterm/Miderm11_05_2019.Rmd.
Help: As always skip any part you have trouble with and you may come back to finish it if you have time. Ask one of us for help if you are stuck somewhere for technical issues.
Electronic Submission: In the Assignments section, go to the Midterm assignment and upload your completed files: your .rmd file and a compiled file (either a pdf/html/docx).
You can upload multiple files. The folder will be closed at 08:10PM.
If you have trouble to upload your files, email them to lzhao@wharton.upenn.edu and arunku@wharton.upenn.edu.
Whenever we ask for test at some level, assume all the model assumptions are satisfied.
The adolescent fertility rate (AFR) is defined as the number of births per 1,000 women of age 15 to 19. While world’s AFR has been decreasing steadily over the years, some countries still have high AFR. Having children this early in life exposes adolescent women to unnecessary risks. Their chance of dying is twice as high as that of women who wait until their 20s to begin childbearing. In addition, early childbearing greatly reduces the likelihood of a girl advancing her education and limits her opportunities for training and employment.
Based on a data set from the Data Bank of the World Bank (https://databank.worldbank.org/data/home.aspx), AFR together with other information of 2012 is available. Our goal is to identify important factors associated with AFR. Hope we could give some recommendations to lower the AFR for policymakers.
The data set is AFR_2012.csv.
| Variable | Description |
|---|---|
| mortality.rate | Mortality rate, under-5 (per 1,000 live births) |
| Country | Country name |
| AFR | Adolescent fertility rate (births per 1,000 women ages 15-19) |
| agri.forestry.fish.gdp.pct | Agriculture, forestry, and fishing, value added (% of GDP) |
| industry.gdp.pct | Industry (including construction), value added (% of GDP) |
| CO2 | CO2 emissions (metric tons per capita) |
| fertility.rate | Fertility rate, total (births per woman) |
| GDP | GDP (current USD) |
| GDP.per.capita | GDP per capita (current US$) |
| gdp.grwoth.rate | GDP growth (annual %) |
| gni | GNI, PPP (current international dollar) |
| inflation | Inflation, GDP deflator (annual %) |
| LE | Life expectancy at birth, total (years) |
| population.growth | Population growth (annual %) |
| population | Population, total |
| unemployment | Unemployment, total (% of total labor force)) |
| Continent | Continent |
| Urban.pop | Percentage of urban population |
| Household.consump | Household consumption expenditure in million |
| Forest.area | Percentage of forest |
| Water | Access to improved water source in percentage |
| Food.prod.index | Food production index |
| Arable.land | Arable land per capita |
| Health.expend | Health expenditure percentage of GDP |
| Immunization | DPT Immunization percentage of children |
| Sanitation.faci | Access to improved sanitation facilities in percentage |
| Immunization.measles | Measles Immunization percentage of children |
| Health.exp.pocket | Percentage of out of pocket health expenditure to total health |
| Fixed.tel | Fixed telephone subscriptions per 100 people |
| Mobile.cel | Mobile cellular subscriptions per 100 people |
| Internet.users | Internet users per 100 people |
Load AFR_2012.csv. Notice AFR is Adolescent Fertility Rate.
# you need to put the dataset in the same folder
# where this .rmd file sits.
data1 <- read.csv("AFR_2012.csv")
data1$X <- NULLUse data1 from now.
## Country mortality.rate AFR
## Algeria : 1 Min. : 2.500 Min. : 3.503
## Argentina : 1 1st Qu.: 5.625 1st Qu.: 13.194
## Armenia : 1 Median : 16.900 Median : 36.206
## Austria : 1 Mean : 31.968 Mean : 49.878
## Azerbaijan: 1 3rd Qu.: 45.400 3rd Qu.: 76.373
## Bangladesh: 1 Max. :142.900 Max. :202.109
## (Other) :108
## agri.forestry.fish.gdp.pct industry.gdp.pct CO2
## Min. : 0.3593 Min. : 3.326 Min. : 0.0303
## 1st Qu.: 3.0792 1st Qu.:20.891 1st Qu.: 0.8510
## Median : 7.0568 Median :25.514 Median : 3.1088
## Mean :11.1746 Mean :27.454 Mean : 4.2928
## 3rd Qu.:16.5350 3rd Qu.:32.127 3rd Qu.: 6.5080
## Max. :54.8998 Max. :74.768 Max. :20.0842
##
## fertility.rate GDP GDP.per.capita GDP.grwoth.rate
## Min. :1.269 Min. :1.564e+09 Min. : 250.4 Min. :-7.4446
## 1st Qu.:1.666 1st Qu.:1.445e+10 1st Qu.: 1653.0 1st Qu.: 0.4994
## Median :2.236 Median :5.523e+10 Median : 6155.6 Median : 3.5747
## Mean :2.728 Mean :5.596e+11 Mean : 14394.5 Mean : 3.2095
## 3rd Qu.:3.214 3rd Qu.:3.057e+11 3rd Qu.: 15433.9 3rd Qu.: 5.4452
## Max. :7.420 Max. :1.616e+13 Max. :106749.0 Max. :12.3198
##
## GNI inflation population.growth
## Min. :2.503e+09 Min. :-1.170 Min. :-1.3412
## 1st Qu.:2.964e+10 1st Qu.: 1.743 1st Qu.: 0.3716
## Median :1.123e+11 Median : 3.598 Median : 1.2666
## Mean :1.383e+12 Mean : 5.679 Mean : 1.3457
## 3rd Qu.:4.695e+11 3rd Qu.: 6.620 3rd Qu.: 2.1936
## Max. :7.215e+13 Max. :75.277 Max. : 6.9053
##
## population unemployment Continent Urban.pop
## Min. :3.207e+05 Min. : 0.160 Africa :31 Min. :11.19
## 1st Qu.:4.942e+06 1st Qu.: 3.862 Asia :28 1st Qu.:42.70
## Median :1.084e+07 Median : 6.825 Europe :35 Median :62.08
## Mean :5.194e+07 Mean : 8.171 North America:11 Mean :58.66
## 3rd Qu.:3.533e+07 3rd Qu.:10.992 South America: 9 3rd Qu.:74.50
## Max. :1.351e+09 Max. :24.790 Max. :97.73
##
## Household.consump Forest.area Arable.land Water
## Min. : 6.520 Min. : 0.01 Min. :0.0100 Min. : 48.70
## 1st Qu.: 9.228 1st Qu.:11.23 1st Qu.:0.1025 1st Qu.: 84.83
## Median :10.394 Median :30.90 Median :0.2000 Median : 95.45
## Mean :10.718 Mean :30.07 Mean :0.2540 Mean : 89.08
## 3rd Qu.:12.110 3rd Qu.:44.09 3rd Qu.:0.2975 3rd Qu.: 99.50
## Max. :16.219 Max. :84.10 Max. :1.7500 Max. :100.00
##
## Food.prod.index Health.expend Immunization Sanitation.faci
## Min. : 72.55 Min. : 0.610 Min. :45.00 Min. : 10.10
## 1st Qu.:100.53 1st Qu.: 2.243 1st Qu.:87.25 1st Qu.: 57.75
## Median :115.97 Median : 3.910 Median :95.00 Median : 86.25
## Mean :116.12 Mean : 4.193 Mean :90.61 Mean : 73.40
## 3rd Qu.:128.61 3rd Qu.: 5.625 3rd Qu.:97.00 3rd Qu.: 97.42
## Max. :173.83 Max. :10.090 Max. :99.00 Max. :100.00
##
## Immunization.measles Health.exp.pocket Fixed.tel Mobile.cel
## Min. :49.00 Min. : 5.30 Min. : 0.000 Min. : 14.92
## 1st Qu.:87.00 1st Qu.:18.69 1st Qu.: 4.175 1st Qu.: 81.97
## Median :93.00 Median :31.93 Median :15.480 Median :108.20
## Mean :89.96 Mean :33.04 Mean :19.137 Mean :104.48
## 3rd Qu.:97.00 3rd Qu.:46.03 3rd Qu.:29.415 3rd Qu.:127.18
## Max. :99.00 Max. :74.46 Max. :62.110 Max. :187.36
##
## Internet.users
## Min. : 1.22
## 1st Qu.:13.33
## Median :39.62
## Mean :40.71
## 3rd Qu.:61.81
## Max. :96.21
##
i) How many countries are there in this data?
## [1] 114
#Answer: There are 114 countries in the data
ii) Are there any missing values? If so, remove them. (You can use the function na.omit().)
## [1] 0
## Country mortality.rate
## 0 0
## AFR agri.forestry.fish.gdp.pct
## 0 0
## industry.gdp.pct CO2
## 0 0
## fertility.rate GDP
## 0 0
## GDP.per.capita GDP.grwoth.rate
## 0 0
## GNI inflation
## 0 0
## population.growth population
## 0 0
## unemployment Continent
## 0 0
## Urban.pop Household.consump
## 0 0
## Forest.area Arable.land
## 0 0
## Water Food.prod.index
## 0 0
## Health.expend Immunization
## 0 0
## Sanitation.faci Immunization.measles
## 0 0
## Health.exp.pocket Fixed.tel
## 0 0
## Mobile.cel Internet.users
## 0 0
#Answer: There are no missing values.
i) Which country has the highest AFR and which one has the lowest AFR?
## [1] Niger
## 114 Levels: Algeria Argentina Armenia Austria Azerbaijan ... Vietnam
## [1] Switzerland
## 114 Levels: Algeria Argentina Armenia Austria Azerbaijan ... Vietnam
#Answer: According to the code above, Niger has the highest AFR and Switzerland has the lowest AFR.
ii) Provide a boxplot of AFR among Continent. Comment on the relation between AFR and Continent in 3 sentences.
ggplot(data1) +
geom_boxplot(aes(x=Continent, y=AFR)) +
labs(title="Boxplot AFR among Continent", x="") #Answer: According to the boxplot, Africa has the highest average AFR rate, but also the highest spread among the countries within, showing that some countries within the continent have lower AFR than some countries in North or South America. Europe has the lowest average AFR and also the lowest spread, meaning most countries in the continent have very low AFR rate.
AFR vs. a single variablei) Fit a linear model of AFR vs. GDP.per.capita. Is GDP.per.capita significant at 0.01 level? Is the association appearing to be negative?
##
## Call:
## lm(formula = AFR ~ GDP.per.capita, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.696 -28.628 -7.529 18.814 136.656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.8883149 4.3024002 15.314 < 2e-16 ***
## GDP.per.capita -0.0011122 0.0001731 -6.425 3.31e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.45 on 112 degrees of freedom
## Multiple R-squared: 0.2693, Adjusted R-squared: 0.2628
## F-statistic: 41.28 on 1 and 112 DF, p-value: 3.31e-09
ii) Are the averages of AFR the same across all the Continents at 0.01 level? Which continent has the highest AFR on average?
## Analysis of Variance Table
##
## Response: AFR
## Df Sum Sq Mean Sq F value Pr(>F)
## Continent 4 117764 29441.1 33.014 < 2.2e-16 ***
## Residuals 109 97204 891.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = AFR ~ Continent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.746 -9.475 -2.582 12.226 108.773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.336 5.364 17.402 < 2e-16 ***
## ContinentAsia -59.795 7.786 -7.680 7.33e-12 ***
## ContinentEurope -79.334 7.365 -10.771 < 2e-16 ***
## ContinentNorth America -25.032 10.480 -2.388 0.0186 *
## ContinentSouth America -25.314 11.307 -2.239 0.0272 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.86 on 109 degrees of freedom
## Multiple R-squared: 0.5478, Adjusted R-squared: 0.5312
## F-statistic: 33.01 on 4 and 109 DF, p-value: < 2.2e-16
#Answer: Continent as a whole is significant as the p-value is lower than 0.01 according to the anova test performed above. Africa is the base and seems to be the highest because the other continents have negative coefficients. See below for further proof that AFR is highest for Africa.
## # A tibble: 5 x 2
## Continent `mean(AFR)`
## <fct> <dbl>
## 1 Africa 93.3
## 2 Asia 33.5
## 3 Europe 14.0
## 4 North America 68.3
## 5 South America 68.0
AFR vs GDP.per.capita and Continenti) Fit a linear model of AFR vs GDP.per.capita and Continent, assuming there is no interaction effect.
##
## Call:
## lm(formula = AFR ~ GDP.per.capita + Continent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.850 -10.116 -1.985 11.456 107.944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.435e+01 5.222e+00 18.066 < 2e-16 ***
## GDP.per.capita -4.607e-04 1.677e-04 -2.746 0.00706 **
## ContinentAsia -5.650e+01 7.657e+00 -7.379 3.47e-11 ***
## ContinentEurope -6.553e+01 8.742e+00 -7.497 1.93e-11 ***
## ContinentNorth America -2.130e+01 1.027e+01 -2.074 0.04046 *
## ContinentSouth America -2.260e+01 1.103e+01 -2.049 0.04286 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29 on 108 degrees of freedom
## Multiple R-squared: 0.5773, Adjusted R-squared: 0.5578
## F-statistic: 29.5 on 5 and 108 DF, p-value: < 2.2e-16
a) Is GDP.per.capita significant at 0.01 level controlling for Continent?
## Anova Table (Type II tests)
##
## Response: AFR
## Sum Sq Df F value Pr(>F)
## GDP.per.capita 6345 1 7.5422 0.007062 **
## Continent 66212 4 19.6759 3.456e-12 ***
## Residuals 90859 108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
b) Is Continent significant at 0.01 level controlling for GDP.per.capita. For a given GDP.per.capita value, which continent seems to have the lowest AFR on average?
ii) Some summary statistics seem to indicate a possible interaction effect of Continent and GDP.per.capita over AFR. Run a linear model of AFR vs GDP.per.capita and Continent with interaction.
##
## Call:
## lm(formula = AFR ~ GDP.per.capita * Continent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.447 -10.153 -1.849 8.412 86.525
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 120.419266 6.048705 19.908
## GDP.per.capita -0.012350 0.001882 -6.563
## ContinentAsia -76.166753 8.513451 -8.947
## ContinentEurope -99.189984 8.951556 -11.081
## ContinentNorth America -42.199062 11.167372 -3.779
## ContinentSouth America -45.140165 18.582881 -2.429
## GDP.per.capita:ContinentAsia 0.011204 0.001925 5.822
## GDP.per.capita:ContinentEurope 0.012125 0.001888 6.421
## GDP.per.capita:ContinentNorth America 0.011387 0.001963 5.802
## GDP.per.capita:ContinentSouth America 0.011453 0.002689 4.260
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## GDP.per.capita 2.11e-09 ***
## ContinentAsia 1.51e-14 ***
## ContinentEurope < 2e-16 ***
## ContinentNorth America 0.000263 ***
## ContinentSouth America 0.016850 *
## GDP.per.capita:ContinentAsia 6.53e-08 ***
## GDP.per.capita:ContinentEurope 4.12e-09 ***
## GDP.per.capita:ContinentNorth America 7.15e-08 ***
## GDP.per.capita:ContinentSouth America 4.50e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.62 on 104 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.6813
## F-statistic: 27.84 on 9 and 104 DF, p-value: < 2.2e-16
a) Can we reject the null hypothesis of no interaction effect at 0.01 level?
## Analysis of Variance Table
##
## Model 1: AFR ~ GDP.per.capita + Continent
## Model 2: AFR ~ GDP.per.capita * Continent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 108 90859
## 2 104 63055 4 27804 11.465 9.516e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lastly we will build a parsimonious model to see what factors are related to AFR.
i) In any linear model you will run, can you include the variable Country in it? Why or Why not? Explain in no more than 2 sentences.
#Answer: we cannot include the variable Country because there is only one row/ entry for each country. The model has no degree of freedoms to predict anything because there is only one data.
We now take out Country, fertility.rate, Continent and save it as data2.
ii) LASSO with cv.glmnet
a) Run a LASSO analysis using all variables in data2. For reproducibility, use set.seed(1). Also use 10 folds by setting nfolds=10. Plot the LASSO output.
set.seed(1)
X2 <- model.matrix(AFR ~ ., data = data2)[, -1]
Y2 <- data2$AFR
fit.lasso1 <- cv.glmnet(x = X2, y = Y2, alpha = 1, nfolds = 10)
plot(fit.lasso1)## [1] 0.4522635
## [1] 6.715969
b) Choose 6 non-zero variables from LASSO. Hint: The top line in the plot shows the number of non-zero coefficients. Choose \(s\) approximately equal to exponential of value on x-axis that corresponds to \(6\) in the top line.
## (Intercept) mortality.rate Water Immunization
## 114.0591324 0.3698861 -0.1780301 -0.4178737
## Sanitation.faci Internet.users
## -0.2288822 -0.1346024
#Answer: By using s=9, we got 6 non-zero variables from Lasso, including the intercept. The other variables are mortality.rate, Water, Immunization, Sanitation.faci, Internet.users
i) Assume we obtain the following variables from LASSO: mortality.rate, Water, Immunization, Sanitation.faci. Run the final linear model of AFR with the variables listed here AND Continent. Call this fit fit_final_AFR. Report the Anova of this fit and report if any of the variables are insignificant at 0.05 level.
fit_final_AFR <- lm (AFR ~ mortality.rate + Water + Immunization + Sanitation.faci + Continent, data=data1)
Anova (fit_final_AFR)## Anova Table (Type II tests)
##
## Response: AFR
## Sum Sq Df F value Pr(>F)
## mortality.rate 4253 1 12.7503 0.0005386 ***
## Water 1513 1 4.5361 0.0355243 *
## Immunization 2374 1 7.1175 0.0088437 **
## Sanitation.faci 2210 1 6.6263 0.0114428 *
## Continent 22633 4 16.9636 9.308e-11 ***
## Residuals 35023 105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Answer: Based on the Anova table, all of the variables listed above are significant at 0.05 level
Note: data2 does not contain Continent. Also, we are giving the variables so that students who are not able to output LASSO variables will not be double penalized. This may not be the variables from the LASSO output.
ii) By your judgement, are the linear model assumptions satisfied for fit_final_AFR? Provide relevant plots.
par(mfrow=c(1,2), mar=c(5,2,4,2), mgp=c(3,0.5,0))
plot(fit_final_AFR, 1, pch=16)
abline(h=0, col="blue", lwd=2)
plot(fit_final_AFR, 2) # Answer: The Residual plot shows a fanned out pattern that needs to be fixed first so that it better fits the linear model assumptions. Normal QQ-plot looks fine, although it deviates a little bit on both tails, due to the outliers.
iii) Based on the summary of fit_final_AFR, provide one variable in this which the policy makers can use to decrease AFR.
##
## Call:
## lm(formula = AFR ~ mortality.rate + Water + Immunization + Sanitation.faci +
## Continent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.090 -6.926 -1.108 8.960 65.139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 164.2197 33.0370 4.971 2.61e-06 ***
## mortality.rate 0.4835 0.1354 3.571 0.000539 ***
## Water -0.4954 0.2326 -2.130 0.035524 *
## Immunization -0.6875 0.2577 -2.668 0.008844 **
## Sanitation.faci -0.3702 0.1438 -2.574 0.011443 *
## ContinentAsia -4.9808 6.4118 -0.777 0.439014
## ContinentEurope -3.2148 7.7039 -0.417 0.677319
## ContinentNorth America 36.3918 8.1479 4.466 2.01e-05 ***
## ContinentSouth America 31.6557 8.5002 3.724 0.000317 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.26 on 105 degrees of freedom
## Multiple R-squared: 0.8371, Adjusted R-squared: 0.8247
## F-statistic: 67.43 on 8 and 105 DF, p-value: < 2.2e-16
\[ \mbox{\textbf{End of AFR analysis}}. \]
In this part, we will explore the relation between heart disease and smoking using Framingham dataset. As we saw in class this dataset has a factor variable of interest HD which takes values “0” or “1” with “1” indicating the presence of heart disease. It includes other variables such as AGE, SEX, SBP, DBP, CHOL, FRW and CIG. We will use a revised data for the purpose of the midterm. A new categorical variable Smoke is created by grouping the original continuous variable CIG into categories “None”, “Med”, “High” and “VHigh”. We have split the original Framingham dataset into training and testing data: train_fram.csv and test_fram.csv.
## load the dataset train_fram.csv and testing data here
HD_train <- read.csv("train_fram.csv")
HD_train$HD <- factor(HD_train$HD, levels = c("0", "1"))
HD_train$Smoke <- factor(HD_train$Smoke, levels = c("None", "Med", "High", "VHigh"))
HD_train$X <- NULLi) Fit a logistic regression between HD and Smoke. Call this model fit1_logi. Report the summary. What is the base level? At what level/category of Smoke, the probability of HD = 1 appears to be the highest?
##
## Call:
## glm(formula = HD ~ Smoke, family = binomial(logit), data = HD_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8702 -0.6740 -0.6740 -0.5945 1.9081
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3665 0.1045 -13.081 <2e-16 ***
## SmokeMed -0.2771 0.2506 -1.106 0.2688
## SmokeHigh 0.3464 0.1913 1.811 0.0701 .
## SmokeVHigh 0.5907 0.2475 2.386 0.0170 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1053.8 on 999 degrees of freedom
## Residual deviance: 1043.2 on 996 degrees of freedom
## AIC: 1051.2
##
## Number of Fisher Scoring iterations: 4
#Answer: The base level is “SmokeNone”. The probability of HD=1 appears to be the highest for VHigh Smoker
ii) In model fit1_logi, is Smoke a significant variable at level 0.05?
## Analysis of Deviance Table (Type II tests)
##
## Response: HD
## LR Chisq Df Pr(>Chisq)
## Smoke 10.655 3 0.01375 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Answer: Smoke is a significant variable in fit1_logi, as the p-value is lower than 0.05
iii) Now fit a logistic regression model for HD using AGE, SEX, SBP, CHOL and Smoke as covariates/features. Let us call this model fit2_logi. Is Smoke a significant variable at level 0.05?
fit2_logi <- glm(HD ~ AGE + SEX + SBP + CHOL + Smoke, HD_train, family =binomial (logit))
summary (fit2_logi)##
## Call:
## glm(formula = HD ~ AGE + SEX + SBP + CHOL + Smoke, family = binomial(logit),
## data = HD_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5165 -0.7313 -0.5566 -0.3549 2.5273
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.875653 1.080740 -7.287 3.16e-13 ***
## AGE 0.052295 0.017208 3.039 0.00237 **
## SEXMALE 0.909009 0.183307 4.959 7.09e-07 ***
## SBP 0.016669 0.002852 5.845 5.06e-09 ***
## CHOL 0.003286 0.001770 1.857 0.06338 .
## SmokeMed -0.275544 0.261996 -1.052 0.29293
## SmokeHigh 0.224530 0.211482 1.062 0.28837
## SmokeVHigh 0.438044 0.275931 1.588 0.11240
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1053.82 on 999 degrees of freedom
## Residual deviance: 968.11 on 992 degrees of freedom
## AIC: 984.11
##
## Number of Fisher Scoring iterations: 4
## Analysis of Deviance Table (Type II tests)
##
## Response: HD
## LR Chisq Df Pr(>Chisq)
## AGE 9.349 1 0.00223 **
## SEX 25.309 1 4.885e-07 ***
## SBP 34.999 1 3.299e-09 ***
## CHOL 3.430 1 0.06401 .
## Smoke 5.380 3 0.14599
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Answer: According to Anova table, Smoke is not a significant variable as 0.145 is bigger than 0.05 level.
Load the testing data test_fram.csv.
HD_test <- read.csv("test_fram.csv")
HD_test$Smoke <- factor(HD_test$Smoke, levels = c("None", "Med", "High", "VHigh"))
HD_test$X <- NULL## HD AGE SEX SBP
## Min. :0.0000 Min. :45.0 FEMALE:204 Min. : 90
## 1st Qu.:0.0000 1st Qu.:48.0 MALE :189 1st Qu.:130
## Median :0.0000 Median :52.0 Median :144
## Mean :0.2214 Mean :51.8 Mean :149
## 3rd Qu.:0.0000 3rd Qu.:55.0 3rd Qu.:160
## Max. :1.0000 Max. :62.0 Max. :294
## DBP CHOL FRW CIG
## Min. : 60.00 Min. :133.0 Min. : 69.0 Min. : 0.000
## 1st Qu.: 80.00 1st Qu.:200.0 1st Qu.: 93.0 1st Qu.: 0.000
## Median : 90.00 Median :227.0 Median :102.0 Median : 0.000
## Mean : 90.68 Mean :233.1 Mean :104.1 Mean : 8.562
## 3rd Qu.:100.00 3rd Qu.:260.0 3rd Qu.:113.0 3rd Qu.:20.000
## Max. :155.00 Max. :430.0 Max. :204.0 Max. :50.000
## Smoke
## None :200
## Med : 64
## High : 88
## VHigh: 41
##
##
i) Use the 1/2 thresholding rule for predicting HD with models fit1_logi and fit2_logi. Predict HD on the testing data. What are the (testing) misclassification errors from models fit1_logi and fit2_logi? Report at least 3 decimals.
fit1.pred <- predict(fit1_logi, HD_test, type="response") >=0.5
error.training1 <- mean(fit1.pred != HD_test$HD)
error.training1## [1] 0.221374
#Answer: for fit1_logi, the misclassfication error is 0.221
fit2.predict.5 <- predict(fit2_logi, HD_test, type="response")>=0.5
error.fit2 <- mean(fit2.predict.5 != HD_test$HD)
error.fit2## [1] 0.2188295
#Answer: for fit1_logi, the misclassfication error is 0.219
ii) Based on the testing MCE, which model is the best? #Answer" Based on the testing MCE, Model 2, fit2_logi is better, because it has lower MCE of 0.219 compared to 0.221.
i) We have a male with features: AGE = 50, SBP = 160, CHOL = 230 and Smoke = None. Predict whether this person has a heart disease or not based on the 1/2 thresholding rule with fit1_logi.
## AGE SBP CHOL Smoke
## 1 50 160 230 None
## 1
## "0"
#Answer: This male is predicted to NOT have heart disease.
\[ \mbox{\textbf{End of the exam}}. \]
By submitting this document you certify that you have complied with the University of Pennsylvania’s Code of Academic Integrity, to the best of your knowledge. You further certify that you have taken this exam under its sanctioned conditions, i.e. solely within the set exam room and within the time allotted.