Name your submission using the scheme:

LastName_FirstName.pdf etc.

For example: Zhao_Linda .rmd, .pdf, .html or .docx.

Instruction: This exam requires you to use R. It is completely open book/notes/internet. Write your answers using .rmd format and knitr it into one of the html/pdf/docx format. Show your codes, plots or R-output when needed. You can use echo = TRUE to show your codes. If you have trouble formatting the plots, don’t worry about it. We are not looking for pretty solutions, rather to see if you are able to make sense out of data using R. Make sure the compiled pdf/html/docx shows your answers completely and that they are not cut-off. Throughout the exam, you do not need to use any LaTeX or mathematical equations.

All the answers should be clearly supported by relevant R code.

Data for Midterm: The data for midterm can be found at:

/canvas/Files/Midterm/AFR_2012.csv,

/canvas/Files/Midterm/train_fram.csv, and

/canvas/Files/Midterm/test_fram.csv.

Midterm Question File can be found at:

/canvas/Files/Midterm/Miderm11_05_2019.Rmd.

Help: As always skip any part you have trouble with and you may come back to finish it if you have time. Ask one of us for help if you are stuck somewhere for technical issues.

Electronic Submission: In the Assignments section, go to the Midterm assignment and upload your completed files: your .rmd file and a compiled file (either a pdf/html/docx).

You can upload multiple files. The folder will be closed at 08:10PM.

If you have trouble to upload your files, email them to lzhao@wharton.upenn.edu and arunku@wharton.upenn.edu.

Whenever we ask for test at some level, assume all the model assumptions are satisfied.

The adolescent fertility rate (AFR)

The adolescent fertility rate (AFR) is defined as the number of births per 1,000 women of age 15 to 19. While world’s AFR has been decreasing steadily over the years, some countries still have high AFR. Having children this early in life exposes adolescent women to unnecessary risks. Their chance of dying is twice as high as that of women who wait until their 20s to begin childbearing. In addition, early childbearing greatly reduces the likelihood of a girl advancing her education and limits her opportunities for training and employment.

Based on a data set from the Data Bank of the World Bank (https://databank.worldbank.org/data/home.aspx), AFR together with other information of 2012 is available. Our goal is to identify important factors associated with AFR. Hope we could give some recommendations to lower the AFR for policymakers.

The data set is AFR_2012.csv.

Variable Description
mortality.rate Mortality rate, under-5 (per 1,000 live births)
Country Country name
AFR Adolescent fertility rate (births per 1,000 women ages 15-19)
agri.forestry.fish.gdp.pct Agriculture, forestry, and fishing, value added (% of GDP)
industry.gdp.pct Industry (including construction), value added (% of GDP)
CO2 CO2 emissions (metric tons per capita)
fertility.rate Fertility rate, total (births per woman)
GDP GDP (current USD)
GDP.per.capita GDP per capita (current US$)
gdp.grwoth.rate GDP growth (annual %)
gni GNI, PPP (current international dollar)
inflation Inflation, GDP deflator (annual %)
LE Life expectancy at birth, total (years)
population.growth Population growth (annual %)
population Population, total
unemployment Unemployment, total (% of total labor force))
Continent Continent
Urban.pop Percentage of urban population
Household.consump Household consumption expenditure in million
Forest.area Percentage of forest
Water Access to improved water source in percentage
Food.prod.index Food production index
Arable.land Arable land per capita
Health.expend Health expenditure percentage of GDP
Immunization DPT Immunization percentage of children
Sanitation.faci Access to improved sanitation facilities in percentage
Immunization.measles Measles Immunization percentage of children
Health.exp.pocket Percentage of out of pocket health expenditure to total health
Fixed.tel Fixed telephone subscriptions per 100 people
Mobile.cel Mobile cellular subscriptions per 100 people
Internet.users Internet users per 100 people

Part 1. EDA

1) Reading data

Load AFR_2012.csv. Notice AFR is Adolescent Fertility Rate.

Use data1 from now.

##        Country    mortality.rate         AFR         
##  Algeria   :  1   Min.   :  2.500   Min.   :  3.503  
##  Argentina :  1   1st Qu.:  5.625   1st Qu.: 13.194  
##  Armenia   :  1   Median : 16.900   Median : 36.206  
##  Austria   :  1   Mean   : 31.968   Mean   : 49.878  
##  Azerbaijan:  1   3rd Qu.: 45.400   3rd Qu.: 76.373  
##  Bangladesh:  1   Max.   :142.900   Max.   :202.109  
##  (Other)   :108                                      
##  agri.forestry.fish.gdp.pct industry.gdp.pct      CO2         
##  Min.   : 0.3593            Min.   : 3.326   Min.   : 0.0303  
##  1st Qu.: 3.0792            1st Qu.:20.891   1st Qu.: 0.8510  
##  Median : 7.0568            Median :25.514   Median : 3.1088  
##  Mean   :11.1746            Mean   :27.454   Mean   : 4.2928  
##  3rd Qu.:16.5350            3rd Qu.:32.127   3rd Qu.: 6.5080  
##  Max.   :54.8998            Max.   :74.768   Max.   :20.0842  
##                                                               
##  fertility.rate       GDP            GDP.per.capita     GDP.grwoth.rate  
##  Min.   :1.269   Min.   :1.564e+09   Min.   :   250.4   Min.   :-7.4446  
##  1st Qu.:1.666   1st Qu.:1.445e+10   1st Qu.:  1653.0   1st Qu.: 0.4994  
##  Median :2.236   Median :5.523e+10   Median :  6155.6   Median : 3.5747  
##  Mean   :2.728   Mean   :5.596e+11   Mean   : 14394.5   Mean   : 3.2095  
##  3rd Qu.:3.214   3rd Qu.:3.057e+11   3rd Qu.: 15433.9   3rd Qu.: 5.4452  
##  Max.   :7.420   Max.   :1.616e+13   Max.   :106749.0   Max.   :12.3198  
##                                                                          
##       GNI              inflation      population.growth
##  Min.   :2.503e+09   Min.   :-1.170   Min.   :-1.3412  
##  1st Qu.:2.964e+10   1st Qu.: 1.743   1st Qu.: 0.3716  
##  Median :1.123e+11   Median : 3.598   Median : 1.2666  
##  Mean   :1.383e+12   Mean   : 5.679   Mean   : 1.3457  
##  3rd Qu.:4.695e+11   3rd Qu.: 6.620   3rd Qu.: 2.1936  
##  Max.   :7.215e+13   Max.   :75.277   Max.   : 6.9053  
##                                                        
##    population         unemployment            Continent    Urban.pop    
##  Min.   :3.207e+05   Min.   : 0.160   Africa       :31   Min.   :11.19  
##  1st Qu.:4.942e+06   1st Qu.: 3.862   Asia         :28   1st Qu.:42.70  
##  Median :1.084e+07   Median : 6.825   Europe       :35   Median :62.08  
##  Mean   :5.194e+07   Mean   : 8.171   North America:11   Mean   :58.66  
##  3rd Qu.:3.533e+07   3rd Qu.:10.992   South America: 9   3rd Qu.:74.50  
##  Max.   :1.351e+09   Max.   :24.790                      Max.   :97.73  
##                                                                         
##  Household.consump  Forest.area     Arable.land         Water       
##  Min.   : 6.520    Min.   : 0.01   Min.   :0.0100   Min.   : 48.70  
##  1st Qu.: 9.228    1st Qu.:11.23   1st Qu.:0.1025   1st Qu.: 84.83  
##  Median :10.394    Median :30.90   Median :0.2000   Median : 95.45  
##  Mean   :10.718    Mean   :30.07   Mean   :0.2540   Mean   : 89.08  
##  3rd Qu.:12.110    3rd Qu.:44.09   3rd Qu.:0.2975   3rd Qu.: 99.50  
##  Max.   :16.219    Max.   :84.10   Max.   :1.7500   Max.   :100.00  
##                                                                     
##  Food.prod.index  Health.expend     Immunization   Sanitation.faci 
##  Min.   : 72.55   Min.   : 0.610   Min.   :45.00   Min.   : 10.10  
##  1st Qu.:100.53   1st Qu.: 2.243   1st Qu.:87.25   1st Qu.: 57.75  
##  Median :115.97   Median : 3.910   Median :95.00   Median : 86.25  
##  Mean   :116.12   Mean   : 4.193   Mean   :90.61   Mean   : 73.40  
##  3rd Qu.:128.61   3rd Qu.: 5.625   3rd Qu.:97.00   3rd Qu.: 97.42  
##  Max.   :173.83   Max.   :10.090   Max.   :99.00   Max.   :100.00  
##                                                                    
##  Immunization.measles Health.exp.pocket   Fixed.tel        Mobile.cel    
##  Min.   :49.00        Min.   : 5.30     Min.   : 0.000   Min.   : 14.92  
##  1st Qu.:87.00        1st Qu.:18.69     1st Qu.: 4.175   1st Qu.: 81.97  
##  Median :93.00        Median :31.93     Median :15.480   Median :108.20  
##  Mean   :89.96        Mean   :33.04     Mean   :19.137   Mean   :104.48  
##  3rd Qu.:97.00        3rd Qu.:46.03     3rd Qu.:29.415   3rd Qu.:127.18  
##  Max.   :99.00        Max.   :74.46     Max.   :62.110   Max.   :187.36  
##                                                                          
##  Internet.users 
##  Min.   : 1.22  
##  1st Qu.:13.33  
##  Median :39.62  
##  Mean   :40.71  
##  3rd Qu.:61.81  
##  Max.   :96.21  
## 

i) How many countries are there in this data?

## [1] 114

#Answer: There are 114 countries in the data

ii) Are there any missing values? If so, remove them. (You can use the function na.omit().)

## [1] 0
##                    Country             mortality.rate 
##                          0                          0 
##                        AFR agri.forestry.fish.gdp.pct 
##                          0                          0 
##           industry.gdp.pct                        CO2 
##                          0                          0 
##             fertility.rate                        GDP 
##                          0                          0 
##             GDP.per.capita            GDP.grwoth.rate 
##                          0                          0 
##                        GNI                  inflation 
##                          0                          0 
##          population.growth                 population 
##                          0                          0 
##               unemployment                  Continent 
##                          0                          0 
##                  Urban.pop          Household.consump 
##                          0                          0 
##                Forest.area                Arable.land 
##                          0                          0 
##                      Water            Food.prod.index 
##                          0                          0 
##              Health.expend               Immunization 
##                          0                          0 
##            Sanitation.faci       Immunization.measles 
##                          0                          0 
##          Health.exp.pocket                  Fixed.tel 
##                          0                          0 
##                 Mobile.cel             Internet.users 
##                          0                          0

#Answer: There are no missing values.

2) Summaries

i) Which country has the highest AFR and which one has the lowest AFR?

## [1] Niger
## 114 Levels: Algeria Argentina Armenia Austria Azerbaijan ... Vietnam
## [1] Switzerland
## 114 Levels: Algeria Argentina Armenia Austria Azerbaijan ... Vietnam

#Answer: According to the code above, Niger has the highest AFR and Switzerland has the lowest AFR.

ii) Provide a boxplot of AFR among Continent. Comment on the relation between AFR and Continent in 3 sentences.

#Answer: According to the boxplot, Africa has the highest average AFR rate, but also the highest spread among the countries within, showing that some countries within the continent have lower AFR than some countries in North or South America. Europe has the lowest average AFR and also the lowest spread, meaning most countries in the continent have very low AFR rate.

Part 2. Analysis with domain knowledge

3) AFR vs. a single variable

i) Fit a linear model of AFR vs. GDP.per.capita. Is GDP.per.capita significant at 0.01 level? Is the association appearing to be negative?

## 
## Call:
## lm(formula = AFR ~ GDP.per.capita, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.696 -28.628  -7.529  18.814 136.656 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    65.8883149  4.3024002  15.314  < 2e-16 ***
## GDP.per.capita -0.0011122  0.0001731  -6.425 3.31e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.45 on 112 degrees of freedom
## Multiple R-squared:  0.2693, Adjusted R-squared:  0.2628 
## F-statistic: 41.28 on 1 and 112 DF,  p-value: 3.31e-09

Answer: Yes, GDP.per.capita’s p-values is lower than 0.01, we can reject the null hypothesis and accept that the variable is significant. The Association appears to be negative, as 1 unit higher in GDP.per.capita reduces AFR by 0.0011122

ii) Are the averages of AFR the same across all the Continents at 0.01 level? Which continent has the highest AFR on average?

## Analysis of Variance Table
## 
## Response: AFR
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Continent   4 117764 29441.1  33.014 < 2.2e-16 ***
## Residuals 109  97204   891.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = AFR ~ Continent, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.746  -9.475  -2.582  12.226 108.773 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              93.336      5.364  17.402  < 2e-16 ***
## ContinentAsia           -59.795      7.786  -7.680 7.33e-12 ***
## ContinentEurope         -79.334      7.365 -10.771  < 2e-16 ***
## ContinentNorth America  -25.032     10.480  -2.388   0.0186 *  
## ContinentSouth America  -25.314     11.307  -2.239   0.0272 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.86 on 109 degrees of freedom
## Multiple R-squared:  0.5478, Adjusted R-squared:  0.5312 
## F-statistic: 33.01 on 4 and 109 DF,  p-value: < 2.2e-16

#Answer: Continent as a whole is significant as the p-value is lower than 0.01 according to the anova test performed above. Africa is the base and seems to be the highest because the other continents have negative coefficients. See below for further proof that AFR is highest for Africa.

## # A tibble: 5 x 2
##   Continent     `mean(AFR)`
##   <fct>               <dbl>
## 1 Africa               93.3
## 2 Asia                 33.5
## 3 Europe               14.0
## 4 North America        68.3
## 5 South America        68.0

4) AFR vs GDP.per.capita and Continent

i) Fit a linear model of AFR vs GDP.per.capita and Continent, assuming there is no interaction effect.

## 
## Call:
## lm(formula = AFR ~ GDP.per.capita + Continent, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.850 -10.116  -1.985  11.456 107.944 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             9.435e+01  5.222e+00  18.066  < 2e-16 ***
## GDP.per.capita         -4.607e-04  1.677e-04  -2.746  0.00706 ** 
## ContinentAsia          -5.650e+01  7.657e+00  -7.379 3.47e-11 ***
## ContinentEurope        -6.553e+01  8.742e+00  -7.497 1.93e-11 ***
## ContinentNorth America -2.130e+01  1.027e+01  -2.074  0.04046 *  
## ContinentSouth America -2.260e+01  1.103e+01  -2.049  0.04286 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29 on 108 degrees of freedom
## Multiple R-squared:  0.5773, Adjusted R-squared:  0.5578 
## F-statistic:  29.5 on 5 and 108 DF,  p-value: < 2.2e-16

a) Is GDP.per.capita significant at 0.01 level controlling for Continent?

## Anova Table (Type II tests)
## 
## Response: AFR
##                Sum Sq  Df F value    Pr(>F)    
## GDP.per.capita   6345   1  7.5422  0.007062 ** 
## Continent       66212   4 19.6759 3.456e-12 ***
## Residuals       90859 108                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer:According to Anova test, GDP per capita is significant at 0.01 level controlling for Continent

b) Is Continent significant at 0.01 level controlling for GDP.per.capita. For a given GDP.per.capita value, which continent seems to have the lowest AFR on average?

Answer: Yes, Continent is significant at 0.01 level controlling for GDP.per.capita. Controlling for GDP.per.capita, Europe seems to have the lowest AFR on average since it has the most negative coefficient.

ii) Some summary statistics seem to indicate a possible interaction effect of Continent and GDP.per.capita over AFR. Run a linear model of AFR vs GDP.per.capita and Continent with interaction.

## 
## Call:
## lm(formula = AFR ~ GDP.per.capita * Continent, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -87.447 -10.153  -1.849   8.412  86.525 
## 
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                           120.419266   6.048705  19.908
## GDP.per.capita                         -0.012350   0.001882  -6.563
## ContinentAsia                         -76.166753   8.513451  -8.947
## ContinentEurope                       -99.189984   8.951556 -11.081
## ContinentNorth America                -42.199062  11.167372  -3.779
## ContinentSouth America                -45.140165  18.582881  -2.429
## GDP.per.capita:ContinentAsia            0.011204   0.001925   5.822
## GDP.per.capita:ContinentEurope          0.012125   0.001888   6.421
## GDP.per.capita:ContinentNorth America   0.011387   0.001963   5.802
## GDP.per.capita:ContinentSouth America   0.011453   0.002689   4.260
##                                       Pr(>|t|)    
## (Intercept)                            < 2e-16 ***
## GDP.per.capita                        2.11e-09 ***
## ContinentAsia                         1.51e-14 ***
## ContinentEurope                        < 2e-16 ***
## ContinentNorth America                0.000263 ***
## ContinentSouth America                0.016850 *  
## GDP.per.capita:ContinentAsia          6.53e-08 ***
## GDP.per.capita:ContinentEurope        4.12e-09 ***
## GDP.per.capita:ContinentNorth America 7.15e-08 ***
## GDP.per.capita:ContinentSouth America 4.50e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.62 on 104 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.6813 
## F-statistic: 27.84 on 9 and 104 DF,  p-value: < 2.2e-16

a) Can we reject the null hypothesis of no interaction effect at 0.01 level?

## Analysis of Variance Table
## 
## Model 1: AFR ~ GDP.per.capita + Continent
## Model 2: AFR ~ GDP.per.capita * Continent
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    108 90859                                  
## 2    104 63055  4     27804 11.465 9.516e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer:Yes, we reject the null hypothesis of no interaction given that the p-value is 9.51 e-08, much smaller than 0.01. The interaction effect is significant.

Part 3. Analysis with LASSO

Lastly we will build a parsimonious model to see what factors are related to AFR.

5) LASSO to reduce the number of factors

i) In any linear model you will run, can you include the variable Country in it? Why or Why not? Explain in no more than 2 sentences.

#Answer: we cannot include the variable Country because there is only one row/ entry for each country. The model has no degree of freedoms to predict anything because there is only one data.

We now take out Country, fertility.rate, Continent and save it as data2.

ii) LASSO with cv.glmnet

a) Run a LASSO analysis using all variables in data2. For reproducibility, use set.seed(1). Also use 10 folds by setting nfolds=10. Plot the LASSO output.

## [1] 0.4522635
## [1] 6.715969

b) Choose 6 non-zero variables from LASSO. Hint: The top line in the plot shows the number of non-zero coefficients. Choose \(s\) approximately equal to exponential of value on x-axis that corresponds to \(6\) in the top line.

##     (Intercept)  mortality.rate           Water    Immunization 
##     114.0591324       0.3698861      -0.1780301      -0.4178737 
## Sanitation.faci  Internet.users 
##      -0.2288822      -0.1346024

#Answer: By using s=9, we got 6 non-zero variables from Lasso, including the intercept. The other variables are mortality.rate, Water, Immunization, Sanitation.faci, Internet.users

6) Final analysis using variables from LASSO

i) Assume we obtain the following variables from LASSO: mortality.rate, Water, Immunization, Sanitation.faci. Run the final linear model of AFR with the variables listed here AND Continent. Call this fit fit_final_AFR. Report the Anova of this fit and report if any of the variables are insignificant at 0.05 level.

## Anova Table (Type II tests)
## 
## Response: AFR
##                 Sum Sq  Df F value    Pr(>F)    
## mortality.rate    4253   1 12.7503 0.0005386 ***
## Water             1513   1  4.5361 0.0355243 *  
## Immunization      2374   1  7.1175 0.0088437 ** 
## Sanitation.faci   2210   1  6.6263 0.0114428 *  
## Continent        22633   4 16.9636 9.308e-11 ***
## Residuals        35023 105                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Answer: Based on the Anova table, all of the variables listed above are significant at 0.05 level

Note: data2 does not contain Continent. Also, we are giving the variables so that students who are not able to output LASSO variables will not be double penalized. This may not be the variables from the LASSO output.

ii) By your judgement, are the linear model assumptions satisfied for fit_final_AFR? Provide relevant plots.

# Answer: The Residual plot shows a fanned out pattern that needs to be fixed first so that it better fits the linear model assumptions. Normal QQ-plot looks fine, although it deviates a little bit on both tails, due to the outliers.

iii) Based on the summary of fit_final_AFR, provide one variable in this which the policy makers can use to decrease AFR.

Answer: Policy makers can focus on Immunization for DPT in children in order to reduce AFR. As shown below, higher immunization rate has the highest impact in reducing AFR.

## 
## Call:
## lm(formula = AFR ~ mortality.rate + Water + Immunization + Sanitation.faci + 
##     Continent, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.090  -6.926  -1.108   8.960  65.139 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            164.2197    33.0370   4.971 2.61e-06 ***
## mortality.rate           0.4835     0.1354   3.571 0.000539 ***
## Water                   -0.4954     0.2326  -2.130 0.035524 *  
## Immunization            -0.6875     0.2577  -2.668 0.008844 ** 
## Sanitation.faci         -0.3702     0.1438  -2.574 0.011443 *  
## ContinentAsia           -4.9808     6.4118  -0.777 0.439014    
## ContinentEurope         -3.2148     7.7039  -0.417 0.677319    
## ContinentNorth America  36.3918     8.1479   4.466 2.01e-05 ***
## ContinentSouth America  31.6557     8.5002   3.724 0.000317 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.26 on 105 degrees of freedom
## Multiple R-squared:  0.8371, Adjusted R-squared:  0.8247 
## F-statistic: 67.43 on 8 and 105 DF,  p-value: < 2.2e-16

\[ \mbox{\textbf{End of AFR analysis}}. \]

Relation between Heart Disease and Smoking

In this part, we will explore the relation between heart disease and smoking using Framingham dataset. As we saw in class this dataset has a factor variable of interest HD which takes values “0” or “1” with “1” indicating the presence of heart disease. It includes other variables such as AGE, SEX, SBP, DBP, CHOL, FRW and CIG. We will use a revised data for the purpose of the midterm. A new categorical variable Smoke is created by grouping the original continuous variable CIG into categories “None”, “Med”, “High” and “VHigh”. We have split the original Framingham dataset into training and testing data: train_fram.csv and test_fram.csv.

Part 1 Relation between HD and Smoke

1) Preliminary Models

i) Fit a logistic regression between HD and Smoke. Call this model fit1_logi. Report the summary. What is the base level? At what level/category of Smoke, the probability of HD = 1 appears to be the highest?

## 
## Call:
## glm(formula = HD ~ Smoke, family = binomial(logit), data = HD_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8702  -0.6740  -0.6740  -0.5945   1.9081  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.3665     0.1045 -13.081   <2e-16 ***
## SmokeMed     -0.2771     0.2506  -1.106   0.2688    
## SmokeHigh     0.3464     0.1913   1.811   0.0701 .  
## SmokeVHigh    0.5907     0.2475   2.386   0.0170 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1053.8  on 999  degrees of freedom
## Residual deviance: 1043.2  on 996  degrees of freedom
## AIC: 1051.2
## 
## Number of Fisher Scoring iterations: 4

#Answer: The base level is “SmokeNone”. The probability of HD=1 appears to be the highest for VHigh Smoker

ii) In model fit1_logi, is Smoke a significant variable at level 0.05?

## Analysis of Deviance Table (Type II tests)
## 
## Response: HD
##       LR Chisq Df Pr(>Chisq)  
## Smoke   10.655  3    0.01375 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Answer: Smoke is a significant variable in fit1_logi, as the p-value is lower than 0.05

iii) Now fit a logistic regression model for HD using AGE, SEX, SBP, CHOL and Smoke as covariates/features. Let us call this model fit2_logi. Is Smoke a significant variable at level 0.05?

## 
## Call:
## glm(formula = HD ~ AGE + SEX + SBP + CHOL + Smoke, family = binomial(logit), 
##     data = HD_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5165  -0.7313  -0.5566  -0.3549   2.5273  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.875653   1.080740  -7.287 3.16e-13 ***
## AGE          0.052295   0.017208   3.039  0.00237 ** 
## SEXMALE      0.909009   0.183307   4.959 7.09e-07 ***
## SBP          0.016669   0.002852   5.845 5.06e-09 ***
## CHOL         0.003286   0.001770   1.857  0.06338 .  
## SmokeMed    -0.275544   0.261996  -1.052  0.29293    
## SmokeHigh    0.224530   0.211482   1.062  0.28837    
## SmokeVHigh   0.438044   0.275931   1.588  0.11240    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1053.82  on 999  degrees of freedom
## Residual deviance:  968.11  on 992  degrees of freedom
## AIC: 984.11
## 
## Number of Fisher Scoring iterations: 4
## Analysis of Deviance Table (Type II tests)
## 
## Response: HD
##       LR Chisq Df Pr(>Chisq)    
## AGE      9.349  1    0.00223 ** 
## SEX     25.309  1  4.885e-07 ***
## SBP     34.999  1  3.299e-09 ***
## CHOL     3.430  1    0.06401 .  
## Smoke    5.380  3    0.14599    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Answer: According to Anova table, Smoke is not a significant variable as 0.145 is bigger than 0.05 level.

Part 2: Classification

2) Thresholding Rules

Load the testing data test_fram.csv.

##        HD              AGE           SEX           SBP     
##  Min.   :0.0000   Min.   :45.0   FEMALE:204   Min.   : 90  
##  1st Qu.:0.0000   1st Qu.:48.0   MALE  :189   1st Qu.:130  
##  Median :0.0000   Median :52.0                Median :144  
##  Mean   :0.2214   Mean   :51.8                Mean   :149  
##  3rd Qu.:0.0000   3rd Qu.:55.0                3rd Qu.:160  
##  Max.   :1.0000   Max.   :62.0                Max.   :294  
##       DBP              CHOL            FRW             CIG        
##  Min.   : 60.00   Min.   :133.0   Min.   : 69.0   Min.   : 0.000  
##  1st Qu.: 80.00   1st Qu.:200.0   1st Qu.: 93.0   1st Qu.: 0.000  
##  Median : 90.00   Median :227.0   Median :102.0   Median : 0.000  
##  Mean   : 90.68   Mean   :233.1   Mean   :104.1   Mean   : 8.562  
##  3rd Qu.:100.00   3rd Qu.:260.0   3rd Qu.:113.0   3rd Qu.:20.000  
##  Max.   :155.00   Max.   :430.0   Max.   :204.0   Max.   :50.000  
##    Smoke    
##  None :200  
##  Med  : 64  
##  High : 88  
##  VHigh: 41  
##             
## 

i) Use the 1/2 thresholding rule for predicting HD with models fit1_logi and fit2_logi. Predict HD on the testing data. What are the (testing) misclassification errors from models fit1_logi and fit2_logi? Report at least 3 decimals.

## [1] 0.221374

#Answer: for fit1_logi, the misclassfication error is 0.221

## [1] 0.2188295

#Answer: for fit1_logi, the misclassfication error is 0.219

ii) Based on the testing MCE, which model is the best? #Answer" Based on the testing MCE, Model 2, fit2_logi is better, because it has lower MCE of 0.219 compared to 0.221.

Part 3: Prediction

3) Prediction

i) We have a male with features: AGE = 50, SBP = 160, CHOL = 230 and Smoke = None. Predict whether this person has a heart disease or not based on the 1/2 thresholding rule with fit1_logi.

##   AGE SBP CHOL Smoke
## 1  50 160  230  None
##   1 
## "0"

#Answer: This male is predicted to NOT have heart disease.

\[ \mbox{\textbf{End of the exam}}. \]

Declaration

By submitting this document you certify that you have complied with the University of Pennsylvania’s Code of Academic Integrity, to the best of your knowledge. You further certify that you have taken this exam under its sanctioned conditions, i.e. solely within the set exam room and within the time allotted.