Take home exam - Johana Radojević

#Task one.

#loading library

library(car)

## Loading required package: carData

data(package = .packages(all.available = TRUE))

#asigning wong as mydata

mydata <-Wong

#Showing summary of the dataset, and printing the head

summary(Wong  )

##        id            days            duration         sex           age        
##  Min.   : 405   Min.   :   13.0   Min.   :  0.0   Female: 71   Min.   : 6.513  
##  1st Qu.:3808   1st Qu.:   59.0   1st Qu.:  1.0   Male  :260   1st Qu.:21.737  
##  Median :5085   Median :  150.0   Median :  7.0                Median :26.877  
##  Mean   :4735   Mean   :  481.9   Mean   : 14.3                Mean   :31.853  
##  3rd Qu.:5712   3rd Qu.:  416.0   3rd Qu.: 16.0                3rd Qu.:40.923  
##  Max.   :7548   Max.   :11628.0   Max.   :255.0                Max.   :80.033  
##       piq              viq        
##  Min.   : 50.00   Min.   : 64.00  
##  1st Qu.: 77.00   1st Qu.: 85.00  
##  Median : 87.00   Median : 94.00  
##  Mean   : 87.56   Mean   : 94.96  
##  3rd Qu.: 97.00   3rd Qu.:105.00  
##  Max.   :133.00   Max.   :132.00

head(Wong  )

##     id days duration  sex      age piq viq
## 1 3358   30        4 Male 20.67077  87  89
## 2 3535   16       17 Male 55.28816  95  77
## 3 3547   40        1 Male 55.91513  95 116
## 4 3592   13       10 Male 61.66461  59  73
## 5 3728   19        6 Male 30.12731  67  73
## 6 3790   13        3 Male 57.06229  76  69

Description of variables: The data frame has 331 rows (observations) and 7 columns(variables). The observations are longitudinal data on recovery of IQ after comas of varying duration for 200 subjects.

This data frame contains the following columns:

id-patient ID number.

days-number of days post coma at which IQs were measured.

duration-duration of the coma in days.

sex-a factor with levels Female and Male.

age-in years at the time of injury.

piq-performance (i.e., mathematical) IQ.

viq-verbal IQ.

#Rounding data

library (dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata_rounded <- mydata %>% mutate(across(c('age'), round, 0))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(c("age"), round, 0)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

head(mydata_rounded)

##     id days duration  sex age piq viq
## 1 3358   30        4 Male  21  87  89
## 2 3535   16       17 Male  55  95  77
## 3 3547   40        1 Male  56  95 116
## 4 3592   13       10 Male  62  59  73
## 5 3728   19        6 Male  30  67  73
## 6 3790   13        3 Male  57  76  69

#Ordering dataset, by variable “duration”

order(mydata_rounded$duration)

##   [1]  15  38  39  40  42  43  44  45  62  73  87  88  89  91  94 114 117 123
##  [19] 138 139 142 144 154 160 191 194 201 222 225 229 230 237 260 263 272 274
##  [37] 276 286 289 291 292 298 302 304 314 331   3  13  18  20  23  27  28  31
##  [55]  33  34  37  41  50  57  72  75  76  80  95 104 107 111 153 158 161 181
##  [73] 182 185 195 202 203 204 211 216 233 239 246 249 254 270 281 283 294 300
##  [91]  35  58  79  82  93 125 163 166 172 183 213 218 221 251 257 293 296 309
## [109]   6   9  24  48  60  74  83  86 129 141 159 165 173 205 264 282 284   1
## [127]  16  21  61  66 116 124 134 137 140 152 170 192 206 317   7  17  19  55
## [145]  84  92 127 156 243   5  56 147   8  10  25  26  36  65  70  71  81 126
## [163] 130 133 135 151 157 171 178 179 188 198 220 228 244 250 253 279 295 307
## [181] 308 315 324  11  22  29  32 122 162 196 242 259 299  30  47  49  52  99
## [199] 169 327   4  68 100 118 132 224 323  12 121 168 240  69  85 105 131 180
## [217] 187 245 248 146 212 261 285  46  54  59  63  77  78  90 102 109 110 115
## [235] 149 155 164 190 197 199 223 235 255 273 278 312 328 265   2  64 143 150
## [253] 174 207 241 297 316  51  67 136 145 175 214 313 275 290 103 108 112 113
## [271] 177 184 200 210 247 252 303 325 267  14 101 209 258 318  53  97 119 120
## [289] 193 215 226 236 277 322 106 128 189 219 280  96 176 217 256 306 148 208
## [307] 238 288 320 329 227 301 319  98 287 330 231 232 262 268 269 186 167 266
## [325] 234 271 310 321 326 311 305

mydata_rounded[order(mydata_rounded$duration),]

##       id  days duration    sex age piq viq
## 15  4802    36        0   Male  62  88  97
## 38  6870    22        0   Male  42  84  95
## 39  6914    43        0   Male  62  85  90
## 40  6937    18        0 Female  21  94  81
## 42  7120    39        0   Male  70  84  86
## 43  7309    31        0 Female  51  85  95
## 44  7321    23        0   Male  26  84  83
## 45  7548    31        0   Male  24 108 106
## 62  4315    63        0   Male  38 107 130
## 73  5204    71        0   Male  59  97 107
## 87  6340    71        0   Male  19  76  72
## 88  6564    69        0   Male  34  67  74
## 89  6614    57        0   Male  45  80 101
## 91  6795    55        0   Male  31  87 104
## 94  7271    55        0   Male  42 100  95
## 114 5916    84        0 Female  27  93  73
## 117 7221    98        0   Male  64  74  79
## 123 4933   134        0   Male  18  69  83
## 138 6214   112        0   Male  60  65  74
## 139 6253   128        0 Female  46 104 112
## 142 6834   123        0   Male  31  72  75
## 144 2849   151        0   Male  20  51  86
## 154 4802   142        0   Male  62 101 117
## 160 5642   162        0   Male  66  89 103
## 191 2790   211        0   Male  49  89  99
## 194 4933   226        0   Male  18  79  86
## 201 5916   205        0 Female  27  92  76
## 222 5600   232        0   Male  49  75  81
## 225 5204   299        0   Male  59  99 105
## 229 5397   328        0 Female  63 121 108
## 230 6214   318        0   Male  60  78  82
## 237 3226   444        0   Male  27  76  64
## 260 6226   438        0   Male  37  84  92
## 263 6614   362        0   Male  45  88 106
## 272 2498   615        0 Female  17  86 113
## 274 2849   642        0   Male  20  76  98
## 276 3226   683        0   Male  27  89  78
## 286  405   986        0   Male  21  66 116
## 289 2849  1040        0   Male  20  91 103
## 291 3226  1123        0   Male  27  88  81
## 292 4864   936        0 Female  54 119 131
## 298 5642  1143        0   Male  66 104 109
## 302 7061   923        0   Male  37  74  81
## 304 2527  1294        0   Male  17  93 104
## 314 3768  1916        0   Male  19  69  80
## 331 5964 11038        0   Male  13  71  73
## 3   3547    40        1   Male  56  95 116
## 13  4705    18        1 Female  22 127 109
## 18  5129    26        1   Male  25  77  89
## 20  5162    33        1   Male  25 118 101
## 23  5253    29        1   Male  33 104 105
## 27  5680    17        1   Male  28  84  90
## 28  5699    26        1 Female  34  95 108
## 31  5754    36        1   Male  16  87  86
## 33  6122    29        1   Male  56  95 103
## 34  6163    21        1   Male  19 112 106
## 37  6859    27        1   Male  34  74  79
## 41  6977    30        1   Male  36  97  94
## 50  3277    51        1   Male  37 104  96
## 57  3919    58        1   Male  30  99  95
## 72  5192    60        1   Male  59  87  97
## 75  5280    83        1   Male  49  78  88
## 76  5289    52        1   Male  49  84  85
## 80  5568    64        1 Female  52  75  79
## 95  7371    55        1   Male  57  80  88
## 104 4962    63        1 Female  25  69  67
## 107 5253    86        1   Male  33 106 128
## 111 5837    82        1 Female  33  82 110
## 153 4705   146        1 Female  22 133 111
## 158 5162   144        1   Male  25 130 118
## 161 5699   138        1 Female  34 110 107
## 181 5192   179        1   Male  59  93 105
## 182 5505   171        1   Male  65  95  93
## 185 5680   184        1   Male  28  84  90
## 195 4962   210        1 Female  25  71  70
## 202 6122   212        1   Male  56 109 117
## 203 6136   216        1   Male  33  92  89
## 204 6175   278        1   Male  51  99  98
## 211 5837   242        1 Female  33  93 105
## 216 4342   263        1   Male  44  79  91
## 233 1836   375        1   Male  47 101 108
## 239 4342   432        1   Male  44  92 107
## 246 5289   417        1   Male  49  83  83
## 249 5505   527        1   Male  65 104  87
## 254 5680   403        1   Male  28  94  93
## 270 1624   604        1   Male  20  97  85
## 281 5253   591        1   Male  33 114 124
## 283 6059   794        1 Female  17  71  76
## 294 5568  1114        1 Female  52  81  82
## 300 5837   962        1 Female  33 109 110
## 35  6179    22        2   Male  38  89  95
## 58  4094    50        2   Male  20  79  93
## 79  5474    65        2 Female  29  95  86
## 82  5581    65        2   Male  26  85  95
## 93  7084    54        2   Male  37  87  93
## 125 5085   117        2   Male  49  67  71
## 163 5804   159        2 Female  29 102 107
## 166 6664   164        2   Male  25  66  73
## 172 4094   177        2   Male  20  89 102
## 183 5581   176        2   Male  26  96 110
## 213 1892   276        2   Male  22  87 107
## 218 5085   269        2   Male  49  65  77
## 221 5474   280        2 Female  29  99  91
## 251 5581   378        2   Male  26  95  95
## 257 5804   354        2 Female  29 122 105
## 293 5474  1100        2 Female  29  94  88
## 296 5581  1113        2   Male  26  99  96
## 309 5085  1512        2   Male  49  75  75
## 6   3790    13        3   Male  57  76  69
## 9   4253    40        3   Male  23 115 110
## 24  5298    30        3   Male  23  87  86
## 48  2761    40        3 Female  24  98 112
## 60  4183    42        3   Male  26  98 116
## 74  5238    44        3   Male  45  99 103
## 83  5628    51        3 Female  30  81  85
## 86  6314    58        3   Male  17  80  99
## 129 5298   107        3   Male  23 117 112
## 141 6665   119        3 Female  23 106  94
## 159 5238   150        3   Male  45 117 126
## 165 6314   140        3   Male  17  87  96
## 173 4253   175        3   Male  23 114 118
## 205 6228   174        3 Female  32 114 108
## 264 6665   368        3 Female  23 100  92
## 282 5628   609        3 Female  30  89  78
## 284 6228   662        3 Female  32 128 111
## 1   3358    30        4   Male  21  87  89
## 16  4941    46        4 Female  19  69  88
## 21  5174    38        4 Female  37  87  99
## 61  4189    69        4 Female  29  75  86
## 66  4696    54        4   Male  47 101 112
## 116 7173    84        4   Male  25  72  75
## 124 4941   131        4 Female  19  96  96
## 134 5896   126        4 Female  27  50  74
## 137 6173   125        4   Male  35  94  97
## 140 6433   120        4   Male  24 100 103
## 152 4696   150        4   Male  47 120 120
## 170 3358   175        4   Male  21  97  97
## 192 4189   202        4 Female  29  81  90
## 206 7173   210        4   Male  25  79  78
## 317 4696  1769        4   Male  47 105 124
## 7   3807    37        5   Male  25  74  77
## 17  4983    33        5   Male  38 102 117
## 19  5154    35        5   Male  22  82  95
## 55  3655    57        5 Female  22  90 103
## 84  6154    43        5 Female  23  74  80
## 92  7080    64        5 Female  77  76 106
## 127 5154   120        5   Male  22  89 109
## 156 4983   146        5   Male  38 107 123
## 243 4983   398        5   Male  38 121 132
## 5   3728    19        6   Male  30  67  73
## 56  3762    48        6   Male  20  85  93
## 147 3728   151        6   Male  30  96 105
## 8   3808    31        7   Male  28  91 110
## 10  4356    31        7   Male  21  86  83
## 25  5640    34        7   Male  26  93 113
## 26  5668    27        7   Male  41  72  79
## 36  6671    30        7 Female  28  71  82
## 65  4678    63        7   Male  47  96  95
## 70  5009    50        7   Male  24  61 104
## 71  5014    46        7 Female  24  75  90
## 81  5580    56        7   Male  18  86  95
## 126 5111   107        7   Male  22  71  80
## 130 5339   119        7   Male  22  87  82
## 133 5494   111        7   Male  55  86  86
## 135 5901   115        7   Male  22 112 116
## 151 4678   143        7   Male  47  98 107
## 157 5014   151        7 Female  24  97 110
## 171 3808   165        7   Male  28  94 111
## 178 5009   174        7   Male  24  77 103
## 179 5111   177        7   Male  22  72  81
## 188 6671   184        7 Female  28  91  92
## 198 5668   219        7   Male  41  76  90
## 220 5339   271        7   Male  22  94  89
## 228 4678   340        7   Male  47 108 119
## 244 5111   442        7   Male  22  77  86
## 250 5580   369        7   Male  18  96 107
## 253 5668   390        7   Male  41  92  92
## 279 5014   637        7 Female  24 101 114
## 295 5580  1087        7   Male  18 106  98
## 307 5009  1537        7   Male  24  76 112
## 308 5014  1523        7 Female  24 105 114
## 315 4356  2000        7   Male  21 104  91
## 324 3808  2434        7   Male  28 105 111
## 11  4384    35        8   Male  36  76  90
## 22  5208    31        8 Female  21  97  90
## 29  5713    36        8   Male  16  89  97
## 32  5776    26        8   Male  17  71  88
## 122 4902   102        8   Male  16  87  77
## 162 5713   144        8   Male  16 100  99
## 196 5208   193        8 Female  21 133 111
## 242 4902   397        8   Male  16  92  86
## 259 5841   415        8   Male  27  82  83
## 299 5713  1016        8   Male  16 126 106
## 30  5736    18        9   Male  16  89  86
## 47  2600  3333        9   Male  44  86  80
## 49  3237    65        9   Male  50  67  67
## 52  3359    59        9 Female  57  84  91
## 99  3844    73        9   Male  26  79  94
## 169 3237   189        9   Male  50  79  82
## 327 2600  3337        9   Male  44 101  84
## 4   3592    13       10   Male  62  59  73
## 68  4837    42       10   Male  20  83  88
## 100 4725   124       10   Male  33  93  97
## 118 2453   120       10   Male  37  63  99
## 132 5414   105       10 Female  40  93 104
## 224 4725   286       10   Male  33 105  94
## 323 3592  2569       10   Male  62  76  93
## 12  4542    22       11 Female  22  71  89
## 121 4542   121       11 Female  22  86 114
## 168 1085   159       11   Male  31 103  97
## 240 4542   431       11 Female  22  98 114
## 69  4996    51       12   Male  43  77  78
## 85  6180    59       12   Male  21  67  84
## 105 5125    78       12   Male  18  94 118
## 131 5387   109       12   Male  22  85 112
## 180 5125   173       12   Male  18 106 119
## 187 6180   177       12   Male  21  81  94
## 245 5125   510       12   Male  18 112 125
## 248 5387   480       12   Male  22  94 116
## 146 3051   131       13   Male  37  68  79
## 212 6247   228       13   Male  42  77  80
## 261 6247   389       13   Male  42  82  80
## 285 6247   616       13   Male  42  85  82
## 46  2364    41       14   Male  26  84  94
## 54  3544    32       14   Male  55  81  98
## 59  4133    34       14   Male  20  70  88
## 63  4482    58       14 Female  18  86 103
## 77  5456    48       14   Male  41  80 101
## 78  5458    44       14   Male  34  84  95
## 90  6686    44       14 Female  38  90 100
## 102 4807    64       14 Female  48  74  74
## 109 5534    87       14   Male  29  75  82
## 110 5712    88       14   Male  22  70  68
## 115 6410    80       14   Male  32  85  98
## 149 4133   133       14   Male  20  82  94
## 155 4807   139       14 Female  48  80  78
## 164 5818   125       14   Male  35  72  91
## 190 2646   187       14   Male  23  97  97
## 197 5456   193       14   Male  41  87 110
## 199 5712   192       14   Male  22  87  85
## 223 2826   290       14   Male  23  94 108
## 235 2646   438       14   Male  23  98  94
## 255 5712   365       14   Male  22  98  86
## 273 2826   636       14   Male  23 111 101
## 278 4807   532       14 Female  48  84  82
## 312 2826  1809       14   Male  23 104 108
## 328 3835  4933       14   Male  26  91  88
## 265  781   714       15   Male  30  85  85
## 2   3535    16       17   Male  55  95  77
## 64  4638    20       17   Male  21  82  72
## 143 1176   146       17 Female  20  65  98
## 150 4661   135       17 Female  31  84  93
## 174 4638   140       17   Male  21  89  78
## 207 1176   216       17 Female  20  74 100
## 241 4661   374       17 Female  31  93  95
## 297 5617  1113       17   Male  20  78  87
## 316 4638  1779       17   Male  21  92  76
## 51  3346    44       18 Female  57  79  85
## 67  4755    24       18   Male  28 105 102
## 136 6135    96       18   Male  27  66 105
## 145 2882   141       18   Male  19  84  85
## 175 4755   128       18   Male  28 105 109
## 214 2882   262       18   Male  19  94  90
## 313 2882  1716       18   Male  19 100 103
## 275 3032   525       20   Male  17  79  87
## 290 3032   884       20   Male  17  87  93
## 103 4892    62       21   Male  22  76  88
## 108 5386    78       21   Male  21  78  93
## 112 5879    75       21   Male  26  80 105
## 113 5893    71       21   Male  23  65  90
## 177 4892   148       21   Male  22 106 110
## 184 5599   148       21   Male  19  72  81
## 200 5893   200       21   Male  23  65  89
## 210 5386   241       21   Male  21  80  94
## 247 5386   436       21   Male  21  90 103
## 252 5599   443       21   Male  19  78  80
## 303  651  1491       21   Male  22  71  94
## 325  651  3412       21   Male  22  68  92
## 267 1157   810       23   Male  17  97  84
## 14  4744    15       25   Male  58  82  85
## 101 4744    65       25   Male  58 105 119
## 209 4744   217       25   Male  58 108 118
## 258 5811   431       25   Male  80  78  80
## 318 4744  1743       25   Male  58  97 118
## 53  3373    39       28 Female  26  87  91
## 97  3058    56       28   Male  22  65  75
## 119 2653    97       28   Male  30  93 112
## 120 4218    82       28   Male  26  74  92
## 193 4775   180       28   Male  54  70  86
## 215 3058   236       28   Male  22  85  88
## 226 6498   270       28   Male  24  82 101
## 236 2653   352       28   Male  30 105 126
## 277 4218   814       28   Male  26  99  96
## 322 2653  2191       28   Male  30 117 129
## 106 5222    63       30   Male  23  77  85
## 128 5222    93       30   Male  23  77  91
## 189 2124   173       30   Male  31  76 106
## 219 5222   247       30   Male  23  88  85
## 280 5222   690       30   Male  23  81  90
## 96  2569    49       35   Male  19  50 101
## 176 4865   142       35   Male  58  84 103
## 217 4865   240       35   Male  58  93 105
## 256 5772   412       35   Male  26 102 104
## 306 4865  1363       35   Male  58  88 104
## 148 3913    96       42 Female  24  56  80
## 208 3467   186       42   Male  25  53  69
## 238 3467   333       42   Male  25  68  74
## 288 1075   907       42 Female  27  63  64
## 320 1075  2259       42 Female  27  78  79
## 329 2773  7631       42   Male   7  88 103
## 227 2081   185       43   Male  18  77  97
## 301 6140  1077       44 Female  21  65  88
## 319 6140  1742       44 Female  21  67  87
## 98  3645    43       45   Male  27  72  90
## 287  626   870       55   Male  20  80  85
## 330 5142 11628       57   Male  16 101  95
## 231 7034   280       60   Male  23  78  80
## 232 1493   453       60   Male  18  59  81
## 262 6468   513       60   Male  43  99  94
## 268 1493   684       60   Male  18  66  75
## 269 1611   511       60   Male  23  69 107
## 186 5782   108       68 Female  20  69  85
## 167 1048    85       94   Male  20  63  82
## 266 1048   576       94   Male  20  91  96
## 234 1939   295      130   Male  28  67 117
## 271 1939   562      130   Male  28  85 111
## 310 1939  1926      130   Male  28  95 108
## 321 1939  3111      130   Male  28  88 111
## 326 1939  3864      130   Male  28  88 105
## 311 2662  1569      180   Male  28  90 101
## 305 2638  1093      255   Male  17  78  84

#Assigning a new dataset with rounded and ordered observations

mydata_rounded1 <- mydata_rounded[order(mydata_rounded$duration),]

#creating factors

mydata_rounded1$sexf <- factor(mydata_rounded1$sex, levels = c ("0", "1"), labels = c("male", "female") )

#Presenting statistic descripiton

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

round(stat.desc(mydata_rounded1))

##                   id    days duration sex   age   piq   viq sexf
## nbr.val          331     331      331  NA   331   331   331   NA
## nbr.null           0       0       46  NA     0     0     0   NA
## nbr.na             0       0        0  NA     0     0     0   NA
## min              405      13        0  NA     7    50    64   NA
## max             7548   11628      255  NA    80   133   132   NA
## range           7143   11615      255  NA    73    83    68   NA
## sum          1567184  159493     4732  NA 10542 28981 31433   NA
## median          5085     150        7  NA    27    87    94   NA
## mean            4735     482       14  NA    32    88    95   NA
## SE.mean           83      62        1  NA     1     1     1   NA
## CI.mean.0.95     164     123        3  NA     2     2     2   NA
## var          2293522 1292958      678  NA   193   229   197   NA
## std.dev         1514    1137       26  NA    14    15    14   NA
## coef.var           0       2        2  NA     0     0     0   NA

Days: the median value of days is 5085. The mean value of days is 4735. The min value is 13. Age:the median value of age is 27.The mean value of age is 32.The min value is 7 Piq:the median value of piq is 87.The mean value of piq is 88.The min value is 50 Viq:the median value of viq is 94.The mean value of viq is 95.The min value is 64 Duration:the median value of duration is 7.The mean value of duration is 482.The min value is 0.

#Creating a scatterplot of observations, by variables age and piq.

library(car)
scatterplot(x=mydata_rounded1$age,
            y=mydata_rounded1$piq, 
            xlab = "age of patients", 
            ylab = "performance IQ", 
            smooth = FALSE)

By examining the scatterplots we can conclude the following:

Comparing the correlation between age and performance IQ is not correlated.

#Graphing the distribution of the variables using a histogram

hist(mydata_rounded1$piq, 
     main = "Distribution of the variable Performance IQ", 
     ylab = "Frequency", 
     xlab = "piq", 
     breaks = seq(40, 150, by =1 ), 
     right = FALSE)

We can observe that the distribution of performance IQ is unimodal and esymmetrical. Furthermore the average is skewed lower than the global average of 100 or slightly positively skewed. this could be due to the brain trauma induced by the coma.

#Task 2:

#install.packages("readxl")
library(readxl)
business <- read_excel("~/Desktop/imb exam2024/Task 2/Business School.xlsx")

head(business)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

#Creating factor variables for further analysis.

business$undergradf <- factor(business$`Undergrad Degree`, levels = c ("Business","Computer Science", "Finance", "Engineering", "Art"), labels = c("Business","Computer Science", "Finance", "Engineering", "Art") )

#Using ggplot to graphically present the undergrad degrees variable.

library(ggplot2) 
ggplot(business, aes(x = undergradf)) + geom_bar(fill = "skyblue", color = "black") + xlab("Undergraduate Degree") + ylab("Count") + ggtitle("Distribution of Undergraduate Degrees") + theme_minimal()

The most common degree is a degree in business.

#presenting descriptive statistics

library(pastecs)
round(stat.desc(business$`Annual Salary`))

##      nbr.val     nbr.null       nbr.na          min          max        range 
##          100            0            0        20000       340000       320000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##     10905800       103500       109058         4150         8235   1722373475 
##      std.dev     coef.var 
##        41501            0

# Display descriptive statistics for Annual Salary
summary_stats <- summary(business$`Annual Salary`)
print(summary_stats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

#Graphically presenting the distribution of the variable Annual Salary with ggplot.

# Plot the histogram of Annual Salary
ggplot(business, aes(x = `Annual Salary`)) + 
  geom_histogram(binwidth = 5000, fill = "skyblue", color = "black") + 
  xlab("Annual Salary") + 
  ylab("Count") + 
  ggtitle("Distribution of Annual Salary") + 
  theme_minimal()

#Testing the following hypothesis: 𝐻0: 𝜇MBA Grade = 74.

t.test(business$"MBA Grade",
       mu = 74,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  business$"MBA Grade"
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

With the 95% confidence interval we can reject the null hypothesis, and conclude that the mean MBA Grade is not equal to 74. We can reject the null hypothesis because the p-value is smaller than 0,05. Furthermore the actual mean amounts to 76.04055.

#Task 3:

Import the dataset Apartments.xlsx

Apartments <- read_excel("~/Desktop/R Take Home Exam 2024 5/Task 3/Apartments.xlsx")

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

head(Apartments)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Change categorical variables into factors.

Apartments$Balconyf <- factor(Apartments$Balcony, levels = c ("0", "1"), labels = c("No", "Yes") )
Apartments$Parkingf <- factor(Apartments$Parking, levels = c ("0", "1"), labels = c("No", "Yes") )

Apartments1 <- Apartments

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(Apartments1$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  Apartments1$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We can reject the null hypothisis and conclude with a 95% confidence interval that the mean (price) of the population is not equal to 1900.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(formula = Price ~ Age, data = Apartments1)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor(Apartments1$Price, Apartments1$Age)

## [1] -0.230255

The estimation for b1 is b1 = -8.975. With every additional year on average price of the apartment per m2 will decrease by 8.975, under the condition everything else remains unchanged.

The Min residual (-623.9) indicates that some predicted prices were 623.9 units higher than the actual price. The Max residual (776.1) indicates that some predicted prices were 776.1 units lower than the actual price. The Median residual (-69.8) suggests that the model tends to slightly overestimate apartment prices.

The negative sign of the correlation level (-0.230255) means that as the age of apartments increases, the price of apartments tends to decrease. This suggests that older apartments generally have lower prices than newer ones.

The determination coefficient, R² = 0.05302, means that 5.3% of the variation in the price of apartments can be explained by the age of the apartments. The remaining 94.7% of the variation in apartment prices is due to other factors that are not explained by the age of the apartment

The F-statistic tests whether the overall regression model is a good fit for the data. In this case, the F-statistic is 4.647, with a p-value of 0.03401. Since the p-value is below 0.05, we can conclude that the overall model (using age to predict price) is statistically significant.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

scatterplotMatrix(Apartments1[,c(1,2,3)], 
                  smooth = FALSE)

Multicolinearity seems to not be present.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Apartments1$Price ~ Apartments1$Age + Apartments1$Distance, data = Apartments1)
summary(fit2)

## 
## Call:
## lm(formula = Apartments1$Price ~ Apartments1$Age + Apartments1$Distance, 
##     data = Apartments1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2460.101     76.632   32.10  < 2e-16 ***
## Apartments1$Age        -7.934      3.225   -2.46    0.016 *  
## Apartments1$Distance  -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

correlation_matrix <- cor(Apartments1[, c("Price", "Age", "Distance")], use = "complete.obs")

#The Min residual is -603.23, which means some actual prices are much lower than predicted. #The Max residual is 689.58, meaning some actual prices are much higher than predicted. #The Median residual is -85.68, suggesting that, on average, the model slightly overestimates apartment prices.

#Apartments$Age (-7.934): For each additional year of apartment age, the price is expected to decrease by 7.93 units. Since the p-value (0.016) is below 0.05, this effect is statistically significant.

#Apartments$Distance (-20.667): For each additional unit of distance, the price decreases by 20.67 units. This is also statistically significant with a very small p-value of 6.18e-11.

#Age: The p-value of 0.016 suggests that the relationship between age and price is statistically significant at the 5% significance level.

#Distance: The p-value of 6.18e-11 indicates a very strong statistical significance for the relationship between distance and price.

#R-squared (0.4396): This tells us that 43.96% of the variability in apartment prices is explained by the model (age and distance).

#Adjusted R-squared (0.4259): This is the R-squared adjusted for the number of predictors. It’s slightly lower than the R-squared and indicates that 42.59% of the variability in prices is explained by the model, accounting for the number of predictors.

Chech the multicolinearity with VIF statistics. Explain the findings.

fit2 <-lm(formula = Price ~ Age + Distance, data = Apartments1)
vif (fit2)

##      Age Distance 
## 1.001845 1.001845

The VIF statistic is less than 5 so we can conclude there is no multicoliniarity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

#install.packages("olsrr")
Apartments1$StdResid <- round(rstandard(fit2), 3) 
Apartments1$CooksD <- round(cooks.distance(fit2), 3)

hist(Apartments1$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(Apartments1$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments1$StdResid
## W = 0.95303, p-value = 0.003645

The p value is less than 5%,however there is one outlier that skews the conclusions about the regression. We have to remove the outlier in order to have valuable conclusions.

hist(Apartments1$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

Because we still had residuals on the Cooks distance, therefore we have to remove the variables that had the Cooks distance higher than 4/N.

#Exclugin the Cooks Distance residuals, and assigning them to excluded_cases

excluded_cases <- which(Apartments1$CooksD > (4/85))

#Assigning the clean dataset to Apartments 2

Apartments2<-Apartments1[c(-22,-33,-38,-53,-55) , ]
head(Apartments2[order(Apartments2$StdResid),], 3)

## # A tibble: 3 × 9
##     Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1    12       14  1650       0       1 Yes      No          -1.50  0.013
## 2    12       14  1650       0       0 No       No          -1.50  0.013
## 3    13        8  1800       0       0 No       No          -1.38  0.012

head(Apartments2[order(-Apartments1$CooksD),], 6)

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1    18       12  1770       1       0 No       Yes        -1.05   0.005
## 2    30        9  2130       0       0 No       No          0.333  0.001
## 3    14       16  1660       0       1 Yes      No         -1.26   0.008
## 4     8        2  2820       1       0 No       Yes         1.66   0.037
## 5    22       23  1730       0       1 Yes      No         -0.283  0.001
## 6    24        1  2220       1       1 Yes      Yes        -0.103  0

#Printed the head of the highest residuals, in order to re-check for recurring residuals. I removed the outlier from the data, and will have to run the regression again to have an unbiased regression result. I removed outliers that had a cooks value greater than 4/N

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

Apartments1$StdfittedValues <- scale(fit2$fitted.values)

library(car)
scatterplot(y = Apartments1$StdResid, x= Apartments1$StdfittedValues,
            ylab = "Standarized residuals",
            xlab = "Standarized fitted values",
            boxplots = FALSE, 
            regLine = FALSE, 
            smooth = FALSE)

We checked the possibility of homo/heteroskedasticity with the help of a scatter plot between standardized residuals and (standardized) fitted values. According to the scatterplot the Variance does not appear to be growing, the conclusion is that homoskedasticity is not violated.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(Apartments2$StdResid,
     xlab = "Standarized Residuals",
     ylab = "Frequency",
     main = "Histogram of standarized residuals")

shapiro.test(Apartments2$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments2$StdResid
## W = 0.93418, p-value = 0.0004761

H0: The variables are normally distributed H1: The variables are not normally distributed

p-value is less than alpha = 5% (p-value = 0.0004761), so we can reject H0 and accept H1

The assumption of normal distribution of errors is violated, meaning t-distribution may not be correct.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(formula = Price ~ Age + Distance, data = Apartments)
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

#b1 is estimated with -7.934, which implies that With every additional year on average price of an apartment/m2 will decrease by 7.934 p<0.05, under the condition that all other factors remain unchanged.

#b2 is estimated with -20.667, which implies that With every additional km from city center on average apartment’s pric/m2 will decrease by -20.667 assuming the p<0.001, under the condition that all other factors remain unchanged.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(formula =Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf,
           data = Apartments)
fit3

## 
## Call:
## lm(formula = Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf, 
##     data = Apartments)
## 
## Coefficients:
##            (Intercept)                     Age                Distance  
##               2301.667                  -6.799                 -18.045  
## Apartments$BalconyfYes  Apartments$ParkingfYes  
##                  1.935                 196.168

#The coefficient for the Balcony variable is 1.935, indicating that having a balcony adds 1.94 units to the apartment price compared to apartments that do not have a balcony. #The coefficient for the Parking variable is 196.168, meaning that having parking increases the apartment price by 196.17 units compared to apartments without parking.

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova() test shows that adding the variables Balcony and Parking in fit3 significantly improves the model compared to fit2, which only includes Age and Distance. With a p-value of 0.01007, we reject the null hypothesis that the simpler model (fit2) fits just as well as the more complex model (fit3). This means fit3 is statistically a better fit for the data than fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf, 
##     data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2301.667     94.271  24.415  < 2e-16 ***
## Age                      -6.799      3.110  -2.186  0.03172 *  
## Distance                -18.045      2.758  -6.543 5.28e-09 ***
## Apartments$BalconyfYes    1.935     60.014   0.032  0.97436    
## Apartments$ParkingfYes  196.168     62.868   3.120  0.00251 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Parking: Given Age, Distance and Balcony Factor, the apartments with parking space have on average price higher by 196 € given the p-value is smaller than 0.05 (p < α, (α=5%)) Balcony: p=0.97, p> α, (α=5%): we can not reject the null hypothesis (if a balcony is available in the apartment it effects the price of it)

F-statistics: H0: ρ squared = 0 H1: ρ squared > 0

p-value < 0.001 therefore we reject H0 at p< 0.001. It means there is a linear correlation between Price and at least one explanatory variable. The model is appropriate due to the r squared being higher than 0

Save fitted values and claculate the residual for apartment ID2.

Apartments$Fitted <- fitted(fit2)
Apartments$FittedRes <- Apartments$Fitted - Apartments$Price
print(Apartments1[2,])

## # A tibble: 1 × 10
##     Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1    18        1  2800       1       0 No       Yes          1.78   0.03
## # ℹ 1 more variable: StdfittedValues <dbl[,1]>