#Task one.
#loading library
library(car)
## Loading required package: carData
data(package = .packages(all.available = TRUE))
#asigning wong as mydata
mydata <-Wong
#Showing summary of the dataset, and printing the head
summary(Wong )
## id days duration sex age
## Min. : 405 Min. : 13.0 Min. : 0.0 Female: 71 Min. : 6.513
## 1st Qu.:3808 1st Qu.: 59.0 1st Qu.: 1.0 Male :260 1st Qu.:21.737
## Median :5085 Median : 150.0 Median : 7.0 Median :26.877
## Mean :4735 Mean : 481.9 Mean : 14.3 Mean :31.853
## 3rd Qu.:5712 3rd Qu.: 416.0 3rd Qu.: 16.0 3rd Qu.:40.923
## Max. :7548 Max. :11628.0 Max. :255.0 Max. :80.033
## piq viq
## Min. : 50.00 Min. : 64.00
## 1st Qu.: 77.00 1st Qu.: 85.00
## Median : 87.00 Median : 94.00
## Mean : 87.56 Mean : 94.96
## 3rd Qu.: 97.00 3rd Qu.:105.00
## Max. :133.00 Max. :132.00
head(Wong )
## id days duration sex age piq viq
## 1 3358 30 4 Male 20.67077 87 89
## 2 3535 16 17 Male 55.28816 95 77
## 3 3547 40 1 Male 55.91513 95 116
## 4 3592 13 10 Male 61.66461 59 73
## 5 3728 19 6 Male 30.12731 67 73
## 6 3790 13 3 Male 57.06229 76 69
Description of variables: The data frame has 331 rows (observations) and 7 columns(variables). The observations are longitudinal data on recovery of IQ after comas of varying duration for 200 subjects.
This data frame contains the following columns:
id-patient ID number.
days-number of days post coma at which IQs were measured.
duration-duration of the coma in days.
sex-a factor with levels Female and Male.
age-in years at the time of injury.
piq-performance (i.e., mathematical) IQ.
viq-verbal IQ.
#Rounding data
library (dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata_rounded <- mydata %>% mutate(across(c('age'), round, 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(c("age"), round, 0)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
head(mydata_rounded)
## id days duration sex age piq viq
## 1 3358 30 4 Male 21 87 89
## 2 3535 16 17 Male 55 95 77
## 3 3547 40 1 Male 56 95 116
## 4 3592 13 10 Male 62 59 73
## 5 3728 19 6 Male 30 67 73
## 6 3790 13 3 Male 57 76 69
#Ordering dataset, by variable “duration”
order(mydata_rounded$duration)
## [1] 15 38 39 40 42 43 44 45 62 73 87 88 89 91 94 114 117 123
## [19] 138 139 142 144 154 160 191 194 201 222 225 229 230 237 260 263 272 274
## [37] 276 286 289 291 292 298 302 304 314 331 3 13 18 20 23 27 28 31
## [55] 33 34 37 41 50 57 72 75 76 80 95 104 107 111 153 158 161 181
## [73] 182 185 195 202 203 204 211 216 233 239 246 249 254 270 281 283 294 300
## [91] 35 58 79 82 93 125 163 166 172 183 213 218 221 251 257 293 296 309
## [109] 6 9 24 48 60 74 83 86 129 141 159 165 173 205 264 282 284 1
## [127] 16 21 61 66 116 124 134 137 140 152 170 192 206 317 7 17 19 55
## [145] 84 92 127 156 243 5 56 147 8 10 25 26 36 65 70 71 81 126
## [163] 130 133 135 151 157 171 178 179 188 198 220 228 244 250 253 279 295 307
## [181] 308 315 324 11 22 29 32 122 162 196 242 259 299 30 47 49 52 99
## [199] 169 327 4 68 100 118 132 224 323 12 121 168 240 69 85 105 131 180
## [217] 187 245 248 146 212 261 285 46 54 59 63 77 78 90 102 109 110 115
## [235] 149 155 164 190 197 199 223 235 255 273 278 312 328 265 2 64 143 150
## [253] 174 207 241 297 316 51 67 136 145 175 214 313 275 290 103 108 112 113
## [271] 177 184 200 210 247 252 303 325 267 14 101 209 258 318 53 97 119 120
## [289] 193 215 226 236 277 322 106 128 189 219 280 96 176 217 256 306 148 208
## [307] 238 288 320 329 227 301 319 98 287 330 231 232 262 268 269 186 167 266
## [325] 234 271 310 321 326 311 305
mydata_rounded[order(mydata_rounded$duration),]
## id days duration sex age piq viq
## 15 4802 36 0 Male 62 88 97
## 38 6870 22 0 Male 42 84 95
## 39 6914 43 0 Male 62 85 90
## 40 6937 18 0 Female 21 94 81
## 42 7120 39 0 Male 70 84 86
## 43 7309 31 0 Female 51 85 95
## 44 7321 23 0 Male 26 84 83
## 45 7548 31 0 Male 24 108 106
## 62 4315 63 0 Male 38 107 130
## 73 5204 71 0 Male 59 97 107
## 87 6340 71 0 Male 19 76 72
## 88 6564 69 0 Male 34 67 74
## 89 6614 57 0 Male 45 80 101
## 91 6795 55 0 Male 31 87 104
## 94 7271 55 0 Male 42 100 95
## 114 5916 84 0 Female 27 93 73
## 117 7221 98 0 Male 64 74 79
## 123 4933 134 0 Male 18 69 83
## 138 6214 112 0 Male 60 65 74
## 139 6253 128 0 Female 46 104 112
## 142 6834 123 0 Male 31 72 75
## 144 2849 151 0 Male 20 51 86
## 154 4802 142 0 Male 62 101 117
## 160 5642 162 0 Male 66 89 103
## 191 2790 211 0 Male 49 89 99
## 194 4933 226 0 Male 18 79 86
## 201 5916 205 0 Female 27 92 76
## 222 5600 232 0 Male 49 75 81
## 225 5204 299 0 Male 59 99 105
## 229 5397 328 0 Female 63 121 108
## 230 6214 318 0 Male 60 78 82
## 237 3226 444 0 Male 27 76 64
## 260 6226 438 0 Male 37 84 92
## 263 6614 362 0 Male 45 88 106
## 272 2498 615 0 Female 17 86 113
## 274 2849 642 0 Male 20 76 98
## 276 3226 683 0 Male 27 89 78
## 286 405 986 0 Male 21 66 116
## 289 2849 1040 0 Male 20 91 103
## 291 3226 1123 0 Male 27 88 81
## 292 4864 936 0 Female 54 119 131
## 298 5642 1143 0 Male 66 104 109
## 302 7061 923 0 Male 37 74 81
## 304 2527 1294 0 Male 17 93 104
## 314 3768 1916 0 Male 19 69 80
## 331 5964 11038 0 Male 13 71 73
## 3 3547 40 1 Male 56 95 116
## 13 4705 18 1 Female 22 127 109
## 18 5129 26 1 Male 25 77 89
## 20 5162 33 1 Male 25 118 101
## 23 5253 29 1 Male 33 104 105
## 27 5680 17 1 Male 28 84 90
## 28 5699 26 1 Female 34 95 108
## 31 5754 36 1 Male 16 87 86
## 33 6122 29 1 Male 56 95 103
## 34 6163 21 1 Male 19 112 106
## 37 6859 27 1 Male 34 74 79
## 41 6977 30 1 Male 36 97 94
## 50 3277 51 1 Male 37 104 96
## 57 3919 58 1 Male 30 99 95
## 72 5192 60 1 Male 59 87 97
## 75 5280 83 1 Male 49 78 88
## 76 5289 52 1 Male 49 84 85
## 80 5568 64 1 Female 52 75 79
## 95 7371 55 1 Male 57 80 88
## 104 4962 63 1 Female 25 69 67
## 107 5253 86 1 Male 33 106 128
## 111 5837 82 1 Female 33 82 110
## 153 4705 146 1 Female 22 133 111
## 158 5162 144 1 Male 25 130 118
## 161 5699 138 1 Female 34 110 107
## 181 5192 179 1 Male 59 93 105
## 182 5505 171 1 Male 65 95 93
## 185 5680 184 1 Male 28 84 90
## 195 4962 210 1 Female 25 71 70
## 202 6122 212 1 Male 56 109 117
## 203 6136 216 1 Male 33 92 89
## 204 6175 278 1 Male 51 99 98
## 211 5837 242 1 Female 33 93 105
## 216 4342 263 1 Male 44 79 91
## 233 1836 375 1 Male 47 101 108
## 239 4342 432 1 Male 44 92 107
## 246 5289 417 1 Male 49 83 83
## 249 5505 527 1 Male 65 104 87
## 254 5680 403 1 Male 28 94 93
## 270 1624 604 1 Male 20 97 85
## 281 5253 591 1 Male 33 114 124
## 283 6059 794 1 Female 17 71 76
## 294 5568 1114 1 Female 52 81 82
## 300 5837 962 1 Female 33 109 110
## 35 6179 22 2 Male 38 89 95
## 58 4094 50 2 Male 20 79 93
## 79 5474 65 2 Female 29 95 86
## 82 5581 65 2 Male 26 85 95
## 93 7084 54 2 Male 37 87 93
## 125 5085 117 2 Male 49 67 71
## 163 5804 159 2 Female 29 102 107
## 166 6664 164 2 Male 25 66 73
## 172 4094 177 2 Male 20 89 102
## 183 5581 176 2 Male 26 96 110
## 213 1892 276 2 Male 22 87 107
## 218 5085 269 2 Male 49 65 77
## 221 5474 280 2 Female 29 99 91
## 251 5581 378 2 Male 26 95 95
## 257 5804 354 2 Female 29 122 105
## 293 5474 1100 2 Female 29 94 88
## 296 5581 1113 2 Male 26 99 96
## 309 5085 1512 2 Male 49 75 75
## 6 3790 13 3 Male 57 76 69
## 9 4253 40 3 Male 23 115 110
## 24 5298 30 3 Male 23 87 86
## 48 2761 40 3 Female 24 98 112
## 60 4183 42 3 Male 26 98 116
## 74 5238 44 3 Male 45 99 103
## 83 5628 51 3 Female 30 81 85
## 86 6314 58 3 Male 17 80 99
## 129 5298 107 3 Male 23 117 112
## 141 6665 119 3 Female 23 106 94
## 159 5238 150 3 Male 45 117 126
## 165 6314 140 3 Male 17 87 96
## 173 4253 175 3 Male 23 114 118
## 205 6228 174 3 Female 32 114 108
## 264 6665 368 3 Female 23 100 92
## 282 5628 609 3 Female 30 89 78
## 284 6228 662 3 Female 32 128 111
## 1 3358 30 4 Male 21 87 89
## 16 4941 46 4 Female 19 69 88
## 21 5174 38 4 Female 37 87 99
## 61 4189 69 4 Female 29 75 86
## 66 4696 54 4 Male 47 101 112
## 116 7173 84 4 Male 25 72 75
## 124 4941 131 4 Female 19 96 96
## 134 5896 126 4 Female 27 50 74
## 137 6173 125 4 Male 35 94 97
## 140 6433 120 4 Male 24 100 103
## 152 4696 150 4 Male 47 120 120
## 170 3358 175 4 Male 21 97 97
## 192 4189 202 4 Female 29 81 90
## 206 7173 210 4 Male 25 79 78
## 317 4696 1769 4 Male 47 105 124
## 7 3807 37 5 Male 25 74 77
## 17 4983 33 5 Male 38 102 117
## 19 5154 35 5 Male 22 82 95
## 55 3655 57 5 Female 22 90 103
## 84 6154 43 5 Female 23 74 80
## 92 7080 64 5 Female 77 76 106
## 127 5154 120 5 Male 22 89 109
## 156 4983 146 5 Male 38 107 123
## 243 4983 398 5 Male 38 121 132
## 5 3728 19 6 Male 30 67 73
## 56 3762 48 6 Male 20 85 93
## 147 3728 151 6 Male 30 96 105
## 8 3808 31 7 Male 28 91 110
## 10 4356 31 7 Male 21 86 83
## 25 5640 34 7 Male 26 93 113
## 26 5668 27 7 Male 41 72 79
## 36 6671 30 7 Female 28 71 82
## 65 4678 63 7 Male 47 96 95
## 70 5009 50 7 Male 24 61 104
## 71 5014 46 7 Female 24 75 90
## 81 5580 56 7 Male 18 86 95
## 126 5111 107 7 Male 22 71 80
## 130 5339 119 7 Male 22 87 82
## 133 5494 111 7 Male 55 86 86
## 135 5901 115 7 Male 22 112 116
## 151 4678 143 7 Male 47 98 107
## 157 5014 151 7 Female 24 97 110
## 171 3808 165 7 Male 28 94 111
## 178 5009 174 7 Male 24 77 103
## 179 5111 177 7 Male 22 72 81
## 188 6671 184 7 Female 28 91 92
## 198 5668 219 7 Male 41 76 90
## 220 5339 271 7 Male 22 94 89
## 228 4678 340 7 Male 47 108 119
## 244 5111 442 7 Male 22 77 86
## 250 5580 369 7 Male 18 96 107
## 253 5668 390 7 Male 41 92 92
## 279 5014 637 7 Female 24 101 114
## 295 5580 1087 7 Male 18 106 98
## 307 5009 1537 7 Male 24 76 112
## 308 5014 1523 7 Female 24 105 114
## 315 4356 2000 7 Male 21 104 91
## 324 3808 2434 7 Male 28 105 111
## 11 4384 35 8 Male 36 76 90
## 22 5208 31 8 Female 21 97 90
## 29 5713 36 8 Male 16 89 97
## 32 5776 26 8 Male 17 71 88
## 122 4902 102 8 Male 16 87 77
## 162 5713 144 8 Male 16 100 99
## 196 5208 193 8 Female 21 133 111
## 242 4902 397 8 Male 16 92 86
## 259 5841 415 8 Male 27 82 83
## 299 5713 1016 8 Male 16 126 106
## 30 5736 18 9 Male 16 89 86
## 47 2600 3333 9 Male 44 86 80
## 49 3237 65 9 Male 50 67 67
## 52 3359 59 9 Female 57 84 91
## 99 3844 73 9 Male 26 79 94
## 169 3237 189 9 Male 50 79 82
## 327 2600 3337 9 Male 44 101 84
## 4 3592 13 10 Male 62 59 73
## 68 4837 42 10 Male 20 83 88
## 100 4725 124 10 Male 33 93 97
## 118 2453 120 10 Male 37 63 99
## 132 5414 105 10 Female 40 93 104
## 224 4725 286 10 Male 33 105 94
## 323 3592 2569 10 Male 62 76 93
## 12 4542 22 11 Female 22 71 89
## 121 4542 121 11 Female 22 86 114
## 168 1085 159 11 Male 31 103 97
## 240 4542 431 11 Female 22 98 114
## 69 4996 51 12 Male 43 77 78
## 85 6180 59 12 Male 21 67 84
## 105 5125 78 12 Male 18 94 118
## 131 5387 109 12 Male 22 85 112
## 180 5125 173 12 Male 18 106 119
## 187 6180 177 12 Male 21 81 94
## 245 5125 510 12 Male 18 112 125
## 248 5387 480 12 Male 22 94 116
## 146 3051 131 13 Male 37 68 79
## 212 6247 228 13 Male 42 77 80
## 261 6247 389 13 Male 42 82 80
## 285 6247 616 13 Male 42 85 82
## 46 2364 41 14 Male 26 84 94
## 54 3544 32 14 Male 55 81 98
## 59 4133 34 14 Male 20 70 88
## 63 4482 58 14 Female 18 86 103
## 77 5456 48 14 Male 41 80 101
## 78 5458 44 14 Male 34 84 95
## 90 6686 44 14 Female 38 90 100
## 102 4807 64 14 Female 48 74 74
## 109 5534 87 14 Male 29 75 82
## 110 5712 88 14 Male 22 70 68
## 115 6410 80 14 Male 32 85 98
## 149 4133 133 14 Male 20 82 94
## 155 4807 139 14 Female 48 80 78
## 164 5818 125 14 Male 35 72 91
## 190 2646 187 14 Male 23 97 97
## 197 5456 193 14 Male 41 87 110
## 199 5712 192 14 Male 22 87 85
## 223 2826 290 14 Male 23 94 108
## 235 2646 438 14 Male 23 98 94
## 255 5712 365 14 Male 22 98 86
## 273 2826 636 14 Male 23 111 101
## 278 4807 532 14 Female 48 84 82
## 312 2826 1809 14 Male 23 104 108
## 328 3835 4933 14 Male 26 91 88
## 265 781 714 15 Male 30 85 85
## 2 3535 16 17 Male 55 95 77
## 64 4638 20 17 Male 21 82 72
## 143 1176 146 17 Female 20 65 98
## 150 4661 135 17 Female 31 84 93
## 174 4638 140 17 Male 21 89 78
## 207 1176 216 17 Female 20 74 100
## 241 4661 374 17 Female 31 93 95
## 297 5617 1113 17 Male 20 78 87
## 316 4638 1779 17 Male 21 92 76
## 51 3346 44 18 Female 57 79 85
## 67 4755 24 18 Male 28 105 102
## 136 6135 96 18 Male 27 66 105
## 145 2882 141 18 Male 19 84 85
## 175 4755 128 18 Male 28 105 109
## 214 2882 262 18 Male 19 94 90
## 313 2882 1716 18 Male 19 100 103
## 275 3032 525 20 Male 17 79 87
## 290 3032 884 20 Male 17 87 93
## 103 4892 62 21 Male 22 76 88
## 108 5386 78 21 Male 21 78 93
## 112 5879 75 21 Male 26 80 105
## 113 5893 71 21 Male 23 65 90
## 177 4892 148 21 Male 22 106 110
## 184 5599 148 21 Male 19 72 81
## 200 5893 200 21 Male 23 65 89
## 210 5386 241 21 Male 21 80 94
## 247 5386 436 21 Male 21 90 103
## 252 5599 443 21 Male 19 78 80
## 303 651 1491 21 Male 22 71 94
## 325 651 3412 21 Male 22 68 92
## 267 1157 810 23 Male 17 97 84
## 14 4744 15 25 Male 58 82 85
## 101 4744 65 25 Male 58 105 119
## 209 4744 217 25 Male 58 108 118
## 258 5811 431 25 Male 80 78 80
## 318 4744 1743 25 Male 58 97 118
## 53 3373 39 28 Female 26 87 91
## 97 3058 56 28 Male 22 65 75
## 119 2653 97 28 Male 30 93 112
## 120 4218 82 28 Male 26 74 92
## 193 4775 180 28 Male 54 70 86
## 215 3058 236 28 Male 22 85 88
## 226 6498 270 28 Male 24 82 101
## 236 2653 352 28 Male 30 105 126
## 277 4218 814 28 Male 26 99 96
## 322 2653 2191 28 Male 30 117 129
## 106 5222 63 30 Male 23 77 85
## 128 5222 93 30 Male 23 77 91
## 189 2124 173 30 Male 31 76 106
## 219 5222 247 30 Male 23 88 85
## 280 5222 690 30 Male 23 81 90
## 96 2569 49 35 Male 19 50 101
## 176 4865 142 35 Male 58 84 103
## 217 4865 240 35 Male 58 93 105
## 256 5772 412 35 Male 26 102 104
## 306 4865 1363 35 Male 58 88 104
## 148 3913 96 42 Female 24 56 80
## 208 3467 186 42 Male 25 53 69
## 238 3467 333 42 Male 25 68 74
## 288 1075 907 42 Female 27 63 64
## 320 1075 2259 42 Female 27 78 79
## 329 2773 7631 42 Male 7 88 103
## 227 2081 185 43 Male 18 77 97
## 301 6140 1077 44 Female 21 65 88
## 319 6140 1742 44 Female 21 67 87
## 98 3645 43 45 Male 27 72 90
## 287 626 870 55 Male 20 80 85
## 330 5142 11628 57 Male 16 101 95
## 231 7034 280 60 Male 23 78 80
## 232 1493 453 60 Male 18 59 81
## 262 6468 513 60 Male 43 99 94
## 268 1493 684 60 Male 18 66 75
## 269 1611 511 60 Male 23 69 107
## 186 5782 108 68 Female 20 69 85
## 167 1048 85 94 Male 20 63 82
## 266 1048 576 94 Male 20 91 96
## 234 1939 295 130 Male 28 67 117
## 271 1939 562 130 Male 28 85 111
## 310 1939 1926 130 Male 28 95 108
## 321 1939 3111 130 Male 28 88 111
## 326 1939 3864 130 Male 28 88 105
## 311 2662 1569 180 Male 28 90 101
## 305 2638 1093 255 Male 17 78 84
#Assigning a new dataset with rounded and ordered observations
mydata_rounded1 <- mydata_rounded[order(mydata_rounded$duration),]
#creating factors
mydata_rounded1$sexf <- factor(mydata_rounded1$sex, levels = c ("0", "1"), labels = c("male", "female") )
#Presenting statistic descripiton
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
round(stat.desc(mydata_rounded1))
## id days duration sex age piq viq sexf
## nbr.val 331 331 331 NA 331 331 331 NA
## nbr.null 0 0 46 NA 0 0 0 NA
## nbr.na 0 0 0 NA 0 0 0 NA
## min 405 13 0 NA 7 50 64 NA
## max 7548 11628 255 NA 80 133 132 NA
## range 7143 11615 255 NA 73 83 68 NA
## sum 1567184 159493 4732 NA 10542 28981 31433 NA
## median 5085 150 7 NA 27 87 94 NA
## mean 4735 482 14 NA 32 88 95 NA
## SE.mean 83 62 1 NA 1 1 1 NA
## CI.mean.0.95 164 123 3 NA 2 2 2 NA
## var 2293522 1292958 678 NA 193 229 197 NA
## std.dev 1514 1137 26 NA 14 15 14 NA
## coef.var 0 2 2 NA 0 0 0 NA
Days: the median value of days is 5085. The mean value of days is 4735. The min value is 13. Age:the median value of age is 27.The mean value of age is 32.The min value is 7 Piq:the median value of piq is 87.The mean value of piq is 88.The min value is 50 Viq:the median value of viq is 94.The mean value of viq is 95.The min value is 64 Duration:the median value of duration is 7.The mean value of duration is 482.The min value is 0.
#Creating a scatterplot of observations, by variables age and piq.
library(car)
scatterplot(x=mydata_rounded1$age,
y=mydata_rounded1$piq,
xlab = "age of patients",
ylab = "performance IQ",
smooth = FALSE)
By examining the scatterplots we can conclude the following:
Comparing the correlation between age and performance IQ is not correlated.
#Graphing the distribution of the variables using a histogram
hist(mydata_rounded1$piq,
main = "Distribution of the variable Performance IQ",
ylab = "Frequency",
xlab = "piq",
breaks = seq(40, 150, by =1 ),
right = FALSE)
We can observe that the distribution of performance IQ is unimodal and esymmetrical. Furthermore the average is skewed lower than the global average of 100 or slightly positively skewed. this could be due to the brain trauma induced by the coma.
#Task 2:
#install.packages("readxl")
library(readxl)
business <- read_excel("~/Desktop/imb exam2024/Task 2/Business School.xlsx")
head(business)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
#Creating factor variables for further analysis.
business$undergradf <- factor(business$`Undergrad Degree`, levels = c ("Business","Computer Science", "Finance", "Engineering", "Art"), labels = c("Business","Computer Science", "Finance", "Engineering", "Art") )
#Using ggplot to graphically present the undergrad degrees variable.
library(ggplot2)
ggplot(business, aes(x = undergradf)) + geom_bar(fill = "skyblue", color = "black") + xlab("Undergraduate Degree") + ylab("Count") + ggtitle("Distribution of Undergraduate Degrees") + theme_minimal()
The most common degree is a degree in business.
#presenting descriptive statistics
library(pastecs)
round(stat.desc(business$`Annual Salary`))
## nbr.val nbr.null nbr.na min max range
## 100 0 0 20000 340000 320000
## sum median mean SE.mean CI.mean.0.95 var
## 10905800 103500 109058 4150 8235 1722373475
## std.dev coef.var
## 41501 0
# Display descriptive statistics for Annual Salary
summary_stats <- summary(business$`Annual Salary`)
print(summary_stats)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
#Graphically presenting the distribution of the variable Annual Salary with ggplot.
# Plot the histogram of Annual Salary
ggplot(business, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 5000, fill = "skyblue", color = "black") +
xlab("Annual Salary") +
ylab("Count") +
ggtitle("Distribution of Annual Salary") +
theme_minimal()
#Testing the following hypothesis: 𝐻0: 𝜇MBA Grade = 74.
t.test(business$"MBA Grade",
mu = 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: business$"MBA Grade"
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
With the 95% confidence interval we can reject the null hypothesis, and conclude that the mean MBA Grade is not equal to 74. We can reject the null hypothesis because the p-value is smaller than 0,05. Furthermore the actual mean amounts to 76.04055.
#Task 3:
Apartments <- read_excel("~/Desktop/R Take Home Exam 2024 5/Task 3/Apartments.xlsx")
Description:
head(Apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Apartments$Balconyf <- factor(Apartments$Balcony, levels = c ("0", "1"), labels = c("No", "Yes") )
Apartments$Parkingf <- factor(Apartments$Parking, levels = c ("0", "1"), labels = c("No", "Yes") )
Apartments1 <- Apartments
t.test(Apartments1$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments1$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We can reject the null hypothisis and conclude with a 95% confidence interval that the mean (price) of the population is not equal to 1900.
fit1 <- lm(formula = Price ~ Age, data = Apartments1)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(Apartments1$Price, Apartments1$Age)
## [1] -0.230255
The estimation for b1 is b1 = -8.975. With every additional year on average price of the apartment per m2 will decrease by 8.975, under the condition everything else remains unchanged.
The Min residual (-623.9) indicates that some predicted prices were 623.9 units higher than the actual price. The Max residual (776.1) indicates that some predicted prices were 776.1 units lower than the actual price. The Median residual (-69.8) suggests that the model tends to slightly overestimate apartment prices.
The negative sign of the correlation level (-0.230255) means that as the age of apartments increases, the price of apartments tends to decrease. This suggests that older apartments generally have lower prices than newer ones.
The determination coefficient, R² = 0.05302, means that 5.3% of the variation in the price of apartments can be explained by the age of the apartments. The remaining 94.7% of the variation in apartment prices is due to other factors that are not explained by the age of the apartment
The F-statistic tests whether the overall regression model is a good fit for the data. In this case, the F-statistic is 4.647, with a p-value of 0.03401. Since the p-value is below 0.05, we can conclude that the overall model (using age to predict price) is statistically significant.
scatterplotMatrix(Apartments1[,c(1,2,3)],
smooth = FALSE)
Multicolinearity seems to not be present.
fit2 <- lm(Apartments1$Price ~ Apartments1$Age + Apartments1$Distance, data = Apartments1)
summary(fit2)
##
## Call:
## lm(formula = Apartments1$Price ~ Apartments1$Age + Apartments1$Distance,
## data = Apartments1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Apartments1$Age -7.934 3.225 -2.46 0.016 *
## Apartments1$Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
correlation_matrix <- cor(Apartments1[, c("Price", "Age", "Distance")], use = "complete.obs")
#The Min residual is -603.23, which means some actual prices are much lower than predicted. #The Max residual is 689.58, meaning some actual prices are much higher than predicted. #The Median residual is -85.68, suggesting that, on average, the model slightly overestimates apartment prices.
#Apartments$Age (-7.934): For each additional year of apartment age, the price is expected to decrease by 7.93 units. Since the p-value (0.016) is below 0.05, this effect is statistically significant.
#Apartments$Distance (-20.667): For each additional unit of distance, the price decreases by 20.67 units. This is also statistically significant with a very small p-value of 6.18e-11.
#Age: The p-value of 0.016 suggests that the relationship between age and price is statistically significant at the 5% significance level.
#Distance: The p-value of 6.18e-11 indicates a very strong statistical significance for the relationship between distance and price.
#R-squared (0.4396): This tells us that 43.96% of the variability in apartment prices is explained by the model (age and distance).
#Adjusted R-squared (0.4259): This is the R-squared adjusted for the number of predictors. It’s slightly lower than the R-squared and indicates that 42.59% of the variability in prices is explained by the model, accounting for the number of predictors.
fit2 <-lm(formula = Price ~ Age + Distance, data = Apartments1)
vif (fit2)
## Age Distance
## 1.001845 1.001845
The VIF statistic is less than 5 so we can conclude there is no multicoliniarity.
#install.packages("olsrr")
Apartments1$StdResid <- round(rstandard(fit2), 3)
Apartments1$CooksD <- round(cooks.distance(fit2), 3)
hist(Apartments1$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(Apartments1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments1$StdResid
## W = 0.95303, p-value = 0.003645
The p value is less than 5%,however there is one outlier that skews the conclusions about the regression. We have to remove the outlier in order to have valuable conclusions.
hist(Apartments1$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
Because we still had residuals on the Cooks distance, therefore we have to remove the variables that had the Cooks distance higher than 4/N.
#Exclugin the Cooks Distance residuals, and assigning them to excluded_cases
excluded_cases <- which(Apartments1$CooksD > (4/85))
#Assigning the clean dataset to Apartments 2
Apartments2<-Apartments1[c(-22,-33,-38,-53,-55) , ]
head(Apartments2[order(Apartments2$StdResid),], 3)
## # A tibble: 3 × 9
## Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 12 14 1650 0 1 Yes No -1.50 0.013
## 2 12 14 1650 0 0 No No -1.50 0.013
## 3 13 8 1800 0 0 No No -1.38 0.012
head(Apartments2[order(-Apartments1$CooksD),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 18 12 1770 1 0 No Yes -1.05 0.005
## 2 30 9 2130 0 0 No No 0.333 0.001
## 3 14 16 1660 0 1 Yes No -1.26 0.008
## 4 8 2 2820 1 0 No Yes 1.66 0.037
## 5 22 23 1730 0 1 Yes No -0.283 0.001
## 6 24 1 2220 1 1 Yes Yes -0.103 0
#Printed the head of the highest residuals, in order to re-check for recurring residuals. I removed the outlier from the data, and will have to run the regression again to have an unbiased regression result. I removed outliers that had a cooks value greater than 4/N
Apartments1$StdfittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = Apartments1$StdResid, x= Apartments1$StdfittedValues,
ylab = "Standarized residuals",
xlab = "Standarized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
We checked the possibility of homo/heteroskedasticity with the help of a
scatter plot between standardized residuals and (standardized) fitted
values. According to the scatterplot the Variance does not appear to be
growing, the conclusion is that homoskedasticity is not violated.
hist(Apartments2$StdResid,
xlab = "Standarized Residuals",
ylab = "Frequency",
main = "Histogram of standarized residuals")
shapiro.test(Apartments2$StdResid)
##
## Shapiro-Wilk normality test
##
## data: Apartments2$StdResid
## W = 0.93418, p-value = 0.0004761
H0: The variables are normally distributed H1: The variables are not normally distributed
p-value is less than alpha = 5% (p-value = 0.0004761), so we can reject H0 and accept H1
The assumption of normal distribution of errors is violated, meaning t-distribution may not be correct.
fit2 <- lm(formula = Price ~ Age + Distance, data = Apartments)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
#b1 is estimated with -7.934, which implies that With every additional year on average price of an apartment/m2 will decrease by 7.934 p<0.05, under the condition that all other factors remain unchanged.
#b2 is estimated with -20.667, which implies that With every additional km from city center on average apartment’s pric/m2 will decrease by -20.667 assuming the p<0.001, under the condition that all other factors remain unchanged.
fit3 <- lm(formula =Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf,
data = Apartments)
fit3
##
## Call:
## lm(formula = Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf,
## data = Apartments)
##
## Coefficients:
## (Intercept) Age Distance
## 2301.667 -6.799 -18.045
## Apartments$BalconyfYes Apartments$ParkingfYes
## 1.935 196.168
#The coefficient for the Balcony variable is 1.935, indicating that having a balcony adds 1.94 units to the apartment price compared to apartments that do not have a balcony. #The coefficient for the Parking variable is 196.168, meaning that having parking increases the apartment price by 196.17 units compared to apartments without parking.
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The anova() test shows that adding the variables Balcony and Parking in fit3 significantly improves the model compared to fit2, which only includes Age and Distance. With a p-value of 0.01007, we reject the null hypothesis that the simpler model (fit2) fits just as well as the more complex model (fit3). This means fit3 is statistically a better fit for the data than fit2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Apartments$Balconyf + Apartments$Parkingf,
## data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## Apartments$BalconyfYes 1.935 60.014 0.032 0.97436
## Apartments$ParkingfYes 196.168 62.868 3.120 0.00251 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
Parking: Given Age, Distance and Balcony Factor, the apartments with parking space have on average price higher by 196 € given the p-value is smaller than 0.05 (p < α, (α=5%)) Balcony: p=0.97, p> α, (α=5%): we can not reject the null hypothesis (if a balcony is available in the apartment it effects the price of it)
F-statistics: H0: ρ squared = 0 H1: ρ squared > 0
p-value < 0.001 therefore we reject H0 at p< 0.001. It means there is a linear correlation between Price and at least one explanatory variable. The model is appropriate due to the r squared being higher than 0
Apartments$Fitted <- fitted(fit2)
Apartments$FittedRes <- Apartments$Fitted - Apartments$Price
print(Apartments1[2,])
## # A tibble: 1 × 10
## Age Distance Price Parking Balcony Balconyf Parkingf StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 18 1 2800 1 0 No Yes 1.78 0.03
## # ℹ 1 more variable: StdfittedValues <dbl[,1]>