Installed and loaded the necessary packages.
library(readr) # useful for importing data
library(magrittr) #useful for pipe operator
library(tidyr) #useful for tidying data
library(dplyr) #useful for data manipulation
library(Hmisc) #to replace the missing values
library(outliers) #useful for finding the outliers
library(lubridate) #useful for date transformation
library(car) # useful for plotting qqPlot
Objective of this report is to find an open data set with creative commons licence and apply the various data preprocessing concepts acquired through Data Preprocessing course. The sequence of steps followed for data preprocessing is as follows.
The data set contains the official details of 11538 athletes competed in 2016 Olympics Games in Rio de Janeiro and their respective countries.Collected dataset from kaggle( https://www.kaggle.com/rio2016/olympic-games )
Considered two data files ‘athletes.csv’ with 11 columns and ‘countries.csv’ with 4 columns.
The athletes data set contains following columns, id: Athlete ID name: Athlete name nationality: IOC country code of Athlete sex: Athlete gender dob: Athlete date of birth height: Athlete height weight: Athlete weight sport: The event in which athlete competes gold: Number of gold medal silver: Number of silver medal bronze: Number of bronze medal
The countries table contains the following attributes, Country: Country Code: IOC Country code Population: Total population of country gdp_per_capita: GDP per capita of the country
Imported the dataset using base R function and restricted the auto conversion of characters to strings . Using merge() function athletes table was joined with countries table to form olympics dataset based on the common attribute ie,country IOC code and displayed the first few rows using head function.
athletes <- read.csv("athletes.csv",stringsAsFactors = FALSE)
head(athletes)
countries <- read.csv("countries.csv",stringsAsFactors = FALSE)
head(countries)
olympics <- merge(athletes,countries,by.x = "nationality",by.y = "code")
head(olympics)
Summarised the types of variables and other statistics using ‘summarise()’ function. The data stucture of each variables were found using ‘str()’ function. Found that certain datatypes were captured incorrectly and performed proper datatype conversion on dob(char to date)and factorised sex,nationality and country variables.
summary(olympics)
nationality id name sex dob
Length:11464 Min. : 18347 Length:11464 Length:11464 Length:11464
Class :character 1st Qu.:245072255 Class :character Class :character Class :character
Mode :character Median :499491784 Mode :character Mode :character Mode :character
Mean :499588457
3rd Qu.:753180230
Max. :999987786
height weight sport gold silver bronze
Min. :1.210 Min. : 31.00 Length:11464 Min. :0.00000 Min. :0.00000 Min. :0.00000
1st Qu.:1.690 1st Qu.: 60.00 Class :character 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
Median :1.760 Median : 70.00 Mode :character Median :0.00000 Median :0.00000 Median :0.00000
Mean :1.766 Mean : 72.04 Mean :0.05792 Mean :0.05714 Mean :0.06132
3rd Qu.:1.840 3rd Qu.: 81.00 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :2.210 Max. :170.00 Max. :5.00000 Max. :2.00000 Max. :2.00000
NA's :325 NA's :654
country population gdp_per_capita
Length:11464 Min. :1.022e+04 Min. : 277.1
Class :character 1st Qu.:1.035e+07 1st Qu.: 8027.7
Mode :character Median :4.342e+07 Median : 18002.2
Mean :1.240e+08 Mean : 24858.1
3rd Qu.:8.141e+07 3rd Qu.: 41313.3
Max. :1.371e+09 Max. :101450.0
NA's :83 NA's :509
str(olympics)
'data.frame': 11464 obs. of 14 variables:
$ nationality : chr "AFG" "AFG" "AFG" "ALB" ...
$ id : int 103254143 289057786 152408417 539021692 103773001 324317073 345441615 915002256 997380920 690873472 ...
$ name : chr "Kamia Yousufi" "Mohammad Tawfiq Bakhshi" "Abdul Wahab Zahiri" "Nikol Merizaj" ...
$ sex : chr "female" "male" "male" "female" ...
$ dob : chr "5/20/96" "3/11/86" "5/27/92" "8/7/98" ...
$ height : num 1.65 1.81 1.75 1.8 1.6 1.95 1.7 1.59 1.93 1.9 ...
$ weight : int 55 99 68 65 52 86 69 45 87 100 ...
$ sport : chr "athletics" "judo" "athletics" "aquatics" ...
$ gold : int 0 0 0 0 0 0 0 0 0 0 ...
$ silver : int 0 0 0 0 0 0 0 0 0 0 ...
$ bronze : int 0 0 0 0 0 0 0 0 0 0 ...
$ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Albania" ...
$ population : int 32526562 32526562 32526562 2889167 2889167 2889167 2889167 2889167 2889167 39666519 ...
$ gdp_per_capita: num 594 594 594 3945 3945 ...
olympics$dob <- dmy(format(as.Date(olympics$dob, format ="%m/%d/%y") ,"%d-%m-%y") )
olympics$sex <- as.factor(olympics$sex)
olympics$nationality <- as.factor(olympics$nationality)
olympics$country <- as.factor(olympics$country)
Inorder to tidy up the dataset removed the insignificant columns from the dataset. Using subset function,gdp_per_capita was removed from the dataset. On further analysis it’s found that the dataset doesn’t need any structural reformation.
head(olympics)
olympics <- subset(olympics,select = -c(gdp_per_capita))
Created a new column (Total_medals) to display the total number of medals received by an athlete by summing up the gold,silver and bronze medal reveived by each athlete using mutate function. Another column, population_interval was created inorder to have a better understanding of the variable population using ‘mutate’ and ‘case_when’ each population were categorised to form the interval. Then using factor() function, the variable ‘population_interval’ was categorised.
The stucture of the tidied dataset was checked using str() function.
olympics <- olympics %>% mutate(Total_Medals=gold+silver+bronze)
olympics <- olympics %>% mutate (population_interval=case_when(
population>0 & population <50000000 ~"1",
population>=50000000 & population <100000000 ~"2",
population>=100000000& population <150000000 ~"3",
population>=150000000& population <200000000 ~"4",
population>=200000000& population <250000000 ~"5",
population>=250000000& population <300000000 ~"6",
population>=300000000& population <350000000 ~"7",
population>=350000000& population <2000000000 ~"8" ))
olympics$population_interval <-factor(olympics$population_interval,
levels = c("1","2","3","4","5","6","7","8"),
labels =c("0-50M","50M-100M","100M-150M","150M-200M",
"200M-250M","250M-300M","300M-350M","350M+"),
ordered = TRUE)
str(olympics)
'data.frame': 11464 obs. of 15 variables:
$ nationality : Factor w/ 199 levels "AFG","ALB","ALG",..: 1 1 1 2 2 2 2 2 2 3 ...
$ id : int 103254143 289057786 152408417 539021692 103773001 324317073 345441615 915002256 997380920 690873472 ...
$ name : chr "Kamia Yousufi" "Mohammad Tawfiq Bakhshi" "Abdul Wahab Zahiri" "Nikol Merizaj" ...
$ sex : Factor w/ 2 levels "female","male": 1 2 2 1 1 2 2 1 2 2 ...
$ dob : Date, format: "1996-05-20" "1986-03-11" "1992-05-27" "1998-08-07" ...
$ height : num 1.65 1.81 1.75 1.8 1.6 1.95 1.7 1.59 1.93 1.9 ...
$ weight : int 55 99 68 65 52 86 69 45 87 100 ...
$ sport : chr "athletics" "judo" "athletics" "aquatics" ...
$ gold : int 0 0 0 0 0 0 0 0 0 0 ...
$ silver : int 0 0 0 0 0 0 0 0 0 0 ...
$ bronze : int 0 0 0 0 0 0 0 0 0 0 ...
$ country : Factor w/ 199 levels "Afghanistan",..: 1 1 1 2 2 2 2 2 2 3 ...
$ population : int 32526562 32526562 32526562 2889167 2889167 2889167 2889167 2889167 2889167 39666519 ...
$ Total_Medals : int 0 0 0 0 0 0 0 0 0 0 ...
$ population_interval: Ord.factor w/ 8 levels "0-50M"<"50M-100M"<..: 1 1 1 1 1 1 1 1 1 1 ...
In this step we scanned our dataset in order to find out the missing values present in it. We used the ColSums() function and found that a total of 5 columns had missing values present in them. In order to spot the locations of values such as Infinite, NaN & NAs we introduced a user defined function, is.nullcheck() and spotted them. It’s found that all attributes except height and weight had very few missing values (less than 5% of the total samples). To deal with null values in height column and weight column We filtered the dataset into olympics_m and olympics_f using filter() function. Then checked the normality of these values using qqPlot against both gender, even though it’s not necessary according to central limit theorem. Based on the result of qqPlot, NAs in height column was replaced with respective mean value of the height against gender using the mutate() function and group_by() function as height follows a normal distribution. Since the weight distribution was right skewed, the NA values in weight column was replaced with respective median value of the weight against each gender using the properties of mutate and group_by function. Once these substitutions were done then we removed all the other NA values in all the other fields using na.omit() function. At last we checked for NAs using ColSums() function to confirm that all NAs were removed from the dataset.
par(mfrow=c(1,2))
colSums(is.na(olympics))
nationality id name sex dob
0 0 0 0 1
height weight sport gold silver
325 654 0 0 0
bronze country population Total_Medals population_interval
0 0 83 0 83
is.nullcheck <- function(x){(is.infinite(x) | is.nan(x) | is.na(x))}
which(sapply(olympics$height, is.nullcheck))
[1] 111 398 423 433 434 519 604 610 650 674 675 689 752 787 942 943 948
[18] 949 952 954 957 958 959 961 965 972 974 975 976 977 978 1273 1275 1277
[35] 1278 1279 1280 1281 1282 1284 1350 1411 1423 1432 1435 1500 1670 1673 1765 1781 1783
[52] 1785 1793 1800 1801 1862 2129 2196 2200 2694 2695 2696 2794 2819 2877 3075 3443 3465
[69] 3480 3499 3539 3540 3542 3543 3544 3546 3548 3549 3551 3552 3553 3556 3742 3923 3936
[86] 3959 3972 3973 3987 3991 4003 4049 4484 4485 4486 4487 5351 5352 5353 5354 5355 5356
[103] 5360 5361 5363 5428 5459 5460 5461 5463 5464 5465 5505 5507 5509 5511 5778 5789 5797
[120] 5817 5839 5847 5852 5861 5875 5884 6111 6112 6113 6116 6141 6182 6431 6443 6461 6617
[137] 6884 6970 6976 6979 6993 7022 7037 7135 7200 7246 7273 7309 7310 7312 7313 7315 7316
[154] 7317 7319 7323 7324 7325 7327 7328 7329 7330 7335 7336 7337 7535 7536 7537 7538 7723
[171] 7738 7751 8072 8075 8113 8115 8116 8117 8118 8119 8120 8121 8122 8123 8124 8125 8126
[188] 8128 8129 8130 8132 8134 8135 8136 8137 8139 8140 8141 8142 8143 8144 8145 8148 8149
[205] 8150 8151 8152 8153 8154 8156 8157 8158 8161 8164 8166 8167 8168 8169 8170 8383 8389
[222] 8392 8447 8457 8458 8459 8462 8476 8517 8571 8652 8682 8689 8853 9067 9095 9146 9232
[239] 9237 9267 9294 9300 9320 9335 9342 9399 9420 9426 9453 9455 9456 9457 9459 9463 9464
[256] 9466 9467 9525 9532 9533 9534 9535 9605 9606 9607 9608 9835 9836 9837 9838 9839 10063
[273] 10064 10066 10069 10070 10151 10152 10153 10155 10226 10228 10232 10233 10278 10295 10419 10421 10422
[290] 10423 10424 10425 10426 10427 10428 10429 10430 10431 10432 10433 10434 10435 10436 10437 10439 10440
[307] 10441 10467 10545 10600 10791 10905 10985 11106 11144 11148 11149 11235 11256 11304 11335 11383 11416
[324] 11418 11426
olympics_m <- olympics %>% filter(sex=="male")
olympics_f <- olympics %>% filter(sex=="female")
qqPlot(olympics_m$height,dist="norm",main=" Male Height")
[1] 5907 3033
qqPlot(olympics_f$height,dist="norm",main="Female Height")
[1] 3087 653
olympics <- olympics %>% group_by(country) %>% group_by(sex) %>%
mutate(height=ifelse(is.na(height),mean(height,na.rm = TRUE),(height)))
which(sapply(olympics$height, is.nullcheck))
integer(0)
which(sapply(olympics$weight, is.nullcheck))
[1] 20 29 45 47 62 64 67 75 111 182 198 236 237 279 295 346 351
[18] 352 353 372 398 423 433 434 501 511 512 519 542 584 590 604 606 610
[35] 650 674 675 689 752 787 801 890 892 904 911 917 918 923 925 930 933
[52] 939 942 943 948 949 950 952 954 957 958 959 961 963 965 971 972 974
[69] 975 976 977 978 1140 1177 1228 1273 1275 1277 1278 1279 1280 1281 1282 1284 1306
[86] 1350 1352 1397 1411 1423 1432 1433 1435 1500 1522 1628 1653 1668 1670 1673 1757 1758
[103] 1765 1781 1783 1785 1793 1800 1801 1823 1826 1838 1862 1866 1921 1961 2093 2129 2196
[120] 2200 2208 2258 2300 2336 2361 2454 2468 2505 2538 2614 2625 2649 2658 2662 2673 2682
[137] 2687 2690 2694 2696 2718 2781 2819 2826 2830 2842 2867 2877 2956 2960 2973 2989 2994
[154] 3011 3014 3046 3048 3049 3057 3075 3077 3359 3375 3383 3392 3397 3402 3407 3441 3443
[171] 3445 3465 3480 3499 3510 3523 3539 3540 3542 3543 3544 3546 3548 3549 3551 3552 3553
[188] 3556 3564 3719 3742 3923 3936 3959 3991 4003 4049 4051 4075 4080 4096 4131 4217 4267
[205] 4268 4293 4317 4339 4388 4435 4475 4484 4485 4486 4487 4510 4526 4547 4565 4597 4618
[222] 4628 4636 4642 4710 4739 4780 4810 4817 4832 4836 4847 4849 4957 4971 4983 5029 5152
[239] 5306 5351 5352 5353 5354 5355 5356 5361 5362 5363 5428 5459 5460 5461 5463 5464 5505
[256] 5507 5509 5511 5580 5722 5724 5778 5789 5797 5801 5817 5839 5847 5852 5855 5861 5862
[273] 5875 5884 5890 5958 5976 5977 5982 5995 6003 6008 6026 6046 6111 6112 6113 6116 6133
[290] 6141 6159 6165 6182 6265 6282 6382 6383 6389 6443 6461 6492 6498 6508 6553 6617 6845
[307] 6850 6855 6856 6864 6875 6877 6884 6895 6906 6909 6910 6925 6965 6970 6987 7003 7022
[324] 7037 7039 7090 7135 7136 7184 7194 7200 7222 7246 7273 7309 7310 7312 7313 7315 7316
[341] 7317 7319 7323 7324 7325 7327 7328 7329 7330 7335 7336 7337 7359 7392 7408 7427 7428
[358] 7434 7437 7438 7440 7447 7466 7535 7536 7537 7538 7568 7595 7604 7626 7657 7660 7668
[375] 7670 7683 7686 7693 7697 7701 7717 7723 7724 7737 7738 7745 7751 7756 7760 7820 7846
[392] 7888 8027 8035 8037 8049 8050 8052 8055 8061 8064 8078 8079 8081 8082 8083 8085 8091
[409] 8093 8095 8098 8102 8112 8113 8114 8115 8116 8117 8118 8119 8120 8121 8122 8123 8124
[426] 8125 8126 8127 8128 8129 8130 8131 8132 8133 8134 8135 8136 8137 8139 8140 8141 8142
[443] 8143 8144 8145 8146 8147 8148 8149 8150 8151 8152 8153 8154 8155 8156 8157 8158 8160
[460] 8161 8162 8163 8164 8165 8166 8167 8168 8169 8170 8171 8172 8383 8389 8392 8400 8448
[477] 8452 8457 8458 8459 8462 8469 8476 8517 8558 8571 8595 8652 8682 8689 8849 8853 8909
[494] 8917 8927 9067 9095 9129 9146 9188 9200 9203 9232 9235 9237 9241 9267 9284 9294 9300
[511] 9305 9316 9320 9335 9342 9367 9384 9399 9419 9420 9426 9453 9455 9456 9457 9459 9463
[528] 9464 9466 9467 9494 9525 9532 9534 9605 9606 9607 9608 9835 9836 9837 9838 9839 10052
[545] 10063 10088 10093 10097 10111 10127 10134 10139 10148 10149 10150 10151 10152 10153 10155 10181 10183
[562] 10189 10217 10220 10226 10228 10232 10233 10240 10251 10269 10278 10280 10312 10326 10360 10389 10404
[579] 10406 10419 10421 10422 10423 10424 10425 10426 10427 10428 10429 10430 10431 10432 10433 10434 10435
[596] 10436 10437 10438 10439 10440 10441 10467 10470 10518 10545 10580 10600 10620 10622 10628 10750 10791
[613] 10837 10846 10905 10958 10960 10978 10985 10996 11039 11106 11144 11148 11149 11231 11235 11251 11253
[630] 11256 11257 11260 11261 11264 11268 11276 11282 11294 11304 11311 11325 11326 11332 11335 11338 11349
[647] 11372 11373 11379 11383 11386 11416 11418 11426
qqPlot(olympics_m$weight,dist="norm",main="Male Weight")
[1] 3268 4987
qqPlot(olympics_f$weight,dist="norm",main="Female Weight")
[1] 1346 3740
olympics <- olympics %>% group_by(country) %>%group_by(sex) %>%
mutate(weight=ifelse(is.na(weight),median(weight,na.rm = TRUE),(weight)))
which(sapply(olympics$weight, is.nullcheck))
integer(0)
colSums(is.na(olympics))
nationality id name sex dob
0 0 0 0 1
height weight sport gold silver
0 0 0 0 0
bronze country population Total_Medals population_interval
0 0 83 0 83
olympics <- na.omit(olympics)
colSums(is.na(olympics))
nationality id name sex dob
0 0 0 0 0
height weight sport gold silver
0 0 0 0 0
bronze country population Total_Medals population_interval
0 0 0 0 0
To deal with the outliers, if any, in height and weight columns for each gender, . Boxplots for both the genders were plotted for corresponding height and weight. The presence of outliers were identified in all the 4 plots through analysis. We used capping(winsorising) method to deal with the outliers. We replaced the values that lie outside the outlier fence with lower and upper outlier values respectively. For further analysis of the whole data set, we combined olympics_m and olympics_f into a new data set ‘olympics_final’ using rbind() function.
#Male
par(mfrow=c(1,2))
boxplot(olympics_m$height,main="Male Height DIstribution",ylab="Height(M)",col = "cyan")
IQR <- IQR(olympics_m$height, na.rm = TRUE)
q1 <- quantile(olympics_m$height, .25, na.rm = TRUE)
q3 <- quantile(olympics_m$height, .75, na.rm = TRUE)
benchq1 <- (q1-1.5 * IQR )
benchq3 <- (q3+1.5 * IQR )
olympics_m$height[olympics_m$height > benchq3] <- benchq3
olympics_m$height[olympics_m$height < benchq1] <- benchq1
boxplot(olympics_m$height,main="Male Height DIst. (Handled Outliers)",ylab="Height(M)",col = "cyan")
boxplot.stats(olympics_m$height)$out
numeric(0)
boxplot(olympics_m$weight,main="Male Weight DIstribution",ylab="Weight(kg)",col = "cyan")
IQR <- IQR(olympics_m$weight, na.rm = TRUE)
q1 <- quantile(olympics_m$weight, .25, na.rm = TRUE)
q3 <- quantile(olympics_m$weight, .75, na.rm = TRUE)
benchq1 <- (q1-1.5 * IQR )
benchq3 <- (q3+1.5 * IQR )
olympics_m$weight[olympics_m$weight > benchq3] <- benchq3
olympics_m$weight[olympics_m$weight < benchq1] <- benchq1
boxplot(olympics_m$weight,main="Male Weight DIst. (Handled Outliers)",ylab="Weight(kg)",col = "cyan")
boxplot.stats(olympics_m$height)$out
numeric(0)
#Female
boxplot(olympics_f$height,main="Female Height DIstribution",ylab="Height(M)",col = "deeppink")
IQR <- IQR(olympics_f$height, na.rm = TRUE)
q1 <- quantile(olympics_f$height, .25, na.rm = TRUE)
q3 <- quantile(olympics_f$height, .75, na.rm = TRUE)
benchq1 <- (q1-1.5 * IQR )
benchq3 <- (q3+1.5 * IQR )
olympics_f$height[olympics_f$height > benchq3] <- benchq3
olympics_f$height[olympics_f$height < benchq1] <- benchq1
boxplot(olympics_f$height,main="Male Height DIst. (Handled Outliers)",ylab="Height(M)",col = "deeppink")
boxplot.stats(olympics_f$height)$out
numeric(0)
boxplot(olympics_f$weight,main="Female Weight DIstribution",ylab="Weight(kg)",col = "deeppink")
IQR <- IQR(olympics_f$weight, na.rm = TRUE)
q1 <- quantile(olympics_f$weight, .25, na.rm = TRUE)
q3 <- quantile(olympics_f$weight, .75, na.rm = TRUE)
benchq1 <- (q1-1.5 * IQR )
benchq3 <- (q3+1.5 * IQR )
olympics_f$weight[olympics_f$weight > benchq3] <- benchq3
olympics_f$weight[olympics_f$weight < benchq1] <- benchq1
boxplot(olympics_f$weight,main="Female Weight Dist. (Handled Outliers)",ylab="Weight(kg)",col = "deeppink")
boxplot.stats(olympics_f$weight)$out
numeric(0)
olympics_final <- rbind(olympics_m,olympics_f)
str(olympics_final)
'data.frame': 11464 obs. of 15 variables:
$ nationality : Factor w/ 199 levels "AFG","ALB","ALG",..: 1 1 2 2 2 3 3 3 3 3 ...
$ id : int 289057786 152408417 324317073 345441615 997380920 690873472 268626951 545134894 133974151 218421111 ...
$ name : chr "Mohammad Tawfiq Bakhshi" "Abdul Wahab Zahiri" "Izmir Smajlaj" "Briken Calja" ...
$ sex : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
$ dob : Date, format: "1986-03-11" "1992-05-27" "1993-03-29" "1990-02-19" ...
$ height : num 1.81 1.75 1.95 1.7 1.93 1.9 1.6 1.68 1.85 1.78 ...
$ weight : num 99 68 86 69 87 100 60 62 79 70 ...
$ sport : chr "judo" "athletics" "athletics" "weightlifting" ...
$ gold : int 0 0 0 0 0 0 0 0 0 0 ...
$ silver : int 0 0 0 0 0 0 0 0 0 0 ...
$ bronze : int 0 0 0 0 0 0 0 0 0 0 ...
$ country : Factor w/ 199 levels "Afghanistan",..: 1 1 2 2 2 3 3 3 3 3 ...
$ population : int 32526562 32526562 2889167 2889167 2889167 39666519 39666519 39666519 39666519 39666519 ...
$ Total_Medals : int 0 0 0 0 0 0 0 0 0 0 ...
$ population_interval: Ord.factor w/ 8 levels "0-50M"<"50M-100M"<..: 1 1 1 1 1 1 1 1 1 1 ...
To analyze the body mass index (BMI) of the athletes, we introduced a new variable called BMI using the height and weight attributes. To check whether the BMI against each gender follows normality we used qqPlot and histogram. From these plots it’s observed that both genders follows right skewed normality. So, in order to make the distribution normal we had taken the logarithmic transformation of BMI on both the genders.
olympics_final <- olympics_final %>% mutate(BMI=weight/height^2)
olympics_m <- olympics_m %>% mutate(BMI=weight/height^2)
olympics_f <- olympics_f %>% mutate(BMI=weight/height^2)
qqPlot(olympics_m$BMI,dist="norm",main="Male BMI")
[1] 4695 6178
qqPlot(olympics_f$BMI,dist="norm",main = "Female BMI")
[1] 1588 3743
par(mfrow=c(1,2))
hist(olympics_final$BMI[olympics$sex=="male"],main = "Distribution of Male BMI",xlab = "BMI")
hist(olympics_final$BMI[olympics$sex=="female"],main = "Distribution of Female BMI",xlab = "BMI")
hist(log(olympics_final$BMI[olympics$sex=="male"]),main = "Male BMI (Log Transformation) ",xlab = "BMI with log transformation")
hist(log(olympics_final$BMI[olympics$sex=="female"]),main = "Female BMI (Log Transformation) ",xlab = "BMI with log transformation")