Intro and Problem:

We used the water quality dataset. Water is something necessary for human surviving. However, according to National Geographic, only 1.2 percent of water on the earth can be used as drinkable water. For the drinkable water, according to “What is the best PH level of water for drinking?”, the desirable drinking water PH level is between 6.5 to 8.5. ## Goal: find whether PH level will be affected by other variable in the dataset and make the water become drinkable?

First step:

Import the water quality data and the library that may used in this project.

library(magrittr)
library(leaflet)
library(plotly)
## 載入需要的套件:ggplot2
## 
## 載入套件:'plotly'
## 下列物件被遮斷自 'package:ggplot2':
## 
##     last_plot
## 下列物件被遮斷自 'package:stats':
## 
##     filter
## 下列物件被遮斷自 'package:graphics':
## 
##     layout
library(ggplot2)
library(scales)
library(palmerpenguins)
water= read.csv("water quality.csv", header = TRUE)

Interpret the data.

Use str function to check all the column inside the dataset, in this dataset we have PH level, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic_Carbon, Trihalomethanes, Turbidity, and Potability.

str(water)
## 'data.frame':    3276 obs. of  10 variables:
##  $ ph             : num  NA 3.72 8.1 8.32 9.09 ...
##  $ Hardness       : num  205 129 224 214 181 ...
##  $ Solids         : num  20791 18630 19910 22018 17979 ...
##  $ Chloramines    : num  7.3 6.64 9.28 8.06 6.55 ...
##  $ Sulfate        : num  369 NA NA 357 310 ...
##  $ Conductivity   : num  564 593 419 363 398 ...
##  $ Organic_carbon : num  10.4 15.2 16.9 18.4 11.6 ...
##  $ Trihalomethanes: num  87 56.3 66.4 100.3 32 ...
##  $ Turbidity      : num  2.96 4.5 3.06 4.63 4.08 ...
##  $ Potability     : int  0 0 0 0 0 0 0 0 0 0 ...

Second step:

Clean the data

Utilize summary function to find the NA value in the water dataset, and use the na.omit function to remove all the NA data from the water dataset and create a new dataset called water_cleaned for storing it. After that use the summary function again to check whether all NA have been removed.

summary(water)
##        ph            Hardness          Solids         Chloramines    
##  Min.   : 0.000   Min.   : 47.43   Min.   :  320.9   Min.   : 0.352  
##  1st Qu.: 6.093   1st Qu.:176.85   1st Qu.:15666.7   1st Qu.: 6.127  
##  Median : 7.037   Median :196.97   Median :20927.8   Median : 7.130  
##  Mean   : 7.081   Mean   :196.37   Mean   :22014.1   Mean   : 7.122  
##  3rd Qu.: 8.062   3rd Qu.:216.67   3rd Qu.:27332.8   3rd Qu.: 8.115  
##  Max.   :14.000   Max.   :323.12   Max.   :61227.2   Max.   :13.127  
##  NA's   :491                                                         
##     Sulfate       Conductivity   Organic_carbon  Trihalomethanes  
##  Min.   :129.0   Min.   :181.5   Min.   : 2.20   Min.   :  0.738  
##  1st Qu.:307.7   1st Qu.:365.7   1st Qu.:12.07   1st Qu.: 55.845  
##  Median :333.1   Median :421.9   Median :14.22   Median : 66.622  
##  Mean   :333.8   Mean   :426.2   Mean   :14.28   Mean   : 66.396  
##  3rd Qu.:360.0   3rd Qu.:481.8   3rd Qu.:16.56   3rd Qu.: 77.337  
##  Max.   :481.0   Max.   :753.3   Max.   :28.30   Max.   :124.000  
##  NA's   :781                                     NA's   :162      
##    Turbidity       Potability    
##  Min.   :1.450   Min.   :0.0000  
##  1st Qu.:3.440   1st Qu.:0.0000  
##  Median :3.955   Median :0.0000  
##  Mean   :3.967   Mean   :0.3901  
##  3rd Qu.:4.500   3rd Qu.:1.0000  
##  Max.   :6.739   Max.   :1.0000  
## 
water_cleaned <- na.omit(water)
summary(water_cleaned)
##        ph             Hardness          Solids         Chloramines    
##  Min.   : 0.2275   Min.   : 73.49   Min.   :  320.9   Min.   : 1.391  
##  1st Qu.: 6.0897   1st Qu.:176.74   1st Qu.:15615.7   1st Qu.: 6.139  
##  Median : 7.0273   Median :197.19   Median :20933.5   Median : 7.144  
##  Mean   : 7.0860   Mean   :195.97   Mean   :21917.4   Mean   : 7.134  
##  3rd Qu.: 8.0530   3rd Qu.:216.44   3rd Qu.:27182.6   3rd Qu.: 8.110  
##  Max.   :14.0000   Max.   :317.34   Max.   :56488.7   Max.   :13.127  
##     Sulfate       Conductivity   Organic_carbon  Trihalomethanes  
##  Min.   :129.0   Min.   :201.6   Min.   : 2.20   Min.   :  8.577  
##  1st Qu.:307.6   1st Qu.:366.7   1st Qu.:12.12   1st Qu.: 55.953  
##  Median :332.2   Median :423.5   Median :14.32   Median : 66.542  
##  Mean   :333.2   Mean   :426.5   Mean   :14.36   Mean   : 66.401  
##  3rd Qu.:359.3   3rd Qu.:482.4   3rd Qu.:16.68   3rd Qu.: 77.292  
##  Max.   :481.0   Max.   :753.3   Max.   :27.01   Max.   :124.000  
##    Turbidity       Potability    
##  Min.   :1.450   Min.   :0.0000  
##  1st Qu.:3.443   1st Qu.:0.0000  
##  Median :3.968   Median :0.0000  
##  Mean   :3.970   Mean   :0.4033  
##  3rd Qu.:4.514   3rd Qu.:1.0000  
##  Max.   :6.495   Max.   :1.0000

Round the PH level to 1 decimal and restore it as another column called ph_round and then round the hold dataset to 2 decimal for easier manipulate data and doing calculation.

water_cleaned$ph_round <- round(water_cleaned$ph,1)
water_cleaned <- round(water_cleaned,2)

Third step:

Check the trueness of the statment that the PH level between 6.5 and 8.5 is the drinkable water.

From the pie chart1, we can tell that the PH level is not the only factor decide the potability since in the graph even though all data PH level is within 6.5 and 8.5 only 44% is drinkable and 56% is not drinkable.

drinkable <- nrow(filter(water_cleaned, ph_round >=6.5) %>% filter(ph_round <=8.5)%>% filter(Potability == 1))

notdrinkable <- nrow(filter(water_cleaned, ph_round >=6.5) %>% filter(ph_round <=8.5) %>%filter(Potability == 0))
df3 <- data.frame(group = c("PH level within range of 6.5 to 8.5 and drinkable", "PH level within range of 6.5 to 8.5 but not drinkable"), value = c(drinkable/(drinkable+notdrinkable), notdrinkable/(drinkable+notdrinkable)))
e <- ggplot(df3, aes(x = "", y =value, fill  = group)) +
  geom_bar(stat ="identity",width = 1, color = "white") +
  coord_polar("y",start = 0) +
  theme_void() +
  geom_text(aes(label = percent(value) ), size=3, position=position_stack(vjust=0.5))+ggtitle("Pie chart 1")
e

From pie chart 2, we can tell that even though the PH level is outside of the range 6.5 to 8.5, some of the data still be counted as drinkable which mean that there are other factors affect the potability.

between_6.5and8.5 <- nrow(filter(water_cleaned, Potability == 1) %>% filter(ph_round >=6.5) %>% filter(ph_round <=8.5))
out_6.5and8.5 <-nrow(filter(water_cleaned, Potability == 1)) - between_6.5and8.5
df_new <- data.frame(group = c("drinkable and PH level within range of 6.5 to 8.5", "drinkable but PH level is out of the range of 6.5 to 8.5"), value = c(between_6.5and8.5/nrow(filter(water_cleaned, Potability == 1)), out_6.5and8.5/nrow(filter(water_cleaned, Potability == 1))))

x <- ggplot(df_new, aes(x = "", y =value, fill  = group)) +
  geom_bar(stat ="identity",width = 1, color = "white") +
  coord_polar("y",start = 0) +
  theme_void() +
  geom_text(aes(label = percent(value) ), size=3, position=position_stack(vjust=0.5))+ggtitle("Pie chart 2")
x

Forth step:

Create chart for PH level.

Plot the chart for desire PH level and not desire PH level. As we can tell from the pie chart, the number of the water data that PH level is not within the range of 6.5 to 8.5 is 0.34 percent more than the water data that PH level is higher than the water data that PH level is within range of 6.5 to 8.5.

desire_ph <- nrow(filter(water_cleaned, ph_round >= 6.5) %>% filter(ph_round <= 8.5))
total <- nrow(water_cleaned)
not_desire <- total - desire_ph
portion_desire <- desire_ph/total
portion_no_desire <- not_desire/total
df <- data.frame(group = c(" PH level within range of 6.5 to 8.5","PH level out of range of 6.5 to 8.5"), value = c(portion_desire, portion_no_desire))

w <- ggplot(df, aes(x = "", y =value, fill  = group)) +
  geom_bar(stat ="identity",width = 1, color = "white") +
  coord_polar("y",start = 0) +
  theme_void() +
  geom_text(aes(label = percent(value) ), size=3, position=position_stack(vjust=0.5))+
  ggtitle("PH level within range of 6.5 to 8.5 v.s. PH level out of range of 6.5 to 8.5")
  
w

Create a chart that shows more detail about PH level. Filter out different level of PH.

ph0_1 <- nrow(filter(water_cleaned, ph_round>=0) %>% filter(ph_round<1))
ph1_2 <- nrow(filter(water_cleaned, ph_round>=1) %>% filter(ph_round<2))
ph2_3 <- nrow(filter(water_cleaned, ph_round>=2) %>% filter(ph_round<3))
ph3_4 <- nrow(filter(water_cleaned, ph_round>=3) %>% filter(ph_round<4))
ph4_5 <- nrow(filter(water_cleaned, ph_round>=4) %>% filter(ph_round<5))
ph5_6 <- nrow(filter(water_cleaned, ph_round>=5) %>% filter(ph_round<6))
ph6_7 <- nrow(filter(water_cleaned, ph_round>=6) %>% filter(ph_round<7))
ph7_8 <- nrow(filter(water_cleaned, ph_round>=7) %>% filter(ph_round<8))
ph8_9 <- nrow(filter(water_cleaned, ph_round>=8) %>% filter(ph_round<9))
ph9_10 <- nrow(filter(water_cleaned, ph_round>=9) %>% filter(ph_round<10))
ph10_11 <- nrow(filter(water_cleaned, ph_round>=10) %>% filter(ph_round<11))
ph11_12 <- nrow(filter(water_cleaned, ph_round>=11) %>% filter(ph_round<12))
ph12_13 <- nrow(filter(water_cleaned, ph_round>=12) %>% filter(ph_round<13))
ph13_14 <- nrow(filter(water_cleaned, ph_round>=13) %>% filter(ph_round<14))
ph_14 <- nrow(filter(water_cleaned, ph_round>=14))

Create a bar plot for each PH level. As we can tell from the bar plot, the majority of data PH value is between 4 to 8.

df2 <- data.frame(group = c("PH 0~1","PH 1~2", "PH 2~3","PH 3~4","PH 4~5","PH 5~6","PH 6~7","PH 7~8","PH 8~9","PH 9~10","PH 10~11","PH 11~12","PH 12~13","PH 13~14","PH 14 above "), value = c(ph0_1, ph1_2, ph2_3, ph3_4, ph4_5, ph5_6, ph6_7, ph7_8, ph8_9, ph9_10, ph10_11, ph11_12, ph12_13,ph13_14, ph_14))
q <- ggplot(df2, aes(x = group, y = value, fill = group))+geom_bar(stat = "identity")
q

Scatter plot for PH level with different variable in dataset.

ggplot(water_cleaned, aes(x = ph_round, y = Hardness)) + geom_point()+ ggtitle("PH v.s. Hardness") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x = ph_round, y = Solids)) + geom_point() +ggtitle("PH v.s. Solids") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x = ph_round, y = Chloramines)) + geom_point() +ggtitle("PH v.s. Chloramines") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x = ph_round, y = Sulfate)) + geom_point() +ggtitle("PH v.s. Sulfate") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x = ph_round, y = Conductivity)) + geom_point() +ggtitle("PH v.s. Conductivity") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x = ph_round, y = Organic_carbon)) + geom_point() +ggtitle("PH v.s. Organic_carbon") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x = ph_round, y = Trihalomethanes)) + geom_point() +ggtitle("PH v.s. Trihalomethanes") +geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x = ph_round, y = Turbidity)) + geom_point() +ggtitle("PH v.s. Turbidity") +geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

In the linear regression, the correlation number will between -1 and 1, the closer to the -1 or 1, the stronger relationship they have. In other word, the closer to zero, the weaker the relationship they have. If the correlation is negative and you increase on that independent variable, then your dependent variable will decrease. Viewing from other side, if the number is positive and you increase the independent variable, then your dependent variable will increase.

As we can see from the graph above, the value of ph_round doesn’t have any linear relationship with other variables. In this case, I change the way to using log transformation trying to find the linear relationship

For ph verses hardness in log transformation No linear relationship between PH level and hardness in log transformation

water_cleaned$ph_log <- log(water_cleaned$ph_round)
water_cleaned$hardness_log <- log(water_cleaned$Hardness)
ggplot(water_cleaned, aes(x =hardness_log , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Hardness_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Hardness , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Hardness v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For PH level verses solid in log transformation No linear relationship between Solid and PH in log transformation

water_cleaned$solids_log <- log(water_cleaned$Solids)
ggplot(water_cleaned, aes(x =solids_log , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Solid_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Solids , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Solids v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For PH level verses Chloramines in log transformation No linear relationship between Chloramines and PH in log transformation

water_cleaned$Chloramines_log <- log(water_cleaned$Chloramines)
ggplot(water_cleaned, aes(x =Chloramines_log , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Chloramines_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Chloramines , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Chloramines v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For PH level verses Sulfate in log transformation No linear relationship between Sulfate and PH in log transformation

water_cleaned$Sulfate_log <- log(water_cleaned$Sulfate)
ggplot(water_cleaned, aes(x =Sulfate_log , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Sulfate_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Sulfate , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Sulfate v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For PH level verses Conductivity in log transformation No linear relationship between Conductivity and PH in log transformation

water_cleaned$Conductivity_log <- log(water_cleaned$Conductivity)
ggplot(water_cleaned, aes(x =Conductivity_log , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Conductivity_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Conductivity , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Conductivity v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For PH level verses Organic_carbon in log transformation No linear relationship between Organic_carbon and PH in log transformation

water_cleaned$Organic_carbon_log <- log(water_cleaned$Organic_carbon)
ggplot(water_cleaned, aes(x =Organic_carbon_log, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Organic_carbon_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Organic_carbon , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Organic_carbon v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For PH level verses Trihalomethanes in log transformation No linear relationship between Trihalomethanes and PH in log transformation

water_cleaned$Trihalomethanes_log <- log(water_cleaned$Trihalomethanes)
ggplot(water_cleaned, aes(x =Trihalomethanes_log, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Trihalomethanes_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Trihalomethanes , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Trihalomethanes v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For PH level verses Turbidity in log transformation No linear relationship between Turbidity and PH in log transformation

water_cleaned$Turbidity_log <- log(water_cleaned$Turbidity)
ggplot(water_cleaned, aes(x =Turbidity_log, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Turbidity_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Turbidity , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Turbidity v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Since the log transformation didn’t work, I change to the square root transformation For ph verses hardness in square root transformation No linear relationship between PH level and hardness

water_cleaned$ph_sqrt <- sqrt(water_cleaned$ph_round)
water_cleaned$hardness_sqrt <- sqrt(water_cleaned$Hardness)
ggplot(water_cleaned, aes(x =hardness_sqrt , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Hardness_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Hardness , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Hardness v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For ph verses solids in square root transformation No linear relationship between PH level and hardness

water_cleaned$solids_sqrt <- sqrt(water_cleaned$Solids)
ggplot(water_cleaned, aes(x =solids_sqrt, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Solid_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Solids , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Solids v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For ph verses Chloramines in square root transformation No linear relationship between PH level and Chloramines

water_cleaned$Chloramines_sqrt <- sqrt(water_cleaned$Chloramines)
ggplot(water_cleaned, aes(x =Chloramines_sqrt , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Chloramines_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Chloramines , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Chloramines v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For ph verses Sulfate in square root transformation No linear relationship between PH level and Sulfate

water_cleaned$Sulfate_sqrt <- sqrt(water_cleaned$Sulfate)
ggplot(water_cleaned, aes(x =Sulfate_sqrt , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Sulfate_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Sulfate , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Sulfate v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For ph verses Conductivity in square root transformation No linear relationship between PH level and Conductivity

water_cleaned$Conductivity_sqrt <- sqrt(water_cleaned$Conductivity)
ggplot(water_cleaned, aes(x =Conductivity_sqrt , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Conductivity_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Conductivity , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Conductivity v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For ph verses Organic_carbon in square root transformation No linear relationship between PH level and Organic_carbon

water_cleaned$Organic_carbon_sqrt <- sqrt(water_cleaned$Organic_carbon)
ggplot(water_cleaned, aes(x =Organic_carbon_sqrt, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Organic_carbon_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Organic_carbon , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Organic_carbon v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For ph verses Trihalomethanes in square root transformation No linear relationship between PH level and Trihalomethanes

water_cleaned$Trihalomethanes_sqrt <- sqrt(water_cleaned$Trihalomethanes)
ggplot(water_cleaned, aes(x =Trihalomethanes_sqrt, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Trihalomethanes_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Trihalomethanes , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Trihalomethanes v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For ph verses Turbidity in square root transformation No linear relationship between PH level and Turbidity

water_cleaned$Turbidity_sqrt <- sqrt(water_cleaned$Turbidity)
ggplot(water_cleaned, aes(x =Turbidity_sqrt, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Turbidity_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(water_cleaned, aes(x =Turbidity , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Turbidity v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Conclusion:

As the graph presented, the correlation between PH level with other variables is low, so that I can conclude PH level won’t be affected by other variables in the dataset and it will be one of the independent varibales in potability test.

Potability = (2.875e-01) + (6.291e-03)PH + (-8.805e-06)Hardness + (2.371e-06)Solids + (6.882e-03)Chloramines + (-9.931e-05)Sulfate + (-9.217e-05)Conductivity + (-2.141e-03)Organic_carbon + (2.880e-04)Trihalomethanes + (1.408e-02)*Turbidity

model2 <- lm(Potability ~ ph + Hardness + Solids + Chloramines + Sulfate + Conductivity + Organic_carbon + Trihalomethanes + Turbidity, data =
water_cleaned)
summary(model2)
## 
## Call:
## lm(formula = Potability ~ ph + Hardness + Solids + Chloramines + 
##     Sulfate + Conductivity + Organic_carbon + Trihalomethanes + 
##     Turbidity, data = water_cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5031 -0.4067 -0.3745  0.5870  0.7012 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      2.875e-01  1.797e-01   1.600    0.110  
## ph               6.291e-03  7.035e-03   0.894    0.371  
## Hardness        -8.805e-06  3.406e-04  -0.026    0.979  
## Solids           2.371e-06  1.294e-06   1.832    0.067 .
## Chloramines      6.882e-03  6.929e-03   0.993    0.321  
## Sulfate         -9.931e-05  2.715e-04  -0.366    0.715  
## Conductivity    -9.217e-05  1.358e-04  -0.679    0.497  
## Organic_carbon  -2.141e-03  3.297e-03  -0.649    0.516  
## Trihalomethanes  2.880e-04  6.819e-04   0.422    0.673  
## Turbidity        1.408e-02  1.406e-02   1.002    0.316  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4909 on 2001 degrees of freedom
## Multiple R-squared:  0.003638,   Adjusted R-squared:  -0.0008431 
## F-statistic: 0.8119 on 9 and 2001 DF,  p-value: 0.6053