We used the water quality dataset. Water is something necessary for human surviving. However, according to National Geographic, only 1.2 percent of water on the earth can be used as drinkable water. For the drinkable water, according to “What is the best PH level of water for drinking?”, the desirable drinking water PH level is between 6.5 to 8.5. ## Goal: find whether PH level will be affected by other variable in the dataset and make the water become drinkable?
Import the water quality data and the library that may used in this project.
library(magrittr)
library(leaflet)
library(plotly)
## 載入需要的套件:ggplot2
##
## 載入套件:'plotly'
## 下列物件被遮斷自 'package:ggplot2':
##
## last_plot
## 下列物件被遮斷自 'package:stats':
##
## filter
## 下列物件被遮斷自 'package:graphics':
##
## layout
library(ggplot2)
library(scales)
library(palmerpenguins)
water= read.csv("water quality.csv", header = TRUE)
Use str function to check all the column inside the dataset, in this dataset we have PH level, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic_Carbon, Trihalomethanes, Turbidity, and Potability.
str(water)
## 'data.frame': 3276 obs. of 10 variables:
## $ ph : num NA 3.72 8.1 8.32 9.09 ...
## $ Hardness : num 205 129 224 214 181 ...
## $ Solids : num 20791 18630 19910 22018 17979 ...
## $ Chloramines : num 7.3 6.64 9.28 8.06 6.55 ...
## $ Sulfate : num 369 NA NA 357 310 ...
## $ Conductivity : num 564 593 419 363 398 ...
## $ Organic_carbon : num 10.4 15.2 16.9 18.4 11.6 ...
## $ Trihalomethanes: num 87 56.3 66.4 100.3 32 ...
## $ Turbidity : num 2.96 4.5 3.06 4.63 4.08 ...
## $ Potability : int 0 0 0 0 0 0 0 0 0 0 ...
Utilize summary function to find the NA value in the water dataset, and use the na.omit function to remove all the NA data from the water dataset and create a new dataset called water_cleaned for storing it. After that use the summary function again to check whether all NA have been removed.
summary(water)
## ph Hardness Solids Chloramines
## Min. : 0.000 Min. : 47.43 Min. : 320.9 Min. : 0.352
## 1st Qu.: 6.093 1st Qu.:176.85 1st Qu.:15666.7 1st Qu.: 6.127
## Median : 7.037 Median :196.97 Median :20927.8 Median : 7.130
## Mean : 7.081 Mean :196.37 Mean :22014.1 Mean : 7.122
## 3rd Qu.: 8.062 3rd Qu.:216.67 3rd Qu.:27332.8 3rd Qu.: 8.115
## Max. :14.000 Max. :323.12 Max. :61227.2 Max. :13.127
## NA's :491
## Sulfate Conductivity Organic_carbon Trihalomethanes
## Min. :129.0 Min. :181.5 Min. : 2.20 Min. : 0.738
## 1st Qu.:307.7 1st Qu.:365.7 1st Qu.:12.07 1st Qu.: 55.845
## Median :333.1 Median :421.9 Median :14.22 Median : 66.622
## Mean :333.8 Mean :426.2 Mean :14.28 Mean : 66.396
## 3rd Qu.:360.0 3rd Qu.:481.8 3rd Qu.:16.56 3rd Qu.: 77.337
## Max. :481.0 Max. :753.3 Max. :28.30 Max. :124.000
## NA's :781 NA's :162
## Turbidity Potability
## Min. :1.450 Min. :0.0000
## 1st Qu.:3.440 1st Qu.:0.0000
## Median :3.955 Median :0.0000
## Mean :3.967 Mean :0.3901
## 3rd Qu.:4.500 3rd Qu.:1.0000
## Max. :6.739 Max. :1.0000
##
water_cleaned <- na.omit(water)
summary(water_cleaned)
## ph Hardness Solids Chloramines
## Min. : 0.2275 Min. : 73.49 Min. : 320.9 Min. : 1.391
## 1st Qu.: 6.0897 1st Qu.:176.74 1st Qu.:15615.7 1st Qu.: 6.139
## Median : 7.0273 Median :197.19 Median :20933.5 Median : 7.144
## Mean : 7.0860 Mean :195.97 Mean :21917.4 Mean : 7.134
## 3rd Qu.: 8.0530 3rd Qu.:216.44 3rd Qu.:27182.6 3rd Qu.: 8.110
## Max. :14.0000 Max. :317.34 Max. :56488.7 Max. :13.127
## Sulfate Conductivity Organic_carbon Trihalomethanes
## Min. :129.0 Min. :201.6 Min. : 2.20 Min. : 8.577
## 1st Qu.:307.6 1st Qu.:366.7 1st Qu.:12.12 1st Qu.: 55.953
## Median :332.2 Median :423.5 Median :14.32 Median : 66.542
## Mean :333.2 Mean :426.5 Mean :14.36 Mean : 66.401
## 3rd Qu.:359.3 3rd Qu.:482.4 3rd Qu.:16.68 3rd Qu.: 77.292
## Max. :481.0 Max. :753.3 Max. :27.01 Max. :124.000
## Turbidity Potability
## Min. :1.450 Min. :0.0000
## 1st Qu.:3.443 1st Qu.:0.0000
## Median :3.968 Median :0.0000
## Mean :3.970 Mean :0.4033
## 3rd Qu.:4.514 3rd Qu.:1.0000
## Max. :6.495 Max. :1.0000
Round the PH level to 1 decimal and restore it as another column called ph_round and then round the hold dataset to 2 decimal for easier manipulate data and doing calculation.
water_cleaned$ph_round <- round(water_cleaned$ph,1)
water_cleaned <- round(water_cleaned,2)
From the pie chart1, we can tell that the PH level is not the only factor decide the potability since in the graph even though all data PH level is within 6.5 and 8.5 only 44% is drinkable and 56% is not drinkable.
drinkable <- nrow(filter(water_cleaned, ph_round >=6.5) %>% filter(ph_round <=8.5)%>% filter(Potability == 1))
notdrinkable <- nrow(filter(water_cleaned, ph_round >=6.5) %>% filter(ph_round <=8.5) %>%filter(Potability == 0))
df3 <- data.frame(group = c("PH level within range of 6.5 to 8.5 and drinkable", "PH level within range of 6.5 to 8.5 but not drinkable"), value = c(drinkable/(drinkable+notdrinkable), notdrinkable/(drinkable+notdrinkable)))
e <- ggplot(df3, aes(x = "", y =value, fill = group)) +
geom_bar(stat ="identity",width = 1, color = "white") +
coord_polar("y",start = 0) +
theme_void() +
geom_text(aes(label = percent(value) ), size=3, position=position_stack(vjust=0.5))+ggtitle("Pie chart 1")
e
From pie chart 2, we can tell that even though the PH level is outside
of the range 6.5 to 8.5, some of the data still be counted as drinkable
which mean that there are other factors affect the potability.
between_6.5and8.5 <- nrow(filter(water_cleaned, Potability == 1) %>% filter(ph_round >=6.5) %>% filter(ph_round <=8.5))
out_6.5and8.5 <-nrow(filter(water_cleaned, Potability == 1)) - between_6.5and8.5
df_new <- data.frame(group = c("drinkable and PH level within range of 6.5 to 8.5", "drinkable but PH level is out of the range of 6.5 to 8.5"), value = c(between_6.5and8.5/nrow(filter(water_cleaned, Potability == 1)), out_6.5and8.5/nrow(filter(water_cleaned, Potability == 1))))
x <- ggplot(df_new, aes(x = "", y =value, fill = group)) +
geom_bar(stat ="identity",width = 1, color = "white") +
coord_polar("y",start = 0) +
theme_void() +
geom_text(aes(label = percent(value) ), size=3, position=position_stack(vjust=0.5))+ggtitle("Pie chart 2")
x
Plot the chart for desire PH level and not desire PH level. As we can tell from the pie chart, the number of the water data that PH level is not within the range of 6.5 to 8.5 is 0.34 percent more than the water data that PH level is higher than the water data that PH level is within range of 6.5 to 8.5.
desire_ph <- nrow(filter(water_cleaned, ph_round >= 6.5) %>% filter(ph_round <= 8.5))
total <- nrow(water_cleaned)
not_desire <- total - desire_ph
portion_desire <- desire_ph/total
portion_no_desire <- not_desire/total
df <- data.frame(group = c(" PH level within range of 6.5 to 8.5","PH level out of range of 6.5 to 8.5"), value = c(portion_desire, portion_no_desire))
w <- ggplot(df, aes(x = "", y =value, fill = group)) +
geom_bar(stat ="identity",width = 1, color = "white") +
coord_polar("y",start = 0) +
theme_void() +
geom_text(aes(label = percent(value) ), size=3, position=position_stack(vjust=0.5))+
ggtitle("PH level within range of 6.5 to 8.5 v.s. PH level out of range of 6.5 to 8.5")
w
Create a chart that shows more detail about PH level. Filter out different level of PH.
ph0_1 <- nrow(filter(water_cleaned, ph_round>=0) %>% filter(ph_round<1))
ph1_2 <- nrow(filter(water_cleaned, ph_round>=1) %>% filter(ph_round<2))
ph2_3 <- nrow(filter(water_cleaned, ph_round>=2) %>% filter(ph_round<3))
ph3_4 <- nrow(filter(water_cleaned, ph_round>=3) %>% filter(ph_round<4))
ph4_5 <- nrow(filter(water_cleaned, ph_round>=4) %>% filter(ph_round<5))
ph5_6 <- nrow(filter(water_cleaned, ph_round>=5) %>% filter(ph_round<6))
ph6_7 <- nrow(filter(water_cleaned, ph_round>=6) %>% filter(ph_round<7))
ph7_8 <- nrow(filter(water_cleaned, ph_round>=7) %>% filter(ph_round<8))
ph8_9 <- nrow(filter(water_cleaned, ph_round>=8) %>% filter(ph_round<9))
ph9_10 <- nrow(filter(water_cleaned, ph_round>=9) %>% filter(ph_round<10))
ph10_11 <- nrow(filter(water_cleaned, ph_round>=10) %>% filter(ph_round<11))
ph11_12 <- nrow(filter(water_cleaned, ph_round>=11) %>% filter(ph_round<12))
ph12_13 <- nrow(filter(water_cleaned, ph_round>=12) %>% filter(ph_round<13))
ph13_14 <- nrow(filter(water_cleaned, ph_round>=13) %>% filter(ph_round<14))
ph_14 <- nrow(filter(water_cleaned, ph_round>=14))
Create a bar plot for each PH level. As we can tell from the bar plot, the majority of data PH value is between 4 to 8.
df2 <- data.frame(group = c("PH 0~1","PH 1~2", "PH 2~3","PH 3~4","PH 4~5","PH 5~6","PH 6~7","PH 7~8","PH 8~9","PH 9~10","PH 10~11","PH 11~12","PH 12~13","PH 13~14","PH 14 above "), value = c(ph0_1, ph1_2, ph2_3, ph3_4, ph4_5, ph5_6, ph6_7, ph7_8, ph8_9, ph9_10, ph10_11, ph11_12, ph12_13,ph13_14, ph_14))
q <- ggplot(df2, aes(x = group, y = value, fill = group))+geom_bar(stat = "identity")
q
Scatter plot for PH level with different variable in dataset.
ggplot(water_cleaned, aes(x = ph_round, y = Hardness)) + geom_point()+ ggtitle("PH v.s. Hardness") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x = ph_round, y = Solids)) + geom_point() +ggtitle("PH v.s. Solids") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x = ph_round, y = Chloramines)) + geom_point() +ggtitle("PH v.s. Chloramines") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x = ph_round, y = Sulfate)) + geom_point() +ggtitle("PH v.s. Sulfate") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x = ph_round, y = Conductivity)) + geom_point() +ggtitle("PH v.s. Conductivity") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x = ph_round, y = Organic_carbon)) + geom_point() +ggtitle("PH v.s. Organic_carbon") + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x = ph_round, y = Trihalomethanes)) + geom_point() +ggtitle("PH v.s. Trihalomethanes") +geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x = ph_round, y = Turbidity)) + geom_point() +ggtitle("PH v.s. Turbidity") +geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
In the linear regression, the correlation number will between -1 and 1,
the closer to the -1 or 1, the stronger relationship they have. In other
word, the closer to zero, the weaker the relationship they have. If the
correlation is negative and you increase on that independent variable,
then your dependent variable will decrease. Viewing from other side, if
the number is positive and you increase the independent variable, then
your dependent variable will increase.
As we can see from the graph above, the value of ph_round doesn’t have any linear relationship with other variables. In this case, I change the way to using log transformation trying to find the linear relationship
For ph verses hardness in log transformation No linear relationship between PH level and hardness in log transformation
water_cleaned$ph_log <- log(water_cleaned$ph_round)
water_cleaned$hardness_log <- log(water_cleaned$Hardness)
ggplot(water_cleaned, aes(x =hardness_log , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Hardness_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Hardness , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Hardness v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For PH level verses solid in log transformation No linear relationship between Solid and PH in log transformation
water_cleaned$solids_log <- log(water_cleaned$Solids)
ggplot(water_cleaned, aes(x =solids_log , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Solid_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Solids , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Solids v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For PH level verses Chloramines in log transformation No linear
relationship between Chloramines and PH in log transformation
water_cleaned$Chloramines_log <- log(water_cleaned$Chloramines)
ggplot(water_cleaned, aes(x =Chloramines_log , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Chloramines_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Chloramines , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Chloramines v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For PH level verses Sulfate in log transformation No linear relationship between Sulfate and PH in log transformation
water_cleaned$Sulfate_log <- log(water_cleaned$Sulfate)
ggplot(water_cleaned, aes(x =Sulfate_log , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Sulfate_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Sulfate , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Sulfate v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For PH level verses Conductivity in log transformation No linear relationship between Conductivity and PH in log transformation
water_cleaned$Conductivity_log <- log(water_cleaned$Conductivity)
ggplot(water_cleaned, aes(x =Conductivity_log , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Conductivity_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Conductivity , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Conductivity v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For PH level verses Organic_carbon in log transformation No linear relationship between Organic_carbon and PH in log transformation
water_cleaned$Organic_carbon_log <- log(water_cleaned$Organic_carbon)
ggplot(water_cleaned, aes(x =Organic_carbon_log, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Organic_carbon_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Organic_carbon , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Organic_carbon v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For PH level verses Trihalomethanes in log transformation No linear
relationship between Trihalomethanes and PH in log transformation
water_cleaned$Trihalomethanes_log <- log(water_cleaned$Trihalomethanes)
ggplot(water_cleaned, aes(x =Trihalomethanes_log, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Trihalomethanes_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Trihalomethanes , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Trihalomethanes v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For PH level verses Turbidity in log transformation No linear
relationship between Turbidity and PH in log transformation
water_cleaned$Turbidity_log <- log(water_cleaned$Turbidity)
ggplot(water_cleaned, aes(x =Turbidity_log, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Turbidity_log v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Turbidity , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Turbidity v.s. PH_log")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Since the log transformation didn’t work, I change to the square root
transformation For ph verses hardness in square root transformation No
linear relationship between PH level and hardness
water_cleaned$ph_sqrt <- sqrt(water_cleaned$ph_round)
water_cleaned$hardness_sqrt <- sqrt(water_cleaned$Hardness)
ggplot(water_cleaned, aes(x =hardness_sqrt , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Hardness_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Hardness , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Hardness v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For ph verses solids in square root transformation No linear relationship between PH level and hardness
water_cleaned$solids_sqrt <- sqrt(water_cleaned$Solids)
ggplot(water_cleaned, aes(x =solids_sqrt, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Solid_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Solids , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Solids v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For ph verses Chloramines in square root transformation No linear relationship between PH level and Chloramines
water_cleaned$Chloramines_sqrt <- sqrt(water_cleaned$Chloramines)
ggplot(water_cleaned, aes(x =Chloramines_sqrt , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Chloramines_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Chloramines , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Chloramines v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For ph verses Sulfate in square root transformation No linear relationship between PH level and Sulfate
water_cleaned$Sulfate_sqrt <- sqrt(water_cleaned$Sulfate)
ggplot(water_cleaned, aes(x =Sulfate_sqrt , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Sulfate_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Sulfate , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Sulfate v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For ph verses Conductivity in square root transformation No linear relationship between PH level and Conductivity
water_cleaned$Conductivity_sqrt <- sqrt(water_cleaned$Conductivity)
ggplot(water_cleaned, aes(x =Conductivity_sqrt , y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Conductivity_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Conductivity , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Conductivity v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For ph verses Organic_carbon in square root transformation No linear relationship between PH level and Organic_carbon
water_cleaned$Organic_carbon_sqrt <- sqrt(water_cleaned$Organic_carbon)
ggplot(water_cleaned, aes(x =Organic_carbon_sqrt, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Organic_carbon_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Organic_carbon , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Organic_carbon v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For ph verses Trihalomethanes in square root transformation No linear relationship between PH level and Trihalomethanes
water_cleaned$Trihalomethanes_sqrt <- sqrt(water_cleaned$Trihalomethanes)
ggplot(water_cleaned, aes(x =Trihalomethanes_sqrt, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Trihalomethanes_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Trihalomethanes , y = ph_sqrt )) + geom_point()+ geom_smooth()+ggtitle("Trihalomethanes v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
For ph verses Turbidity in square root transformation No linear relationship between PH level and Turbidity
water_cleaned$Turbidity_sqrt <- sqrt(water_cleaned$Turbidity)
ggplot(water_cleaned, aes(x =Turbidity_sqrt, y = ph_round )) + geom_point()+ geom_smooth()+ggtitle("Turbidity_sqrt v.s. PH")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(water_cleaned, aes(x =Turbidity , y = ph_log )) + geom_point()+ geom_smooth()+ggtitle("Turbidity v.s. PH_sqrt")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
As the graph presented, the correlation between PH level with other variables is low, so that I can conclude PH level won’t be affected by other variables in the dataset and it will be one of the independent varibales in potability test.
Potability = (2.875e-01) + (6.291e-03)PH + (-8.805e-06)Hardness + (2.371e-06)Solids + (6.882e-03)Chloramines + (-9.931e-05)Sulfate + (-9.217e-05)Conductivity + (-2.141e-03)Organic_carbon + (2.880e-04)Trihalomethanes + (1.408e-02)*Turbidity
model2 <- lm(Potability ~ ph + Hardness + Solids + Chloramines + Sulfate + Conductivity + Organic_carbon + Trihalomethanes + Turbidity, data =
water_cleaned)
summary(model2)
##
## Call:
## lm(formula = Potability ~ ph + Hardness + Solids + Chloramines +
## Sulfate + Conductivity + Organic_carbon + Trihalomethanes +
## Turbidity, data = water_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5031 -0.4067 -0.3745 0.5870 0.7012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.875e-01 1.797e-01 1.600 0.110
## ph 6.291e-03 7.035e-03 0.894 0.371
## Hardness -8.805e-06 3.406e-04 -0.026 0.979
## Solids 2.371e-06 1.294e-06 1.832 0.067 .
## Chloramines 6.882e-03 6.929e-03 0.993 0.321
## Sulfate -9.931e-05 2.715e-04 -0.366 0.715
## Conductivity -9.217e-05 1.358e-04 -0.679 0.497
## Organic_carbon -2.141e-03 3.297e-03 -0.649 0.516
## Trihalomethanes 2.880e-04 6.819e-04 0.422 0.673
## Turbidity 1.408e-02 1.406e-02 1.002 0.316
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4909 on 2001 degrees of freedom
## Multiple R-squared: 0.003638, Adjusted R-squared: -0.0008431
## F-statistic: 0.8119 on 9 and 2001 DF, p-value: 0.6053