Data Mining Lab

Experiment 2

1. Create an excel sheet ( Or CSV format) with following columns and import it into R. This will be a data frame in R. “Sl.No” ,“Reg.No”, “Name”, “SSC %”, “Inter %”," B.Tech %“,”Cap/Non CAP“.

df1<-read.csv("C:/Users/pradeep/OneDrive/dm,cns,cp and JKC/data mining/dm files 2020 passouts/lab/sheet1.csv")
head(df1)

##   Sl.No Reg.No Name SSC.. Inter.. B.Tech.. Cap.Non.CAP
## 1     1    RAJ   16    90      88       75        TRUE
## 2     2  RAKHI   13    98      90       85       FALSE
## 3     3    RAM   23    80      80       65       FALSE
## 4     4    SAI   34    85      85       80        TRUE

2. Create another excel sheet ( Or CSV format) same as above with following columns and import it into R. This will be a data frame in R. “Sl.No” ,“Reg.No”, “Name”, “SSC %”, “Inter %”," B.Tech %“,”Cap/Non CAP“.

df2<-read.csv("C:/Users/pradeep/OneDrive/dm,cns,cp and JKC/data mining/dm files 2020 passouts/lab/sheet2.csv")
head(df2)

##   Sl.No   Reg.No         Name SSC.. Inter.. B.Tech.. Cap.Non.CAP
## 1     1 PREETHAM          5B4    93      91       75        TRUE
## 2     2    VARMA          5B5    98      97       81        TRUE
## 3     3  KARTHIK          5B6    98      98       85        TRUE

3a. Now, after creating the two data frames (above), merge them and create another data frame.

df<-merge(df1,df2,all=TRUE)
df

##   Sl.No   Reg.No         Name SSC.. Inter.. B.Tech.. Cap.Non.CAP
## 1     1      RAJ           16    90      88       75        TRUE
## 2     1 PREETHAM          5B4    93      91       75        TRUE
## 3     2    RAKHI           13    98      90       85       FALSE
## 4     2    VARMA          5B5    98      97       81        TRUE
## 5     3      RAM           23    80      80       65       FALSE
## 6     3  KARTHIK          5B6    98      98       85        TRUE
## 7     4      SAI           34    85      85       80        TRUE

3b. Display its top 2 rows of the merged data frame.

# here, in c(1,2)  c stands for concatenate. And is used for combing more than 1 entity
df[c(1,2),]

##   Sl.No   Reg.No   Name SSC.. Inter.. B.Tech.. Cap.Non.CAP
## 1     1      RAJ     16    90      88       75        TRUE
## 2     1 PREETHAM    5B4    93      91       75        TRUE

# You can also use head command: head(df,2)

4. Now export the above data frame into csv format.

write.csv(df,“df.csv”)

# To write the file use command:  write.csv(df,"df.csv")

5. Consider a dataset below (df3) . Predict the intermediate percentage of 11PA1A0503 by taking mean of other students and substitute in the place of the NA.

df3<-read.csv("C:/Users/pradeep/OneDrive/dm,cns,cp and JKC/data mining/dm files 2020 passouts/lab/sheet3.csv")
df3

##   Sl.No    Roll.No      Name  SSC inter B.Tech.perc CAP..NOT.CAP
## 1     1 11PA1A0501   AVINASH 87.3  65.3       40.00          CAP
## 2     2 11PA1A0503 KEERTHANA 89.0    NA       71.45      NOT_CAP
## 3     3 11PA1A0504 PRAVALIKA 67.0  68.0       45.26      NOT_CAP
## 4     4 11PA1A0505  SAI BABA   NA  67.0          NA          CAP
## 5     5 11PA1A0506  SRINIVAS 67.0  65.5       42.52          CAP

Now replace the missing values in Inter percentage

# Here, na.rm=TRUE because we want to calculate mean without considering NA. If this is not used, then we get an error.
df3$inter[is.na(df3$inter)]<-mean(df3$inter,na.rm=TRUE)
df3

##   Sl.No    Roll.No      Name  SSC inter B.Tech.perc CAP..NOT.CAP
## 1     1 11PA1A0501   AVINASH 87.3 65.30       40.00          CAP
## 2     2 11PA1A0503 KEERTHANA 89.0 66.45       71.45      NOT_CAP
## 3     3 11PA1A0504 PRAVALIKA 67.0 68.00       45.26      NOT_CAP
## 4     4 11PA1A0505  SAI BABA   NA 67.00          NA          CAP
## 5     5 11PA1A0506  SRINIVAS 67.0 65.50       42.52          CAP

6) Notice that in the above table(df3), 4th row is having more NA’s. It’s a good choice to remove the 4th row and having a new dataframe without 4th row. Remove the 4^th row.

# Use prefix minus ( - ) before the row number to remove that row
df3[-4,]

##   Sl.No    Roll.No      Name  SSC inter B.Tech.perc CAP..NOT.CAP
## 1     1 11PA1A0501   AVINASH 87.3 65.30       40.00          CAP
## 2     2 11PA1A0503 KEERTHANA 89.0 66.45       71.45      NOT_CAP
## 3     3 11PA1A0504 PRAVALIKA 67.0 68.00       45.26      NOT_CAP
## 5     5 11PA1A0506  SRINIVAS 67.0 65.50       42.52          CAP

Data Mining Lab

I Kali Pradeep

March 10, 2019

Experiment 2

1. Create an excel sheet ( Or CSV format) with following columns and import it into R. This will be a data frame in R. “Sl.No” ,“Reg.No”, “Name”, “SSC %”, “Inter %”," B.Tech %“,”Cap/Non CAP“.

2. Create another excel sheet ( Or CSV format) same as above with following columns and import it into R. This will be a data frame in R. “Sl.No” ,“Reg.No”, “Name”, “SSC %”, “Inter %”," B.Tech %“,”Cap/Non CAP“.

3a. Now, after creating the two data frames (above), merge them and create another data frame.

3b. Display its top 2 rows of the merged data frame.

4. Now export the above data frame into csv format.

5. Consider a dataset below (df3) . Predict the intermediate percentage of 11PA1A0503 by taking mean of other students and substitute in the place of the NA.

6) Notice that in the above table(df3), 4th row is having more NA’s. It’s a good choice to remove the 4th row and having a new dataframe without 4th row. Remove the 4^th row.

Data Mining Lab

I Kali Pradeep

March 10, 2019

Experiment 2

1. Create an excel sheet ( Or CSV format) with following columns and import it into R. This will be a data frame in R. “Sl.No” ,“Reg.No”, “Name”, “SSC %”, “Inter %”," B.Tech %“,”Cap/Non CAP“.

2. Create another excel sheet ( Or CSV format) same as above with following columns and import it into R. This will be a data frame in R. “Sl.No” ,“Reg.No”, “Name”, “SSC %”, “Inter %”," B.Tech %“,”Cap/Non CAP“.

3a. Now, after creating the two data frames (above), merge them and create another data frame.

3b. Display its top 2 rows of the merged data frame.

4. Now export the above data frame into csv format.

5. Consider a dataset below (df3) . Predict the intermediate percentage of 11PA1A0503 by taking mean of other students and substitute in the place of the NA.

6) Notice that in the above table(df3), 4th row is having more NA’s. It’s a good choice to remove the 4th row and having a new dataframe without 4th row. Remove the 4th row.

6) Notice that in the above table(df3), 4th row is having more NA’s. It’s a good choice to remove the 4th row and having a new dataframe without 4th row. Remove the 4^th row.