library(readr)
library(xlsx)
library(readxl)
library(foreign)
library(gdata)
library(rvest)
library(dplyr)
library(tidyr)
library(deductive)
library(deducorrect)
library(editrules)
library(validate)
library(Hmisc)
library(forecast)
library(stringr)
library(lubridate)
library(car)
library(outliers)
library(MVN)
library(infotheo)
library(MASS)
library(caret)
library(mlr)
library(ggplot2)
library(knitr)
library(xlsx2dfs)
GSDP is gross state domestic product. GSDP will be calculated among the countries in the world. In a country we can also calculate that how much does the particular state contributes to the country GSDP. The GSDP is calculated on the basis of the expenditure of the people living in the particular sate. Expenditure consists of the money earned and spent by the people across the state. It is used to measure the complete economy of the country in order to say whether the country is financially stable or not. More the expenditure of the people and more the earnings of individual contributed to the country’s GDP. If the GDP of the country is more, then its more financially stable. We have two data sets which tells the story of GSDP of country India in the year between 2015-2016 and the literacy rate of the total population. Even literacy plays a very vital role in the calculating the country’s GDP , as the literacy rate is a major part to contribute to country GDP. It is closely related as the region with good literacy rate of people contributes to more earnings of the individual which impacts the expenditure of individual and the country. Eventually with more GDP value , the country will have more financial freedom and adding to this the people in the country will be more benefitted by this because all the goods which will be generated in the country will be circulated around the world and it increases the wealth of the country.
Here, in this data we have taken two datasets from the website called Kaggle. One dataset contains the GSDP of the sates in India and in another dataset, we have taken the literacy rate of people across the states across India. country GSDP depends on the economy of country, and the education sector plays a very important role in building the GSDP of that country.
Literacy Rate <- https://www.kaggle.com/rajanand/education-in-india?select=2015_16_Statewise_Secondary.csv
GSDP <- https://www.kaggle.com/ankandash/gdp-data-for-indian-states
Firstly, we set a particular path to read the file and perform further analysis.We read the file using the function read_xlsx for the excel files and read.csv for the csv files.
setwd("C:/Users/ravindra/Desktop/Subjects/DATA WRANGLING/ASSIGNMENT/ASSIGNMENT 2/Datasets/final dataset")
The working directory was changed to C:/Users/ravindra/Desktop/Subjects/DATA WRANGLING/ASSIGNMENT/ASSIGNMENT 2/Datasets/final dataset inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
getwd()
[1] "C:/Users/ravindra/Desktop/Subjects/DATA WRANGLING/ASSIGNMENT/ASSIGNMENT 2/Datasets/final dataset"
education <- read_xlsx("2015_16_Statewise_Secondary.xlsx")
gdp <- read.csv("GSDP.csv")
Secondly after reading the files we will come to know how many variables are present in each file. Then, moving forward the data is subsetted because there will be “n” number of variables which will be not appropriate or related to the analysis we are doing in the project. I am subsettting both the data in order to take only the desired values for further analysis. To merge the data sets we need to have the similar variables where it would be appropriate to merge the data set and use it for the further analysis. Here, we can see that the variable name ststname is converted to state and parallelly duration is changed to ac_year. The str() function is use to find the structure of the particular variables present in the dataset.
education2 <-education[-c(14:630)]
str(education2)
tibble [36 x 13] (S3: tbl_df/tbl/data.frame)
$ ac_year : chr [1:36] "2015-16" "2015-16" "2015-16" "2015-16" ...
$ statcd : num [1:36] 1 2 3 4 5 6 7 8 9 10 ...
$ statname : chr [1:36] "Jammu And Kashmir" "Himachal Pradesh" "Punjab" "Chandigarh" ...
$ area_sqkm : num [1:36] 222236 55673 50362 114 53483 ...
$ tot_population : num [1:36] 12549 6857 27704 1055 10117 ...
$ urban_population : num [1:36] 20.05 8.69 29.82 76.66 21.54 ...
$ grwoth_rate : num [1:36] 23.7 12.8 13.7 17.1 19.2 ...
$ sexratio : num [1:36] 883 974 893 818 963 877 866 926 908 916 ...
$ sc_population : num [1:36] 7.4 25.2 31.9 18.9 18.8 20.2 16.8 17.8 20.7 15.9 ...
$ st_population : num [1:36] 11.9 5.7 0 0 2.9 0 0 13.5 0.6 1.3 ...
$ literacy_rate : num [1:36] 68.7 83.8 76.7 86.4 79.6 ...
$ male_literacy_rate : num [1:36] 78.3 90.8 81.5 90.5 88.3 ...
$ female_literacy_rate: num [1:36] 58 76.6 71.3 81.4 70.7 ...
names(education2)[names(education2)=="statname"]<-"State"
gdp2 <- gdp[-c(1:4,6:11),]
str(gdp2)
'data.frame': 1 obs. of 36 variables:
$ Items..Description : chr "GSDP - CURRENT PRICES (` in Crore)"
$ Duration : chr "2015-16"
$ Andhra.Pradesh : num 609934
$ Arunachal.Pradesh : num 18784
$ Assam : num 224234
$ Bihar : num 413503
$ Chhattisgarh : num 260776
$ Goa : num 45002
$ Gujarat : num 994316
$ Haryana : num 485184
$ Himachal.Pradesh : num NA
$ Jammu...Kashmir : num 118387
$ Jharkhand : num 241955
$ Karnataka : num 1027068
$ Kerala : num 588337
$ Madhya.Pradesh : num 543975
$ Maharashtra : num NA
$ Manipur : num NA
$ Meghalaya : num 26745
$ Mizoram : num NA
$ Nagaland : num NA
$ Odisha : num 341887
$ Punjab : num NA
$ Rajasthan : num NA
$ Sikkim : num 16637
$ Tamil.Nadu : num 1212668
$ Telangana : num 575631
$ Tripura : num NA
$ Uttar.Pradesh : num 1153795
$ Uttarakhand : num 184091
$ West.Bengal1 : logi NA
$ Andaman...Nicobar.Islands: num NA
$ Chandigarh : num 30304
$ Delhi : num 551963
$ Puducherry : num 26533
$ All_India.GDP : num 13675331
names(gdp2)[names(gdp2)=="Duration"]<-"ac_year"
Here in one of the datasets the data was untidy. In order to make the untidy data into the tidy data the gather () function was used. All the values which were mixed up in the columns where made in the form of rows. The tidy data was given the key value as “state” and the value as “GSDP” current price. While making the untidy data into tidy data the representation of the data looks good and we can extract the variables in the simple and standard form.
gdp2<- gdp2%>% gather('Andhra.Pradesh','Arunachal.Pradesh','Assam','Bihar','Chhattisgarh','Goa','Gujarat','Haryana','Himachal.Pradesh','Jammu...Kashmir','Jharkhand','Karnataka','Kerala','Madhya.Pradesh','Maharashtra','Manipur','Meghalaya','Mizoram','Nagaland','Odisha','Punjab','Rajasthan','Sikkim','Tamil.Nadu','Telangana','Tripura','Uttar.Pradesh','Uttarakhand','West.Bengal1','Andaman...Nicobar.Islands','Chandigarh','Delhi','Puducherry','All_India.GDP',key = "State",value = "GSDP Current Price")
After tidying the data, we need to join the two datasets. I have used two common variables state and ac_year to join those datasets for further analysis. A new variable z is created and we find the structure of the combined dataset. I have used the inner join () function to merge the datasets. Then chances of losing the missing values will be less when we perform the inner join function. Then further moving levels is used to access the levels of attribute of the variables.
z<-inner_join(education2,gdp2,by= c("State","ac_year"))
str(z)
tibble [24 x 15] (S3: tbl_df/tbl/data.frame)
$ ac_year : chr [1:24] "2015-16" "2015-16" "2015-16" "2015-16" ...
$ statcd : num [1:24] 3 4 5 6 7 8 10 11 13 14 ...
$ State : chr [1:24] "Punjab" "Chandigarh" "Uttarakhand" "Haryana" ...
$ area_sqkm : num [1:24] 50362 114 53483 44212 1483 ...
$ tot_population : num [1:24] 27704 1055 10117 25353 16753 ...
$ urban_population : num [1:24] 29.8 76.7 21.5 24.1 77 ...
$ grwoth_rate : num [1:24] 13.7 17.1 19.2 19.9 21 ...
$ sexratio : num [1:24] 893 818 963 877 866 926 916 889 931 987 ...
$ sc_population : num [1:24] 31.9 18.9 18.8 20.2 16.8 17.8 15.9 4.6 0 3.8 ...
$ st_population : num [1:24] 0 0 2.9 0 0 13.5 1.3 33.8 86.5 35.1 ...
$ literacy_rate : num [1:24] 76.7 86.4 79.6 76.6 86.3 ...
$ male_literacy_rate : num [1:24] 81.5 90.5 88.3 85.4 91 ...
$ female_literacy_rate: num [1:24] 71.3 81.4 70.7 66.8 80.9 ...
$ Items..Description : chr [1:24] "GSDP - CURRENT PRICES (` in Crore)" "GSDP - CURRENT PRICES (` in Crore)" "GSDP - CURRENT PRICES (` in Crore)" "GSDP - CURRENT PRICES (` in Crore)" ...
$ GSDP Current Price : num [1:24] NA 30304 184091 485184 551963 ...
z$State<-factor(z$State,levels=c("Andhra.Pradesh", "Arunachal.Pradesh","Assam","Bihar","Chhattisgarh","Goa","Gujarat","Haryana","Himachal.Pradesh","Jammu...Kashmir","Jharkhand","Karnataka","Kerala","Madhya.Pradesh","Maharashtra","Manipur","Meghalaya","Mizoram","Nagaland","Odisha","Punjab","Rajasthan","Sikkim","Tamil.Nadu","Telangana","Tripura","Uttar.Pradesh","Uttarakhand","West.Bengal1","Andaman...Nicobar.Islands","Chandigarh","Delhi","Puducherry"),ordered = TRUE)
levels(z$State)
[1] "Andhra.Pradesh" "Arunachal.Pradesh" "Assam"
[4] "Bihar" "Chhattisgarh" "Goa"
[7] "Gujarat" "Haryana" "Himachal.Pradesh"
[10] "Jammu...Kashmir" "Jharkhand" "Karnataka"
[13] "Kerala" "Madhya.Pradesh" "Maharashtra"
[16] "Manipur" "Meghalaya" "Mizoram"
[19] "Nagaland" "Odisha" "Punjab"
[22] "Rajasthan" "Sikkim" "Tamil.Nadu"
[25] "Telangana" "Tripura" "Uttar.Pradesh"
[28] "Uttarakhand" "West.Bengal1" "Andaman...Nicobar.Islands"
[31] "Chandigarh" "Delhi" "Puducherry"
For further analysis, I have muted a new variable called “Male to female Literacy ratio” with the help of two variables which were already present in the dataset. We mutate the new variables for the better analysis of the data so that we get the broader analysis of the data or we can even simply the data by adding the new variables for betterment understanding.
y<- mutate(z,"Male to Female literacy ratio"=(z$male_literacy_rate/ z$female_literacy_rate))
is.na() function is used to check whether we have missing values or special values in the dataset. In this datasets I came across few missing values, to remove the missing values and to delete the incomplete observations I have used na.omit() function.
colSums(is.na(y))
ac_year statcd State
0 0 0
area_sqkm tot_population urban_population
1 1 1
grwoth_rate sexratio sc_population
1 1 1
st_population literacy_rate male_literacy_rate
1 1 1
female_literacy_rate Items..Description GSDP Current Price
1 0 7
Male to Female literacy ratio
1
xy <- na.omit(y)
xy
NA
Once we overcome the missing values, we need to check for any outliners if present. The outliners would be of any form it would be reading error, human error, or recording errors. But in our datasets, I did not find any outliners in any one of the variables. The data looks pretty much descent without any wrong observations which would give us the outliners. I have used the boxplot method to check for the outliners
boxplot(xy$male_literacy_rate ,main="Male literacy rate")
boxplot(xy$female_literacy_rate, main="Female literacy rate")
boxplot(xy$statcd, main="Statcd")
boxplot(xy$State, main="State")
boxplot(xy$area_sqkm, main="Area in square km")
boxplot(xy$tot_population , main="Total Polpulation")
boxplot(xy$urban_population, main="Urban population")
boxplot(xy$grwoth_rate, main="Growth rate")
boxplot(xy$sexratio,main="Sex ratio")
boxplot(xy$sc_population,main="Sc population")
boxplot(xy$st_population,main="Total St population")
boxplot(xy$literacy_rate,main="Literacy rate")
boxplot(xy$`GSDP Current Price`,main="GSDP current price")
b <- boxplot(xy$`Male to Female literacy ratio`)
b$out
numeric(0)
For the better development of data analysis and representation of data we use transformations. Firstly, I have created a histogram for the pictorial representation and then proceeding further I have applied the boxcox transformation and the sqrt transformation. For the male literacy I have used the boxcox transformation, the sqrt transformation had the right skewness so to overcome that I have used the bocox transformation, For the female literacy and the ratio of male to female literacy the sqrt transformation gave the better representation when compared to boxcox transformation.
hist(xy$male_literacy_rate, main = "Histogram of male literacy rate ", xlab = "males")
hist(xy$female_literacy_rate, main = "Histogram of female literacy rate ", xlab = "females")
hist(xy$`Male to Female literacy ratio`, main = "Histogram of male to female literacy rate ", xlab ="Comparision")
boxcox_male<- BoxCox(xy$male_literacy_rate,lambda = "auto")
hist(boxcox_male)
sqrt_female<-sqrt(xy$female_literacy_rate)
hist(sqrt_female,main = "Histogram after transformation - FeMale literacy rate", xlab = "FeMales")
sqrt_comparision<-sqrt(xy$`Male to Female literacy ratio`)
hist(sqrt_comparision,main = "Histogram after transformation - Male and female literacy rate", xlab = "comparision")
NA
NA
##References Data1<- https://www.kaggle.com/rajanand/education-in-india?select=2015_16_Statewise_Secondary.csv
Data2<- https://www.kaggle.com/ankandash/gdp-data-for-indian-states
Url <- https://www.kaggle.com/