1. Read the data.

MBASalaries.df <- read.csv("C:/interships/MBA Starting Salaries Data.csv")
View(MBASalaries.df)

2. Summary statistics for the variable “age”.

summary(MBASalaries.df$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   25.00   27.00   27.36   29.00   48.00

This tells us that the average age of the students is 27years.

3. summary for the gmat total scores.

summary(MBASalaries.df$gmat_tot)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   450.0   580.0   620.0   619.5   660.0   790.0

The mean gmat total score of the students are quite lesser than the maximum score that has been scored. The maximun score=790, whereas the mean of it is around 619.

4. table to see the proportion of male and female students.

table(MBASalaries.df$sex)
## 
##   1   2 
## 206  68
prop.table(table(MBASalaries.df$sex))
## 
##         1         2 
## 0.7518248 0.2481752

There were approximately 75.18% male in the institutional program whereas there were only 28.81% female.

5. table showing the salary of all the students.

table(MBASalaries.df$salary)
## 
##      0    998    999  64000  77000  78256  82000  85000  86000  88000 
##     90     46     35      1      1      1      1      4      2      1 
##  88500  90000  92000  93000  95000  96000  96500  97000  98000  99000 
##      1      3      3      3      7      4      1      2     10      1 
## 100000 100400 101000 101100 101600 102500 103000 104000 105000 106000 
##      9      1      2      1      1      1      1      2     11      3 
## 107000 107300 107500 108000 110000 112000 115000 118000 120000 126710 
##      1      1      1      2      1      3      5      1      4      1 
## 130000 145800 146000 162000 220000 
##      1      1      1      1      1

998 no of candidates didnt disclose ther salaries but we will see if the marks decided their salaries.

6.table showing the satisfaction all the students had.

table(MBASalaries.df$satis)
## 
##   1   2   3   4   5   6   7 998 
##   1   1   5  17  74  97  33  46

This table shows us that most of the students were well satisfied with this management program.

7. summary

summary(MBASalaries.df$gmat_tpc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    78.0    87.0    84.2    94.0    99.0

8. table showing the number of people having first language as english.

table(MBASalaries.df$frstlang)
## 
##   1   2 
## 242  32

This table shows us that the number of students whose first language is English were 242 and the number of students whose first language was some other language is 32. We will see if the first language is a measure which effects the the marks of the students and the salaries or not.

9. Hypothesis: The null hypothesis is that the salaries of male is higher than the female.

library("psych", lib.loc="~/R/win-library/3.4")
mba_job<- MBASalaries.df[which(MBASalaries.df$salary!='0'),]
t.test(mba_job$sex, mba_job$salary, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mba_job$sex and mba_job$salary
## t = -15.012, df = 366, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -65725.64 -50500.56
## sample estimates:
##    mean of x    mean of y 
##     1.244565 58114.342391

Since the p<0.05, hence we reject null hytothesis saying the salaries of males is not higher than females and that the variable salary is independent of sex.

10. boxplot of all the variables.

boxplot(MBASalaries.df, xlab="salary",ylab="mba student", horizontal = TRUE)

11. regression analysis1.

mba_job<- MBASalaries.df[which(MBASalaries.df$salary!='0'),]
salaryregg<-lm(mba_job$salary~mba_job$gmat_tot+mba_job$sex+mba_job$gmat_tpc+mba_job$work_yrs+mba_job$frstlang)
summary(salaryregg)
## 
## Call:
## lm(formula = mba_job$salary ~ mba_job$gmat_tot + mba_job$sex + 
##     mba_job$gmat_tpc + mba_job$work_yrs + mba_job$frstlang)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -88592 -51648  20551  39847 126530 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      177593.0    55338.3   3.209  0.00158 **
## mba_job$gmat_tot   -286.4      134.4  -2.131  0.03447 * 
## mba_job$sex       14577.9     8679.1   1.680  0.09478 . 
## mba_job$gmat_tpc    753.4      565.0   1.333  0.18407   
## mba_job$work_yrs   3797.4     1519.1   2.500  0.01333 * 
## mba_job$frstlang -32729.2    11226.1  -2.915  0.00401 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50350 on 178 degrees of freedom
## Multiple R-squared:  0.1057, Adjusted R-squared:  0.08053 
## F-statistic: 4.206 on 5 and 178 DF,  p-value: 0.001228

12. Regression analysis 2.

mba_job<- MBASalaries.df[which(MBASalaries.df$salary!='0'),]
salaryregg<-lm(mba_job$salary~mba_job$gmat_tot+mba_job$sex+mba_job$gmat_tpc+mba_job$work_yrs+mba_job$frstlang+mba_job$age)
summary(salaryregg)
## 
## Call:
## lm(formula = mba_job$salary ~ mba_job$gmat_tot + mba_job$sex + 
##     mba_job$gmat_tpc + mba_job$work_yrs + mba_job$frstlang + 
##     mba_job$age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -86355 -51529  20692  39672 125538 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      166385.2    76804.1   2.166  0.03162 * 
## mba_job$gmat_tot   -292.6      137.9  -2.122  0.03527 * 
## mba_job$sex       14845.7     8794.5   1.688  0.09316 . 
## mba_job$gmat_tpc    780.6      581.0   1.344  0.18081   
## mba_job$work_yrs   3290.2     2845.2   1.156  0.24907   
## mba_job$frstlang -33379.5    11670.4  -2.860  0.00474 **
## mba_job$age         556.8     2638.0   0.211  0.83309   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50490 on 177 degrees of freedom
## Multiple R-squared:  0.1059, Adjusted R-squared:  0.07557 
## F-statistic: 3.493 on 6 and 177 DF,  p-value: 0.002736

Keeping into consideration both the analysis, we see that the regression 2 model is a better model as it can actually show us all the dependent variables which effects the salary of a student. It isn’t such a good model maybe becuase the no. of rows are really less for a good regression analysis. The star mark beside the variables denote that those corresponding variables are the ones which mostly affect the salary.

13. Scatterplot between salary and sex.

library("car", lib.loc="~/R/win-library/3.4")
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplot(mba_job$sex,mba_job$salary,xlab = "sex", ylab = "salary")

14. Scatterplot between salary and age.

library("car", lib.loc="~/R/win-library/3.4")
scatterplot(mba_job$age,mba_job$salary,xlab = "age", ylab = "salary")

15. Corrgram

library("corrgram", lib.loc="~/R/win-library/3.4")
corrgram(mba_job, order=TRUE, lower.panel=panel.shade,
  upper.panel=panel.pie, text.panel=panel.txt,
  main="Corrgram of Variables")

16. Summary

By this analysis, it can be seen that there are only few people who didnt get the job. Most of the students were really satisfied with the management programme. The regression analysis gave us a wider view to look upon the different factors which actually affect the salary of an individual. The t.test showed us that the sex has no connection with one’s salary. They both are independent. The scatterplot shows the variations between sex and salary and salary and age. Overall, it is a pretty good model.