Cargar datos
Mostrar el código
from kaggle.api.kaggle_api_extended import KaggleApi
from zipfile import ZipFile
import os
api = KaggleApi()
api.authenticate()
api.dataset_download_files('ruchi798/data-science-job-salaries' )
zf = ZipFile('data-science-job-salaries.zip' )
#los data extraída se guardará en la siguiente carpeta:
zf.extractall('data/' )
zf.close()
os.remove("data-science-job-salaries.zip" )
Mostrar el código
import pandas as pd
data= pd.read_csv('data/ds_salaries.csv' )
data = data.iloc[:,1 :]
data.head()
work_year
experience_level
employment_type
job_title
salary
salary_currency
salary_in_usd
employee_residence
remote_ratio
company_location
company_size
0
2020
MI
FT
Data Scientist
70000
EUR
79833
DE
0
DE
L
1
2020
SE
FT
Machine Learning Scientist
260000
USD
260000
JP
0
JP
S
2
2020
SE
FT
Big Data Engineer
85000
GBP
109024
GB
50
GB
M
3
2020
MI
FT
Product Data Analyst
20000
USD
20000
HN
0
HN
S
4
2020
SE
FT
Machine Learning Engineer
150000
USD
150000
US
50
US
L
Análisis Exploratorio de datos
Mostrar el código
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 work_year 607 non-null int64
1 experience_level 607 non-null object
2 employment_type 607 non-null object
3 job_title 607 non-null object
4 salary 607 non-null int64
5 salary_currency 607 non-null object
6 salary_in_usd 607 non-null int64
7 employee_residence 607 non-null object
8 remote_ratio 607 non-null int64
9 company_location 607 non-null object
10 company_size 607 non-null object
dtypes: int64(4), object(7)
memory usage: 52.3+ KB
Mostrar el código
work_year
salary
salary_in_usd
remote_ratio
count
607.000000
6.070000e+02
607.000000
607.00000
mean
2021.405272
3.240001e+05
112297.869852
70.92257
std
0.692133
1.544357e+06
70957.259411
40.70913
min
2020.000000
4.000000e+03
2859.000000
0.00000
25%
2021.000000
7.000000e+04
62726.000000
50.00000
50%
2022.000000
1.150000e+05
101570.000000
100.00000
75%
2022.000000
1.650000e+05
150000.000000
100.00000
max
2022.000000
3.040000e+07
600000.000000
100.00000
The echo: false option disables the printing of code (only output is displayed).