# Import your packages here
import pandas as pd
# Set a display option so all columns of tables are displayed
pd.set_option('display.max_columns', None)Python Basics
Python Basics
In this milestone, you’ll become familiar with the covid data set.
Recreation
Part 1
Let’s start by loading the data that we’ll be using.
Can you:
- Run the setup chunk above, which will import pandas with the common alias,
pd. - Use
help(pd.read_csv)to explore the help page for the pandasread_csv()function. - Use
read_csv()to load our data set, which is stored atdata/covid.csv. - Save the data set to a variable named
covid.
import pandas as pd
covid = pd.read_csv("data/covid.csv")
covid.head()| date | state | tests | cases | hospitalizations | deaths | |
|---|---|---|---|---|---|---|
| 0 | 2020-01-13 | WA | 0 | 0 | 0 | 0 |
| 1 | 2020-01-14 | WA | 0 | 0 | 0 | 0 |
| 2 | 2020-01-15 | WA | 0 | 0 | 0 | 0 |
| 3 | 2020-01-16 | WA | 0 | 0 | 0 | 0 |
| 4 | 2020-01-17 | WA | 0 | 0 | 0 | 0 |
Part 2
Now that covid is loaded, run the following code chunk, which will pull out the cases, deaths, and state columns and save them as variables to use later.
# This code may look unfamiliar.
# Don't worry, we will cover this in next week's tutorials.
cases = covid['cases']
deaths = covid['deaths']
state = covid['state']Part 3
Use Python to explore our data:
- Display the contents of
covid - Display the contents of the columns that we saved separately for you:
cases,deaths, andstate - Use the built-in
sum()function to find the total number of cases recorded incases - Use the built-in
max()function to find the highest daily number of deaths recorded by any state indeaths - Use the pandas
value_counts()function to tally up the number of records for each state in the data set
Hint: you may want to look up the help pages for each of these functions.
Write your code in the following chunk:
- Let’s see the data:
print("Whole Dataset")
covid.head()
print("\nCases")
print(cases.head())
print("\nDeaths")
print(deaths.head())
print("\nState")
print(state.head())Whole Dataset
Cases
0 0
1 0
2 0
3 0
4 0
Name: cases, dtype: int64
Deaths
0 0
1 0
2 0
3 0
4 0
Name: deaths, dtype: int64
State
0 WA
1 WA
2 WA
3 WA
4 WA
Name: state, dtype: object
Now explore quantities:
print("Total Cases in USA")
total_cases = sum(cases)
print(f"{total_cases:,}")
print("\nMaximum Daily Deaths by State")
max_deaths = max(deaths)
print(f"{max_deaths:,}")
print("\nCases by State")
state.value_counts(dropna=False)Total Cases in USA
28,756,393
Maximum Daily Deaths by State
2,559
Cases by State
state
NaN 2153
WA 420
MA 411
VA 406
FL 404
NJ 392
NE 387
IN 375
CT 373
WY 372
RI 372
MI 372
NY 371
PA 370
TX 370
VT 370
WI 370
IL 369
OR 369
NC 369
NH 369
SC 369
HI 369
GA 369
CO 369
CA 369
AZ 369
MD 368
NV 368
NM 368
OH 368
TN 368
WV 367
MN 367
KY 367
KS 367
IA 367
DE 367
AR 367
AK 367
AL 366
ID 366
LA 366
ME 366
MS 366
MO 366
MT 366
ND 366
OK 366
SD 366
UT 366
Name: count, dtype: int64
Extension
Explore covid or one of the individual columns in a new way. Consider using a new function, or something that you can learn from a help page. Are there any functions that you would like to use from the statistics package?
For your first extension, we encourage you to think small. Throughout this course, you will build many skills to explore data; for this first week, your work will likely be fairly straightforward. Consider limiting yourself to 30-45 minutes for this portion of your work; at the end of that time interval, stop working and make notes on your progress, dead-ends, and aspirations – bringing those to your group for discussion during the milestone session.
Write your extension code in the following chunk:
First, explore variables of the dataset:
covid.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20780 entries, 0 to 20779
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 20780 non-null object
1 state 18627 non-null object
2 tests 20780 non-null int64
3 cases 20780 non-null int64
4 hospitalizations 20780 non-null int64
5 deaths 20780 non-null int64
dtypes: int64(4), object(2)
memory usage: 974.2+ KB
I wanna highlight that data dictionary indicates data from 56 US states and territories, but we have just 50 different states.
covid.state.nunique()50
Now let us examine what proportion of the data these null values represent:
rows = covid.shape[0]
null_values = covid['state'].isnull().sum()
percentage = round((null_values / rows) * 100,1)
print(f"We have {percentage}% of data with null states")We have 10.4% of data with null states
We can also explore the data with null states
covid_null = covid[covid['state'].isnull()]
covid_null| date | state | tests | cases | hospitalizations | deaths | |
|---|---|---|---|---|---|---|
| 288 | 2020-03-05 | NaN | 0 | 0 | 0 | 0 |
| 328 | 2020-03-06 | NaN | 2 | 0 | 0 | 0 |
| 379 | 2020-03-07 | NaN | 0 | 0 | 0 | 0 |
| 430 | 2020-03-08 | NaN | 3 | 1 | 0 | 0 |
| 481 | 2020-03-09 | NaN | 4 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 20775 | 2021-03-07 | NaN | 5726 | 146 | 0 | 0 |
| 20776 | 2021-03-07 | NaN | 0 | 0 | 0 | 0 |
| 20777 | 2021-03-07 | NaN | 1 | 1 | 0 | 0 |
| 20778 | 2021-03-07 | NaN | 305 | 305 | 0 | 3 |
| 20779 | 2021-03-07 | NaN | 0 | 0 | 0 | 0 |
2153 rows × 6 columns
Let’s see the statistics in these data with null state
covid_null.describe()| tests | cases | hospitalizations | deaths | |
|---|---|---|---|---|
| count | 2153.000000 | 2153.000000 | 2153.000000 | 2153.000000 |
| mean | 863.052020 | 71.365536 | 0.001858 | 1.510451 |
| std | 3156.191137 | 175.974696 | 0.086206 | 3.442843 |
| min | -4065.000000 | -640.000000 | 0.000000 | -2.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 87.000000 | 3.000000 | 0.000000 | 0.000000 |
| 75% | 477.000000 | 58.000000 | 0.000000 | 1.000000 |
| max | 79420.000000 | 2169.000000 | 4.000000 | 29.000000 |
In the table above, we can see negative values in the tests, cases and deaths columns. Also, we can see that 25% of the data values are less than or equal to 0.
Now let’s see data wich state is not null
covid_not_null = covid[~covid['state'].isnull()]
covid_not_null.describe()| tests | cases | hospitalizations | deaths | |
|---|---|---|---|---|
| count | 18627.000000 | 18627.000000 | 18627.000000 | 18627.000000 |
| mean | 19432.337789 | 1535.552853 | 41.678907 | 27.481559 |
| std | 34950.405917 | 3157.992845 | 219.533774 | 62.981636 |
| min | -130545.000000 | -7757.000000 | -12257.000000 | -201.000000 |
| 25% | 2207.500000 | 122.000000 | 0.000000 | 1.000000 |
| 50% | 7608.000000 | 555.000000 | 3.000000 | 8.000000 |
| 75% | 21608.500000 | 1534.500000 | 44.000000 | 27.000000 |
| max | 473076.000000 | 71734.000000 | 16373.000000 | 2559.000000 |
Here we can see that we also have negative cases. For further researches, data needs corrections or has anomalies. Approximately, we have 19.432 tests, 1.536 cases, 42 hospitalizations and 27 deaths on average. For the difference between the minimum, the maximum and percentiles in each category we can infer that the data vary a lot, which makes a lot of sense in this type of diseases.
Let’s explore how many of the data has negative
total_negative_cases = covid[covid.cases<0].shape[0]
percentage_negative = round((total_negative_cases / rows) * 100, 1)
print(f"We have {percentage_negative}% of data with negative values")
covid[covid.cases<0]We have 0.1% of data with negative values
| date | state | tests | cases | hospitalizations | deaths | |
|---|---|---|---|---|---|---|
| 2914 | 2020-04-22 | NaN | -330 | -383 | 0 | 3 |
| 3026 | 2020-04-24 | NaN | -44 | -44 | 0 | 8 |
| 3584 | 2020-05-04 | NaN | 89 | -1 | 0 | 0 |
| 3613 | 2020-05-05 | MT | 184 | -1 | 0 | 0 |
| 4158 | 2020-05-15 | HI | 832 | -1 | 0 | 0 |
| 4482 | 2020-05-20 | NaN | -1 | -1 | 0 | 1 |
| 4606 | 2020-05-23 | HI | 738 | -5 | 1 | 0 |
| 4706 | 2020-05-24 | NaN | -640 | -640 | 0 | 0 |
| 4826 | 2020-05-27 | CT | 9022 | -15 | 0 | 34 |
| 5405 | 2020-06-06 | MT | 866 | -1 | 0 | 0 |
| 6125 | 2020-06-19 | LA | 45688 | -119 | 0 | 22 |
| 7817 | 2020-07-19 | NJ | 9711 | -7 | 0 | 8 |
| 8131 | 2020-07-25 | DE | 3115 | -27 | 0 | 1 |
| 9069 | 2020-08-10 | WY | 2303 | -8 | 2 | 0 |
| 9303 | 2020-08-15 | AR | 0 | -400 | -4 | 13 |
| 9474 | 2020-08-18 | CT | 19513 | -12 | 0 | 0 |
| 9979 | 2020-08-27 | DE | 3285 | -10 | 0 | 0 |
| 10147 | 2020-08-30 | DE | 4223 | -6 | 0 | 0 |
| 10328 | 2020-09-02 | MA | 30141 | -7757 | -91 | -4 |
| 11634 | 2020-09-25 | RI | 10422 | -130 | 0 | 1 |
| 12211 | 2020-10-05 | NaN | 238 | -5 | 0 | 0 |
| 12323 | 2020-10-07 | NaN | 117 | -1 | 0 | 0 |
| 12771 | 2020-10-15 | NaN | 178 | -1 | 0 | 0 |
| 19301 | 2021-02-09 | ND | 578 | -9 | 3 | 3 |
It’s also important to analyze dates. We can see that we must to summarize data if we wanna analyze variables by date.
covid.date.value_counts()date
2020-08-10 56
2020-11-23 56
2020-10-26 56
2020-10-25 56
2020-10-24 56
..
2020-01-14 1
2020-01-19 1
2020-01-20 1
2020-01-21 1
2020-01-13 1
Name: count, Length: 420, dtype: int64
Additionally, we can use ‘pd.to_datetime’ to manipulate this column easily, even we can split by ‘-’ to identify month, day and year
covid["date_formatted"] = pd.to_datetime(covid["date"])
covid.head()| date | state | tests | cases | hospitalizations | deaths | date_formatted | |
|---|---|---|---|---|---|---|---|
| 0 | 2020-01-13 | WA | 0 | 0 | 0 | 0 | 2020-01-13 |
| 1 | 2020-01-14 | WA | 0 | 0 | 0 | 0 | 2020-01-14 |
| 2 | 2020-01-15 | WA | 0 | 0 | 0 | 0 | 2020-01-15 |
| 3 | 2020-01-16 | WA | 0 | 0 | 0 | 0 | 2020-01-16 |
| 4 | 2020-01-17 | WA | 0 | 0 | 0 | 0 | 2020-01-17 |
Our Data Set just have data of 2020 and 2021. We have all months of 2020 but just the first quarter.
covid["year"] = covid["date_formatted"].dt.year
covid["month"] = covid["date_formatted"].dt.month
covid["day"] = covid["date_formatted"].dt.day
print("Years")
covid["year"].value_counts()
print("\nYears-Months")
covid[["year","month"]].value_counts()Years
Years-Months
year month
2020 5 1736
7 1736
8 1736
10 1736
12 1736
2021 1 1736
2020 4 1680
6 1680
9 1680
11 1680
2021 2 1568
2020 3 1492
2021 3 392
2020 2 155
1 37
Name: count, dtype: int64
But January 2020 and March 2021 are incomplete:
covid.groupby(["year","month"]).day.nunique()year month
2020 1 19
2 29
3 31
4 30
5 31
6 30
7 31
8 31
9 30
10 31
11 30
12 31
2021 1 31
2 28
3 7
Name: day, dtype: int64