Python Basics

# Import your packages here
import pandas as pd

# Set a display option so all columns of tables are displayed
pd.set_option('display.max_columns', None)

Python Basics

In this milestone, you’ll become familiar with the covid data set.

Recreation

Part 1

Let’s start by loading the data that we’ll be using.

Can you:

Run the setup chunk above, which will import pandas with the common alias, pd.
Use help(pd.read_csv) to explore the help page for the pandas read_csv() function.
Use read_csv() to load our data set, which is stored at data/covid.csv.
Save the data set to a variable named covid.

import pandas as pd
covid = pd.read_csv("data/covid.csv")
covid.head()

	date	state
0	2020-01-13	WA
1	2020-01-14	WA
2	2020-01-15	WA
3	2020-01-16	WA
4	2020-01-17	WA

Part 2

Now that covid is loaded, run the following code chunk, which will pull out the cases, deaths, and state columns and save them as variables to use later.

# This code may look unfamiliar.
# Don't worry, we will cover this in next week's tutorials. 
cases = covid['cases']
deaths = covid['deaths']
state = covid['state']

Part 3

Use Python to explore our data:

Display the contents of covid
Display the contents of the columns that we saved separately for you: cases, deaths, and state
Use the built-in sum() function to find the total number of cases recorded in cases
Use the built-in max() function to find the highest daily number of deaths recorded by any state in deaths
Use the pandas value_counts() function to tally up the number of records for each state in the data set

Hint: you may want to look up the help pages for each of these functions.

Write your code in the following chunk:

Let’s see the data:

print("Whole Dataset")
covid.head()
print("\nCases")
print(cases.head())
print("\nDeaths")
print(deaths.head())
print("\nState")
print(state.head())

Whole Dataset

Cases
0    0
1    0
2    0
3    0
4    0
Name: cases, dtype: int64

Deaths
0    0
1    0
2    0
3    0
4    0
Name: deaths, dtype: int64

State
0    WA
1    WA
2    WA
3    WA
4    WA
Name: state, dtype: object

Now explore quantities:

print("Total Cases in USA")
total_cases = sum(cases)
print(f"{total_cases:,}")

print("\nMaximum Daily Deaths by State")
max_deaths = max(deaths)
print(f"{max_deaths:,}")

print("\nCases by State")
state.value_counts(dropna=False)

Total Cases in USA
28,756,393

Maximum Daily Deaths by State
2,559

Cases by State

state
NaN    2153
WA      420
MA      411
VA      406
FL      404
NJ      392
NE      387
IN      375
CT      373
WY      372
RI      372
MI      372
NY      371
PA      370
TX      370
VT      370
WI      370
IL      369
OR      369
NC      369
NH      369
SC      369
HI      369
GA      369
CO      369
CA      369
AZ      369
MD      368
NV      368
NM      368
OH      368
TN      368
WV      367
MN      367
KY      367
KS      367
IA      367
DE      367
AR      367
AK      367
AL      366
ID      366
LA      366
ME      366
MS      366
MO      366
MT      366
ND      366
OK      366
SD      366
UT      366
Name: count, dtype: int64

Extension

Explore covid or one of the individual columns in a new way. Consider using a new function, or something that you can learn from a help page. Are there any functions that you would like to use from the statistics package?

For your first extension, we encourage you to think small. Throughout this course, you will build many skills to explore data; for this first week, your work will likely be fairly straightforward. Consider limiting yourself to 30-45 minutes for this portion of your work; at the end of that time interval, stop working and make notes on your progress, dead-ends, and aspirations – bringing those to your group for discussion during the milestone session.

Write your extension code in the following chunk:

First, explore variables of the dataset:

covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20780 entries, 0 to 20779
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   date              20780 non-null  object
 1   state             18627 non-null  object
 2   tests             20780 non-null  int64 
 3   cases             20780 non-null  int64 
 4   hospitalizations  20780 non-null  int64 
 5   deaths            20780 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 974.2+ KB

I wanna highlight that data dictionary indicates data from 56 US states and territories, but we have just 50 different states.

covid.state.nunique()

Now let us examine what proportion of the data these null values represent:

rows = covid.shape[0]
null_values = covid['state'].isnull().sum()

percentage = round((null_values / rows) * 100,1)
print(f"We have {percentage}% of data with null states")

We have 10.4% of data with null states

We can also explore the data with null states

covid_null = covid[covid['state'].isnull()]
covid_null

	date	state	tests	cases	hospitalizations	deaths
288	2020-03-05	NaN	0	0	0	0
328	2020-03-06	NaN	2	0	0	0
379	2020-03-07	NaN	0	0	0	0
430	2020-03-08	NaN	3	1	0	0
481	2020-03-09	NaN	4	0	0	0
...	...	...	...	...	...	...
20775	2021-03-07	NaN	5726	146	0	0
20776	2021-03-07	NaN	0	0	0	0
20777	2021-03-07	NaN	1	1	0	0
20778	2021-03-07	NaN	305	305	0	3
20779	2021-03-07	NaN	0	0	0	0

2153 rows × 6 columns

Let’s see the statistics in these data with null state

covid_null.describe()

	tests	cases	hospitalizations	deaths
count	2153.000000	2153.000000	2153.000000	2153.000000
mean	863.052020	71.365536	0.001858	1.510451
std	3156.191137	175.974696	0.086206	3.442843
min	-4065.000000	-640.000000	0.000000	-2.000000
25%	0.000000	0.000000	0.000000	0.000000
50%	87.000000	3.000000	0.000000	0.000000
75%	477.000000	58.000000	0.000000	1.000000
max	79420.000000	2169.000000	4.000000	29.000000

In the table above, we can see negative values in the tests, cases and deaths columns. Also, we can see that 25% of the data values are less than or equal to 0.

Now let’s see data wich state is not null

covid_not_null = covid[~covid['state'].isnull()]
covid_not_null.describe()

	tests	cases	hospitalizations	deaths
count	18627.000000	18627.000000	18627.000000	18627.000000
mean	19432.337789	1535.552853	41.678907	27.481559
std	34950.405917	3157.992845	219.533774	62.981636
min	-130545.000000	-7757.000000	-12257.000000	-201.000000
25%	2207.500000	122.000000	0.000000	1.000000
50%	7608.000000	555.000000	3.000000	8.000000
75%	21608.500000	1534.500000	44.000000	27.000000
max	473076.000000	71734.000000	16373.000000	2559.000000

Here we can see that we also have negative cases. For further researches, data needs corrections or has anomalies. Approximately, we have 19.432 tests, 1.536 cases, 42 hospitalizations and 27 deaths on average. For the difference between the minimum, the maximum and percentiles in each category we can infer that the data vary a lot, which makes a lot of sense in this type of diseases.

Let’s explore how many of the data has negative

total_negative_cases = covid[covid.cases<0].shape[0]
percentage_negative = round((total_negative_cases / rows) * 100, 1)
print(f"We have {percentage_negative}% of data with negative values")

covid[covid.cases<0]

We have 0.1% of data with negative values

	date	state	tests	cases	hospitalizations	deaths
2914	2020-04-22	NaN	-330	-383	0	3
3026	2020-04-24	NaN	-44	-44	0	8
3584	2020-05-04	NaN	89	-1	0	0
3613	2020-05-05	MT	184	-1	0	0
4158	2020-05-15	HI	832	-1	0	0
4482	2020-05-20	NaN	-1	-1	0	1
4606	2020-05-23	HI	738	-5	1	0
4706	2020-05-24	NaN	-640	-640	0	0
4826	2020-05-27	CT	9022	-15	0	34
5405	2020-06-06	MT	866	-1	0	0
6125	2020-06-19	LA	45688	-119	0	22
7817	2020-07-19	NJ	9711	-7	0	8
8131	2020-07-25	DE	3115	-27	0	1
9069	2020-08-10	WY	2303	-8	2	0
9303	2020-08-15	AR	0	-400	-4	13
9474	2020-08-18	CT	19513	-12	0	0
9979	2020-08-27	DE	3285	-10	0	0
10147	2020-08-30	DE	4223	-6	0	0
10328	2020-09-02	MA	30141	-7757	-91	-4
11634	2020-09-25	RI	10422	-130	0	1
12211	2020-10-05	NaN	238	-5	0	0
12323	2020-10-07	NaN	117	-1	0	0
12771	2020-10-15	NaN	178	-1	0	0
19301	2021-02-09	ND	578	-9	3	3

It’s also important to analyze dates. We can see that we must to summarize data if we wanna analyze variables by date.

covid.date.value_counts()

date
2020-08-10    56
2020-11-23    56
2020-10-26    56
2020-10-25    56
2020-10-24    56
              ..
2020-01-14     1
2020-01-19     1
2020-01-20     1
2020-01-21     1
2020-01-13     1
Name: count, Length: 420, dtype: int64

Additionally, we can use ‘pd.to_datetime’ to manipulate this column easily, even we can split by ‘-’ to identify month, day and year

covid["date_formatted"] = pd.to_datetime(covid["date"])
covid.head()

	date	state	date_formatted
0	2020-01-13	WA	2020-01-13
1	2020-01-14	WA	2020-01-14
2	2020-01-15	WA	2020-01-15
3	2020-01-16	WA	2020-01-16
4	2020-01-17	WA	2020-01-17

Our Data Set just have data of 2020 and 2021. We have all months of 2020 but just the first quarter.

covid["year"] = covid["date_formatted"].dt.year
covid["month"] = covid["date_formatted"].dt.month
covid["day"] = covid["date_formatted"].dt.day

print("Years")
covid["year"].value_counts()
print("\nYears-Months")
covid[["year","month"]].value_counts()

Years

Years-Months

year  month
2020  5        1736
      7        1736
      8        1736
      10       1736
      12       1736
2021  1        1736
2020  4        1680
      6        1680
      9        1680
      11       1680
2021  2        1568
2020  3        1492
2021  3         392
2020  2         155
      1          37
Name: count, dtype: int64

But January 2020 and March 2021 are incomplete:

covid.groupby(["year","month"]).day.nunique()

year  month
2020  1        19
      2        29
      3        31
      4        30
      5        31
      6        30
      7        31
      8        31
      9        30
      10       31
      11       30
      12       31
2021  1        31
      2        28
      3         7
Name: day, dtype: int64