Python Basics

# Import your packages here
import pandas as pd

# Set a display option so all columns of tables are displayed
pd.set_option('display.max_columns', None)

Python Basics

In this milestone, you’ll become familiar with the covid data set.

Recreation

Part 1

Let’s start by loading the data that we’ll be using.

Can you:

  • Run the setup chunk above, which will import pandas with the common alias, pd.
  • Use help(pd.read_csv) to explore the help page for the pandas read_csv() function.
  • Use read_csv() to load our data set, which is stored at data/covid.csv.
  • Save the data set to a variable named covid.
import pandas as pd
covid = pd.read_csv("data/covid.csv")
covid.head()
date state tests cases hospitalizations deaths
0 2020-01-13 WA 0 0 0 0
1 2020-01-14 WA 0 0 0 0
2 2020-01-15 WA 0 0 0 0
3 2020-01-16 WA 0 0 0 0
4 2020-01-17 WA 0 0 0 0

Part 2

Now that covid is loaded, run the following code chunk, which will pull out the cases, deaths, and state columns and save them as variables to use later.

# This code may look unfamiliar.
# Don't worry, we will cover this in next week's tutorials. 
cases = covid['cases']
deaths = covid['deaths']
state = covid['state']

Part 3

Use Python to explore our data:

  • Display the contents of covid
  • Display the contents of the columns that we saved separately for you: cases, deaths, and state
  • Use the built-in sum() function to find the total number of cases recorded in cases
  • Use the built-in max() function to find the highest daily number of deaths recorded by any state in deaths
  • Use the pandas value_counts() function to tally up the number of records for each state in the data set

Hint: you may want to look up the help pages for each of these functions.

Write your code in the following chunk:

  • Let’s see the data:
print("Whole Dataset")
covid.head()
print("\nCases")
print(cases.head())
print("\nDeaths")
print(deaths.head())
print("\nState")
print(state.head())
Whole Dataset

Cases
0    0
1    0
2    0
3    0
4    0
Name: cases, dtype: int64

Deaths
0    0
1    0
2    0
3    0
4    0
Name: deaths, dtype: int64

State
0    WA
1    WA
2    WA
3    WA
4    WA
Name: state, dtype: object

Now explore quantities:

print("Total Cases in USA")
total_cases = sum(cases)
print(f"{total_cases:,}")

print("\nMaximum Daily Deaths by State")
max_deaths = max(deaths)
print(f"{max_deaths:,}")

print("\nCases by State")
state.value_counts(dropna=False)
Total Cases in USA
28,756,393

Maximum Daily Deaths by State
2,559

Cases by State
state
NaN    2153
WA      420
MA      411
VA      406
FL      404
NJ      392
NE      387
IN      375
CT      373
WY      372
RI      372
MI      372
NY      371
PA      370
TX      370
VT      370
WI      370
IL      369
OR      369
NC      369
NH      369
SC      369
HI      369
GA      369
CO      369
CA      369
AZ      369
MD      368
NV      368
NM      368
OH      368
TN      368
WV      367
MN      367
KY      367
KS      367
IA      367
DE      367
AR      367
AK      367
AL      366
ID      366
LA      366
ME      366
MS      366
MO      366
MT      366
ND      366
OK      366
SD      366
UT      366
Name: count, dtype: int64

Extension

Explore covid or one of the individual columns in a new way. Consider using a new function, or something that you can learn from a help page. Are there any functions that you would like to use from the statistics package?

For your first extension, we encourage you to think small. Throughout this course, you will build many skills to explore data; for this first week, your work will likely be fairly straightforward. Consider limiting yourself to 30-45 minutes for this portion of your work; at the end of that time interval, stop working and make notes on your progress, dead-ends, and aspirations – bringing those to your group for discussion during the milestone session.

Write your extension code in the following chunk:

First, explore variables of the dataset:

covid.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20780 entries, 0 to 20779
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   date              20780 non-null  object
 1   state             18627 non-null  object
 2   tests             20780 non-null  int64 
 3   cases             20780 non-null  int64 
 4   hospitalizations  20780 non-null  int64 
 5   deaths            20780 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 974.2+ KB

I wanna highlight that data dictionary indicates data from 56 US states and territories, but we have just 50 different states.

covid.state.nunique()
50

Now let us examine what proportion of the data these null values represent:

rows = covid.shape[0]
null_values = covid['state'].isnull().sum()

percentage = round((null_values / rows) * 100,1)
print(f"We have {percentage}% of data with null states")
We have 10.4% of data with null states

We can also explore the data with null states

covid_null = covid[covid['state'].isnull()]
covid_null
date state tests cases hospitalizations deaths
288 2020-03-05 NaN 0 0 0 0
328 2020-03-06 NaN 2 0 0 0
379 2020-03-07 NaN 0 0 0 0
430 2020-03-08 NaN 3 1 0 0
481 2020-03-09 NaN 4 0 0 0
... ... ... ... ... ... ...
20775 2021-03-07 NaN 5726 146 0 0
20776 2021-03-07 NaN 0 0 0 0
20777 2021-03-07 NaN 1 1 0 0
20778 2021-03-07 NaN 305 305 0 3
20779 2021-03-07 NaN 0 0 0 0

2153 rows × 6 columns

Let’s see the statistics in these data with null state

covid_null.describe()
tests cases hospitalizations deaths
count 2153.000000 2153.000000 2153.000000 2153.000000
mean 863.052020 71.365536 0.001858 1.510451
std 3156.191137 175.974696 0.086206 3.442843
min -4065.000000 -640.000000 0.000000 -2.000000
25% 0.000000 0.000000 0.000000 0.000000
50% 87.000000 3.000000 0.000000 0.000000
75% 477.000000 58.000000 0.000000 1.000000
max 79420.000000 2169.000000 4.000000 29.000000

In the table above, we can see negative values in the tests, cases and deaths columns. Also, we can see that 25% of the data values are less than or equal to 0.

Now let’s see data wich state is not null

covid_not_null = covid[~covid['state'].isnull()]
covid_not_null.describe()
tests cases hospitalizations deaths
count 18627.000000 18627.000000 18627.000000 18627.000000
mean 19432.337789 1535.552853 41.678907 27.481559
std 34950.405917 3157.992845 219.533774 62.981636
min -130545.000000 -7757.000000 -12257.000000 -201.000000
25% 2207.500000 122.000000 0.000000 1.000000
50% 7608.000000 555.000000 3.000000 8.000000
75% 21608.500000 1534.500000 44.000000 27.000000
max 473076.000000 71734.000000 16373.000000 2559.000000

Here we can see that we also have negative cases. For further researches, data needs corrections or has anomalies. Approximately, we have 19.432 tests, 1.536 cases, 42 hospitalizations and 27 deaths on average. For the difference between the minimum, the maximum and percentiles in each category we can infer that the data vary a lot, which makes a lot of sense in this type of diseases.

Let’s explore how many of the data has negative

total_negative_cases = covid[covid.cases<0].shape[0]
percentage_negative = round((total_negative_cases / rows) * 100, 1)
print(f"We have {percentage_negative}% of data with negative values")

covid[covid.cases<0]
We have 0.1% of data with negative values
date state tests cases hospitalizations deaths
2914 2020-04-22 NaN -330 -383 0 3
3026 2020-04-24 NaN -44 -44 0 8
3584 2020-05-04 NaN 89 -1 0 0
3613 2020-05-05 MT 184 -1 0 0
4158 2020-05-15 HI 832 -1 0 0
4482 2020-05-20 NaN -1 -1 0 1
4606 2020-05-23 HI 738 -5 1 0
4706 2020-05-24 NaN -640 -640 0 0
4826 2020-05-27 CT 9022 -15 0 34
5405 2020-06-06 MT 866 -1 0 0
6125 2020-06-19 LA 45688 -119 0 22
7817 2020-07-19 NJ 9711 -7 0 8
8131 2020-07-25 DE 3115 -27 0 1
9069 2020-08-10 WY 2303 -8 2 0
9303 2020-08-15 AR 0 -400 -4 13
9474 2020-08-18 CT 19513 -12 0 0
9979 2020-08-27 DE 3285 -10 0 0
10147 2020-08-30 DE 4223 -6 0 0
10328 2020-09-02 MA 30141 -7757 -91 -4
11634 2020-09-25 RI 10422 -130 0 1
12211 2020-10-05 NaN 238 -5 0 0
12323 2020-10-07 NaN 117 -1 0 0
12771 2020-10-15 NaN 178 -1 0 0
19301 2021-02-09 ND 578 -9 3 3

It’s also important to analyze dates. We can see that we must to summarize data if we wanna analyze variables by date.

covid.date.value_counts()
date
2020-08-10    56
2020-11-23    56
2020-10-26    56
2020-10-25    56
2020-10-24    56
              ..
2020-01-14     1
2020-01-19     1
2020-01-20     1
2020-01-21     1
2020-01-13     1
Name: count, Length: 420, dtype: int64

Additionally, we can use ‘pd.to_datetime’ to manipulate this column easily, even we can split by ‘-’ to identify month, day and year

covid["date_formatted"] = pd.to_datetime(covid["date"])
covid.head()
date state tests cases hospitalizations deaths date_formatted
0 2020-01-13 WA 0 0 0 0 2020-01-13
1 2020-01-14 WA 0 0 0 0 2020-01-14
2 2020-01-15 WA 0 0 0 0 2020-01-15
3 2020-01-16 WA 0 0 0 0 2020-01-16
4 2020-01-17 WA 0 0 0 0 2020-01-17

Our Data Set just have data of 2020 and 2021. We have all months of 2020 but just the first quarter.

covid["year"] = covid["date_formatted"].dt.year
covid["month"] = covid["date_formatted"].dt.month
covid["day"] = covid["date_formatted"].dt.day

print("Years")
covid["year"].value_counts()
print("\nYears-Months")
covid[["year","month"]].value_counts()
Years

Years-Months
year  month
2020  5        1736
      7        1736
      8        1736
      10       1736
      12       1736
2021  1        1736
2020  4        1680
      6        1680
      9        1680
      11       1680
2021  2        1568
2020  3        1492
2021  3         392
2020  2         155
      1          37
Name: count, dtype: int64

But January 2020 and March 2021 are incomplete:

covid.groupby(["year","month"]).day.nunique()
year  month
2020  1        19
      2        29
      3        31
      4        30
      5        31
      6        30
      7        31
      8        31
      9        30
      10       31
      11       30
      12       31
2021  1        31
      2        28
      3         7
Name: day, dtype: int64