import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Data Analysis in Python
Analyzing Popular English Words in Python
0.1 Background
In this project, I analyze data about popular English language names. The data is available on this site. The project was part of a course created by FreeCodeCamp. The course is available on YouTube on this link. The course is project based.
The purpose is to illustrate the basics of data analysis using the Python language.
1 Project 1: English Words
- Load the packages
I read in the data and convert the column word into a character column.
= pd.read_csv("unix-words.txt")
words # help(pd.read_csv)
'word'] = words['word'].astype(str)
words[
words.columns
'word'] = [word.lower().strip() for word in words['word']]
words[
words.head()
word | |
---|---|
0 | a |
1 | aa |
2 | aal |
3 | aalii |
4 | aam |
I add a count of letters for each word as a column named count
.
def count_letters(word):
= 0
count for letter in word:
+= 1
count return (count)
'count'] = [len(word) for word in words['word']] words[
Lastly, I create a column sum
that is the addition of the values of each letter in a word. For instance, give that “a” is in pisition one alpabetically, the word “aa” has a sum of 2, while “bb” has a sum of 4, and so on.
= {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13,
letters 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}
# Function to get the sum of letters
def sum_letters(word):
sum = 0
for character in word.lower():
sum += letters.get(character, 0)
return (sum)
"sum"] = [sum_letters(word) for word in words["word"]]
words[
words.head()
word | count | sum | |
---|---|---|---|
0 | a | 1 | 1 |
1 | aa | 2 | 2 |
2 | aal | 3 | 14 |
3 | aalii | 5 | 32 |
4 | aam | 3 | 15 |
2 Exploring the Data
In this section, I explore the data.
Lets look at the first few columns of the data set.
words.head()
word | count | sum | |
---|---|---|---|
0 | a | 1 | 1 |
1 | aa | 2 | 2 |
2 | aal | 3 | 14 |
3 | aalii | 5 | 32 |
4 | aam | 3 | 15 |
Then, we look at the information about the data.
words.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235886 entries, 0 to 235885
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 word 235886 non-null object
1 count 235886 non-null int64
2 sum 235886 non-null int64
dtypes: int64(2), object(1)
memory usage: 5.4+ MB
How about the dimensions of the data?
# NB shape is an attribute
words.shape
(235886, 3)
And the last few columns.
words.tail()
word | count | sum | |
---|---|---|---|
235881 | zythem | 6 | 97 |
235882 | zythia | 6 | 89 |
235883 | zythum | 6 | 113 |
235884 | zyzomys | 7 | 149 |
235885 | zyzzogeton | 10 | 179 |
Finally, we look at the columns of the data.
print(words.columns)
Index(['word', 'count', 'sum'], dtype='object')
Finally, let us convert the names column into an index column.
"word", inplace=True) words.set_index(
Lets see the updated data.
words.head()
count | sum | |
---|---|---|
word | ||
a | 1 | 1 |
aa | 2 | 2 |
aal | 3 | 14 |
aalii | 5 | 32 |
aam | 3 | 15 |
3 Slicing Rows of Data Using LOC
In this case, I illustrate how to slice data by rows using the “loc” and “iloc” functions in Pandas.
# Use the square brackets to subset
'jump'] words.loc[
count 4
sum 60
Name: jump, dtype: int64
We examine the maximum values in the dataset. Here we can use the max method in pandas.
max() words.
count 24
sum 328
dtype: int64
We can also use the describe method instead. This method gives the summary statistics of all the variables, including the minimum and maximum values.
words.describe()
count | sum | |
---|---|---|
count | 235886.000000 | 235886.000000 |
mean | 9.569156 | 112.388484 |
std | 2.927299 | 40.889882 |
min | 1.000000 | 1.000000 |
25% | 8.000000 | 83.000000 |
50% | 9.000000 | 110.000000 |
75% | 11.000000 | 139.000000 |
max | 24.000000 | 328.000000 |
Next, lets look at the word with a count of 7 and a sum of 87. We are given several words below.
"pinfish", "glowing", "enfold", "microbrew"]] words.loc[[
count | sum | |
---|---|---|
word | ||
pinfish | 7 | 81 |
glowing | 7 | 87 |
enfold | 6 | 56 |
microbrew | 9 | 106 |
We see that the term glowing
meets this criteria.
3.1 Highest Possible Value of a Word
We cxan again use the describe method as an easy way to get all summary statistics neccesary to answer this question.
words.describe()
count | sum | |
---|---|---|
count | 235886.000000 | 235886.000000 |
mean | 9.569156 | 112.388484 |
std | 2.927299 | 40.889882 |
min | 1.000000 | 1.000000 |
25% | 8.000000 | 83.000000 |
50% | 9.000000 | 110.000000 |
75% | 11.000000 | 139.000000 |
max | 24.000000 | 328.000000 |
3.2 Word with a Value of 319
Here we can subset as follows.
'sum'] == 328] words.loc[words[
count | sum | |
---|---|---|
word | ||
dacryocystosyringotomy | 22 | 328 |
The idxmax function can also turn useful results, although not as accurate.
words.loc[words.idxmax()]
count | sum | |
---|---|---|
word | ||
formaldehydesulphoxylate | 24 | 294 |
dacryocystosyringotomy | 22 | 328 |
3.3 Most Common Value
What is the most common value for sum and counts? Here, we can use the value_counts method.
'sum'].value_counts() words[
102 2362
107 2332
97 2323
100 2302
98 2279
...
301 1
291 1
292 1
297 1
314 1
Name: sum, Length: 305, dtype: int64
We can also use the mode method.
"sum"].mode() words[
0 102
Name: sum, dtype: int64
We see that for sum, 102 occurs 2362 times. You could then use this to check the first few values.
"sum"] == 102].head() words.loc[words[
count | sum | |
---|---|---|
word | ||
abrogation | 10 | 102 |
absurdly | 8 | 102 |
acanthous | 9 | 102 |
acatamathesia | 13 | 102 |
acclimatize | 11 | 102 |
WE can do the same for counts.
'count'].value_counts() words[
9 32404
10 30878
8 29989
11 26013
7 23869
12 20462
6 17706
13 14939
5 10230
14 9765
15 5925
4 5271
16 3377
17 1813
3 1421
18 842
19 428
20 198
2 160
21 82
1 51
22 41
23 17
24 5
Name: count, dtype: int64
"count"].mode() words[
0 9
Name: count, dtype: int64
We see that for sum, 9 occurs 32404 times. You could then use this to check the first few values.
"count"] == 9].head() words.loc[words[
count | sum | |
---|---|---|
word | ||
aaronical | 9 | 74 |
aaronitic | 9 | 90 |
microbrew | 9 | 106 |
abacinate | 9 | 56 |
abaciscus | 9 | 78 |
3.4 Shortest Word with a Value (SUM) of 271
Here, we combine the slicing methods with the sort_values method. We see that the term oversuperstitious
meets the condition.
"sum"] == 271].sort_values(by="count").head() words.loc[words[
count | sum | |
---|---|---|
word | ||
oversuperstitious | 17 | 271 |
prostatomyomectomy | 18 | 271 |
ultrarevolutionist | 18 | 271 |
ureteroenterostomy | 18 | 271 |
superconstitutional | 19 | 271 |
3.5 Create a New Column
In this section, I create a new column titled ratio
that is the ratio of sum of the word and the count of the word.
'ratio'] = words['sum'] / words['count']
words[
words.head()
count | sum | ratio | |
---|---|---|---|
word | |||
a | 1 | 1 | 1.000000 |
aa | 2 | 2 | 1.000000 |
aal | 3 | 14 | 4.666667 |
aalii | 5 | 32 | 6.400000 |
aam | 3 | 15 | 5.000000 |
What is the highest, lowest ratio? Here, we can describe the data. We see the highest and lowest ratios are 26 and 1, respectively.
words.describe()
count | sum | ratio | |
---|---|---|---|
count | 235886.000000 | 235886.000000 | 235886.000000 |
mean | 9.569156 | 112.388484 | 11.700352 |
std | 2.927299 | 40.889882 | 2.227709 |
min | 1.000000 | 1.000000 | 1.000000 |
25% | 8.000000 | 83.000000 | 10.272727 |
50% | 9.000000 | 110.000000 | 11.750000 |
75% | 11.000000 | 139.000000 | 13.166667 |
max | 24.000000 | 328.000000 | 26.000000 |
But which word has the highest ratio?
'ratio'] == 26] words.loc[words[
count | sum | ratio | |
---|---|---|---|
word | |||
z | 1 | 26 | 26.0 |
z | 1 | 26 | 26.0 |
We see that the letter z has the highest ratio. Likewise, the terms “a”, and “aa” have the lowest ratio.
'ratio'] == 1] words.loc[words[
count | sum | ratio | |
---|---|---|---|
word | |||
a | 1 | 1 | 1.0 |
aa | 2 | 2 | 1.0 |
How many words have a ratio of 10?
'ratio'] == 10].shape words.loc[words[
(3410, 3)
Here, we see a value of 3410 rows.
What is the maximum sum value for words with a ratio of 10? Here, we can sort the words with a ratio of 10. We see the max is 210.
'ratio'] == 10].sort_values(by="sum", ascending=False).head() words.loc[words[
count | sum | ratio | |
---|---|---|---|
word | |||
pathologicoanatomical | 21 | 210 | 10.0 |
anatomicopathological | 21 | 210 | 10.0 |
palaeometeorological | 20 | 200 | 10.0 |
anatomicochirurgical | 20 | 200 | 10.0 |
hypsidolichocephalic | 20 | 200 | 10.0 |
Of the words with a sum of 260, what is lowest character count? Again, we can sort the values.
'sum'] == 260].sort_values(by="count", ascending=True).head() words.loc[words[
count | sum | ratio | |
---|---|---|---|
word | |||
gastrohysteropexy | 17 | 260 | 15.294118 |
overobsequiousness | 18 | 260 | 14.444444 |
psychophysiologist | 18 | 260 | 14.444444 |
extraconstitutional | 19 | 260 | 13.684211 |
hypsiprymnodontinae | 19 | 260 | 13.684211 |
Here, we see that the count is 17 for the word gastrohysteropexy
.
NB: When columns have spaces, sorround them with back ticks.
4 Conclusion
In this analysis, I have analyzed popular English words using Python. The objective was to illustrate the use of Python for data analysis.