Data Analysis in Python

Analyzing Popular English Words in Python

Author
Affiliations

John Karuitha

Karatina University, School of Business

Graduate School of Business Administration, University of the Witwatersrand

0.1 Background

In this project, I analyze data about popular English language names. The data is available on this site. The project was part of a course created by FreeCodeCamp. The course is available on YouTube on this link. The course is project based.

The purpose is to illustrate the basics of data analysis using the Python language.

1 Project 1: English Words

  • Load the packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

I read in the data and convert the column word into a character column.

words = pd.read_csv("unix-words.txt")
# help(pd.read_csv)

words['word'] = words['word'].astype(str)
words.columns

words['word'] = [word.lower().strip() for word in words['word']]

words.head()
word
0 a
1 aa
2 aal
3 aalii
4 aam

I add a count of letters for each word as a column named count.

def count_letters(word):
    count = 0
    for letter in word:
        count += 1
    return (count)


words['count'] = [len(word) for word in words['word']]

Lastly, I create a column sum that is the addition of the values of each letter in a word. For instance, give that “a” is in pisition one alpabetically, the word “aa” has a sum of 2, while “bb” has a sum of 4, and so on.

letters = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13,
           'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}

# Function to get the sum of letters


def sum_letters(word):
    sum = 0
    for character in word.lower():
        sum += letters.get(character, 0)
    return (sum)


words["sum"] = [sum_letters(word) for word in words["word"]]

words.head()
word count sum
0 a 1 1
1 aa 2 2
2 aal 3 14
3 aalii 5 32
4 aam 3 15

2 Exploring the Data

In this section, I explore the data.

Lets look at the first few columns of the data set.

words.head()
word count sum
0 a 1 1
1 aa 2 2
2 aal 3 14
3 aalii 5 32
4 aam 3 15

Then, we look at the information about the data.

words.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235886 entries, 0 to 235885
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   word    235886 non-null  object
 1   count   235886 non-null  int64 
 2   sum     235886 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 5.4+ MB

How about the dimensions of the data?

# NB shape is an attribute
words.shape
(235886, 3)

And the last few columns.

words.tail()
word count sum
235881 zythem 6 97
235882 zythia 6 89
235883 zythum 6 113
235884 zyzomys 7 149
235885 zyzzogeton 10 179

Finally, we look at the columns of the data.

print(words.columns)
Index(['word', 'count', 'sum'], dtype='object')

Finally, let us convert the names column into an index column.

words.set_index("word", inplace=True)

Lets see the updated data.

words.head()
count sum
word
a 1 1
aa 2 2
aal 3 14
aalii 5 32
aam 3 15

3 Slicing Rows of Data Using LOC

In this case, I illustrate how to slice data by rows using the “loc” and “iloc” functions in Pandas.

# Use the square brackets to subset
words.loc['jump']
count     4
sum      60
Name: jump, dtype: int64

We examine the maximum values in the dataset. Here we can use the max method in pandas.

words.max()
count     24
sum      328
dtype: int64

We can also use the describe method instead. This method gives the summary statistics of all the variables, including the minimum and maximum values.

words.describe()
count sum
count 235886.000000 235886.000000
mean 9.569156 112.388484
std 2.927299 40.889882
min 1.000000 1.000000
25% 8.000000 83.000000
50% 9.000000 110.000000
75% 11.000000 139.000000
max 24.000000 328.000000

Next, lets look at the word with a count of 7 and a sum of 87. We are given several words below.

words.loc[["pinfish", "glowing", "enfold", "microbrew"]]
count sum
word
pinfish 7 81
glowing 7 87
enfold 6 56
microbrew 9 106

We see that the term glowing meets this criteria.

3.1 Highest Possible Value of a Word

We cxan again use the describe method as an easy way to get all summary statistics neccesary to answer this question.

words.describe()
count sum
count 235886.000000 235886.000000
mean 9.569156 112.388484
std 2.927299 40.889882
min 1.000000 1.000000
25% 8.000000 83.000000
50% 9.000000 110.000000
75% 11.000000 139.000000
max 24.000000 328.000000

3.2 Word with a Value of 319

Here we can subset as follows.

words.loc[words['sum'] == 328]
count sum
word
dacryocystosyringotomy 22 328

The idxmax function can also turn useful results, although not as accurate.

words.loc[words.idxmax()]
count sum
word
formaldehydesulphoxylate 24 294
dacryocystosyringotomy 22 328

3.3 Most Common Value

What is the most common value for sum and counts? Here, we can use the value_counts method.

words['sum'].value_counts()
102    2362
107    2332
97     2323
100    2302
98     2279
       ... 
301       1
291       1
292       1
297       1
314       1
Name: sum, Length: 305, dtype: int64

We can also use the mode method.

words["sum"].mode()
0    102
Name: sum, dtype: int64

We see that for sum, 102 occurs 2362 times. You could then use this to check the first few values.

words.loc[words["sum"] == 102].head()
count sum
word
abrogation 10 102
absurdly 8 102
acanthous 9 102
acatamathesia 13 102
acclimatize 11 102

WE can do the same for counts.

words['count'].value_counts()
9     32404
10    30878
8     29989
11    26013
7     23869
12    20462
6     17706
13    14939
5     10230
14     9765
15     5925
4      5271
16     3377
17     1813
3      1421
18      842
19      428
20      198
2       160
21       82
1        51
22       41
23       17
24        5
Name: count, dtype: int64
words["count"].mode()
0    9
Name: count, dtype: int64

We see that for sum, 9 occurs 32404 times. You could then use this to check the first few values.

words.loc[words["count"] == 9].head()
count sum
word
aaronical 9 74
aaronitic 9 90
microbrew 9 106
abacinate 9 56
abaciscus 9 78

3.4 Shortest Word with a Value (SUM) of 271

Here, we combine the slicing methods with the sort_values method. We see that the term oversuperstitious meets the condition.

words.loc[words["sum"] == 271].sort_values(by="count").head()
count sum
word
oversuperstitious 17 271
prostatomyomectomy 18 271
ultrarevolutionist 18 271
ureteroenterostomy 18 271
superconstitutional 19 271

3.5 Create a New Column

In this section, I create a new column titled ratio that is the ratio of sum of the word and the count of the word.

words['ratio'] = words['sum'] / words['count']

words.head()
count sum ratio
word
a 1 1 1.000000
aa 2 2 1.000000
aal 3 14 4.666667
aalii 5 32 6.400000
aam 3 15 5.000000

What is the highest, lowest ratio? Here, we can describe the data. We see the highest and lowest ratios are 26 and 1, respectively.

words.describe()
count sum ratio
count 235886.000000 235886.000000 235886.000000
mean 9.569156 112.388484 11.700352
std 2.927299 40.889882 2.227709
min 1.000000 1.000000 1.000000
25% 8.000000 83.000000 10.272727
50% 9.000000 110.000000 11.750000
75% 11.000000 139.000000 13.166667
max 24.000000 328.000000 26.000000

But which word has the highest ratio?

words.loc[words['ratio'] == 26]
count sum ratio
word
z 1 26 26.0
z 1 26 26.0

We see that the letter z has the highest ratio. Likewise, the terms “a”, and “aa” have the lowest ratio.

words.loc[words['ratio'] == 1]
count sum ratio
word
a 1 1 1.0
aa 2 2 1.0

How many words have a ratio of 10?

words.loc[words['ratio'] == 10].shape
(3410, 3)

Here, we see a value of 3410 rows.

What is the maximum sum value for words with a ratio of 10? Here, we can sort the words with a ratio of 10. We see the max is 210.

words.loc[words['ratio'] == 10].sort_values(by="sum", ascending=False).head()
count sum ratio
word
pathologicoanatomical 21 210 10.0
anatomicopathological 21 210 10.0
palaeometeorological 20 200 10.0
anatomicochirurgical 20 200 10.0
hypsidolichocephalic 20 200 10.0

Of the words with a sum of 260, what is lowest character count? Again, we can sort the values.

words.loc[words['sum'] == 260].sort_values(by="count", ascending=True).head()
count sum ratio
word
gastrohysteropexy 17 260 15.294118
overobsequiousness 18 260 14.444444
psychophysiologist 18 260 14.444444
extraconstitutional 19 260 13.684211
hypsiprymnodontinae 19 260 13.684211

Here, we see that the count is 17 for the word gastrohysteropexy.

NB: When columns have spaces, sorround them with back ticks.

4 Conclusion

In this analysis, I have analyzed popular English words using Python. The objective was to illustrate the use of Python for data analysis.