Data Analysis in Python

Analyzing Popular English Words in Python

Author

Affiliations

John Karuitha

Karatina University, School of Business

Graduate School of Business Administration, University of the Witwatersrand

0.1 Background

In this project, I analyze data about popular English language names. The data is available on this site. The project was part of a course created by FreeCodeCamp. The course is available on YouTube on this link. The course is project based.

The purpose is to illustrate the basics of data analysis using the Python language.

1 Project 1: English Words

Load the packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

I read in the data and convert the column word into a character column.

words = pd.read_csv("unix-words.txt")
# help(pd.read_csv)

words['word'] = words['word'].astype(str)
words.columns

words['word'] = [word.lower().strip() for word in words['word']]

words.head()

	word
0	a
1	aa
2	aal
3	aalii
4	aam

I add a count of letters for each word as a column named count.

def count_letters(word):
    count = 0
    for letter in word:
        count += 1
    return (count)


words['count'] = [len(word) for word in words['word']]

Lastly, I create a column sum that is the addition of the values of each letter in a word. For instance, give that “a” is in pisition one alpabetically, the word “aa” has a sum of 2, while “bb” has a sum of 4, and so on.

letters = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13,
           'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}

# Function to get the sum of letters


def sum_letters(word):
    sum = 0
    for character in word.lower():
        sum += letters.get(character, 0)
    return (sum)


words["sum"] = [sum_letters(word) for word in words["word"]]

words.head()

	word	count	sum
0	a	1	1
1	aa	2	2
2	aal	3	14
3	aalii	5	32
4	aam	3	15

2 Exploring the Data

In this section, I explore the data.

Lets look at the first few columns of the data set.

words.head()

	word	count	sum
0	a	1	1
1	aa	2	2
2	aal	3	14
3	aalii	5	32
4	aam	3	15

Then, we look at the information about the data.

words.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235886 entries, 0 to 235885
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   word    235886 non-null  object
 1   count   235886 non-null  int64 
 2   sum     235886 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 5.4+ MB

How about the dimensions of the data?

# NB shape is an attribute
words.shape

(235886, 3)

And the last few columns.

words.tail()

	word	count	sum
235881	zythem	6	97
235882	zythia	6	89
235883	zythum	6	113
235884	zyzomys	7	149
235885	zyzzogeton	10	179

Finally, we look at the columns of the data.

print(words.columns)

Index(['word', 'count', 'sum'], dtype='object')

Finally, let us convert the names column into an index column.

words.set_index("word", inplace=True)

Lets see the updated data.

words.head()

	count	sum
word
a	1	1
aa	2	2
aal	3	14
aalii	5	32
aam	3	15

3 Slicing Rows of Data Using LOC

In this case, I illustrate how to slice data by rows using the “loc” and “iloc” functions in Pandas.

# Use the square brackets to subset
words.loc['jump']

count     4
sum      60
Name: jump, dtype: int64

We examine the maximum values in the dataset. Here we can use the max method in pandas.

words.max()

count     24
sum      328
dtype: int64

We can also use the describe method instead. This method gives the summary statistics of all the variables, including the minimum and maximum values.

words.describe()

	count	sum
count	235886.000000	235886.000000
mean	9.569156	112.388484
std	2.927299	40.889882
min	1.000000	1.000000
25%	8.000000	83.000000
50%	9.000000	110.000000
75%	11.000000	139.000000
max	24.000000	328.000000

Next, lets look at the word with a count of 7 and a sum of 87. We are given several words below.

words.loc[["pinfish", "glowing", "enfold", "microbrew"]]

	count	sum
word
pinfish	7	81
glowing	7	87
enfold	6	56
microbrew	9	106

We see that the term glowing meets this criteria.

3.1 Highest Possible Value of a Word

We cxan again use the describe method as an easy way to get all summary statistics neccesary to answer this question.

words.describe()

	count	sum
count	235886.000000	235886.000000
mean	9.569156	112.388484
std	2.927299	40.889882
min	1.000000	1.000000
25%	8.000000	83.000000
50%	9.000000	110.000000
75%	11.000000	139.000000
max	24.000000	328.000000

3.2 Word with a Value of 319

Here we can subset as follows.

words.loc[words['sum'] == 328]

	count	sum
word
dacryocystosyringotomy	22	328

The idxmax function can also turn useful results, although not as accurate.

words.loc[words.idxmax()]

	count	sum
word
formaldehydesulphoxylate	24	294
dacryocystosyringotomy	22	328

3.3 Most Common Value

What is the most common value for sum and counts? Here, we can use the value_counts method.

words['sum'].value_counts()

102    2362
107    2332
97     2323
100    2302
98     2279
       ... 
301       1
291       1
292       1
297       1
314       1
Name: sum, Length: 305, dtype: int64

We can also use the mode method.

words["sum"].mode()

0    102
Name: sum, dtype: int64

We see that for sum, 102 occurs 2362 times. You could then use this to check the first few values.

words.loc[words["sum"] == 102].head()

	count	sum
word
abrogation	10	102
absurdly	8	102
acanthous	9	102
acatamathesia	13	102
acclimatize	11	102

WE can do the same for counts.

words['count'].value_counts()

9     32404
10    30878
8     29989
11    26013
7     23869
12    20462
6     17706
13    14939
5     10230
14     9765
15     5925
4      5271
16     3377
17     1813
3      1421
18      842
19      428
20      198
2       160
21       82
1        51
22       41
23       17
24        5
Name: count, dtype: int64

words["count"].mode()

0    9
Name: count, dtype: int64

We see that for sum, 9 occurs 32404 times. You could then use this to check the first few values.

words.loc[words["count"] == 9].head()

	count	sum
word
aaronical	9	74
aaronitic	9	90
microbrew	9	106
abacinate	9	56
abaciscus	9	78

3.4 Shortest Word with a Value (SUM) of 271

Here, we combine the slicing methods with the sort_values method. We see that the term oversuperstitious meets the condition.

words.loc[words["sum"] == 271].sort_values(by="count").head()

	count	sum
word
oversuperstitious	17	271
prostatomyomectomy	18	271
ultrarevolutionist	18	271
ureteroenterostomy	18	271
superconstitutional	19	271

3.5 Create a New Column

In this section, I create a new column titled ratio that is the ratio of sum of the word and the count of the word.

words['ratio'] = words['sum'] / words['count']

words.head()

	count	sum	ratio
word
a	1	1	1.000000
aa	2	2	1.000000
aal	3	14	4.666667
aalii	5	32	6.400000
aam	3	15	5.000000

What is the highest, lowest ratio? Here, we can describe the data. We see the highest and lowest ratios are 26 and 1, respectively.

words.describe()

	count	sum	ratio
count	235886.000000	235886.000000	235886.000000
mean	9.569156	112.388484	11.700352
std	2.927299	40.889882	2.227709
min	1.000000	1.000000	1.000000
25%	8.000000	83.000000	10.272727
50%	9.000000	110.000000	11.750000
75%	11.000000	139.000000	13.166667
max	24.000000	328.000000	26.000000

But which word has the highest ratio?

words.loc[words['ratio'] == 26]

	count	sum	ratio
word
z	1	26	26.0
z	1	26	26.0

We see that the letter z has the highest ratio. Likewise, the terms “a”, and “aa” have the lowest ratio.

words.loc[words['ratio'] == 1]

	count	sum	ratio
word
a	1	1	1.0
aa	2	2	1.0

How many words have a ratio of 10?

words.loc[words['ratio'] == 10].shape

(3410, 3)

Here, we see a value of 3410 rows.

What is the maximum sum value for words with a ratio of 10? Here, we can sort the words with a ratio of 10. We see the max is 210.

words.loc[words['ratio'] == 10].sort_values(by="sum", ascending=False).head()

	count	sum	ratio
word
pathologicoanatomical	21	210	10.0
anatomicopathological	21	210	10.0
palaeometeorological	20	200	10.0
anatomicochirurgical	20	200	10.0
hypsidolichocephalic	20	200	10.0

Of the words with a sum of 260, what is lowest character count? Again, we can sort the values.

words.loc[words['sum'] == 260].sort_values(by="count", ascending=True).head()

	count	sum	ratio
word
gastrohysteropexy	17	260	15.294118
overobsequiousness	18	260	14.444444
psychophysiologist	18	260	14.444444
extraconstitutional	19	260	13.684211
hypsiprymnodontinae	19	260	13.684211

Here, we see that the count is 17 for the word gastrohysteropexy.

NB: When columns have spaces, sorround them with back ticks.

4 Conclusion

In this analysis, I have analyzed popular English words using Python. The objective was to illustrate the use of Python for data analysis.