# Chapter 13 Working with Textual Data

You may not be used to thinking about _text_, like a newspaper article or an e-mail, as data. But just like houses and wines, it makes sense to find articles that are similar or to classify e-mails into types. The focus of this chapter is how to turn raw, unstructured text into tabular form so that we can apply the data science techniques we have already learned.

### Documentation

* Bags:  collections.Counter: https://docs.python.org/3/library/collections.html#collections.Counter
* Vectorization of string functions across Pandas Series:  pandas.Series.str: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html
* For one of the exercises you may need to get rid of `nan` values: pandas.Series.fillna(): https://pandas.pydata.org/docs/reference/api/pandas.Series.fillna.html


# Chapter 13.1 Bag of Words and N-Grams

In data science, a text is typically called a **document**, even though a document can be anything from a text message to a full-length novel.  A collection of documents is called a **corpus**. In this chapter, we will work with a corpus of text messages, which contains both spam and non-spam ("ham") messages.

In [1]:
import pandas as pd
pd.options.display.max_rows = 10

texts = pd.read_csv(
    "../data/SMSSpamCollection.txt", 
    sep="\t",
    names=["label", "text"]
)
texts

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [3]:
texts.text[2]

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

We might, for example, want to train a classifier to predict whether or not a text message is spam. To use machine learning techniques like $k$-nearest neighbors, we have to transform each of these "documents" into a more regular representation.

A **bag of words** representation reduces a document to just the multiset of its words, ignoring grammar and word order. (A _multiset_ is like a set, except elements are allowed to appear more than once.)

So, for example, the **bag of words** representation of the string "I am Sam. Sam I am." would be `{I, I, am, am, Sam, Sam}`. In Python, it is easiest to represent multisets using dictionaries, where the keys are the (unique) words and the values are the counts. So we would represent the above bag of words as `{"I": 2, "am": 2, "Sam": 2}`.

Let's convert the text messages to a bag of words representation. To do this, we will use the `Counter` object in the `collections` module of the Python standard library. First, let's see how the `Counter` works.

In [4]:
from collections import Counter
cc = Counter(["I", "am", "Sam", "Sam", "I", "am"])

cc

Counter({'I': 2, 'am': 2, 'Sam': 2})

In [8]:
s = "I am Sam Sam I am"
s.split(" ")
Counter(s.split(" "))

Counter({'I': 2, 'am': 2, 'Sam': 2})

In [9]:
cc["I"], cc["fancy"]

(2, 0)

It takes in a list and returns a dictionary of counts---in other words, the bag of words representation that we want. But to be able to use `Counter`, we have to first convert our text into a list of words. We can do this using the string methods in Pandas, such as `.str.split()`, which splits a string into a list based on some character (which, by default, is whitespace).

In [30]:
texts["text"].str.replace("[^\w\s]", "").str.split()[0]

  texts["text"].str.replace("[^\w\s]", "").str.split()[0]


['Go',
 'until',
 'jurong',
 'point',
 'crazy',
 'Available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'Cine',
 'there',
 'got',
 'amore',
 'wat']

There are several problems with this approach:

- **It is case-sensitive.**  The word "the" in message 5567 and the word "The" in message 5570 are technically different strings and will be treated as different words by the `Counter`.
- **There is punctuation.**  For example, in message 0, one of the words is "point,". This will be treated differently from the word "point".

We can normalize the text for case by converting all of the characters to lowercase, using the `.str.lower()` method.  We can also strip punctuation using a regular expression. The regular expression `[^\w\s]` tells Python to look for any pattern that is not (`^`) either an alphanumeric character (`\w`) or whitespace (`\s`). That is, it will detect any occurrence of punctuation. We will then use the `.str.replace()` method to replace all detected occurrences with the empty string, effectively removing all punctuation from the string.

By chaining these commands together, we obtain a list, to which we can apply the `Counter` to obtain the bag of words representation.

In [31]:
words = (
    texts["text"].
    str.lower().
    str.replace("[^\w\s]", "").
    str.split()
)

words

  texts["text"].


0       [go, until, jurong, point, crazy, available, o...
1                          [ok, lar, joking, wif, u, oni]
2       [free, entry, in, 2, a, wkly, comp, to, win, f...
3       [u, dun, say, so, early, hor, u, c, already, t...
4       [nah, i, dont, think, he, goes, to, usf, he, l...
                              ...                        
5567    [this, is, the, 2nd, time, we, have, tried, 2,...
5568         [will, ü, b, going, to, esplanade, fr, home]
5569    [pity, was, in, mood, for, that, soany, other,...
5570    [the, guy, did, some, bitching, but, i, acted,...
5571                     [rofl, its, true, to, its, name]
Name: text, Length: 5572, dtype: object

In [33]:
bags = words.apply(Counter)

In [38]:
bags[2], texts.text[2]

(Counter({'free': 1,
          'entry': 2,
          'in': 1,
          '2': 1,
          'a': 1,
          'wkly': 1,
          'comp': 1,
          'to': 3,
          'win': 1,
          'fa': 2,
          'cup': 1,
          'final': 1,
          'tkts': 1,
          '21st': 1,
          'may': 1,
          '2005': 1,
          'text': 1,
          '87121': 1,
          'receive': 1,
          'questionstd': 1,
          'txt': 1,
          'ratetcs': 1,
          'apply': 1,
          '08452810075over18s': 1}),
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's")

## N-Grams

The problem with the bag of words representation is that the ordering of the words is lost. For example, the following sentences have the exact same bag of words representation, but convey different meanings:

1. The dog bit her owner.
2. Her dog bit the owner.

The first sentence has only two actors (the dog and its owner), but the second sentence has three (a woman, her dog, and the owner of something). To better capture the _semantic_ meaning of these two documents, we can use **bigrams** instead of individual words. A **bigram** is simply a pair of consecutive words. The "bag of bigrams" of the two sentences above are quite different:

1. {"The dog", "dog bit", "bit her", "her owner"}
2. {"Her dog", "dog bit", "bit the", "the owner"}

They only share 1 bigram (out of 4) in common, even though they share the same 5 words.

Let's get the bag of bigrams representation for the words above. To generate the bigrams from the list of words, we will use the `zip` function in Python, which takes in two lists and returns a single list of pairs (consisting of one element from each list):

In [64]:
list(zip([1, 2, 3], [4, 5, 6], [7,8,9]))

[(1, 4, 7), (2, 5, 8), (3, 6, 9)]

In [42]:
s = "The dog bit her owner"

sl = s.lower().split()


['the', 'dog', 'bit', 'her', 'owner']

In [45]:
s1 = sl[0:-1]  ### first words in my bigrams
s2 = sl[1:]    ### second words in my bigrams
list(zip(s1, s2))  

[('the', 'dog'), ('dog', 'bit'), ('bit', 'her'), ('her', 'owner')]

In [48]:
def get_bigrams(words):
    # We need to line up the words as follows:
    #   words[0], words[1]
    #   words[1], words[2]
    #       ... ,  ...
    # words[n-1], words[n]
    return zip(words[:-1], words[1:])


In [51]:
Counter(get_bigrams(words[0]))

Counter({('go', 'until'): 1,
         ('until', 'jurong'): 1,
         ('jurong', 'point'): 1,
         ('point', 'crazy'): 1,
         ('crazy', 'available'): 1,
         ('available', 'only'): 1,
         ('only', 'in'): 1,
         ('in', 'bugis'): 1,
         ('bugis', 'n'): 1,
         ('n', 'great'): 1,
         ('great', 'world'): 1,
         ('world', 'la'): 1,
         ('la', 'e'): 1,
         ('e', 'buffet'): 1,
         ('buffet', 'cine'): 1,
         ('cine', 'there'): 1,
         ('there', 'got'): 1,
         ('got', 'amore'): 1,
         ('amore', 'wat'): 1})

In [53]:

bigrams =words.apply(get_bigrams).apply(Counter)

Instead of taking 2 words at a time, we could take 3, 4, or, in general, $n$ words. 
A tuple of $n$ consecutive words is called an $n$-gram, and we can convert any document to a "bag of $n$-grams" representation. 

The larger $n$ is, the better the representation will capture the meaning of a document. But if $n$ is so large that $n$-grams never occur more than once in a corpus, then we will not learn much from this representation.

In [56]:
bigrams[777]

Counter({('why', 'dont'): 1,
         ('dont', 'you'): 1,
         ('you', 'go'): 1,
         ('go', 'tell'): 1,
         ('tell', 'your'): 1,
         ('your', 'friend'): 1,
         ('friend', 'youre'): 1,
         ('youre', 'not'): 1,
         ('not', 'sure'): 1,
         ('sure', 'you'): 1,
         ('you', 'want'): 1,
         ('want', 'to'): 1,
         ('to', 'live'): 1,
         ('live', 'with'): 1,
         ('with', 'him'): 1,
         ('him', 'because'): 1,
         ('because', 'he'): 1,
         ('he', 'smokes'): 1,
         ('smokes', 'too'): 1,
         ('too', 'much'): 1,
         ('much', 'then'): 1,
         ('then', 'spend'): 1,
         ('spend', 'hours'): 1,
         ('hours', 'begging'): 1,
         ('begging', 'him'): 1,
         ('him', 'to'): 1,
         ('to', 'come'): 1,
         ('come', 'smoke'): 1})

# Exercises

**Exercise 1.** Read in the OKCupid data set (`../data/okcupid/profiles.csv`). Convert the users' responses to `essay0` ("self summary") into a bag of words representation.

(_Hint:_ Test your code on the first 100 users before testing it on the entire data set.)

In [58]:
okCupid = pd.read_csv("../data/okCupid/profiles.csv")

In [61]:
okCupid["essay0"][7]

nan

In [None]:
# TYPE YOUR CODE HERE.




**Exercise 2.** The text of _Green Eggs and Ham_ by Dr. Seuss can be found in (`../data/drseuss/greeneggsandham.txt`). Read in this file and convert this "document" into a bag of trigrams (3-grams) representation. Some code has been provided to get you started.

In [63]:
# TYPE YOUR CODE HERE.
with open("../data/drseuss/greeneggsandham.txt", "r") as f:
    for line in f:
        print(line)
        pass

I am Sam



I am Sam

Sam I am



That Sam-I-am

That Sam-I-am!

I do not like

that Sam-I-am



Do you like

green eggs and ham



I do not like them,

Sam-I-am.

I do not like

green eggs and ham.



Would you like them

Here or there?



I would not like them

here or there.

I would not like them

anywhere.

I do not like

green eggs and ham.

I do not like them,

Sam-I-am



Would you like them

in a house?

Would you like them

with a mouse?



I do not like them

in a house.

I do not like them

with a mouse.

I do not like them

here or there.

I do not like them

anywhere.

I do not like green eggs and ham.

I do not like them, Sam-I-am.





Would you eat them

in a box?

Would you eat them

with a fox?



Not in a box.

Not with a fox.

Not in a house.

Not with a mouse.

I would not eat them here or there.

I would not eat them anywhere.

I would not eat green eggs and ham.

I do not like them, Sam-I-am.



Would you? Could you?

in a car?

Eat them! Eat them!

Here they ar