# Chapter 4.2 Distance Metrics and Categorical Variables



The distance metrics that we studied in the previous section were designed for quantitative variables. But most data sets will contain a mix of categorical and quantitative variables. Let's return to the Titanic data set and see if we can measure similarity between passengers.

In [52]:
%matplotlib inline
import numpy as np
import pandas as pd
pd.options.display.max_rows = 10

titanic = pd.read_csv("../data/titanic.csv", sep=",")
titanic

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,,C,,,


In [53]:
titanic.embarked.unique(), titanic.groupby("embarked").name.count()

(array(['S', 'C', nan, 'Q'], dtype=object),
 embarked
 C    270
 Q    123
 S    914
 Name: name, dtype: int64)

In [42]:
### let's map 'S', 'C' and 'Q' to numeric values

### 

#titanic["embarkedN"] = titanic.embarked.map({"S":0, "C":1, "Q":2})
#titanic[["embarked", "embarkedN"]]

## Converting Categorical Variables to Quantitative Variables

Binary categorical variables (categorical variables with two levels) can be converted into quantitative variables by coding one category as 1 and the other category as 0. But what about a categorical variable with more than 2 categories, like `embarked`, which has 3 categories?

We can convert categorical variables with $K$ categories to $K$ separate **0/1 variables**, or **dummy variables**, each of which is an indicator for a particular category. Although this can be done manually, the easiest way is to use the `pandas` function `get_dummies()`.

In [43]:
pd.get_dummies(titanic["embarked"])

Unnamed: 0,C,Q,S
0,0,0,1
1,0,0,1
...,...,...,...
1307,1,0,0
1308,0,0,1


In [44]:
pd.get_dummies(titanic, columns= ["embarked", "sex"])

Unnamed: 0,pclass,survived,name,age,sibsp,parch,ticket,fare,cabin,boat,body,home.dest,embarked_C,embarked_Q,embarked_S,sex_female,sex_male
0,1,1,"Allen, Miss. Elisabeth Walton",29.0000,0,0,24160,211.3375,B5,2,,"St Louis, MO",0,0,1,1,0
1,1,1,"Allison, Master. Hudson Trevor",0.9167,1,2,113781,151.5500,C22 C26,11,,"Montreal, PQ / Chesterville, ON",0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1307,3,0,"Zakarian, Mr. Ortin",27.0000,0,0,2670,7.2250,,,,,1,0,0,0,1
1308,3,0,"Zimmerman, Mr. Leo",29.0000,0,0,315082,7.8750,,,,,0,0,1,0,1


Since each observation is in exactly one category, each row contains exactly one 1; the rest are 0s.

We can call `get_dummies` on a `DataFrame` to apply dummy-encoding to multiple variables at once. `pandas` will only dummy-encode the variables it thinks are categorical, so it is necessary to convert any categorical variables that are represented using numeric values. `pandas` will also automatically add the variable name as a prefix to prevent collisions between category names.

In [45]:
titanic["pclass"].unique()


array([1, 2, 3])

In [46]:
# Convert pclass to a categorical type
titanic["pclass"] = titanic["pclass"].astype(str)

# Pass all variables to get_dummies, except ones that are "other" types
titanic_num = pd.get_dummies(
    titanic.drop(["name", "ticket", "cabin", "boat", "body","home.dest"], axis=1)
)
titanic_num

Unnamed: 0,survived,age,sibsp,parch,fare,pclass_1,pclass_2,pclass_3,sex_female,sex_male,embarked_C,embarked_Q,embarked_S
0,1,29.0000,0,0,211.3375,1,0,0,1,0,0,0,1
1,1,0.9167,1,2,151.5500,1,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1307,0,27.0000,0,0,7.2250,0,0,1,0,1,1,0,0
1308,0,29.0000,0,0,7.8750,0,0,1,0,1,0,0,1


Now that we have a `DataFrame` of all numeric values, we can apply the methods from the previous section to calculate distances between observations. For example, to find the passenger who is most similar to the first passenger, Elisabeth Watson, we can find the row with the smallest Euclidean distance to that row in the above `DataFrame`.

In [47]:
## let's standardize our data

titanic_std = (titanic_num - titanic_num.mean())/titanic_num.std()




In [48]:
titanic_std.loc[0]

survived      1.271520
age          -0.061133
                ...   
embarked_Q   -0.321917
embarked_S    0.657142
Name: 0, Length: 13, dtype: float64

In [54]:

pd.DataFrame(np.sqrt(((titanic_std - titanic_std.loc[0]) ** 2).sum(axis=1)).sort_values())

Unnamed: 0,0
0,0.000000
24,0.201738
180,0.693794
32,0.900513
88,1.187990
...,...
1170,9.704504
1178,9.704504
1171,9.756508
1180,11.409460


The most similar passenger to Elisabeth Allen, other than herself, is passenger 24. Let's extract these passengers from the original `DataFrame` to see how similar they really are.

In [56]:
titanic_num.loc[[0, 24,180,32]]

Unnamed: 0,survived,age,sibsp,parch,fare,pclass_1,pclass_2,pclass_3,sex_female,sex_male,embarked_C,embarked_Q,embarked_S
0,1,29.0,0,0,211.3375,1,0,0,1,0,0,0,1
24,1,29.0,0,0,221.7792,1,0,0,1,0,0,0,1
180,1,39.0,0,0,211.3375,1,0,0,1,0,0,0,1
32,1,30.0,0,0,164.8667,1,0,0,1,0,0,0,1


# Exercises

The following exercise uses the Ames housing data set (`../data/AmesHousing.txt`).

**Exercise 1.** Suppose that you really like the first house in the data set, but it is too expensive. Find homes that are similar to it that are cheaper, by calculating distances after encoding categorical variables as dummy variables. Be sure to actually look at the profiles of the homes that your algorithm picked out as most similar. Do they make sense?

Try different distance metrics and different standardization methods. How sensitive are your results to these choices?

_Think:_ If your goal is to find a "good deal" on a similar house, should you include sale price as a variable in your distance metric? 

_Hint:_ There are too many variables in the data set. You will want to pare it down to a manageable number, but be sure to include a mixture of categorical and quantitative variables. Refer to the [data documentation](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) for information about the variables.

In [None]:
# ENTER YOUR CODE HERE