Introduction
Greetings, friends; in this post, I will explain my solution for a variation of the Stable Matching problem, particularly for the case of couple matching. The idea is that given a set of Men (M) with, let's say, ten dudes and a set of Women (F) with 100 ladies, we will use a distance-based method (Euclidean Distance) to match every woman to every dude so that every dude has the same amount of ladies assigned to each one after a quota has been reached.
This exercise aims to replicate a little of Tinder's or Match's way of assigning you a person you might want to meet. We will make some assumptions. The first one is that there is at least one female for every male. The other one is that there might be more women in the data than men, so this will overflow. The algorithm in this post will assign many ladies to a single dude based on their matching score. Once a male has reached the assignment quota, the best matches will go to another male. The last assumption here is the direction of the assignments; we are assigning many females to a single male. Feel free to revert this if your heart feels this is wrong.
Ok, let's start. The first thing to do here is to create random data about the parties of interest. I have created a pandas data frame for each set with the following attributes: Age, Education, State, Children, and Income. You can add more if you want.
The following code will create 10 random Males and 100 random females. Each variable is encoded. This means that it uses a numeric representation of a particular category.
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from random import randrange
import math
Create the Fake DataSet
# 1: age 18 - 25
# 2: age 26 - 30
# 3: age 31 - 35
# 4: age 36 - 40
# 5: age 41 - 45
# 6: age 46 - 50
# 7: age 50+
# 1: edu none
# 2: edu elementary
# 3: edu highschool
# 4: edu college
# 1: state AL
# 2: state CO
# 3: state CT
# 4: state GA
# 1: income 20-50k
# 2: income 50-100k
# 3: income 100k-150k
# 4: income 150-200k
# 5: income 200k+
males = pd.DataFrame()
for i in range(0,10):
males = males.append({'Name': 'Male_' + str(i+1) , 'Age' : randrange(7)+1, 'Education' : randrange(4)+1, 'State' : randrange(4)+1, 'Children': randrange(3)+1, 'income' : randrange(5)+1}, ignore_index=True)
male_names = males.Name
males = males.drop(['Name'], axis = 1)
females = pd.DataFrame()
for i in range(0,100):
females = females.append({'Name': 'Female_' + str(i+1) , 'Age' : randrange(7)+1, 'Education' : randrange(4)+1, 'State' : randrange(4)+1, 'Children': randrange(3)+1, 'income' : randrange(5)+1}, ignore_index=True)
female_names = females.Name
females = females.drop(['Name'], axis = 1)
The code above will generate the following grid of data. You can observe that each attribute is represented by a number from the encoding defined as comments. The same attributes are used for both Males and Females, as this is the way, later, to compare how similar a couple is. This is the list of the ten males with their features.
The Main Algorithm
Part 1 - Create a Grid with Euclidean Distances
The first part of our journey to find a match is to compare each male to each female feature array and estimate how "similar" they are with a distance-based method. In our case, I chose to use the Euclidean Distance. If you want, select another technique like Cosine, Minkowski, or Chebychev.
# Estimation of Euclidean Distance between Males and Females...
male_matchings = []
for index, m_row in males.iterrows():
male = m_row.to_numpy()
matchings = []
for index, f_row in females.iterrows():
female = f_row.to_numpy()
similarity = np.linalg.norm(male-female) # euclidean distace
matchings.append(similarity)
male_matchings.append(matchings)
The code snippet shows an O(N^2) nested-loop approach to compare each male with each woman with the "np.linalg.norm(male-female)" method from Numpy, which tells the Euclidean Distance between two arrays. The closer they are, the closer to zero the value is. The male_matchings is a list of lists that contains such distances.
Part 2 - Sort Similarity Arrays
Each row of the male_matchings list represents a man, and the sublist attached represents the distance obtained for each woman. Let's sort each sublist so that the best match is on top. The following code obtains each sublist, sorts it, and appends it to the matches_values list of lists. Now the matches_values contain the sorted representation of the male_matchings list of lists.
# re-sorting of matchings & matrix of values and names
matches = [] # sorted collections
matches_values = []
for fem in male_matchings:
x = {}
for i in range(0, len(fem)):
x[female_names[i]] = fem[i]
sorted_x = sorted(x.items(), key=lambda kv: kv[1])
matches.append([k[0] for k in sorted_x])
matches_values.append([k[1] for k in sorted_x])
Part 3 - Assignments based on Quota
We did some pre-work to start making some assignments. We estimated the euclidean distance between each pair and sorted them so that the most similar other was on top of their stack. Now, we must start assigning one to another but with certain conditions. Here is the algorithm:
While there are Females available {
We will assign one lady to one man at a time. This means that "Anna" will be assigned to "Johnny," assuming "Anna" is his first option.
Once a lady has been assigned, her name is added to the taken_females array. This means that even if another dude has a good match with Anna, she is no longer available.
After Johnny gets a lady assigned, we must move down to the second male in the list and assign their first match if she is not assigned. We will continue like this until all males can get given. If the first lady in the list is unavailable, then the male gets nothing until the second iteration.
Quota validation: every male can only be matched with math.ceil((1/len(male_names)) * len(female_names)). This means that if there are 100 women and 10 men, only 10 ladies can be assigned per male. If the quota is reached for that dude, he will be skipped and will not get any more assignments.
}
The following code shows how this is done:
# main algorithm.
assigments = pd.DataFrame()
taken_females = []
min_score = 0
quota = math.ceil((1/len(male_names)) * len(female_names))
quotas = {}
for name in male_names:
quotas[name] = 0
print("Max Quota per Male", quota)
j_max = len(matches[0]) # width of matrix (number of females)
i_max = len(matches) # height of matrix (number of males)
for j in range(0,j_max):
#print("Iteration:", j+1)
current = pd.DataFrame({
'males' : male_names,
'females' : np.array(matches)[:,j].tolist(),
'scores' : np.array(matches_values)[:,j].tolist()
}).sort_values(by='scores', ascending=True)
# MAKE ASSIGMENTS based on best score for each column.
for index, row in current.iterrows():
female_name = row['females']
male_name = row['males']
score = row['scores']
# check quota before assigment...
if quotas[male_name] >= quota:
continue
# if the female is taken, then continue con next male...
if female_name in taken_females:
continue
else:
assigments = assigments.append({'Male': male_name , 'Female' : female_name}, ignore_index=True)
#print(" --- ",male_name, "assigned to", female_name)
taken_females.append(female_name)
min_score += score
quotas[male_name] += 1
# convergence, all females taken
if len(taken_females) == len(female_names):
print("All Females Assigned - Algorithm Converged at", round(min_score,2))
break
The code above shows that each male gets ten ladies each. This is to emulate the app functionality where you get a match, and if you don't like it, then you swipe left and continue with the next one.
The following image shows the assigments.groupby(["Male"]).sum() command to show which females were assigned to each male.
Additional Comments
The quota concept was introduced to allow all males to get assigned. If the quota validation is removed from the code, then you can guarantee the best matches will happen, but you might leave one bro without a lady, and here, we don't do that.