# Recommend Cereals with the Correlation Coefficient

Today's post is about how to recommend things. Concretely, we are going to perform a basic implementation of an item-to-item recommender by using the Correlation coefficient we explored in another [post](https://www.doczamora.com/covariance-and-correlation-coefficient).

To explore this idea, we will be using a Cereals dataset I downloaded from Kaggle. You can download the dataset from [here](https://www.kaggle.com/datasets/crawford/80-cereals).

The dataset contains about 77 different cereals with some interesting nutritional features such as calories, protein, fat, sodium, carbohydrates, sugars, vitamins, etc.

```
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
```

Now, let's load the data and select only the nutritional attributes we want to use:

```
data = pd.read_csv('data/cereals-products.csv')
data = data[['name','calories','protein','fat','sodium', 'fiber', 'carbo', 'sugars', 'potass', 'vitamins']]
```
![Capture.PNG](https://cdn.hashnode.com/res/hashnode/image/upload/v1650432248266/fB-Fc9mVO.PNG)

As you can see, there are 77 different kinds of cereal, each one described by 10 features. The objective is to calculate a *metric * that can tell us how "similar" is one cereal to the other. That metric is the Pearson Correlation Coefficient.

Numpy has the **corcoeff ** method that allows us to calculate the correlation coefficient of all cereals at once! We will use the data loaded to estimate the correlation between cereals. This will generate a correlation matrix with dimensions [77,77]. 

To calculate the Pearson matrix, we just need the numerical attributes, so we will avoid putting the "name" column as this has no use, for now.

```
pearson_matrix = np.corrcoef(data[data.columns[1:]])
``` 

The pearson_matrix is a 77x77 matrix of the correlations between cereals. To visualize this, we will use the seaborn library with the following code.

```
plt.figure(figsize = (16,16))

sns.heatmap(pearson_matrix, 
        xticklabels=data["name"],
        yticklabels=data["name"],
        fmt="d")
```

![descarga.png](https://cdn.hashnode.com/res/hashnode/image/upload/v1650432701873/_qWdJpTKT.png)

Let's remember that the correlation measures the level of the linear relationship between two lists. If the correlation is perfect, then it will be 1.

The pearson_matrix is all we need now to search for cereals that are similar. The idea is that if we choose *"Special K"*, for example, we look into the pearson_matrix for those cereals that have the highest correlations. Is that simple! All we need to do now, is create a method *"recommend"* which will search in the pearson_matrix the top-k most similar cereals.

The code looks like this:

```
def recommend(cereal_name, top_k):
    cereal_names = data["name"]
    index = list(cereal_names).index(cereal_name)
    coeff = pearson_matrix[index]
    df = pd.DataFrame({'pearson':coeff, 'name' : cereal_names}).sort_values('pearson', ascending=False)
    return df.head(top_k)
```

This method returns the list of cereals names sorted by their respective pearson coefficient. 

Let's search for **"Special K"** to get the top 10 most similar cereals:

```
recommend("Special K", 10)
```

![image.png](https://cdn.hashnode.com/res/hashnode/image/upload/v1650433078533/BPqo3vden.png)

What about "100% Bran"

```
recommend("100% Bran", 10)
```

![image.png](https://cdn.hashnode.com/res/hashnode/image/upload/v1650433169328/RJk106cih.png)

**Final Thoughts:**

- The correlation coefficient is used as a similarity measure to find other products that are mathematically similar, based on the data attributes.
- If this recommender is used in a commercial setting, it should be used as tool to find replacement cereals or similar cereals in terms of nutritional values, not flavor. 
- The person matrix calculated here should be persisted somewhere so that this is not estimated on every call. If new products are added, the matrix should be generated again.
- This technique shown here works also with other types of data such as categorical data or text. In the case of categorical data, variables should be one-hot encoded. If used with text, then the text should be converted into tokens.

<a href="https://www.buymeacoffee.com/doczamora" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" height="60" class="buycoffee" alt="Buy Me A Coffee."></a>