That sounds good, right? well, let's test it out by building the embeddings out from the description of some superheroes I got in a csv file.

1 - Ok to build this, you will need to get an OpenAI key. you can get on for free at https://openai.com/api/. Just sign-in, click on Personal and then View API Keys and create a new one. Make sure to copy-paste it in a secure place.

2 - Download the csv file from My Github Datasets Repo

3 - Open a jupyter notebook and install the following libraries

openai

pandas

numpy

Alright, enough chat, let's get into the business here. The first step is to load the required libraries.

`import pandas as pdimport numpy as npimport openaifrom openai.embeddings_utils import get_embedding, cosine_similarityopenai.api_key = 'put-your-api-key-here'`

In this snippet, we are loading all the utils we need as well as setting up the API Key.

We load the supers.csv into a pandas dataframe. We will search over the "history_text" and "powers_text" descriptions. To make this easier, we will merge both columns into a new one called "Combined". The following code achieves exactly this. We will be using only the first 50 super heroes for our testing.

`df = pd.read_csv("data/supers.csv")df = df.fillna('')# lets use the first 50 superheros. (because of the rate limit)df = df.head(50)df["combined"] = df[["history_text", "powers_text"]].apply(" ".join, axis=1)`

The "Combined" column plays an important role. We need to convert this text into an Embedding. The embedding is a vector representation of the tokens in the text according to the "text-embedding-ada-002" model. This means we will create a new pandas column with the name "Embedding" to store this representation. The code below loops over the data frame and creates an embedding out of each combined text row. Then each embedding is converted as numpy.array and added to the dataframe. Note that the model embeddings are created by making a remote call to OpenAI through the get_embedding(...) function, which has a rate limit of 60 calls/second unless you upgrade your subscription.

`for item in df["combined"]: embedding = get_embedding(item, engine="text-embedding-ada-002") embeddings.append(np.array(embedding))df["embedding"] = embeddings`

Now we have a new column called "embeddings" that will hold the model representation for each row. the only thing left to do here is to build a Search function that can calculate the cosine similarity between the search query and the existent embeddings. The most similar results should be the ones returned first.

This is where the magic happens. The search function converts a query such as "*helps batman*" to a vector embedding and then compares it to the rest of the dataset. The results are sorted by the similarity column calculated from the cosine function. Take a look.

`def search(df, description, n=3): text_embedding = get_embedding( description, engine="text-embedding-ada-002" ) df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, text_embedding)) results = ( df.sort_values("similarity", ascending=False) .head(n) ) return results`

Here comes the fun and final part. We now just look for information and we let the search(...) function make the magic. The n=3 means to show the top 3 results.

`results = search(df, "helps batman", n=3)`

Now we loop over the results with this code to see what came back.

`for index, row in results.iterrows(): print("---------------------- ") print(row["real_name"], '[Similarity Score:', round(row["similarity"]*100),"%]") print(" ") print("History:", row["history_text"][:200]) print(" ") print("Powers:", row["powers_text"][:200])`

These are the results from the search:

`---------------------- Alfred Pennyworth [Similarity Score: 84 %]History: Alfred CranePennyworth is the butler, mentor, surrogate father, and close friend of Bruce Wayne. Alfred has served the Wayne family since before Bruce was born. After Bruce was left orphaned from the Powers: Alfred is exceptionally intelligent, which extends to his considerable investigative, analytical, communications, computer operating, engineering, and medical skills, as well as those in ordinary hous---------------------- Aaron Cash [Similarity Score: 82 %]History: Aaron Cash is the head of security at Arkham Asylum. He has a hook for a hand after his real hand was eaten by Killer Croc.Powers: ---------------------- Bruce Wayne [Similarity Score: 82 %]History: He was one of the many prisoners of Indian Hill to be transferred to another facility upstate on the orders of The Court. However, Fish Mooney hijacks the bus and drives it into Gotham City, where thePowers:`

As you can see, with a Similarity Score of 84%, **Alfred Pennyworth**, was the first result. For those who don't know, Alfred is Batman's butler.

Cheers!

]]>Greetings, friends; in this post, I will explain my solution for a variation of the Stable Matching problem, particularly for the case of couple matching. The idea is that given a set of Men (M) with, let's say, ten dudes and a set of Women (F) with 100 ladies, we will use a distance-based method (Euclidean Distance) to match every woman to every dude so that every dude has the same amount of ladies assigned to each one after a quota has been reached.

This exercise aims to replicate a little of Tinder's or Match's way of assigning you a person you might want to meet. We will make some assumptions. The first one is that there is at least one female for every male. The other one is that there might be more women in the data than men, so this will overflow. The algorithm in this post will assign many ladies to a single dude based on their matching score. Once a male has reached the assignment quota, the best matches will go to another male. The last assumption here is the direction of the assignments; we are assigning many females to a single male. Feel free to revert this if your heart feels this is wrong.

Ok, let's start. The first thing to do here is to create random data about the parties of interest. I have created a pandas data frame for each set with the following attributes: Age, Education, State, Children, and Income. You can add more if you want.

The following code will create 10 random Males and 100 random females. Each variable is encoded. This means that it uses a numeric representation of a particular category.

`import warningswarnings.filterwarnings('ignore')import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom random import randrangeimport math`

`# 1: age 18 - 25# 2: age 26 - 30# 3: age 31 - 35# 4: age 36 - 40# 5: age 41 - 45# 6: age 46 - 50# 7: age 50+# 1: edu none# 2: edu elementary# 3: edu highschool# 4: edu college# 1: state AL# 2: state CO# 3: state CT# 4: state GA# 1: income 20-50k# 2: income 50-100k# 3: income 100k-150k# 4: income 150-200k# 5: income 200k+males = pd.DataFrame()for i in range(0,10): males = males.append({'Name': 'Male_' + str(i+1) , 'Age' : randrange(7)+1, 'Education' : randrange(4)+1, 'State' : randrange(4)+1, 'Children': randrange(3)+1, 'income' : randrange(5)+1}, ignore_index=True)male_names = males.Namemales = males.drop(['Name'], axis = 1)females = pd.DataFrame()for i in range(0,100): females = females.append({'Name': 'Female_' + str(i+1) , 'Age' : randrange(7)+1, 'Education' : randrange(4)+1, 'State' : randrange(4)+1, 'Children': randrange(3)+1, 'income' : randrange(5)+1}, ignore_index=True)female_names = females.Namefemales = females.drop(['Name'], axis = 1)`

The code above will generate the following grid of data. You can observe that each attribute is represented by a number from the encoding defined as comments. The same attributes are used for both Males and Females, as this is the way, later, to compare how similar a couple is. This is the list of the ten males with their features.

The first part of our journey to find a match is to compare each male to each female feature array and estimate how "similar" they are with a distance-based method. In our case, I chose to use the Euclidean Distance. If you want, select another technique like Cosine, Minkowski, or Chebychev.

`# Estimation of Euclidean Distance between Males and Females...male_matchings = []for index, m_row in males.iterrows(): male = m_row.to_numpy() matchings = [] for index, f_row in females.iterrows(): female = f_row.to_numpy() similarity = np.linalg.norm(male-female) # euclidean distace matchings.append(similarity) male_matchings.append(matchings)`

The code snippet shows an O(N^2) nested-loop approach to compare each male with each woman with the "np.linalg.norm(male-female)" method from Numpy, which tells the Euclidean Distance between two arrays. The closer they are, the closer to zero the value is. The ** male_matchings** is a list of lists that contains such distances.

Each row of the male_matchings list represents a man, and the sublist attached represents the distance obtained for each woman. Let's sort each sublist so that the best match is on top. The following code obtains each sublist, sorts it, and appends it to the ** matches_values** list of lists. Now the

`# re-sorting of matchings & matrix of values and namesmatches = [] # sorted collectionsmatches_values = []for fem in male_matchings: x = {} for i in range(0, len(fem)): x[female_names[i]] = fem[i] sorted_x = sorted(x.items(), key=lambda kv: kv[1]) matches.append([k[0] for k in sorted_x]) matches_values.append([k[1] for k in sorted_x])`

We did some pre-work to start making some assignments. We estimated the euclidean distance between each pair and sorted them so that the most similar other was on top of their stack. Now, we must start assigning one to another but with certain conditions. Here is the algorithm:

*While there are Females available {*

We will assign one lady to one man at a time. This means that "Anna" will be assigned to "Johnny," assuming "Anna" is his first option.

Once a lady has been assigned, her name is added to the taken_females array. This means that even if another dude has a good match with Anna, she is no longer available.

After Johnny gets a lady assigned, we must move down to the second male in the list and assign their first match if she is not assigned. We will continue like this until all males can get given. If the first lady in the list is unavailable, then the male gets nothing until the second iteration.

Quota validation: every male can only be matched with math.ceil((1/len(male_names)) * len(female_names)). This means that if there are 100 women and 10 men, only 10 ladies can be assigned per male. If the quota is reached for that dude, he will be skipped and will not get any more assignments.

*}*

The following code shows how this is done:

`# main algorithm. assigments = pd.DataFrame()taken_females = []min_score = 0quota = math.ceil((1/len(male_names)) * len(female_names))quotas = {}for name in male_names: quotas[name] = 0print("Max Quota per Male", quota)j_max = len(matches[0]) # width of matrix (number of females)i_max = len(matches) # height of matrix (number of males)for j in range(0,j_max): #print("Iteration:", j+1) current = pd.DataFrame({ 'males' : male_names, 'females' : np.array(matches)[:,j].tolist(), 'scores' : np.array(matches_values)[:,j].tolist() }).sort_values(by='scores', ascending=True) # MAKE ASSIGMENTS based on best score for each column. for index, row in current.iterrows(): female_name = row['females'] male_name = row['males'] score = row['scores'] # check quota before assigment... if quotas[male_name] >= quota: continue # if the female is taken, then continue con next male... if female_name in taken_females: continue else: assigments = assigments.append({'Male': male_name , 'Female' : female_name}, ignore_index=True) #print(" --- ",male_name, "assigned to", female_name) taken_females.append(female_name) min_score += score quotas[male_name] += 1 # convergence, all females taken if len(taken_females) == len(female_names): print("All Females Assigned - Algorithm Converged at", round(min_score,2)) break`

The code above shows that each male gets ten ladies each. This is to emulate the app functionality where you get a match, and if you don't like it, then you swipe left and continue with the next one.

The following image shows the assigments.groupby(["Male"]).sum() command to show which females were assigned to each male.

The quota concept was introduced to allow all males to get assigned. If the quota validation is removed from the code, then you can guarantee the best matches will happen, but you might leave one bro without a lady, and here, we don't do that.

]]>A content-based recommender is a type of system that makes recommendations to users based on their preferences and considers product attributes. This sounds very much like the Amazon "if you liked that, maybe you might also like this..." type of recommendation.

The idea is to recommend something the user has *never rated before*. When I say "rated" means it probably hasn't visited, rated, or clicked it before.

In a real setting, you can change rates for visited, clicked, or watched. You might also find some other metric to describe the interest of the user in a certain product.

The example below is a hello-world type of example where we are going to recommend movies to users. We will only have 4 users and 5 movies:'Star Wars', 'The Dark Knight', 'Shrek', 'The Incredibles', 'Bleu', and 'Memento'. Each movie is described by its genre. however we will use one-hot encoding for the genres: 'Action', 'Sci-Fi', 'Comedy', 'Cartoon', 'Drama'.

Let's start coding. The first thing to do is to represent User ratings and Movie features as tensors (matrices).

Let's start by loading Tensorflow and Numpy.

`import numpy as npimport tensorflow as tf`

We have the following data from 4 users that rated some movies. We also have the metadata from 6 movies described by the genre. Let's keep some arrays with the user names and movie names for later use.

`users = ['Juan', 'Daniel', 'Ana', 'Christian']movies = [ 'Star Wars', 'The Dark Knight', 'Shrek', 'The Incredibles', 'Bleu', 'Memento']features = ['Action', 'Sci-Fi', 'Comedy', 'Cartoon', 'Drama']num_users = len(users)num_movies = len(movies)num_feats = len(features)num_recommendations = 2`

The following table shows the preferences shown by users for some of the movies. There are blank spaces as not all the users have watched all the movies.

The **users_movies** tensor contains the information from the table. Let's load this information manually.

`users_movies = tf.constant([ [4, 6, 8, 0, 0, 0], [0, 0, 10, 0, 8, 3], [0, 6, 0, 0, 3, 7], [10, 9, 0, 5, 0, 2]], dtype=tf.float32)`

Let's do the same thing with the **movies_feats**. The following tensor will contain the genre metadata for each movie as seen in the table above.

`movies_feats = tf.constant([ [1, 1, 0, 0, 1], [1, 1, 0, 0, 0], [0, 0, 1, 1, 0], [1, 0, 1, 1, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 1]], dtype=tf.float32)`

Now that we have those tensors with user and movie data, we need to create the user embeddings. The user embeddings show the relationship between each user and the movies by performing matrix multiplication. We can easily achieve this with the tf.matmul (yes, for matrix multiplication) in TensorFlow.

`users_feats = tf.matmul(users_movies, movies_feats)users_feats`

This will print the following tensor:

As we can see, this is basically telling us how important is each feature for each user. We must normalize each row so that each value adds to 1. The following code normalizes each row:

`users_feats = users_feats / tf.reduce_sum(users_feats, axis=1, keepdims=True)users_feats`

The **user_feats** tensor now has normalized values for each user (row) and for each feature (column).

The **user_feats** tell us how important is each feature for each user. Let's print this so we can understand the preferences of each user by the calculated weights.

The **top_users_features** is a tensor that returns the sorted indices of the most important attributes for each row. We are basically sorting each row and returning the index for each attribute. We will later loop and print the name of each feature name.

We find the top-k values of each row with the "tf.nn.top_k" function form tensorflow.

`top_users_features = tf.nn.top_k(users_feats, num_feats)[1]top_users_features`

The tensor above shows the indices of the features that are more relevant for each user. The following code prints, for every user, the name of what seems to be more relevant to them:

`for i in range(num_users): feature_names = [features[int(index)] for index in top_users_features[i]] print('{}: {}'.format(users[i], feature_names))`

- Juan: ['Action', 'Sci-Fi', 'Comedy', 'Cartoon', 'Drama']
- Daniel: ['Drama', 'Comedy', 'Cartoon', 'Action', 'Sci-Fi']
- Ana: ['Action', 'Drama', 'Sci-Fi', 'Comedy', 'Cartoon']
- Christian: ['Action', 'Sci-Fi', 'Drama', 'Comedy', 'Cartoon']

So Juan and Christian prefer Action and Sci-Fi, and Daniel (and Ana) prefer Drama and Comedy. Please note that they all have different preferences and in different order as defined in the users_feats tensor.

Now, we need to calculate the similarity between the user ratings and the movie features (already calculated in the users_feats tensor). The idea is to create the projected ratings for each movie and store them in a new tensor called **users_ratings**.

`users_ratings = [tf.map_fn(lambda x: tf.tensordot(users_feats[i], x, axes = 1), tf.cast(movies_feats, tf.float32)) for i in range(num_users)]`

In this code, we are using dot-product as a similarity measure between the tensors to calculate the projected ratings. This tensor will now serve as the prediction for the entire dataset. The higher the dot-product, the more we will recommend the movie. The only problem with this is that we don't want to recommend movies the user has already seen, so we will remove the movie the user has already seen by applying a mask to the users_ratings tensor.

The first thing we will do is to create the tensor **users_unseen_movies** which will have a True if the movie has not been rated by the user and False otherwise.

`users_unseen_movies = tf.equal(users_movies, tf.zeros_like(users_movies))users_unseen_movies`

The **ignore_matrix** is just a tensor the same size as the users_movies filled with zeros. This will serve as a base tensor to fill with only the probabilities of the unseen movies.

`ignore_matrix = tf.zeros_like(tf.cast(users_movies, tf.float32))ignore_matrix`

Now, the magic is created with the **tf.where** function from tensorflow, which is able to apply the mask to the users_ratings so that only the unseen movies are kept in the tensor.

`users_ratings_new = tf.where( users_unseen_movies, users_ratings, ignore_matrix)users_ratings_new`

As you can observe, now only the unsee movie probabilities are shown. These will be our movie recommendations!

We can now calculate all of the users to 2 movies with the tf.nn.top_k function. The following code will return the top-k indices for each row of the users_ratings_new matrix. Remember that the users_ratings_new rows represent the users and the columns the movies. The following code obtains the tensor with the top movies for each user:

`top_movies = tf.nn.top_k(users_ratings_new, num_recommendations)[1]top_movies`

As we did previously, we can print the name of the user and the top-2 movies that are recommended for the user that it has never seen before!

`for i in range(num_users): movie_names = [movies[index] for index in top_movies[i]] print('{}: {}'.format(users[i], movie_names))`

This post has been a reproduction of the code provided by the Google Cloud Platform Data Analyst Recommender Systems program.

]]>When we research something, we look to gain knowledge and answers about things (if any). The idea is to follow a process (the scientific method) that will lead us to an outcome, but that process might be affected by the desired results and the nature of the problem itself.

While there is sufficient documentation, books, and articles about research methods for some well-depicted problems, these often point to research in medicine, psychology, economics, and natural and political sciences (among others). During my doctoral studies, I found that the books I had to review as part of the program had a strong focus on psychology. I found it challenging as I was trying to apply research on information systems, particularly for machine learning, which is a quantitative discipline. Although the books in the program covered practically everything in terms of methodologies, they still let me spin a bit on what to use if I want to create research on how to build a self-driving car or train a convolutional neural network to identify Covid in X-rays.

This post reviews the Exploratory Research previously used in machine learning research to understand how problems are framed and solved. I will also briefly introduce the method I used for my dissertation at the end of this post for sign language recognition.

Exploratory research looks to investigate questions that have not been previously studied in depth or are different from other problems in literature. For example, research titles such as "Generative adversarial nets" (Goodfellow, 2014) sound like a candidate topic to be assessed as exploratory. This ground-breaking research reveals how to train generative models using an adversarial process by training two neural networks simultaneously. If we look at the 2014 paper, it goes straight to the action, with virtually no clarification on methodology, research techniques, or anything in this space. This type of research does not mean there is no research question and methods. Still, the way it was written shows only the novelty, not the failed experiments and other original theories for the adversarial GANS apart from what has been included in the literature review.

Tom Dietterich (2019) from Oregon State University recommends the following process to drive the exploratory research successfully.

Exploratory research -> Initial Solutions -> Refinement & Evaluation -> Competing Solutions & Comparative Evaluation -> Mapping The Solution to Space -> Engineering & Technology Transfer.

This process states an "initial solution" to a problem that can be refined, reduced, and evaluated to be compared to other solutions. This will lead the research to determine if the outcome is important enough for publication or if it needs additional refinement, or should it be stashed.

The idea here is to provide the kick-start solution to the problem. It does not have to be pretty, but it will serve as the base model for improvement. This stage is essential as this helps to define a precedent for a base solution to a complex problem. A paper might be created at this stage even if the solution does not generalize well. For example, Bayesian networks (Pearl 1985) described simple message passing for tree-structured networks.

Nothing stimulates good research like a bad paper about an interesting problem - Dietterich

This is the process where the initial solution is evaluated for improvement. The refinement process can affect the initial solution by proposing new metrics, algorithms, hyper parameters, optimization techniques, or more data. The idea is to see if this could be done better even if the new solution takes a 180-degree turn.

In the previous step, the model or algorithm is compared against itself. We are looking to understand how this is compared against other methods or techniques available for similar purposes in this phase.

It is now always easy to perform the comparative evaluation when the research is proposing something completely new, as in the case of adversarial networks. But the analysis can also be put in perspective against other research efforts that look to solve the same problem, but not in the same way.

After Goodfellow's paper on adversarial gans, now it's easier to do this as many researchers are looking for progress, and improvements against each other can be used for benchmarks.

This looks to identify the design space for a particular problem. What are the bounds, critical design decisions, and how is the algorithm compared to others? For example, the concept of learning and the training time changes drastically between KNN and logistic regression on foundational machine learning algorithms. The training time becomes a problem for KNN as the dataset size increases. But why? How does this affect the usage of the algorithm for particular issues? How can the issue be fixed? These are questions to revisit in this stage as time and space complexity are under evaluation.

At least in machine learning research, applied research is recommended, and a proof-by-construction will help demonstrate that the investigation is sound. Many research papers test over well-known datasets such as ImageNet, Fake News Detection Dataset, Boston Housing, Atari RL, or MNIST. Study replicability is essential as we often read articles to solve other problems and require code or parts of the approach used to solve the research problem. Writing a paper with publicly accessible data and a code-repository is one of the best things we can do for the scientific community that uses ML for their research efforts.

Design science research focuses on the development and performance of (designed) artifacts with the explicit intention of improving the functional performance of the artifact. Design science research is typically applied to categories of artifacts, including algorithms, human/computer interfaces, design methodologies (including process models), and languages. Its application is most notable in the Engineering and Computer Science disciplines, though it is not restricted to these and can be found in many disciplines and fields (Vaishnavi et al., 2019).

Design science is a research methodology that focuses on an artifact and looks for improvement iteratively. So, a machine learning model can be seen as an artifact, subject to the seven guidelines for the research:

**Design as an artifact**: Design-science research must produce a viable artifact in the form of a construct, a model, a method, or an instantiation.**Problem relevance**: The objective of design-science research is to develop technology-based solutions to important and relevant business problems.**Design evaluation**: The utility, quality, and efficacy of a design artifact must be rigorously demonstrated via well-executed evaluation methods.**Research contributions**: Effective design-science research must provide clear and verifiable contributions in the areas of the design artifact, design foundations, and/or design methodologies.**Research rigor**: Design-science research relies upon the application of rigorous methods in both the construction and evaluation of the design artifact.**Design as a search process**: The search for an effective artifact requires utilizing available means to reach desired ends while satisfying laws in the problem environment.**Communication of research**: Design-science research must be presented effectively both to technology-oriented as well as management-oriented audiences.

Dietterich's exploratory research process can be merged with the Design Science methodology (they are highly compatible but not the same). Design science puts mathematical rigor and artifact validation as a key to improving the solution (research results), methods, constructs, and other design theories to construct new knowledge. The idea is that the exploratory analysis is performed in iterations. We can evaluate the effect of a particular change in the model or methods used on each cycle.

I used design science for my dissertation "Video-Based Costa Rican Sign Language Recognition for Emergency Services" which proposes the process required to transform a video into text when a person is communicating in sign language (LESCO).

The entire project was envisioned as a collection of artifacts: from the raw video data, the algorithms to transform the data, to the machine learning models used to classify each sign into its textual meaning (label).

Everything is measurable, therefore it can be improved.

The valuable thing about design science is that we can plan the experiment and collect the information on each iteration. After every cycle, we observe what changed and rerun the experiment to see if there are any improvements (pretty much trial and error, which in synthesis is at the core of the experimental research paradigm). This can take us to a path for improvement where we are storytelling how we got into the eureka moment.

Dietterich, T. (2019). Research Methods in Machine Learning. 46.

Hevner, A. R.; March, S. T.; Park, J. & Ram, S. Design Science in Information Systems Research. MIS Quarterly, 2004, 28, 75-106. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.103.1725&rep=rep1&type=pdf

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'14). MIT Press, Cambridge, MA, USA, 26722680.

Pearl, J. (1985) A model of self-activated memory for evidential reasoning, in Proceedings of the 7th Conference of the Cognitive Science Society, University of California, Irvine, CA, pp. 329334.

Vaishnavi, V., Kuechler, W., and Petter, S. (2004/19). "Design Science Research in Information Systems" January 20, 2004; last updated June 30, 2019. URL: http://desrist.org/design-research-in-information-systems

]]>One of the challenges we face while designing machine learning solutions is picking the right prediction features. Deciding which elements are essential and which ones genuinely contribute to model performance becomes complicated as the number of features increases.

The infinite monkey theorem states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type any given text, such as the complete works of William Shakespeare.

As absurd as the theorem sounds (which has been proved btw), has an important implication in machine learning. When there are too many features, there is the probability that the model can predict the response variable by pure luck, just like the monkey typing random stuff, because some variables result in being perfect predictors.

In statistics, a spurious correlation (or spuriousness) refers to a connection between two variables that appears to be causal but is not. With spurious correlation, any observed dependencies between variables are merely due to chance or related to some unseen confounder (Kenton, 2021).

Spuriousness then implicates the existence of variables in machine learning models that might be nearly-perfect predictors but by pure luck. To understand better the predictive capabilities of spurious relationships, let's look at the following example from Tyler Vigen:

Let's seriously consider the relationship from the chart above. If this is true, the US government should reduce the spending on science and technology to reduce the suicide rate. But that relationship is casual; it is just a coincidence. And this phenomenon can also happen within your machine learning model. There might be variables that perfectly predict the response (y) but by luck. This situation can also appear unnoticed as the dimensional space increases.

Covariance is a measure of the joint variability of two random variables. This is, the changes in the X variable are aligned to changes in the Y variable. The issue with covariance is that the relationship is judged based on the linear relationship (as in the Pearson Correlation Coefficient). Two random variables might be perfectly correlated but with a non-linear relationship. This can happen in neural networks when an image classifier learns to identify a "moose" from images because it realizes that the moose images are always in snowy conditions. The classifier learns about the snow and forgets the moose. And because the dataset images with snow are the only ones with a moose, then the classifier predicts the moose perfectly every time, making us think that the classifier is doing a great job.

If you want to know how to influence neural networks to perform the wrong predictions, please read:

**Unsupervised Adversarial Defense through Tandem Deep Image Priors**(https://www.researchgate.net/publication/347798137_Unsupervised_Adversarial_Defense_through_Tandem_Deep_Image_Priors/figures?lo=1)**A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks**(https://arxiv.org/pdf/1610.02136.pdf)**Adversarial Examples in the Physical World**(https://www.taylorfrancis.com/chapters/edit/10.1201/9781351251389-8/adversarial-examples-physical-world-alexey-kurakin-ian-goodfellow-samy-bengio)

- First, today's machine learning solutions are based on correlations, not causations. Many solutions work because of the correlations learned, even if they are full of nonsense. Keep this in mind.
- Looking for the best features is an art more than science. When the model has way too many elements, they seem to contribute to the model's accuracy in the same way. Check for the curse of dimensionality.
- Focus on data first. Make sure you collect good data (moose pictures in many shapes, angles, and seasons) and use cross-validation to validate the model performance. Good data is the critical ingredient for a good model.
- Split the data into train/valid/test sets. The train set is used to make the model learn. The valid set is used to validate the model accuracy with cross validation. The test set is a randomly picked sample before the train and valid set were created. This set is important as it represents real-world data the model has never seen before.
- Using the covariance matrix only helps spot linear relationships. Don't take this for granted.
- Test manually. Collect images or new samples and use them to validate the model. Spurious relationships usually overfit the model. This means that if the classifier is making lucky predictions, it should not work with other external images.
- Always start with a simple architecture and data model. The simpler it is, the easiest to spot the issues in the classifier. Neural networks that are too tall and wide are difficult to debug.
- Don't hack the data to make the classifier work. A penguin detector based on penguin images from cartoon books might be convenient because of the reduced amount of colors. Still, it will certainly not generalize to authentic penguin images from the wild.
- Test several algorithms. There might be a chance that a machine learning algorithm might overfit the data giving us the sensation that the classifier can learn beyond the training set. For example, don't just use decision trees; test random forests and compare the results.

We will start our exploration by building a binary classifier for Cat and Dog pictures. The images were downloaded from the Kaggle Dogs vs Cats Redux Edition competition. There are 25,000 images of dogs and cats we will use to train our convolutional neural network.

If you are wondering how to get PyTorch installed, I used miniconda with the following commands to get the environment started.

`# install conda environment with pytorch support# - conda create -n torch python=3.7# - conda activate torch# - conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch`

`import numpy as npimport pandas as pdimport osimport randomimport timeimport torchimport torchvisionimport torch.nn as nnimport torchvision.datasets as datasetsfrom torchvision import datasets, transformsfrom torch.utils.data import Dataset, DataLoaderimport torch.nn.functional as Ffrom sklearn.model_selection import train_test_splitfrom PIL import Imageimport matplotlib.pyplot as plt`

Once you downloaded the data, put the training images in the data/train/ folder on your local computer. The following code will parse the train folder and will collect the path for each image and will save it into the img_files list:

`img_files = os.listdir('data/train/')img_files = list(filter(lambda x: x != 'train', img_files))def train_path(p): return f"data/train/{p}"img_files = list(map(train_path, img_files))print("total training images", len(img_files))print("First item", img_files[0])`

output:

- total training images 25000
- First item data/train/cat.0.jpg

Before we start transforming our data, let's split the dataset into train-test sets with the following code.

`# create train-test splitrandom.shuffle(img_files)train = img_files[:20000]test = img_files[20000:]print("train size", len(train))print("test size", len(test))`

output:

- train size 20000
- test size 5000

Now, we have to use each path to load the image, convert it to RGB, and also label the image based on the name. If the word "cat" is in the path, then the label will be 0, otherwise 1 for a dog. Additional transformation is needed for the image, so a transform object will be created to resize the image to 244x244 and normalize its values. The following class will receive a Dataset (from torch.utils.data) and will apply the transformation.

`# image normalizationtransform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])# preprocessing of imagesclass CatDogDataset(Dataset): def __init__(self, image_paths, transform): super().__init__() self.paths = image_paths self.len = len(self.paths) self.transform = transform def __len__(self): return self.len def __getitem__(self, index): path = self.paths[index] image = Image.open(path).convert('RGB') image = self.transform(image) label = 0 if 'cat' in path else 1 return (image, label)`

Ok, let's use the CatDogDataset class to transform the train and test sets and convert them into an iterable dataset that PyTorch can use to train the model.

`# create train datasettrain_ds = CatDogDataset(train, transform)train_dl = DataLoader(train_ds, batch_size=100)print(len(train_ds), len(train_dl))# create test datasettest_ds = CatDogDataset(test, transform)test_dl = DataLoader(test_ds, batch_size=100)print(len(test_ds), len(test_dl))`

output

- 20000 200
- 5000 50

PyTorch uses a pythonic approach to define the architecture of the neural network in contrast to how you usually do it with Keras. The following architecture is simple, contains 3 convolutional layers with the core neural network composed of 3 fully connected layers. This architecture is not optimal but serves the purpose of testing the model. I recommend you to change this architecture (increase layers, change depth, etc) to get different results.

`# Pytorch Convolutional Neural Network Model Architectureclass CatAndDogConvNet(nn.Module): def __init__(self): super().__init__() # onvolutional layers (3,16,32) self.conv1 = nn.Conv2d(in_channels = 3, out_channels = 16, kernel_size=(5, 5), stride=2, padding=1) self.conv2 = nn.Conv2d(in_channels = 16, out_channels = 32, kernel_size=(5, 5), stride=2, padding=1) self.conv3 = nn.Conv2d(in_channels = 32, out_channels = 64, kernel_size=(3, 3), padding=1) # conected layers self.fc1 = nn.Linear(in_features= 64 * 6 * 6, out_features=500) self.fc2 = nn.Linear(in_features=500, out_features=50) self.fc3 = nn.Linear(in_features=50, out_features=2) def forward(self, X): X = F.relu(self.conv1(X)) X = F.max_pool2d(X, 2) X = F.relu(self.conv2(X)) X = F.max_pool2d(X, 2) X = F.relu(self.conv3(X)) X = F.max_pool2d(X, 2) X = X.view(X.shape[0], -1) X = F.relu(self.fc1(X)) X = F.relu(self.fc2(X)) X = self.fc3(X) return X`

Ok, we are all set to train our data against the CatAndDogConvNet model. The following code will do the hard work. Please note that the accuracy and loss functions are loaded from the PyTorch libraries but the performance metrics are calculated manually.

`# Create instance of the modelmodel = CatAndDogConvNet()losses = []accuracies = []epoches = 8start = time.time()loss_fn = nn.CrossEntropyLoss()optimizer = torch.optim.Adam(model.parameters(), lr = 0.001)# Model Training...for epoch in range(epoches): epoch_loss = 0 epoch_accuracy = 0 for X, y in train_dl: preds = model(X) loss = loss_fn(preds, y) optimizer.zero_grad() loss.backward() optimizer.step() accuracy = ((preds.argmax(dim=1) == y).float().mean()) epoch_accuracy += accuracy epoch_loss += loss print('.', end='', flush=True) epoch_accuracy = epoch_accuracy/len(train_dl) accuracies.append(epoch_accuracy) epoch_loss = epoch_loss / len(train_dl) losses.append(epoch_loss) print("\n --- Epoch: {}, train loss: {:.4f}, train acc: {:.4f}, time: {}".format(epoch, epoch_loss, epoch_accuracy, time.time() - start)) # test set accuracy with torch.no_grad(): test_epoch_loss = 0 test_epoch_accuracy = 0 for test_X, test_y in test_dl: test_preds = model(test_X) test_loss = loss_fn(test_preds, test_y) test_epoch_loss += test_loss test_accuracy = ((test_preds.argmax(dim=1) == test_y).float().mean()) test_epoch_accuracy += test_accuracy test_epoch_accuracy = test_epoch_accuracy/len(test_dl) test_epoch_loss = test_epoch_loss / len(test_dl) print("Epoch: {}, test loss: {:.4f}, test acc: {:.4f}, time: {}\n".format(epoch, test_epoch_loss, test_epoch_accuracy, time.time() - start))`

- Epoch: 7, train loss: 0.2084, train acc: 0.9138, time: 871.9559330940247
- Epoch: 7, test loss: 0.5432, test accracy: 0.8340, time: 890.4497690200806

As we can observe the model train accuracy reached 91% with 83% for the test set. Not bad, but this can be improved!

We will re-use the CatDogDataset to create the TestCatDogDataset a class that pretty much does the same thing, but returns the image object and the file id, as the /data/test/ folder contains a set of unlabeled images.

`test_files = os.listdir('data/test/')test_files = list(filter(lambda x: x != 'test', test_files))def test_path(p): return f"data/test/{p}"test_files = list(map(test_path, test_files))class TestCatDogDataset(Dataset): def __init__(self, image_paths, transform): super().__init__() self.paths = image_paths self.len = len(self.paths) self.transform = transform def __len__(self): return self.len def __getitem__(self, index): path = self.paths[index] image = Image.open(path).convert('RGB') image = self.transform(image) fileid = path.split('/')[-1].split('.')[0] return (image, filed)test_ds = TestCatDogDataset(test_files, transform)test_dl = DataLoader(test_ds, batch_size=100)len(test_ds), len(test_dl)`

output:

- (12500, 125)

Let's now make predictions for the entire unlabeled test set from the /data/test/ folder. We will store the probabilities of the image to be a dog P(Image|dog) in the dog_probs list:

`dog_probs = []with torch.no_grad(): for X, fileid in test_dl: preds = model(X) preds_list = F.softmax(preds, dim=1)[:, 1].tolist() dog_probs += list(zip(list(fileid), preds_list))`

Let's print the top 5 images to see what the algorithm predicted in unseen images:

%matplotlib inline

`# display some imagesfor img, probs in zip(test_files[:5], dog_probs[:5]): pil_im = Image.open(img, 'r') label = "dog" if probs[1] > 0.5 else "cat" title = "prob of dog: " + str(probs[1]) + " Classified as: " + label plt.figure() plt.imshow(pil_im) plt.suptitle(title) plt.show()`

The classifier is not perfect and still making mistakes. How do we solve this? let's try another approach that has proved to be very effective in training convolutional neural networks to achieve state-of-art performance: **transfer learning**. I will replicate this same example using transfer learning in my next post to check the difference.

Kudos to chriszoufor the original code for this post.

]]>Hoffmann et al. (2018), in their paper "Using machine learning techniques to generate laboratory diagnostic pathwaysa case study" challenged the use of expert rules in the interpretation of laboratory testing with machine learning models. The idea behind using machine learning is to provide an alternative to laboratory diagnostics that can offer state-of-art detection capabilities for certain conditions.

Fortunately, the data from this paper is available at the Machine Learning UCI Repository from UC Irvine. This dataset is also available at Kaggle as CSV, so I used the latter one.

The data was collected from 73 patients (52 males and 21 females), ages 19 to 75, with proven serological and histopathological diagnoses of hepatitis C. The data was labeled in several categories according to their hepatic activity index. This index was later used to transfer those groups into the machine learning classes.

Each entry in the dataset is described by ten biochemical tests, gender, and age. the following list details which test indicators were used in the data collection process:

**Test Codes**

- albumin (ALB)
- alkaline phosphatase (ALP)
- alanine amino-transferase (ALT)
- aspartate amino-transferase (AST)
- bilirubin (BIL)
- choline esterase (CHE)
- LDL (CHOL)
- creatinine (CREA)
- gamma glutamyl transpeptidase) (GGT)
- PROT

**Additional attributes **

- age
- sex

**Response variable (y)**

- Category (diagnosis) (values: '0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis')

In the paper, the results using decision trees obtained the following results.

We are going to use the PyCaret AutoML capabilities to improve the accuracy over all the classes by using cross-validation.

**Imports**

`import pandas as pdimport numpy as npfrom pycaret.classification import *`

**Load Data**

`data = pd.read_csv("HepatitisCdata.csv")data = data.drop(labels='Unnamed: 0', axis=1)data.head(10)`

**Data Preprocessing**

`# pd.Series(data["Category"], dtype="category")# ['0=Blood Donor', '0s=suspect Blood Donor', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis']data["Category"] = [0 if x == "0=Blood Donor" else x for x in data["Category"]]data["Category"] = [1 if x == "0s=suspect Blood Donor" else x for x in data["Category"]]data["Category"] = [2 if x == "1=Hepatitis" else x for x in data["Category"]]data["Category"] = [3 if x == "2=Fibrosis" else x for x in data["Category"]]data["Category"] = [4 if x == "3=Cirrhosis" else x for x in data["Category"]]# pd.Series(data["Sex"], dtype="category")# ['f', 'm']data["Sex"] = [0 if x == "f" else 1 for x in data["Sex"]]data.head(10)`

The data looks much cleaner, and categorical columns were encoded accordingly.

**Magic with PyCaret**

PyCaret with minimal coding will perform a benchmark of a suite of machine learning algorithms, will estimate metrics to determine which is the best model and will also calculate many visualizations such as AUC/ROC curves.

`# pycaret setups = setup(data, target = "Category")# model training and selectionbest = compare_models()`

This code snippet automatically created the following table of results, showing which algorithms performed better, and the metrics they aced are highlighted in yellow.

The clear winner here is the GBC Gradient Boosting Classifier algorithm.

**Performance Metrics**

PyCaret with a single line of code can provide a set of analyses that can help us identify how the mode is performing, accuracy, F1-Score, sensitivity analysis, and many more. Here are some interesting visualizations provided by PyCaret.

`evaluate_model(best)`

That's it! here are some analytics:

*ROC/AUC Curve for GBC*

*Confusion Matrix*

*Precision-Recall Curve*

*Classification Report*

Before we start marking some predictions, PyCaret offers the save_model method to save the machine learning model as a pickle file. The following code shows how to save and load the model if you want later to use it in a web service.

*Save Model*

`# save model to disk "hepatitis_model.pkl"save_model(best, "hepatitis_model")`

*Load Model*

`from pycaret.classification import load_model# load modelhepatitis_model = load_model('hepatitis_model')`

**Prediction of a new Patient**

We are simulating a new patient, and we are creating a new pandas data frame with the information needed. the following snippet shows how to submit a new patient for inference.

*Create the Data Frame*

`patient_data = pd.DataFrame()patient_data = patient_data.append({ 'Age': 32.00, 'Sex': 1.00, 'ALB': 44.30, 'ALP': 52.30, 'ALT': 21.70, 'AST': 22.40, 'BIL': 17.20, 'CHE': 4.15, 'CHOL': 3.57, 'CREA': 78.00, 'GGT': 24.10, 'PROT': 75.40}, ignore_index=True)`

*Perform the Prediction*

`# perform predictionprediction = predict_model(hepatitis_model, data = patient_data)print('Patient was CLASSIFIED as:', classes[prediction.Label[0]])`

The outcome of this prediction is: ** Patient was CLASSIFIED as: Blood Donor**.

**Summary**

The PyCaret POC demonstrated a 0.9395 accuracy and 0.9270 F1-Score for predicting the classes in the diabetes dataset. This is way much better than the results shown by Hoffmann et al. Although this is a promising result, this code is still a POC and should not be used for production purposes until further evaluation by medical experts (I am, unfortunately, the wrong type of doctor here for medical and clinical approval!)

I uploaded the code a Gist Here!

if you liked this, remember to send some love back.

]]>I made a quick visit to the hardware store to find some PVC tubes. While looking for them, I realized that every item in the warehouse must be counted, by someone I guess in a periodically way to refill when the reserve is running low (how boring). So, what if there is a camera in front of the racks counting things? This is exactly what I tried to do with an image I downloaded to check how we can use computer vision (OpenCV) to count for things.

While looking at the OpenCV documentation and samples online, I found that there are several techniques to identify geometric shapes from an image. Concretely, there is a method in OpenCV; HoughCircles, which is able to find circles by using the Hough transform.

The **HoughCircles(...)** method has the following documentation:

`cv.HoughCircles( image, method, dp, minDist[, circles[, param1[, param2[, minRadius[, maxRadius]]]]] )`

Parameters

**image**: 8-bit, single-channel, grayscale input image.**circles**: Output vector of found circles. Each vector is encoded as 3 or 4 element floating-point vector (x,y,radius) or (x,y,radius,votes) .**method**: Detection method, see HoughModes. The available methods are HOUGH_GRADIENT and HOUGH_GRADIENT_ALT.**dp**Inverse ratio of the accumulator resolution to the image resolution. For example, if dp=1 , the accumulator has the same resolution as the input image. If dp=2 , the accumulator has half as big width and height. For HOUGH_GRADIENT_ALT the recommended value is dp=1.5, unless some small very circles need to be detected.**minDist**: Minimum distance between the centers of the detected circles. If the parameter is too small, multiple neighbor circles may be falsely detected in addition to a true one. If it is too large, some circles may be missed.**param1**: First method-specific parameter. In case of HOUGH_GRADIENT and HOUGH_GRADIENT_ALT, it is the higher threshold of the two passed to the Canny edge detector (the lower one is twice smaller). Note that HOUGH_GRADIENT_ALT uses Scharr algorithm to compute image derivatives, so the threshold value shough normally be higher, such as 300 or normally exposed and contrasty images.**param2**: Second method-specific parameter. In case of HOUGH_GRADIENT, it is the accumulator threshold for the circle centers at the detection stage. The smaller it is, the more false circles may be detected. Circles, corresponding to the larger accumulator values, will be returned first. In the case of HOUGH_GRADIENT_ALT algorithm, this is the circle "perfectness" measure. The closer it to 1, the better shaped circles algorithm selects. In most cases 0.9 should be fine. If you want get better detection of small circles, you may decrease it to 0.85, 0.8 or even less. But then also try to limit the search range [minRadius, maxRadius] to avoid many false circles.**minRadius**: Minimum circle radius.**maxRadius**: Maximum circle radius. If <= 0, uses the maximum image dimension. If < 0, HOUGH_GRADIENT returns centers without finding the radius. HOUGH_GRADIENT_ALT always computes circle radiuses.

Well, lets try to use this method and adjust its parameters to identify circles in an image. The circles we want to identify are the ones that belongs to the end of the pipes.

`import numpy as npimport cv2 as cvfrom matplotlib import pyplot as plt`

`# use matplotlib to print image in jupyter notebookdef show(img): plt.figure(figsize=(10, 16)) plt.imshow(img, cmap='gray') plt.show()`

`# load the image in full colorimg = cv.imread('pipes.png', cv.IMREAD_COLOR)show(img)`

So this is the original image with a bunch of pipes. I am not going to count them. Lets use the function to identify them.

The cv.HoughCircles() function requires the image to be in grayscale. Also, applying a gaussian blur is also a good idea to blend imperfections from the image that might cause false positives.

`# Convert to grayscalegray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)# Apply blur with a 3x3 kernelgray_blurred = cv.blur(gray, (3, 3))show(gray_blurred)`

Nice! now our image is lightly blurred and in grayscale. That's all the pre-processing we need.

`# Apply Hough transform on the blurred image.detected_circles = cv.HoughCircles(gray_blurred, cv.HOUGH_GRADIENT, 1, 15, param1 = 100, param2 = 20, minRadius = 0, maxRadius = 20)`

detected_circles is list that contains all the circles identified in the image. Each circle is composed of 3 values in the list: a, b, and r. The a-b are the x-y location of the circle and r is the radius. The data looks something like this:

Now its time to loop over each circle and draw the origin of the circle over the original image. This is done in the following cycle:

`pipes_count = 0# Draw circles that are detected.if detected_circles is not None: # Convert circle metadata to integers detected_circles = np.uint16(np.around(detected_circles)) for points in detected_circles[0, :]: a, b, r = points[0], points[1], points[2] # Draw a small circle (of radius 1) to show the center. cv.circle(img, (a, b), 1, (0, 0, 255), 3) # count the number of pipes pipes_count += 1`

I added also the variable pipes_count to count how many circles were detected. At the end this is the number we need.

The final image with the small blue circle drawn at the origin of each pipe looks like this.

Well, there are **79 pipes** in this image.

The Gist is available here

Hope you like this!

]]>In this post, we will use a cool project from Google; The Mediapipe, to build a code snippet that is able to identify if a person is drowsy. This could be applicable to monitor people while driving or while performing dangerous tasks that require being fully awake.

MediaPipe offers cross-platform, customizable ML solutions for live and streaming media. Learn more about the Mediapipe project here

The idea is simple: we will monitor your eyes. If your eyes are closed for some time, then we will show an alert. Now we have to make a list of the things we need to provide a solution. Here is the list of things:

**Camera**. We need a camera to monitor in real-time, the person's eyes to identify if he/she is falling asleep. For this exercise, we will use our PC or laptop webcam. the OpenCV library has already all the tools to capture each frame from a video stream in real-time.**Capture Eyes Metadata**. Fortunately, the Google Mediapipe library has a python wrapper (because it was originally accessible only in C++) that provides facial landmarks that we can use to capture the contour around the eyes.**Determine if eyes are open or closed**. The metadata captured from the Mediapipe library is an array of [x,y] positions of each landmark in the face (as seen in the image below). We need to filter them to get only the points around the right and left eyes. Once these arrays are available, we will estimate the height of each eye. We will make a condition that if an eye is half-opened during k frames, then we will raise an alert. We will use k = 20 in our example.

Ok, now that we have the idea of how to solve this, let's start with the fun part. Time to build the prototype.

Mediapipe is available with pip, so use **pip install mediapipe** to download the library.

`import cv2 as cvimport numpy as npimport mediapipe as mp`

The Mediapipe will output the points around both eyes as a collection of [x,y] points. The idea is to identify the highest and lowest, y-values for each eye, so that we can use them to calculate how open each one is. The following function will take the array from Mediapipe and calculate the height of each eye as the difference between the max-min y values of the [x,y] point collection.

`def open_len(arr): y_arr = [] for _,y in arr: y_arr.append(y) min_y = min(y_arr) max_y = max(y_arr) return max_y - min_y`

`mp_face_mesh = mp.solutions.face_mesh# A: location of the eye-landamarks in the facemesh collectionRIGHT_EYE = [ 362, 382, 381, 380, 374, 373, 390, 249, 263, 466, 388, 387, 386, 385,384, 398 ]LEFT_EYE = [ 33, 7, 163, 144, 145, 153, 154, 155, 133, 173, 157, 158, 159, 160, 161 , 246 ]# handle of the webcamcap = cv.VideoCapture(0)# Mediapipe parameteswith mp_face_mesh.FaceMesh( max_num_faces=1, refine_landmarks=True, min_detection_confidence=0.5, min_tracking_confidence=0.5) as face_mesh: # B: count how many frames the user seems to be going to nap (half closed eyes) drowsy_frames = 0 # C: max height of each eye max_left = 0 max_right = 0 while True: # get every frame from the web-cam ret, frame = cap.read() if not ret: break # Get the current frame and collect the image information frame = cv.flip(frame, 1) rgb_frame = cv.cvtColor(frame, cv.COLOR_BGR2RGB) img_h, img_w = frame.shape[:2] # D: collect the mediapipe results results = face_mesh.process(rgb_frame) # E: if mediapipe was able to find any landmanrks in the frame... if results.multi_face_landmarks: # F: collect all [x,y] pairs of all facial landamarks all_landmarks = np.array([np.multiply([p.x, p.y], [img_w, img_h]).astype(int) for p in results.multi_face_landmarks[0].landmark]) # G: right and left eye landmarks right_eye = all_landmarks[RIGHT_EYE] left_eye = all_landmarks[LEFT_EYE] # H: draw only landmarks of the eyes over the image cv.polylines(frame, [left_eye], True, (0,255,0), 1, cv.LINE_AA) cv.polylines(frame, [right_eye], True, (0,255,0), 1, cv.LINE_AA) # I: estimate eye-height for each eye len_left = open_len(right_eye) len_right = open_len(left_eye) # J: keep highest distance of eye-height for each eye if len_left > max_left: max_left = len_left if len_right > max_right: max_right = len_right # print on screen the eye-height for each eye cv.putText(img=frame, text='Max: ' + str(max_left) + ' Left Eye: ' + str(len_left), fontFace=0, org=(10, 30), fontScale=0.5, color=(0, 255, 0)) cv.putText(img=frame, text='Max: ' + str(max_right) + ' Right Eye: ' + str(len_right), fontFace=0, org=(10, 50), fontScale=0.5, color=(0, 255, 0)) # K: condition: if eyes are half-open the count. if (len_left <= int(max_left / 2) + 1 and len_right <= int(max_right / 2) + 1): drowsy_frames += 1 else: drowsy_frames = 0 # L: if count is above k, that means the person has drowsy eyes for more than k frames. if (drowsy_frames > 20): cv.putText(img=frame, text='ALERT', fontFace=0, org=(200, 300), fontScale=3, color=(0, 255, 0), thickness = 3) cv.imshow('img', frame) key = cv.waitKey(1) if key == ord('q'): breakcap.release()cv.destroyAllWindows()`

**A**: RIGHT_EYE and LEFT_EYE are the arrays of the indexes of the points around the eyes in the face-landmark collection. We will use these indexes to filter the results from the mediapipe.**B**: drowsy_frames is the variable used to accumulate how many frames the user closed the eyes at least 50% of the height of the eye.**C**: max_left and max_right will capture the maximum height recorded for each eye while open.**D**: This is where the mediapipe uses the rgb_frame and searches for landmarks. This is basically the predict method of the mediapipe FaceMesh functionality.**E**: results.multi_face_landmarks contains all the landmarks found. So if any, let's do something, otherwise, ignore it.**F**: all_landmarks contain all face landmarks found in the image by media pipe. This includes all the face area, not only the eyes.**G**: we filter the all_landmarks array with the indexes from RIGHT_EYE and LEFT_EYE from section A. This will return two lists of the form [[x,y],[x,y],...,[x,y]] that contains only the points around the left and right eye.**H**: this is for debugging. We will use OpenCV polylines method to draw the lines around the eyes from the right_eye and left_eye landmarks captured in section G.**I**: Estimate the height of each eye.**J**: keep track of the max height of each eye.**K**: This is the drowsy condition. If eyes are half-open add 1 to the drowsy_frames counter. Otherwise, initialize the variable to 0 to start counting again.**L**: If the user has drowsy eyes for more than 20 frames, then show the ALERT text on the screen.

You can get the Gist **here!**

**Warning**: This comparison is biased from my experience which is mostly with tensorflow, but there are interesting conclusions. Althought they scored similarly, there are some aspects that might weigh a bit more, here they are.

According ti AssemblyAI, Historically TF has been used as the defacto framework for deep learning, but the raise of popularity for PyTorch as grown exponentially. In 2017 almost 90% of the research papers used TF. In 2021 almost 80% of the new research are made with PyTorch.

Papers with Code Shows that most of the papers exposed are done with PyTorch.

Althoug Tensorflow 1 was first, it was difficult to use, giving other Libraries and Frameworks a chance to shine.

Companies like OpenAI have moved a lot of their internal efforts from TF to PyTorch. There are things like Reinforcement Learning that are still done in TF.

According to GradietFlow, Tensorflow is the most popular framework in job postings, but I believe this is because of historical reasons. PyTorch grew 194% year-over-year in contrast to Tensorflow who grew just 23%.

Most of the models ready-to-use on HuggingFace are made in PyTorch.

Tensorflow has some unique features such as TFLite for mobile devices and for Javascript, allowing you to train simple models on the web. PyTorch Mobile was released in 2019 for iOS and Android devices. PyTorchLive was Facebooks's response for javascript and react-native support.

Reinformcement Learning is more robust on the Tensorflow side.

I believe the final message is loud an clear. PyTorch is the future and it's becoming the new bad boy in town. Tensorflow is still widely used and supported. So, you better learn both as they will be competing.

If you are just starting, I recommend start with Tensorflow and look for the alternative PyTorch code.

]]>To explore this idea, we will be using a Cereals dataset I downloaded from Kaggle. You can download the dataset from here.

The dataset contains about 77 different cereals with some interesting nutritional features such as calories, protein, fat, sodium, carbohydrates, sugars, vitamins, etc.

`import numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as plt`

Now, let's load the data and select only the nutritional attributes we want to use:

`data = pd.read_csv('data/cereals-products.csv')data = data[['name','calories','protein','fat','sodium', 'fiber', 'carbo', 'sugars', 'potass', 'vitamins']]`

As you can see, there are 77 different kinds of cereal, each one described by 10 features. The objective is to calculate a *metric * that can tell us how "similar" is one cereal to the other. That metric is the Pearson Correlation Coefficient.

Numpy has the **corcoeff ** method that allows us to calculate the correlation coefficient of all cereals at once! We will use the data loaded to estimate the correlation between cereals. This will generate a correlation matrix with dimensions [77,77].

To calculate the Pearson matrix, we just need the numerical attributes, so we will avoid putting the "name" column as this has no use, for now.

`pearson_matrix = np.corrcoef(data[data.columns[1:]])`

The pearson_matrix is a 77x77 matrix of the correlations between cereals. To visualize this, we will use the seaborn library with the following code.

`plt.figure(figsize = (16,16))sns.heatmap(pearson_matrix, xticklabels=data["name"], yticklabels=data["name"], fmt="d")`

Let's remember that the correlation measures the level of the linear relationship between two lists. If the correlation is perfect, then it will be 1.

The pearson_matrix is all we need now to search for cereals that are similar. The idea is that if we choose *"Special K"*, for example, we look into the pearson_matrix for those cereals that have the highest correlations. Is that simple! All we need to do now, is create a method *"recommend"* which will search in the pearson_matrix the top-k most similar cereals.

The code looks like this:

`def recommend(cereal_name, top_k): cereal_names = data["name"] index = list(cereal_names).index(cereal_name) coeff = pearson_matrix[index] df = pd.DataFrame({'pearson':coeff, 'name' : cereal_names}).sort_values('pearson', ascending=False) return df.head(top_k)`

This method returns the list of cereals names sorted by their respective pearson coefficient.

Let's search for **"Special K"** to get the top 10 most similar cereals:

`recommend("Special K", 10)`

What about "100% Bran"

`recommend("100% Bran", 10)`

**Final Thoughts:**

- The correlation coefficient is used as a similarity measure to find other products that are mathematically similar, based on the data attributes.
- If this recommender is used in a commercial setting, it should be used as tool to find replacement cereals or similar cereals in terms of nutritional values, not flavor.
- The person matrix calculated here should be persisted somewhere so that this is not estimated on every call. If new products are added, the matrix should be generated again.
- This technique shown here works also with other types of data such as categorical data or text. In the case of categorical data, variables should be one-hot encoded. If used with text, then the text should be converted into tokens.

Bootstrapping is a resampling technique with replacement; that is, we can choose on every sample a subset of elements that might be repeated.

The bootstrap method is a statistical technique for estimating quantities of a population by averaging estimates from multiple small data samples (Brownlee, 2018).

**The Algorithm To Create One Sample**

`N = size of Dataset (population)n = target size of samplecn = current size of the samples = sampleWhile (cn < n) Randomly select an item i from the dataset D Add the randomly selected item i to s. cn++`

This creates subset s full of randomly selected samples from D. This process can be repeated k times.

In Scikit-Learn, we can get a random sample of a dataset D with the resample method. Let's select from a 4-row matrix 3 samples of two rows each.

`import numpy as npfrom sklearn.utils import resampleX = np.array([[1., 0.], [2., 1.], [0., 0.], [2,4]])`

array([[1., 0.], [2., 1.], [0., 0.], [2., 4.]])

`resample(X, n_samples=2)`

array([[1., 0.], [1., 0.]])

`resample(X, n_samples=2)`

array([[0., 0.], [2., 4.]])

`resample(X, n_samples=2)`

array([[2., 4.], [0., 0.]])

Cross-validations is a very similar technique to Bootstrapping, with the difference that it selects its samples without replacement; that is, there are no repeated elements in every subset. Selecting k-samples with Cross-Validation is called K-Fold CrossValidation. Usually, on each cross-validation exercise, we define a portion of the selected sample to be the train and the other the test set (for example 70/30 or 80/20). The type of cross-validation that selects a test set with one single example is called LOOCV (leave one out cross-validation).

Subsetting a dataset using LOOCV is computationally expensive, so with usually use k-fold cross-validation with a k = {4, 5, 7, 10}

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation.

`import numpy as npfrom sklearn.model_selection import KFoldX = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])y = np.array([1, 2, 3, 4])kf = KFold(n_splits=2)kf.get_n_splits(X)for train_index, test_index in kf.split(X): print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index]`

KFold(n_splits=2, random_state=None, shuffle=False)TRAIN: [2 3] TEST: [0 1]TRAIN: [0 1] TEST: [2 3]

- Bootstrapping selects samples with replacements that can be as big as the dataset.
- Cross-validation samples are smaller than the dataset.
- Bootstrapping contains repeated elements in every subset. Bootstrapping relies on random sampling.
- Cross-validation does not rely on random sampling, just splitting the dataset into k unique subsets.
- Cross-validation is usually used to test an ML model's generalization capabilities.
- Bootstrapping is used more for statistical tests, ensemble machine learning, and parameter estimation.

If you are still not sure which one to use for testing your ML model, just go with cross-validation.

`from sklearn import datasets, linear_modelfrom sklearn.model_selection import cross_val_scorediabetes = datasets.load_diabetes()X = diabetes.data[:150]y = diabetes.target[:150]lasso = linear_model.Lasso()print(cross_val_score(lasso, X, y, cv=3))`

[0.33150734 0.08022311 0.03531764]

In this example, we are applying Lasso to predict diabetes progression after one year. With k = 3, we obtained accuracy scores of 33%, 8%, and 35%. If we average them we can say that the cross-validation accuracy score is mean(33%, 8%, and 35%) = 25.3%.

Now, you can test this same code with another ML model such as Linear Regression OLS. If the new model has a better cross-validation accuracy score, then you can be certain that the new model performs better than the other one. Cross-validation is excellent for model selection.

]]>First, we will flip two coins two times to explain the probability of getting something out of these scenarios. The following image explains all possible outcomes.

As you can observe from the diagram we can get HH, TT, or [HT-TH] which are basically the same thing. We calculate the probability of these 3 outcomes, assuming each toss is independent. So the probability of getting a HH or TT is 0.25, but the probability of getting either HT or TH is 0.5.

The p-value is not the coin flip probability. We have already calculated that.

The p-value is the probability that

random chance has generated the data,something of equal chanceorrarer. With p-values, we look to understand the probability of how the data was generated.

The p-value is the sum of the probability of HH (which we already know is 0.25) + the probability of something of equal chance (in this case the probability of TT which is also 0.25) + the probability of a rarer event (which in this case is 0 because there is nothing rarer than what we have in the diagram)

$$p-value(HH) = p(HH) + p(TT) + 0$$$$p-value(HH) = 0.25 + 0.25 + 0 = 0.5$$

So to recap, the probability of HH is 0.25 but the p-value of HH is 0.5.

Well, the p-value checks the probability that HH is generated by luck. The 0.5 means that HH is not special, it's a very common combination that is very plausible caused by luck. When a p-value is less than 0.05, we say that it is statistically significant, because it rejects the null hypothesis, which is that the observed event (in this case HH) was caused by luck or any other factor.

We can use p-values in machine learning models to understand the significance of predictors (variables) used for regression or classification. For example, we might use scipy to generate a regression model with OLS for some problems such as estimating the battery life of a Tesla car after x mileage. The OLS will make the linear regression model and calculate the p-values for each variable including the intercept.

When we observe a predictor if a p-value is lower than 0.05, we say that that variable is **significant** because the change in that predictor affects the response variable (y) not by luck. Then that p-value rejects the null hypothesis.

**Warning:** We have to be careful with the curse of dimensionality and p-values. If our ML model has too many variables, this might cause the predictors to have a very low p-value making them significant when they are not. Reducing the number of dimensions and pruning the model from variables that make noise is a must when using p-values.

The following image shows an example print of the scipy OLS. You can see that every regressor has its p-values (p>|t|) calculated and (almost) all of them are less than 0.05, which means they are statistically significant. Since SqFtLot has a 0.323 > 0.05, we can think of removing this feature since its variability does not go along y.

]]>Covariance is a measure that help us understand if two independent variables vary together (also called joint variability). If the covariance is positive the the variables move in the same direction. If the covariance is negative, then they move in inverse direction. Let's check the formula first and then we will create an example to build the intuition.

$$cov(X,Y) = \frac{\sum(X_i - \overline{X})(Y_i - \overline{Y})}{n}$$

where:

- \(X_i\) represent each value of the X array
- \(\overline{X}\) is the mean of the X array
- \(Y_i\) represent each value of the Y array
- \(\overline{Y}\) is the mean of the Y array

This is a powerful image, as it demonstrates what covariance is telling us about the relationship. In case the relationship is nearly-linear and is its direction is pointing down (negative slope) then the covariance is negative, otherwise is positive. But look at the center case where the shape is like a circle. Because covariance does not understand non-linear relationships, anything that looks like this will have a value close to zero.

For now, covariance can be used as a formula that tells us about the direction of the relationship, but now how strong the relationship is. For these matters we will use the Correlation Coefficient which includes covariance as part of its formula. Let's check it out.

The correlation coefficient also (Pearson Correlation Coefficient or PCC) measures the strength of a linear correlation between two variables. Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations.

The standard deviation is a measure of the dispersion of data from its average. Covariance is a measure of how two variables change together. However, its magnitude is unbounded, so it is difficult to interpret. The normalized version of the statistic is calculated by dividing covariance by the product of the two standard deviations.

The correlation coefficient is normalized in the range of [-1, 1]

$$\rho(X,Y)=\frac{cov(X,Y)}{\sigma_X\sigma_Y}$$

As we can observe, we are normalizing the Covariance by the product of the standard deviations from the two variables.

Higher values of 1 variable tend to be associated with either higher (positive correlation) or lower (negative correlation) values of the other variable, and vice versa (Schober et al, 2018).

Thankfully we don't need to worry about implementing these formulas, the numpy library already has this implemented in a very efficient way. Lets us it with an example.

**Create a Fake Dataset using Sklearn datasets.make_regression**

`import numpy as npfrom matplotlib import pyplot as pltfrom matplotlib.pyplot import figurefrom sklearn import datasetsn_samples = 1000n_outliers = 10X, y, coef = datasets.make_regression( n_samples=n_samples, n_features=1, n_informative=1, noise=10, coef=True, random_state=0,)`

**Let's plot the data**

`figure(figsize=(8, 6), dpi=80)plt.scatter(X,y, color="green", alpha=0.5)`

**Calculate Covariance with Numpy Cov**

`np.cov(X.reshape(-1),y)[0,1]`

Covariance = 80.0952750884216This does tells us about the positive relationship, but that 80 is not telling us much. Thats why we need to use the Correlation Coefficient to have a normalized value between [-1, 1] of the strength of the relationship.

**Calculate the Correlation Coefficient with Numpy Corrcoef**

`np.corrcoef(X.reshape(-1),y)[0,1]`

Correlation Coefficient = 0.9925933634659728. Well this tell us is very linear and also very strong correlation!

We can build an item-to-item recommender system. If products are described as arrays of numerical values, we can use the Pearson correlation coefficient as an element to compare items that are correlated to each other, so that we can find the top k items that are similar to something of interest.

If an Xbox is X1 and Playsation is X2 and a Paper Towels is X3, we can calculate the corrcoef(X1,X2) and corrcoef(X1,X3) to find out the best correlation is probably between X1 and X2, so that if you select X1 in your shopping cart, then we can recommend you X2 because it has a higher correlation with X1 rather than with X3.

Schober, Patrick MD, PhD, MMedStat; Boer, Christa PhD, MSc; Schwarte, Lothar A. MD, PhD, MBA Correlation Coefficients: Appropriate Use and Interpretation, Anesthesia & Analgesia: May 2018 - Volume 126 - Issue 5 - p 1763-1768doi: 10.1213/ANE.0000000000002864

]]>The Poisson distribution (Haight, 1967) is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

In other words, this distribution helps determine the probability of an event if it has been happening over and over again in some time interval.

The probability can be found using the following formula:

$$f(k; \lambda)= \Pr(X{=}k)= \frac{\lambda^k e^{-\lambda}}{k!}$$

where

- \(k\) is the number of occurrences (k = 0,1,2,3...)
- \(e\) is Euler's constant (2.71828...)
- \(\lambda\) is a real number > 0 that represents the event rate.

**How can we use it?**

So to exemplify this, let's create a hypothetical scenario. If we know that an earthquake hits San Francisco every 100 years one time (\(\lambda = 1\)), we might want to know what is the probability that \(k = {0,1,2,3}\) earthquakes can happen during the next 100 years.

$$f(k; \lambda)= \frac{\lambda^k e^{-\lambda}}{k!}$$

then:

$$f(0; 1)= \frac{1^0 e^{-1}}{0!} = 0.3678$$$$f(1; 1)= \frac{1^1 e^{-1}}{1!} = 0.3678$$$$f(2; 1)= \frac{1^2 e^{-1}}{2!} = 0.1839$$$$f(3; 1)= \frac{1^3 e^{-1}}{3!} = 0.0613$$

This tells us that given that San Francisco has been shaking hard at least once every 100 years, the probability that no earthquake or at least one is the same, but the odds go down if we are expecting 2 or more earthquakes.

**Assumptions**

- All events (in our example earthquakes) are independent events. We assume one earthquake will not cause another one (at least from the same magnitude, its obvious replicas will happen).
- The rate of occurrence is also independent.
- The average rate is constant. Here we assume earthquakes happen every 100 years.
- Two events can't happen at the same instant. San Andreas Fault cant have two shakes at the same time.

**Other questions answered by the Poisson Distribution:**

- What is the possibility of selling 10 pizzas a day.
- What is the probability of Earth getting destroyed by a giant meteor.
- What is the probability that Cuba and the Dominican Republic will get hit by 3 tropical storms next year?
- What is the probability that Brazil will win the next world cup? (the WC happens every 4 years!)

**References:**

Haight, Frank A. (1967), Handbook of the Poisson Distribution, New York, NY, USA: John Wiley & Sons

]]>