Enhance Athletes Performance Through Data Analytics & Recommender Systems
A couple of months ago I was contacted by a small training-fitness startup. They wanted to know if applying analytics techniques and methods could help them answer some of their business questions and enhance trainees' performance. I told them that indeed it was a doable task. However, they pointed out that the startup had limited resources, a small budget, and NO data... Now, I realized that the task at hand was a huge challenge... Indeed. So, how to tackle this challenge effectively, get the job done, and deliver useful and actionable insight for this training-fitness startup? It was time to, again, think out of the box.
The answer to these tough questions was to use some of Google Workspace cloud-based collaborative tools (Docs, Sheets, Slides, Forms), together with the TRIFACTA platform, and the open-source R language comprehensive library; to design and implement an effective low-cost solution; serving the results as fully interactive tables and easy to digest visualizations in Data Studio. BINGO!
To clarify the discussion that follows, I have divided this post into three sections:
- Gathering Data and Identifying Key Variables: Online Surveys and Principal Component Analysis - PCA.
- Graph Analysis and Weighted Association Rules Mining - WARM.
- Implementation of a Content-Based Recommendation Engine - CBRE.
Let's unlock another interesting use case that illustrates how to get the most value out of data, helping to solve a real-world problem.
Gathering Data and Identifying Key Variables: Online Surveys and Principal Component Analysis - PCA
The main issue that I had to care about was the absence of reliable data. The easiest and cheapest solution was to use Google FORMS free tool to build a few online surveys and ask the trainees to fill them up online. Fortunately, there are several reliable free solutions out there, tutorials, and videos walkthrough with instructions on how to construct a survey using Google Forms that contain types of questions like single-choice short-answer, and paragraph questions; single and multiple-choice questions; and multiple-choice checkboxes questions. Tips to improve online surveys are also available online.
The first survey deployed was designed to explore the trainees' experience and satisfaction. Knowledge gained analyzing the results delivered by the Form's own visualization capabilities, was used directly to quickly address and fix a few important issues (that hadn't been yet detected) regarding trainers' performance, logistics, etc. Touchdown!
Now, Figure 1 below shows the surveys designed, using some of the question types already mentioned, to collect trainees' general data, like age, sex, and email (to be used as a unique identifier), among other important information.
Figure 1: Example of the survey - trainees' general data and information.
Figure 2 above depicts some of the trainees' general survey responses, saved as a Google Sheet; by the way, Google Sheets is another powerful easy-to-use cloud-based free tool, with all Excel features and more. Sheets are easy to share, and also they allow easy configuring of the access and roles of multiple users.
As I pointed out in the last post, the data preparation process is the most important step in any Data Science workflow. Indeed, it was extremely important, particularly in this example, because NO data was available to work with; so, to deliver reliable results, it was imperative to apply advanced data preparation techniques to extract every single drop of knowledge and relevant information that could be encapsulated in the survey responses. Yes, wrangling properly the data was absolutely critical to address the issues and successfully accomplishing the goals.
To start, a Principal Component Analysis or PCA was carried out to explore the survey's responses and select a reduced set of principal components or variables from them. Simply put, the Principal Component Analysis was performed to identify the most relevant and informative variables for the oncoming analysis. Figure 4 above illustrates the data preparation process in TRIFACTA; and, to the right, the recipe implemented to reshape the data and ready it for PCA. As will be discussed in the following sections, recipes were also implemented to reshape the data for visualization, graph analysis, etc.
The R packages FactoMineR and factoextra were leveraged to carry out the analysis and build some informative visualizations. The packages' ability to handle quantitative and categorical variables was key to tackling the challenge and quickly obtaining reliable results.
The excellent website, Statistical Tools for High-Throughput Data Analysis (STHDA), contains tutorials and many examples of applications of the R packages already mentioned. Figure 5 above shows one of the available graphic tools to visually deliver the PCA results; the image can be interpreted as follows: variables located far from the Dim1 and Dim2 axis intersection are the GREATEST VARIABILITY in the data, namely, the most impactful variables. For a detailed explanation, additional visual tools, and examples, please explore the STHDA web page. Taking into account the suggestions of trainers and the startup's domain experts, a set of 16 out of 27 variables (highlighted inside the dotted blue curve) was selected.
Armed with the PCA results, the next step was to go back to TRIFACTA and reformat the available data, generate the inputs to apply Graph and Weighted Association Rules Mining techniques, and unearth additional actionable knowledge that could help the startup's trainers to enhance trainees' performance.
Graph Analysis and Weighted Association Rules Mining - WARM
Figure 6 below shows the results (trainees' identifier has been anonymized) of the data preparation complex process in TRIFACTA to generate the input with a format suitable to the Graph and Association Rules Mining analysis.
Figure 7 below illustrates an example of a Graph built and plotted using arules and igraph R packages' tools regarding the trainees surveyed. Tuning the plot function parameters (size, colors, etc.), it's possible to surface a few interesting features, as well as key relationships between some of them. The figure suggests that there exists a connection between, for example, BEBE_ALCOHOL_FRECUENTE (alcohol consumption) and other factors that clearly harm the trainees' performance, like LESION_MUSCULAR_ARTICULAR_SI (muscular lesions) and (sleep deprivation) HORAS_DUERME_NOCHE_5-6, etc.
To corroborate these visual findings and unearth more possible useful variables' associations, a detailed Weighted Association Rules Mining analysis was carried out, following the workflow and discussion presented in my first post. To unlock the current use case, the apriori and hits algorithms, and the same analogy process, were also applied. So, each trainee is now a "customer", and factors or variables mentioned above like BEBE_ALCOHOL_FRECUENTE (alcohol consumption), HORAS_DUERME_NOCHE_5-6 (sleep deprivation), etc., are the items "bought" by the customers. The task is to uncover relevant relationships or rules between the items.
Figure 8 and Figure 9 above show Graph and Parallel Coordinate plots, respectively; obtained following a procedure very similar to the one used to unlock the use case presented in the first post. The metric Lift was also used here to rank the rules or item associations (in Figure 9, the thicker the red line the higher the Lift value). After carefully exploring both figures it's not hard to conclude that the incidence of muscular and articular lesions could be closely associated with respiratory issues, frequent alcohol consumption, and other trainees' unhealthy habits.
To facilitate the interpretation and add explainability, the rules are now tabulated (ranked by the metric Lift) in Figure 10 above (the top-10 associations or rules have been highlighted). Certainly, this unearthed actionable insight can be used to address and mitigate the negative factors and associations identified, helping directly to, at the same time, enhance the health and the performance of the trainees. Touchdown!
Content-Based Recommendation Engine - CBRE
Finally, the startup was interested in exploring innovative ways to optimize and personalize its training programs, and they wondered if it was possible to extract additional knowledge from the available survey responses data. Here was when the idea of designing and implementing a Content-Based Recommendation Engine (CBRE) came in handy. Yup! You guessed right... I had to go back to TRIFACTA.
Indeed to tackle this challenge, it was necessary to implement the most complex data preparation recipes so far. An example is illustrated in Figure 11 above. Again, the trainees' identifier has been anonymized.
First, using one-hot encoding to transform categorical variables into numeric 0/1 codes, the input required to evaluate a Similarity Matrix (using the Pearson Correlation Coefficient) to comparate the trainees, was generated. The Similarity Matrix for this example is a (square) matrix where each row (column) corresponds to a trainee; the diagonal is filled with values equal to 1 (each trainee is identical to itself), and the off-diagonal elements, are values between approximately -1 (very different) and approximately 1 (very similar). Calculations were performed once again in the R language framework.
Second, the Similarity Matrix was UNPIVOT and blended with the trainees' basic data and other relevant information. This refined dataset will be the core of the recommender system. The final step in implementing the Content-Based Recommendation Engine consisted of facilitating the access and utilization of the results to the end-users (trainers and domain experts). To accomplish that the results served as a fully interactive table and controls, and as an easy-to-digest and informative visualization. Figure 12 below depicts the built solution.
From Figure 12 above (trainees’ identifiers are anonymized), taking a trainer or an advanced trainee as a reference, and selecting her/his identifier in the dropdown REFERENCE filter, the engine shows up, in the table at the bottom, a list (sorted descending) of the most similar trainees. Results are also displayed, to the right, in a compelling convenient visualization. The SLIDER control can be used to easily adjust the comparison's upper and lower similarity index bound values. Additional relevant data and information were included in the table, too. If required, the user can export the filtered data or save it as a Google Sheet.
The results delivered by the CBRE can be used immediately by the trainers and domain experts to, for example, prescribe to the most similar trainees of different levels, customized workout routines, nutrition supplements, etc., that have been tested and refined in the (more advanced) reference group. TOUCHDOWN!
Summary
In this post, an End-to-End Data Analytics workflow was presented and discussed. It comprises the construction of improved online surveys using the Google Forms tool; the implementation of complex data preparation recipes in TRIFACTA to transform and reshape the surveys' responses; PCA, Graph, and Weighted Association Rules Mining Analysis carried out by applying algorithms/methods available in the R language framework; and the implementation of a Content-Based Recommendation Engine in Google Looker Studio.
This low-cost analytics solution allowed me to generate the relevant data, reshape and refine it, visually discover and extract actionable insight that can be used immediately to address the real-life issues of the training-fitness startup, as well as help its trainers and domain experts to deliver recommendations and customized workout routines oriented to enhance the trainees' health and performance.
In future posts, I’ll continue presenting and discussing more real-life relevant use cases. Please, stay tuned and don’t miss them out. And kindly, leave your comments below and share. Thank you!
Comments
Post a Comment
Please, leave your comments: