COPENHAGEN

2 days / 15 talks
Awesome and great blog

January 25-27

central-limit-1200x800.jpeg

February 21, 2024 0

These days, we see a lot of memes and short videos on social media in the style of: “I survived another day without using the Pythagorean theorem!”.

Well, this is not one of those…

This is actually a story about one of the unsung heroes of the world of statistics – the Central Limit Theorem (CLT).

This theorem is like the reliable backbone of statistical analysis, making sense of the chaos in our data-filled world. It’s what allows us to use simple averages (means) to understand complex datasets, and that’s a big deal! Whether we are talking medical studies, market research, or even astronomical observations, the CLT is the quiet powerhouse making sure we can draw reliable conclusions from all that data.

So, what is CLT? Our friend, ChatGPT, says: “The Central Limit Theorem is a fundamental principle in probability theory and statistics. It states that when an adequately large number of samples are taken from any population with a finite level of variance, the mean of all samples will be approximately equal to the mean of the original population. More importantly, the theorem tells us that these sample means will tend to follow a normal distribution (commonly known as a bell curve), regardless of the shape of the population distribution. This convergence to a normal distribution occurs as the sample size becomes larger, making the CLT particularly powerful for predictions and statistical inference, as it allows for the approximation of the sampling distribution of almost any statistic using a normal distribution, provided the sample size is sufficiently large.

Wow, sounds nice, but it means nothing to most people. Until an actual use case presents itself. Here comes one 😊!

A while back, Inspira Group’s IT job board HelloWorld started searching for an appropriate way to inform their users about salaries corresponding to different positions, taking into account seniority, location, and even exact companies. The data was being collected for a significant amount of time and initially it was presented in a form of a (min, mean, max) triple.

While the triple contained valuable information for the website users, it didn’t provide insight into how representative a particular sample was to the entire population.

CLT, in practice, works for samples of 30 or more datapoints. In the HelloWorld’s case, this means that if there are at least 30 known salaries for a specific combination of the elements considered (position, seniority, location, company), then a relevant conclusion about the mean salary of all the people (well, in Serbia 😊) who correspond to the same position, seniority, location and company can be drawn, even though not all of their salaries are known. This works the same way if one or more of the four elements are omitted, for example if we are analyzing a sample based only on position and location.

Anyway, what is the relevant conclusion that can be drawn about the mean salary of the entire population?

For example, let’s consider all the Software Developers in Belgrade. What our friend (ChatGPT) was telling us is that no matter what the distribution of their salaries looks like (the green histogram in the image below), if we took a bunch of samples larger than 30 and calculated the mean salaries for all the samples, those means would form a normal distribution (the blue histogram in the image below).

However, we only have one sample (the group of Software Developers in Belgrade that have shared information about their salaries on the HelloWorld website), and hopefully it is larger than 30. How does CLT help with it? It is in fact really simple when it comes to implementation. What we can learn based on the mean salary of the sample we have is the range of values in which we can say with high certainty that the mean salary of the entire population (the population consisting of all the Software Developers in Belgrade) lies. Depending on the level of certainty we want, we can use one of the following three formulas:

Here, x is the mean salary of the sample, s is the standard deviation calculated for the sample, and n is the size of the sample.

And that is it! That is the whole wisdom. If we have a large enough sample, we calculate it’s mean, and based on the formulas above, we can say where the true mean of the population probably lies.

The data about mean salaries obtained with the help of CLT will soon be available on the HelloWorld website. There is, of course, much more to this story, for example the details about the process of cleaning and transforming the data, handling the factor of time when a certain datapoint occurred, security and so on, but that type of storytelling has a more of an intimate café vibe 😊.

For more information and a fantastic explanation of CLT, check out the StatQuest YouTube channel.

Author

Stevan Ostrogonac, Senior AI Engineer at Inspira Group


male-leader-talking-to-employees-showing-the-plan-2023-11-27-05-03-45-utc-1200x798.jpg

February 21, 2024 0

A Paradigm Shift to Data as the Core Product

In the journey toward refining our product, one significant hurdle was the ambitious scope of functionality testing conducted in parallel, which led to complexities in communication and managing expectations. Through collaborative efforts with the business teams, we addressed a multifaceted challenge, streamlining the advertisement administration process to enhance both accuracy and efficiency. The culmination of these endeavors led to the emergence of a distinct and innovative product from the Inspira Group‘s Data Science team: a comprehensive text processing infrastructure.

This architectural overview and detailed description of the infrastructure’s components were contributed by Borko Rastovic, Inspira Group Senior Data Engineer.

This infrastructure is a complex network of services set up in detailed, flexible pipelines. Each step is closely connected, meaning what happens in one phase affects the next, and all information is carefully put together with the outcome. The system can handle many different types of inputs, like documents, HTML, and images, showing its wide-ranging use and thorough approach.

Architectural Overview

The infrastructure’s architecture is underpinned by three fundamental components, each serving a critical role in the processing ecosystem:

  • Extraction, Processing, and Storage Services: These services form the backbone of the system, easing the selection of documents for processing and the extraction of text from various formats such as documents, images, and HTML.
  • Basic Text Processing Service (NLP): This part is equipped with an array of over ten tools, including functionalities for language detection, diacritization, transliteration, normalization, anonymization, sentence splitting, lemmatization, vectorization any many more. These tools are indispensable across all our pipelines, supplying the necessary foundation for advanced text analysis and processing.
  • LLM Processing Service (LLM Module): Acting as a wrapper for the OpenAI API (or any alternative LLM platform), this module is designed for efficiency and scalability. It offers one prompt per endpoint, integrating seamlessly with the NLP service to enhance functionality. This part is crucial for monitoring costs effectively per endpoint and ensuring the reliability and accuracy of output validation.

Bonus: Introducing the Prompt Batch Tester – A Catalyst for Development

Amidst the development of the Job Formatter, a pivotal tool appeared to address a critical need within our development teams—the “Prompt Batch Tester.” This tool was conceived to empower product development teams, enabling them to efficiently test prompts on larger datasets. This capability is crucial for the refinement and optimization of AI-driven features and functionalities across our product suite.

Key Features of the Prompt Batch Tester

  • Prompt Customization: Product managers can craft and define specific prompts, select the right model, and configure its parameters to best suit their project needs.
  • Dataset Upload: The tool supports the upload of extensive datasets, easing a robust testing environment that closely mimics real-world application scenarios.
  • Test on a large scale: By simulating how a prompt will perform in a larger scale of data, the tool offers valuable insights into its practical application and utility.
  • Cost Forecasting: An essential feature of the Prompt Batch Tester is its ability to project the total operational costs of deploying such a product in a live environment. This aids in strategic planning and budget allocation.
  • Response Time Evaluation: Understanding the response time is critical for assessing the scalability and user experience of the product. This tool supplies precise metrics, helping teams to optimize performance and efficiency.

Figure 1 – “Prompt batch tester” user interface

Conclusion

Our innovative journey at Inspira Group has led us from the first prototypes to a robust AI-driven text processing infrastructure, revolutionizing the efficiency and accuracy of job ad moderation. Through the development of advanced tools like the “Prompt Batch Tester,” we have embraced challenges, iterated solutions, and unlocked new potential.

In collaboration with the Infostud team, we’ve also achieved significant milestones beyond job and moderation. Our development of an AI-powered CV parser has simplified the process for job seekers to craft distinctive profiles on our platforms. Moreover, we’re demystifying the interview preparation process with an innovative AI tool, making daunting tasks more manageable and less stressful for candidates.

As we look ahead, we are excited to collaborate with forward-thinkers and innovators. Together, let’s continue to break new ground and redefine what is possible in the evolving landscape of technology and work.

Author

Srđan Mijušković, Senior Product Manager at Inspira Group


trading-teaching-male-leader-talking-to-employees-2023-11-27-05-16-49-utc-1200x776.jpg

February 21, 2024 0

Opening Insights: This article is structured in two main parts: the first delves into the nuances of product development, and the second outlines the architectural solutions and the envisioned future of the product. Given the comprehensive nature of our discussion, we encourage you to peruse the entire article for a thorough understanding.

In the following sections, I will unfold the narrative of our product’s evolution, tracing the path from early-stage prototyping to the polished end-product. The introduction of the GPT-3 API in the early months of 2023 heralded a shift towards enhancing the efficiency of job ad moderation on our platform, poslovi.infostud.com. Here, we manage the publication and moderation of hundreds of job advertisements each week, a process traditionally performed manually. Although meticulous, this approach is susceptible to inaccuracies and delays.

Confronted with many moderation guidelines, both explicit and implicit, alongside the transformative potential of generative AI models, we went beyond rigidly set KPIs and objectives. Our approach was characterized by an openness to experimentation, acknowledging the risk of failure while striving to maximize the likelihood of success through rapid prototyping. This method encompasses several key phases:

  • Prototyping: Crafting high-fidelity prototypes that deliver practical solutions.
  • Feedback: Gathering and interpreting feedback from users to inform development.
  • Improvements: Refining the prototype in response to insights gained from user feedback.

While our aims were intentionally not defined as SMART goals, they were nonetheless focused and ambitious:

  • Speed: Accelerating the moderation process for each job advertisement.
  • Quality: Elevating the overall quality of job advertisements post-moderation.

This introduction sets the stage for a deeper exploration into the iterative development process and architectural innovation, underscoring our commitment to pushing the boundaries of what is possible with AI in the realm of job ad moderation.

V1 Prototype: Foundational Steps and Preliminary Concept Evaluation

The start of our first prototype- designated V1, was driven by the ambition to evaluate a foundational concept. The ultimate aim was to iteratively refine this concept into a robust final product. Our preliminary strategy aimed to equip administrators with insights into potentially contentious sections of job advertisements while concurrently extracting a diverse array of metadata from these ads. For the implementation of this prototype, we identified Streamlit as the optimal platform, given its flexibility and user-friendly interface.

The prototype was designed to find

  • The existence of any form of discrimination within the advertisement’s text
  • The inclusion of contact details such as email addresses, salary information, or phone numbers
  • The presence of grammatical inaccuracies
  • Whether the advertisement’s text inadvertently had multiple job listings

Included in the prototype’s functionality was the ability to extract

  • The seniority level of the advertised position
  • Relevant IT-related tags
  • The geographical location of the job, pinpointing specific cities

The testing phase of the V1 prototype yielded encouraging results, demonstrating a high degree of accuracy in its operational performance. This initial success laid a solid foundation for later development stages, confirming the viability of our approach and the potential for further refinement and expansion of the prototype’s capabilities.

V2 Prototype: Broadening Functionality and Elevating User Experience

Building upon the insights gained from the first prototype, we embarked on the development of the second iteration, prototype version 2 (V2), intending to significantly enhance both the content quality and the visual appeal of advertisements. This ambition was realized through focused improvements in the user interface (UI) and user experience (UX), ensuring a more intuitive and efficient interaction for administrators.

UI and UX Improvements

  • Introduction of a feature enabling the live streaming of processed data into the application, without the need for manual text input.

Functional Enhancements

  • Acknowledging the considerable time administrators dedicate to formatting HTML versions of job ads, V2 aimed to automate this process. By employing a set of predefined rules, the prototype could transform ad text into HTML. This included the automatic identification of job descriptions, candidate profiles, benefits, etc. and their allocation to specified styles. For instance, the system was designed to recognize essential qualifications and skills from the ad text, correct any grammatical errors, remove discriminatory language, apply H2 headers in blue, and format benefits as bullet points.
  • To address issues such as grammatical mistakes and discriminatory language in both Serbian and English, we broadened the prototype’s scope of functionality.
  • A concerted effort was made to refine the ETL (Extract, Transform, Load) process, acknowledging the varied formats in which employers post job advertisements on poslovi.infostud.com, including PDF, DOCX, Images, and others. We intensified our efforts in processing and converting these formats into text in near real-time, after sending the text for further processing. The next sections will delve deeper into how this project catalyzed further investment in text processing infrastructure development.

This phase of development is not only aimed at enhancing the practical aspects of the platform but also at enriching the overall experience for both administrators and end-users, setting a precedent for continuous improvement and innovation.

V2.1 Prototype: Refinement and Advanced Change Detection

In response to the administrative challenge of finding textual modifications in job advertisements through mere visual inspection, we introduced an innovative feature in version 2.1 aimed at enabling precise monitoring of changes. This was achieved through the integration of a diff algorithm, setting up a robust foundation, and setting clear expectations for the final, production-ready iteration of our product.

In the evaluation of version 2, we decided to stop the automatic formatting of job ads for two main reasons:

  • Employer Branding Integrity: Automatic formatting could alter the original layout provided by employers, affecting their brand presentation. By dropping it, we ensure the employer’s intended formatting is preserved.
  • Resource Efficiency: The benefits of automatic formatting did not justify the significant resources needed for its maintenance and improvement.

Nonetheless, we kept and further developed all other functionalities from version 2.1 for inclusion in version 3, poised to be the definitive version of our product.

The enhanced product helps the direct annotation of sections of text within the administrative editor that may present issues. Administrators are empowered with the capability to either approve or dismiss suggested modifications. Internally, we affectionately refer to this feature as “Grammarly on steroids 😊,” a nod to its enhanced content editing prowess tailored to our specific application.

Figure 1 – Implemented solution

As we close the first chapter of our adventure into the world of AI-powered job ad moderation at Inspira Group, we find ourselves on the brink of a deeper exploration. Our focus shifts to the core of our innovation journey: data as our primary product. This chapter has shed light on our progress in improving moderation processes and hints at a major shift in direction.

Read the continuation of our story here.

Author:

Srđan Mijušković, Senior Product Manager at Inspira Group