Building Trustworthy Big Data Algorithms for Insightful Analysis

Much of our reams of data sit in large databases of unstructured text. Finding insights among emails, text documents, and websites is extremely difficult unless we can search, characterize, and classify their text data in a meaningful way.

One of the leading big data algorithms for finding related topics within unstructured text (an area called topic modeling) is latent Dirichlet allocation (LDA). But when Northwestern University professor Luis Amaral set out to test LDA, he found that it was neither as accurate nor reproducible as a leading topic modeling algorithm should be.

Using his network analysis background, Amaral, professor of chemical and biological engineering in Northwestern's McCormick School of Engineering and Applied Science, developed a new topic modeling algorithm that has shown very high accuracy and reproducibility during tests. His results, published with co-author Konrad Kording, associate professor of physical medicine and rehabilitation, physiology, and applied mathematics at Northwestern, were published Jan. 29 in Physical Review X.

Topic modeling algorithms take unstructured text and find a set of topics that can be used to describe each document in the set. They are the workhorses of big data science, used as the foundation for recommendation systems, spam filtering, and digital image processing. The LDA topic modeling algorithm was developed in 2003 and has been widely used for academic research and for commercial applications, like search engines.

When Amaral explored how LDA worked, he found that the algorithm produced different results each time for the same set of data, and it often did so inaccurately. Amaral and his group tested LDA by running it on documents they created that were written in English, French, Spanish, and other languages. By doing this, they were able to prevent text overlap among documents.

“In this simple case, the algorithm should be able to perform at 100 percent accuracy and reproducibility,” he said. But when LDA was used, it separated these documents into similar groups with only 90 percent accuracy and 80 percent reproducibility. “While these numbers may appear to be good, they are actually very poor, since they are for an exceedingly easy case,” Amaral said.

To create a better algorithm, Amaral took a network approach. The result, called TopicMapping, begins by preprocessing data to replace words with their stem (so “star” and “stars” would be considered the same word). It then builds a network of connecting words and identifies a “community” of related words (just as one could look for communities of people in Facebook). The words within a given community define a topic.

The algorithm was able to perfectly separate the documents according to language and was able to reproduce its results. It also had high accuracy and reproducibility when separating 23,000 scientific papers and 1.2 million Wikipedia articles by topic.

These results show the need for more testing of big data algorithms and more research into making them more accurate and reproducible, Amaral said.

“Companies that make products must show that their products work,” he said. “They must be certified. There is no such case for algorithms. We have a lot of uninformed consumers of big data algorithms that are using tools that haven't been tested for reproducibility and accuracy.”

academic research accuracy algorithm Applied Science biological engineering commercial applications digital image processing image processing LDA reproducible

Comments (0) Cancel reply

The recorded HD-sEMG are decomposed into channel-wise CSTs, transformed into cwCST-image according to the channel distribution, and decoded to different gestures by CNN-based model. The CSTs in each channel comprise the discharges of motor units adjacent to the corresponding electrode, represented by bars and circles in different colors. Credit: Yang Yu, Shanghai Jiao Tong University.

Information Technology

Hand Gesture Recognition via Cumulative Spike Train Model

A research paper by scientists at Shanghai Jiao Tong University presented a novel channel-wise cumulative spike train image-driven model (cwCST-CNN) for hand gesture recognition. The research paper, published on Mar. 21, 2025 in the journal Cyborg and Bionic Systems, leverage a custom convolutional neural network (CNN) to extract both local and global features for classifying hand gestures, by decomposing high-density surface EMG (HD-sEMG) signals into channel-wise cumulative spike trains (cw-CSTs) and reconstructing these into two-dimensional images based on the spatial…

14.04.2025

The framework of the proposed M2I. Credit: Qinghua Zheng et al.

Information Technology

Brain-Inspired Machine Memory: A New Frontier in AI

A recent paper published in Engineering titled “Machine Memory Intelligence: Inspired by Human Memory Mechanisms” explores a novel approach to AIby drawing inspiration from the human brain’s memory mechanisms. This research aims to address the limitations of current large models, such as ChatGPT, and paves the way for the development of more efficient and intelligent machines. Large models have achieved remarkable performance in various fields but suffer from several drawbacks. They consume excessive amounts of data and computing power, are prone to…

11.04.2025

A digital twin could help scientists study the inner workings of the brain. Credit: Emily Moskal/Stanford Medicine

Information Technology

AI Brain Models: Digital Twins Transforming Research

Scientists build ‘digital twin’ of mouse brain Much as a pilot might practice maneuvers in a flight simulator, scientists might soon be able to perform experiments on a realistic simulation of the mouse brain. In a new study, Stanford Medicine researchers and collaborators used an artificial intelligence model to build a “digital twin” of the part of the mouse brain that processes visual information. The digital twin was trained on large datasets of brain activity collected from the visual cortex…

10.04.2025

Information Technology

Direct Interaction with Mid-Air Holograms via Elastic Volumetric Displays

Doctor Elodie Bouzbib, from Public University of Navarra (UPNA), together with Iosune Sarasate, Unai Fernández, Manuel López-Amo, Iván Fernández, Iñigo Ezcurdia and Asier Marzo (the latter two, members of the Institute of Smart Cities) have succeeded, for the first time, in displaying three-dimensional graphics in mid-air that can be manipulated with the hands. ‘What we see in films and call holograms are typically volumetric displays,’ says Bouzbib, the first author of the work. ‘These are graphics that appear in mid-air…

10.04.2025

Building Trustworthy Big Data Algorithms for Insightful Analysis

Comments (0) Cancel reply

Most Read Articles

Related Posts

Hand Gesture Recognition via Cumulative Spike Train Model

Brain-Inspired Machine Memory: A New Frontier in AI

AI Brain Models: Digital Twins Transforming Research

Direct Interaction with Mid-Air Holograms via Elastic Volumetric Displays

Do You Like Our New Design?