Data Analysis Similar to Puzzle Solving
Data analysis typically deals with large quantities of information that data sources generate on a continuous basis. Companies can process the information as streaming data or collect it and process it in batches. If they have the data-processing capability, companies try to achieve continuous, real-time analysis. An informal investigation of group behavior in solving picture puzzles suggests that a combination of streaming and batch data processing may be the most effective approach.
According to InfoWorld, IBM scientist Jeff Jonas, who is responsible for entity analytics, discussed observations of informal puzzle-solving tests at the GigaOM conference in New York. He looked at how groups of friends and family handled solving the kinds of picture puzzles that have thousands of pieces. He found that fitting pieces together to form small entities and then guessing at the whole picture to place the remaining pieces was similar to data analysis. The observations yielded what he calls "profound effects that we could bring to big data."
He noticed that puzzle solvers were often able to make correct guesses about the whole picture with remarkably few pieces in place. He found that the most successful approach was to assemble small areas of puzzle and study them in depth to determine the probable nature of their surroundings.
This was analogous to looking at a data point and studying the surrounding data, he concluded. Data is often unstructured and similar, but if companies collected the data in a batch, it might yield clues to the characteristics of the larger picture. Such a strategy is especially useful for companies processing mid-range data volumes. Their IT departments often face constraints on processing capacity but can tag data that might be of interest. Processing such data in batches for in-depth analysis can yield valuable information while consuming fewer resources.
To make the process more realistic when studying the process himself, Jonas removed puzzle pieces from some puzzles and bought extra copies to add duplicates to some puzzles to simulate "low-quality data." But the original approach still worked.
Data often has multiple copies of bits of information and may be missing many data points of the complete set. Processing such data as a stream identifies some characteristics but misses others. Focusing on too few data points may actually give an incorrect overall impression. From the observation of the successful puzzle solvers, the most effective approach is to assemble as many pieces as fit together into a small unit and look at that collection in detail to guess what other nearby pieces might look like.
That approach corresponds to analyzing streaming data for some characteristics and collecting data in batches for in-depth analysis as well. At the end of the joint approach, the dual path may yield additional data characteristics. The batch-processing is less costly and can help in determining what the whole collection of data points might look like.
This post was written as part of the IBM for Midsize Business program, which provides midsize businesses with the tools, expertise and solutions they need to become engines of a smarter planet. Become a fan of the program on Facebook. Follow us on Twitter.