Cluster analysis

Cluster analysis is a means of examining and evaluating information focused on combining the information into groups based on how similar the individual facts are to one another. The groups, or clusters, of related items that are formed are helpful to the analyst in understanding the information. Cluster analysis can also be used to help summarize large amounts of information as part of another purpose, such as categorizing information to find the most relevant facts. This form of analysis has applications in biology, health sciences, psychology, sociology, anthropology and statistics. It also has applications in data mining and other forms of information retrieval and analysis, and pattern recognition.rsspencyclopedia-20180108-49-167656.jpgrsspencyclopedia-20180108-49-167657.jpg

Background

The basics of cluster analysis are so commonly used they seem to be an inherent human behavior. Even a small child playing with a bin of toys might pull out all the vehicles or all the dolls. This is a form of clustering because the items have been sorted based on what they have in common.

As humankind grew in scientific understanding, the natural world was increasingly categorized into clusters based on the ways in which things were similar. Early scientists created elaborate plant and animal kingdoms that clustered various life-forms into groups based on the ways in which they were the same and different. For instance, rocks and minerals were similarly classified.

When computers came into more common use in the 1960s and 1970s, clustering became an important tool that helped in all forms of data analysis. At the same time, computers made it increasingly easier to categorize things by their similarities: various facts about different items could be entered into the computer and the computer could be programmed to sort them by the ways they were the same and the ways they were different.

Among the more influential computer scientists in the field of cluster analysis were Roger Needham and his wife, Karen Spärck Jones. Much of their work in the 1960s was focused on information retrieval and how clustering factored into the ways computers categorized and retrieved information. Both published many articles and other forms of information about the topic.

Overview

Clustering is a way of combining into groups similar things that have a lot in common with one another. This holds true whether the clusters are made up of bits of information, types of animals or plants, kinds of diseases, geographic features, people, or anything else that can be compared and contrasted with something else. For example, zoos usually put all the primates together in one section, grocery stores shelve all canned vegetables in one aisle, health insurance companies categorize diseases by diagnosis groups, and employers group office workers together in departments according to their function.

Grouping things together in clusters like this works best when the groups have a high degree of similarity within the group and a low degree of similarity with things outside the group. For example, coffee, tea, and soda are all beverages, but they usually are not found in the same aisle in the grocery store. Coffee and tea will be found together, and soda will most likely be found elsewhere. This is because although they are all beverages, there are some key differences, including the fact that coffee and tea both need to be brewed while soda is ready to drink. Coffee and tea are more like each other than they are like soda.

To some degree, the appropriateness of a cluster can be subjective. It is also dependent on what qualities of the object one is attempting to analyze. For instance, grouping coffee, tea, and soda may not be the best way of clustering for a grocery store because of the way they are prepared. However, if one was clustering information to analyze overall beverage consumption in a fast- food chain, it would be appropriate to cluster them together. The key to clustering is to group the items based on their relationship to one another in the context to which the analysis needs to happen.

There are a number of ways to approach forming clusters for analysis. Two of the most common types of clusters are partitional and hierarchical. A partitional cluster is one that does not overlap with another cluster. It is sometimes referred to as unnested. Military ranks provide an example of a partitional cluster; soldiers are members of only one rank at a time. A hierarchical cluster overlaps with another; an individual item in a cluster can belong to another cluster. These are referred to as nested clusters. Musicians in a rock band provide an example of a hierarchical cluster; each may sing and play an instrument at the same time, so they belong to two clusters at once.

Clusters can also be labeled in a number of other ways, such as exclusive versus overlapping, or may be considered "fuzzy." A fuzzy cluster is one in which the object could belong to every available cluster but belongs to some more than others. This is usually defined by giving each item a weight depending on how well it fits into the category. For example, a physician who has a heart attack while working at his hospital is an employee, a physician, and a patient simultaneously, but each role is weighted differently depending on the circumstances.

The K-means algorithm is one way of achieving a partitioned cluster of random bits of data. K clusters are formed by choosing several random points of data that become known as cluster centroids. A predetermined algorithm is then applied to determine how well each of the other data points compares to the centroids. This process is repeated until the items are all sorted into the assigned number of groups based on the ways in which the data points best fit together.

Another important factor in cluster analysis is the outlier. An outlier is something that is noticeably different than the other objects in a cluster. For example, someone analyzing test scores for a group of students may find that while most of the scores fall in a certain range, some are significantly higher or lower than the group. Paying attention to outliers is important because they can indicate that mistakes were made. They can also point to important facts or information not previously known.

Bibliography

Blasius, Jorg, and Michael Greenacre, editors. Visualization and Verbalization of Data. CRC Press, 2014.

"Chapter 15. Cluster Analysis." York University, www.yorku.ca/ptryfos/f1500.pdf. Accessed 5 Mar. 2018.

"Cluster Analysis." University of California at Berkeley, www.stat.berkeley.edu/~s133/Cluster2a.html. Accessed 5 Mar. 2018.

"Cluster Analysis: Basic Concepts and Algorithms." University of Minnesota Twin Cities, www-users.cs.umn.edu/~kumar001/dmbook/ch8.pdf. Accessed 5 Mar. 2018.

"Detection of Outliers." National Institute of Standards and Technology, US Government, www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm. Accessed 5 Mar. 2018.

"K Means Clustering Algorithm: Explained." DnI Institute, 25 Sept. 2015, dni-institute.in/blogs/k-means-clustering-algorithm-explained/. Accessed 5 Mar. 2018.

"What Is Cluster Analysis?" Columbia University, www.stat.columbia.edu/~madigan/W2025/notes/clustering.pdf. Accessed 5 Mar. 2018.

"What Is Cluster Analysis?" Stanford University, hlab.stanford.edu/brian/what‗is‗it.html. Accessed 5 Mar. 2018.