Convolutional neural network (CNN, or ConvNet)

A convolutional neural network (CNN, or ConvNet) is a computer technology used for interpreting and recognizing images. Humans and many other animals are able to perform this task automatically using their eyes and the cells in their brains. Computers must use artificial means of deciphering images and determining their meaning. Engineers using complex mathematical formulas and high-tech computers have created CNN systems that can correctly categorize a large majority of images presented to them. This technology developed gradually over time but reached a high level of speed and accuracy in the early 2010s, after which many major online companies embraced CNN technology for a variety of purposes.

Background

The study of CNNs can be traced back to similar studies that took place in the fields of biology and neuroscience (the study of the brain and nervous system) mainly during the twentieth century. Scientists began to identify the means by which the eyes transfer information to the brain and the brain interprets the visual stimuli to create understanding and useful information.

In 1962, neurophysiologists David Hubel and Torsten Wiesel performed experiments that revealed the workings of some of the brain cells responsible for interpreting visual input. In particular, parts of the visual cortex are activated for this task. Their test involved showing various visual patterns to cats, while sensitive machinery monitored the electrical activity of the animals’ brains. The readings were so precise that they were able to identify certain clusters of cells activated for different kinds of images, allowing the researchers to map the operations of the visual cortex.

Hubel and Wiesel discovered that some of these cells only responded when the visual image contained different kinds of shapes or edges. For example, some cells might only respond to horizontal edges on the image being viewed, while others might only be triggered by vertical edges. Subsequent studies showed that these highly specialized cells were arranged so that they could best work together to process visual information and determine its meaning.

The scientists helped to establish the Harvard Medical School Department of Neurobiology and create some of its most lasting contributions to biology. In 1981, they co-won the Nobel Prize for Physiology or Medicine for their work in the visual powers of the brain.

The findings of the 1962 research helped later scientists grasp the idea of different parts of the system being responsible for different perceptive tasks, and all parts working together to grasp complex stimuli such as shape, size, dimension, and color. During this time, programmers exploring the capabilities of computers began borrowing from this wealth of information and speculation to attempt to create machines that could mimic the mental/visual interpretative powers of living things.

Overview

The main idea behind CNN is image classification. Simply put, this is the ability to process an image and decide what the image represents, such as a mountain or an apple. In humans and many other animals, this ability is inborn. Even very young children are able to interpret visual information thanks to the remarkable neural abilities of their brain.

Visual information passes from the eyes into the brain, where complex layers of cells process the information and find patterns that the person can understand, such as shapes and colors. From there, the information also cross-references with other senses and prior knowledge to allow the person to fully grasp what he or she sees. Without this ability, people would be unable to understand their surroundings, and thus would be unlikely to have the same abilities and accomplishments.

Although image classification is a basic human ability that many people may take completely for granted, the science behind it is complex and sophisticated. During the rise of computer technology in the second half of the twentieth century, an increasing number of scientists and engineers began pondering the uses of computers in image processing. They quickly found that understanding the complexities of human image classification, and then transferring that into a machine, was an extremely difficult task.

Computers do not see images the way humans see them, as actual visual representations of shapes and colors and dimensions. Rather, computers can only process numeric data, meaning that the images humans see may be interpreted by a computer as long lists of numbers. Each number corresponds to a pixel, or a tiny point that will be shown on the computer monitor. Some computer image processors use three values, generally red, green, and blue. Each pixel will receive a list of red, green, and blue data, numbers that correspond to color strength. These red, green, and blue values are then populated by colors of the correct strength, and then layered to create an actual visual image of the type recognized by humans. This basic technology was well-established by the 1980s and continued to improve in the coming years. However, the problem of image classification remained. How could programmers use computers’ unique method of “seeing” to allow these machines to analyze and “understand” exactly what they are seeing?

The answer came in the form of CNN, a process of different layers of image analysis performed by computers. Using sophisticated mathematical algorithms, computers can view images in different layers and fields. Each filters out one kind of possible feature of the image. For example, one layer might search for steep curves, and another might find circles. These findings are then processed by the neural network to interpret their meaning, and the results are combined with pre-established information about many kinds of images. That allows computers to take an educated guess at what the image represents.

Following the groundbreaking Hubel and Wiesel findings, the field of CNN grew and developed through the end of the twentieth century and the beginning of the next century. Even the most advanced computers with cutting-edge programming still struggled to interpret images, however, and continually fell short of even basic human capabilities. By the beginning of the 2010s, computer systems boasted a success rate of about 74 percent in tasks requiring them to view and correctly classify a variety of images. However, the early years of that decade saw great improvements.

Around 2011, Alex Krizhevsky, then a Ukrainian-Canadian student at the University of Toronto, designed his own artificial neural network. Krizhevsky had been inspired by earlier research into programming algorithms done by his professor and advisor, Geoff Hinton. Hinton’s findings had been used to produce so-called “restricted Boltzmann machines” that were successful in interpreting many images. Krizhevsky felt that modifying that system, specifically by combining it into a larger system of networks, would greatly increase the speed of image interpretation and ultimately improve its results. He added more layers to the neural network, thus “deepening” it.

Fellow student Ilya Sutskever introduced Krizhevsky to the ImageNet competition. ImageNet is a contest among computer visual-information researchers that includes more than a million images; participants in the competition are challenged to use their own systems to analyze and attempt to interpret the images, as quickly and accurately as possible.

Traditionally, systems might require weeks or longer to go through the ImageNet visual archives. However, Krizhevsky’s innovative system powered through the images in about five days. The speed was valuable in itself and handily beat the competition. The speed also allowed the system to process more information and draw conclusions and “deep learning” from its findings, thus making it even faster and “smarter” in the future. Krizhevsky and his team won the event with a system that correctly classified some 85 percent of visual stimuli presented to it. His results were almost 11 percent more efficient and successful than his competition.

The ImageNet performance helped to trigger a new awakening in interest in the field of CNN technology. Around the same time, social media technology was rising in use and popularity around the globe. Leading social networking site Facebook began experimenting with CNN systems that would allow Facebook computers to automatically identify people shown in users’ photos who could subsequently be “tagged” and added into the poster’s network of friends and associates, as well as Facebook’s own stores of user information.

Even as Facebook was implementing these CNN systems, similar programs became major additions to the digital services provided by many other companies. For example, online retailer Amazon used the technology to help identify products that might be similar to purchased products and consequently of potential interest to purchasers. Amazon’s CNN technology allowed it to fine-tune its in-system advertising and provide more accurate automatic product recommendations for its users.

Meanwhile, Google also put CNN systems to use. After hiring Krizhevsky, Sutskever, and Hinton as project leaders and overseers, Google brought an increased focus on artificial intelligence and deep learning to its search-engine offerings. A prime example is the innovative “Image Search” feature, which involves automatic surveying and filtering of the billions of images accessed by Google’s search engines. Users could upload their own image, or provide a link to an already uploaded image, and Google’s visual-recognition software will scan its vast archives for the most similar-looking images, or attempt to identify the origin or content of the submitted images.

Bibliography

“A Nobel Partnership: Hubel & Wiesel.” Harvard Brain Tour, 2016, braintour.harvard.edu/archives/portfolio-items/hubel-and-wiesel. Accessed 20 Jan. 2020.

Brownlee, Jason. “How Do Convolutional Layers Work in Deep Learning Neural Networks?” Machine Learning Mastery, 17 Apr. 2019, machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/. Accessed 20 Jan. 2020.

“Convolutional Neural Network.” Stanford University, deeplearning.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/. Accessed 20 Jan. 2020.

Deshpande, Adit. “A Beginner’s Guide to Understanding Convolutional Neural Networks.” GitHub, adeshpande3.github.io/A-Beginner’s-Guide-To-Understanding-Convolutional-Neural-Networks/. Accessed 20 Jan. 2020.

Gershgorn, Dave. “The Inside Story of How AI Got Good Enough to Dominate Silicon Valley.” Quartz, 18 June 2018, qz.com/1307091/the-inside-story-of-how-ai-got-good-enough-to-dominate-silicon-valley/. Accessed 20 Jan. 2020.

Khan, Salman, et al. A Guide to Convolutional Neural Networks for Computer Vision. Morgan & Claypool Publishers, 2018.

Saha, Sumit. “A Comprehensive Guide to Convolutional Neural Networks—The ELI5 Way.” Towards Data Science, 15 Dec. 2018, towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53. Accessed 20 Jan. 2020.

Singh, Vikram. “Convolutional Neural Networks for Image Classification.” CompleteGate, www.completegate.com/2017022864/blog/deep-machine-learning-images-lenet-alexnet-cnn/all-pages. Accessed 20 Jan. 2020.