Reconstruction of Body Height from Images: Does It Still Make Sense?

,
Suspects Height

Computer vision constitutes a dynamic and intricate field within computer science, where the aim is to imbue computers with the capacity to interpret and understand the visual environment, much like the human visual system does. This discipline is at the intersection of algorithmic innovation, machine learning, and artificial intelligence, enabling the analysis of images, videos, and other visual data sources to extract meaningful information or to initiate responsive actions.

Image Processing is foundational to computer vision. Here, raw visual data undergoes multiple transformations to enhance quality and interpretability. Noise reduction is achieved through techniques like Gaussian blur or median filtering, where each pixel in the image is altered based on its neighbors to reduce spurious variations that could confuse later stages of analysis. Contrast adjustment might involve histogram equalization, which redistributes the pixel intensities to achieve better image contrast. Edge detection employs algorithms such as the Sobel or Canny operators, which identify changes in intensity, crucial for understanding where one object ends and another begins. This process involves calculating gradients in image intensity, which are then thresholded to produce binary edge maps. Color segmentation leverages color spaces like HSV or LAB for better distinction between colors under varying lighting conditions, using clustering methods like k-means to group similar colors, facilitating object recognition and scene understanding.

Object Detection and Recognition involves identifying and locating objects within an image or video. Haar cascades are a traditional approach, particularly effective for fast face detection, where features are represented as simple rectangles and learned from positive and negative examples. However, modern deep learning has revolutionized this field with Convolutional Neural Networks (CNNs), notably architectures like YOLO (You Only Look Once) for real-time object detection, or more accurate models like R-CNN and its derivatives, which propose regions where objects might exist before classifying them. These models learn complex features from vast datasets, enabling them to differentiate not just objects but nuances within classes, like the make and model of cars or breeds of dogs. Transfer learning further enhances this capability by leveraging pre-trained models on new tasks with less data, accelerating learning processes.

Image Segmentation goes beyond detection to provide a detailed mapping of objects in an image. Semantic segmentation labels each pixel with a class, creating a dense prediction map where every pixel is understood in context, useful for applications like autonomous driving where road, sky, and pedestrian areas must be clearly delineated. Instance segmentation adds the complexity of distinguishing between individual instances of the same class, using methods like Mask R-CNN, which combines object detection with per-pixel segmentation. Here, techniques like U-Net or DeepLab use encoder-decoder structures to preserve spatial information while learning to segment at high resolutions.

Motion Detection and Tracking analyzes temporal changes in video frames. Optical flow calculates the vector field describing the movement of brightness patterns in the image plane, allowing for understanding of how objects move in relation to the camera or each other. This is crucial for applications like surveillance where tracking suspicious movements, or in sports analysis for player or ball tracking. Background subtraction is another method used to detect moving objects, where a model of the scene background is continuously updated to separate foreground objects. Techniques like Gaussian Mixture Models (GMM) or more advanced deep learning models can handle complex scenes with dynamic backgrounds. Depth Perception and 3D Reconstruction aim to infer a third dimension from two-dimensional images. Stereo vision uses disparity maps calculated from two slightly offset views to determine depth, much like human binocular vision. Structure from Motion (SfM) reconstructs 3D scenes by analyzing how objects appear to move from multiple camera viewpoints, solving for both the camera’s pose and the 3D structure of the scene. This involves intricate processes like bundle adjustment, optimizing the 3D coordinates describing the scene geometry along with the camera parameters. Multi-View Stereo (MVS) further refines this by matching features across many views to create dense point clouds or meshes, essential for applications like virtual reality, where a realistic 3D environment is necessary.

In medicine, computer vision algorithms dissect images from CT, MRI, or X-ray scans to spot anomalies or assist in surgeries through augmented reality overlays that provide real-time anatomical guidance. Automotive applications include not just basic object detection, but also semantic understanding of the driving environment, predicting pedestrian intentions or road conditions.

Security leverages facial recognition to match faces against databases, using techniques like eigenfaces or more recent deep learning methods that offer higher accuracy with varying poses and expressions. Anomaly detection in video surveillance uses unsupervised learning to identify unusual behavior patterns.

Industrial uses include automated inspection where vision systems can detect minute defects in manufacturing, using high-resolution imaging combined with pattern recognition or anomaly detection algorithms. Robotics integrates vision for tasks like part recognition for assembly, bin picking where items must be identified and handled, or navigation in cluttered or dynamic environments.

In agriculture, computer vision through drones or ground-based systems can assess plant health, detect pests or diseases by analyzing leaf colors or patterns, and optimize resource use like water or fertilizers based on precise plant condition data. Entertainment sees computer vision in creating lifelike visual effects, enhancing games with realistic physics or character movements based on motion capture, or enabling gesture-based controls for interactive experiences.

The technological backbone of computer vision includes deep learning with layers specialized for vision tasks like convolutional layers for feature extraction, pooling layers for dimensionality reduction while retaining essential features, and fully connected layers for final classifications or regressions. Backpropagation, combined with gradient descent optimization, allows these networks to learn complex patterns from data. Classical methods like Hough transforms detect lines or circles in images by voting in parameter space, while SIFT and SURF algorithms match features across images for tasks like image stitching or object tracking, robust to changes in scale, rotation, and illumination.

In essence, computer vision is a field that not only replicates but in many ways, augments human visual perception, driving forward technology across industries by giving machines the ability to see, interpret, and interact with the world in a sophisticated and increasingly autonomous manner.

Sources for this article include:

• Tillmann, V., & Clayton, P. E. (2001). Diurnal variation in height and the reliability of height measurements using stretched and unstretched techniques in the evaluation of short-term growth. Annals of Human Biology, 28(2), 195-206.
• Voss, L. D., & Bailey, B. J. (1997). Diurnal variation in stature: is stretching the answer? Archives of Disease in Childhood, 77(4), 319-322
• Bieler, D., Günel, S., Fua, P., & Rhodin, H. (2019). Gravity as a Reference for Estimating a Person’s Height from Video. arXiv preprint arXiv:1909.02211.
• Vuvor, F., & Harrison, O. A. (2017). A Study of the Diurnal Height Changes Among Sample of Adults Aged Thirty Years and Above in Ghana. Letters in Health and Biological Sciences, 2(2), 91-96.
• Brolund, A., & Bergström, P. (2007). Estimation of human height from surveillance camera footage. Linköping University Electronic Press.
• Criminisi, A., Reid, I., & Zisserman, A. (1998). A new approach to obtain height measurements from video. Proceedings of the 1998 International Conference on Image Processing, 2, 879-883.
• Finocchiaro, J., Khan, A. U., & Borji, A. (2016). Egocentric Height Estimation. arXiv preprint arXiv:1610.02714.
• Polito, E. (2023). Measure Heights from Surveillance Video. Amped Software Blog.
• Spreadborough, D. (2017). The Beginner’s Guide to Suspect Height Calculation from CCTV. Amped Software Blog.
• NCAVF. (n.d.). How Video Forensic Experts Determine Height in Images and Videos. NCAVF.