Department of Computer Science and Engineering, University of Nebraska–Lincoln, NE, USA
Received Date: June 20, 2017; Accepted Date: July 01, 2017; Published Date: July 15, 2017
Citation: Silva BVR, Cui J (2017) A Survey on Automated Food Monitoring and Dietary Management Systems. J Health Med Informat 8:272. doi: 10.4172/2157-7420.1000272
Copyright: © 2017 Silva BVR, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Visit for more related articles at Journal of Health & Medical Informatics
Healthy diet with balanced nutrition is key to the prevention of life-threatening diseases such as obesity, cardiovascular disease, and cancer. Recent advances in smartphone and wearable sensor technologies have led to a proliferation of food monitoring applications based on automated food image processing and eating episode detection, with the goal to conquer drawbacks of the traditional manual food journaling that is time consuming, inaccurate, underreporting, and low adherent. In order to provide users feedback with nutritional information accompanied by insightful dietary advice, various techniques in light of the key computational learning principles have been explored. This survey presents a variety of methodologies and resources on this topic, along with unsolved problems, and closes with a perspective and boarder implications of this field.
Food monitoring; Food image classification; Food image dataset; Automatic nutrient assessment; Machine learning
Many people face challenges to maintain healthy diet and manage their weight these days, while knowing bad eating habits lead to overweight and obesity that increase the risk of heart diseases, hypertension, other metabolic comorbidities such as type 2 diabetes, and cancer . Personal diet management is always warranted in these scenarios, which often involves manual food logging that is time consuming and tedious . By virtues of growth of smartphone use, several mobile applications have been developed to facilitate food journaling, such as MyFitnessPal, LoseIt and Fooducate, and many have demonstrated great potential in effective diet control . For example, a study shows higher user retention with smartphone-based diet logging compared to the websites and paper diary in a period of six months . Teenagers are willing to take food images using a mobile food recorder before eating ; and the dietary feedback contributes to weight loss . However, many of these applications require significant manual input from users and suffer from the low performance in assessing the exact ingredients and food portion , which has hindered the long-term use from user.
The current consensus objective on this topic is to develop new methods that can automatically identify food items and estimate nutrients based on food images, utilizing cutting-edge techniques in Computer Vision and Machine Learning and ideally being friendly, effort free and accurate for user to keep track of their meals. Along this line of research, several key issues have been raised including the following. First, the food image databases are expected to be comprehensive, containing large number of food classes to cover the food diversity and abundant images per class to reflect the food image discrepancy when training a classification system . Second, reliable food segmentation is highly recommended to identify all possible items in an image and separate them from the background regardless the lighting conditions or if the food are mixed or not . Subsequently, classification will be performed on each segmented item using machine learning models that are trained based on large food datasets. Last, volume and weight estimation can be performed on each identified item, followed by the nutrient assessment [10-12]. The workflow of an automated food monitoring system that connects these components is presented in Figure 1. It is notable that every aforementioned step involves technical challenges, e.g., it is difficult to estimate food volume based on 2-dimentinoal images.
In addition to the image-based strategy, several wearable devices, such as glasses with load cells  or connected to sensors on temporalis muscle and accelerometer  and wrist motion track , have been explored to detect food intake events automatically. The collected information about eating episodes, pertinent to users’ diet habit pattern, can serve as starting point for food consumption analysis and diet interventions, e.g., providing user recommendations for physical exercise, healthier food, or eating habit [16,17].
In this paper, we review the most relevant applications on automatic food monitoring (till April 2017) that focus on addressing each aforementioned challenge. We specifically introduce current food image databases in section 2, followed by a survey on next section existing methods for segmentation, feature extraction, classification, and volume and nutrient estimation. In addition, a few studies on food-monitoring wearable devices and diet invention are depicted respectively. Finally, we close the review by discussing the remaining challenges and presenting future outlook in this field.
A comprehensive collection of quality food images is key to train a food-classification model and benchmark the prediction performance, i.e., a common procedure to verify if a new classifier outperforms previous methods is to compare their classification performance on large food image databases such as Food-101 , UEC Food-100 , and UEC Food-256 . Current food image datasets vary in many aspects, such as, type of cuisine, number of food groups, and total images per food class. For instance, Menu-Match dataset  contains 41 food classes and a total of 646 images captured in 3 distinct restaurants while PFID  has 61 classes with a total of 1098 pictures captured in fast food restaurants and laboratory. Table 1 gives a summary of different databases with their respective features.
|Study||Database||Image content||Total # of class/image||Acquisition||Reference|
|Chen et al., 2009||PFID||Fast food items from USA||61/1098||Images taken in restaurants and in lab, with white background|||
|Mariappan, 2009||TADA*||Common food in USA||256 food+50 replicas||Images collected in a controlled environment|||
|Hoashi et al., 2010||Food85*||Japanese food||85/8500||Images derived from previous database with 50 Japanese food category and web|||
|Chen, 2012||Chen||Chinese food||50/5000||Images downloaded from the web|||
|Matsuda et al., 2012||UEC Food-100||Popular Japanese food||100/9060||Images acquired by digital camera (each photo has a bounding box indicating the location of the food item)|||
|Farinella et al., 2014||Diabetes||Selected food||11/4868||Images downloaded from the web|||
|Bossard et al., 2014||Food-101||Popular food in USA||101/101000||Images downloaded from the web|||
|Kawano and Yanai, 2014||UEC Food-256||Popular foods in Japan and other countries||256/31397||Images acquired by digital camera (each photo has a bounding box indicating the location of the food item)|||
|Meyers, 2015||Food201-Segmented*||Popular food in USA||201/12625||Images derived from Food 101 dataset; segmented|||
|Beijbom et al., 2015||Menu-Match||Food from three restaurants (Asian, Italian, and soup)||41/646||Images taken by authors|||
|Ciocca et al., 2016||UNIMIB2016||Food from dining hall||73/1027||Images acquired by digital camera in dining hall; segmented|||
|Chen and Ngo, 2016||Vireo||Chinese dishes||172/110241||Images downloaded from the web|||
Table 1: Food image databases.
It is noticeable that there is no benchmark food image database for general classification purpose since most databases archive specific food type. For examples, the UNIMIB2016 database  has Italian food images from a campus dining hall and the UEC Food-100  consists of items from Chinese culinary. Similarly, Chen  and PFID  consist of images from traditional Japanese dishes and American fast food, respectively. On the other hand, Food-101  and UEC Food-256  contain a mix of eastern and western food. Except for food type, other image features such as if the picture was obtained in the wild, in a controlled environment, or whether the image is segmented or not has been taken into consideration when developing those databases (Table 1).
Food image segmentation
Segmentation is an important process to separate parts of a scene. When dealing with food, the objective is to localize and extract food items from the image [23-26]. It takes place before food classification when authors attempt to identify multiple food items in the image [8,27] or estimate volume [11,12], which often contributes to improved classification accuracy [9,12].
It is challenging to segment food images since they may not present specific attributes such as edges and defined contour . Food items can be on top of each other or being obstructed by another component, making it hidden in the given image . Meanwhile, external factors such as illumination can interfere negatively in this step, where shadows can be identified as part of the food or even a new food item [12,29].
Several methods have been proposed to address the segmentation issue, summarized in Table 2. For examples, one asks user to draw bounding boxes over food items on the smartphone screen, and performs segmentation using GrabCut algorithm over selected areas . Another segments items by integrating four methods to detect candidate region, including the whole image (assuming each image has one food), Deformable Part Model (DPM, a method utilizing sliding windows to detect object regions), circle detector (detecting circular in an image), and JSEG segmentation to segment regions . A similar approach in Ciocca et al.  combined different strategies including image saturation, binarization, JSEG segmentation, and morphological operations (noise removal) to segment multiple food items. In addition, the work presented in Yang et al.  tries to segment food by its ingredients and their spatial relationship applying Semantic Texton Forest (STF).
|Yang et al., 2010||Semantic Texton Forest calculates the probability for each pixel to belong to one of the food classes.||Output from Semantic Texton Forest is far from a precise parsing of an image|||
|Matsuda et al., 2012||Combined techniques: whole image, DPM, circle detector and JSEG segmentation||Overall accuracy to 21% (top 1) and 45% (top 5)*|||
|Kawano and Yanai, 2013||Each food item within user generated bound boxes is segmented by GrabCut algorithm||Performance depending on the size of the bounding boxes|||
|Pouladzadeh et al., 2014||Graph cut segmentation algorithm to extract food items and user's finger||Overall accuracy of 95%|||
|Shimoda and Yanai, 2015||CNN model searching for food item based on fragmented reference||Detects correct bounding boxes around food items with mean average precision of 49.9% when compared to ground truth values|||
|Meyers, 2015||DeepLab model||Classification accuracy increases with conditional random fields|||
|Zhu et al., 2015||Multiple segmentations generated for an image and selected by a classifier||It outperforms normalized cut|||
|Ciocca et al., 2016||Combines saturazation, binarization, JSEG segmentation and morphological operations||Achieves better segmentation than using JSEG-only approach|||
*Top 1 and/or Top 5 indicate that the performance of the classification model was evaluated based on the first assigned class with the highest probability and/or the top 5 classes among the prediction for each given food item, respectively
Table 2: Food segmentation methods.
Of particular interest is that Deep Leaning approach has been used to tackle food segmentation [11,30], although at its early stage. For example, the application named Im2Calories utilized the Convolution Neural Network (CNN) model that provides unary potentials of a conditional random fields and a fully connected graph to perform edgesensitive label smoothing , which increased the overall classification accuracy (Table 2).
Image objects can be recognized based on their characteristics, such as colour, shape and texture . According to Hassannejad et al. , selection of relevant features is important when building a recognition model capable of identifying food items. General image features, as mentioned above, may not be descriptive enough to distinguish foods since the properties of the same good may change when the food is prepared in different ways . For example, Penne and Spaghetti have same colour and texture but distinct shape.
In order to extract informative visual information from food image, descriptors such as Local Binary Patterns (LBP), color information, Gabor filter, Scale-Invariant Feature Transform (SIFT) , called handcrafted features, can be applied (illustrated in Table 3). Different features and their fusion often result in different classification performance. For instance, when SIFT and LBP were used individually on Chen dataset , it achieves accuracy of 53% and 45.9%, respectively; when they were combined with additional colour and Gabor filter, accuracy rises to 68,3%. Based on the same dataset, another study, Menu-Match , extracted the SIFT, LBP and colour in different settings, along with HOG and MR8 and obtained the accuracy of 77.4%. It also illustrates how sensitive a classification can be when the same feature is extracted but with different parameters.
|Features||Classifier||Top1 Acc.||Top5 Acc.|
|Chen, 2012||SIFT, LBP, color and gabor||Multi-class Adaboost||Chen||68.3%||90.9%|||
|Beijbom et al., 2015||SIFT, LBP, color, HOG and MR8||SVM||77.4%||96.2%|||
|Anthimopoulos et al., 2014||SIFT and color||Bag of Words and SVM||Diabetes||78.0%||-|||
|Bossard et al., 2014||SURF and L*a*b color values||RFDC||Food-101||50.8%||-|||
|Hoashi et al., 2010||Bag of features, color, gabor texture and HOG||MKL||Food85||62.5%||-|||
|Beijbom et al., 2015||SIFT, LBP, Color, HOG and MR8||SVM||Menu-Match||51.2%*|||
|Christodoulidis et al., 2015||Color and LBP||SVM||Local dataset||82.2%||-|||
|Pouladzadeh et al., 2014||Color, texture, size and shape||SVM||92.2%||-|||
|Pouladzadeh et al., 2014||Graph Cut, color, texture, size and shape||SVM||95.0%||-|||
|Kawano and Yanai, 2013||Color and SURF||SVM||-||81.6%|||
|Farinella et al., 2014||Bag of textons||SVM||PFID||31.3%||-|||
|Yang et al., 2010||Pairwise local features||SVM||78.0%||-|||
|He et al., 2014||DCD, MDSIFT, SCD, SIFT||KNN||TADA||64.5%||-|||
|Zhu et al., 2015||Color, texture and SIFT||KNN||70.0%||-|||
|Matsuda et al., 2012||SIFT, HOG, Gabor texture and color||MKL-SVM||UEC-Food-100||21.0%||45.0%|||
|Liu et al., 2016||Extended HOG and Color||Fisher Vector||59.6%||82.9%|||
|Kawano and Yanai, 2014||Color and HOG||Fisher Vector||65.3%||-|||
|Yanai and Kawano, 2015||Color and HOG||Fisher Vector||65.3%||86.7%|||
|Kawano and Yanai, 2014||Fisher Vector, HOG and color||One x rest Linear classifier||UEC-Food-256||50.1%||74.4%|||
|Yanai et al., 2015||Color and HOG||Fisher Vector||52.9%||75.5%|||
|Deep Leaning Methods|
|Anthimopoulos et al., 2014||ANNnh||Diabetes||75.0%||-|||
|Bossard et al., 2014||Food-101||Food-101||56.4%||-|||
|Yanai and Kawano, 2015||DCNN-Food||70.4%||-|||
|Liu et al., 2016||DeepFood||77.4%||93.7%|||
|Hassannejad et al., 2016||Inception v3||88.3%||96.9%|||
|Meyers, 2015||GoogleLeNet||Food201 segmented||76.0%||-|||
|Christodoulidis et al., 2015||Patch-wise CNN||Own database||84.90%||-|||
|Pouladzadeh et al., 2016||Graph cut+Deep Neural Network||99.0%||-|||
|Kawano and Yanai, 2014||OverFeat+Fisher Vector||UEC-Food-100||72.3%||92.0%|||
|Liu et al., 2016||DeepFood||76.3%||94.6%|||
|Yanai and Kawano, 2015||DCNN-Food||78.8%||95.2%|||
|Hassannejad et al., 2016||Inception v3||81.5%||97.3%|||
|Chen and Ngo, 2016||Arch-D||82.1%||97.3%|||
|Liu et al., 2016||DeepFood||UEC-Food-256||54.7%||81.5%|||
|Yanai and Kawano, 2015||DCNN-Food||67.6%||89.0%|||
|Hassannejad et al., 2016||Inception v3||76.2%||92.6%|||
|Ciocca et al., 2016||VGG||UNIMINB2016||78.3%||-|||
|Chen and Ngo, 2016||Arch-D||VIREO||82.1%||95.9%|||
*Represents the mean average precision
**Top 1 and/or Top 5 indicate that the performance of the classification model was evaluated based on the first assigned class with the highest probability and/or the top 5 classes among the prediction for each given food item, respectively
Table 3: Traditional and deep learning classification methods.
Currently, there are two major classification strategies for food image recognition: 1) Traditional machine learning-based approach using handcrafted features and 2) Deep Learning-based approach. The former usually start with a set of visual features extracted from the food image and use them to train a prediction model based on Machine Learning algorithms such as Support Vector Machine , Bag of Features , or K Nearest Neighbors . In contrast, emerging deep learning architectures have a large number of connected layers that are able to learn features, followed by a final layer responsible for classification . Recent approaches based on Deep Learning become more popular and effective, e.g., the study in Christodoulidis et al.  obtained astonishing results in the ImageNet’s Large Scale Visual Recognition Challenge 2012 (ILSVRC2012).
The following example compares a classifier trained with handcrafted features with a deep learning architecture. In Yanai and Kawano , color and HOG features are classified using a similar strategy to Bag of Features, called Fisher Vectors, which achieved accuracy of 65.3% on UEC Food-100 . On the same database, the Deep Learning architecture DCNN-FOOD  was created and showed an improvement of 13.5% over the handcrafted method. A major advantage of Deep Learning method is that they can learn relevant features from images automatically, which is particularly important in the cases when the pre-defined features are not discriminative enough . More studies based on both methods are shown in Table 3. Clearly, a common issue with most current methods is that the performance was presented mainly based on overall accuracy where the assessment of sensitivity and specificity was missing (Table 3).
Food volume estimation and nutrient analysis
After identifying all food items from an image, it is important to assess the nutrients included, e.g., the carbohydrates, sugar, or total calorie, which will require volume/weight estimation, another challenge. In fact, not even an expert dietitian can estimate the total calories without a precise instrument, e.g., a scale. Taking image-based calorie estimation as an example: first, food candidate regions must be recognized, segmented, and classified correctly [22,36]; the volume from each segmented item will be calculated; and the nutrient can be estimated based on a nutritional facts table [37-39], such as USDA Food Composition Database .
The most challenging part is to estimate food’s volume from 2-dimensional image which normally does not have the depth information, unless reference objects are placed next to the meal [8,41]. Volume can be underestimated or overestimated with interference from external factors, such as lighting conditions, blurred images, and noisy background , only few strategies were reported for estimation of food volume and calorie intake as currently the major focus in this domain still lies in the food classification (Table 4) .
|Noronha et al., 2011||Via crowdsourcing (e.g. users from Amazon Mechanical Turk)||Better performance than other commercial app using crowdsourcing but overall it is error prone since users estimate food portion by just looking at the picture|||
|Chen, 2012||Use depth camera to acquire color and depth||Preliminary result showing some limitations when estimating quantity of cooked rice and water|||
|Villalobos et al., 2012||Use Top+side view pictures with user's finger as reference||Results change due to illumination conditions and image angle; standard error is in an acceptable range|||
|Beijbom et al., 2015||Use menu items from nearby restaurants||Food calorie is from pre-defined restaurant’s menu|||
|Meyers, 2015||3D volume estimation by capturing images with a depth camera and reconstructing image using Convolutional Neural Network and RANSAC||Using toy food; the CNN volume predictor is accurate for most of the meals; no calorie estimation outside a controlled environment.|||
|Woo et al., 2010||Use a checkerboard as reference for camera calibration and 3D reconstruction||Mean volume error of 5.68% on a test of sever food items|||
Table 4: Methods for food volume and calorie estimation.
As listed in Table 4, crowdsourcing  and a depth sensor camera [11,22] have been utilized for food volume estimation and nutrition assessment. Although leading to promising results, these studies were conducted either in a controlled environment or using an extra camera that is not practical in real-world events. In addition, user’s finger was also used as reference when one takes a picture from the top and side views of the plate to estimate food volume . The concern here is that multiple food items overlap in the side view, making it hard to distinguish. Similarly, another reference object, a checkerboard, was used to help obtaining depth information alongside camera calibration , which also needs users to carry additional equipment in order to estimate food’s volume.
Note that those methods are mostly tied to a controlled environment. For example, it has stated that a broader study outside the laboratory is not feasible because nutrient values vary depending on how the food was prepared and there is no broad nutritional database for prepared foods yet . On the other hand, it performed volume estimation for only 7 items in Woo et al. , while the study only matched classified food to annotated menu items with respective (Table 4) known calories in Beijbom et al. .
Other than monitoring food intake through image processing, several wearable devices have been developed for auto-detection of eating episodes. For example, a proof of concept called Glassense  utilizes a pair of glasses with load cells to detect user’s digestive behaviours through facial signals. Likewise, glasses connected to a sensor placed on the temporalis muscle and an accelerometer was also presented to detect food intake when users are physically active and/ or talking . In addition, a wrist motion tracker was developed to identify eating activities and measure food intake .
Although these approaches can detect eating activities with decent resolution, more follow-up research efforts are needed to explore the relationships between eating activities and nutrient intake and calories consumption.
Dietary intervention can be realized after the aforementioned diet management systems learn adequate information about the individual’s’ eating habits. Often it requires functionality similar to a diet advisor capable of giving users feedbacks to improve their health, e.g., eat less often or replace A by B in the meal for weight loss . Recent applications are more sophisticated in this regard. For examples, Faiz et al.  introduces a Semantic Healthcare Assistant for Diet and Exercise (SHADE) that can identify user habits and generate suggestions not only for diet, but also for exercise for diabetic control.
Similarly, Lee et al.  presents a personal food recommendation agent that can creates a meal plan according to a person’s lifestyle and particular health needs towards a certain health goal.
As mentioned above, despite of the advances in food recognition technologies, there are remaining challenges with respect to each analytical step. For example, food image datasets and classification methods are highly related since the former provide training data for the latter. Current image databases tend to grow in number of classes to incorporate different types of food, as what happened to Food201- Segmented , Food85 , and UEC Food-256 . Meanwhile, classifiers are developed based on new architecture that is capable of identifying new food items. Since the Deep Learning approaches can provide better classification accuracy when trained on larger datasets , there is a possible also a need to generate more food images from existing datasets by randomly cropping images and apply distortions like brightness, contrast, saturation and hue .
Although segmentation of food items has shown significant improvements in Zhu et al. , it is still difficult to segment hidden food item and mixed food. Other factors such as lightning can also contribute negatively to segment foods. For example, shadows can be considered as part of food or candidate regions by algorithms. Methods based on manually-selected candidate items can be promising , however, the bounding box size may be influential .
Nutrient and calorie estimation remains the most challenging problem in automated diet monitoring systems since it is highly dependent on food segmentation and volume estimation . Undoubtedly, calories can be overestimated or underestimated if any of the other steps is erroneous. However, as discussed above, volume estimation based on 2D images are still far from satisfactory even using the effective reference objects such as a checkerboard  and finger . Note this problem can be solved by using stereo cameras, as illustrated in im2Calories , which requires extra accessories, or using SmartPlate, a device that integrates multiple scales into a dinning plate to weight food items. Obviously, once all those new functionalities and sensors are embedded in the smartphones, all such complexity  of the problem can be alleviated significantly.
In this review, we have surveyed a wide range of strategies in computer vision and artificial intelligence specifically designed for automated food recognition and dietary intervention. Particularly, the entire framework can be broken down into four parts that involve developments of comprehensive food image databases, classifiers capable for food item recognition, and strategies for food volume estimation, nutrient analysis that provide information for diet intervention. Even though improved performance has been demonstrated, challenging issues still remain and desire novel algorithms and techniques. Worth mentioning is the increased appreciation of using Deep Learning models for food image classification, which has outperformed traditional methodologies using handcrafted features. Increased application of wearable sensor devices, especially those can be integrated into smartphone, will revolutionize this line of research and as a whole the food monitoring system will help generate novel insights in effective health promotion and disease prevention.
This work was supported by the National Institutes of Health funded COBRE grant [1P20GM104320] from National Institutes of Health, UNL Food For Health seed grant and the Tobacco Settlement Fund as part of Cui’s startup grant.