Body Field: Structured Mean Field with Human Body Skeleton Model and Shifted Gaussian Edge Potentials

An efficient method for simultaneous human body part segmentation and pose estimation is introduced. A conditional random field with a fully-connected graphical model is used. Possible node (image pixel) labels comprise of the human body parts and the background. In the human body skeleton model, the spatial dependencies among body parts are encoded in the definition of pairwise energy functions according to the conditional random fields. Proper pairwise edge potentials between image pixels are defined according to the presence or absence of human body parts that are near to each other. Various Gaussian kernels in position, color, and histogram of oriented gradients spaces are used for defining the pairwise energy terms. Shifted Gaussian kernels are defined between each two body parts that are connected to each other according to the human body skeleton model. As shifted Gaussian kernels impose a high computational cost to the inference, an efficient inference process is proposed by a mean field approximation method that uses high dimensional shifted Gaussian filtering. The experimental results evaluated on the challenging KTH Football, Leeds Sports Pose, HumanEva, and PennFudan datasets show that the proposed method increases the per-pixel accuracy measure for human body part segmentation and also improves the probability of correct parts metric of human body joint locations.


1-Introduction
Human body part segmentation is the problem of segmenting a given image to human body (HB) parts and the background. The main difference between this process and the general object segmentation is that the HB has an articulated structure. Human pose estimation is defined as the problem of localization of human body joints in the 2D image or 3D space. Human body part segmentation and pose estimation are challenging tasks in computer vision. Their wide applications include surveillance, motion analysis, human-computer interaction, image understanding, augmented reality, and action recognition. As HB has an articulated structure, pose estimation methods aim to find that configuration in a given image. The articulation in HB is often realized by a skeleton model with 14 body joints as well as the corresponding connections among them [1], [2], [3], [4], [5]. The main challenges involved in HB part segmentation and pose estimation are the occluded body parts, the foreshortening effect on the length of some body parts (caused by projection from the 3D space to the 2D image plane), and the ambiguity in defective body parts (due to motion blur or self-occlusion). In this paper, a new and efficient method for simultaneous HB part segmentation and pose estimation is introduced. The block diagram of the proposed method is shown in Figure 1. The method is based on a conditional random field (CRF) graphical model.
The graphical model is a fully connected graph (shown in Figure 2). The graphical model for human skeleton in the proposed dual pose and segmentation method is shown in Figure 3. The label of each image pixel (graph node) is a random variable of this CRF, taking values from the set , where labels are body part labels and is the background label (see Figure 4). In this work, HB joints are modeled in a graph with 14 nodes and the corresponding connections among graph nodes are determined according to the HB skeleton, as it is shown in Figure 3. In the proposed method, the HB skeleton is not restricted to tree; it can also have cycles. Only the unary and pairwise relations are considered in defining the energy function, and higher order relations (e.g. ternary, quadratic, etc.) are neglected. The spatial dependency of HB joints in the skeleton model, the length of limbs, and the difference between the features of two joints are encoded in the pairwise terms of the CRF energy function. The main contributions of this paper are summarized as following.  The semantic human body part segmentation and pose estimation problems are modeled, simultaneously, in a single graphical model. Then, an efficient inference method is proposed to minimize the energy function defined by the model.  The body length constraint is modeled in the proposed fully connected graphical model by the shifted Gaussian kernels considered in the definition of pairwise energy terms.  It is demonstrated that although the proposed graphical model is fully connected and Gaussian kernels are shifted, the message passing operation in the inner part of the mean field inference can be computed using the fast bilateral filtering approach. Therefore, the inference algorithm remains tractable.  Experimental results on the popular and challenging pedestrian parsing benchmark Penn-Fudan dataset [6] for semantic human segmentation, and also on the HumanEva I [7], Extended Leeds Sports Pose [8], and KTH Football I [9] datasets show that the proposed method outperforms the method of Xia [10] that is the state-of-the-art in HB segmentation in terms of perpixel accuracy measure. It also achieves substantial improvement in finding the locations of corresponding joints according to the probability of correct pose (PCP) and probability of correct key points (PCK) measures in comparison with Chu et al. [2] that is state-of-the-art in 2D pose estimation. The rest of this paper is organized as follows. In Section 2-, related literature and previous research is reviewed. In Section 3-, the method of Kraehenbuehl et al. [11] is reviewed that is necessary for explaining the proposed method. In Section 4-, the proposed method is explained. Next, in Section 5-, experimental results are given. Finally, Section 6-concludes the paper.

2-Related Work
The problem of HB part segmentation and pose estimation can be approached simultaneously. The best graphical model for solving this problem would have to take into account the relations among all image pixels. However, when considering image pixels as the nodes of a fully connected graphical model, the computational cost of the inference step will be very high. Kraehenbuehl et al. [11] showed that the inference in dense CRF can successfully be performed by mean field approximation using efficient high dimensional Gaussian filtering operations [12]. The method is specifically designed for the general segmentation problem without any constraint on articulation of HB part. The kernels are Gaussian functions on the position or color space. No other image features, such as histogram of oriented gradients (HOG) [13] are used. Other researchers tried to use this efficient inference and filtering in pose estimation tasks. Vineet et al. [14] used this efficient inference in the joint HB pose estimation, segmentation, and depth estimation in a method called PoseField.
However, the energy function defined by them is not specialized for HB and does not reflect the HB skeleton model. Kiefel et al. [15] tried to extend the inference method introduced in [11] to pose estimation problem. They introduced the field of parts method to detect HB joints in 2D images. In their method, the local appearance and joint spatial configuration of HB are modeled. Recently, models based on deep convolutional neural networks (DCNN) have been studied extensively in 2D human pose estimation [1], [2], [3], [16]. The convolutional pose machines (CPM) architecture proposed by Wei et al. [16] is a sequential convolutional neural network that enforces intermediate supervision at the end of each stage to prevent vanishing gradients. DeeperCut [1] is a multi-person pose estimation approach that adapts the deep residual network for human body part detection and uses integer linear programming to jointly detect multiple persons and estimate their body part configurations. Chu et al. [2] incorporated the DCNN with a multi-context attention mechanism into an end-to-end framework for human pose estimation. They adapt stacked hourglass networks to generate attention maps from features at multiple resolutions with various semantics. Bulat et al. [3] designed a DCNN cascaded architecture specifically for learning part relationships and spatial context. The first part of their cascade outputs part detection heat maps and the second part performs regression on these heat maps to estimate the 2D body pose. Kazemi et al. [9] tried to learn the body shape in a discriminative approach using random forest (RF) classifier to capture the variations in appearances of HB parts in 2D images. Semantic segmentation and human parsing based on shape-based methods has been studied in [17]. They generate region proposals, rank them using shape and appearance features, and assemble the proposals with simple geometric constraints. A Bayesian framework for jointly estimating articulated body pose and pixel-level segmentation of each body part is proposed in [18].
Wang et al. [19] proposed a joint solution that tackles the semantic object and part segmentation, simultaneously. In that method [19], the higher object-level context is provided to guide the part segmentation process. Also, more detailed part-level localization is utilized to refine the object segmentation process. A deep decompositional network (DDN) for parsing pedestrian images into semantic regions is proposed in [20]. This method tries to directly map low-level visual features to the label maps of body parts. Top-down pose cues as well as deep-learned features are used in an and-or graph (AOG) for semantic part assembling [10]. This method tries to refine the semantic parts of objects by using the pose cues. DeepLab framework [21] augments fully convolutional network with dilated convolutions, atrous spatial pyramid pooling, and CRF. DeepLab obtains state-of-the-art performance in general problem of semantic segmentation. Guler et al. [22] proposed a surfaced based framework for dense human pose estimation and body part segmentation. It is based on finding dense correspondence between image and a surface of human body. Since there is not a large-scale dataset containing correspondence between image and human body surface, this method has some challenges with general and natural images. An occlusion aware framework for human pose estimation is proposed in [23]. It is based on adversarial training of a Convolutional Neural Network (CNN). They designed discriminators to distinguish the real poses from the fake ones (such as biologically implausible ones) to avoid fake estimated poses. Peng et al. [24] used data augmentation method in training phase of an adversarial learning framework. They proposed to optimize data augmentation and network training jointly to avoid overfitting for the task of human pose estimation. Yang et al. [25] tried to learn 3D human pose structure from a dataset with only 2D pose annotation as the ground-truth. Their method is based on an adversarial learning framework using multi-source discriminators to distinguish the predicted 3D poses from the ground-truth one. In fact, they tried to enforce the pose estimator to generate anthropometrically valid poses even with images from natural scenes. Chen et al. [26] proposed a method for multi-person pose estimation in challenging scenes that contain occluded or invisible keypoints and complex backgrounds. They used cascaded networks of GlobalNet and RefineNet. Simple key points like eyes and hands are localized with the GlobalNet. Hard keypoints such as occluded or invisible key points are addressed with the RefineNet network. Also, this method handles only the pose estimation problem and does not handle the body part segmentation problem.
PoseTrack is a large-scale benchmark for video-based human pose estimation and articulated tracking [27]. It is a more suitable dataset for multiple human tracking task in video sequences rather than body part segmentation since it does not have any ground-truth information for human body segmented regions. It is worth mentioning that the proposed method is different from the Kraehenbuehl et al.'s work [11], in that in the proposed method, the CRF formulation is specifically defined according to the HB configuration such that HB segments are naturally considered to appear in a set of constrained positions relative to each other. Also, the definitions of pairwise energy terms are different from that work. Since the graphical model used in the proposed method is a fully-connected graph constrained to image pixels, it is similar to the work of Kiefel et al. [15], albeit that method does not produce the HB part segmentation and they only report the PCP values on the Leeds Sports Pose [8] dataset.

3-Efficient Mean Field in Object Segmentation
Kraehenbuehl et al. [11] proposed an efficient inference in mean field approximation for general segmentation problem. Their method is not designed for articulated objects such as human body and is only evaluated in PASCAL dataset for general object segmentation problem. In this Section a brief description of Kraehenbuehl et al.'s [11] method is reviewed that is needed for introducing the proposed method in the next Section. They defined a conditional random field over a set of random variables * +, where is the total number of pixels in image . Each variable has a set of possible labels * +, where corresponds to the background and are possible pixel labeling. The conditional random field is characterized by the Gibbs energy function defined on this graph by where range from to . The Gibbs energy function is a summation of pairwise and unary terms. The ( ) is the cost of assigning label to random variable . The second term, ( ) ( ), measures the cost of assigning label and to two neighboring pixels and , respectively. The pairwise term is the cost of assigning two different labels to two arbitrary pixels, given by in which vectors and are feature vectors of pixels and in an arbitrary feature space, respectively, ( ) is the weight of the kernel, is the index of the kernel, and is the number of kernels. ( ) is matrix of variances between and of th Gaussian kernel. The energy function defined in the CRF formulation is minimized during the inference phase. The mean field approximation is an iterative process that instead of computing the exact distribution , computes the approximated ( ) such that minimizes the -divergence ( ) among all distributions, where can be expressed as the product of independent marginal ( ) ∏ ( ) ) According to the energy function defined in Equation( ), the closedform solution of the mean field approximation can be written as ) is the belief of pixel about having the label and is updated in iterative steps. is defined as ∑ ( ) and is the normalization term. Also, ( ) is the initial belief about pixel having the label . The belief of all other pixels about pixel having the part label is defined as in which, ( ) is the label compatibility function between two possible labels and for each pixel. A simple label compatibility function is the Potts model, in which where ( ) denotes the indicator function. ( ) is the weight of th Gaussian kernel, is the total number of kernels, and in which ( ) ( ) is a Gaussian kernel as is defined in Equation (3). It is worth mentioning that Equation (7) is performed once for all pixels by using the Permutohedral lattice filtering. Every channel of matrix is blurred by Gaussian kernel of ( ) ( ) as in Equation (7) that are applied on all image pixels. By substituting Equations (5), (6), and (7) in Equation (4) the message passing is performed as Since the graphical model is a fully-connected graph, the message passing step is the bottleneck of the mean field approximation. Its run-time is quadratic in the number of pixels .

4-Proposed Method
The block diagram of the proposed method is illustrated in Figure 1. The Image block is input to the method, the FRCF block is a pre-processing step that computes the initial pose that is needed in the next block. The details of this pre-processing step are explained in Subsection 4-1-. The BodyField Graphical model is the proposed method that is explained in detail in Subsection 4-2-. The Segmented HB parts are the output of the method. The Mean-Shift block is a post-processing step that is applied to the distribution of the segmented body parts for computing the final estimated pose. The Final Pose block is the final estimated pose and output of the method.

4-1-Pre-processing: Computing the Initial Pose by a Fully Connected Pairwise CRF
A fully connected pairwise CRF is proposed for computing the initial pose that is needed in the proposed dual pose and segmentation method. The graphical model of human body according to this CRF is shown in Figure  2. The nodes of this graph are human body joints that all of them are connected to each other. Images are initially processed with the DeeperCut 2D part detector [1] and the score map of body joints in the images are obtained. The score map, , is an array of size where and are the width and height of the image, respectively, and is the number of body joints (14) plus a special class for the background. The unary term of the energy function is computed by the first output of body part detector, , as ( ( ) ) ( ( ) ( ) ) (9) Another output of 2D part detector is , that is an array of size , where indicates the number of permutations of length two of 14 distinct variables, and 2 is for two dimensions and . According to the output of the part detector [1] which implies that if a pixel in location ( ) has the joint label , it is expected that the joint will occur with an offset ( ( ) ( ) ) from it. Also, ( ) is an index between 1 and which indicates one of the possible permutations belonging to joints and , according to [1]. Therefore, if joint is in location ( ( ) ( ) ) , then the model expects that joint to be in location In the same way, if a pixel in location ( ) has the joint label , acording to the output of the part detector [1], it expects that the joint be in the offset from it. Therefore the expected location of joint from the point of view of pixel ( ) that has joint label is The difference vector between the expected location of joint from the point of view of pixel ( ) that has joint label and pixel ( ) that has joint label is Also, the difference vector between the expected location of joint from the point of view of pixel ( ) that has joint label and pixel ( ) that has joint label is (15) The pairwise term as the cost of assigning label to pixel ( ) and label to pixel ( ) is defined as The inference in the proposed fully connected CRF is computed by the loopy belief propagation method [28].
Using this pre-processing step improves the estimated pose of the DeeperCut method. The comparison between the estimated pose in this pre-processing step, (FCRF), and DeeperCut method is provided in experimental results Section 5-. The initial pose obtained by the pre-processing, (FCRF), is used in computation of the amount of needed shift values in the proposed method in the next Section.

4-2-BodyField Graphical Model Definition
According to Figure 3, human skeleton model is considered to contain 14 joints and their connections are set according to the HB configuration. Furthermore, as the graph is not restricted to be a tree, the model can easily be extended to arbitrary number of HB parts and there is no hard constraint on the number of joints in the model. Figure 4 illustrates the proposed fully-connected graphical model. The nodes in the proposed graphical model are image pixels, and pairwise terms are weights of any connection between two arbitrary pixels. A pixel is shown to be connected to all other pixels with labels in . It is also true for all other pixels (due to visualization restrictions, other connections are not shown). Also, it is important to note that there are no connections, and thus pairwise terms, between a pixel and itself. Since the pairwise terms between two pixels are constrained to the label compatibility, for visualization purposes, image labels are separated to channels. These 15 channels should be added to create a fully connected graphical model. Therefore, there are nodes in the graph, in which and are the width and height of the image, respectively. Also, ( ) is the probability of assigning label to a set of image pixels . Energy function should be defined such that a true configuration of HB corresponds to the minimum value of the energy function, otherwise, finding the minimum value of the energy function will not lead to a good configuration. Note that the sum of probability values of parts for each pixel is one. The label assigned to each pixel is the HB part with the highest probability value among all HB parts and the background. The pairwise energy terms are defined such that the pairwise terms have lower values when the two paired pixels have corrected HB part labels. In the general segmentation problem, it is assumed that pixels that are close to each other (in the feature space) lie in the same segment. It can be met in general segmentation problems, but it does not always hold in HB part segmentation. Some body parts should occur in pre-defined distances to each other in accordance to the existence of a connection among related joints in the HB skeleton model. The inference process tries to find the minimum of the energy function; in the final solution, all nearby (generally, in the feature space) pixels will have similar labels. But, in pose estimation problems, this is not always true. The reason is simply that there might be nearby and similar pixels in the image of HB that do not belong to the same part. In images of HB, there are three common categories of relationships among pixels.
 Pixels that are close in the feature space and belong to the same HB part.  Pixels that are close in the feature space but do not belong to the same HB part, however their corresponding parts are connected in the HB skeleton model.  Pixels that may or may not be close in the feature space and do not belong to the same HB part, but their corresponding parts are not connected in the HB skeleton model. Pixels belonging to the third type can move and eventually appear close to each other; e.g. the wrist can appear near the other parts of HB. When defining the energy function and pairwise terms, all of these situations should be considered and the suitable kernel and compatibility functions should be assigned to any two labels. For any two arbitrary pixels, according to their labels, two different pairwise terms are defined. One for resolving the first and the third type and the other for resolving the second type. In the proposed method the energy function is defined as ( ) where range from to . Variable denotes the set of parameters of HB. It is computed by the initial pose that is estimated in the Subsection 4-1-. For the sake of conciseness, in the remainder of the paper, and are omitted in equations. If these two pixels are close to each other in the feature space, the energy cost for assigning different labels to these two pixels is high. When minimizing the energy function during the inference process, this configuration of labeling (two nearby pixels with two different labels) will be avoided. Therefore, in the best configuration, neighboring pixels approach towards getting identical labels. This is generally true in articulated HB shapes and therefore these pairwise terms are defined between any two arbitrary pixels by using a simple Potts model. The second type of pairwise energy terms is specifically defined to encode HB joints' constraints in the proposed CRF formulation, given by where ( ) is the weight of the shifted kernel function.
The label compatibility function ( ) is defined as according to the existence of a connection between body part and body part in the HB skeleton model as it is shown in Figure 3. The value of ( ) ( ) is defined as ( ( ) ) is the variance matrix between feature vector of joint and of th shifted Gaussian kernel. is the mean expected difference vector between the features and . When the features are simply the positions of points, the value of is a difference vector that is computed from the initial pose that is estimated by the preprocessing step of Subsection 4-1-. Let us consider two arbitrary pixels which have two different labels and are connected according to their labels in the HB skeleton model. The pairwise term that is defined for these two pixels, takes the minimum value when these pixels are placed at a predefined distance from each other. By this definition, the Gaussian term is shifted by , such that the mean of Gaussian lies on pixels for which the difference between their features and the feature of pixel is . The pairwise energy is the weight of edges in the fully connected model between pixels and it is constrained on labels of pixels. There will be pairwise energy terms between any two pixels. There are some constraints on HB skeleton model according to the skeleton graph. The goal of the proposed method is enforcing all constraints presented in the HB skeleton model in the mean field approximation process. Note that, up to here, body part lengths and nearby joints that are connected in the skeleton graph are successfully encoded in the energy function definition of the fully connected conditional random field model that is defined on image pixels. In defining the pairwise energy terms between two arbitrary pixels, kernel is used that is a 36-D vector (

4-3-Efficient Inference Via High Dimensional Gaussian Filtering
According to the energy function defined in Equation (17), the closed-form solution of the mean field approximation can be written as ) is the belief of pixel about having the label and is updated in iterative steps. is defined as ∑ ( ) and is the normalization term. ( ) is the initial belief about pixel having the label . ̂ ( ) and ̂ ( ) are the belief of all other pixels about pixel having the part label The value of ̂ ( ) in Equation (21) is defined as ̂ ( ) in which ( ) is the label compatibility function and is defined in Equation (19), ( ) is the weight of shifted Gaussian kernel, and in which ( ) ( ) is a shifted Gaussian kernel, is defined in Equation (20). It is worth mentioning that Equations (7) and (23) are performed once for all pixels by using the Permutohedral lattice filtering. Every channel of matrix is blurred by Gaussian kernel of ( ) ( ) as in Equation (7) and by shifted Gaussian kernel of ( ) ( ) as in Equation (23) that are applied on all image pixels. By substituting Equations (5), (7), (22), and (23) in Equation (21) the message passing is performed as Since the graphical model is a fully-connected graph, the message passing step is the bottleneck of the mean field approximation. Its run-time is quadratic in the number of pixels . As another contribution of the proposed method, shifted Gaussian kernels are used in the pairwise terms in addition to the non-shifted Gaussian kernels, while keeping the inference step computationally tractable.

4-4-Implementation Details of Shifted Gaussian Kernels
Permutohedral lattice high dimensional Gaussian filtering, performs the filtering task in three steps [12]:  Splatting the points to the lattice space.  Performing the blurring process in lattice space.  Slicing the lattice to find the final values of blurred points. Splatting is the initial phase of lattice construction in high dimensional space according to the definition in [12]. Since we want to blur the value of ( ) with position vectors that are shifted by ( ) , it implies that at first the position vector shifts by ( ) before performing the blurring task. But, in lattice space the operation of shifting and then blurring is equivalent to blurring and then slicing at the shifted positions. Substituting Equation (20) in Equation (23) will result in The permutohedral lattice filter [12] is implemented in the ImageStack library [29], which is a toolbox for high dimensional Gaussian filtering. It is used for performing the high dimensional blurring in the inference step of the proposed method. In implementation process, according to Equation (25), is a matrix of size ( ), given that the input image is of size ( ), where is the total number of body parts and background labels ( ) ( ) is the probability of part p for each arbitrary pixel of the image. It is necessary that ∑ ( ) , in which and is the set of all image pixels. Taking Equations (7) and (23) into account, it is apparent that both equations are similar, except that in the former, Gaussian weights are shifted.
Baek et al. [30] proved that to use shifted Gaussian kernels, it is sufficient to slice the lattice at shifted positions. Using the ImageStack library, the lattice points are ordinary position vectors without shifting and the values of ( ) are blurred in the lattice space by using position vectors in Gaussian weights. Afterwards, the lattice should be sliced to find the final values of ( ) in the initial space. In Equation (7), for all channels of matrix , these operations are performed once using only a single lattice. Permutohedral lattice filter reduces the time complexity of Gaussian operation to ( ), where is the number of points to be blurred and is the dimension of the position space (despite the fact that its three required steps of splatting, blurring, and slicing are time consuming; specifically in high dimensional spaces like HOG feature space). In the proposed method, the shifted Gaussian filtering is performed for several times, which is time consuming. It is worth mentioning that to further speed-up the process, one can force all ( ) to be the same for all and , and therefore some steps need only be performed once for updating the belief about each label, as done in Equation (24). Note that for all shifted Gaussian kernels that use the same feature space and covariance matrix, constructing and blurring the lattice in the feature space is only performed once. On the contrary, due to different values of ( ) , the lattice is sliced in different shifted positions.

5-Experimental Results
The According to the definition in [31], a part is considered correctly localized if the average distance between its endpoints (joints) and the ground-truth data is less than times of the length of annotated endpoints in the groundtruth data.     For the Penn-Fudan dataset there is a ground-truth segmentation for body part segmentation. Since the face and hair are segmented in two different classes in this dataset we used these data in training phase of the Body Field method. In fact the training phase is performed with one more extra class for this dataset. It shows the generalizability of the proposed method to the datasets that have more fine body parts segmented regions. We could define extra classes for each of the fine segmented regions and compute the mean expected difference vector between fine regions and other classes to use in training phase of the BodyField method. Quantitative results of the proposed method on the challenging KTH Football I datasets are summarized in Table 1. The pre-processing step, (FCRF), improves the results obtained by the DeeperCut method by up to . Also DeeperCut [1] method is evaluated on this dataset and it has total PCP. DeeperCut is a powerful body part detector. It uses integer linear programming for estimating the pose from the probability map. However, it sometimes fails to estimate the correct pose of player because of high degree of motion blur in images of KTH Football I dataset. The proposed BodyField method has PCP and improves the results of the original DeeperCut method by Also the proposed BodyField method improves the results obtained by Kazemi et al. [9] in terms of PCP measure by up to due to its better and refined HB part segments. In the Extended Leeds Sports Pose dataset, the standard probability of correct key points (PCK) evaluation metric is used [1], [2]. According to the definition in [31], a candidate key point is considered to be correct if it falls within ( ) pixels of the ground-truth key point, where and are the height and width of the bounding box of human respectively, and controls the relative threshold for considering correctness. Results in Table 2 is based on Person-Centric ground-truth with . In Table 2, comparison results of the proposed method with the method of Chu et al. [2], Bulat et al. [3], Wei et al. [16], and Insafutdinov et al. [1] are presented. As it can be seen from Table 2, the original method of Insafutdinov et al. [1] has PCK. The pre-processing step, FCRF, improves the PCK to . The proposed BodyField method has efficiency in terms of PCK measure. It outperforms the original method of Insafutdinov et al. [1] by , and also the method of Chu et al. [2] by in terms of PCK measure. For the HumanEva dataset, the standard method for computing the accuracy of pose estimation methods is the average 2D error [7]. The proposed method is evaluated on sequences 1 and 2 in walking, jogging and balance actions. As it can be seen in Table 3, the 2D error between the estimated pose and ground-truth location of joints is decreased by using the proposed BodyField method. The overall average 2D error of Sigal et al. [7] is , while it decreases to in preprocessing step, FCRF, and to in the proposed BodyField methods. Since the official evaluation server of the HumanEva dataset, http://humaneva.is.tue.mpg.de/, is currently out of service, we used the validation set for reporting the average values of 2D error. The proposed method is evaluated on the popular Penn-Fudan benchmark [6], which consists of pedestrians in outdoor scenes with much pose variations. Labels of the dataset include 7 body parts namely hair, face, upperclothes, lower-clothes, arms (arm skin), legs (leg skin), and shoes. Also, since the proposed method segments the human body parts into 14 classes, we used the mapping process to convert the corresponding classes to those used in the Penn-Fudan dataset. For this conversion, the estimated pose is used as auxiliary information. The typical part segmentation results in these datasets are illustrated in Figure 8, Figure 9 and Figure 10. This dataset does not have the ground-truth of joints and therefore the results in pose estimation cannot be compared in this dataset in terms of PCP or PCK measures. But since for this dataset the ground truth for segmentation is available in pixel by pixel, the standard evaluation metric is used as per-pixel accuracy [10]. In Figure 9 and Figure 10 the comparison between the proposed method and the method of Xia et al. [10] and Bo et al. [17] are provided. For the method of Xia et al. [10] and Bo et al. [17] we used the source image provided by the authors. As shown in Table 4, the proposed method is compared with state-of-the-art methods, namely, AOG [10], DDN [20], P&S [18], SBP [17], and Wang et al. [19] on the Penn-Fudan dataset. The proposed BodyField method outperforms the DDN [20] method by over and it has improvement in comparison with the state-of-the-art method of Xia et al. [10] (AOG method). The improvement in the proposed method is due to the fact that estimated pose and corresponding body part segments are refined simultaneously. In other words, use of pose information in semantic human body part segmentation has increased the per-pixel accuracy. More visual output results of the inner steps of the BodyField method are available in http://ipl.ce.sharif.edu/bodyfield.html.

6-Conclusion
A new and efficient method for simultaneous single-view human body part segmentation and pose estimation is introduced that opens a new approach to the problem of structured semantic segmentation. A new energy function is introduced that encodes the spatial dependency between human body parts, in addition to the available segmentation constraints. In the proposed method, despite the fact that shifted Gaussian kernels are used, it is shown that finding the minimum of the proposed energy function is possible by applying an efficient mean field approximation process. Due to challenges such as occlusion and self-occlusion effects that occur frequently in human body pose data, the previous learning methods that only use the appearance model cannot converge to a proper pose estimation. That is because there are not enough evidences about the occluded and self-occluded parts available. The proposed BodyField method uses the probability map of the DeeperCut method to define a proper energy function with shifted Gaussian kernels between connected body parts. During the inference step, the evidence for occluded parts is refined by using the information of other parts that are connected in the human body skeleton model. Although shifted Gaussian kernels (in pairwise terms of the proposed energy function) add huge computational cost to the inference process, the problem is solved by proposing an efficient mean field approximation algorithm that speeds up message-passing steps, despite the fact that kernels are shifted. For demonstrating the effectiveness of the proposed fully connected model in comparison with the state-of-the-art pose estimation methods, the probability maps of the DeeperCut method are used in the training phase and it is shown that results improve significantly in KTH Football I, LSP, HumanEva I in comparison with the original DeeperCut method. Also it is shown that the BodyField method has substantial improvement in HB segmentation in Penn-Fudan dataset in per pixel segmentation measure. The experimental results on the challenging KTH Football I, Extended Leeds Sports Pose, HumanEva I, and Penn-Fudan datasets show the superiority of the proposed method over other existing methods in terms of PCP, PCK and per pixel segmentation accuracy.