« Home « Kết quả tìm kiếm

Multimedia_Data_Mining_08


Tóm tắt Xem thử

- the association of visual features and the textual words is determined in a Bayesian framework such that the confidence of the association can be pro- vided.
- In the proposed probabilistic model, a hidden concept layer which connects the visual features and the word layer is discovered by fitting a generative model to the training images and anno- tation words.
- An Expectation-Maximization (EM) based iterative learning procedure is developed to determine the conditional probabilities of the vi- sual features and the textual words given a hidden concept class.
- Based on the discovered hidden concept layer and the corresponding conditional prob- abilities, the image annotation and the text-to-image retrieval are performed using the Bayesian framework.
- The evaluations of the prototype system on 17,000 images and 7,736 automatically extracted annotation words from the crawled Web pages for multimodal image data mining and retrieval have in- dicated that the model and the framework are superior to a state-of-the-art peer system in the literature..
- The rest of the chapter is organized as follows: Section 7.2 introduces the motivations to this work and outlines the main contributions of this work..
- In Section 7.4 the proposed probabilistic seman- tic model and the EM based learning procedure are described.
- The acquisition of the training and testing data.
- collected from the Web, and the experiments to evaluate the proposed ap- proach against a state-of-the-art peer system in several aspects, are reported in Section 7.6.
- In order to further reduce the semantic gap, recently multimodal approaches to image data mining and retrieval have been proposed in the literature [251] to explicitly exploit the redundancy co-existing in the collateral information to the images.
- In this chapter, we propose a probabilistic semantic model and the cor- responding learning procedure to address the problem of automatic image annotation and show its application to multimodal image data mining and retrieval.
- Specifically, we use the proposed probabilistic semantic model to explicitly exploit the synergy between the different modalities of the imagery and the collateral information.
- This hid- den layer constitutes the concepts to be discovered through a probabilistic framework such that the confidence of the association can be provided.
- An Expectation-Maximization (EM) based iterative learning procedure is devel- oped to determine the conditional probabilities of the visual features and the.
- Based on the discovered hidden concept layer and the corresponding conditional probabilities, the image-to-text and text-to-image retrievals are performed in a Bayesian framework..
- It has been argued [217] that the COREL data are much easier to annotate and retrieve due to their small number of concepts and small variations of the visual content.
- In addition, the relative small number (1,000 to 5,000) of the training images and test images typically used in the literature further makes the problem easier and the evaluation less convictive.
- In order to truly capture the difficulties in real scenarios such as Web image data mining and retrieval and to demonstrate the robustness and the promise of the proposed model and the framework in these challenging applications, we have evaluated the prototype system on a collection of 17,000 images with the automatically extracted textual annotations from various crawled Web pages.
- We have shown that the proposed model and framework work well on this scale of a very noisy image dataset and substantially outperform the state-of-the-art peer system MBRM [75]..
- The association of visual features and textual words is determined in a Bayesian framework such that the confidence of the association can be provided..
- The ex- perimental results demonstrate the superiority and the promise of the approach..
- A number of approaches have been proposed in the literature on automatic image annotation .
- The models consider image annotation as a process of translation from “visual language” to text and collect the co-occurrence infor- mation by the estimation of the translation probabilities.
- As noted by the authors [14], the performance of the models is strongly af- fected by the quality of image segmentation.
- The image annotation is based on the nearest-neighbor classification and word oc- currence counting, while the correspondence between the visual content and the annotation words is not exploited.
- The essential idea is to first find annotated images that are similar to a test image and then use the words shared by the annotations of the similar images to annotate the test image.
- Improvement of the mining and re- trieval performance is reported, attributing to the synergy of both modalities..
- To achieve automatic image annotation as well as multimodal image data mining and retrieval, a probabilistic semantic model is proposed for the train- ing imagery and the associated textual word annotation dataset.
- First, a word about notation: f i , i ∈ [1, N] denotes the visual feature vec- tor of images in the training database, where N is the size of the image database.
- In the probabilistic model, we assume the visual features of images in the database, f i = [f i 1 , f i 2.
- The dimension of the visual feature is L.
- FIGURE 7.1: Graphic representation of the model proposed for the random- ized data generation for exploiting the synergy between imagery and text..
- k and µ k are the covariance matrix and mean of the visual fea- tures belonging to z k , respectively.
- Following the likelihood principle, one determines P F (f i | z k ) by the maxi- mization of the log-likelihood function.
- (7.5) where u i is the number of the annotation words for image f i .
- Similarly, P Z (z k ) and P V (w j | z k ) can be determined by the maximization of the log-likelihood function.
- n(w j i ) log P (f i , w j ) (7.6) where n(w j i ) denotes the weight of annotation word w j , i.e., the occurrence frequency, for image f i.
- The EM alternates in two steps: (i) an expectation (E) step where the posterior probabilities are computed for the hidden variable z k , based on the current estimates of the parameters.
- and (ii) a maximization (M) step, where parameters are updated to maximize the expectation of the complete- data likelihood log P(F, V, Z) given the posterior probabilities computed in the previous E-step.
- t=1 P Z (z t )P F (f i | z t )P V (w j | z t ) (7.8) The expectation of the complete-data likelihood log P (F, V, Z) for the esti- mated P(Z | F, V ) derived from Equation 7.8 is.
- In Equation 7.9 the notation z i,j is the concept variable that associates with the feature-word pair (f i , w j.
- Similarly, the expectation of the likelihood log P (F, Z) for the estimated P (Z | F ) derived from Equation 7.7 is.
- for any f i , w j , and z l , the parameters are determined as µ k.
- v=1 n(w u v )P (z k | f v , w u ) (7.15) Alternating Equations 7.7 and 7.8 with Equations 7.12–7.15 defines a conver- gent procedure to a local maximum of the expectation in Equations 7.9 and 7.10..
- Ideally, we intend to select the value of K that best agrees with the number of the semantic classes in the training set.
- One readily available notion of the fitting goodness is the log-likelihood.
- In the experimen- tal database reported in Section 7.6, K is determined through maximizing Equation 7.16..
- The objective of image annotation is to return words which best reflect the semantics of the visual content of images.
- (7.19) In the above equations E z.
- With the combination of Equations 7.18 and 7.19, the automatic image annotation can be solved fully in the Bayesian framework..
- In practice, we derive an approximation of the expectation in Equation 7.18 by utilizing the Monte Carlo sampling [79] technique.
- FIGURE 7.2: The architecture of the prototype system..
- The images in the database with the top highest P (f i | w j ) are returned as the querying result for each query word..
- The architecture of the prototype system is illustrated in Figure 7.2.
- FIGURE 7.3: An example of image and annotation word pairs in the gener- ated database.
- The number following each word is the corresponding weight of the word..
- It has been noted that the datasets used in the recent automatic image annotation systems fail to capture the difficulties inherent in many real image databases.
- Two issues are taken into the consideration in the design of the experiments reported in this section.
- First, the commonly used COREL database is much easier for image annotation and retrieval due to its limited semantics conveyed and small variations of the visual content.
- Second, the typical small scales of the datasets reported in the recent literature are far away from being realistic in all the real-world applications.
- To address these issues, we decide not to use the COREL database in the evaluation of the prototype system.
- Figure 7.3 shows an image example with the associated annotation words in the generated database..
- To evaluate the effectiveness and the promise of the prototype system for multimodal image data mining and retrieval, the following performance mea- sures are defined:.
- Complete-Length (CL): the average minimum length of the returned words which contain all the ground truth words for a test image for the test set..
- the average rate of the rele- vant images (here “relevant” means that the ground truth annotation of this image contains the query word) in the top n returned images for a single word query for the test set..
- SWQP(n) measures the precision of the text-to-image querying.
- recall = B C and precision = B A , where A is the number of the images automatically anno- tated with a given word in the top 10 returned word list.
- B is the number of the images correctly annotated with that word in the top-10-returned-word list.
- and C is the number of the images having that word in the ground truth annotation.
- The interface of the prototype system for automatic image annotation is shown in Figure 7.4.
- Applying the method of estimating the number of the hidden concepts described in Section 7.4.3 to the training set, the number of the concepts is determined to be 262.
- Compared with the number of the images in the.
- Table 7.1: Comparisons between the examples of the automatic annotations generated by the proposed prototype system and MBRM.
- FIGURE 7.4: The interface of the automatic image annotation prototype..
- training set, 12,000, and the number of the stemmed and cleaned annotation words, 7,736, the number of the semantic concept variables is far less.
- In terms of the computational complexity, the model fitting is computation-intensive;.
- To show the effectiveness and the promise of the probabilistic model in image annotation, we have compared the accuracy of the proposed method with that of MBRM [75].
- We compare the proposed approach with MBRM because MBRM reflects the performance of the state-of-the-art automatic image annotation research.
- In addition, since the same image visual features are used in MBRM, a fair comparison of the performance is expected.
- Table 7.1 shows examples of the automatic annotation obtained by the proposed prototype system and MBRM on the test image set.
- The performance comparison demonstrated in the table clearly indicates that the proposed system performs better than MBRM..
- The results are reported for all (7,736) words in the database.
- The multiple Bernoulli generation of the words in MBRM is artificial and the association of the words and features is noisy.
- On the contrary, in the proposed model no explicit word distribution is assumed, and the synergy between the visual features and the words exploited by the hidden concept variables reduces the noise substantially.
- We believe that these reasons account for the better per- formance of the proposed approach.
- We note that certain returned words with top rank from the proposed system for a given image query are found semantically relevant by subjective examinations, although they are not con- tained in the ground truth annotation of the image.
- We did not count these words in the computation of the performance in Table 7.2.
- Consequently, the HR3, recall, and precision in the table are actually underestimated while the CL is overestimated for the proposed system..
- The average SWQP values of the proposed system and those of MBRM are reported..
- A returned image is considered as relevant to the single word query if this word is contained in the ground truth annotation of the image.
- It is shown that the performance of the proposed probabilistic model has higher over- all SWQP than that of MBRM.
- It is also noticeable that when the scope of the returned images increases, the SWQP(n) in the proposed system attenu- ates more gracefully than that in MBRM, which is another advantage of the proposed model..
- For each of the m annotation words,.
- Clearly, for a general query consisting of words and images, each component of the query may be individually processed and the final retrieval may be obtained by merging all the retrieved lists together based on the posterior probability..
- We randomly select 20 words out of the 7,736 word vocabulary and use each of them as a pure text query to pose to UPMIR, Google image search, and Yahoo! image search, respectively.
- for different numbers of the top images retrieved.
- Instead of assuming artificial distributions of the annotation words and the unreliable association evidence used in many existing approaches, we assume a hidden concept layer to generate and connect visual features and annotation words.
- Based on the model obtained, the image-to-text and text-to-image queryings are con- ducted in a Bayesian framework, which is adaptive to the dataset and has a clear interpretation of the confidence measure.
- this is demonstrated by the evaluations of the prototype system on.
- In comparison with a state-of-the-art image an- notation system, MBRM, and a state-of-the-art image retrieval system, UFM, it shows the higher reliability and the superior effectiveness of the proposed model and the querying framework, given the noisy and diverse semantics and visual variations of the data we have used.

Xem thử không khả dụng, vui lòng xem tại trang nguồn
hoặc xem Tóm tắt