Lector rss google reader

Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). The experimental results based on 1514 pages collected from 54 top well-known Chinese and English news websites/channels show that it is very appropriate for most news pages to clean noise before text mining.Ī commercial Web page typically contains many information blocks. This method is much less complicated than other ones, and its accuracy and efficiency are fairly high, its complexity about the pages size is just linear. A similarity measure based on edit distance is introduced and applied in the algorithms to separate the news content from noisy information. One of the most important features is the similarity of the twin-pages which are collected from the same topic section of a site and published on the same/near date. This paper proposes a new approach to news content extraction from web pages, which is based on several simple features observed in most well-known news websites/channels. Accurate extraction of news content is a necessary and crucial step for news text mining. However, news content of most web pages is embedded in a large amount of noisy materials.

Online news as an up-to-date and important information source, is an absorbing data repository for data mining. The extensive experiment carried out proves the effectiveness of categorization of this method. RSS news feeds with 2658 web pages with articles of different category are used as training data set and 300 web page news contents are considered as testing dataset. The experimental study was carried out for content categorization using the RSS feed Data sets. RSS is a spam free, quick and efficient way to read the news and weblogs. In order to overcome this, the proposed paper puts forward a web information extraction method which is based on the RSS feed reader which helps to categorize the News articles (informative content) in an effective manner. In the existing system, users' do not have control over the information and up-to-date content information is also not possible. Content categorization is the process in which the contents are grouped into categories, usually for some specific purpose. Really Simple Syndication (RSS) or Rich Site Summary is a Web feed format used for publishing frequently updated content on the Internet, such as blog, news, audio and video in a standardized format.