Follow Kurt Melvin on Twitter

Subject: "Web Crawler that Lists URL and and Pictures...."   Previous Topic | Next Topic
Printer-friendly copy    
Conferences The New MadBomber Marketing and SEO Forum Topic #44
Reading page 1 of 1 pages
sgtawmoderator click here to view user rating
Member since Jun-1-05
402 posts, 2 feedbacks, 3 points
Dec-15-06, 11:24 AM (PST)
Click to EMail sgtaw Click to send private message to sgtaw Click to view user profileClick to add this user to your buddy list  
"Web Crawler that Lists URL and and Pictures...."
 
   Hi all,

Need some help.

I am an affilate of a merchant that sells hard goods. They don't have a datafeed so I am trying to create one myself.

I have found several programs that will crawl a site and list every single URL, keyword, description, and title. Then spit it out into csv or whatever.

Here's the problem. I also want it to grab the image urls on each page and associate it with that particular page. I know this will have multiple image urls picked up for each page, but I can clean that up easily.

If I can do this last step then having a full blown datafeed (with pictures is created!

Any suggestions?

Thanks,

Ed


  Alert | IP Printer-friendly page | Edit | Reply | Reply With Quote
Kurtadmin click here to view user rating
Member since Dec-5-02
8892 posts, 5 feedbacks, 8 points
Dec-15-06, 11:50 AM (PST)
Click to EMail Kurt Click to send private message to Kurt Click to view user profileClick to add this user to your buddy list  
1. "RE: Web Crawler that Lists URL and and Pictures...."
 
Hey Ed,

Without seeing the exact data in question, it is hard for me to give a work-around.

Out of curiousity, have you tried to use an html2rss program to create a pheed from the pages?

You can download it here:
http://blogbomb.com/blogless.zip

There's a few other options that may work, depending on the original output.


-Boom boom boom boom.


  Alert | IP Printer-friendly page | Edit | Reply | Reply With Quote
sgtawmoderator click here to view user rating
Member since Jun-1-05
402 posts, 2 feedbacks, 3 points
Dec-15-06, 12:55 PM (PST)
Click to EMail sgtaw Click to send private message to sgtaw Click to view user profileClick to add this user to your buddy list  
2. "RE: Web Crawler that Lists URL and and Pictures...."
 
   Mery Christmas Kurt!

Thanks for the quick reply....

I'm not sure that html2rss will work... I tried to change your tags for my purposes and got an error.

Here is what I am trying to do.

1. Let's take this site for example http://www.tennis-warehouse.com
I want to "crawl" this site grabbing all the product pages.

2. In grabbing those pages, I want to be able to grab various bits of information (this can change from site to site). The key items are: url of the page, title, metadescription, and (the problem child) the picture url.

For instance, http://www.tennis-warehouse.com/descpage.html?PCODE=MTLX10.

In addition to the items I mentioned, I want to grab the picture of the tennis racket. Most preferrably, I would want to have the url of where the picture is located.

3. I then want to be able to have all that information saved as a CSV so that I can upload it, for instance to BIB.

I played a tiny bit with instantrss. I found a unique tag in the webpages and replaced instantrss tags. But it got me an error.

I guess a work around would be to down load the site I am interested in. Then do a replacez using the tags that you have in instantrss. Then uploading the site to my server so that I can run instantrss.

Thanks Kurt!

Ed


  Alert | IP Printer-friendly page | Edit | Reply | Reply With Quote
Kurtadmin click here to view user rating
Member since Dec-5-02
8892 posts, 5 feedbacks, 8 points
Dec-15-06, 01:02 PM (PST)
Click to EMail Kurt Click to send private message to Kurt Click to view user profileClick to add this user to your buddy list  
3. "RE: Web Crawler that Lists URL and and Pictures...."
 
Hey Ed,

It appears there is a pattern in this paticular example that you may be able to use.

It seems the main graphic has the same name as the page name:
descpageRCDUNLOP-MF200P.html
-and-
MF200P.jpeg

Note that the data after the hyphen for the page name is the same as the graphic name.

I only checked this on 3 or 4 pages, but it held true each time. You'll need to check it some more.

If this holds up, you should be able to use the Tuelz to manipulate the data so that you can get the graphic URL using the page URL.

But don't spend 5x as much time and effort trying to find a work-around that it would take to do "by hand".


-Boom boom boom boom.


  Alert | IP Printer-friendly page | Edit | Reply | Reply With Quote


Conferences | Topics | Previous Topic | Next Topic