Site Spider/ Crawler/ Scraper Software
-
Short of coding up your own web crawler - does anyone know/ have any experience with a good bit of software to run through all the pages on a single domain?
(And potentially on linked domains 1 hop away...)
This could be either server or desktop based.
Useful capabilities would include:
- Scraping (x-path parameters)
-
of clicks from homepage (site architecture)
- http headers
- Multi threading
- Use of proxies
- Robots.txt compliance option
- csv output
- Anything else you can think of...
Perhaps an oppourtunity for an additional SEOmoz tool here since they do it already!
Cheers!
Note:
I've had a look at:- Nutch
http://nutch.apache.org/ - Heritrix
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix - Scrapy
http://doc.scrapy.org/en/latest/intro/overview.html - Mozenda (does scraping but doesn't appear extensible..)
Any experience/ preferences with these or others?
-
Hey Alex,
Screaming Frog is hands down the best desktop crawling software and it has most of what you are looking for.
-Mike
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
International Confusion between .com and .com/us
A question regarding International SEO: We are seeing cases for many sites that meet these criteria: -International sites that have www.example.com/ ip redirecting to country site based on ip redirect (ex. www.example.com/ 301 to www.example.com/us -There is a desktop + mobile site (www.example.com + m.example.com) The issue we see is Google shows www.example.com/ in US search results instead of www.example.com/us in search results. Since the .com/ redirects, there is no mobile version, and www.example.com/ also shows up in mobile SERPs instead of m.example.com/us. My questions are: 1. If www.example.com/ is redirecting users and Googlebot, why is Googlebot caching it with the content of www.example.com/us? 2. Why is www.example.com/ showing up in SERPs instead of www.example.com/us? 3. How can we help Google display www.example.com/us and m.example.com/us in SERPs instead of www.example.com/? Thanks!!
International SEO | | FranFerrara0 -
International SEO question domain.com vs domain.com/us/ , domain.com/uk etc.
Hi Mozzers, I am expanding a website internationally. I own the .com for the domain. I need to accommodate multiple countries and I'm not sure if I should build a folder for /us/ for United States or just have the root domain .com OPTION 1:
International SEO | | jeremycabral
domain.com/page-url -- United States
domain.com/de/page-url -- Denmark
domain.com/jp/page-url -- Japan OPTION 2:
domain.com/us/page-url -- United States
domain.com/de/page-url -- Denmark
domain.com/jp/page-url -- Japan My concern with option 2 is there will be some dilution and we wouldn't get the full benefit of inbound links compared to Option 1 as we would have geo ip redirection in place to redirect users etc. to the relative sub-folder. Which option is better from an SEO perspective? Cheers, Jeremy0 -
If I redirect based on IP will Google still crawl my international sites if I implement Hreflang
We are setting up several international sites. Ideally, we wouldn't set up any redirects, but if we have to (for merchandising reasons etc) I'd like to assess what the next best option would be. A secondary option could be that we implement the redirects based on IP. However, Google then wouldn't be able to access the content for all the international sites (we're setting up 6 in total) and would only index the .com site. I'm wondering whether the Hreflang annotations would still allow Google to find the International sites? If not, that's a lot of content we are not fully benefiting from. Another option could be that we treat the Googlebot user agent differently, but this would probably be considered as cloaking by the G-Man. If there are any other options, please let me know.
International SEO | | Ben.JD0 -
Site Ranking in all countries except USA
Hello, I have a site www.apdermatology.com in is ranking #1 for
International SEO | | element8design
"Dermatologist Chelsea Mi" "Dermatologist Chelsea Michigan" In Google in Canada, UK, Australia, Etc.. But in the USA it is on the 4th+ Page, it has been this way for weeks if not months. And does not seem to come up. I originally thought maybe that google was penalizing the site although, it comes up in all other counties. Does anyone have any recommendations how to resolve this, or what the problem may be? Thanks.0 -
Sub-domains or sub-directories for country-specific versions of the site?
What approach do you think would be better from an SEO perspective when creating country-targeted versions for an eCommerce site (all in the same language with slight regional changes) - sub-domains or sub-directories? Is any of the approaches more cost effective, web development-wise? I know this topic's been under much debate and I would really like to hear your opinion. Many thanks!
International SEO | | ramarketing0 -
Upper case or/and lower case in rel="alternate" hreflang
Hi Mozzers, I have a question about the rel="alternate" hreflang tag, with an example. When I use two subfolders for two different countries/languages, for instance www.domain.com/nl-nl/ and www.domain.com/nl-en/ (for the English version) and I want to use the rel="alternate" hreflang tag, do I need to follow the ISO standards concerning Uppercase country code and Lowercase language code (en-NL)? Or is it okay to use the Lowercase country and language code (en-nl), since we also use this in the URL of the Subfolder. What does Google prefer? Thanks in advance.
International SEO | | MartijnHoving820 -
How to fix the duplicate content problem on different domains (.nl /.be) of your brand's websites in multiple countries?
Dear all, what is the best way to fix the duplicate content problem on different domains (.nl /.be) of your brand's websites in multiple countries? What must I add to my code of websites my .nl domain to avoid duplicate content and to keep the .nl website out of google.be, but still well-indexed in google.nl? What must I add to my code of websites my .be domain to avoid duplicate content and to keep the .nl website out of google.be, but still well-indexed in google.nl? Thanks in advance!
International SEO | | HMK-NL3 -
International SEO whats best 2 sites co.uk and com.au ?
We have the co.uk and com.au ccTLDS and currently operate out of the UK only but plans are in place for Australia. We can't get hold of the .org or .com so it has to be the ccTLD. I want to use the same site for both countries and either host 2 identical sites (same content) or 1 site with different domain names + meta tags for the 2 countries. Whats the best way to make this happen without screwing things up?
International SEO | | therealmarkhall0