1. While you are free to lurk, we welcome you to register for a (free) account so that you will be able to participate in forum discussions.

Bad Bots - which you might like to block

Discussion in 'Search Engines' started by Alunny, Dec 16, 2014.

  1. Alunny Member

    Alunny

    A bunch of my websites were hit extremely hard in the past by a bunch of bad bots (namely Choopa - which is Ahrefs, and Yandex), they seemed to come in one after the other. In less than a minute some of these had over 300 views to my site, racking up hundreds of thousands of requests out of nowhere.

    It took me a long while to trace them back to where they were from, and I ended up putting info into my .htaccess file to prevent them from accessing my site again, and adding them to my robots.txt. I searched a lot of blacklists and went through a bunch of blogs to get the list.

    Here is the list in htaccess:
    Code:
    # BLOCK USER AGENTS
    RewriteEngine on
    RewriteCond %{HTTP_USER_AGENT} AhrefsBot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} spbot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} DigExt [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} Sogou [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} MJ12 [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} majestic12 [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} 80legs [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} SISTRIX [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} Semrush [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} Ezooms [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} CCBot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} TalkTalk [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} Ahrefs [NC]
    RewriteRule !^robots\.txt$ - [F]
    
    # BLOCK BLANK USER AGENTS
    RewriteCond %{HTTP_USER_AGENT} ^-?$
    Here's the list in my robots.txt (which I did before I ended up using htaccess admittedly)
    Code:
    # Horrible bandwidth eating robots
    
    user-agent: AhrefsBot
    Disallow: /
    
    User-agent: Yandex
    Disallow: /
    
    User-agent: moget
    Disallow: /
    
    User-agent: ichiro
    Disallow: /
    
    User-agent: NaverBot
    Disallow: /
    
    User-agent: Yeti
    Disallow: /
    
    User-agent: sogou spider
    Disallow: /
    
    User-agent: YoudaoBot
    Disallow: /
    Some of these are malicious and some are harvesters. A majority are just nuisances. There's an old one in the list that I can't tell if it's malicious or not (TalkTalk) but I thought better be safe than sorry.

    Since doing it my bandwidth usage dropped by 2/3rds, and my site has sped up! When I did it to the others I noticed similar speed increases so I am very happy with it :). Of course you could automate this process (perishablepress.com/blackhole-bad-bots/) in an extremely cunning way too :).

     
    CallieJo likes this.
  2. Converse Active Member

    Converse

    So far, I've been okay. I see a little more of the Baidu and Yandex bots that I'd like to see, since I have no foreign-language stuff here, but it's not outrageous. When did Ahrefs go over to the dark side? That didn't used to be a bad one.

    Later...

    Majestic12 (and probably MJ12) are from Majestic.com, which is a backlink checker used in SEO, as is SEMRush and Ahrefs. I don’t think they are malicious but, of course, the accumulation of bots can put a drag on the resources. If you’re not using these tools, then there is probably no advantage to letting them in.

    SISTRIX is supposed to be an SEO tool, as well, but I don’t see the advantage in letting it have its way with your resources.

    Yandex is the Russian equivalent of Google, as Baidu (which I see a lot of) is to China, except that blunt force bots are sometimes identified as Baidu as well. While these may be legitimate search engine bots, I am not sure what the advantage is to having foreign language bots indexing my sites.

    Spbot is a Trojan, from what I can determine. 80legs is a web crawler and scraper, and HTTrack is identified as a website copier intended to download sites to a local computer.

    DigExt seems to be connected to Microsoft Internet Explorer, and no one seems to know what Ezooms does for a living, except that it is persistent and doesn’t follow the rules.

    CCBot is the Common Crawl spider, but I don’t know what task it might be employed to accomplish.

    You’re right -- TalkTalk is an odd duck, and I don’t know why it has a bot roaming around.

    Ichiro plays for the New York Yankees, so I don't know what he's doing on your site.

     
    CallieJo likes this.
  3. Alunny Member

    Alunny

    Not sure when Ahrefs went dark side, but with the amount it was hitting my forum I had to make it stop. It was killing everything. It was coming in through "Choopa"....if you Google "Choopa Invasion" you will see a bunch of forum softwares posting about it, and I have put input into the Simple Machines topic as I used to be on their staff.

    I've noticed they are not using the bots they say they are on their website, or were not at the time. My website was never once hit by Ahrefs until Choopa....which was insane, and I didn't even realise that's what was happening until someone else said they had a bunch of Choopa's on their site and I added it to my spider list to monitor it. It just ate so much bandwidth, ignored robots.txt, and ugghhh I can't even tell you the stress. I've had to help other forums in similar positions out since then and it's the same story. I've seen over 15k hits in a day from them....it's just too much!

    The problem with Yandex is it comes in large volumes as well. There was a problem with Yahoo Slurp a while back when it was doing it but I believe they fixed it. Many of these were suggested in blacklisting blogs and I did check the sources and each one was (at the time up to date) very thorough in stating what it was on the blacklist for. Many are nuisances because they do suck up bandwidth, and they ignore the rules, but there's definitely some there you don't want! :D

     
  4. CallieJo New Member

    CallieJo

    Alunny,
    You could probably condense some of your list like this:
    Code:
    RewriteCond %{HTTP_USER_AGENT} (AhrefsBot|spbot|DigExt|Sogou) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} (majestic12|80legs|SISTRIX|HTTrack|Semrush) [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} (MJ12|Ezooms|CCBot|TalkTalk|Ahrefs) [NC]
    RewriteRule .* - [F] 
    Converse,
    Very informative.

    The problem I see with some of these bots is that it allows your competitors to check up on you. Although there are other methods of spying on competitors, some of these bots make the information easier to attain. Like you said, if you are not using any of these yourself then it's safe to block them.

     
  5. xTinx New Member

    xTinx

    No wonder my site has so many views and the comments on my shoutbox are all repetitive and random. Bots have got something to do with it. If I start disallowing those bots from my website, I think my visits would drop. On the brighter side, my ranking may increase. I'd have start filling in my site's robots.txt with all the necessary spam bots you mentioned.

     
  6. CallieJo New Member

    CallieJo

    I've never seen any evidence where a ranking increases because of bot visits.

    If you are talking about page rank, a lot of people believe page rank is dead since google has said they will not be updating it any longer.

    If you mean serp ranking, I highly doubt that bots will help you rank better in the serps.

    Be cautious with only using robots.txt. Many bots ignore it. Especially the bad bots. There is no guarantee that this method will keep any of those bots out.

    One of the biggest problems with bot traffic is the bandwidth they consume. There are bad bots that can suck up your bandwidth allotment if you are on shared hosting.

    Another issue is the spam bots. If you have user generated content on your site (allowing comments), these type of bots are looking to spam websites with junk. That junk can help deteriorate your serp ranking.

     
  7. Converse Active Member

    Converse

    Unless its Google, Bing, Yahoo, etc. Then, of course, you want your pages indexed.

     
    CallieJo likes this.

Share This Page