昨天，刚入职的同事小黄来找我帮个忙，想让我帮忙用python爬一下近几年全球大学的排行榜，网址为：

https://www.compassedu.hk/qs_2016

乍一看，这不就是一个静态网页嘛，于是夸下海口，5分钟就能搞定

然而，就是这句话，最后狠狠打了我的脸

爬静态网页第一步要查看网页源码，但是我发现无论怎么点鼠标右键都没反应，最终判定是鼠标右键被网页用js代码禁用了

于是我用快捷键F12（或ctrl+shift+i）打开网页元素，本以为问题就这样解决了，但是又遇到了新的问题

快捷键打开的网页元素是无法准确定位的，例如，假如我想查看“麻省理工学院”的具体位置，正常情况下直接右键-查看元素，就能看到了

而现在只能一行一行去找，这显然是不行，必须想办法破解js禁用右键

经过分析，我找到了2种方法，这里以火狐浏览器为例，给大家讲解一下：

方法一：

按F12，打点击控制台（Chrome是console），输入以下内容后按回车：

javascript:alert(document.onselectstart = document.oncontextmenu= document.onmousedown = document.onkeydown= function(){return true;});

方法二：

按F12，打点击控制台（Chrome是console），输入以下内容后按回车：

javascript:(function() { function R(a){ona = "on"+a; if(window.addEventListener) window.addEventListener(a, function (e) { for(var n=e.originalTarget; n; n=n.parentNode) n[ona]=null; }, true); window[ona]=null; document[ona]=null; if(document.body) document.body[ona]=null; } R("contextmenu"); R("click"); R("mousedown"); R("mouseup"); R("selectstart");})()

按以上步骤操作完，发现右键就能用了，但是这时已经过去了20分钟。。。

于是我迅速写好爬虫，并将数据保存至表格，就在我以为就要大功告成的时候，结果又出现新的问题

代码如下：

import requests
from bs4 import BeautifulSoup
url='https://www.compassedu.hk/qs_2015'
response=requests.get(url)
response.encoding='utf-8'
soup=BeautifulSoup(response.text,'html.parser')
ranks=soup.find('table',id='rk')
print(ranks)

<table aria-describedby="rk_info" class="rank-items dataTable no-footer" id="rk" role="grid">
<thead><tr class="header" role="row" style="width:980px;">
<th class="sorting_disabled" colspan="1" rowspan="1" style="width:50px">Ranking</th>
<th class="sorting_disabled" colspan="1" rowspan="1" style="width:230px;text-align: center;">University Name</th>
<th class="sorting_disabled" colspan="1" rowspan="1" style="width:93px;">Country/Region</th>
<th class="sorting_disabled" colspan="1" rowspan="1" style="width:78px;">Academic Reputation</th>
<th class="sorting_disabled" colspan="1" rowspan="1" style="width:90px;">Employer Reputation</th>
<th class="sorting_disabled" colspan="1" rowspan="1" style="width:66px;">Faculty Student</th>
<th class="sorting_disabled" colspan="1" rowspan="1" style="width:79px;">International Faculty</th>
<th class="sorting_disabled" colspan="1" rowspan="1" style="width:75px;">International Students</th>
<th class="sorting_disabled" colspan="1" rowspan="1" style="width:71px;">Citations per Faculty</th>
<th class="sorting_disabled" colspan="1" rowspan="1" style="width:58px;">Overall Score</th>
<th class="sorting_disabled" colspan="1" rowspan="1" style="width:89px">Free</th></tr></thead><tbody>
<tr class="odd" role="row" style="font-family:'Times new Roman',宋体">
<td style="width:50px">1</td>
<td style="width:230px;text-align:left;"><a href="//www.compassedu.hk/univ_85_12" target="_blank">麻省理工学院<br/>Massachusetts Institute of Technology</a></td>
<td style="width: 93px;">United Sta</td></tr></tbody></table>

发现bs4解析返回后的网页元素中，只有"麻省理工学院"一条信息，这也太奇怪了吧

刚开始以为是元素定位不对，试了多次都是这么个情况

最后，突然灵机一动，可能是网页解析方式的问题，于是去网上查了一下：

果然是有bug，我也懒得深究了，就赶紧把'html.parser'换成'lxml':

import requests
from bs4 import BeautifulSoup
url='https://www.compassedu.hk/qs_2015'
response=requests.get(url)
response.encoding='utf-8'
soup=BeautifulSoup(response.text,'lxml')
ranks=soup.find('table',id='rk')
print(ranks)

这下就正常了，但是时间已经过去了半个多小时

事实证明，人还是不要太装b，尤其是在美女面前，容易翻车！

网站首页 > 教程分享正文

分享一次翻车的python爬虫经历（python爬虫最全教程）

方法一：

方法二：

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎你发表评论:

网站首页 > 教程分享 正文

分享一次翻车的python爬虫经历（python爬虫最全教程）

方法一：

方法二：

猜你喜欢

本文暂时没有评论，来添加一个吧(●'◡'●)

取消回复欢迎 你 发表评论:

网站首页 > 教程分享正文

取消回复欢迎你发表评论: