PHP实现文件转存,解决掘金等防盗链问题

现在越来越多的网站出于自身的利益开启了防盗链功能,这样导致我们转载、保存的文章无法看到图片。最近开始用wordpress来搭建自己的网站,于是想着把文章转存到自己服务器,解决防盗链问题。

实现的思路:wordpress 数据库中的 wp_posts 表保存了文章内容,我们把文章读取出来,然后遍历所有的图片,一次通过 curl 获取图片,保存到相应目录,然后替换图片 url 即可。是不是很简单,说做就做!

最终代码用 php 实现:

PicFetcher.php:

<?php

/**---------------------------------------------------------------------------
 * 注意调用前需要加载 common.php
 * deps: logs, startsWith, file_save
 */

/**
 * 获取图片资源
 * @param $url 图片url
 * @return array [errno, errmsg, fetch_status, content], 总是包含此四个字段
 */
function fetchPic($url)
{
    slog("fetching: " . $url);
    $ch = curl_init();
    $fetch_status = -1; // fetch 到的外部 status
    $content = ""; // fetch 到的 content,不一定就是图片
    $ext = "";
    $errno = 0;
    $errmsg = "";

    // 301 最多嵌套3次。
    for ($i = 0; $i < 3; $i++) {
        $options = array(
            CURLOPT_HEADER => 1,
            CURLOPT_POST => 0,
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_TIMEOUT_MS => 15000,
            CURLOPT_SSL_VERIFYPEER => 0,
        );
        curl_setopt_array($ch, $options);
        curl_setopt($ch, CURLOPT_ENCODING, "gzip,deflate");
        curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');

        if (($response = curl_exec($ch))) {
            $fetch_status = $code = curl_get_status($ch, $response);
            if ($code == 301 || $code == 302) {
                $redirect_url = curl_get_header($ch, $response, "Location");
                slog("redirect_url: $redirect_url");
                $parsed_re = parse_url($redirect_url);
                if (isset($parsed_re["host"])) {
                    $url = $redirect_url;
                } else {
                    $parsed = parse_url($url);
                    if (startsWith($redirect_url, "/")) {
                        $url = $parsed["scheme"] . "://" . $parsed["host"] . $redirect_url;
                    } else {
                        // TODO 相对路径拼接 normalize (/a/../b/c.html -> /b/c.html)
                        $dir = pathinfo($parsed["path"])["dirname"];
                        $url = $parsed["scheme"] . "://" . $parsed["host"] . $dir . "/" . $redirect_url;
                    }
                }
                continue;
            } else if ($code == 200) {
                $content_type = curl_get_header($ch, $response, "Content-Type");
                if (startsWith($content_type, "image/")) {
                    $header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
                    $content = substr($response, $header_size);
                    $ext = "." . substr($content_type, strlen("image/"));
                    $errmsg = "OK";
                } else {
                    $errno = -1;
                    $errmsg = "resource is not image!";
                }
            } else {
                $errno = -1;
                $errmsg = "unexpected status code get!";
            }
        } else {
            $errno = -1;
            $errmsg = "curl error!";
            slog("curl errno: " . curl_errno($ch) . ", errmsg: " . curl_error($ch));
        }
        curl_close($ch);
        break;
    }
    $data = [
        "errno" => $errno,
        "errmsg" => $errmsg,
        "fetch_status" => $fetch_status,
        "content" => $content,
        "ext" => $ext
    ];
    return $data;
}

/**
 * fetch & save
 *
 * @param $url
 * @return array
 */
function fetchSavePic($url, $saveRoot, $pic_base, $pic_host = "")
{
    // 不要获取本身的图片
    // uTODO: 以下方式能够绕过验证。可能需要配置,才能将所有的入口拦截下来。
    // 1 http:///somehost
    // 2 http://ip
    // 3 pic/xxx.xxx
    // 需要计算出 $url's host,再进行比较

    // //xx.com 转为 xx.com
    if (startsWith($url, "//")) {
        $url = substr($url, 2);
    }

    $pic_host = ($pic_host && $pic_host !== "") ? $pic_host : $pic_base;
    if (startsWith(strtolower($url), $pic_host) || startsWith(strtolower($url), "/")) {
        return [
            "errno" => 0,
            "errmsg" => "do NOT fetch inner pic!",
            "url" => $url
        ];
    }

    $data = fetchPic($url);
    if ($data["errno"] === 0) {
        $content = $data["content"];
        $save_name = md5($content);
        $save_name = substr($save_name, 0, 2) . "/" . substr($save_name, 2) . $data["ext"];
        file_save($content, $saveRoot . $save_name, false);
        $saveUrl = $pic_base . "/" . $save_name;
    } else {
        $saveUrl = "";
    }
    unset($data["content"]);
    $data["url"] = $saveUrl;
    slog(print_r($data, true));
    return $data;
}

WPPicFetcher.php

<?php
define('APPROOT', __DIR__ . '/../../');

SeasLog::setBasePath(APPROOT . '/logs/seaslog');

require_once APPROOT . "/system/config.php";
require_once APPROOT . "/system/common.php";
require_once APPROOT . "/system/PicFetcher.php";
require_once "phpQuery-onefile.php";

/**
* @param $configPath string 额外的配置文件路径,如果不存在,则忽略
*/
function fetchWPPic($configPath)
{
slog("+++++++++++++++++++++++++++++ wp pic processing: $configPath");
if (file_exists($configPath)) {
require_once $configPath;
}

try {
global $g_config;
$saveRoot = $g_config["save_root"];
$picBase = $g_config["pic_base"];
$dbh = new PDO("mysql:host=${g_config['host']};dbname=${g_config['wp_dbname']}", $g_config["username"], $g_config["pwd"]);
$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$dbh->query("set names utf8;");
$sql = '
SELECT id, post_content from wp_posts where post_status = "publish" and post_type="post" and `post_modified_gmt` > `pic_fetched`
';
foreach ($dbh->query($sql) as $row) {
$id = $row["id"];
$content = $row["post_content"];
slog("++++:processing id:$id");
$doc = \phpQuery::newDocument($content);
$elements = pq("img", $doc)->elements;
$cnt = count($elements);
for ($i = 0; $i < $cnt; $i++) {
$ele = $elements[$i];
// 破图可见处理
$ele->setAttribute("alt", "x");
// 图片转储
$src = $ele->getAttribute("src");
$r = fetchSavePic($src, $saveRoot, $picBase);
// slog($r);
if ($r["errno"] === 0) {
$url = $r["url"];
$ele->setAttribute("src", $url);
slog(">>> fetch ok");
} else {
slog(">>> fetch failed");
}
}

$update = "update wp_posts set post_content=?, pic_fetched=? where id=$id";
$stmt = $dbh->prepare($update);
$stmt->bindValue(1, $doc->html());
$stmt->bindValue(2, gmdate("Y-m-d H:i:s"));
$stmt->execute();
$stmt = null;
}
$dbh = null;
} catch (PDOException $e) {
ob_clean();
slog(print_r($e, true));
die();
}
}

主文件 WPPic.php

<?php
define('DOCSPATH', __DIR__ . '/');

require_once "WPPicFetcher.php";

if ($argc === 1) {
echo "Usage: php WPPic.php config-path1";
echo " Note: only relative path is supported";
return;
}

fetchWPPic(__DIR__ . "/" . $argv[1]);

这个是同事的站点,处理之后效果良好: http://wptp.rongyipiao.com/

当然还有些问题,带有时间后再慢慢优化

留下评论

电子邮件地址不会被公开。 必填项已用*标注