Import XML stripping embedded content

Hey there,

We’re finding that when we export a book in Pressbooks XML format and then import that file into a newly created book, the iframes and embedded content get stripped away and replaced by links in plain text (not interactive in any way). We would expect that iframe tags wouldn’t be stripped at all. Is this a Pressbooks thing or a Wordpress thing?

I’m still investigating but thought I’d pop in here and see if anyone else has had this issue.

Bryan

Yeah, that’s a WordPress thing…

kses doesn’t allow iframes.

Look in web/wp/wp-includes/default-filters.php for kses_init.

In my experience (and by looking at the kses_init function) admins are not affected by kses.

Who is importing?

“Super admins” are importing/exporting and running into this issue.

Ugh, my mistake. It’s a Pressbooks thing.

@see \Pressbooks\Modules\Import\Import::tidy

Okay, yep. Just confirmed that’s the issue. Would y’all be open to a pull request that would put a filter hook on the $config variable? Something like:

$config = apply_filters( 'pb_htmLawed_config', [
    'safe' => 1,
]);

Otherwise, we’ll have to maintain a hack that preserves our preferred config. Also, definitely open to other suggestions. Thanks! :slight_smile:

Hi @bryan, in talking through this with @dac.chartrand I think our preference would be to follow the WordPress approach, where in a multisite context only super admins (who have the unfiltered_html() capability) can bypass sanitization routines when creating/saving posts. Would this work for your use case? It would mean that super admins could import from WordPress XML files that contain iframes without issue, but other users would still be subject to the sanitizing that is applied in all other contexts. This would make the behaviour consistent. What are your thoughts?

No, we would find a way to bypass that. We understand the broader wordpress community’s concern with iframes, but we think it’s ok for our book editors/admins at schools to be able to use them. (also contractors who we don’t want to give super admin access) They’re likely to get content from many other books and want to import them.
And I assume importing content via PB’s api copying would also strip iframes.

I followed the other recent conversation about this problem, and yes, it’s better if people use oembed replacement methods, but that doesn’t solve everything and our users don’t generally understand what that is and just grab the embed code provided to them on a site. It’s not a quick task to educate casual editors making their courses.

Also, we already have a ton of embeds for various things and won’t be going through all our books in the short term to replace them all.

It’s just not scalable for a teacher to ask us anytime they want to import or edit a page that has an iframe.

Import::tidy is overridden by at least 4 child classes that we know of.

If we apply your change consistently then sometimes you would get:

$config = apply_filters( 'pb_htmLawed_config', [
    'safe' => 1,
]);

But other times it might look like (import\ooxml\class-docx.php):

$config = apply_filters( 'pb_htmLawed_config', [
	'safe' => 1,
	'valid_xhtml' => 1,
	'xml:lang' => 1, // keep xml:lang *and* lang
	'no_deprecated_attr' => 2,
	'deny_attribute' => 'div -id',
	'hook' => '\Pressbooks\Sanitize\html5_to_xhtml11',
]);

If you were to tweak that second one incorrectly, it would break DOCX imports.

I think the solution should focus on the modules you are interested in and not the tidy routine itself.

What we are proposing is that since its safe (LOUD OBNOXIOUS AIR HORN: IT IS NOT) to assume that a WXR is probably WordPress content then we could change (import/wordpress/class-wxr.php) from:

$doc = new HTML5();
$html = $this->tidy( wpautop( $p['post_content'] ) );
$dom = $doc->loadHtml( $html );
// snip ...

To:

$doc = new HTML5();
$html = wpautop( $p['post_content'] );
if ( ! current_user_can( 'unfiltered_html' ) ) {
    $html = $this->tidy( $html );
}
$dom = $doc->loadHtml( $html );
// Snip ...

This would bypass safety like how WordPress does it in kses.

contractors who we don’t want to give super admin access

Unfortunately, the WXR import module uses wp_insert_post() to create the new posts. As far as we can tell that function results in kses sanitization being applied. Here are some Trac tickets dating back six years about this:

So even if you modify tidy, you will get your iframes taken away by WordPress later. (As seen in my first reply…)

Still up for discussion though. Not entirely against the filter idea. But we need a consistent approach for child classes.

Let’s try to find a compromise?

1 Like

Aside:

When looking at the import code, tidy is not the same thing as sanitize. We use tidy to clean up HTML so we can use it in strict formats.

That we make it “safe” is a side effect.

nods … I see your point about focusing on the WXR import module instead of the tidy function. I’m on board with that.

I like what you’re suggesting around checking the user capability. As a compromise on this strategy, could we filter on the user capability? So …

$doc = new HTML5();
$html = wpautop( $p['post_content'] );
// just spitballing, but we probably would want to name the filter something else...
if ( ! current_user_can( apply_filters( 'pb_import_wxr_skip_tidy', 'unfiltered_html' ) ) ) {
    $html = $this->tidy( $html );
}
$dom = $doc->loadHtml( $html );
// Snip ...

That could give us the flexibility we need while making the default behavior “safe” as it is now. Thoughts? Thanks for the continued dialogue on this. :slight_smile:

1 Like

@dac.chartrand just suggested to me that we could implement my original suggestion (this) and if you want to enable editing/importing of iframes by all users (or editors and higher, etc.) you can just use the core user_has_cap filter to add the unfiltered_html capability to a range of users who don’t currently have it.

The current_user_can() check calls WP_User::has_cap(). Within the latter method, you have:

You could drop a small mu-plugin into your network which adds the unfiltered_html capability. Then the same users would also be able to add/edit iframes in the normal content editor.

Then we could still maintain the security that we want (only users with unfiltered_html get to add/edit iframes) and you could relax that restriction as you wish in a broader context. Of course, as the unfiltered_html capability is used ~20 times in the WordPress codebase, I’d be very cautious about enabling it for everyone. Maybe stick to editors and up. Our main criteria for these types of changes is that our biggest network is still Pressbooks.com, which has open registration— and obviously, we don’t want to enable anything that would allow the insertion of malicious code by default.

If this sounds like a good approach, we can make the tidy routine in the WXR importer conditional on the unfiltered_html capability and then you can handle the rest on your end.

That’s an interesting approach. I’ll mull this over with Bracken and see if this solution would work for us.

yeah, doing it as a role capability would work great for us I think. We definitely want book admins to be able to import this stuff, and I think we probably want book editors to as well, but haven’t discussed that yet.

Just out of curiosity, how do you handle the fact that network admins are the only ones who can insert iframes in the editor by default? Do you override that some other way?

1 Like

We have altered the default editor behavior, yes.

https://codex.wordpress.org/Function_Reference/wp_kses_allowed_html

In Pressbooks 5.1 we did this.

Our iframes whitelist is a two step process:

Step 1) The first hook parses post content and looks for iframes. If an iframe is found then it checks the iframe against a whitelist. If the iframe url is not in the whitelist then the iframe gets deleted. You can add your own domains to the whitelist with this filter:

$whitelist = apply_filters( 'pb_whitelisted_domains', $this->whitelistedDomains );

Step 2) At this point all iframes we don’t trust have been deleted. We modify KSES to allow the remaining iframes but only with certain attributes. The second hook is probably similar to what you are doing.

I’d like your opinions on this 2nd hook. Are our KSES rules similar or are you just accepting iframes untouched?

Edit: Fixed links to GitHub.

You’re spot on @dac.chartrand. We’ve employed something practically identical to the second hook. We aren’t doing the first part though, and in that regard I think the solution you’re presenting is better. I’m going to test this out on a sandbox instance of our app updated to PB 5.1 and see if this satisfies our needs while hardening our stuff appropriately.

As for the original issue, I think we are going to settle on @ned’s suggestion of adding the unfiltered_html capability to the users we’d like to enable using iframes (for now).

Another issue I’ve run into (along these same lines) is that our oembeds are no longer working because they’re being wpautop'ed during import. Is there an easy solution to that I’m not seeing? For now, in class-wxr.php, I’m having to replace this line:

$html = wpautop( $p['post_content'] );

with:

$html = $p['post_content'];

While this makes it so oembeds render on the page correctly, this hack does have implications about how all the rest of the content may look (it doesn’t effect us negatively, but it would probably effect others).

EDIT: specified the file i’m having to alter to make oembeds work

Thanks. I’ll make sure to test Oembed URLs when implementing. We use wpautop because we can’t send bad HTML to \Masterminds\HTML5 which is used to download images and change src attributes.

1 Like

I take that back ^. By taking out wpautop( ... ) there are cases now where entire paragraphs are getting stripped out of the content on import because they aren’t being wrapped in p tags (up to and including the opening of the first shortcode on a page). We may have to add wpautop back in here but that brings us back to the issue of Oembeds being wrapped in p tags, causing them to not render correctly.

@dac.chartrand, did you ever have a chance to think about how the wxr import can handle Oembeds properly? I’ve been thinking about it and I don’t really see an obvious solution to the problem, so thought I’d reach out.

Cheers!