commit f1aa865ae5d4608cbfbb02f42baa1ef5ed95fce2 Author: Greg Kroah-Hartman Date: Wed Jun 7 12:08:04 2017 +0200 Linux 4.9.31 commit 11214bd292ec95c52f636c187b15c9160215f988 Author: Jan Kara Date: Thu May 18 16:36:23 2017 -0700 xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff() commit d7fd24257aa60316bf81093f7f909dc9475ae974 upstream. There is an off-by-one error in loop termination conditions in xfs_find_get_desired_pgoff() since 'end' may index a page beyond end of desired range if 'endoff' is page aligned. It doesn't have any visible effects but still it is good to fix it. Signed-off-by: Jan Kara Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 75c5afd58d4696b4dce6ed931c49da9975962f99 Author: Eric Sandeen Date: Mon May 22 19:54:10 2017 -0700 xfs: fix unaligned access in xfs_btree_visit_blocks commit a4d768e702de224cc85e0c8eac9311763403b368 upstream. This structure copy was throwing unaligned access warnings on sparc64: Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs] xfs_btree_copy_ptrs does a memcpy, which avoids it. Signed-off-by: Eric Sandeen Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 7fb8ab8f0a38adb97cf2cf0738628959db5d840d Author: Darrick J. Wong Date: Mon May 15 19:16:15 2017 -0700 xfs: avoid mount-time deadlock in CoW extent recovery commit 3ecb3ac7b950ff8f6c6a61e8b7b0d6e3546429a0 upstream. If a malicious user corrupts the refcount btree to cause a cycle between different levels of the tree, the next mount attempt will deadlock in the CoW recovery routine while grabbing buffer locks. We can use the ability to re-grab a buffer that was previous locked to a transaction to avoid deadlocks, so do that here. Signed-off-by: Darrick J. Wong Reviewed-by: Brian Foster Signed-off-by: Greg Kroah-Hartman commit e40c145c023db15a0504e95ac442f0e47c0a816d Author: Christoph Hellwig Date: Sat Jun 3 15:18:31 2017 +0200 xfs: xfs_trans_alloc_empty This is a partial cherry-pick of commit e89c041338 ("xfs: implement the GETFSMAP ioctl"), which also adds this helper, and a great example of why feature patches should be properly split into their parts. Signed-off-by: Darrick J. Wong [hch: split from the larger patch for -stable] Signed-off-by: Christoph Hellwig commit 0e542792a046c788f39ab13fcd68c7ba65930407 Author: Zorro Lang Date: Mon May 15 08:40:02 2017 -0700 xfs: bad assertion for delalloc an extent that start at i_size commit 892d2a5f705723b2cb488bfb38bcbdcf83273184 upstream. By run fsstress long enough time enough in RHEL-7, I find an assertion failure (harder to reproduce on linux-4.11, but problem is still there): XFS: Assertion failed: (iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c The assertion is in xfs_getbmap() funciton: if (map[i].br_startblock == DELAYSTARTBLOCK && --> map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip))) ASSERT((iflags & BMV_IF_DELALLOC) != 0); When map[i].br_startoff == XFS_B_TO_FSB(mp, XFS_ISIZE(ip)), the startoff is just at EOF. But we only need to make sure delalloc extents that are within EOF, not include EOF. Signed-off-by: Zorro Lang Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit f60d76efa91a1450c73589d9c61b3af74bf2cf73 Author: Darrick J. Wong Date: Fri May 12 10:44:08 2017 -0700 xfs: BMAPX shouldn't barf on inline-format directories commit 6eadbf4c8ba816c10d1c97bed9aa861d9fd17809 upstream. When we're fulfilling a BMAPX request, jump out early if the data fork is in local format. This prevents us from hitting a debugging check in bmapi_read and barfing errors back to userspace. The on-disk extent count check later isn't sufficient for IF_DELALLOC mode because da extents are in memory and not on disk. Signed-off-by: Darrick J. Wong Reviewed-by: Brian Foster Reviewed-by: Christoph Hellwig Signed-off-by: Greg Kroah-Hartman commit 53c44c236f218fbb813ce45b79da7b7d9938bf68 Author: Brian Foster Date: Fri May 12 10:44:08 2017 -0700 xfs: fix indlen accounting error on partial delalloc conversion commit 0daaecacb83bc6b656a56393ab77a31c28139bc7 upstream. The delalloc -> real block conversion path uses an incorrect calculation in the case where the middle part of a delalloc extent is being converted. This is documented as a rare situation because XFS generally attempts to maximize contiguity by converting as much of a delalloc extent as possible. If this situation does occur, the indlen reservation for the two new delalloc extents left behind by the conversion of the middle range is calculated and compared with the original reservation. If more blocks are required, the delta is allocated from the global block pool. This delta value can be characterized as the difference between the new total requirement (temp + temp2) and the currently available reservation minus those blocks that have already been allocated (startblockval(PREV.br_startblock) - allocated). The problem is that the current code does not account for previously allocated blocks correctly. It subtracts the current allocation count from the (new - old) delta rather than the old indlen reservation. This means that more indlen blocks than have been allocated end up stashed in the remaining extents and free space accounting is broken as a result. Fix up the calculation to subtract the allocated block count from the original extent indlen and thus correctly allocate the reservation delta based on the difference between the new total requirement and the unused blocks from the original reservation. Also remove a bogus assert that contradicts the fact that the new indlen reservation can be larger than the original indlen reservation. Signed-off-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 54894ea3c5420170597d1f6a4343637e68b7a3e2 Author: Eryu Guan Date: Tue May 2 13:54:47 2017 -0700 xfs: fix use-after-free in xfs_finish_page_writeback commit 161f55efba5ddccc690139fae9373cafc3447a97 upstream. Commit 28b783e47ad7 ("xfs: bufferhead chains are invalid after end_page_writeback") fixed one use-after-free issue by pre-calculating the loop conditionals before calling bh->b_end_io() in the end_io processing loop, but it assigned 'next' pointer before checking end offset boundary & breaking the loop, at which point the bh might be freed already, and caused use-after-free. This is caught by KASAN when running fstests generic/127 on sub-page block size XFS. [ 2517.244502] run fstests generic/127 at 2017-04-27 07:30:50 [ 2747.868840] ================================================================== [ 2747.876949] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3d3/0x4e0 [xfs] at addr ffff8801395ae698 ... [ 2747.918245] Call Trace: [ 2747.920975] dump_stack+0x63/0x84 [ 2747.924673] kasan_object_err+0x21/0x70 [ 2747.928950] kasan_report+0x271/0x530 [ 2747.933064] ? xfs_destroy_ioend+0x3d3/0x4e0 [xfs] [ 2747.938409] ? end_page_writeback+0xce/0x110 [ 2747.943171] __asan_report_load8_noabort+0x19/0x20 [ 2747.948545] xfs_destroy_ioend+0x3d3/0x4e0 [xfs] [ 2747.953724] xfs_end_io+0x1af/0x2b0 [xfs] [ 2747.958197] process_one_work+0x5ff/0x1000 [ 2747.962766] worker_thread+0xe4/0x10e0 [ 2747.966946] kthread+0x2d3/0x3d0 [ 2747.970546] ? process_one_work+0x1000/0x1000 [ 2747.975405] ? kthread_create_on_node+0xc0/0xc0 [ 2747.980457] ? syscall_return_slowpath+0xe6/0x140 [ 2747.985706] ? do_page_fault+0x30/0x80 [ 2747.989887] ret_from_fork+0x2c/0x40 [ 2747.993874] Object at ffff8801395ae690, in cache buffer_head size: 104 [ 2748.001155] Allocated: [ 2748.003782] PID = 8327 [ 2748.006411] save_stack_trace+0x1b/0x20 [ 2748.010688] save_stack+0x46/0xd0 [ 2748.014383] kasan_kmalloc+0xad/0xe0 [ 2748.018370] kasan_slab_alloc+0x12/0x20 [ 2748.022648] kmem_cache_alloc+0xb8/0x1b0 [ 2748.027024] alloc_buffer_head+0x22/0xc0 [ 2748.031399] alloc_page_buffers+0xd1/0x250 [ 2748.035968] create_empty_buffers+0x30/0x410 [ 2748.040730] create_page_buffers+0x120/0x1b0 [ 2748.045493] __block_write_begin_int+0x17a/0x1800 [ 2748.050740] iomap_write_begin+0x100/0x2f0 [ 2748.055308] iomap_zero_range_actor+0x253/0x5c0 [ 2748.060362] iomap_apply+0x157/0x270 [ 2748.064347] iomap_zero_range+0x5a/0x80 [ 2748.068624] iomap_truncate_page+0x6b/0xa0 [ 2748.073227] xfs_setattr_size+0x1f7/0xa10 [xfs] [ 2748.078312] xfs_vn_setattr_size+0x68/0x140 [xfs] [ 2748.083589] xfs_file_fallocate+0x4ac/0x820 [xfs] [ 2748.088838] vfs_fallocate+0x2cf/0x780 [ 2748.093021] SyS_fallocate+0x48/0x80 [ 2748.097006] do_syscall_64+0x18a/0x430 [ 2748.101186] return_from_SYSCALL_64+0x0/0x6a [ 2748.105948] Freed: [ 2748.108189] PID = 8327 [ 2748.110816] save_stack_trace+0x1b/0x20 [ 2748.115093] save_stack+0x46/0xd0 [ 2748.118788] kasan_slab_free+0x73/0xc0 [ 2748.122969] kmem_cache_free+0x7a/0x200 [ 2748.127247] free_buffer_head+0x41/0x80 [ 2748.131524] try_to_free_buffers+0x178/0x250 [ 2748.136316] xfs_vm_releasepage+0x2e9/0x3d0 [xfs] [ 2748.141563] try_to_release_page+0x100/0x180 [ 2748.146325] invalidate_inode_pages2_range+0x7da/0xcf0 [ 2748.152087] xfs_shift_file_space+0x37d/0x6e0 [xfs] [ 2748.157557] xfs_collapse_file_space+0x49/0x120 [xfs] [ 2748.163223] xfs_file_fallocate+0x2a7/0x820 [xfs] [ 2748.168462] vfs_fallocate+0x2cf/0x780 [ 2748.172642] SyS_fallocate+0x48/0x80 [ 2748.176629] do_syscall_64+0x18a/0x430 [ 2748.180810] return_from_SYSCALL_64+0x0/0x6a Fixed it by checking on offset against end & breaking out first, dereference bh only if there're still bufferheads to process. Signed-off-by: Eryu Guan Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit d457f822817f9ab3a28b89b090dee5e4c5fb7ddf Author: Darrick J. Wong Date: Wed Apr 12 12:26:07 2017 -0700 xfs: reserve enough blocks to handle btree splits when remapping commit fe0be23e68200573de027de9b8cc2b27e7fce35e upstream. In xfs_reflink_end_cow, we erroneously reserve only enough blocks to handle adding 1 extent. This is problematic if we fragment free space, have to do CoW, and then have to perform multiple bmap btree expansions. Furthermore, the BUI recovery routine doesn't reserve /any/ blocks to handle btree splits, so log recovery fails after our first error causes the filesystem to go down. Therefore, refactor the transaction block reservation macros until we have a macro that works for our deferred (re)mapping activities, and fix both problems by using that macro. With 1k blocks we can hit this fairly often in g/187 if the scratch fs is big enough. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig Signed-off-by: Greg Kroah-Hartman commit 0ba833fe73d246b0bfb76403b7b67a890cf87b5e Author: Brian Foster Date: Wed Apr 26 08:30:40 2017 -0700 xfs: wait on new inodes during quotaoff dquot release commit e20c8a517f259cb4d258e10b0cd5d4b30d4167a0 upstream. The quotaoff operation has a race with inode allocation that results in a livelock. An inode allocation that occurs before the quota status flags are updated acquires the appropriate dquots for the inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode into the perag radix tree, sometime later attaches the dquots to the inode and finally clears the XFS_INEW flag. Quotaoff expects to release the dquots from all inodes in the filesystem via xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator, which skips inodes in the XFS_INEW state because they are not fully constructed. If the scan occurs after dquots have been attached to an inode, but before XFS_INEW is cleared, the newly allocated inode will continue to hold a reference to the applicable dquots. When quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those dquot(s) remain elevated and the dqpurge scan spins indefinitely. To address this problem, update the xfs_qm_dqrele_all_inodes() scan to wait on inodes marked on the XFS_INEW state. We wait on the inodes explicitly rather than skip and retry to avoid continuous retry loops due to a parallel inode allocation workload. Since quotaoff updates the quota state flags and uses a synchronous transaction before the dqrele scan, and dquots are attached to inodes after radix tree insertion iff quota is enabled, one INEW waiting pass through the AG guarantees that the scan has processed all inodes that could possibly hold dquot references. Reported-by: Eryu Guan Signed-off-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 2ea882d8ebc7cd6e95b09479af9f5f5692818fdb Author: Brian Foster Date: Wed Apr 26 08:30:39 2017 -0700 xfs: update ag iterator to support wait on new inodes commit ae2c4ac2dd39b23a87ddb14ceddc3f2872c6aef5 upstream. The AG inode iterator currently skips new inodes as such inodes are inserted into the inode radix tree before they are fully constructed. Certain contexts require the ability to wait on the construction of new inodes, however. The fs-wide dquot release from the quotaoff sequence is an example of this. Update the AG inode iterator to support the ability to wait on inodes flagged with XFS_INEW upon request. Create a new xfs_inode_ag_iterator_flags() interface and support a set of iteration flags to modify the iteration behavior. When the XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the radix tree inode lookup and wait on them before the callback is executed. Signed-off-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit e86b616b5b9ed5023ceef154764e327507e30007 Author: Brian Foster Date: Wed Apr 26 08:30:39 2017 -0700 xfs: support ability to wait on new inodes commit 756baca27fff3ecaeab9dbc7a5ee35a1d7bc0c7f upstream. Inodes that are inserted into the perag tree but still under construction are flagged with the XFS_INEW bit. Most contexts either skip such inodes when they are encountered or have the ability to handle them. The runtime quotaoff sequence introduces a context that must wait for construction of such inodes to correctly ensure that all dquots in the fs are released. In anticipation of this, support the ability to wait on new inodes. Wake the appropriate bit when XFS_INEW is cleared. Signed-off-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 10f0b2c3c225bae7eddf88a18466b9133a716c72 Author: Brian Foster Date: Fri Apr 21 12:40:44 2017 -0700 xfs: fix up quotacheck buffer list error handling commit 20e8a063786050083fe05b4f45be338c60b49126 upstream. The quotacheck error handling of the delwri buffer list assumes the resident buffers are locked and doesn't clear the _XBF_DELWRI_Q flag on the buffers that are dequeued. This can lead to assert failures on buffer release and possibly other locking problems. Move this code to a delwri queue cancel helper function to encapsulate the logic required to properly release buffers from a delwri queue. Update the helper to clear the delwri queue flag and call it from quotacheck. Signed-off-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 95487d4be1e9eb7a43120d852c86c80803cfe51d Author: Brian Foster Date: Thu Apr 20 08:06:47 2017 -0700 xfs: prevent multi-fsb dir readahead from reading random blocks commit cb52ee334a45ae6c78a3999e4b473c43ddc528f4 upstream. Directory block readahead uses a complex iteration mechanism to map between high-level directory blocks and underlying physical extents. This mechanism attempts to traverse the higher-level dir blocks in a manner that handles multi-fsb directory blocks and simultaneously maintains a reference to the corresponding physical blocks. This logic doesn't handle certain (discontiguous) physical extent layouts correctly with multi-fsb directory blocks. For example, consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory block size and a directory with the following extent layout: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..7]: 88..95 0 (88..95) 8 1: [8..15]: 80..87 0 (80..87) 8 2: [16..39]: 168..191 0 (168..191) 24 3: [40..63]: 5242952..5242975 1 (72..95) 24 Directory block 0 spans physical extents 0 and 1, dirblk 1 lies entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because extent 2 is larger than the directory block size, the readahead code erroneously assumes the block is contiguous and issues a readahead based on the physical mapping of the first fsb of the dirblk. This results in read verifier failure and a spurious corruption or crc failure, depending on the filesystem format. Further, the subsequent readahead code responsible for walking through the physical table doesn't correctly advance the physical block reference for dirblk 2. Instead of advancing two physical filesystem blocks, the first iteration of the loop advances 1 block (correctly), but the subsequent iteration advances 2 more physical blocks because the next physical extent (extent 3, above) happens to cover more than dirblk 2. At this point, the higher-level directory block walking is completely off the rails of the actual physical layout of the directory for the respective mapping table. Update the contiguous dirblock logic to consider the current offset in the physical extent to avoid issuing directory readahead to unrelated blocks. Also, update the mapping table advancing code to consider the current offset within the current dirblock to avoid advancing the mapping reference too far beyond the dirblock. Signed-off-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 93bd169845e5d5911e51509d7c7e4cba4a3a2d5a Author: Eric Sandeen Date: Thu Apr 13 15:15:47 2017 -0700 xfs: handle array index overrun in xfs_dir2_leaf_readbuf() commit 023cc840b40fad95c6fe26fff1d380a8c9d45939 upstream. Carlos had a case where "find" seemed to start spinning forever and never return. This was on a filesystem with non-default multi-fsb (8k) directory blocks, and a fragmented directory with extents like this: 0:[0,133646,2,0] 1:[2,195888,1,0] 2:[3,195890,1,0] 3:[4,195892,1,0] 4:[5,195894,1,0] 5:[6,195896,1,0] 6:[7,195898,1,0] 7:[8,195900,1,0] 8:[9,195902,1,0] 9:[10,195908,1,0] 10:[11,195910,1,0] 11:[12,195912,1,0] 12:[13,195914,1,0] ... i.e. the first extent is a contiguous 2-fsb dir block, but after that it is fragmented into 1 block extents. At the top of the readdir path, we allocate a mapping array which (for this filesystem geometry) can hold 10 extents; see the assignment to map_info->map_size. During readdir, we are therefore able to map extents 0 through 9 above into the array for readahead purposes. If we count by 2, we see that the last mapped index (9) is the first block of a 2-fsb directory block. At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill more readahead; the outer loop assumes one full dir block is processed each loop iteration, and an inner loop that ensures that this is so by advancing to the next extent until a full directory block is mapped. The problem is that this inner loop may step past the last extent in the mapping array as it tries to reach the end of the directory block. This will read garbage for the extent length, and as a result the loop control variable 'j' may become corrupted and never fail the loop conditional. The number of valid mappings we have in our array is stored in map->map_valid, so stop this inner loop based on that limit. There is an ASSERT at the top of the outer loop for this same condition, but we never made it out of the inner loop, so the ASSERT never fired. Huge appreciation for Carlos for debugging and isolating the problem. Debugged-and-analyzed-by: Carlos Maiolino Signed-off-by: Eric Sandeen Tested-by: Carlos Maiolino Reviewed-by: Carlos Maiolino Reviewed-by: Bill O'Donnell Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 99226b890d63c5aaf8b5ccc95b17d1a8087f45c9 Author: Christoph Hellwig Date: Tue Apr 11 16:45:52 2017 -0700 xfs: fix integer truncation in xfs_bmap_remap_alloc commit 52813fb13ff90bd9c39a93446cbf1103c290b6e9 upstream. bno should be a xfs_fsblock_t, which is 64-bit wides instead of a xfs_aglock_t, which truncates the value to 32 bits. Signed-off-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 4e2762878a59072f2ba1e02987d22ef77921791e Author: Brian Foster Date: Tue Apr 11 10:50:05 2017 -0700 xfs: drop iolock from reclaim context to appease lockdep commit 3b4683c294095b5f777c03307ef8c60f47320e12 upstream. Lockdep complains about use of the iolock in inode reclaim context because it doesn't understand that reclaim has the last reference to the inode, and thus an iolock->reclaim->iolock deadlock is not possible. The iolock is technically not necessary in xfs_inactive() and was only added to appease an assert in xfs_free_eofblocks(), which can be called from other non-reclaim contexts. Therefore, just kill the assert and drop the use of the iolock from reclaim context to quiet lockdep. Signed-off-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 4e8163fc8159590c031783bfb0d548ca45f4ebdf Author: Darrick J. Wong Date: Thu Apr 6 16:00:39 2017 -0700 xfs: actually report xattr extents via iomap commit 84358536dc355a9c8978ee425f87e116186bed16 upstream. Apparently FIEMAP for xattrs has been broken since we switched to the iomap backend because of an incorrect check for xattr presence. Also fix the broken locking. Signed-off-by: Darrick J. Wong Reviewed-by: Brian Foster Reviewed-by: Christoph Hellwig Signed-off-by: Greg Kroah-Hartman commit de417ea6b0a607e79b765838d397c0eb0bcb62d9 Author: Darrick J. Wong Date: Mon Apr 3 15:17:57 2017 -0700 xfs: fix over-copying of getbmap parameters from userspace commit be6324c00c4d1e0e665f03ed1fc18863a88da119 upstream. In xfs_ioc_getbmap, we should only copy the fields of struct getbmap from userspace, or else we end up copying random stack contents into the kernel. struct getbmap is a strict subset of getbmapx, so a partial structure copy should work fine. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig Signed-off-by: Greg Kroah-Hartman commit c2ad2dc3d2648d603e7ab7dff4c6b38122206673 Author: Brian Foster Date: Tue Mar 28 14:51:44 2017 -0700 xfs: use dedicated log worker wq to avoid deadlock with cil wq commit 696a562072e3c14bcd13ae5acc19cdf27679e865 upstream. The log covering background task used to be part of the xfssyncd workqueue. That workqueue was removed as of commit 5889608df ("xfs: syncd workqueue is no more") and the associated work item scheduled to the xfs-log wq. The latter is used for log buffer I/O completion. Since xfs_log_worker() can invoke a log flush, a deadlock is possible between the xfs-log and xfs-cil workqueues. Consider the following codepath from xfs_log_worker(): xfs_log_worker() xfs_log_force() _xfs_log_force() xlog_cil_force() xlog_cil_force_lsn() xlog_cil_push_now() flush_work() The above is in xfs-log wq context and blocked waiting on the completion of an xfs-cil work item. Concurrently, the cil push in progress can end up blocked here: xlog_cil_push_work() xlog_cil_push() xlog_write() xlog_state_get_iclog_space() xlog_wait(&log->l_flush_wait, ...) The above is in xfs-cil context waiting on log buffer I/O completion, which executes in xfs-log wq context. In this scenario both workqueues are deadlocked waiting on eachother. Add a new workqueue specifically for the high level log covering and ail pushing worker, as was the case prior to commit 5889608df. Diagnosed-by: David Jeffery Signed-off-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 3890d83805fe0f9c854fa3445358714aaa71d1ca Author: Darrick J. Wong Date: Mon Apr 3 12:22:39 2017 -0700 xfs: fix kernel memory exposure problems commit bf9216f922612d2db7666aae01e65064da2ffb3a upstream. Fix a memory exposure problems in inumbers where we allocate an array of structures with holes, fail to zero the holes, then blindly copy the kernel memory contents (junk and all) into userspace. Signed-off-by: Darrick J. Wong Reviewed-by: Christoph Hellwig Signed-off-by: Greg Kroah-Hartman commit ca659e086fb7c1584d5e665ea51bd1abd687ea00 Author: Darrick J. Wong Date: Mon Apr 3 12:22:20 2017 -0700 xfs: rework the inline directory verifiers commit 78420281a9d74014af7616958806c3aba056319e upstream. The inline directory verifiers should be called on the inode fork data, which means after iformat_local on the read side, and prior to ifork_flush on the write side. This makes the fork verifier more consistent with the way buffer verifiers work -- i.e. they will operate on the memory buffer that the code will be reading and writing directly. Furthermore, revise the verifier function to return -EFSCORRUPTED so that we don't flood the logs with corruption messages and assert notices. This has been a particular problem with xfs/348, which triggers the XFS_WANT_CORRUPTED_RETURN assertions, which halts the kernel when CONFIG_XFS_DEBUG=y. Disk corruption isn't supposed to do that, at least not in a verifier. Reviewed-by: Brian Foster Reviewed-by: Christoph Hellwig Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 815414e7648b6d94cb81542dfb15cb71cbbd2ae9 Author: Darrick J. Wong Date: Wed Mar 15 00:24:25 2017 -0700 xfs: verify inline directory data forks commit 630a04e79dd41ff746b545d4fc052e0abb836120 upstream. When we're reading or writing the data fork of an inline directory, check the contents to make sure we're not overflowing buffers or eating garbage data. xfs/348 corrupts an inline symlink into an inline directory, triggering a buffer overflow bug. v2: add more checks consistent with _dir2_sf_check and make the verifier usable from anywhere. Signed-off-by: Darrick J. Wong Reviewed-by: Brian Foster Signed-off-by: Greg Kroah-Hartman commit 11b4854772859d5a755cd575f68e2e2e3234ef43 Author: Eryu Guan Date: Tue May 23 08:30:46 2017 -0700 xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff() commit 8affebe16d79ebefb1d9d6d56a46dc89716f9453 upstream. xfs_find_get_desired_pgoff() is used to search for offset of hole or data in page range [index, end] (both inclusive), and the max number of pages to search should be at least one, if end == index. Otherwise the only page is missed and no hole or data is found, which is not correct. When block size is smaller than page size, this can be demonstrated by preallocating a file with size smaller than page size and writing data to the last block. E.g. run this xfs_io command on a 1k block size XFS on x86_64 host. # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \ -c "seek -d 0" /mnt/xfs/testfile wrote 1024/1024 bytes at offset 2048 1 KiB, 1 ops; 0.0000 sec (33.675 MiB/sec and 34482.7586 ops/sec) Whence Result DATA EOF Data at offset 2k was missed, and lseek(2) returned ENXIO. This is uncovered by generic/285 subtest 07 and 08 on ppc64 host, where pagesize is 64k. Because a recent change to generic/285 reduced the preallocated file size to smaller than 64k. Signed-off-by: Eryu Guan Reviewed-by: Jan Kara Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 9c795fff53f934952c84e80e568b488492d8f805 Author: Brian Foster Date: Wed May 31 08:22:52 2017 -0700 xfs: use ->b_state to fix buffer I/O accounting release race commit 63db7c815bc0997c29e484d2409684fdd9fcd93b upstream. We've had user reports of unmount hangs in xfs_wait_buftarg() that analysis shows is due to btp->bt_io_count == -1. bt_io_count represents the count of in-flight asynchronous buffers and thus should always be >= 0. xfs_wait_buftarg() waits for this value to stabilize to zero in order to ensure that all untracked (with respect to the lru) buffers have completed I/O processing before unmount proceeds to tear down in-core data structures. The value of -1 implies an I/O accounting decrement race. Indeed, the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele() (where the buffer lock is no longer held) means that bp->b_flags can be updated from an unsafe context. While a user-level reproducer is currently not available, some intrusive hacks to run racing buffer lookups/ioacct/releases from multiple threads was used to successfully manufacture this problem. Existing callers do not expect to acquire the buffer lock from xfs_buf_rele(). Therefore, we can not safely update ->b_flags from this context. It turns out that we already have separate buffer state bits and associated serialization for dealing with buffer LRU state in the form of ->b_state and ->b_lock. Therefore, replace the _XBF_IN_FLIGHT flag with a ->b_state variant, update the I/O accounting wrappers appropriately and make sure they are used with the correct locking. This ensures that buffer in-flight state can be modified at buffer release time without racing with modifications from a buffer lock holder. Fixes: 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against unmount") Signed-off-by: Brian Foster Reviewed-by: Nikolay Borisov Tested-by: Libor Pechacek Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit c9eab63b9e6210a9bd3b124e43ad575f5f52a130 Author: Jan Kara Date: Thu May 18 16:36:22 2017 -0700 xfs: Fix missed holes in SEEK_HOLE implementation commit 5375023ae1266553a7baa0845e82917d8803f48c upstream. XFS SEEK_HOLE implementation could miss a hole in an unwritten extent as can be seen by the following command: xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k" -c "seek -h 0" file wrote 57344/57344 bytes at offset 0 56 KiB, 14 ops; 0.0000 sec (49.312 MiB/sec and 12623.9856 ops/sec) wrote 8192/8192 bytes at offset 131072 8 KiB, 2 ops; 0.0000 sec (70.383 MiB/sec and 18018.0180 ops/sec) Whence Result HOLE 139264 Where we can see that hole at offset 56k was just ignored by SEEK_HOLE implementation. The bug is in xfs_find_get_desired_pgoff() which does not properly detect the case when pages are not contiguous. Fix the problem by properly detecting when found page has larger offset than expected. Fixes: d126d43f631f996daeee5006714fed914be32368 Signed-off-by: Jan Kara Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit 670821b9482d037915e75bdd307ec87f231dd70a Author: Patrik Jakobsson Date: Tue Apr 18 13:43:32 2017 +0200 drm/gma500/psb: Actually use VBT mode when it is found commit 82bc9a42cf854fdf63155759c0aa790bd1f361b0 upstream. With LVDS we were incorrectly picking the pre-programmed mode instead of the prefered mode provided by VBT. Make sure we pick the VBT mode if one is provided. It is likely that the mode read-out code is still wrong but this patch fixes the immediate problem on most machines. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=78562 Signed-off-by: Patrik Jakobsson Link: http://patchwork.freedesktop.org/patch/msgid/20170418114332.12183-1-patrik.r.jakobsson@gmail.com Signed-off-by: Greg Kroah-Hartman commit 74b416367b4ea3aad7c4ff566345a4cf3cf5a80b Author: Daniel Thompson Date: Tue Jan 24 15:18:02 2017 -0800 mm/slub.c: trace free objects at KERN_INFO commit aa2efd5ea4041754da4046c3d2e7edaac9526258 upstream. Currently when trace is enabled (e.g. slub_debug=T,kmalloc-128 ) the trace messages are mostly output at KERN_INFO. However the trace code also calls print_section() to hexdump the head of a free object. This is hard coded to use KERN_ERR, meaning the console is deluged with trace messages even if we've asked for quiet. Fix this the obvious way but adding a level parameter to print_section(), allowing calls from the trace code to use the same trace level as other trace messages. Link: http://lkml.kernel.org/r/20170113154850.518-1-daniel.thompson@linaro.org Signed-off-by: Daniel Thompson Acked-by: Christoph Lameter Acked-by: David Rientjes Cc: Pekka Enberg Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman commit c1bb2a899b5f5edc10cd5b5b9b0496cddb533565 Author: Thomas Gleixner Date: Fri Jun 2 14:46:25 2017 -0700 slub/memcg: cure the brainless abuse of sysfs attributes commit 478fe3037b2278d276d4cd9cd0ab06c4cb2e9b32 upstream. memcg_propagate_slab_attrs() abuses the sysfs attribute file functions to propagate settings from the root kmem_cache to a newly created kmem_cache. It does that with: attr->show(root, buf); attr->store(new, buf, strlen(bug); Aside of being a lazy and absurd hackery this is broken because it does not check the return value of the show() function. Some of the show() functions return 0 w/o touching the buffer. That means in such a case the store function is called with the stale content of the previous show(). That causes nonsense like invoking kmem_cache_shrink() on a newly created kmem_cache. In the worst case it would cause handing in an uninitialized buffer. This should be rewritten proper by adding a propagate() callback to those slub_attributes which must be propagated and avoid that insane conversion to and from ASCII, but that's too large for a hot fix. Check at least the return value of the show() function, so calling store() with stale content is prevented. Steven said: "It can cause a deadlock with get_online_cpus() that has been uncovered by recent cpu hotplug and lockdep changes that Thomas and Peter have been doing. Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpu_hotplug.lock); lock(slab_mutex); lock(cpu_hotplug.lock); lock(slab_mutex); *** DEADLOCK ***" Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanos Signed-off-by: Thomas Gleixner Reported-by: Steven Rostedt Acked-by: David Rientjes Cc: Johannes Weiner Cc: Michal Hocko Cc: Peter Zijlstra Cc: Christoph Lameter Cc: Pekka Enberg Cc: Joonsoo Kim Cc: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit 873f3b0ebbfe454f092d038a4bd1e0e17622c786 Author: Andrea Arcangeli Date: Fri Jun 2 14:46:11 2017 -0700 ksm: prevent crash after write_protect_page fails commit a7306c3436e9c8e584a4b9fad5f3dc91be2a6076 upstream. "err" needs to be left set to -EFAULT if split_huge_page succeeds. Otherwise if "err" gets clobbered with zero and write_protect_page fails, try_to_merge_one_page() will succeed instead of returning -EFAULT and then try_to_merge_with_ksm_page() will continue thinking kpage is a PageKsm when in fact it's still an anonymous page. Eventually it'll crash in page_add_anon_rmap. This has been reproduced on Fedora25 kernel but I can reproduce with upstream too. The bug was introduced in commit f765f540598a ("ksm: prepare to new THP semantics") introduced in v4.5. page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673 flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked) page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) page->mem_cgroup:ffffa09674bf0000 ------------[ cut here ]------------ kernel BUG at mm/rmap.c:1222! CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1 RIP: do_page_add_anon_rmap+0x1c4/0x240 Call Trace: page_add_anon_rmap+0x18/0x20 try_to_merge_with_ksm_page+0x50b/0x780 ksm_scan_thread+0x1211/0x1410 ? prepare_to_wait_event+0x100/0x100 ? try_to_merge_with_ksm_page+0x780/0x780 kthread+0xd9/0xf0 ? kthread_park+0x60/0x60 ret_from_fork+0x25/0x30 Fixes: f765f54059 ("ksm: prepare to new THP semantics") Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli Reported-by: Federico Simoncelli Acked-by: Kirill A. Shutemov Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit d5ecb4ca0da7528c0f1ebc676d831b00efd37efa Author: Rob Landley Date: Sat May 20 15:03:29 2017 -0500 x86/boot: Use CROSS_COMPILE prefix for readelf commit 3780578761921f094179c6289072a74b2228c602 upstream. The boot code Makefile contains a straight 'readelf' invocation. This causes build warnings in cross compile environments, when there is no unprefixed readelf accessible via $PATH. Add the missing $(CROSS_COMPILE) prefix. [ tglx: Rewrote changelog ] Fixes: 98f78525371b ("x86/boot: Refuse to build with data relocations") Signed-off-by: Rob Landley Acked-by: Kees Cook Cc: Jiri Kosina Cc: Paul Bolle Cc: "H.J. Lu" Link: http://lkml.kernel.org/r/ced18878-693a-9576-a024-113ef39a22c0@landley.net Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit d1cff22220718d2fd2a694271a11727f30413658 Author: Mike Marciniszyn Date: Fri May 12 09:02:00 2017 -0700 RDMA/qib,hfi1: Fix MR reference count leak on write with immediate commit 1feb40067cf04ae48d65f728d62ca255c9449178 upstream. The handling of IB_RDMA_WRITE_ONLY_WITH_IMMEDIATE will leak a memory reference when a buffer cannot be allocated for returning the immediate data. The issue is that the rkey validation has already occurred and the RNR nak fails to release the reference that was fruitlessly gotten. The the peer will send the identical single packet request when its RNR timer pops. The fix is to release the held reference prior to the rnr nak exit. This is the only sequence the requires both rkey validation and the buffer allocation on the same packet. Tested-by: Tadeusz Struk Reviewed-by: Dennis Dalessandro Signed-off-by: Mike Marciniszyn Signed-off-by: Dennis Dalessandro Signed-off-by: Doug Ledford Signed-off-by: Greg Kroah-Hartman commit 292f70cd9649170243fe29331654f8c5f0c8d5d6 Author: Michal Hocko Date: Fri Jun 2 14:46:49 2017 -0700 mm: consider memblock reservations for deferred memory initialization sizing commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374 upstream. We have seen an early OOM killer invocation on ppc64 systems with crashkernel=4096M: kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0 kthreadd cpuset=/ mems_allowed=7 CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1 Call Trace: dump_stack+0xb0/0xf0 (unreliable) dump_header+0xb0/0x258 out_of_memory+0x5f0/0x640 __alloc_pages_nodemask+0xa8c/0xc80 kmem_getpages+0x84/0x1a0 fallback_alloc+0x2a4/0x320 kmem_cache_alloc_node+0xc0/0x2e0 copy_process.isra.25+0x260/0x1b30 _do_fork+0x94/0x470 kernel_thread+0x48/0x60 kthreadd+0x264/0x330 ret_from_kernel_thread+0x5c/0xa4 Mem-Info: active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:5 slab_unreclaimable:73 mapped:0 shmem:0 pagetables:0 bounce:0 free:0 free_pcp:0 free_cma:0 Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB 0 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 819200 pages RAM 0 pages HighMem/MovableOnly 817481 pages reserved 0 pages cma reserved 0 pages hwpoisoned the reason is that the managed memory is too low (only 110MB) while the rest of the the 50GB is still waiting for the deferred intialization to be done. update_defer_init estimates the initial memoty to initialize to 2GB at least but it doesn't consider any memory allocated in that range. In this particular case we've had Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB) so the low 2GB is mostly depleted. Fix this by considering memblock allocations in the initial static initialization estimation. Move the max_initialise to reset_deferred_meminit and implement a simple memblock_reserved_memory helper which iterates all reserved blocks and sums the size of all that start below the given address. The cumulative size is than added on top of the initial estimation. This is still not ideal because reset_deferred_meminit doesn't consider holes and so reservation might be above the initial estimation whihch we ignore but let's make the logic simpler until we really need to handle more complicated cases. Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.cz Signed-off-by: Michal Hocko Acked-by: Mel Gorman Tested-by: Srikar Dronamraju Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit 1163e785b1506a4f46dbdee89bbab161dd742186 Author: Yisheng Xie Date: Fri Jun 2 14:46:43 2017 -0700 mlock: fix mlock count can not decrease in race condition commit 70feee0e1ef331b22cc51f383d532a0d043fbdcc upstream. Kefeng reported that when running the follow test, the mlock count in meminfo will increase permanently: [1] testcase linux:~ # cat test_mlockal grep Mlocked /proc/meminfo for j in `seq 0 10` do for i in `seq 4 15` do ./p_mlockall >> log & done sleep 0.2 done # wait some time to let mlock counter decrease and 5s may not enough sleep 5 grep Mlocked /proc/meminfo linux:~ # cat p_mlockall.c #include #include #include #define SPACE_LEN 4096 int main(int argc, char ** argv) { int ret; void *adr = malloc(SPACE_LEN); if (!adr) return -1; ret = mlockall(MCL_CURRENT | MCL_FUTURE); printf("mlcokall ret = %d\n", ret); ret = munlockall(); printf("munlcokall ret = %d\n", ret); free(adr); return 0; } In __munlock_pagevec() we should decrement NR_MLOCK for each page where we clear the PageMlocked flag. Commit 1ebb7cc6a583 ("mm: munlock: batch NR_MLOCK zone state updates") has introduced a bug where we don't decrement NR_MLOCK for pages where we clear the flag, but fail to isolate them from the lru list (e.g. when the pages are on some other cpu's percpu pagevec). Since PageMlocked stays cleared, the NR_MLOCK accounting gets permanently disrupted by this. Fix it by counting the number of page whose PageMlock flag is cleared. Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates") Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie Reported-by: Kefeng Wang Tested-by: Kefeng Wang Cc: Vlastimil Babka Cc: Joern Engel Cc: Mel Gorman Cc: Michel Lespinasse Cc: Hugh Dickins Cc: Rik van Riel Cc: Johannes Weiner Cc: Michal Hocko Cc: Xishi Qiu Cc: zhongjiang Cc: Hanjun Guo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit d494cab70697413fe1a1b6ae57e4ba18a5e43e36 Author: Punit Agrawal Date: Fri Jun 2 14:46:40 2017 -0700 mm/migrate: fix refcount handling when !hugepage_migration_supported() commit 30809f559a0d348c2dfd7ab05e9a451e2384962e upstream. On failing to migrate a page, soft_offline_huge_page() performs the necessary update to the hugepage ref-count. But when !hugepage_migration_supported() , unmap_and_move_hugepage() also decrements the page ref-count for the hugepage. The combined behaviour leaves the ref-count in an inconsistent state. This leads to soft lockups when running the overcommitted hugepage test from mce-tests suite. Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000 soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head) INFO: rcu_preempt detected stalls on CPUs/tasks: Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715 (detected by 7, t=5254 jiffies, g=963, c=962, q=321) thugetlb_overco R running task 0 2715 2685 0x00000008 Call trace: dump_backtrace+0x0/0x268 show_stack+0x24/0x30 sched_show_task+0x134/0x180 rcu_print_detail_task_stall_rnp+0x54/0x7c rcu_check_callbacks+0xa74/0xb08 update_process_times+0x34/0x60 tick_sched_handle.isra.7+0x38/0x70 tick_sched_timer+0x4c/0x98 __hrtimer_run_queues+0xc0/0x300 hrtimer_interrupt+0xac/0x228 arch_timer_handler_phys+0x3c/0x50 handle_percpu_devid_irq+0x8c/0x290 generic_handle_irq+0x34/0x50 __handle_domain_irq+0x68/0xc0 gic_handle_irq+0x5c/0xb0 Address this by changing the putback_active_hugepage() in soft_offline_huge_page() to putback_movable_pages(). This only triggers on systems that enable memory failure handling (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration (!ARCH_ENABLE_HUGEPAGE_MIGRATION). I imagine this wasn't triggered as there aren't many systems running this configuration. [akpm@linux-foundation.org: remove dead comment, per Naoya] Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com Reported-by: Manoj Iyer Tested-by: Manoj Iyer Suggested-by: Naoya Horiguchi Signed-off-by: Punit Agrawal Cc: Joonsoo Kim Cc: Wanpeng Li Cc: Christoph Lameter Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit 7d8ef0e0bc1e25cc23f6012aacafe3b28e8c01f7 Author: Alexander Tsoy Date: Mon May 22 20:58:11 2017 +0300 ALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430 commit 1fc2e41f7af4572b07190f9dec28396b418e9a36 upstream. This model is actually called 92XXM2-8 in Windows driver. But since pin configs for M22 and M28 are identical, just reuse M22 quirk. Fixes external microphone (tested) and probably docking station ports (not tested). Signed-off-by: Alexander Tsoy Signed-off-by: Takashi Iwai Signed-off-by: Greg Kroah-Hartman commit da856d05645c8d4ca74a6465a98126ec69815266 Author: Nicolas Iooss Date: Fri Jun 2 14:46:28 2017 -0700 pcmcia: remove left-over %Z format commit ff5a20169b98d84ad8d7f99f27c5ebbb008204d6 upstream. Commit 5b5e0928f742 ("lib/vsprintf.c: remove %Z support") removed some usages of format %Z but forgot "%.2Zx". This makes clang 4.0 reports a -Wformat-extra-args warning because it does not know about %Z. Replace %Z with %z. Link: http://lkml.kernel.org/r/20170520090946.22562-1-nicolas.iooss_linux@m4x.org Signed-off-by: Nicolas Iooss Cc: Harald Welte Cc: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit ebd4c110fd0b4247277e8f351aa2d6694a0b5d94 Author: Michel Dänzer Date: Mon Jan 30 12:06:35 2017 +0900 drm/radeon: Fix vram_size/visible values in DRM_RADEON_GEM_INFO ioctl commit 51964e9e12d0a054002a1a0d1dec4f661c7aaf28 upstream. vram_size is supposed to be the total amount of VRAM that can be used by userspace, which corresponds to the TTM VRAM manager size (which is normally the full amount of VRAM, but can be just the visible VRAM when DMA can't be used for BO migration for some reason). The above was incorrectly used for vram_visible before, resulting in generally too large values being reported. Reviewed-by: Christian König Reviewed-by: Nicolai Hähnle Reviewed-by: Alex Deucher Signed-off-by: Michel Dänzer Signed-off-by: Alex Deucher Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman commit acc771fdaec7008a3b68db2a07e6d5901ea021fe Author: Lyude Date: Thu May 11 19:31:12 2017 -0400 drm/radeon: Unbreak HPD handling for r600+ commit 3d18e33735a02b1a90aecf14410bf3edbfd4d3dc upstream. We end up reading the interrupt register for HPD5, and then writing it to HPD6 which on systems without anything using HPD5 results in permanently disabling hotplug on one of the display outputs after the first time we acknowledge a hotplug interrupt from the GPU. This code is really bad. But for now, let's just fix this. I will hopefully have a large patch series to refactor all of this soon. Reviewed-by: Christian König Signed-off-by: Lyude Signed-off-by: Alex Deucher Signed-off-by: Greg Kroah-Hartman commit c8d25fcb5980fa7305215c870045dd717b54bab4 Author: Alex Deucher Date: Thu May 11 13:14:14 2017 -0400 drm/radeon/ci: disable mclk switching for high refresh rates (v2) commit 58d7e3e427db1bd68f33025519a9468140280a75 upstream. Even if the vblank period would allow it, it still seems to be problematic on some cards. v2: fix logic inversion (Nils) bug: https://bugs.freedesktop.org/show_bug.cgi?id=96868 Acked-by: Christian König Signed-off-by: Alex Deucher Signed-off-by: Greg Kroah-Hartman commit 9869fb485cc64a2be3dde7952aef10809a631497 Author: Ram Pai Date: Thu Jan 26 16:37:01 2017 -0200 scsi: mpt3sas: Force request partial completion alignment commit f2e767bb5d6ee0d988cb7d4e54b0b21175802b6b upstream. The firmware or device, possibly under a heavy I/O load, can return on a partial unaligned boundary. Scsi-ml expects these requests to be completed on an alignment boundary. Scsi-ml blindly requeues the I/O without checking the alignment boundary of the I/O request for the remaining bytes. This leads to errors, since devices cannot perform non-aligned read/write operations. This patch fixes the issue in the driver. It aligns unaligned completions of FS requests, by truncating them to the nearest alignment boundary. [mkp: simplified if statement] Reported-by: Mauricio Faria De Oliveira Signed-off-by: Guilherme G. Piccoli Signed-off-by: Ram Pai Acked-by: Sreekanth Reddy Signed-off-by: Martin K. Petersen Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman commit 21f33b157721172595eb06c711cbf6a9f1a155fd Author: Ming Lei Date: Mon May 22 23:05:04 2017 +0800 nvme: avoid to use blk_mq_abort_requeue_list() commit 986f75c876dbafed98eba7cb516c5118f155db23 upstream. NVMe may add request into requeue list simply and not kick off the requeue if hw queues are stopped. Then blk_mq_abort_requeue_list() is called in both nvme_kill_queues() and nvme_ns_remove() for dealing with this issue. Unfortunately blk_mq_abort_requeue_list() is absolutely a race maker, for example, one request may be requeued during the aborting. So this patch just calls blk_mq_kick_requeue_list() in nvme_kill_queues() to handle this issue like what nvme_start_queues() does. Now all requests in requeue list when queues are stopped will be handled by blk_mq_kick_requeue_list() when queues are restarted, either in nvme_start_queues() or in nvme_kill_queues(). Reported-by: Zhang Yi Reviewed-by: Keith Busch Reviewed-by: Johannes Thumshirn Signed-off-by: Ming Lei Signed-off-by: Christoph Hellwig Signed-off-by: Greg Kroah-Hartman commit 510b0ec7f60fd762971286b3246cdd9c37aa41f8 Author: Ming Lei Date: Mon May 22 23:05:03 2017 +0800 nvme: use blk_mq_start_hw_queues() in nvme_kill_queues() commit 806f026f9b901eaf1a6baeb48b5da18d6a4f818e upstream. Inside nvme_kill_queues(), we have to start hw queues for draining requests in sw queues, .dispatch list and requeue list, so use blk_mq_start_hw_queues() instead of blk_mq_start_stopped_hw_queues() which only run queues if queues are stopped, but the queues may have been started already, for example nvme_start_queues() is called in reset work function. blk_mq_start_hw_queues() run hw queues in current context, instead of running asynchronously like before. Given nvme_kill_queues() is run from either remove context or reset worker context, both are fine to run hw queue directly. And the mutex of namespaces_mutex isn't a problem too becasue nvme_start_freeze() runs hw queue in this way already. Reported-by: Zhang Yi Reviewed-by: Keith Busch Reviewed-by: Johannes Thumshirn Signed-off-by: Ming Lei Signed-off-by: Christoph Hellwig Signed-off-by: Greg Kroah-Hartman commit ae057808924227e244492444b8f6e6ea6fc9544d Author: Marta Rybczynska Date: Mon Apr 10 17:12:34 2017 +0200 nvme-rdma: support devices with queue size < 32 commit 0544f5494a03b8846db74e02be5685d1f32b06c9 upstream. In the case of small NVMe-oF queue size (<32) we may enter a deadlock caused by the fact that the IB completions aren't sent waiting for 32 and the send queue will fill up. The error is seen as (using mlx5): [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273): [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12 This patch changes the way the signaling is done so that it depends on the queue depth now. The magic define has been removed completely. Signed-off-by: Marta Rybczynska Signed-off-by: Samuel Jones Acked-by: Sagi Grimberg Signed-off-by: Christoph Hellwig Signed-off-by: Greg Kroah-Hartman commit 34808d76dd779ba78aaed8a87c1363106d378aeb Author: Jason Gerecke Date: Tue Apr 25 11:29:56 2017 -0700 HID: wacom: Have wacom_tpc_irq guard against possible NULL dereference commit 2ac97f0f6654da14312d125005c77a6010e0ea38 upstream. The following Smatch complaint was generated in response to commit 2a6cdbd ("HID: wacom: Introduce new 'touch_input' device"): drivers/hid/wacom_wac.c:1586 wacom_tpc_irq() error: we previously assumed 'wacom->touch_input' could be null (see line 1577) The 'touch_input' and 'pen_input' variables point to the 'struct input_dev' used for relaying touch and pen events to userspace, respectively. If a device does not have a touch interface or pen interface, the associated input variable is NULL. The 'wacom_tpc_irq()' function is responsible for forwarding input reports to a more-specific IRQ handler function. An unknown report could theoretically be mistaken as e.g. a touch report on a device which does not have a touch interface. This can be prevented by only calling the pen/touch functions are called when the pen/touch pointers are valid. Fixes: 2a6cdbd ("HID: wacom: Introduce new 'touch_input' device") Signed-off-by: Jason Gerecke Reviewed-by: Ping Cheng Signed-off-by: Jiri Kosina Signed-off-by: Greg Kroah-Hartman commit 69b1d90e6a0f848e84d49b5771acbfe97ec30150 Author: Bryant G. Ly Date: Wed May 10 14:35:47 2017 -0500 ibmvscsis: Fix the incorrect req_lim_delta commit 75dbf2d36f6b122ad3c1070fe4bf95f71bbff321 upstream. The current code is not correctly calculating the req_lim_delta. We want to make sure vscsi->credit is always incremented when we do not send a response for the scsi op. Thus for the case where there is a successfully aborted task we need to make sure the vscsi->credit is incremented. v2 - Moves the original location of the vscsi->credit increment to a better spot. Since if we increment credit, the next command we send back will have increased req_lim_delta. But we probably shouldn't be doing that until the aborted cmd is actually released. Otherwise the client will think that it can send a new command, and we could find ourselves short of command elements. Not likely, but could happen. This patch depends on both: commit 25e78531268e ("ibmvscsis: Do not send aborted task response") commit 98883f1b5415 ("ibmvscsis: Clear left-over abort_cmd pointers") Signed-off-by: Bryant G. Ly Reviewed-by: Michael Cyr Signed-off-by: Nicholas Bellinger Signed-off-by: Greg Kroah-Hartman commit 80569d0e09add209e80c1354af8433daac506f35 Author: Bryant G. Ly Date: Tue May 9 11:50:26 2017 -0500 ibmvscsis: Clear left-over abort_cmd pointers commit 98883f1b5415ea9dce60d5178877d15f4faa10b8 upstream. With the addition of ibmvscsis->abort_cmd pointer within commit 25e78531268e ("ibmvscsis: Do not send aborted task response"), make sure to explicitly NULL these pointers when clearing DELAY_SEND flag. Do this for two cases, when getting the new new ibmvscsis descriptor in ibmvscsis_get_free_cmd() and before posting the response completion in ibmvscsis_send_messages(). Signed-off-by: Bryant G. Ly Reviewed-by: Michael Cyr Signed-off-by: Nicholas Bellinger Signed-off-by: Greg Kroah-Hartman commit 49d33fd10070902fa1cd4a0c5bfb4b41c072af5d Author: Jiang Yi Date: Tue May 16 17:57:55 2017 +0800 iscsi-target: Always wait for kthread_should_stop() before kthread exit commit 5e0cf5e6c43b9e19fc0284f69e5cd2b4a47523b0 upstream. There are three timing problems in the kthread usages of iscsi_target_mod: - np_thread of struct iscsi_np - rx_thread and tx_thread of struct iscsi_conn In iscsit_close_connection(), it calls send_sig(SIGINT, conn->tx_thread, 1); kthread_stop(conn->tx_thread); In conn->tx_thread, which is iscsi_target_tx_thread(), when it receive SIGINT the kthread will exit without checking the return value of kthread_should_stop(). So if iscsi_target_tx_thread() exit right between send_sig(SIGINT...) and kthread_stop(...), the kthread_stop() will try to stop an already stopped kthread. This is invalid according to the documentation of kthread_stop(). (Fix -ECONNRESET logout handling in iscsi_target_tx_thread and early iscsi_target_rx_thread failure case - nab) Signed-off-by: Jiang Yi Signed-off-by: Nicholas Bellinger Signed-off-by: Greg Kroah-Hartman commit ecbf0f48d65ecd1eecfe6c81af8b9c2589ab5641 Author: Srinath Mannam Date: Thu May 18 22:27:40 2017 +0530 mmc: sdhci-iproc: suppress spurious interrupt with Multiblock read commit f5f968f2371ccdebb8a365487649673c9af68d09 upstream. The stingray SDHCI hardware supports ACMD12 and automatically issues after multi block transfer completed. If ACMD12 in SDHCI is disabled, spurious tx done interrupts are seen on multi block read command with below error message: Got data interrupt 0x00000002 even though no data operation was in progress. This patch uses SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12 to enable ACM12 support in SDHCI hardware and suppress spurious interrupt. Signed-off-by: Srinath Mannam Reviewed-by: Ray Jui Reviewed-by: Scott Branden Acked-by: Adrian Hunter Fixes: b580c52d58d9 ("mmc: sdhci-iproc: add IPROC SDHCI driver") Signed-off-by: Ulf Hansson Signed-off-by: Greg Kroah-Hartman commit 8735cf2291cdfaddf517113bf92edb6eaf20a371 Author: Benjamin Tissoires Date: Wed May 10 18:12:40 2017 +0200 Revert "ACPI / button: Change default behavior to lid_init_state=open" commit 878d8db039daac0938238e9a40a5bd6e50ee3c9b upstream. Revert commit 77e9a4aa9de1 (ACPI / button: Change default behavior to lid_init_state=open) which changed the kernel's behavior on laptops that boot with closed lids and expect the lid switch state to be reported accurately by the kernel. If you boot or resume your laptop with the lid closed on a docking station while using an external monitor connected to it, both internal and external displays will light on, while only the external should. There is a design choice in gdm to only provide the greeter on the internal display when lit on, so users only see a gray area on the external monitor. Also, the cursor will not show up as it's by default on the internal display too. To "fix" that, users have to open the laptop once and close it once again to sync the state of the switch with the hardware state. Even if the "method" operation mode implementation can be buggy on some platforms, the "open" choice is worse. It breaks docking stations basically and there is no way to have a user-space hwdb to fix that. On the contrary, it's rather easy in user-space to have a hwdb with the problematic platforms. Then, libinput (1.7.0+) can fix the state of the lid switch for us: you need to set the udev property LIBINPUT_ATTR_LID_SWITCH_RELIABILITY to 'write_open'. When libinput detects internal keyboard events, it will overwrite the state of the switch to open, making it reliable again. Given that logind only checks the lid switch value after a timeout, we can assume the user will use the internal keyboard before this timeout expires. For example, such a hwdb entry is: libinput:name:*Lid Switch*:dmi:*svnMicrosoftCorporation:pnSurface3:* LIBINPUT_ATTR_LID_SWITCH_RELIABILITY=write_open Link: https://bugzilla.gnome.org/show_bug.cgi?id=782380 Signed-off-by: Benjamin Tissoires Signed-off-by: Rafael J. Wysocki Signed-off-by: Greg Kroah-Hartman commit 32d8077f1e9bd1c23c478648163d03bf475c55f4 Author: Vishal Verma Date: Fri May 19 11:39:10 2017 +0200 acpi, nfit: Fix the memory error check in nfit_handle_mce() commit fc08a4703a418a398bbb575ac311d36d110ac786 upstream. The check for an MCE being a memory error in the NFIT mce handler was bogus. Use the new mce_is_memory_error() helper to detect the error properly. Reported-by: Tony Luck Signed-off-by: Vishal Verma Signed-off-by: Borislav Petkov Link: http://lkml.kernel.org/r/20170519093915.15413-3-bp@alien8.de Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 68c83a379106115c4c2f33c6cba393d5f0a87a52 Author: Borislav Petkov Date: Fri May 19 11:39:09 2017 +0200 x86/MCE: Export memory_error() commit 2d1f406139ec20320bf38bcd2461aa8e358084b5 upstream. Export the function which checks whether an MCE is a memory error to other users so that we can reuse the logic. Drop the boot_cpu_data use, while at it, as mce.cpuvendor already has the CPU vendor in there. Integrate a piece from a patch from Vishal Verma to export it for modules (nfit). The main reason we're exporting it is that the nfit handler nfit_handle_mce() needs to detect a memory error properly before doing its recovery actions. Signed-off-by: Borislav Petkov Cc: Tony Luck Cc: Vishal Verma Link: http://lkml.kernel.org/r/20170519093915.15413-2-bp@alien8.de Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 4472887cbd1373d7781bea9d8935f2d4968dd580 Author: Herbert Xu Date: Wed May 10 03:48:23 2017 +0800 crypto: skcipher - Add missing API setkey checks commit 9933e113c2e87a9f46a40fde8dafbf801dca1ab9 upstream. The API setkey checks for key sizes and alignment went AWOL during the skcipher conversion. This patch restores them. Fixes: 4e6c3df4d729 ("crypto: skcipher - Add low-level skcipher...") Reported-by: Baozeng Signed-off-by: Herbert Xu Signed-off-by: Greg Kroah-Hartman commit 63399974effb65d755a94edd10f33cd74e5a1b77 Author: Sebastian Reichel Date: Fri May 5 11:06:50 2017 +0200 i2c: i2c-tiny-usb: fix buffer not being DMA capable commit 5165da5923d6c7df6f2927b0113b2e4d9288661e upstream. Since v4.9 i2c-tiny-usb generates the below call trace and longer works, since it can't communicate with the USB device. The reason is, that since v4.9 the USB stack checks, that the buffer it should transfer is DMA capable. This was a requirement since v2.2 days, but it usually worked nevertheless. [ 17.504959] ------------[ cut here ]------------ [ 17.505488] WARNING: CPU: 0 PID: 93 at drivers/usb/core/hcd.c:1587 usb_hcd_map_urb_for_dma+0x37c/0x570 [ 17.506545] transfer buffer not dma capable [ 17.507022] Modules linked in: [ 17.507370] CPU: 0 PID: 93 Comm: i2cdetect Not tainted 4.11.0-rc8+ #10 [ 17.508103] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 17.509039] Call Trace: [ 17.509320] ? dump_stack+0x5c/0x78 [ 17.509714] ? __warn+0xbe/0xe0 [ 17.510073] ? warn_slowpath_fmt+0x5a/0x80 [ 17.510532] ? nommu_map_sg+0xb0/0xb0 [ 17.510949] ? usb_hcd_map_urb_for_dma+0x37c/0x570 [ 17.511482] ? usb_hcd_submit_urb+0x336/0xab0 [ 17.511976] ? wait_for_completion_timeout+0x12f/0x1a0 [ 17.512549] ? wait_for_completion_timeout+0x65/0x1a0 [ 17.513125] ? usb_start_wait_urb+0x65/0x160 [ 17.513604] ? usb_control_msg+0xdc/0x130 [ 17.514061] ? usb_xfer+0xa4/0x2a0 [ 17.514445] ? __i2c_transfer+0x108/0x3c0 [ 17.514899] ? i2c_transfer+0x57/0xb0 [ 17.515310] ? i2c_smbus_xfer_emulated+0x12f/0x590 [ 17.515851] ? _raw_spin_unlock_irqrestore+0x11/0x20 [ 17.516408] ? i2c_smbus_xfer+0x125/0x330 [ 17.516876] ? i2c_smbus_xfer+0x125/0x330 [ 17.517329] ? i2cdev_ioctl_smbus+0x1c1/0x2b0 [ 17.517824] ? i2cdev_ioctl+0x75/0x1c0 [ 17.518248] ? do_vfs_ioctl+0x9f/0x600 [ 17.518671] ? vfs_write+0x144/0x190 [ 17.519078] ? SyS_ioctl+0x74/0x80 [ 17.519463] ? entry_SYSCALL_64_fastpath+0x1e/0xad [ 17.519959] ---[ end trace d047c04982f5ac50 ]--- Signed-off-by: Sebastian Reichel Reviewed-by: Greg Kroah-Hartman Acked-by: Till Harbaum Signed-off-by: Wolfram Sang Signed-off-by: Greg Kroah-Hartman commit d3b2d9ca90c2a1e6119c3626f00e62fe23c50f2e Author: Ard Biesheuvel Date: Thu May 18 12:29:55 2017 +0100 drivers/tty: 8250: only call fintek_8250_probe when doing port I/O commit 4c4fc90964b1cf205a67df566cc82ea1731bcb00 upstream. Commit fa01e2ca9f53 ("serial: 8250: Integrate Fintek into 8250_base") modified the probing logic for PNP0501 devices, to remove a collision between the generic 16550A driver and the Fintek driver, which reused the same ACPI _HID. The Fintek device probe is now incorporated into the common 8250 probe path, and gets called for all discovered 16550A compatible devices, including ones that are MMIO mapped rather than IO mapped. However, the Fintek driver assumes the port base is a I/O address, and proceeds to probe some arbitrary offsets above it. This is generally a wrong thing to do, but on ARM systems (having no native port I/O), this may result in faulting accesses of completely unrelated MMIO regions in the PCI I/O space. Given that this is at serial probe time, this results in hard to diagnose crashes at boot. So let's restrict the Fintek probe to devices that we know are using port I/O in the first place. Fixes: fa01e2ca9f53 ("serial: 8250: Integrate Fintek into 8250_base") Suggested-by: Arnd Bergmann Reviewed-by: Ricardo Ribalda Signed-off-by: Ard Biesheuvel Signed-off-by: Greg Kroah-Hartman commit 1d74fc36f3eccb0a9c552d0dfcee12f87a456c98 Author: Jeremy Kerr Date: Wed May 24 16:49:59 2017 +1000 powerpc/spufs: Fix hash faults for kernel regions commit d75e4919cc0b6fbcbc8d6654ef66d87a9dbf1526 upstream. Commit ac29c64089b7 ("powerpc/mm: Replace _PAGE_USER with _PAGE_PRIVILEGED") swapped _PAGE_USER for _PAGE_PRIVILEGED, and introduced check_pte_access() which denied kernel access to non-_PAGE_PRIVILEGED pages. However, it didn't add _PAGE_PRIVILEGED to the hash fault handler for spufs' kernel accesses, so the DMAs required to establish SPE memory no longer work. This change adds _PAGE_PRIVILEGED to the hash fault handler for kernel accesses. Fixes: ac29c64089b7 ("powerpc/mm: Replace _PAGE_USER with _PAGE_PRIVILEGED") Signed-off-by: Jeremy Kerr Reported-by: Sombat Tragolgosol Reviewed-by: Aneesh Kumar K.V Signed-off-by: Michael Ellerman Signed-off-by: Greg Kroah-Hartman commit 68a056175512222019e9959c2504f7c9ad53bbc0 Author: Richard Narron Date: Sun Jun 4 16:23:18 2017 -0700 fs/ufs: Set UFS default maximum bytes per file commit 239e250e4acbc0104d514307029c0839e834a51a upstream. This fixes a problem with reading files larger than 2GB from a UFS-2 file system: https://bugzilla.kernel.org/show_bug.cgi?id=195721 The incorrect UFS s_maxsize limit became a problem as of commit c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()") which started using s_maxbytes to avoid a page index overflow in do_generic_file_read(). That caused files to be truncated on UFS-2 file systems because the default maximum file size is 2GB (MAX_NON_LFS) and UFS didn't update it. Here I simply increase the default to a common value used by other file systems. Signed-off-by: Richard Narron Cc: Al Viro Cc: Will B Cc: Theodore Ts'o Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit 1a658771d5e129c64b34ce22d3018665d631c173 Author: Liam R. Howlett Date: Wed May 17 11:47:00 2017 -0400 sparc/ftrace: Fix ftrace graph time measurement [ Upstream commit 48078d2dac0a26f84f5f3ec704f24f7c832cce14 ] The ftrace function_graph time measurements of a given function is not accurate according to those recorded by ftrace using the function filters. This change pulls the x86_64 fix from 'commit 722b3c746953 ("ftrace/graph: Trace function entry before updating index")' into the sparc specific prepare_ftrace_return which stops ftrace from counting interrupted tasks in the time measurement. Example measurements for select_task_rq_fair running "hackbench 100 process 1000": | tracing/trace_stat/function0 | function_graph Before patch | 2.802 us | 4.255 us After patch | 2.749 us | 3.094 us Signed-off-by: Liam R. Howlett Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 45ceb845ef34984357e6c2e0170570cc6f7179a3 Author: Orlando Arias Date: Tue May 16 15:34:00 2017 -0400 sparc: Fix -Wstringop-overflow warning [ Upstream commit deba804c90642c8ed0f15ac1083663976d578f54 ] Greetings, GCC 7 introduced the -Wstringop-overflow flag to detect buffer overflows in calls to string handling functions [1][2]. Due to the way ``empty_zero_page'' is declared in arch/sparc/include/setup.h, this causes a warning to trigger at compile time in the function mem_init(), which is subsequently converted to an error. The ensuing patch fixes this issue and aligns the declaration of empty_zero_page to that of other architectures. Thank you. Cheers, Orlando. [1] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg02308.html [2] https://gcc.gnu.org/gcc-7/changes.html Signed-off-by: Orlando Arias -------------------------------------------------------------------------------- Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit c1133c671a0467a8cc2fc79bdc34c764fb623043 Author: Daniel Borkmann Date: Thu May 25 01:05:07 2017 +0200 bpf: add bpf_clone_redirect to bpf_helper_changes_pkt_data [ Upstream commit 41703a731066fde79c3e5ccf3391cf77a98aeda5 ] The bpf_clone_redirect() still needs to be listed in bpf_helper_changes_pkt_data() since we call into bpf_try_make_head_writable() from there, thus we need to invalidate prior pkt regs as well. Fixes: 36bbef52c7eb ("bpf: direct packet write and access for helpers for clsact progs") Signed-off-by: Daniel Borkmann Acked-by: Alexei Starovoitov Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 988b9792b8569871ccaabc6ec0bc4d70f982be96 Author: Eric Dumazet Date: Thu May 25 14:27:35 2017 -0700 ipv4: add reference counting to metrics [ Upstream commit 3fb07daff8e99243366a081e5129560734de4ada ] Andrey Konovalov reported crashes in ipv4_mtu() I could reproduce the issue with KASAN kernels, between 10.246.7.151 and 10.246.7.152 : 1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 & 2) At the same time run following loop : while : do ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 done Cong Wang attempted to add back rt->fi in commit 82486aa6f1b9 ("ipv4: restore rt->fi for reference counting") but this proved to add some issues that were complex to solve. Instead, I suggested to add a refcount to the metrics themselves, being a standalone object (in particular, no reference to other objects) I tried to make this patch as small as possible to ease its backport, instead of being super clean. Note that we believe that only ipv4 dst need to take care of the metric refcount. But if this is wrong, this patch adds the basic infrastructure to extend this to other families. Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang for his efforts on this problem. Fixes: 2860583fe840 ("ipv4: Kill rt->fi") Signed-off-by: Eric Dumazet Reported-by: Andrey Konovalov Reviewed-by: Julian Anastasov Acked-by: Cong Wang Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 1de51502a02598a8deaced6ab7575089108ec823 Author: Davide Caratti Date: Thu May 25 19:14:56 2017 +0200 sctp: fix ICMP processing if skb is non-linear [ Upstream commit 804ec7ebe8ea003999ca8d1bfc499edc6a9e07df ] sometimes ICMP replies to INIT chunks are ignored by the client, even if the encapsulated SCTP headers match an open socket. This happens when the ICMP packet is carried by a paged skb: use skb_header_pointer() to read packet contents beyond the SCTP header, so that chunk header and initiate tag are validated correctly. v2: - don't use skb_header_pointer() to read the transport header, since icmp_socket_deliver() already puts these 8 bytes in the linear area. - change commit message to make specific reference to INIT chunks. Signed-off-by: Davide Caratti Acked-by: Marcelo Ricardo Leitner Acked-by: Vlad Yasevich Reviewed-by: Xin Long Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 4b81271ed1c3f60b37196b381fd4bca47a20cf0e Author: Wei Wang Date: Wed May 24 09:59:31 2017 -0700 tcp: avoid fastopen API to be used on AF_UNSPEC [ Upstream commit ba615f675281d76fd19aa03558777f81fb6b6084 ] Fastopen API should be used to perform fastopen operations on the TCP socket. It does not make sense to use fastopen API to perform disconnect by calling it with AF_UNSPEC. The fastopen data path is also prone to race conditions and bugs when using with AF_UNSPEC. One issue reported and analyzed by Vegard Nossum is as follows: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Thread A: Thread B: ------------------------------------------------------------------------ sendto() - tcp_sendmsg() - sk_stream_memory_free() = 0 - goto wait_for_sndbuf - sk_stream_wait_memory() - sk_wait_event() // sleep | sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC) | - tcp_sendmsg() | - tcp_sendmsg_fastopen() | - __inet_stream_connect() | - tcp_disconnect() //because of AF_UNSPEC | - tcp_transmit_skb()// send RST | - return 0; // no reconnect! | - sk_stream_wait_connect() | - sock_error() | - xchg(&sk->sk_err, 0) | - return -ECONNRESET - ... // wake up, see sk->sk_err == 0 - skb_entail() on TCP_CLOSE socket If the connection is reopened then we will send a brand new SYN packet after thread A has already queued a buffer. At this point I think the socket internal state (sequence numbers etc.) becomes messed up. When the new connection is closed, the FIN-ACK is rejected because the sequence number is outside the window. The other side tries to retransmit, but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which corrupts the skb data length and hits a BUG() in copy_and_csum_bits(). +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Hence, this patch adds a check for AF_UNSPEC in the fastopen data path and return EOPNOTSUPP to user if such case happens. Fixes: cf60af03ca4e7 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)") Reported-by: Vegard Nossum Signed-off-by: Wei Wang Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 9e056584770b021eb82fe378c65c12e5f71f8e68 Author: Vlad Yasevich Date: Tue May 23 13:38:43 2017 -0400 virtio-net: enable TSO/checksum offloads for Q-in-Q vlans [ Upstream commit 2836b4f224d4fd7d1a2b23c3eecaf0f0ae199a74 ] Since virtio does not provide it's own ndo_features_check handler, TSO, and now checksum offload, are disabled for stacked vlans. Re-enable the support and let the host take care of it. This restores/improves Guest-to-Guest performance over Q-in-Q vlans. Acked-by: Jason Wang Acked-by: Michael S. Tsirkin Signed-off-by: Vladislav Yasevich Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 9c6cfd5811bd53c4ef1800b768397d38c4790fbf Author: Vlad Yasevich Date: Tue May 23 13:38:42 2017 -0400 be2net: Fix offload features for Q-in-Q packets [ Upstream commit cc6e9de62a7f84c9293a2ea41bc412b55bb46e85 ] At least some of the be2net cards do not seem to be capabled of performing checksum offload computions on Q-in-Q packets. In these case, the recevied checksum on the remote is invalid and TCP syn packets are dropped. This patch adds a call to check disbled acceleration features on Q-in-Q tagged traffic. CC: Sathya Perla CC: Ajit Khaparde CC: Sriharsha Basavapatna CC: Somnath Kotur Signed-off-by: Vladislav Yasevich Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 5f595d5297961c0c641e76f78bb39d1618639d37 Author: Vlad Yasevich Date: Tue May 23 13:38:41 2017 -0400 vlan: Fix tcp checksum offloads in Q-in-Q vlans [ Upstream commit 35d2f80b07bbe03fb358afb0bdeff7437a7d67ff ] It appears that TCP checksum offloading has been broken for Q-in-Q vlans. The behavior was execerbated by the series commit afb0bc972b52 ("Merge branch 'stacked_vlan_tso'") that that enabled accleleration features on stacked vlans. However, event without that series, it is possible to trigger this issue. It just requires a lot more specialized configuration. The root cause is the interaction between how netdev_intersect_features() works, the features actually set on the vlan devices and HW having the ability to run checksum with longer headers. The issue starts when netdev_interesect_features() replaces NETIF_F_HW_CSUM with a combination of NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM, if the HW advertises IP|IPV6 specific checksums. This happens for tagged and multi-tagged packets. However, HW that enables IP|IPV6 checksum offloading doesn't gurantee that packets with arbitrarily long headers can be checksummed. This patch disables IP|IPV6 checksums on the packet for multi-tagged packets. CC: Toshiaki Makita CC: Michal Kubecek Signed-off-by: Vladislav Yasevich Acked-by: Toshiaki Makita Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit cc6773b51bf318fb775106fec51d66906e345fac Author: Andrew Lunn Date: Tue May 23 17:49:13 2017 +0200 net: phy: marvell: Limit errata to 88m1101 [ Upstream commit f2899788353c13891412b273fdff5f02d49aa40f ] The 88m1101 has an errata when configuring autoneg. However, it was being applied to many other Marvell PHYs as well. Limit its scope to just the 88m1101. Fixes: 76884679c644 ("phylib: Add support for Marvell 88e1111S and 88e1145") Reported-by: Daniel Walker Signed-off-by: Andrew Lunn Acked-by: Harini Katakam Reviewed-by: Florian Fainelli Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 4fb5fd27dec09083c5badb5031972a54bc464937 Author: Mohamad Haj Yahia Date: Thu Feb 23 11:19:36 2017 +0200 net/mlx5: Avoid using pending command interface slots [ Upstream commit 73dd3a4839c1d27c36d4dcc92e1ff44225ecbeb7 ] Currently when firmware command gets stuck or it takes long time to complete, the driver command will get timeout and the command slot is freed and can be used for new commands, and if the firmware receive new command on the old busy slot its behavior is unexpected and this could be harmful. To fix this when the driver command gets timeout we return failure, but we don't free the command slot and we wait for the firmware to explicitly respond to that command. Once all the entries are busy we will stop processing new firmware commands. Fixes: 9cba4ebcf374 ('net/mlx5: Fix potential deadlock in command mode change') Signed-off-by: Mohamad Haj Yahia Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed Signed-off-by: Greg Kroah-Hartman commit 1730a2b9e5b5df516c6c6e90b03ee517d06524ad Author: Jarod Wilson Date: Fri May 19 19:43:45 2017 -0400 bonding: fix accounting of active ports in 3ad [ Upstream commit 751da2a69b7cc82d83dc310ed7606225f2d6e014 ] As of 7bb11dc9f59d and 0622cab0341c, bond slaves in a 3ad bond are not removed from the aggregator when they are down, and the active slave count is NOT equal to number of ports in the aggregator, but rather the number of ports in the aggregator that are still enabled. The sysfs spew for bonding_show_ad_num_ports() has a comment that says "Show number of active 802.3ad ports.", but it's currently showing total number of ports, both active and inactive. Remedy it by using the same logic introduced in 0622cab0341c in __bond_3ad_get_active_agg_info(), so sysfs, procfs and netlink all report the number of active ports. Note that this means that IFLA_BOND_AD_INFO_NUM_PORTS really means NUM_ACTIVE_PORTS instead of NUM_PORTS, and thus perhaps should be renamed for clarity. Lightly tested on a dual i40e lacp bond, simulating link downs with an ip link set dev down, was able to produce the state where I could see both in the same aggregator, but a number of ports count of 1. MII Status: up Active Aggregator Info: Aggregator ID: 1 Number of ports: 2 <--- Slave Interface: ens10 MII Status: up <--- Aggregator ID: 1 Slave Interface: ens11 MII Status: up Aggregator ID: 1 MII Status: up Active Aggregator Info: Aggregator ID: 1 Number of ports: 1 <--- Slave Interface: ens10 MII Status: down <--- Aggregator ID: 1 Slave Interface: ens11 MII Status: up Aggregator ID: 1 CC: Jay Vosburgh CC: Veaceslav Falico CC: Andy Gospodarek CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 304b41014acbdc5fa5126c86bac31dc41a245f9f Author: Eric Dumazet Date: Fri May 19 14:17:48 2017 -0700 ipv6: fix out of bound writes in __ip6_append_data() [ Upstream commit 232cd35d0804cc241eb887bb8d4d9b3b9881c64a ] Andrey Konovalov and idaifish@gmail.com reported crashes caused by one skb shared_info being overwritten from __ip6_append_data() Andrey program lead to following state : copy -4200 datalen 2000 fraglen 2040 maxfraglen 2040 alloclen 2048 transhdrlen 0 offset 0 fraggap 6200 The skb_copy_and_csum_bits(skb_prev, maxfraglen, data + transhdrlen, fraggap, 0); is overwriting skb->head and skb_shared_info Since we apparently detect this rare condition too late, move the code earlier to even avoid allocating skb and risking crashes. Once again, many thanks to Andrey and syzkaller team. Signed-off-by: Eric Dumazet Reported-by: Andrey Konovalov Tested-by: Andrey Konovalov Reported-by: Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit ee72e7e5c2b472ff9321c800554d1538fc09b8b4 Author: Xin Long Date: Fri May 19 22:20:29 2017 +0800 bridge: start hello_timer when enabling KERNEL_STP in br_stp_start [ Upstream commit 6d18c732b95c0a9d35e9f978b4438bba15412284 ] Since commit 76b91c32dd86 ("bridge: stp: when using userspace stp stop kernel hello and hold timers"), bridge would not start hello_timer if stp_enabled is not KERNEL_STP when br_dev_open. The problem is even if users set stp_enabled with KERNEL_STP later, the timer will still not be started. It causes that KERNEL_STP can not really work. Users have to re-ifup the bridge to avoid this. This patch is to fix it by starting br->hello_timer when enabling KERNEL_STP in br_stp_start. As an improvement, it's also to start hello_timer again only when br->stp_enabled is KERNEL_STP in br_hello_timer_expired, there is no reason to start the timer again when it's NO_STP. Fixes: 76b91c32dd86 ("bridge: stp: when using userspace stp stop kernel hello and hold timers") Reported-by: Haidong Li Signed-off-by: Xin Long Acked-by: Nikolay Aleksandrov Reviewed-by: Ivan Vecera Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 0d10ebbc835f22cc27750c78f5dfe02651bec31f Author: Bjørn Mork Date: Wed May 17 16:31:41 2017 +0200 qmi_wwan: add another Lenovo EM74xx device ID [ Upstream commit 486181bcb3248e2f1977f4e69387a898234a4e1e ] In their infinite wisdom, and never ending quest for end user frustration, Lenovo has decided to use a new USB device ID for the wwan modules in their 2017 laptops. The actual hardware is still the Sierra Wireless EM7455 or EM7430, depending on region. Signed-off-by: Bjørn Mork Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 2ea4221eb4aec92db6c9a1b579248f52374f8725 Author: Tobias Jungel Date: Wed May 17 09:29:12 2017 +0200 bridge: netlink: check vlan_default_pvid range [ Upstream commit a285860211bf257b0e6d522dac6006794be348af ] Currently it is allowed to set the default pvid of a bridge to a value above VLAN_VID_MASK (0xfff). This patch adds a check to br_validate and returns -EINVAL in case the pvid is out of bounds. Reproduce by calling: [root@test ~]# ip l a type bridge [root@test ~]# ip l a type dummy [root@test ~]# ip l s bridge0 type bridge vlan_filtering 1 [root@test ~]# ip l s bridge0 type bridge vlan_default_pvid 9999 [root@test ~]# ip l s dummy0 master bridge0 [root@test ~]# bridge vlan port vlan ids bridge0 9999 PVID Egress Untagged dummy0 9999 PVID Egress Untagged Fixes: 0f963b7592ef ("bridge: netlink: add support for default_pvid") Acked-by: Nikolay Aleksandrov Signed-off-by: Tobias Jungel Acked-by: Sabrina Dubroca Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 3fa202ef74c83f9c44057b370dccc0981e55d62b Author: David S. Miller Date: Wed May 17 22:54:11 2017 -0400 ipv6: Check ip6_find_1stfragopt() return value properly. [ Upstream commit 7dd7eb9513bd02184d45f000ab69d78cb1fa1531 ] Do not use unsigned variables to see if it returns a negative error or not. Fixes: 2423496af35d ("ipv6: Prevent overrun when parsing v6 header options") Reported-by: Julia Lawall Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit a2c845e51a820549a6df5a1e8907ee754422119e Author: Craig Gallek Date: Tue May 16 14:36:23 2017 -0400 ipv6: Prevent overrun when parsing v6 header options [ Upstream commit 2423496af35d94a87156b063ea5cedffc10a70a1 ] The KASAN warning repoted below was discovered with a syzkaller program. The reproducer is basically: int s = socket(AF_INET6, SOCK_RAW, NEXTHDR_HOP); send(s, &one_byte_of_data, 1, MSG_MORE); send(s, &more_than_mtu_bytes_data, 2000, 0); The socket() call sets the nexthdr field of the v6 header to NEXTHDR_HOP, the first send call primes the payload with a non zero byte of data, and the second send call triggers the fragmentation path. The fragmentation code tries to parse the header options in order to figure out where to insert the fragment option. Since nexthdr points to an invalid option, the calculation of the size of the network header can made to be much larger than the linear section of the skb and data is read outside of it. This fix makes ip6_find_1stfrag return an error if it detects running out-of-bounds. [ 42.361487] ================================================================== [ 42.364412] BUG: KASAN: slab-out-of-bounds in ip6_fragment+0x11c8/0x3730 [ 42.365471] Read of size 840 at addr ffff88000969e798 by task ip6_fragment-oo/3789 [ 42.366469] [ 42.366696] CPU: 1 PID: 3789 Comm: ip6_fragment-oo Not tainted 4.11.0+ #41 [ 42.367628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014 [ 42.368824] Call Trace: [ 42.369183] dump_stack+0xb3/0x10b [ 42.369664] print_address_description+0x73/0x290 [ 42.370325] kasan_report+0x252/0x370 [ 42.370839] ? ip6_fragment+0x11c8/0x3730 [ 42.371396] check_memory_region+0x13c/0x1a0 [ 42.371978] memcpy+0x23/0x50 [ 42.372395] ip6_fragment+0x11c8/0x3730 [ 42.372920] ? nf_ct_expect_unregister_notifier+0x110/0x110 [ 42.373681] ? ip6_copy_metadata+0x7f0/0x7f0 [ 42.374263] ? ip6_forward+0x2e30/0x2e30 [ 42.374803] ip6_finish_output+0x584/0x990 [ 42.375350] ip6_output+0x1b7/0x690 [ 42.375836] ? ip6_finish_output+0x990/0x990 [ 42.376411] ? ip6_fragment+0x3730/0x3730 [ 42.376968] ip6_local_out+0x95/0x160 [ 42.377471] ip6_send_skb+0xa1/0x330 [ 42.377969] ip6_push_pending_frames+0xb3/0xe0 [ 42.378589] rawv6_sendmsg+0x2051/0x2db0 [ 42.379129] ? rawv6_bind+0x8b0/0x8b0 [ 42.379633] ? _copy_from_user+0x84/0xe0 [ 42.380193] ? debug_check_no_locks_freed+0x290/0x290 [ 42.380878] ? ___sys_sendmsg+0x162/0x930 [ 42.381427] ? rcu_read_lock_sched_held+0xa3/0x120 [ 42.382074] ? sock_has_perm+0x1f6/0x290 [ 42.382614] ? ___sys_sendmsg+0x167/0x930 [ 42.383173] ? lock_downgrade+0x660/0x660 [ 42.383727] inet_sendmsg+0x123/0x500 [ 42.384226] ? inet_sendmsg+0x123/0x500 [ 42.384748] ? inet_recvmsg+0x540/0x540 [ 42.385263] sock_sendmsg+0xca/0x110 [ 42.385758] SYSC_sendto+0x217/0x380 [ 42.386249] ? SYSC_connect+0x310/0x310 [ 42.386783] ? __might_fault+0x110/0x1d0 [ 42.387324] ? lock_downgrade+0x660/0x660 [ 42.387880] ? __fget_light+0xa1/0x1f0 [ 42.388403] ? __fdget+0x18/0x20 [ 42.388851] ? sock_common_setsockopt+0x95/0xd0 [ 42.389472] ? SyS_setsockopt+0x17f/0x260 [ 42.390021] ? entry_SYSCALL_64_fastpath+0x5/0xbe [ 42.390650] SyS_sendto+0x40/0x50 [ 42.391103] entry_SYSCALL_64_fastpath+0x1f/0xbe [ 42.391731] RIP: 0033:0x7fbbb711e383 [ 42.392217] RSP: 002b:00007ffff4d34f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 42.393235] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbbb711e383 [ 42.394195] RDX: 0000000000001000 RSI: 00007ffff4d34f60 RDI: 0000000000000003 [ 42.395145] RBP: 0000000000000046 R08: 00007ffff4d34f40 R09: 0000000000000018 [ 42.396056] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400aad [ 42.396598] R13: 0000000000000066 R14: 00007ffff4d34ee0 R15: 00007fbbb717af00 [ 42.397257] [ 42.397411] Allocated by task 3789: [ 42.397702] save_stack_trace+0x16/0x20 [ 42.398005] save_stack+0x46/0xd0 [ 42.398267] kasan_kmalloc+0xad/0xe0 [ 42.398548] kasan_slab_alloc+0x12/0x20 [ 42.398848] __kmalloc_node_track_caller+0xcb/0x380 [ 42.399224] __kmalloc_reserve.isra.32+0x41/0xe0 [ 42.399654] __alloc_skb+0xf8/0x580 [ 42.400003] sock_wmalloc+0xab/0xf0 [ 42.400346] __ip6_append_data.isra.41+0x2472/0x33d0 [ 42.400813] ip6_append_data+0x1a8/0x2f0 [ 42.401122] rawv6_sendmsg+0x11ee/0x2db0 [ 42.401505] inet_sendmsg+0x123/0x500 [ 42.401860] sock_sendmsg+0xca/0x110 [ 42.402209] ___sys_sendmsg+0x7cb/0x930 [ 42.402582] __sys_sendmsg+0xd9/0x190 [ 42.402941] SyS_sendmsg+0x2d/0x50 [ 42.403273] entry_SYSCALL_64_fastpath+0x1f/0xbe [ 42.403718] [ 42.403871] Freed by task 1794: [ 42.404146] save_stack_trace+0x16/0x20 [ 42.404515] save_stack+0x46/0xd0 [ 42.404827] kasan_slab_free+0x72/0xc0 [ 42.405167] kfree+0xe8/0x2b0 [ 42.405462] skb_free_head+0x74/0xb0 [ 42.405806] skb_release_data+0x30e/0x3a0 [ 42.406198] skb_release_all+0x4a/0x60 [ 42.406563] consume_skb+0x113/0x2e0 [ 42.406910] skb_free_datagram+0x1a/0xe0 [ 42.407288] netlink_recvmsg+0x60d/0xe40 [ 42.407667] sock_recvmsg+0xd7/0x110 [ 42.408022] ___sys_recvmsg+0x25c/0x580 [ 42.408395] __sys_recvmsg+0xd6/0x190 [ 42.408753] SyS_recvmsg+0x2d/0x50 [ 42.409086] entry_SYSCALL_64_fastpath+0x1f/0xbe [ 42.409513] [ 42.409665] The buggy address belongs to the object at ffff88000969e780 [ 42.409665] which belongs to the cache kmalloc-512 of size 512 [ 42.410846] The buggy address is located 24 bytes inside of [ 42.410846] 512-byte region [ffff88000969e780, ffff88000969e980) [ 42.411941] The buggy address belongs to the page: [ 42.412405] page:ffffea000025a780 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0 [ 42.413298] flags: 0x100000000008100(slab|head) [ 42.413729] raw: 0100000000008100 0000000000000000 0000000000000000 00000001800c000c [ 42.414387] raw: ffffea00002a9500 0000000900000007 ffff88000c401280 0000000000000000 [ 42.415074] page dumped because: kasan: bad access detected [ 42.415604] [ 42.415757] Memory state around the buggy address: [ 42.416222] ffff88000969e880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 42.416904] ffff88000969e900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 42.417591] >ffff88000969e980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 42.418273] ^ [ 42.418588] ffff88000969ea00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 42.419273] ffff88000969ea80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 42.419882] ================================================================== Reported-by: Andrey Konovalov Signed-off-by: Craig Gallek Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 68647616fd537b3f5dee698da49e2ce8c2c306e8 Author: David Ahern Date: Mon May 15 23:19:17 2017 -0700 net: Improve handling of failures on link and route dumps [ Upstream commit f6c5775ff0bfa62b072face6bf1d40f659f194b2 ] In general, rtnetlink dumps do not anticipate failure to dump a single object (e.g., link or route) on a single pass. As both route and link objects have grown via more attributes, that is no longer a given. netlink dumps can handle a failure if the dump function returns an error; specifically, netlink_dump adds the return code to the response if it is <= 0 so userspace is notified of the failure. The missing piece is the rtnetlink dump functions returning the error. Fix route and link dump functions to return the errors if no object is added to an skb (detected by skb->len != 0). IPv6 route dumps (rt6_dump_route) already return the error; this patch updates IPv4 and link dumps. Other dump functions may need to be ajusted as well. Reported-by: Jan Moskyto Matejka Signed-off-by: David Ahern Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 0174b07408f28669fddb2c1eed3806115f45a758 Author: Soheil Hassas Yeganeh Date: Mon May 15 17:05:47 2017 -0400 tcp: eliminate negative reordering in tcp_clean_rtx_queue [ Upstream commit bafbb9c73241760023d8981191ddd30bb1c6dbac ] tcp_ack() can call tcp_fragment() which may dededuct the value tp->fackets_out when MSS changes. When prior_fackets is larger than tp->fackets_out, tcp_clean_rtx_queue() can invoke tcp_update_reordering() with negative values. This results in absurd tp->reodering values higher than sysctl_tcp_max_reordering. Note that tcp_update_reordering indeeds sets tp->reordering to min(sysctl_tcp_max_reordering, metric), but because the comparison is signed, a negative metric always wins. Fixes: c7caf8d3ed7a ("[TCP]: Fix reord detection due to snd_una covered holes") Reported-by: Rebecca Isaacs Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit ac3735bf97f0693f2cae8e835c704be2f2bf0ad5 Author: Gal Pressman Date: Wed Apr 19 14:35:15 2017 +0300 net/mlx5e: Fix ethtool pause support and advertise reporting [ Upstream commit e3c19503712d6360239b19c14cded56dd63c40d7 ] Pause bit should set when RX pause is on, not TX pause. Also, setting Asym_Pause is incorrect, and should be turned off. Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API") Signed-off-by: Gal Pressman Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed Signed-off-by: Greg Kroah-Hartman commit 1594973b8e0700f187d367f95810ce3244a57ce4 Author: Gal Pressman Date: Mon Apr 3 15:11:22 2017 +0300 net/mlx5e: Use the correct pause values for ethtool advertising [ Upstream commit b383b544f2666d67446b951a9a97af239dafed5d ] Query the operational pause from firmware (PFCC register) instead of always passing zeros. Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API") Signed-off-by: Gal Pressman Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed Signed-off-by: Greg Kroah-Hartman commit f79d3307c0356b63711d53d999158bdad667004b Author: Douglas Caetano dos Santos Date: Fri May 12 15:19:15 2017 -0300 net/packet: fix missing net_device reference release [ Upstream commit d19b183cdc1fa3d70d6abe2a4c369e748cd7ebb8 ] When using a TX ring buffer, if an error occurs processing a control message (e.g. invalid message), the net_device reference is not released. Fixes c14ac9451c348 ("sock: enable timestamping using control messages") Signed-off-by: Douglas Caetano dos Santos Acked-by: Soheil Hassas Yeganeh Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 5e7d9f0b3f729a64b99e58047f7bb0ff36acb759 Author: Eric Dumazet Date: Wed May 17 07:16:40 2017 -0700 sctp: do not inherit ipv6_{mc|ac|fl}_list from parent [ Upstream commit fdcee2cbb8438702ea1b328fb6e0ac5e9a40c7f8 ] SCTP needs fixes similar to 83eaddab4378 ("ipv6/dccp: do not inherit ipv6_mc_list from parent"), otherwise bad things can happen. Signed-off-by: Eric Dumazet Reported-by: Andrey Konovalov Tested-by: Andrey Konovalov Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit eb7f6d6989adecd24e8ed050513d4b5d3d4c9ece Author: Xin Long Date: Fri May 12 14:39:52 2017 +0800 sctp: fix src address selection if using secondary addresses for ipv6 [ Upstream commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 ] Commit 0ca50d12fe46 ("sctp: fix src address selection if using secondary addresses") has fixed a src address selection issue when using secondary addresses for ipv4. Now sctp ipv6 also has the similar issue. When using a secondary address, sctp_v6_get_dst tries to choose the saddr which has the most same bits with the daddr by sctp_v6_addr_match_len. It may make some cases not work as expected. hostA: [1] fd21:356b:459a:cf10::11 (eth1) [2] fd21:356b:459a:cf20::11 (eth2) hostB: [a] fd21:356b:459a:cf30::2 (eth1) [b] fd21:356b:459a:cf40::2 (eth2) route from hostA to hostB: fd21:356b:459a:cf30::/64 dev eth1 metric 1024 mtu 1500 The expected path should be: fd21:356b:459a:cf10::11 <-> fd21:356b:459a:cf30::2 But addr[2] matches addr[a] more bits than addr[1] does, according to sctp_v6_addr_match_len. It causes the path to be: fd21:356b:459a:cf20::11 <-> fd21:356b:459a:cf30::2 This patch is to fix it with the same way as Marcelo's fix for sctp ipv4. As no ip_dev_find for ipv6, this patch is to use ipv6_chk_addr to check if the saddr is in a dev instead. Note that for backwards compatibility, it will still do the addr_match_len check here when no optimal is found. Reported-by: Patrick Talbert Signed-off-by: Xin Long Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 8d625242e86b2e9e05b2cfd82e87aee7b351a0a7 Author: Yuchung Cheng Date: Wed May 10 17:01:27 2017 -0700 tcp: avoid fragmenting peculiar skbs in SACK [ Upstream commit b451e5d24ba6687c6f0e7319c727a709a1846c06 ] This patch fixes a bug in splitting an SKB during SACK processing. Specifically if an skb contains multiple packets and is only partially sacked in the higher sequences, tcp_match_sack_to_skb() splits the skb and marks the second fragment as SACKed. The current code further attempts rounding up the first fragment to MSS boundaries. But it misses a boundary condition when the rounded-up fragment size (pkt_len) is exactly skb size. Spliting such an skb is pointless and causses a kernel warning and aborts the SACK processing. This patch universally checks such over-split before calling tcp_fragment to prevent these unnecessary warnings. Fixes: adb92db857ee ("tcp: Make SACK code to split only at mss boundaries") Signed-off-by: Yuchung Cheng Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh Acked-by: Neal Cardwell Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit a5db124dc2a4819aefb71f4bcb1a825c7818734c Author: Eric Dumazet Date: Tue May 16 13:27:53 2017 -0700 net: fix compile error in skb_orphan_partial() [ Upstream commit 9142e9007f2d7ab58a587a1e1d921b0064a339aa ] If CONFIG_INET is not set, net/core/sock.c can not compile : net/core/sock.c: In function ‘skb_orphan_partial’: net/core/sock.c:1810:2: error: implicit declaration of function ‘skb_is_tcp_pure_ack’ [-Werror=implicit-function-declaration] if (skb_is_tcp_pure_ack(skb)) ^ Fix this by always including Fixes: f6ba8d33cfbb ("netem: fix skb_orphan_partial()") Signed-off-by: Eric Dumazet Reported-by: Paul Gortmaker Reported-by: Randy Dunlap Reported-by: Stephen Rothwell Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 5d165daafc4438b89670b2e77ef8f8df906f6308 Author: Eric Dumazet Date: Thu May 11 15:24:41 2017 -0700 netem: fix skb_orphan_partial() [ Upstream commit f6ba8d33cfbb46df569972e64dbb5bb7e929bfd9 ] I should have known that lowering skb->truesize was dangerous :/ In case packets are not leaving the host via a standard Ethernet device, but looped back to local sockets, bad things can happen, as reported by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 ) So instead of tweaking skb->truesize, lets change skb->destructor and keep a reference on the owner socket via its sk_refcnt. Fixes: f2f872f9272a ("netem: Introduce skb_orphan_partial() helper") Signed-off-by: Eric Dumazet Reported-by: Michael Madsen Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 21e3113298f97ec95622c0359be62145ffd055c8 Author: Daniel Borkmann Date: Thu May 11 01:53:15 2017 +0200 bpf, arm64: fix faulty emission of map access in tail calls [ Upstream commit d8b54110ee944de522ccd3531191f39986ec20f9 ] Shubham was recently asking on netdev why in arm64 JIT we don't multiply the index for accessing the tail call map by 8. That led me into testing out arm64 JIT wrt tail calls and it turned out I got a NULL pointer dereference on the tail call. The buggy access is at: prog = array->ptrs[index]; if (prog == NULL) goto out; [...] 00000060: d2800e0a mov x10, #0x70 // #112 00000064: f86a682a ldr x10, [x1,x10] 00000068: f862694b ldr x11, [x10,x2] 0000006c: b40000ab cbz x11, 0x00000080 [...] The code triggering the crash is f862694b. x1 at the time contains the address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning, above we load the pointer to the program at map slot 0 into x10. x10 can then be NULL if the slot is not occupied, which we later on try to access with a user given offset in x2 that is the map index. Fix this by emitting the following instead: [...] 00000060: d2800e0a mov x10, #0x70 // #112 00000064: 8b0a002a add x10, x1, x10 00000068: d37df04b lsl x11, x2, #3 0000006c: f86b694b ldr x11, [x10,x11] 00000070: b40000ab cbz x11, 0x00000084 [...] This basically adds the offset to ptrs to the base address of the bpf array we got and we later on access the map with an index * 8 offset relative to that. The tail call map itself is basically one large area with meta data at the head followed by the array of prog pointers. This makes tail calls working again, tested on Cavium ThunderX ARMv8. Fixes: ddb55992b04d ("arm64: bpf: implement bpf_tail_call() helper") Reported-by: Shubham Bansal Signed-off-by: Daniel Borkmann Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit c1f3f197d65018134e65f8f4df62562b2d0a7a65 Author: Ursula Braun Date: Wed May 10 19:07:54 2017 +0200 s390/qeth: add missing hash table initializations [ Upstream commit ebccc7397e4a49ff64c8f44a54895de9d32fe742 ] commit 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback") added new hash tables, but missed to initialize them. Fixes: 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback") Signed-off-by: Ursula Braun Reviewed-by: Julian Wiedmann Signed-off-by: Julian Wiedmann Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 96a81eb6ad5ab062dea32c6e764518effa5891e1 Author: Julian Wiedmann Date: Wed May 10 19:07:53 2017 +0200 s390/qeth: avoid null pointer dereference on OSN [ Upstream commit 25e2c341e7818a394da9abc403716278ee646014 ] Access card->dev only after checking whether's its valid. Signed-off-by: Julian Wiedmann Reviewed-by: Ursula Braun Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit b68c2e387a232a9fc09ccbed792649481ca226f4 Author: Julian Wiedmann Date: Wed May 10 19:07:52 2017 +0200 s390/qeth: unbreak OSM and OSN support [ Upstream commit 2d2ebb3ed0c6acfb014f98e427298673a5d07b82 ] commit b4d72c08b358 ("qeth: bridgeport support - basic control") broke the support for OSM and OSN devices as follows: As OSM and OSN are L2 only, qeth_core_probe_device() does an early setup by loading the l2 discipline and calling qeth_l2_probe_device(). In this context, adding the l2-specific bridgeport sysfs attributes via qeth_l2_create_device_attributes() hits a BUG_ON in fs/sysfs/group.c, since the basic sysfs infrastructure for the device hasn't been established yet. Note that OSN actually has its own unique sysfs attributes (qeth_osn_devtype), so the additional attributes shouldn't be created at all. For OSM, add a new qeth_l2_devtype that contains all the common and l2-specific sysfs attributes. When qeth_core_probe_device() does early setup for OSM or OSN, assign the corresponding devtype so that the ccwgroup probe code creates the full set of sysfs attributes. This allows us to skip qeth_l2_create_device_attributes() in case of an early setup. Any device that can't do early setup will initially have only the generic sysfs attributes, and when it's probed later qeth_l2_probe_device() adds the l2-specific attributes. If an early-setup device is removed (by calling ccwgroup_ungroup()), device_unregister() will - using the devtype - delete the l2-specific attributes before qeth_l2_remove_device() is called. So make sure to not remove them twice. What complicates the issue is that qeth_l2_probe_device() and qeth_l2_remove_device() is also called on a device when its layer2 attribute changes (ie. its layer mode is switched). For early-setup devices this wouldn't work properly - we wouldn't remove the l2-specific attributes when switching to L3. But switching the layer mode doesn't actually make any sense; we already decided that the device can only operate in L2! So just refuse to switch the layer mode on such devices. Note that OSN doesn't have a layer2 attribute, so we only need to special-case OSM. Based on an initial patch by Ursula Braun. Fixes: b4d72c08b358 ("qeth: bridgeport support - basic control") Signed-off-by: Julian Wiedmann Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 25c1a1e4d891fbe49bf9c1da6ba0cf5e1f21231b Author: Ursula Braun Date: Wed May 10 19:07:51 2017 +0200 s390/qeth: handle sysfs error during initialization [ Upstream commit 9111e7880ccf419548c7b0887df020b08eadb075 ] When setting up the device from within the layer discipline's probe routine, creating the layer-specific sysfs attributes can fail. Report this error back to the caller, and handle it by releasing the layer discipline. Signed-off-by: Ursula Braun [jwi: updated commit msg, moved an OSN change to a subsequent patch] Signed-off-by: Julian Wiedmann Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 4bd8f5e38e5a1612ce4373068b518b14d3e38ec8 Author: WANG Cong Date: Tue May 9 16:59:54 2017 -0700 ipv6/dccp: do not inherit ipv6_mc_list from parent [ Upstream commit 83eaddab4378db256d00d295bda6ca997cd13a52 ] Like commit 657831ffc38e ("dccp/tcp: do not inherit mc_list from parent") we should clear ipv6_mc_list etc. for IPv6 sockets too. Cc: Eric Dumazet Signed-off-by: Cong Wang Acked-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 8f1f08be397469edf297e2686efc5c70190880d0 Author: Gao Feng Date: Tue May 9 18:27:33 2017 +0800 driver: vrf: Fix one possible use-after-free issue [ Upstream commit 1a4a5bf52a4adb477adb075e5afce925824ad132 ] The current codes only deal with the case that the skb is dropped, it may meet one use-after-free issue when NF_HOOK returns 0 that means the skb is stolen by one netfilter rule or hook. When one netfilter rule or hook stoles the skb and return NF_STOLEN, it means the skb is taken by the rule, and other modules should not touch this skb ever. Maybe the skb is queued or freed directly by the rule. Now uses the nf_hook instead of NF_HOOK to get the result of netfilter, and check the return value of nf_hook. Only when its value equals 1, it means the skb could go ahead. Or reset the skb as NULL. BTW, because vrf_rcv_finish is empty function, so needn't invoke it even though nf_hook returns 1. But we need to modify vrf_rcv_finish to deal with the NF_STOLEN case. There are two cases when skb is stolen. 1. The skb is stolen and freed directly. There is nothing we need to do, and vrf_rcv_finish isn't invoked. 2. The skb is queued and reinjected again. The vrf_rcv_finish would be invoked as okfn, so need to free the skb in it. Signed-off-by: Gao Feng Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 4eed44029507acc666ac7afe9c6a8ea0abf857b7 Author: Eric Dumazet Date: Tue May 9 06:29:19 2017 -0700 dccp/tcp: do not inherit mc_list from parent [ Upstream commit 657831ffc38e30092a2d5f03d385d710eb88b09a ] syzkaller found a way to trigger double frees from ip_mc_drop_socket() It turns out that leave a copy of parent mc_list at accept() time, which is very bad. Very similar to commit 8b485ce69876 ("tcp: do not inherit fastopen_req from parent") Initial report from Pray3r, completed by Andrey one. Thanks a lot to them ! Signed-off-by: Eric Dumazet Reported-by: Pray3r Reported-by: Andrey Konovalov Tested-by: Andrey Konovalov Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman