python制作猜数字游戏 如何制作游戏(或者有没有这方面的论坛以及大神帮忙,私信联系)

&p&&b&最近研究了一下协同过滤算法,其中主要为基于内容与基于用户的两种推荐算法,真实案例中往往会结合两种算法,来实现整体目标最优。本文首先介绍了User CF和Item CF的基本概念,然后介绍了基于内容和用户的协同过滤算法的主要区别是什么。若有不妥的地方,欢迎指出,共同学习。&/b&&/p&&ol&&li&User CF:基于用户的协同过滤算法&/li&&li&Item CF:基于内容的协同过滤算法&/li&&li&User CF vs. Item CF&/li&&ol&&li&计算复杂度&/li&&li&适用场景&/li&&li&推荐多样性和精度&/li&&/ol&&/ol&&h2&&b&User CF:基于用户的协同过滤算法&/b&&/h2&&p&基于用户的 CF 的基本思想相当简单,基于用户对物品的偏好找到相邻邻居用户,然后将邻居用户喜欢的推荐给当前用户。计算上,就是&b&将一个用户对所有物品的偏好作为一个向量&/b&来计算用户之间的相似度,找到 K 邻居后,根据邻居的相似度权重以及他们对物品的偏好,预测当前用户没有偏好的未涉及物品,计算得到一个排序的物品列表作为推荐。图 2 给出了一个例子,对于用户 A,根据用户的历史偏好,这里只计算得到一个邻居 – 用户 C,然后将用户 C 喜欢的物品 D 推荐给用户 A。&/p&&h2&&b&Item CF:基于内容的协同过滤算法&/b&&/h2&&p&基于物品的 CF 的原理和基于用户的 CF 类似,只是在计算邻居时采用物品本身,而不是从用户的角度,即基于用户对物品的偏好找到相似的物品,然后根据用户的历史偏好,推荐相似的物品给他。从计算的角度看,就是&b&将所有用户对某个物品的偏好作为一个向量&/b&来计算物品之间的相似度,得到物品的相似物品后,根据用户历史的偏好预测当前用户还没有表示偏好的物品,计算得到一个排序的物品列表作为推荐。图 3 给出了一个例子,对于物品 A,根据所有用户的历史偏好,喜欢物品 A 的用户都喜欢物品 C,得出物品 A 和物品 C 比较相似,而用户 C 喜欢物品 A,那么可以推断出用户 C 可能也喜欢物品 C。&/p&&h2&&b&User CF vs. Item CF&/b&&/h2&&p&&b&计算复杂度&/b&&/p&&p&user-based 的缺点也比较明显,相似用户的稳定度要小于物品的稳定度程度,所以需要在线更新。 而item-based离线计算就能够获得比较好的复杂度。&/p&&ul&&li&电商网站的商品推荐:Item CF 从性能和复杂度上比 User CF 更优
其中的一个主要原因就是对于一个在线网站,用户的数量往往大大超过物品的数量,同时物品的数据相对稳定,因此计算物品的相似度不但计算量较小,同时也不必频繁更新。&/li&&li&新闻资讯的内容推荐:User CF 从性能和复杂度上比 Item CF 更优
对于新闻,博客或者微内容的推荐系统,情况往往是相反的,内容的数量是海量的,同时也是更新频繁的,所以单从复杂度的角度,这两个算法在不同的系统中各有优势,推荐引擎的设计者需要根据自己应用的特点选择更加合适的算法。&/li&&/ul&&p&&b&适用场景&/b&&/p&&p&user-based. 按理说更适合新闻这种,item量远大于用户的。item-based更适合item量较小的。&/p&&ul&&li&非社交网络的网站更适合Item CF&/li&&/ul&&p&在非社交网络的网站中,内容内在的联系是很重要的推荐原则,它比基于相似用户的推荐原则更加有效。比如在购书网站上,当你看一本书的时候,推荐引擎会给你推荐相关的书籍,这个推荐的重要性远远超过了网站首页对该用户的综合推荐。可以看到,在这种情况下,Item CF 的推荐成为了引导用户浏览的重要手段。同时 Item CF 便于为推荐做出解释,在一个非社交网络的网站中,给某个用户推荐一本书,同时给出的解释是某某和你有相似兴趣的人也看了这本书,这很难让用户信服,因为用户可能根本不认识那个人;但如果解释说是因为这本书和你以前看的某本书相似,用户可能就觉得合理而采纳了此推荐。&/p&&ul&&li&社交网络的网站更适合User CF&/li&&/ul&&p&相反的,在现今很流行的社交网络站点中,User CF 是一个更不错的选择,User CF 加上社会网络信息,可以增加用户对推荐解释的信服程度。&/p&&p&&b&推荐多样性和精度&/b&&/p&&p&研究推荐引擎的学者们在相同的数据集合上分别用 User CF 和 Item CF 计算推荐结果,发现推荐列表中,只有 50% 是一样的,还有 50% 完全不同。但是这两个算法确有相似的精度,所以可以说,这两个算法是很互补的。&/p&&blockquote&从单一用户角度,肯定是user-based 更加多样,但同时也更加容易推荐比较热的东西。item-based,是根据用户的历史记录来推荐,所以相对来说,更容易拟合。item-based, 对于新用户(读取量下于10的时候的推荐效果更好。)。 item-based ,更容易发现长尾数据,因为只要有一些用户买了这个长尾的物品,那么这两个物品就有很高的相关性).&/blockquote&&ul&&li&单个用户:User CF的多样性&Item CF的多样性&/li&&/ul&&p&就是说给定一个用户,查看系统给出的推荐列表是否多样,也就是要比较推荐列表中的物品之间两两的相似度,不难想到,对这种度量方法,Item CF 的多样性显然不如 User CF 的好,因为 Item CF 的推荐就是和以前看的东西最相似的。&/p&&ul&&li&整个系统:Item CF的多样性&User CF的多样性&/li&&/ul&&p&这种情况下,也被称为覆盖率 (Coverage),它是指一个推荐系统是否能够提供给所有用户丰富的选择。&/p&&p&在这种指标下,Item CF 的多样性要远远好于 User CF, 因为 User CF 总是倾向于推荐热门的,从另一个侧面看,也就是说,Item CF 的推荐有很好的新颖性,很擅长推荐长尾里的物品。所以,尽管大多数情况,Item CF 的精度略小于 User CF, 但如果考虑多样性,Item CF 却比 User CF 好很多。 &/p&&p&&b&本文主要整理自POLL的笔记:&/b&&a href=&///?target=http%3A///102819/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&协同过滤(CF)算法详解和实现 &i class=&icon-external&&&/i&&/a&,有兴趣的同学可以继续深入阅读。&/p&
最近研究了一下协同过滤算法,其中主要为基于内容与基于用户的两种推荐算法,真实案例中往往会结合两种算法,来实现整体目标最优。本文首先介绍了User CF和Item CF的基本概念,然后介绍了基于内容和用户的协同过滤算法的主要区别是什么。若有不妥的地方,欢…
&p&谢邀。我并非搞推荐算法的,只是给你些建议,仅供参考呀。&/p&&p&你可以去搜索下阿里巴巴的PAI平台,里面有一个阿里对外发布的协同过滤算法Etrec;另外2016年的时候有一个影响比较大的推荐框架 wide and depp,供你参考&/p&&p&&a href=&///?target=https%3A//arxiv.org/abs/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&[] Wide & Deep Learning for Recommender Systems&i class=&icon-external&&&/i&&/a&&/p&
谢邀。我并非搞推荐算法的,只是给你些建议,仅供参考呀。你可以去搜索下阿里巴巴的PAI平台,里面有一个阿里对外发布的协同过滤算法Etrec;另外2016年的时候有一个影响比较大的推荐框架 wide and depp,供你参考
&figure&&img src=&/50/v2-82cc73ef1f0_b.jpg& data-rawwidth=&637& data-rawheight=&372& class=&origin_image zh-lightbox-thumb& width=&637& data-original=&/50/v2-82cc73ef1f0_r.jpg&&&/figure&&p&推荐算法具有非常多的应用场景和商业价值,因此对推荐算法值得好好研究。推荐算法种类很多,但是目前应用最广泛的应该是协同过滤类别的推荐算法,本文就对协同过滤类别的推荐算法做一个概括总结,后续也会对一些典型的协同过滤推荐算法做原理总结。&/p&&h1&1. 推荐算法概述&/h1&&p&推荐算法是非常古老的,在机器学习还没有兴起的时候就有需求和应用了。概括来说,可以分为以下5种:&/p&&p&1)&strong&基于内容的推荐&/strong&:这一类一般依赖于自然语言处理NLP的一些知识,通过挖掘文本的TF-IDF特征向量,来得到用户的偏好,进而做推荐。这类推荐算法可以找到用户独特的小众喜好,而且还有较好的解释性。这一类由于需要NLP的基础,本文就不多讲,在后面专门讲NLP的时候再讨论。&/p&&p&2)&strong&协调过滤推荐&/strong&:本文后面要专门讲的内容。协调过滤是推荐算法中目前最主流的种类,花样繁多,在工业界已经有了很多广泛的应用。它的优点是不需要太多特定领域的知识,可以通过基于统计的机器学习算法来得到较好的推荐效果。最大的优点是工程上容易实现,可以方便应用到产品中。目前绝大多数实际应用的推荐算法都是协同过滤推荐算法。&/p&&p&3)&strong&混合推荐&/strong&:这个类似我们机器学习中的集成学习,博才众长,通过多个推荐算法的结合,得到一个更好的推荐算法,起到三个臭皮匠顶一个诸葛亮的作用。比如通过建立多个推荐算法的模型,最后用投票法决定最终的推荐结果。混合推荐理论上不会比单一任何一种推荐算法差,但是使用混合推荐,算法复杂度就提高了,在实际应用中有使用,但是并没有单一的协调过滤推荐算法,比如逻辑回归之类的二分类推荐算法广泛。&/p&&p&4)&strong&基于规则的推荐&/strong&:这类算法常见的比如基于最多用户点击,最多用户浏览等,属于大众型的推荐方法,在目前的大数据时代并不主流。&/p&&p&5)&strong&基于人口统计信息的推荐&/strong&:这一类是最简单的推荐算法了,它只是简单的根据系统用户的基本信息发现用户的相关程度,然后进行推荐,目前在大型系统中已经较少使用。&/p&&br&&figure&&img src=&/50/v2-96ab7c82c2_b.png& data-rawwidth=&669& data-rawheight=&161& class=&origin_image zh-lightbox-thumb& width=&669& data-original=&/50/v2-96ab7c82c2_r.png&&&/figure&&br&&h1&2. 协调过滤推荐概述&/h1&&p&协同过滤(Collaborative Filtering)作为推荐算法中最经典的类型,包括在线的协同和离线的过滤两部分。所谓在线协同,就是通过在线数据找到用户可能喜欢的物品,而离线过滤,则是过滤掉一些不值得推荐的数据,比比如推荐值评分低的数据,或者虽然推荐值高但是用户已经购买的数据。&/p&&p&协同过滤的模型一般为m个物品,m个用户的数据,只有部分用户和部分数据之间是有评分数据的,其它部分评分是空白,此时我们要用已有的部分稀疏数据来预测那些空白的物品和数据之间的评分关系,找到最高评分的物品推荐给用户。&/p&&p&一般来说,协同过滤推荐分为三种类型。第一种是&strong&基于用户(user-based)的协同过滤&/strong&,第二种是&strong&基于项目(item-based)的协同过滤&/strong&,第三种是&strong&基于模型(model based)的协同过滤&/strong&。&/p&&p&基于用户(user-based)的协同过滤主要考虑的是用户和用户之间的相似度,只要找出相似用户喜欢的物品,并预测目标用户对对应物品的评分,就可以找到评分最高的若干个物品推荐给用户。而基于项目(item-based)的协同过滤和基于用户的协同过滤类似,只不过这时我们转向找到物品和物品之间的相似度,只有找到了目标用户对某些物品的评分,那么我们就可以对相似度高的类似物品进行预测,将评分最高的若干个相似物品推荐给用户。比如你在网上买了一本机器学习相关的书,网站马上会推荐一堆机器学习,大数据相关的书给你,这里就明显用到了基于项目的协同过滤思想。&/p&&p&我们可以简单比较下基于用户的协同过滤和基于项目的协同过滤:基于用户的协同过滤需要在线找用户和用户之间的相似度关系,计算复杂度肯定会比基于基于项目的协同过滤高。但是可以帮助用户找到新类别的有惊喜的物品。而基于项目的协同过滤,由于考虑的物品的相似性一段时间不会改变,因此可以很容易的离线计算,准确度一般也可以接受,但是推荐的多样性来说,就很难带给用户惊喜了。一般对于小型的推荐系统来说,基于项目的协同过滤肯定是主流。但是如果是大型的推荐系统来说,则可以考虑基于用户的协同过滤,当然更加可以考虑我们的第三种类型,基于模型的协同过滤。&/p&&p&基于模型(model based)的协同过滤是目前最主流的协同过滤类型了,我们的一大堆机器学习算法也可以在这里找到用武之地。下面我们就重点介绍基于模型的协同过滤。&/p&&h1&3. 基于模型的协同过滤&/h1&&p&基于模型的协同过滤作为目前最主流的协同过滤类型,其相关算法可以写一本书了,当然我们这里主要是对其思想做有一个归类概括。我们的问题是这样的m个物品,m个用户的数据,只有部分用户和部分数据之间是有评分数据的,其它部分评分是空白,此时我们要用已有的部分稀疏数据来预测那些空白的物品和数据之间的评分关系,找到最高评分的物品推荐给用户。&/p&&p&对于这个问题,用机器学习的思想来建模解决,主流的方法可以分为:用关联算法,聚类算法,分类算法,回归算法,矩阵分解,神经网络,图模型以及隐语义模型来解决。下面我们分别加以介绍。&/p&&h2&3.1 用关联算法做协同过滤&/h2&&p&一般我们可以找出用户购买的所有物品数据里频繁出现的项集活序列,来做频繁集挖掘,找到满足支持度阈值的关联物品的频繁N项集或者序列。如果用户购买了频繁N项集或者序列里的部分物品,那么我们可以将频繁项集或序列里的其他物品按一定的评分准则推荐给用户,这个评分准则可以包括支持度,置信度和提升度等。&/p&&p&常用的关联推荐算法有Apriori,FP Tree和PrefixSpan。如果大家不熟悉这些算法,可以参考我的另外几篇文章:&/p&&p&Apriori算法原理总结:&a href=&/?target=http%3A///pinard/p/6293298.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Apriori算法原理总结 - 刘建平Pinard - 博客园&i class=&icon-external&&&/i&&/a&&/p&&p&FP Tree算法原理总结:&a href=&/?target=http%3A///pinard/p/6307064.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&FP Tree算法原理总结 - 刘建平Pinard - 博客园&i class=&icon-external&&&/i&&/a&&/p&&p&PrefixSpan算法原理总结:&a href=&/?target=http%3A///pinard/p/6323182.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&PrefixSpan算法原理总结 - 刘建平Pinard - 博客园&i class=&icon-external&&&/i&&/a&&/p&&h2&3.2 用聚类算法做协同过滤&/h2&&p&用聚类算法做协同过滤就和前面的基于用户或者项目的协同过滤有些类似了。我们可以按照用户或者按照物品基于一定的距离度量来进行聚类。如果基于用户聚类,则可以将用户按照一定距离度量方式分成不同的目标人群,将同样目标人群评分高的物品推荐给目标用户。基于物品聚类的话,则是将用户评分高物品的相似同类物品推荐给用户。&/p&&p&常用的聚类推荐算法有K-Means, BIRCH, DBSCAN和谱聚类,如果大家不熟悉这些算法,可以参考我的另外几篇文章:&/p&&p&K-Means聚类算法原理:&a href=&/?target=http%3A///pinard/p/6164214.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&K-Means聚类算法原理 - 刘建平Pinard - 博客园&i class=&icon-external&&&/i&&/a&&/p&&p&BIRCH聚类算法原理:&a href=&/?target=http%3A///pinard/p/6179132.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&BIRCH聚类算法原理 - 刘建平Pinard - 博客园&i class=&icon-external&&&/i&&/a&&/p&&p&DBSCAN密度聚类算法:&a href=&/?target=http%3A///pinard/p/6208966.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&DBSCAN密度聚类算法 - 刘建平Pinard - 博客园&i class=&icon-external&&&/i&&/a&&/p&&p&谱聚类(spectral clustering)原理总结:&a href=&/?target=http%3A///pinard/p/6221564.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&谱聚类(spectral clustering)原理总结&i class=&icon-external&&&/i&&/a&&br&&/p&&br&&h2&3.3 用分类算法做协同过滤&/h2&&p&如果我们根据用户评分的高低,将分数分成几段的话,则这个问题变成分类问题。比如最直接的,设置一份评分阈值,评分高于阈值的就是推荐,评分低于阈值就是不推荐,我们将问题变成了一个二分类问题。虽然分类问题的算法多如牛毛,但是目前使用最广泛的是逻辑回归。为啥是逻辑回归而不是看起来更加高大上的比如支持向量机呢?因为逻辑回归的解释性比较强,每个物品是否推荐我们都有一个明确的概率放在这,同时可以对数据的特征做工程化,得到调优的目的。目前逻辑回归做协同过滤在BAT等大厂已经非常成熟了。&/p&&p&常见的分类推荐算法有逻辑回归和朴素贝叶斯,两者的特点是解释性很强。如果大家不熟悉这些算法,可以参考我的另外几篇文章:&/p&&p&逻辑回归原理小结:&a href=&/?target=http%3A///pinard/p/6029432.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&逻辑回归原理小结 - 刘建平Pinard - 博客园&i class=&icon-external&&&/i&&/a&&/p&&p&朴素贝叶斯算法原理小结:&a href=&/?target=http%3A///pinard/p/6069267.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&朴素贝叶斯算法原理小结 - 刘建平Pinard - 博客园&i class=&icon-external&&&/i&&/a&&/p&&br&&h2&3.4 用回归算法做协同过滤&/h2&&p&用回归算法做协同过滤比分类算法看起来更加的自然。我们的评分可以是一个连续的值而不是离散的值,通过回归模型我们可以得到目标用户对某商品的预测打分。&/p&&p&常用的回归推荐算法有Ridge回归,回归树和支持向量回归。如果大家不熟悉这些算法,可以参考我的另外几篇文章:&/p&&p&线性回归原理小结:&a href=&/?target=http%3A///pinard/p/6004041.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&线性回归原理小结 - 刘建平Pinard - 博客园&i class=&icon-external&&&/i&&/a&&/p&&p&决策树算法原理(下):&a href=&/?target=http%3A///pinard/p/6053344.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&决策树算法原理(下) - 刘建平Pinard - 博客园&i class=&icon-external&&&/i&&/a&&/p&&p&支持向量机原理(五)线性支持回归:&a href=&/?target=http%3A///pinard/p/6113120.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&支持向量机原理(五)线性支持回归 - 刘建平Pinard - 博客园&i class=&icon-external&&&/i&&/a&&/p&&br&&h2&3.5 用矩阵分解做协同过滤&/h2&&p&用矩阵分解做协同过滤是目前使用也很广泛的一种方法。由于传统的奇异值分解SVD要求矩阵不能有缺失数据,必须是稠密的,而我们的用户物品评分矩阵是一个很典型的稀疏矩阵,直接使用传统的SVD到协同过滤是比较复杂的。&/p&&p&目前主流的矩阵分解推荐算法主要是SVD的一些变种,比如FunkSVD,BiasSVD和SVD++。这些算法和传统SVD的最大区别是不再要求将矩阵分解为UΣVT&/p&&p&的形式,而变是两个低秩矩阵PTQ&/p&&p&的乘积形式。对于矩阵分解的推荐算法,后续我会专门开篇来讲。&/p&&br&&h2&3.6 用神经网络做协同过滤&/h2&&p&用神经网络乃至深度学习做协同过滤应该是以后的一个趋势。目前比较主流的用两层神经网络来做推荐算法的是限制玻尔兹曼机(RBM)。在目前的Netflix算法比赛中, RBM算法的表现很牛。当然如果用深层的神经网络来做协同过滤应该会更好,大厂商用深度学习的方法来做协同过滤应该是将来的一个趋势。后续我会专门开篇来讲讲RBM。&/p&&br&&h2&3.7
用图模型做协同过滤&/h2&&p&用图模型做协同过滤,则将用户之间的相似度放到了一个图模型里面去考虑,常用的算法是SimRank系列算法和马尔科夫模型算法。对于SimRank系列算法,它的基本思想是被相似对象引用的两个对象也具有相似性。算法思想有点类似于大名鼎鼎的PageRank。而马尔科夫模型算法当然是基于马尔科夫链了,它的基本思想是基于传导性来找出普通距离度量算法难以找出的相似性。后续我会专门开篇来讲讲SimRank系列算法。 &/p&&br&&h2&3.8 用隐语义模型做协同过滤&/h2&&p&隐语义模型主要是基于NLP的,涉及到对用户行为的语义分析来做评分推荐,主要方法有隐性语义分析LSA和隐含狄利克雷分布LDA,这些等讲NLP的再专门讲。&/p&&br&&h1&4. 协同过滤的一些新方向&/h1&&p&当然推荐算法的变革也在进行中,就算是最火爆的基于逻辑回归推荐算法也在面临被取代。哪些算法可能取代逻辑回归之类的传统协同过滤呢?下面是我的理解:&/p&&p&a)&strong& 基于集成学习的方法和混合推荐&/strong&:这个和混合推荐也靠在一起了。由于集成学习的成熟,在推荐算法上也有较好的表现。一个可能取代逻辑回归的算法是GBDT。目前GBDT在很多算法比赛都有好的表现,而有工业级的并行化实现类库。&/p&&p&b)&strong&基于矩阵分解的方法&/strong&:矩阵分解,由于方法简单,一直受到青睐。目前开始渐渐流行的矩阵分解方法有分解机(Factorization Machine)和张量分解(Tensor Factorization)。&/p&&p&c) &strong&基于深度学习的方法&/strong&:目前两层的神经网络RBM都已经有非常好的推荐算法效果,而随着深度学习和多层神经网络的兴起,以后可能推荐算法就是深度学习的天下了?目前看最火爆的是基于CNN和RNN的推荐算法。&/p&&br&&h1&5. 协同过滤总结 &/h1&&p&协同过滤作为一种经典的推荐算法种类,在工业界应用广泛,它的优点很多,模型通用性强,不需要太多对应数据领域的专业知识,工程实现简单,效果也不错。这些都是它流行的原因。&/p&&p&当然,协同过滤也有些难以避免的难题,比如令人头疼的“冷启动”问题,我们没有新用户任何数据的时候,无法较好的为新用户推荐物品。同时也没有考虑情景的差异,比如根据用户所在的场景和用户当前的情绪。当然,也无法得到一些小众的独特喜好,这块是基于内容的推荐比较擅长的。   &/p&&p&以上就是协同过滤推荐算法的一个总结,希望可以帮大家对推荐算法有一个更深的认识,并预祝大家新年快乐!&/p&&p&出处:&a href=&/?target=http%3A///pinard/p/6349233.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&协同过滤推荐算法总结 - 刘建平Pinard&i class=&icon-external&&&/i&&/a&&/p&&br&&p&&b&大家也可以加小编微信:tswenqu,进R语言中文社区 交流群。&/b&&/p&
推荐算法具有非常多的应用场景和商业价值,因此对推荐算法值得好好研究。推荐算法种类很多,但是目前应用最广泛的应该是协同过滤类别的推荐算法,本文就对协同过滤类别的推荐算法做一个概括总结,后续也会对一些典型的协同过滤推荐算法做原理总结。1. 推荐…
&figure&&img src=&/50/66fd180ad3eeedbe0eee_b.png& data-rawwidth=&529& data-rawheight=&300& class=&origin_image zh-lightbox-thumb& width=&529& data-original=&/50/66fd180ad3eeedbe0eee_r.png&&&/figure&&p&Spark入门学习资源:&a href=&/?target=https%3A///courses/%3Fcourse_type%3Dall%26tag%3DSpark& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Spark入门系列实验课程&i class=&icon-external&&&/i&&/a&。&br&&/p&&h2&一、Spark简介&/h2&&p&Spark是UC Berkeley AMP lab开发的一个集群计算的框架,类似于Hadoop,但有很多的区别。最大的优化是让计算任务的中间结果可以存储在内存中,不需要每次都写入HDFS,更适用于需要迭代的MapReduce算法场景中,可以获得更好的性能提升。例如一次&a href=&/?target=http%3A///blog//spark-officially-sets-a-new-record-in-large-scale-sorting.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&排序测试&i class=&icon-external&&&/i&&/a&中,对100TB数据进行排序,Spark比Hadoop快三倍,并且只需要十分之一的机器。Spark集群目前最大的可以达到8000节点,处理的数据达到PB级别,在互联网企业中应用非常广泛。&/p&&h2&二、Spark理论导读&/h2&&p&学习spark前推荐的理论文章:&/p&&p&2.1 &a href=&/question//answer/& class=&internal&&大数据技术生态介绍&/a&&/p&&p&写的很好的一篇大数据技术生态圈介绍文章,层次条理分明,内容详尽。推荐必读。&/p&&p&2.2 &a href=&/?target=https%3A//cwiki.apache.org/confluence/display/SPARK/Powered%2BBy%2BSpark& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&谁在使用Spark?&i class=&icon-external&&&/i&&/a&&/p&&p&这个页面列举了部分使用Spark的公司和组织,有使用场景的介绍,可做简单了解。&/p&&p&2.3 &a href=&/question//answer/& class=&internal&&Spark与Hadoop对比&/a&&/p&&p&这篇介绍是我看到过最详尽的,讲到很多Spark基本原理和对比Hadoop的优势,推荐必读。&/p&&h1&三、Spark入门实践教程&/h1&&p&有很多想要学习Spark的小伙伴都在自学,实验楼最近整理了一系列的spark入门教程,并提供线上配套的练习环境,希望对Spark学习者有所帮助~&/p&&p&Spark线上实验环境:&br&&/p&&p&&figure&&img src=&/d68818b5eba7f0b807a11b_b.png& data-rawwidth=&1362& data-rawheight=&615& class=&origin_image zh-lightbox-thumb& width=&1362& data-original=&/d68818b5eba7f0b807a11b_r.png&&&/figure& Spark生态圈(图来自&a href=&/?target=http%3A//xiguada.org/spark/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&这里&i class=&icon-external&&&/i&&/a&):&br&&/p&&p&&figure&&img src=&/853b7dbbb0af6f14c1fe_b.png& data-rawwidth=&589& data-rawheight=&281& class=&origin_image zh-lightbox-thumb& width=&589& data-original=&/853b7dbbb0af6f14c1fe_r.png&&&/figure&下面依照上图,对Spark入门系列课程做介绍。&/p&&p&3.1 &a href=&/?target=https%3A///courses/586& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Spark 讲堂之 SQL 入门&i class=&icon-external&&&/i&&/a&&/p&&p&Spark SQL 是一个分布式查询引擎,在这个教程里你可以学习到 Spark SQL 的基础知识和常用 API 用法,了解常用的数学和统计函数。最后将通过一个分析股票价格与石油价格关系的实例进一步学习如何利用 Spark SQL 分析数据。&/p&&p&3.2 &a href=&/?target=https%3A///courses/571& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Spark 讲堂之 Streaming 入门&i class=&icon-external&&&/i&&/a&&/p&&p&Spark Streaming 适用于实时处理流式数据。该教程带你学习 Spark Streaming 的工作机制,了解 Streaming 应用的基本结构,以及如何在 Streaming 应用中附加 SQL 查询。&/p&&p&Streaming图:&/p&&br&&figure&&img src=&/dbca5b0c77e2e7d8b36b3dd_b.jpg& data-rawwidth=&580& data-rawheight=&214& class=&origin_image zh-lightbox-thumb& width=&580& data-original=&/dbca5b0c77e2e7d8b36b3dd_r.jpg&&&/figure&&p&3.3 &a href=&/?target=https%3A///courses/600& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Spark 讲堂之 MLlib 入门&i class=&icon-external&&&/i&&/a&&/p&&p&这个教程你可以了解到 Spark 的 MLlib 库相关知识,掌握 MLlib 的几个基本数据类型,并且可以动手练习如何通过机器学习中的一些算法来推荐电影。&/p&&p&3.4 &a href=&/?target=https%3A///courses/529& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Spark 讲堂之 GraphX 入门&i class=&icon-external&&&/i&&/a&&/p&&p&GraphX是Spark用于解决图和并行图计算问题的新组件。GraphX通过RDD的扩展,在其中引入了一个新的图抽象,即顶点和边带有特性的有向多重图,提供了一些基本运算符和优化了的Pregel API,来支持图计算。&/p&&p&3.5 &a href=&/?target=https%3A///courses/534& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Spark 讲堂之 GraphX 图算法&i class=&icon-external&&&/i&&/a&&/p&&p&GraphX包含了一些用于简化图分析任务的的图计算算法。你可以通过图操作符来直接调用其中的方法。这个教程中讲解这些算法的含义,以及如何实现它们。&/p&&p&3.6 &a href=&/?target=https%3A///courses/615& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Spark 讲堂之 SparkR 入门&i class=&icon-external&&&/i&&/a&&/p&&p&SparkR是一个提供轻量级前端的R包,集成了Spark的分布式计算和存储等特性。这个教程将以较为轻松的方式带你学习如何在SparkR中创建和操作DataFrame,如何应用SQL查询和机器学习算法等。&/p&&p&3.7 &a href=&/?target=https%3A///courses/536& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Spark 讲堂之 DataFrame 入门&i class=&icon-external&&&/i&&/a&&/p&&p&DataFrame让Spark具备了处理大规模结构化数据的能力,在比原有的RDD转化方式更加易用、计算性能更好。这个教程通过一个简单的数据集分析任务,讲解DataFrame的由来、构建方式以及一些常用操作。&/p&&p&3.8 &a href=&/?target=https%3A///courses/543& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Spark 讲堂之 DataFrame 详解&i class=&icon-external&&&/i&&/a&&/p&&p&这个教程通过更加深入的讲解,使用真实的数据集,并结合实际问题分析过程作为引导,旨在让Spark学习者掌握DataFrame的高级操作技巧,如创建DataFrame的两种方式、UDF等。&/p&&p&3.9 &a href=&/?target=https%3A///courses/575& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Sqoop 数据迁移工具&i class=&icon-external&&&/i&&/a&&/p&&p&Sqoop 是大数据环境中重要的是数据转换工具,这个教程对Sqoop 的安装配置进行了详细的讲解,并列举了Sqoop 在数据迁移过程中基本操作指令。&/p&&p&以上9个教程比较适合有一定的Spark基础的人学习。&/p&&p&3.10 &a href=&/?target=https%3A///courses/456& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Spark 大数据动手实验&i class=&icon-external&&&/i&&/a&&/p&&p&这个教程是一个系统性的教程,总共15个小节,带你亲身体验Spark大数据分析的魅力,课程中可以实践:&br&Spark,Scala,Python,Spark Streaming,SparkSQL,MLlib,GraphX,IndexedRDD,SparkR,Tachyon,KeystoneML,BlinkDB等技术点,无疑是学习Spark最快的上手教程!&/p&&p&这个教程较为系统,非常适合零基础的人进行学习。&/p&&p&希望以上10个教程可以帮助想入门Spark的人技术更上一层楼。&/p&
Spark入门学习资源:。 一、Spark简介Spark是UC Berkeley AMP lab开发的一个集群计算的框架,类似于Hadoop,但有很多的区别。最大的优化是让计算任务的中间结果可以存储在内存中,不需要每次都写入HDFS,更适用于需要迭代的MapReduce…
&h2&协同过滤&/h2&&blockquote&算法介绍:&/blockquote&&p&协同过滤常被用于推荐系统。这类技术目标在于填充“用户-商品”联系矩阵中的缺失项。Spark.ml目前支持基于模型的协同过滤,其中用户和商品以少量的潜在因子来描述,用以预测缺失项。Spark.ml使用交替最小二乘(ALS)算法来学习这些潜在因子。&/p&&p&*注意基于DataFrame的ALS接口目前仅支持整数型的用户和商品编号。&/p&&p&显式与隐式反馈&/p&&p&基于矩阵分解的协同过滤的标准方法中,“用户-商品”矩阵中的条目是用户给予商品的显式偏好,例如,用户给电影评级。然而在现实世界中使用时,我们常常只能访问隐式反馈(如意见、点击、购买、喜欢以及分享等),在spark.ml中我们使用“隐式反馈数据集的协同过滤“来处理这类数据。本质上来说它不是直接对评分矩阵进行建模,而是将数据当作数值来看待,这些数值代表用户行为的观察值(如点击次数,用户观看一部电影的持续时间)。这些数值被用来衡量用户偏好观察值的置信水平,而不是显式地给商品一个评分。然后,模型用来寻找可以用来预测用户对商品预期偏好的潜在因子。&/p&&p&正则化参数&/p&&p&我们调整正则化参数regParam来解决用户在更新用户因子时产生新评分或者商品更新商品因子时收到的新评分带来的最小二乘问题。这个方法叫做“ALS-WR”它降低regParam对数据集规模的依赖,所以我们可以将从部分子集中学习到的最佳参数应用到整个数据集中时获得同样的性能。&/p&&blockquote&参数:&/blockquote&&p&alpha:&/p&&p&类型:双精度型。&/p&&p&含义:隐式偏好中的alpha参数(非负)。&/p&&p&checkpointInterval:&/p&&p&类型:整数型。&/p&&p&含义:设置检查点间隔(&=1),或不设置检查点(-1)。&/p&&p&implicitPrefs:&/p&&p&类型:布尔型。&/p&&p&含义:特征列名。&/p&&p&itemCol:&/p&&p&类型:字符串型。&/p&&p&含义:商品编号列名。&/p&&p&maxIter:&/p&&p&类型:整数型。&/p&&p&含义:迭代次数(&=0)。&/p&&p&nonnegative:&/p&&p&类型:布尔型。&/p&&p&含义:是否需要非负约束。&/p&&p&numItemBlocks:&/p&&p&类型:整数型。&/p&&p&含义:商品数目(正数)。&/p&&p&numUserBlocks:&/p&&p&类型:整数型。&/p&&p&含义:用户数目(正数)。&/p&&p&predictionCol:&/p&&p&类型:字符串型。&/p&&p&含义:预测结果列名。&/p&&p&rank:&/p&&p&类型:整数型。&/p&&p&含义:分解矩阵的排名(正数)。&/p&&p&ratingCol:&/p&&p&类型:字符串型。&/p&&p&含义:评分列名。&/p&&p&regParam:&/p&&p&类型:双精度型。&/p&&p&含义:正则化参数(&=0)。&/p&&p&seed:&/p&&p&类型:长整型。&/p&&p&含义:随机种子。&/p&&p&userCol:&/p&&p&类型:字符串型。&/p&&p&含义:用户列名。&/p&&blockquote&调用示例:&/blockquote&&p&下面的例子中,我们从&a href=&/?target=http%3A//grouplens.org/datasets/movielens/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&MovieLens dataset&i class=&icon-external&&&/i&&/a&读入评分数据,每一行包括用户、电影、评分以及时间戳。我们默认其排序是显式的来训练ALS模型。我们通过预测评分的均方根误差来评价推荐模型。如果评分矩阵来自其他信息来源,也可将implicitPrefs设置为true来获得更好的结果。&/p&&p&Scala:&/p&&div class=&highlight&&&pre&&code class=&language-scala&&&span&&/span&&span class=&k&&import&/span& &span class=&nn&&org.apache.spark.ml.evaluation.RegressionEvaluator&/span&
&span class=&k&&import&/span& &span class=&nn&&org.apache.spark.ml.recommendation.ALS&/span&
&span class=&k&&case&/span& &span class=&k&&class&/span& &span class=&nc&&Rating&/span&&span class=&o&&(&/span&&span class=&n&&userId&/span&&span class=&k&&:&/span& &span class=&kt&&Int&/span&&span class=&o&&,&/span& &span class=&n&&movieId&/span&&span class=&k&&:&/span& &span class=&kt&&Int&/span&&span class=&o&&,&/span& &span class=&n&&rating&/span&&span class=&k&&:&/span& &span class=&kt&&Float&/span&&span class=&o&&,&/span& &span class=&n&&timestamp&/span&&span class=&k&&:&/span& &span class=&kt&&Long&/span&&span class=&o&&)&/span&
&span class=&k&&def&/span& &span class=&n&&parseRating&/span&&span class=&o&&(&/span&&span class=&n&&str&/span&&span class=&k&&:&/span& &span class=&kt&&String&/span&&span class=&o&&)&/span&&span class=&k&&:&/span& &span class=&kt&&Rating&/span& &span class=&o&&=&/span& &span class=&o&&{&/span&
&span class=&k&&val&/span& &span class=&n&&fields&/span& &span class=&k&&=&/span& &span class=&n&&str&/span&&span class=&o&&.&/span&&span class=&n&&split&/span&&span class=&o&&(&/span&&span class=&s&&&::&&/span&&span class=&o&&)&/span&
&span class=&n&&assert&/span&&span class=&o&&(&/span&&span class=&n&&fields&/span&&span class=&o&&.&/span&&span class=&n&&size&/span& &span class=&o&&==&/span& &span class=&mi&&4&/span&&span class=&o&&)&/span&
&span class=&nc&&Rating&/span&&span class=&o&&(&/span&&span class=&n&&fields&/span&&span class=&o&&(&/span&&span class=&mi&&0&/span&&span class=&o&&).&/span&&span class=&n&&toInt&/span&&span class=&o&&,&/span& &span class=&n&&fields&/span&&span class=&o&&(&/span&&span class=&mi&&1&/span&&span class=&o&&).&/span&&span class=&n&&toInt&/span&&span class=&o&&,&/span& &span class=&n&&fields&/span&&span class=&o&&(&/span&&span class=&mi&&2&/span&&span class=&o&&).&/span&&span class=&n&&toFloat&/span&&span class=&o&&,&/span& &span class=&n&&fields&/span&&span class=&o&&(&/span&&span class=&mi&&3&/span&&span class=&o&&).&/span&&span class=&n&&toLong&/span&&span class=&o&&)&/span&
&span class=&o&&}&/span&
&span class=&k&&val&/span& &span class=&n&&ratings&/span& &span class=&k&&=&/span& &span class=&n&&spark&/span&&span class=&o&&.&/span&&span class=&n&&read&/span&&span class=&o&&.&/span&&span class=&n&&textFile&/span&&span class=&o&&(&/span&&span class=&s&&&data/mllib/als/sample_movielens_ratings.txt&&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&n&&map&/span&&span class=&o&&(&/span&&span class=&n&&parseRating&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&n&&toDF&/span&&span class=&o&&()&/span&
&span class=&k&&val&/span& &span class=&nc&&Array&/span&&span class=&o&&(&/span&&span class=&n&&training&/span&&span class=&o&&,&/span& &span class=&n&&test&/span&&span class=&o&&)&/span& &span class=&k&&=&/span& &span class=&n&&ratings&/span&&span class=&o&&.&/span&&span class=&n&&randomSplit&/span&&span class=&o&&(&/span&&span class=&nc&&Array&/span&&span class=&o&&(&/span&&span class=&mf&&0.8&/span&&span class=&o&&,&/span& &span class=&mf&&0.2&/span&&span class=&o&&))&/span&
&span class=&c1&&// Build the recommendation model using ALS on the training data&/span&
&span class=&k&&val&/span& &span class=&n&&als&/span& &span class=&k&&=&/span& &span class=&k&&new&/span& &span class=&nc&&ALS&/span&&span class=&o&&()&/span&
&span class=&o&&.&/span&&span class=&n&&setMaxIter&/span&&span class=&o&&(&/span&&span class=&mi&&5&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&n&&setRegParam&/span&&span class=&o&&(&/span&&span class=&mf&&0.01&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&n&&setUserCol&/span&&span class=&o&&(&/span&&span class=&s&&&userId&&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&n&&setItemCol&/span&&span class=&o&&(&/span&&span class=&s&&&movieId&&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&n&&setRatingCol&/span&&span class=&o&&(&/span&&span class=&s&&&rating&&/span&&span class=&o&&)&/span&
&span class=&k&&val&/span& &span class=&n&&model&/span& &span class=&k&&=&/span& &span class=&n&&als&/span&&span class=&o&&.&/span&&span class=&n&&fit&/span&&span class=&o&&(&/span&&span class=&n&&training&/span&&span class=&o&&)&/span&
&span class=&c1&&// Evaluate the model by computing the RMSE on the test data&/span&
&span class=&k&&val&/span& &span class=&n&&predictions&/span& &span class=&k&&=&/span& &span class=&n&&model&/span&&span class=&o&&.&/span&&span class=&n&&transform&/span&&span class=&o&&(&/span&&span class=&n&&test&/span&&span class=&o&&)&/span&
&span class=&k&&val&/span& &span class=&n&&evaluator&/span& &span class=&k&&=&/span& &span class=&k&&new&/span& &span class=&nc&&RegressionEvaluator&/span&&span class=&o&&()&/span&
&span class=&o&&.&/span&&span class=&n&&setMetricName&/span&&span class=&o&&(&/span&&span class=&s&&&rmse&&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&n&&setLabelCol&/span&&span class=&o&&(&/span&&span class=&s&&&rating&&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&n&&setPredictionCol&/span&&span class=&o&&(&/span&&span class=&s&&&prediction&&/span&&span class=&o&&)&/span&
&span class=&k&&val&/span& &span class=&n&&rmse&/span& &span class=&k&&=&/span& &span class=&n&&evaluator&/span&&span class=&o&&.&/span&&span class=&n&&evaluate&/span&&span class=&o&&(&/span&&span class=&n&&predictions&/span&&span class=&o&&)&/span&
&span class=&n&&println&/span&&span class=&o&&(&/span&&span class=&s&&s&Root-mean-square error = &/span&&span class=&si&&$rmse&/span&&span class=&s&&&&/span&&span class=&o&&)&/span&
&/code&&/pre&&/div&&p&Java:&/p&&div class=&highlight&&&pre&&code class=&language-java&&&span&&/span&&span class=&kn&&import&/span& &span class=&nn&&java.io.Serializable&/span&&span class=&o&&;&/span&
&span class=&kn&&import&/span& &span class=&nn&&org.apache.spark.api.java.JavaRDD&/span&&span class=&o&&;&/span&
&span class=&kn&&import&/span& &span class=&nn&&org.apache.spark.api.java.function.Function&/span&&span class=&o&&;&/span&
&span class=&kn&&import&/span& &span class=&nn&&org.apache.spark.ml.evaluation.RegressionEvaluator&/span&&span class=&o&&;&/span&
&span class=&kn&&import&/span& &span class=&nn&&org.apache.spark.ml.recommendation.ALS&/span&&span class=&o&&;&/span&
&span class=&kn&&import&/span& &span class=&nn&&org.apache.spark.ml.recommendation.ALSModel&/span&&span class=&o&&;&/span&
&span class=&kd&&public&/span& &span class=&kd&&static&/span& &span class=&kd&&class&/span& &span class=&nc&&Rating&/span& &span class=&kd&&implements&/span& &span class=&n&&Serializable&/span& &span class=&o&&{&/span&
&span class=&kd&&private&/span& &span class=&kt&&int&/span& &span class=&n&&userId&/span&&span class=&o&&;&/span&
&span class=&kd&&private&/span& &span class=&kt&&int&/span& &span class=&n&&movieId&/span&&span class=&o&&;&/span&
&span class=&kd&&private&/span& &span class=&kt&&float&/span& &span class=&n&&rating&/span&&span class=&o&&;&/span&
&span class=&kd&&private&/span& &span class=&kt&&long&/span& &span class=&n&&timestamp&/span&&span class=&o&&;&/span&
&span class=&kd&&public&/span& &span class=&nf&&Rating&/span&&span class=&o&&()&/span& &span class=&o&&{}&/span&
&span class=&kd&&public&/span& &span class=&nf&&Rating&/span&&span class=&o&&(&/span&&span class=&kt&&int&/span& &span class=&n&&userId&/span&&span class=&o&&,&/span& &span class=&kt&&int&/span& &span class=&n&&movieId&/span&&span class=&o&&,&/span& &span class=&kt&&float&/span& &span class=&n&&rating&/span&&span class=&o&&,&/span& &span class=&kt&&long&/span& &span class=&n&&timestamp&/span&&span class=&o&&)&/span& &span class=&o&&{&/span&
&span class=&k&&this&/span&&span class=&o&&.&/span&&span class=&na&&userId&/span& &span class=&o&&=&/span& &span class=&n&&userId&/span&&span class=&o&&;&/span&
&span class=&k&&this&/span&&span class=&o&&.&/span&&span class=&na&&movieId&/span& &span class=&o&&=&/span& &span class=&n&&movieId&/span&&span class=&o&&;&/span&
&span class=&k&&this&/span&&span class=&o&&.&/span&&span class=&na&&rating&/span& &span class=&o&&=&/span& &span class=&n&&rating&/span&&span class=&o&&;&/span&
&span class=&k&&this&/span&&span class=&o&&.&/span&&span class=&na&&timestamp&/span& &span class=&o&&=&/span& &span class=&n&&timestamp&/span&&span class=&o&&;&/span&
&span class=&o&&}&/span&
&span class=&kd&&public&/span& &span class=&kt&&int&/span& &span class=&nf&&getUserId&/span&&span class=&o&&()&/span& &span class=&o&&{&/span&
&span class=&k&&return&/span& &span class=&n&&userId&/span&&span class=&o&&;&/span&
&span class=&o&&}&/span&
&span class=&kd&&public&/span& &span class=&kt&&int&/span& &span class=&nf&&getMovieId&/span&&span class=&o&&()&/span& &span class=&o&&{&/span&
&span class=&k&&return&/span& &span class=&n&&movieId&/span&&span class=&o&&;&/span&
&span class=&o&&}&/span&
&span class=&kd&&public&/span& &span class=&kt&&float&/span& &span class=&nf&&getRating&/span&&span class=&o&&()&/span& &span class=&o&&{&/span&
&span class=&k&&return&/span& &span class=&n&&rating&/span&&span class=&o&&;&/span&
&span class=&o&&}&/span&
&span class=&kd&&public&/span& &span class=&kt&&long&/span& &span class=&nf&&getTimestamp&/span&&span class=&o&&()&/span& &span class=&o&&{&/span&
&span class=&k&&return&/span& &span class=&n&&timestamp&/span&&span class=&o&&;&/span&
&span class=&o&&}&/span&
&span class=&kd&&public&/span& &span class=&kd&&static&/span& &span class=&n&&Rating&/span& &span class=&nf&&parseRating&/span&&span class=&o&&(&/span&&span class=&n&&String&/span& &span class=&n&&str&/span&&span class=&o&&)&/span& &span class=&o&&{&/span&
&span class=&n&&String&/span&&span class=&o&&[]&/span& &span class=&n&&fields&/span& &span class=&o&&=&/span& &span class=&n&&str&/span&&span class=&o&&.&/span&&span class=&na&&split&/span&&span class=&o&&(&/span&&span class=&s&&&::&&/span&&span class=&o&&);&/span&
&span class=&k&&if&/span& &span class=&o&&(&/span&&span class=&n&&fields&/span&&span class=&o&&.&/span&&span class=&na&&length&/span& &span class=&o&&!=&/span& &span class=&mi&&4&/span&&span class=&o&&)&/span& &span class=&o&&{&/span&
&span class=&k&&throw&/span& &span class=&k&&new&/span& &span class=&n&&IllegalArgumentException&/span&&span class=&o&&(&/span&&span class=&s&&&Each line must contain 4 fields&&/span&&span class=&o&&);&/span&
&span class=&o&&}&/span&
&span class=&kt&&int&/span& &span class=&n&&userId&/span& &span class=&o&&=&/span& &span class=&n&&Integer&/span&&span class=&o&&.&/span&&span class=&na&&parseInt&/span&&span class=&o&&(&/span&&span class=&n&&fields&/span&&span class=&o&&[&/span&&span class=&mi&&0&/span&&span class=&o&&]);&/span&
&span class=&kt&&int&/span& &span class=&n&&movieId&/span& &span class=&o&&=&/span& &span class=&n&&Integer&/span&&span class=&o&&.&/span&&span class=&na&&parseInt&/span&&span class=&o&&(&/span&&span class=&n&&fields&/span&&span class=&o&&[&/span&&span class=&mi&&1&/span&&span class=&o&&]);&/span&
&span class=&kt&&float&/span& &span class=&n&&rating&/span& &span class=&o&&=&/span& &span class=&n&&Float&/span&&span class=&o&&.&/span&&span class=&na&&parseFloat&/span&&span class=&o&&(&/span&&span class=&n&&fields&/span&&span class=&o&&[&/span&&span class=&mi&&2&/span&&span class=&o&&]);&/span&
&span class=&kt&&long&/span& &span class=&n&&timestamp&/span& &span class=&o&&=&/span& &span class=&n&&Long&/span&&span class=&o&&.&/span&&span class=&na&&parseLong&/span&&span class=&o&&(&/span&&span class=&n&&fields&/span&&span class=&o&&[&/span&&span class=&mi&&3&/span&&span class=&o&&]);&/span&
&span class=&k&&return&/span& &span class=&k&&new&/span& &span class=&n&&Rating&/span&&span class=&o&&(&/span&&span class=&n&&userId&/span&&span class=&o&&,&/span& &span class=&n&&movieId&/span&&span class=&o&&,&/span& &span class=&n&&rating&/span&&span class=&o&&,&/span& &span class=&n&&timestamp&/span&&span class=&o&&);&/span&
&span class=&o&&}&/span&
&span class=&o&&}&/span&
&span class=&n&&JavaRDD&/span&&span class=&o&&&&/span&&span class=&n&&Rating&/span&&span class=&o&&&&/span& &span class=&n&&ratingsRDD&/span& &span class=&o&&=&/span& &span class=&n&&spark&/span&
&span class=&o&&.&/span&&span class=&na&&read&/span&&span class=&o&&().&/span&&span class=&na&&textFile&/span&&span class=&o&&(&/span&&span class=&s&&&data/mllib/als/sample_movielens_ratings.txt&&/span&&span class=&o&&).&/span&&span class=&na&&javaRDD&/span&&span class=&o&&()&/span&
&span class=&o&&.&/span&&span class=&na&&map&/span&&span class=&o&&(&/span&&span class=&k&&new&/span& &span class=&n&&Function&/span&&span class=&o&&&&/span&&span class=&n&&String&/span&&span class=&o&&,&/span& &span class=&n&&Rating&/span&&span class=&o&&&()&/span& &span class=&o&&{&/span&
&span class=&kd&&public&/span& &span class=&n&&Rating&/span& &span class=&nf&&call&/span&&span class=&o&&(&/span&&span class=&n&&String&/span& &span class=&n&&str&/span&&span class=&o&&)&/span& &span class=&o&&{&/span&
&span class=&k&&return&/span& &span class=&n&&Rating&/span&&span class=&o&&.&/span&&span class=&na&&parseRating&/span&&span class=&o&&(&/span&&span class=&n&&str&/span&&span class=&o&&);&/span&
&span class=&o&&}&/span&
&span class=&o&&});&/span&
&span class=&n&&Dataset&/span&&span class=&o&&&&/span&&span class=&n&&Row&/span&&span class=&o&&&&/span& &span class=&n&&ratings&/span& &span class=&o&&=&/span& &span class=&n&&spark&/span&&span class=&o&&.&/span&&span class=&na&&createDataFrame&/span&&span class=&o&&(&/span&&span class=&n&&ratingsRDD&/span&&span class=&o&&,&/span& &span class=&n&&Rating&/span&&span class=&o&&.&/span&&span class=&na&&class&/span&&span class=&o&&);&/span&
&span class=&n&&Dataset&/span&&span class=&o&&&&/span&&span class=&n&&Row&/span&&span class=&o&&&[]&/span& &span class=&n&&splits&/span& &span class=&o&&=&/span& &span class=&n&&ratings&/span&&span class=&o&&.&/span&&span class=&na&&randomSplit&/span&&span class=&o&&(&/span&&span class=&k&&new&/span& &span class=&kt&&double&/span&&span class=&o&&[]{&/span&&span class=&mf&&0.8&/span&&span class=&o&&,&/span& &span class=&mf&&0.2&/span&&span class=&o&&});&/span&
&span class=&n&&Dataset&/span&&span class=&o&&&&/span&&span class=&n&&Row&/span&&span class=&o&&&&/span& &span class=&n&&training&/span& &span class=&o&&=&/span& &span class=&n&&splits&/span&&span class=&o&&[&/span&&span class=&mi&&0&/span&&span class=&o&&];&/span&
&span class=&n&&Dataset&/span&&span class=&o&&&&/span&&span class=&n&&Row&/span&&span class=&o&&&&/span& &span class=&n&&test&/span& &span class=&o&&=&/span& &span class=&n&&splits&/span&&span class=&o&&[&/span&&span class=&mi&&1&/span&&span class=&o&&];&/span&
&span class=&c1&&// Build the recommendation model using ALS on the training data&/span&
&span class=&n&&ALS&/span& &span class=&n&&als&/span& &span class=&o&&=&/span& &span class=&k&&new&/span& &span class=&n&&ALS&/span&&span class=&o&&()&/span&
&span class=&o&&.&/span&&span class=&na&&setMaxIter&/span&&span class=&o&&(&/span&&span class=&mi&&5&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&na&&setRegParam&/span&&span class=&o&&(&/span&&span class=&mf&&0.01&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&na&&setUserCol&/span&&span class=&o&&(&/span&&span class=&s&&&userId&&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&na&&setItemCol&/span&&span class=&o&&(&/span&&span class=&s&&&movieId&&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&na&&setRatingCol&/span&&span class=&o&&(&/span&&span class=&s&&&rating&&/span&&span class=&o&&);&/span&
&span class=&n&&ALSModel&/span& &span class=&n&&model&/span& &span class=&o&&=&/span& &span class=&n&&als&/span&&span class=&o&&.&/span&&span class=&na&&fit&/span&&span class=&o&&(&/span&&span class=&n&&training&/span&&span class=&o&&);&/span&
&span class=&c1&&// Evaluate the model by computing the RMSE on the test data&/span&
&span class=&n&&Dataset&/span&&span class=&o&&&&/span&&span class=&n&&Row&/span&&span class=&o&&&&/span& &span class=&n&&predictions&/span& &span class=&o&&=&/span& &span class=&n&&model&/span&&span class=&o&&.&/span&&span class=&na&&transform&/span&&span class=&o&&(&/span&&span class=&n&&test&/span&&span class=&o&&);&/span&
&span class=&n&&RegressionEvaluator&/span& &span class=&n&&evaluator&/span& &span class=&o&&=&/span& &span class=&k&&new&/span& &span class=&n&&RegressionEvaluator&/span&&span class=&o&&()&/span&
&span class=&o&&.&/span&&span class=&na&&setMetricName&/span&&span class=&o&&(&/span&&span class=&s&&&rmse&&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&na&&setLabelCol&/span&&span class=&o&&(&/span&&span class=&s&&&rating&&/span&&span class=&o&&)&/span&
&span class=&o&&.&/span&&span class=&na&&setPredictionCol&/span&&span class=&o&&(&/span&&span class=&s&&&prediction&&/span&&span class=&o&&);&/span&
&span class=&n&&Double&/span& &span class=&n&&rmse&/span& &span class=&o&&=&/span& &span class=&n&&evaluator&/span&&span class=&o&&.&/span&&span class=&na&&evaluate&/span&&span class=&o&&(&/span&&span class=&n&&predictions&/span&&span class=&o&&);&/span&
&span class=&n&&System&/span&&span class=&o&&.&/span&&span class=&na&&out&/span&&span class=&o&&.&/span&&span class=&na&&println&/span&&span class=&o&&(&/span&&span class=&s&&&Root-mean-square error = &&/span& &span class=&o&&+&/span& &span class=&n&&rmse&/span&&span class=&o&&);&/span&
&/code&&/pre&&/div&&p&Python:&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&kn&&from&/span& &span class=&nn&&pyspark.ml.evaluation&/span& &span class=&kn&&import&/span& &span class=&n&&RegressionEvaluator&/span&
&span class=&kn&&from&/span& &span class=&nn&&pyspark.ml.recommendation&/span& &span class=&kn&&import&/span& &span class=&n&&ALS&/span&
&span class=&kn&&from&/span& &span class=&nn&&pyspark.sql&/span& &span class=&kn&&import&/span& &span class=&n&&Row&/span&
&span class=&n&&lines&/span& &span class=&o&&=&/span& &span class=&n&&spark&/span&&span class=&o&&.&/span&&span class=&n&&read&/span&&span class=&o&&.&/span&&span class=&n&&text&/span&&span class=&p&&(&/span&&span class=&s2&&&data/mllib/als/sample_movielens_ratings.txt&&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&rdd&/span&
&span class=&n&&parts&/span& &span class=&o&&=&/span& &span class=&n&&lines&/span&&span class=&o&&.&/span&&span class=&n&&map&/span&&span class=&p&&(&/span&&span class=&k&&lambda&/span& &span class=&n&&row&/span&&span class=&p&&:&/span& &span class=&n&&row&/span&&span class=&o&&.&/span&&span class=&n&&value&/span&&span class=&o&&.&/span&&span class=&n&&split&/span&&span class=&p&&(&/span&&span class=&s2&&&::&&/span&&span class=&p&&))&/span&
&span class=&n&&ratingsRDD&/span& &span class=&o&&=&/span& &span class=&n&&parts&/span&&span class=&o&&.&/span&&span class=&n&&map&/span&&span class=&p&&(&/span&&span class=&k&&lambda&/span& &span class=&n&&p&/span&&span class=&p&&:&/span& &span class=&n&&Row&/span&&span class=&p&&(&/span&&span class=&n&&userId&/span&&span class=&o&&=&/span&&span class=&nb&&int&/span&&span class=&p&&(&/span&&span class=&n&&p&/span&&span class=&p&&[&/span&&span class=&mi&&0&/span&&span class=&p&&]),&/span& &span class=&n&&movieId&/span&&span class=&o&&=&/span&&span class=&nb&&int&/span&&span class=&p&&(&/span&&span class=&n&&p&/span&&span class=&p&&[&/span&&span class=&mi&&1&/span&&span class=&p&&]),&/span&
&span class=&n&&rating&/span&&span class=&o&&=&/span&&span class=&nb&&float&/span&&span class=&p&&(&/span&&span class=&n&&p&/span&&span class=&p&&[&/span&&span class=&mi&&2&/span&&span class=&p&&]),&/span& &span class=&n&&timestamp&/span&&span class=&o&&=&/span&&span class=&nb&&long&/span&&span class=&p&&(&/span&&span class=&n&&p&/span&&span class=&p&&[&/span&&span class=&mi&&3&/span&&span class=&p&&])))&/span&
&span class=&n&&ratings&/span& &span class=&o&&=&/span& &span class=&n&&spark&/span&&span class=&o&&.&/span&&span class=&n&&createDataFrame&/span&&span class=&p&&(&/span&&span class=&n&&ratingsRDD&/span&&span class=&p&&)&/span&
&span class=&p&&(&/span&&span class=&n&&training&/span&&span class=&p&&,&/span& &span class=&n&&test&/span&&span class=&p&&)&/span& &span class=&o&&=&/span& &span class=&n&&ratings&/span&&span class=&o&&.&/span&&span class=&n&&randomSplit&/span&&span class=&p&&([&/span&&span class=&mf&&0.8&/span&&span class=&p&&,&/span& &span class=&mf&&0.2&/span&&span class=&p&&])&/span&
&span class=&c1&&# Build the recommendation model using ALS on the training data&/span&
&span class=&n&&als&/span& &span class=&o&&=&/span& &span class=&n&&ALS&/span&&span class=&p&&(&/span&&span class=&n&&maxIter&/span&&span class=&o&&=&/span&&span class=&mi&&5&/span&&span class=&p&&,&/span& &span class=&n&&regParam&/span&&span class=&o&&=&/span&&span class=&mf&&0.01&/span&&span class=&p&&,&/span& &span class=&n&&userCol&/span&&span class=&o&&=&/span&&span class=&s2&&&userId&&/span&&span class=&p&&,&/span& &span class=&n&&itemCol&/span&&span class=&o&&=&/span&&span class=&s2&&&movieId&&/span&&span class=&p&&,&/span& &span class=&n&&ratingCol&/span&&span class=&o&&=&/span&&span class=&s2&&&rating&&/span&&span class=&p&&)&/span&
&span class=&n&&model&/span& &span class=&o&&=&/span& &span class=&n&&als&/span&&span class=&o&&.&/span&&span class=&n&&fit&/span&&span class=&p&&(&/span&&span class=&n&&training&/span&&span class=&p&&)&/span&
&span class=&c1&&# Evaluate the model by computing the RMSE on the test data&/span&
&span class=&n&&predictions&/span& &span class=&o&&=&/span& &span class=&n&&model&/span&&span class=&o&&.&/span&&span class=&n&&transform&/span&&span class=&p&&(&/span&&span class=&n&&test&/span&&span class=&p&&)&/span&
&span class=&n&&evaluator&/span& &span class=&o&&=&/span& &span class=&n&&RegressionEvaluator&/span&&span class=&p&&(&/span&&span class=&n&&metricName&/span&&span class=&o&&=&/span&&span class=&s2&&&rmse&&/span&&span class=&p&&,&/span& &span class=&n&&labelCol&/span&&span class=&o&&=&/span&&span class=&s2&&&rating&&/span&&span class=&p&&,&/span&
&span class=&n&&predictionCol&/span&&span class=&o&&=&/span&&span class=&s2&&&prediction&&/span&&span class=&p&&)&/span&
&span class=&n&&rmse&/span& &span class=&o&&=&/span& &span class=&n&&evaluator&/span&&span class=&o&&.&/span&&span class=&n&&evaluate&/span&&span class=&p&&(&/span&&span class=&n&&predictions&/span&&span class=&p&&)&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&s2&&&Root-mean-square error = &&/span& &span class=&o&&+&/span& &span class=&nb&&str&/span&&span class=&p&&(&/span&&span class=&n&&rmse&/span&&span class=&p&&))&/span&
&/code&&/pre&&/div&
协同过滤算法介绍:协同过滤常被用于推荐系统。这类技术目标在于填充“用户-商品”联系矩阵中的缺失项。Spark.ml目前支持基于模型的协同过滤,其中用户和商品以少量的潜在因子来描述,用以预测缺失项。Spark.ml使用交替最小二乘(ALS)算法来学习这些潜在…
&p&这一部分主要学习pandas中基于前面两种数据结构的基本操作。&/p&&p&设有DataFrame结果的数据a如下所示:&/p&&p&一、查看数据(查看对象的方法对于Series来说同样适用)&/p&&p&1.查看DataFrame前xx行或后xx行&br&a=DataFrame(data);&br&a.head(6)表示显示前6行数据,若head()中不带参数则会显示全部数据。&br&a.tail(6)表示显示后6行数据,若tail()中不带参数则也会显示全部数据。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&kn&&import&/span& &span class=&nn&&pandas&/span& &span class=&kn&&as&/span& &span class=&nn&&pd&/span&
&span class=&kn&&import&/span& &span class=&nn&&numpy&/span& &span class=&kn&&as&/span& &span class=&nn&&np&/span&
&span class=&n&&a&/span&&span class=&o&&=&/span&&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&DataFrame&/span&&span class=&p&&([[&/span&&span class=&mi&&4&/span&&span class=&p&&,&/span&&span class=&mi&&1&/span&&span class=&p&&,&/span&&span class=&mi&&1&/span&&span class=&p&&],[&/span&&span class=&mi&&6&/span&&span class=&p&&,&/span&&span class=&mi&&2&/span&&span class=&p&&,&/span&&span class=&mi&&0&/span&&span class=&p&&],[&/span&&span class=&mi&&6&/span&&span class=&p&&,&/span&&span class=&mi&&1&/span&&span class=&p&&,&/span&&span class=&mi&&6&/span&&span class=&p&&]],&/span&&span class=&n&&index&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&s1&&'one'&/span&&span class=&p&&,&/span&&span class=&s1&&'two'&/span&&span class=&p&&,&/span&&span class=&s1&&'three'&/span&&span class=&p&&],&/span&&span class=&n&&columns&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&s1&&'a'&/span&&span class=&p&&,&/span&&span class=&s1&&'b'&/span&&span class=&p&&,&/span&&span class=&s1&&'c'&/span&&span class=&p&&])&/span&
&span class=&n&&a&/span&
&/code&&/pre&&/div&&figure&&img src=&/v2-35b71c8729eeb89dfa1de_b.png& data-rawwidth=&100& data-rawheight=&119& class=&content_image& width=&100&&&/figure&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&a&/span&&span class=&o&&.&/span&&span class=&n&&head&/span&&span class=&p&&(&/span&&span class=&mi&&2&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&figure&&img src=&/v2-abae2a147cfabad144beb534_b.png& data-rawwidth=&101& data-rawheight=&98& class=&content_image& width=&101&&&/figure&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&a&/span&&span class=&o&&.&/span&&span class=&n&&tail&/span&&span class=&p&&(&/span&&span class=&mi&&2&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&figure&&img src=&/v2-40e7f982d40f0e467aab91884d0edf96_b.png& data-rawwidth=&101& data-rawheight=&94& class=&content_image& width=&101&&&/figure&&br&&p&2.查看DataFrame的index,columns以及values&br&a. a. a.values 即可&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&a&/span&&span class=&o&&.&/span&&span class=&n&&index&/span&
&span class=&n&&Index&/span&&span class=&p&&([&/span&&span class=&s1&&u'one'&/span&&span class=&p&&,&/span& &span class=&s1&&u'two'&/span&&span class=&p&&,&/span& &span class=&s1&&u'three'&/span&&span class=&p&&],&/span& &span class=&n&&dtype&/span&&span class=&o&&=&/span&&span class=&s1&&'object'&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&a&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span&
&span class=&n&&Index&/span&&span class=&p&&([&/span&&span class=&s1&&u'a'&/span&&span class=&p&&,&/span& &span class=&s1&&u'b'&/span&&span class=&p&&,&/span& &span class=&s1&&u'c'&/span&&span class=&p&&],&/span& &span class=&n&&dtype&/span&&span class=&o&&=&/span&&span class=&s1&&'object'&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&a&/span&&span class=&o&&.&/span&&span class=&n&&values&/span&
&span class=&n&&array&/span&&span class=&p&&([[&/span&&span class=&mi&&4&/span&&span class=&p&&,&/span& &span class=&mi&&1&/span&&span class=&p&&,&/span& &span class=&mi&&1&/span&&span class=&p&&],&/span&
&span class=&p&&[&/span&&span class=&mi&&6&/span&&span class=&p&&,&/span& &span class=&mi&&2&/span&&span class=&p&&,&/span& &span class=&mi&&0&/span&&span class=&p&&],&/span&
&span class=&p&&[&/span&&span class=&mi&&6&/span&&span class=&p&&,&/span& &span class=&mi&&1&/span&&span class=&p&&,&/span& &span class=&mi&&6&/span&&span class=&p&&]])&/span&
&/code&&/pre&&/div&&br&&p&3.describe()函数对于数据的快速统计汇总&/p&&p&a.describe()对每一列数据进行统计,包括计数,均值,std,各个分位数等。&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&a&/span&&span class=&o&&.&/span&&span class=&n&&describe&/span&
&span class=&o&&&&/span&&span class=&n&&bound&/span& &span class=&n&&method&/span& &span class=&n&&DataFrame&/span&&span class=&o&&.&/span&&span class=&n&&describe&/span& &span class=&n&&of&/span&
&span class=&n&&a&/span&
&span class=&n&&b&/span&
&span class=&n&&c&/span&
&span class=&n&&one&/span&
&span class=&mi&&4&/span&
&span class=&mi&&1&/span&
&span class=&mi&&1&/span&
&span class=&n&&two&/span&
&span class=&mi&&6&/span&
&span class=&mi&&2&/span&
&span class=&mi&&0&/span&
&span class=&n&&three&/span&
&span class=&mi&&6&/span&
&span class=&mi&&1&/span&
&span class=&mi&&6&/span&&span class=&o&&&&/span&
&/code&&/pre&&/div&&p&4.对数据的转置&/p&&p&a.T&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&a.T
&/code&&/pre&&/div&&figure&&img src=&/v2-258f3b9ed78d85ed09ef6d83c71344f9_b.png& data-rawwidth=&139& data-rawheight=&123& class=&content_image& width=&139&&&/figure&&br&&p&5.对轴进行排序&br&a.sort_index(axis=1,ascending=False);其中axis=1表示对所有的columns进行排序,下面的数也跟着发生移动。后面的ascending=False表示按降序排列,参数缺失时默认升序。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&a.sort_index(axis=1,ascending=False);
&/code&&/pre&&/div&&figure&&img src=&/v2-d4a1ccaa9d8e76_b.png& data-rawwidth=&107& data-rawheight=&127& class=&content_image& width=&107&&&/figure&&br&&p&6.对DataFrame中的值排序&br&a.sort(columns=’x’)&br&即对a中的x这一列,从小到大进行排序。注意仅仅是x这一列,而上面的按轴进行排序时会对所有的columns进行操作。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&a.sort(columns='c')
&/code&&/pre&&/div&&figure&&img src=&/v2-fbc69e6d612_b.png& data-rawwidth=&105& data-rawheight=&121& class=&content_image& width=&105&&&/figure&&br&&p&二、选择对象&/p&&p&1.选择特定列和行的数据&br&a[‘x’] 那么将会返回columns为x的列,注意这种方式一次只能返回一个列。a.x与a[‘x’]意思一样。&/p&&p&取行数据,通过切片[]来选择&br&如:a[0:3] 则会返回前三行的数据。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&a&/span&&span class=&p&&[&/span&&span class=&s1&&'a'&/span&&span class=&p&&]&/span&
&span class=&n&&one&/span&
&span class=&mi&&4&/span&
&span class=&n&&two&/span&
&span class=&mi&&6&/span&
&span class=&n&&three&/span&
&span class=&mi&&6&/span&
&span class=&n&&Name&/span&&span class=&p&&:&/span& &span class=&n&&a&/span&&span class=&p&&,&/span& &span class=&n&&dtype&/span&&span class=&p&&:&/span& &span class=&n&&int64&/span&
&/code&&/pre&&/div&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&a[0:2]
&/code&&/pre&&/div&&figure&&img src=&/v2-e6cb1be938dbf2a8e16b1_b.png& data-rawwidth=&99& data-rawheight=&95& class=&content_image& width=&99&&&/figure&&br&&p&2.通过标签来选择&br&a.loc[‘one’]则会默认表示选取行为’one’的行;&/p&&p&a.loc[:,[‘a’,’b’] ] 表示选取所有的行以及columns为a,b的列;&/p&&p&a.loc[[‘one’,’two’],[‘a’,’b’]] 表示选取’one’和’two’这两行以及columns为a,b的列;&/p&&p&a.loc[‘one’,’a’]与a.loc[[‘one’],[‘a’]]作用是一样的,不过前者只显示对应的值,而后者会显示对应的行和列标签。&/p&&p&3.通过位置来选择&br&&/p&&p&这与通过标签选择类似&br&a.iloc[1:2,1:2] 则会显示第一行第一列的数据;(切片后面的值取不到)&/p&&p&a.iloc[1:2] 即后面表示列的值没有时,默认选取行位置为1的数据;&/p&&p&a.iloc[[0,2],[1,2]] 即可以自由选取行位置,和列位置对应的数据。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&a.iloc[1:2,1:2]
&/code&&/pre&&/div&&figure&&img src=&/v2-2ddaf46e05b787dfcb14f_b.png& data-rawwidth=&62& data-rawheight=&65& class=&content_image& width=&62&&&/figure&&br&&p&4.使用条件来选择&br&使用单独的列来选择数据&br&a[a.c&0] 表示选择c列中大于0的数据&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&a[a.c&0]
&/code&&/pre&&/div&&figure&&img src=&/v2-e332e9fc3cff1cfc3bf2a0b_b.png& data-rawwidth=&104& data-rawheight=&95& class=&content_image& width=&104&&&/figure&&br&&p&使用where来选择数据&br&a[a&0] 表直接选择a中所有大于0的数据&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&a[a&0]
&/code&&/pre&&/div&&figure&&img src=&/v2-1e46d0ce87d8980aec320_b.png& data-rawwidth=&124& data-rawheight=&123& class=&content_image& width=&124&&&/figure&&br&&p&使用isin()选出特定列中包含特定值的行&br&a1=a.copy()&br&a1[a1[‘one’].isin([‘2′,’3′])] 表显示满足条件:列one中的值包含’2’,’3’的所有行。&/p&&br&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&a1&/span&&span class=&o&&=&/span&&span class=&n&&a&/span&&span class=&o&&.&/span&&span class=&n&&copy&/span&&span class=&p&&()&/span&
&span class=&n&&a1&/span&&span class=&p&&[&/span&&span class=&n&&a1&/span&&span class=&p&&[&/span&&span class=&s1&&'a'&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&isin&/span&&span class=&p&&([&/span&&span class=&mi&&4&/span&&span class=&p&&])]&/span&
&/code&&/pre&&/div&&figure&&img src=&/v2-bfff671c7d_b.png& data-rawwidth=&100& data-rawheight=&63& class=&content_image& width=&100&&&/figure&&p&三、设置值(赋值)&/p&&p&赋值操作在上述选择操作的基础上直接赋值即可。&br&例a.loc[:,[‘a’,’c’]]=9 即将a和c列的所有行中的值设置为9&br&a.iloc[:,[1,3]]=9 也表示将a和c列的所有行中的值设置为9&/p&&p&同时也依然可以用条件来直接赋值&/p&&p&a[a&0]=-a 表示将a中所有大于0的数转化为负值&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&a.loc[:,['a','c']]=9
&/code&&/pre&&/div&&figure&&img src=&/v2-deae7ce88ea5cddfc9c906_b.png& data-rawwidth=&102& data-rawheight=&122& class=&content_image& width=&102&&&/figure&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&a.iloc[:,[0,1,2]]=7
&/code&&/pre&&/div&&figure&&img src=&/v2-1e48e704c30_b.png& data-rawwidth=&104& data-rawheight=&121& class=&content_image& width=&104&&&/figure&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&a[a&0]=-a
&/code&&/pre&&/div&&figure&&img src=&/v2-4f422c078dff2c1962385_b.png& data-rawwidth=&119& data-rawheight=&123& class=&content_image& width=&119&&&/figure&&br&&p&四、缺失值处理&/p&&p&在pandas中,使用np.nan来代替缺失值,这些值将默认不会包含在计算中。&/p&&p&1.reindex()方法&br&用来对指定轴上的索引进行改变/增加/删除操作,这将返回原始数据的一个拷贝。&br&a.reindex(index=list(a.index)+[‘five’],columns=list(b.columns)+[‘d’])&/p&&p&a.reindex(index=[‘one’,’five’],columns=list(b.columns)+[‘d’])&/p&&p&即用index=[]表示对index进行操作,columns表对列进行操作。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&b=a.reindex(index=list(a.index)+['four'],columns=list(a.columns)+['d'])
c=b.copy()
&/code&&/pre&&/div&&figure&&img src=&/v2-5e47f01e84ced_b.png& data-rawwidth=&207& data-rawheight=&153& class=&content_image& width=&207&&&/figure&&br&&p&2.对缺失值进行填充&br&a.fillna(value=x)&br&表示用值为x的数来对缺失值进行填充&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&b.fillna(value=100)
&/code&&/pre&&/div&&figure&&img src=&/v2-51dc1eb673d_b.png& data-rawwidth=&189& data-rawheight=&150& class=&content_image& width=&189&&&/figure&&br&&p&3.去掉包含缺失值的行&br&a.dropna(how=’any’)&br&表示去掉所有包含缺失值的行&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&c.dropna(how='any')
&/code&&/pre&&/div&&figure&&img src=&/v2-c20da6d8fcd64d81139da_b.png& data-rawwidth=&90& data-rawheight=&35& class=&content_image& width=&90&&&/figure&&br&&p&五、合并&/p&&p&1.contact&br&contact(a1,axis=0/1,keys=[‘xx’,’xx’,’xx’,…]),其中a1表示要进行连接的列表数据,axis=1时表横着对数据进行连接。axis=0或不指定时,表将数据竖着进行连接。a1中要连接的数据有几个则对应几个keys,设置keys是为了在数据连接以后区分每一个原始a1中的数据。&/p&&p&例:a1=[b[‘a’],b[‘c’]]&br&result=pd.concat(a1,axis=1,keys=[‘1′,’2’])&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&a1=[b['a'],b['c']]
d=pd.concat(a1,axis=1,keys=['1','2'])
&/code&&/pre&&/div&&figure&&img src=&/v2-a02b93e8e01eaeb1a93ed0_b.png& data-rawwidth=&134& data-rawheight=&153& class=&content_image& width=&134&&&/figure&&br&&p&2.Append 将一行或多行数据连接到一个DataFrame上&br&a.append(a[2:],ignore_index=True)&br&表示将a中的第三行以后的数据全部添加到a中,若不指定ignore_index参数,则会把添加的数据的index保留下来,若ignore_index=Ture则会对所有的行重新自动建立索引。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&a.append(a[2:],ignore_index=True)
&/code&&/pre&&/div&&figure&&img src=&/v2-9cfd1d1cfa2fb2b69c02ae_b.png& data-rawwidth=&96& data-rawheight=&152& class=&content_image& width=&96&&&/figure&&br&&p&3.merge类似于SQL中的join&br&设a1,a2为两个dataframe,二者中存在相同的键值,两个对象连接的方式有下面几种:&br&(1)内连接,pd.merge(a1, a2, on=’key’)&br&(2)左连接,pd.merge(a1, a2, on=’key’, how=’left’)&br&(3)右连接,pd.merge(a1, a2, on=’key’, how=’right’)&br&(4)外连接, pd.merge(a1, a2, on=’key’, how=’outer’)&br&至于四者的具体差别,具体学习参考sql中相应的语法。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&pd.merge(b,c,on='a')
&/code&&/pre&&/div&&figure&&img src=&/v2-dd7346754_b.png& data-rawwidth=&289& data-rawheight=&330& class=&content_image& width=&289&&&/figure&&br&&p&六、分组(groupby)&/p&&p&用pd.date_range函数生成连续指定天数的的日期&br&pd.date_range(‘’,periods=10)&/p&&p&def shuju():&/p&&p&data={&/p&&p&‘date’:pd.date_range(‘’,periods=10),&/p&&p&‘gender’:np.random.randint(0,2,size=10),&/p&&p&‘height’:np.random.randint(40,50,size=10),&/p&&p&‘weight’:np.random.randint(150,180,size=10)&/p&&p&}&/p&&p&a=DataFrame(data)&/p&&p&print(a)&/p&&p&date
weight&/p&&p&0
165&/p&&p&1
179&/p&&p&2
172&/p&&p&3
173&/p&&p&4
151&/p&&p&5
172&/p&&p&6
167&/p&&p&7
157&/p&&p&8
157&/p&&p&9
164&/p&&p&用a.groupby(‘gender’).sum()得到的结果为:
#注意在python中groupby(”xx)后要加sum(),不然显示&/p&&p&不了数据对象。&/p&&p&gender
weight&/p&&p&0
989&/p&&p&1
643&/p&&p&此外用a.groupby(‘gender’).size()可以对各个gender下的数目进行计数。&/p&&p&所以可以看到groupby的作用相当于:&br&按gender对gender进行分类,对应为数字的列会自动求和,而为字符串类型的列则不显示;当然也可以同时groupby([‘x1′,’x2’,…])多个字段,其作用与上面类似。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&n&&a2&/span&&span class=&o&&=&/span&&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&DataFrame&/span&&span class=&p&&({&/span&
&span class=&s1&&'date'&/span&&span class=&p&&:&/span&&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&date_range&/span&&span class=&p&&(&/span&&span class=&s1&&''&/span&&span class=&p&&,&/span&&span class=&n&&periods&/span&&span class=&o&&=

我要回帖

更多关于 python游戏制作教程 的文章

 

随机推荐