基于球员和裁判数据进行探索性数据分析实践-大数据ML样本集案例实战

967 阅读2分钟

版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。

1 数据简介

  • 数据包含球员和裁判的信息,2012-2013年的比赛数据,总共设计球员2053名,裁判3147名,特征列表如下:

    Variable Name: Variable Description:
    playerShort short player ID
    player player name
    club player club
    leagueCountry country of player club (England, Germany, France, and Spain)
    height player height (in cm)
    weight player weight (in kg)
    position player position
    games number of games in the player-referee dyad
    goals number of goals in the player-referee dyad
    yellowCards number of yellow cards player received from the referee
    yellowReds number of yellow-red cards player received from the referee
    redCards number of red cards player received from the referee
    photoID ID of player photo (if available)
    rater1 skin rating of photo by rater 1
    rater2 skin rating of photo by rater 2
    refNum unique referee ID number (referee name removed for anonymizing purposes)
    refCountry unique referee country ID number
    meanIAT mean implicit bias score (using the race IAT) for referee country
    nIAT sample size for race IAT in that particular country
    seIAT standard error for mean estimate of race IAT
    meanExp mean explicit bias score (using a racial thermometer task) for referee country
    nExp sample size for explicit bias in that particular country
    seExp standard error for mean estimate of explicit bias measure

2 数据预处理

  • 数据基本特征挖掘

      # Uncomment one of the following lines and run the cell:
      df = pd.read_csv("redcard.csv.gz", compression='gzip')
      df.shape
      (146028, 28)
      
      df.head()
    

     df.describe().T 

  • df.dtypes

      playerShort       object
      player            object
      club              object
      leagueCountry     object
      birthday          object
      height           float64
      weight           float64
      position          object
      games              int64
      victories          int64
      ties               int64
      defeats            int64
      goals              int64
      yellowCards        int64
      yellowReds         int64
      redCards           int64
      photoID           object
      rater1           float64
      rater2           float64
      refNum             int64
      refCountry         int64
      Alpha_3           object
      meanIAT          float64
      nIAT             float64
      seIAT            float64
      meanExp          float64
      nExp             float64
      seExp            float64
      dtype: object
    
  • all_columns = df.columns.tolist()

     all_columns
    
     ['playerShort',
      'player',
      'club',
      'leagueCountry',
      'birthday',
      'height',
      'weight',
      'position',
      'games',
      'victories',
      'ties',
      'defeats',
      'goals',
      'yellowCards',
      'yellowReds',
      'redCards',
      'photoID',
      'rater1',
      'rater2',
      'refNum',
      'refCountry',
      'Alpha_3',
      'meanIAT',
      'nIAT',
      'seIAT',
      'meanExp',
      'nExp',
      'seExp']
    
  • df['height'].mean()

     181.93593798236887
    
  • df['height'].mean()

     181.93593798236887
    
  • np.mean(df.groupby('playerShort').height.mean())

      181.74372848007872
    
  • Tidy Data

      df2 = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
           'key2':['one', 'two', 'one', 'two', 'one'],
           'data1':np.random.randn(5),
           'data2':np.random.randn(5)})
    

  • 分组聚合

      grouped = df2['data1'].groupby(df['key1'])
      grouped.mean()
    
      key1
      a   -0.093686
      b   -0.322711
      Name: data1, dtype: float64  
          
       player_index = 'playerShort'
       player_cols = [#'player', # drop player name, we have unique identifier
                     'birthday',
                     'height',
                     'weight',
                     'position',
                     'photoID',
                     'rater1',
                     'rater2',
                    ]   
    
      all_cols_unique_players = df.groupby(' ').agg({col:'nunique' for col in player_cols})
      all_cols_unique_players.head()
    

    all_cols_unique_players[all_cols_unique_players > 1].dropna().shape[0] == 0
    True
  • 去重

      def get_subgroup(dataframe, g_index, g_columns):
          g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
          if g[g > 1].dropna().shape[0] != 0:
              print("Warning: you probably assumed this had all unique values but it doesn't.")
          return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})
    
      players = get_subgroup(df, player_index, player_cols)
      players.head()
    

3 数据缺失值指标可视化

未完待续