Description
This is for Part 1 Problem 1. Information extracted from database is created into two tables, one for the total salaries and the other for the winning percentage, because both of these variables belong in different tables in the database. Once information is collected, then both tables are merged into one table through an inner join. I decided to use an inner join since all data is shown for years greater than 1984 except for 2014 for teams NYM and SFG. Missing data occurs in the salaries table for years before 1984 and in the winning percentage table for 2014 in teams NYM and SFG.
[19]: sqlite_file = ‘lahman2014.sqlite’ conn = sql.connect(sqlite_file) #connects database to python
salary_query = “SELECT yearID, teamID, sum(salary) as total_payroll FROM␣
↪Salaries WHERE lgID == ‘AL’ OR lgID == ‘NL’ GROUP BY yearID, teamID” #query␣
↪that extracts information from database for payroll team_salaries = pd.read_sql(salary_query, conn) #extracts information from␣
↪database using the query for payroll to create dataframe winning_percentage_query = “SELECT yearID, teamID, franchID, W, G, ROUND(CAST(W␣
↪AS float) / CAST(G AS float) * 100 , 2) as ‘winning_percentage’ FROM Teams␣
↪WHERE lgID == ‘AL’ OR lgID == ‘NL’ GROUP BY yearID, teamID” #query that␣
↪extracts information from database for winning percentage winning_percentage = pd.read_sql(winning_percentage_query, conn) #extracts␣
↪information from database using the query for winning percentage to create␣ ↪dataframe
main = winning_percentage.merge(team_salaries[[‘yearID’, ‘teamID’,␣
↪’total_payroll’]], how = ‘inner’) #merges both dataframes into one with␣
↪inner join, will be used throughout the project as the main table display(main) yearID teamID franchID W G winning_percentage total_payroll
0 1985 ATL ATL 66 162 40.74 14807000.0
1 1985 BAL BAL 83 161 51.55 11560712.0
2 1985 BOS BOS 81 163 49.69 10897560.0
3 1985 CAL ANA 90 162 55.56 14427894.0
4 1985 CHA CHW 85 163 52.15 9846178.0
.. … … … .. … … …
853 2014 SLN STL 90 162 55.56 120693000.0
854 2014 TBA TBD 77 162 47.53 72689100.0
855 2014 TEX TEX 67 162 41.36 112255059.0
856 2014 TOR TOR 83 162 51.23 109920100.0
857 2014 WAS WSN 96 162 59.26 131983680.0
[858 rows x 7 columns]
This is for Part 2 Problem 2. I create a new dataframe that requires data for only years greater than 1989, pivot the table to have the column names as team/franchise names, years as the indexes, and the values as the total_payroll for each team each year. I also used franchID instead of teamID since team names have changed overtime, therefore reducing the amount of missing data. After this, I create a line plot for payroll across all teams. There are some values left as NaN because those teams did not start until a specific year (e.g. ARI started at 1998).
[20]: main_grt_1989 = main.loc[main[‘yearID’] > 1989] #creates new dataframe from␣
↪main dataframe that requires data for only years > 1989
TS_line_table = main_grt_1989.pivot_table(index = ‘yearID’, columns =␣
↪’franchID’, values = ‘total_payroll’) #pivots the table to have team or␣
↪franchise names as columns, years as the indexes, and the values as the␣
↪total_payroll for each team each year display(TS_line_table)
TS_line_plot = TS_line_table.plot(xlabel = ‘Year’, ylabel = ‘Total Payroll’,␣ ↪title = ‘Payrolls Across Teams from 1990-2014’) #creates line plot for␣ ↪payroll across teams
TS_line_plot.legend(bbox_to_anchor=(1.26,1.05)) #adjusts legend to show next to␣ ↪the plot
franchID ANA ARI ATL BAL BOS yearID
1990 21720000.0 NaN 14555501.0 9680084.0 20558333.0
1991 33060001.0 NaN 18403500.0 17519000.0 35167500.0
1992 34749334.0 NaN 34625333.0 23780667.0 43610584.0
1993 28588334.0 NaN 41641417.0 29096500.0 37120583.0
1994 25156218.0 NaN 49383513.0 38849769.0 37859084.0
1995 31223171.0 NaN 47235445.0 43942521.0 32455518.0
1996 28738000.0 NaN 49698500.0 54490315.0 42393500.0
1997 31135472.0 NaN 52278500.0 58516400.0 43558750.0 1998 41281000.0 32347000.0 61186000.0 72355634.0 56757000.0
1999 55388166.0 68703999.0 73140000.0 80605863.0 63497500.0
2000 51464167.0 81027833.0 84537836.0 81447435.0 77940333.0
2001 47535167.0 85082999.0 91936166.0 67599540.0 110035833.0
2002 61721667.0 102819999.0 92870367.0 60493487.0 108366060.0
2003 79031667.0 80657000.0 106243667.0 73877500.0 99946500.0 2004 100534667.0 69780750.0 90182500.0 51623333.0 127298500.0 2005 94867822.0 62329166.0 86457302.0 73914333.0 123505125.0 2006 103472000.0 59684226.0 90156876.0 72585582.0 120099824.0
2007 109251333.0 52067546.0 87290833.0 93174808.0 143026214.0
2008 119216333.0 66202712.0 102365683.0 67196246.0 133390035.0
2009 113709000.0 73115666.0 96726166.0 67101666.0 121345999.0
2010 104963866.0 60718166.0 84423666.0 81612500.0 162447333.0
2011 138543166.0 53639833.0 87002692.0 85304038.0 161762475.0
2012 154485166.0 73804833.0 82829942.0 77353999.0 173186617.0
2013 124174750.0 90132000.0 87871525.0 84393333.0 151530000.0 2014 121988250.0 97861500.0 97609000.0 103416000.0 139019929.0
franchID CHC CHW CIN CLE COL …
yearID …
1990 13624000.0 9491500.0 14370000.0 14487000.0 NaN …
1991 23175667.0 16919667.0 26305333.0 17635000.0 NaN …
1992 29829686.0 30160833.0 35931499.0 9373044.0 NaN … 1993 39386666.0 39696166.0 44879666.0 18561000.0 10353500.0 …
1994 36287333.0 39183836.0 40961833.0 30490500.0 23887333.0 …
1995 29505834.0 46961282.0 43144670.0 37937835.0 34154717.0 …
1996 33081000.0 45139500.0 42526334.0 48107360.0 40179823.0 …
1997 42155333.0 57740000.0 49768000.0 56802460.0 43559667.0 …
1998 50838000.0 38335000.0 23005000.0 60800166.0 50484648.0 …
1999 62343000.0 25620000.0 33962761.0 72978462.0 61935837.0 …
2000 60539333.0 31133500.0 46867200.0 75880771.0 61111190.0 …
2001 64715833.0 65653667.0 48986000.0 93152001.0 71541334.0 …
2002 75690833.0 57052833.0 45050390.0 78909449.0 56851043.0 …
2003 79868333.0 51010000.0 59355667.0 48584834.0 67179667.0 …
2004 90560000.0 65212500.0 46615250.0 34319300.0 65445167.0 …
2005 87032933.0 75178000.0 61892583.0 41502500.0 47839000.0 …
2006 94424499.0 102750667.0 60909519.0 56031500.0 41233000.0 …
2007 99670332.0 108671833.0 68524980.0 61673267.0 54041000.0 … 2008 118345833.0 121189332.0 74117695.0 78970066.0 68655500.0 …
2009 134809000.0 96068500.0 73558500.0 81579166.0 75201000.0 …
2010 146609000.0 105530000.0 71761542.0 61203966.0 84227000.0 …
2011 125047329.0 127789000.0 75947134.0 48776566.0 88148071.0 …
2012 88197033.0 96919500.0 82203616.0 78430300.0 78069571.0 … 2013 100567726.0 120065277.0 106404462.0 75771800.0 74409071.0 … 2014 65522500.0 81830500.0 108217500.0 82151899.0 95403500.0 …
franchID PHI PIT SDP SEA SFG yearID
1990 13173667.0 15556000.0 17588334.0 12553667.0 19335333.0
1991 22487332.0 23634667.0 22150001.0 15691833.0 30967666.0
1992 24383834.0 33944167.0 26854167.0 23179833.0 33163168.0
1993 28538334.0 24822467.0 25511333.0 32696333.0 35050000.0
1994 31599000.0 24217250.0 14916333.0 29228500.0 42638666.0
1995 30555945.0 18355345.0 26382334.0 36481311.0 36462777.0
1996 34314500.0 23017500.0 28348172.0 41328501.0 37144725.0
1997 36656500.0 10771667.0 37363672.0 41540661.0 35592378.0
1998 36297500.0 15065000.0 46861500.0 54087036.0 42565834.0
1999 31692500.0 24697666.0 49768179.0 54125003.0 46595057.0
2000 47308000.0 28928334.0 54821000.0 58915000.0 53737826.0
2001 41663833.0 57760833.0 39182833.0 74720834.0 63280167.0
2002 57954999.0 42323599.0 41425000.0 80282668.0 78299835.0
2003 70780000.0 54812429.0 45210000.0 86959167.0 82852167.0
2004 92919167.0 32227929.0 55384833.0 81515834.0 82019166.0
2005 95522000.0 38133000.0 63290833.0 87754334.0 90199500.0
2006 88273333.0 46717750.0 69896141.0 87959833.0 90056419.0
2007 89428213.0 38537833.0 58110567.0 106460833.0 90219056.0
2008 97879880.0 48689783.0 73677616.0 117666482.0 76594500.0
2009 113004046.0 48693000.0 43333700.0 98904166.0 83026450.0
2010 141928379.0 34943000.0 37799300.0 86510000.0 98641333.0
2011 172976379.0 45047000.0 45869140.0 86110600.0 118198333.0
2012 174538938.0 62951999.0 55244700.0 81978100.0 117620683.0
2013 169863189.0 77062000.0 65585500.0 74005043.0 140180334.0 2014 180944967.0 77178000.0 75685700.0 92531100.0 20000000.0
franchID STL TBD TEX TOR WSN yearID
1990 20523334.0 NaN 14874372.0 17756834.0 16586388.0
1991 21860001.0 NaN 18224500.0 19902417.0 10732333.0
1992 27583836.0 NaN 30128167.0 44788666.0 15822334.0
1993 23367334.0 NaN 36376959.0 47279166.0 18899333.0
1994 29275601.0 NaN 32973597.0 43433668.0 19098000.0
1995 37101000.0 NaN 34581451.0 50590000.0 12364000.0
1996 40269667.0 NaN 39041528.0 29555083.0 16264500.0
1997 45456667.0 NaN 53448838.0 47079833.0 19295500.0 1998 54672521.0 27280000.0 56572095.0 51376000.0 10641500.0
1999 49778195.0 38870000.0 76709931.0 45444333.0 17903000.0
2000 61453863.0 62765129.0 70795921.0 44838332.0 32994333.0
2001 78538333.0 56980000.0 88633500.0 76895999.0 35159500.0
2002 74660875.0 34380000.0 105526122.0 76864333.0 38670500.0
2003 83786666.0 19630000.0 103491667.0 51269000.0 51948500.0
2004 83228333.0 29556667.0 55050417.0 50017000.0 40897500.0
2005 92106833.0 29679067.0 55849000.0 45719500.0 48581500.0
2006 88891371.0 34917967.0 68228662.0 71365000.0 63143000.0
2007 90286823.0 24123500.0 68318675.0 81942800.0 36947500.0
2008 99624449.0 43820597.0 67712326.0 97793900.0 54961000.0
2009 88528409.0 63313034.0 68178798.0 80538300.0 59928000.0
2010 93540751.0 71923471.0 55250544.0 62234000.0 61400000.0 2011 105433572.0 41053571.0 92299264.0 62567800.0 63856928.0
2012 110300862.0 64173500.0 120510974.0 75009200.0 80855143.0
2013 92260110.0 52955272.0 112522600.0 126288100.0 113703270.0 2014 120693000.0 72689100.0 112255059.0 109920100.0 131983680.0
[25 rows x 30 columns]
[20]: <matplotlib.legend.Legend at 0xffff5af40280>
The code below is written for Part 2 Question 1 to make specific observations regarding central tendency. The mean and median of salaries for all teams each year are shown below.
[21]: TS_mean = TS_line_table.mean(axis = 1) #creates Series for the mean of all␣ ↪teams each year
TS_median = TS_line_table.median(axis = 1) #creates Series for the median of␣
↪all teams each year display(TS_mean) display(TS_median) yearID
1990 1.707235e+07
1991 2.357879e+07
1992 3.098244e+07
1993 3.220500e+07
1994 3.313701e+07
1995 3.398105e+07
1996 3.417798e+07
1997 4.026021e+07
1998 4.260943e+07
1999 4.980762e+07
2000 5.553784e+07
2001 6.535544e+07
2002 6.746925e+07
2003 7.094207e+07
2004 6.902220e+07
2005 7.295711e+07
2006 7.738242e+07
2007 8.255630e+07
2008 8.949529e+07
2009 8.882423e+07
2010 9.071200e+07
2011 9.281684e+07
2012 9.775804e+07
2013 1.011509e+08 2014 9.980002e+07 dtype: float64
yearID 1990 17590786.0
1991 23145583.5
1992 30587250.0
1993 33873166.5
1994 33049798.5
1995 34375642.5
1996 33697750.0
1997 41847997.0
1998 42469917.0
1999 49773187.0
2000 56543083.5
2001 63998000.0
2002 61107577.0
2003 68979833.5
2004 62319083.5
2005 66191416.5
2006 71975291.0
2007 84616816.5
2008 79953782.5
2009 81058733.0 2010 84325333.0
2011 86556646.0
2012 85513487.5
2013 89001762.5 2014 96506250.0 dtype: float64
This is for Part 2 Question 1. The first series above has the mean of the salaries of all teams per year, and the second series above has the median of the salaries of all teams per year. Both of these series show that as time increases, salaries overall increase for the American League. However, this also shows certain years where we might expect skewness, such as 2003 where the mean was $70942070.00 and the median was $68979833.50. This shows that for 2003, the data is positively skewed.
In Part 2 Problem 3, I will graph line plots that shows the mean and the median for each year. One can see that from 1990 to approximately 2000, the mean and the median were closer to each other which indicates a normal distribution, and from 2000-2014, the median would be frequently below the mean, therefore indicating the distribution to be positively skewed. I create a new dataframe by concatenating the Mean series and the Median Series of all teams each year. After that, I create a line plot for the Mean and Median Salaries Across Time in all Teams.
[22]: TS_CT = pd.concat({“Mean”: TS_mean, “Median”: TS_median}, axis = 1) #creates␣
↪new dataframe by concatenating the Mean Series and the Median Series of all␣ ↪teams each year
TS_CT_plot = TS_CT.plot(xlabel = ‘Year’, ylabel = ‘Salaries’) #creates line␣
↪plot for the Mean and Median Salaries Across Time in all Teams
TS_CT_plot.set_title(‘Mean and Median Salaries Across Time in all Teams’, x=0. ↪54, pad=-14) #adjusts title to center on graph while not colliding with␣ ↪”1e8″ label
[22]: Text(0.54, 1.0, ‘Mean and Median Salaries Across Time in all Teams’)
This is for Part 2 Problem 4. I use pd.cut to discretize year into five time periods in the main table. I used groupby() twice. The first time was to get the mean of each team’s salary and the mean of each team’s winning percentage for each time period by grouping the Time Period and the FranchID. The second time was to plot each graph by grouping the FranchID in the table with the mean of salary and winning percentage.
[23]: main[‘Time Period’] = pd.cut(main.yearID, bins =␣
↪[1984,1990,1996,2002,2008,2014], labels = [‘1985-1990’, ‘1990-1996′,␣
↪’1996-2002’, ‘2002-2008’, ‘2008-2014’]) #discretize year into five time␣ ↪periods
TP_and_fID_groupby = main.groupby([‘Time Period’, ‘franchID’]) #creates groupby␣ ↪object for Time Period and franch ID for obtaining mean in salary and␣
↪winning percentage group_data = [] #this array is to store the data of the mean and winning␣ ↪percentage
for k, v in TP_and_fID_groupby:
group_data.append([k[0], v[‘winning_percentage’].mean(), v[‘total_payroll’].
↪mean(), k[1]]) #appended data with Time Period, winning percentage mean,␣ ↪salary mean, and franchID
TS_and_WP_mean = pd.DataFrame(group_data, columns = [‘Time Period’, ‘Mean␣
↪Winning Percentage’, ‘Mean Salary’, ‘franchID’]) #creates dataframe from␣ ↪array
TS_and_WP_mean_groupby = TS_and_WP_mean.groupby(‘Time Period’) #creates groupby␣ ↪object from the mean and winning percentage table for Time Period
for k, v in TS_and_WP_mean_groupby:
v2 = v.pivot_table(index = [“franchID”]) #pivots each group table for the␣
↪franchID to be the index with the mean salary and mean winning percentage␣
↪values x = v2[‘Mean Salary’].tolist() #gathers mean salaries into arraylist y = v2[‘Mean Winning Percentage’].tolist() #gathers mean winning␣
↪percentages into arraylist txt = v2.index.values #gathers team names into arraylist
TS_and_WP_mean_plot = v2.plot(kind = ‘scatter’, x = ‘Mean Salary’, y =␣
↪’Mean Winning Percentage’, title = ‘Mean Salary vs. Mean Winning Percentage␣
↪at ‘ + k) #creates each plot for each time period for i, t in enumerate(txt): #this helps add labels to specific points in␣
↪each plot if(t == ‘OAK’ or t == ‘NYY’ or t == ‘BOS’):
TS_and_WP_mean_plot.annotate(t, (x[i], y[i])) #plot as normal for␣
↪OAK, NYY, or BOS if (t == ‘ARI’):
TS_and_WP_mean_plot.annotate(t, (x[i] – .04e7, y[i])) #plot for ARI␣
↪label to not collide with other labels if (t == ‘NYM’ and k == “1996-2002”):
TS_and_WP_mean_plot.annotate(t, (x[i] – .23e7, y[i])) #plot for NYM␣
↪label to not collide with other labels, specifically in the 1996-2002 table elif (t == ‘NYM’): #plot for NYM label if it is not a a 1996-2002 table
TS_and_WP_mean_plot.annotate(t, (x[i], y[i]))
This is for Part 2 Question 2. Team payrolls overall are increasing. Once can definitely tell from the x-axis in each plot. As time increases for each plot, the x-axis also increases for each plot, and since the x-axis represents mean salaries for each team, this shows that payrolls overall are increasing. Some teams are particulary good at paying for wins, such as NYY and BOS. From 1985-1990, OAK was decent in paying its players. In these two time periods, OAK is clustered around many other teams on the right side of the plot, therefore this represents ethical spending of salaries for team players from the data’s point of view. However, from 1996-2002, 2002-2008, and 2008-2014, OAK turned to the left side of the plot while remaining higher on the plot. This indicates that they’re exploiting team members by spending less for them compared to other teams while expecting a high winning percentage since they hire players who are skilled in baseball but not nationally recognized.
This is for Part 3 Problem 5. I pivot the table in Part 2 Problem 2 with the index as the yearID, the columns as the franchID, and the values as the payroll for each team each year. I add the mean and standard deviation for salaries each year for each team in a separate dataframe as columns, then merge it with the main dataframe on the yearID. After that, I create the standardized payroll column by doing arithmetic operations on the total, mean, and standard deviation columns.
[24]: TS = main.pivot_table(index = ‘yearID’, columns = ‘franchID’, values =␣
↪’total_payroll’) #pivots the main table to create new dataframe with yearID␣
↪as the index, franchId as the columns, and total_payroll as the vlaues
TS_CT = pd.concat({“Mean Salary Each Year”: TS_line_table.mean(axis = 1),␣
↪”Standard Deviation Salary Each Year”: TS_line_table.std(axis = 1)}, axis =␣
↪1) #creates new central tendency table with mean and standard deviation of␣
↪salary main = pd.merge(main, TS_CT, on = ‘yearID’) #merges main table with central␣
↪tendency table main[‘std_payroll’] = (main[‘total_payroll’] – main[‘Mean Salary Each Year’]) /␣
↪main[‘Standard Deviation Salary Each Year’] #creates new column of␣
↪standardized payroll and does arithmetic operations from columns display(main)
yearID teamID franchID W G winning_percentage total_payroll
0 1990 ATL ATL 65 162 40.12 14555501.0
1 1990 BAL BAL 76 161 47.20 9680084.0 2 1990 BOS BOS 88 162 54.32 20558333.0
3 1990 CAL ANA 80 162 49.38 21720000.0
4 1990 CHA CHW 94 162 58.02 9491500.0
.. … … … .. … … …
723 2014 SLN STL 90 162 55.56 120693000.0
724 2014 TBA TBD 77 162 47.53 72689100.0
725 2014 TEX TEX 67 162 41.36 112255059.0
726 2014 TOR TOR 83 162 51.23 109920100.0
727 2014 WAS WSN 96 162 59.26 131983680.0
Time Period Mean Salary Each Year Standard Deviation Salary Each Year
0 1985-1990 1.707235e+07 3.771834e+06
1 1985-1990 1.707235e+07 3.771834e+06
2 1985-1990 1.707235e+07 3.771834e+06
3 1985-1990 1.707235e+07 3.771834e+06
4 1985-1990 1.707235e+07 3.771834e+06
.. … … …
723 2008-2014 9.980002e+07 4.570505e+07
724 2008-2014 9.980002e+07 4.570505e+07
725 2008-2014 9.980002e+07 4.570505e+07
726 2008-2014 9.980002e+07 4.570505e+07
727 2008-2014 9.980002e+07 4.570505e+07
std_payroll
0 -0.667275
1 -1.959861 2 0.924213 3 1.232198 4 -2.009859
.. …
723 0.457126
724 -0.593171
725 0.272509
726 0.221422
727 0.704160
[728 rows x 11 columns]
This is for Part 3 Problem 6. I used groupby() twice. The first time was to get the mean of each team’s standardized payroll and the mean of each team’s winning percentage for each time period by grouping the Time Period and the FranchID. The second time was to plot each graph by grouping the FranchID in the table with the mean of the standardized payroll and winning percentage.
[25]: TP_and_fID_groupby = main.groupby([‘Time Period’, ‘franchID’]) #creates groupby␣ ↪object for Time Period and franch ID for obtaining mean in standardized␣
↪payroll and winning percentage group_data = [] #this array is to store the data of the mean and winning␣ ↪percentage
for k, v in TP_and_fID_groupby:
group_data.append([k[0], v[‘winning_percentage’].mean(), v[‘std_payroll’].
↪mean(), k[1]]) #appended data with Time Period, winning percentage mean,␣ ↪salary mean, and franchID
SP_and_WP_mean = pd.DataFrame(group_data, columns = [‘Time Period’, ‘Mean␣
↪Winning Percentage’, ‘Mean Standardized Payroll’, ‘franchID’]) #creates␣ ↪dataframe from array
SP_and_WP_mean_groupby = SP_and_WP_mean.groupby(‘Time Period’) #creates groupby␣ ↪object from the mean and winning percentage table for Time Period
for k, v in SP_and_WP_mean_groupby:
v2 = v.pivot_table(index = [“franchID”]) #pivots each group table for the␣
↪franchID to be the index with the mean standardized payroll and mean winning␣
↪percentage values x = v2[‘Mean Standardized Payroll’].tolist() #gathers mean standardized␣
↪payroll into arraylist y = v2[‘Mean Winning Percentage’].tolist() #gathers mean winning␣
↪percentages into arraylist txt = v2.index.values #gathers team names into arraylist
SP_and_WP_mean_plot = v2.plot(kind = ‘scatter’, x = ‘Mean Standardized␣
↪Payroll’, y = ‘Mean Winning Percentage’, title = ‘Mean Standardized Payroll␣
↪vs. Mean Winning Percentage at ‘ + k) #creates each plot for each time period for i, t in enumerate(txt): #this helps add labels to specific points in␣
↪each plot if(t == ‘OAK’ or t == ‘ARI’ or t == ‘NYY’ or t == ‘BOS’): #plot as␣
↪normal for OAK, NYY, ARI, or BOS
SP_and_WP_mean_plot.annotate(t, (x[i], y[i])) if (t == ‘NYM’ and k == “1996-2002”): #plot for NYM label to not␣
↪collide with other labels, specifically in the 1996-2002 table
SP_and_WP_mean_plot.annotate(t, (x[i], y[i] – .5))
This is for Part 3 Question 3. Both plots look similar to each other in terms of where each data point is positioned or clustered around. For example, NYY is still in the middle right for the first two plots and on the top right for the last three plots from problem 4 and 6. There are some little differences, such as for ARI in 1996-2002. ARI has a lower mean standard payroll compared to NYM, but ARI has a higher mean payroll compared to NYM. Although the same data points are clustered around each other from both sets of plots, there will be minor differences in terms of the mean standard payroll and the mean total payroll when comparing each data point to another.
This is for Part 3 Problem 7. I extract two columns into arrays for the standardized payroll and the winning percentage. I create a LinearRegression() object, input values from both arrays to create the regression, create an array to store y-values from the linear regression, and do a scatter plot of the data points and a line plot for the regression in one plot.
[26]: x = main[‘std_payroll’].values.reshape(-1, 1) #gathers values for standardized␣
↪payroll y = main[‘winning_percentage’].values.reshape(-1, 1) #gathers values for␣
↪winning percentage lr = LinearRegression() #creates LinearRegression() object lr.fit(x,y) #input values from both arrays to create the regression Y_predict = lr.predict(x) #create an array to store y-values from the linear␣ ↪regression
plt.scatter(x,y) #scatter plot of the data points plt.plot(x, Y_predict, color = ‘red’) #line plot for the regression plt.title(“Winning percentage vs. Standardized payroll”) #creates title
This is for Part 3 Problem 8. I create a new column in the main table for the expected values and another new column in the main table to perform operations to compute values for the efficency. After that, I get my values for the years and efficency values for teams OAK, TBA, BOS, NYA, ATL, and TBA using query(). This time I will be using teamID since using the query method was difficult with franchID. After that, I will be plotting multiple lines in a line graph for those teams.
[27]: main[‘expected_winning_percentage’] = Y_predict #create a new column in the␣
↪main table for the expected values main[‘efficency’] = main[‘winning_percentage’] -␣
↪main[‘expected_winning_percentage’] #perform operations on columns to␣ ↪compute values for the efficency
x = main.query(“teamID == ‘OAK’”)[‘yearID’].values.reshape(-1, 1) #get values␣
↪for the years xTBA = main.query(“teamID == ‘TBA’”)[‘yearID’].values.reshape(-1, 1) #get␣
↪values for the years in TBA since they started later yOAK = main.query(“teamID == ‘OAK’”)[‘efficency’].values.reshape(-1, 1) #get␣ ↪efficency values for OAK
Compared to the plots that I looked at in Question 2 and 3, this plot is able to indicate why the mean payroll or the mean standardized payroll are plotted as such. I am able to understand better on how well teams are spending their money to their players from efficency rather than merely knowing the averages with relationship to their winning percentage. For example, BOS had a high average salary/standardized payroll and a low winning percentage from the scatter plots in 19901996, and this is shown in the efficency plot where BOS hit a low point in efficency (almost -10). In the moneyball period, OAK’s spending efficency was gradually decreasing since it caught on that they can exploit players who perform well but are not nationally known.
Reviews
There are no reviews yet.