Data Visualization - Line Plotting

Preparing the enviroment

In [9]:
import numpy as np
import pandas as pd
import openpyxl # used to read excel files

Enter the url where your dataset is:

In [11]:
URL = 'https://github.com/mmouty/energynds/blob/main/energynds/assets/datasets/UN_MigFlow_All_CountryFiles.zip'

fetching the excel file:

In [15]:
df_can = pd.read_excel(
    URL,
    sheet_name='Canada by Citizenship',
    skiprows=range(20),
    skipfooter=2,
    engine='openpyxl')
print('Data downloaded and read into a dataframe!')
Data downloaded and read into a dataframe!
In [16]:
df_can.head() #Just checking the data
Out[16]:
Type Coverage OdName AREA AreaName REG RegName DEV DevName 1980 ... 2012 2013 Unnamed: 43 Unnamed: 44 Unnamed: 45 Unnamed: 46 Unnamed: 47 Unnamed: 48 Unnamed: 49 Unnamed: 50
0 Immigrants Foreigners Afghanistan 935.0 Asia 5501.0 Southern Asia 902.0 Developing regions 16.0 ... 2635.0 2004.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 Immigrants Foreigners Albania 908.0 Europe 925.0 Southern Europe 901.0 Developed regions 1.0 ... 620.0 603.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 Immigrants Foreigners Algeria 903.0 Africa 912.0 Northern Africa 902.0 Developing regions 80.0 ... 3774.0 4331.0 NaN NaN NaN NaN NaN NaN NaN NaN
3 Immigrants Foreigners American Samoa 909.0 Oceania 957.0 Polynesia 902.0 Developing regions 0.0 ... 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 Immigrants Foreigners Andorra 908.0 Europe 925.0 Southern Europe 901.0 Developed regions 0.0 ... 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 51 columns

Let's get basic information about our dataframe:

In [17]:
df_can.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1013 entries, 0 to 1012
Columns: 51 entries, Type to Unnamed: 50
dtypes: float64(45), object(6)
memory usage: 403.7+ KB

Let's get the list of column headers

In [18]:
df_can.columns
Out[18]:
Index([       'Type',    'Coverage',      'OdName',        'AREA',
          'AreaName',         'REG',     'RegName',         'DEV',
           'DevName',          1980,          1981,          1982,
                1983,          1984,          1985,          1986,
                1987,          1988,          1989,          1990,
                1991,          1992,          1993,          1994,
                1995,          1996,          1997,          1998,
                1999,          2000,          2001,          2002,
                2003,          2004,          2005,          2006,
                2007,          2008,          2009,          2010,
                2011,          2012,          2013, 'Unnamed: 43',
       'Unnamed: 44', 'Unnamed: 45', 'Unnamed: 46', 'Unnamed: 47',
       'Unnamed: 48', 'Unnamed: 49', 'Unnamed: 50'],
      dtype='object')

Let's get the list of indicies:

In [19]:
df_can.index
Out[19]:
RangeIndex(start=0, stop=1013, step=1)

Let's check the type of the columns and index:

In [20]:
print(type(df_can.columns))
print(type(df_can.index))
<class 'pandas.core.indexes.base.Index'>
<class 'pandas.core.indexes.range.RangeIndex'>

The default type of intance variables index and columns are not list, so let's change them into list:

In [21]:
df_can.columns.tolist()
Out[21]:
['Type',
 'Coverage',
 'OdName',
 'AREA',
 'AreaName',
 'REG',
 'RegName',
 'DEV',
 'DevName',
 1980,
 1981,
 1982,
 1983,
 1984,
 1985,
 1986,
 1987,
 1988,
 1989,
 1990,
 1991,
 1992,
 1993,
 1994,
 1995,
 1996,
 1997,
 1998,
 1999,
 2000,
 2001,
 2002,
 2003,
 2004,
 2005,
 2006,
 2007,
 2008,
 2009,
 2010,
 2011,
 2012,
 2013,
 'Unnamed: 43',
 'Unnamed: 44',
 'Unnamed: 45',
 'Unnamed: 46',
 'Unnamed: 47',
 'Unnamed: 48',
 'Unnamed: 49',
 'Unnamed: 50']
In [22]:
df_can.index.tolist()
Out[22]:
[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185,
 186,
 187,
 188,
 189,
 190,
 191,
 192,
 193,
 194,
 195,
 196,
 197,
 198,
 199,
 200,
 201,
 202,
 203,
 204,
 205,
 206,
 207,
 208,
 209,
 210,
 211,
 212,
 213,
 214,
 215,
 216,
 217,
 218,
 219,
 220,
 221,
 222,
 223,
 224,
 225,
 226,
 227,
 228,
 229,
 230,
 231,
 232,
 233,
 234,
 235,
 236,
 237,
 238,
 239,
 240,
 241,
 242,
 243,
 244,
 245,
 246,
 247,
 248,
 249,
 250,
 251,
 252,
 253,
 254,
 255,
 256,
 257,
 258,
 259,
 260,
 261,
 262,
 263,
 264,
 265,
 266,
 267,
 268,
 269,
 270,
 271,
 272,
 273,
 274,
 275,
 276,
 277,
 278,
 279,
 280,
 281,
 282,
 283,
 284,
 285,
 286,
 287,
 288,
 289,
 290,
 291,
 292,
 293,
 294,
 295,
 296,
 297,
 298,
 299,
 300,
 301,
 302,
 303,
 304,
 305,
 306,
 307,
 308,
 309,
 310,
 311,
 312,
 313,
 314,
 315,
 316,
 317,
 318,
 319,
 320,
 321,
 322,
 323,
 324,
 325,
 326,
 327,
 328,
 329,
 330,
 331,
 332,
 333,
 334,
 335,
 336,
 337,
 338,
 339,
 340,
 341,
 342,
 343,
 344,
 345,
 346,
 347,
 348,
 349,
 350,
 351,
 352,
 353,
 354,
 355,
 356,
 357,
 358,
 359,
 360,
 361,
 362,
 363,
 364,
 365,
 366,
 367,
 368,
 369,
 370,
 371,
 372,
 373,
 374,
 375,
 376,
 377,
 378,
 379,
 380,
 381,
 382,
 383,
 384,
 385,
 386,
 387,
 388,
 389,
 390,
 391,
 392,
 393,
 394,
 395,
 396,
 397,
 398,
 399,
 400,
 401,
 402,
 403,
 404,
 405,
 406,
 407,
 408,
 409,
 410,
 411,
 412,
 413,
 414,
 415,
 416,
 417,
 418,
 419,
 420,
 421,
 422,
 423,
 424,
 425,
 426,
 427,
 428,
 429,
 430,
 431,
 432,
 433,
 434,
 435,
 436,
 437,
 438,
 439,
 440,
 441,
 442,
 443,
 444,
 445,
 446,
 447,
 448,
 449,
 450,
 451,
 452,
 453,
 454,
 455,
 456,
 457,
 458,
 459,
 460,
 461,
 462,
 463,
 464,
 465,
 466,
 467,
 468,
 469,
 470,
 471,
 472,
 473,
 474,
 475,
 476,
 477,
 478,
 479,
 480,
 481,
 482,
 483,
 484,
 485,
 486,
 487,
 488,
 489,
 490,
 491,
 492,
 493,
 494,
 495,
 496,
 497,
 498,
 499,
 500,
 501,
 502,
 503,
 504,
 505,
 506,
 507,
 508,
 509,
 510,
 511,
 512,
 513,
 514,
 515,
 516,
 517,
 518,
 519,
 520,
 521,
 522,
 523,
 524,
 525,
 526,
 527,
 528,
 529,
 530,
 531,
 532,
 533,
 534,
 535,
 536,
 537,
 538,
 539,
 540,
 541,
 542,
 543,
 544,
 545,
 546,
 547,
 548,
 549,
 550,
 551,
 552,
 553,
 554,
 555,
 556,
 557,
 558,
 559,
 560,
 561,
 562,
 563,
 564,
 565,
 566,
 567,
 568,
 569,
 570,
 571,
 572,
 573,
 574,
 575,
 576,
 577,
 578,
 579,
 580,
 581,
 582,
 583,
 584,
 585,
 586,
 587,
 588,
 589,
 590,
 591,
 592,
 593,
 594,
 595,
 596,
 597,
 598,
 599,
 600,
 601,
 602,
 603,
 604,
 605,
 606,
 607,
 608,
 609,
 610,
 611,
 612,
 613,
 614,
 615,
 616,
 617,
 618,
 619,
 620,
 621,
 622,
 623,
 624,
 625,
 626,
 627,
 628,
 629,
 630,
 631,
 632,
 633,
 634,
 635,
 636,
 637,
 638,
 639,
 640,
 641,
 642,
 643,
 644,
 645,
 646,
 647,
 648,
 649,
 650,
 651,
 652,
 653,
 654,
 655,
 656,
 657,
 658,
 659,
 660,
 661,
 662,
 663,
 664,
 665,
 666,
 667,
 668,
 669,
 670,
 671,
 672,
 673,
 674,
 675,
 676,
 677,
 678,
 679,
 680,
 681,
 682,
 683,
 684,
 685,
 686,
 687,
 688,
 689,
 690,
 691,
 692,
 693,
 694,
 695,
 696,
 697,
 698,
 699,
 700,
 701,
 702,
 703,
 704,
 705,
 706,
 707,
 708,
 709,
 710,
 711,
 712,
 713,
 714,
 715,
 716,
 717,
 718,
 719,
 720,
 721,
 722,
 723,
 724,
 725,
 726,
 727,
 728,
 729,
 730,
 731,
 732,
 733,
 734,
 735,
 736,
 737,
 738,
 739,
 740,
 741,
 742,
 743,
 744,
 745,
 746,
 747,
 748,
 749,
 750,
 751,
 752,
 753,
 754,
 755,
 756,
 757,
 758,
 759,
 760,
 761,
 762,
 763,
 764,
 765,
 766,
 767,
 768,
 769,
 770,
 771,
 772,
 773,
 774,
 775,
 776,
 777,
 778,
 779,
 780,
 781,
 782,
 783,
 784,
 785,
 786,
 787,
 788,
 789,
 790,
 791,
 792,
 793,
 794,
 795,
 796,
 797,
 798,
 799,
 800,
 801,
 802,
 803,
 804,
 805,
 806,
 807,
 808,
 809,
 810,
 811,
 812,
 813,
 814,
 815,
 816,
 817,
 818,
 819,
 820,
 821,
 822,
 823,
 824,
 825,
 826,
 827,
 828,
 829,
 830,
 831,
 832,
 833,
 834,
 835,
 836,
 837,
 838,
 839,
 840,
 841,
 842,
 843,
 844,
 845,
 846,
 847,
 848,
 849,
 850,
 851,
 852,
 853,
 854,
 855,
 856,
 857,
 858,
 859,
 860,
 861,
 862,
 863,
 864,
 865,
 866,
 867,
 868,
 869,
 870,
 871,
 872,
 873,
 874,
 875,
 876,
 877,
 878,
 879,
 880,
 881,
 882,
 883,
 884,
 885,
 886,
 887,
 888,
 889,
 890,
 891,
 892,
 893,
 894,
 895,
 896,
 897,
 898,
 899,
 900,
 901,
 902,
 903,
 904,
 905,
 906,
 907,
 908,
 909,
 910,
 911,
 912,
 913,
 914,
 915,
 916,
 917,
 918,
 919,
 920,
 921,
 922,
 923,
 924,
 925,
 926,
 927,
 928,
 929,
 930,
 931,
 932,
 933,
 934,
 935,
 936,
 937,
 938,
 939,
 940,
 941,
 942,
 943,
 944,
 945,
 946,
 947,
 948,
 949,
 950,
 951,
 952,
 953,
 954,
 955,
 956,
 957,
 958,
 959,
 960,
 961,
 962,
 963,
 964,
 965,
 966,
 967,
 968,
 969,
 970,
 971,
 972,
 973,
 974,
 975,
 976,
 977,
 978,
 979,
 980,
 981,
 982,
 983,
 984,
 985,
 986,
 987,
 988,
 989,
 990,
 991,
 992,
 993,
 994,
 995,
 996,
 997,
 998,
 999,
 ...]

Now, let's check again:

In [23]:
print(type(df_can.columns.tolist()))
print(type(df_can.index.tolist()))
<class 'list'>
<class 'list'>

Let's check the dimensions of our dataframe:

In [24]:
# size of dataframe (rows, columns)
df_can.shape
Out[24]:
(1013, 51)

We are not going to use all of these columns, so let's drop some:

In [25]:
# in pandas axis=0 represents rows (default) and axis=1 represents columns.
df_can.drop(['AREA','REG','DEV','Type','Coverage','Unnamed: 43','Unnamed: 44','Unnamed: 45','Unnamed: 46','Unnamed: 47','Unnamed: 48','Unnamed: 49','Unnamed: 50'], axis=1, inplace=True)
df_can.head(2)
Out[25]:
OdName AreaName RegName DevName 1980 1981 1982 1983 1984 1985 ... 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
0 Afghanistan Asia Southern Asia Developing regions 16.0 39.0 39.0 47.0 71.0 340.0 ... 2978.0 3436.0 3009.0 2652.0 2111.0 1746.0 1758.0 2203.0 2635.0 2004.0
1 Albania Europe Southern Europe Developed regions 1.0 0.0 0.0 0.0 0.0 0.0 ... 1450.0 1223.0 856.0 702.0 560.0 716.0 561.0 539.0 620.0 603.0

2 rows × 38 columns

Let's rename the columns:

In [26]:
df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent', 'RegName':'Region'}, inplace=True)
df_can.columns
Out[26]:
Index([  'Country', 'Continent',    'Region',   'DevName',        1980,
              1981,        1982,        1983,        1984,        1985,
              1986,        1987,        1988,        1989,        1990,
              1991,        1992,        1993,        1994,        1995,
              1996,        1997,        1998,        1999,        2000,
              2001,        2002,        2003,        2004,        2005,
              2006,        2007,        2008,        2009,        2010,
              2011,        2012,        2013],
      dtype='object')

Let's add a 'Total' column that sums up the total immigrants by country over the entire period 1980 - 2013

In [27]:
df_can['Total'] = df_can.sum(axis=1)

Let's see how many null objects we have in the dataset

In [28]:
df_can.isnull().sum()
Out[28]:
Country      816
Continent    816
Region       816
DevName      816
1980         816
1981         816
1982         816
1983         816
1984         816
1985         816
1986         816
1987         816
1988         816
1989         816
1990         816
1991         816
1992         816
1993         816
1994         816
1995         816
1996         816
1997         816
1998         816
1999         816
2000         816
2001         816
2002         816
2003         816
2004         816
2005         816
2006         816
2007         816
2008         816
2009         816
2010         816
2011         816
2012         816
2013         816
Total          0
dtype: int64

Let's view a quick summary of each column in our dataframe:

In [29]:
df_can.describe()
Out[29]:
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ... 2005 2006 2007 2008 2009 2010 2011 2012 2013 Total
count 197.000000 197.000000 197.000000 197.000000 197.000000 197.000000 197.000000 197.000000 197.000000 197.000000 ... 197.00000 197.000000 197.000000 197.000000 197.000000 197.000000 197.000000 197.000000 197.000000 1.013000e+03
mean 1453.167513 1306.000000 1230.203046 905.431472 896.162437 856.304569 1008.639594 1543.908629 1640.456853 1944.670051 ... 2662.35533 2554.720812 2403.583756 2510.091371 2560.101523 2849.614213 2525.360406 2618.304569 2629.654822 1.367070e+04
std 10784.524807 9449.373841 8864.905615 6503.149859 6452.570413 6155.858422 7205.630681 11045.558746 11809.252241 13939.908653 ... 19104.65330 18261.461660 17129.490442 17908.071420 18276.881231 20386.174875 18082.496369 18763.159434 18838.224387 2.219784e+05
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 ... 29.00000 26.000000 32.000000 32.000000 37.000000 42.000000 38.000000 43.000000 45.000000 0.000000e+00
50% 14.000000 10.000000 12.000000 12.000000 14.000000 17.000000 21.000000 33.000000 36.000000 45.000000 ... 211.00000 230.000000 198.000000 205.000000 217.000000 216.000000 196.000000 235.000000 217.000000 0.000000e+00
75% 266.000000 299.000000 299.000000 197.000000 207.000000 202.000000 257.000000 467.000000 476.000000 538.000000 ... 857.00000 856.000000 975.000000 975.000000 917.000000 953.000000 851.000000 799.000000 852.000000 0.000000e+00
max 143137.000000 128641.000000 121175.000000 89185.000000 88272.000000 84346.000000 99351.000000 152075.000000 161585.000000 191550.000000 ... 262242.00000 251640.000000 236753.000000 247244.000000 252170.000000 280687.000000 248748.000000 257903.000000 259021.000000 6.924212e+06

8 rows × 35 columns

Data Slicing:

Let's filter on the list of countries ('Country'):

In [30]:
df_can.Country  # returns a series
Out[30]:
0          Afghanistan
1              Albania
2              Algeria
3       American Samoa
4              Andorra
             ...
1008               NaN
1009               NaN
1010               NaN
1011               NaN
1012               NaN
Name: Country, Length: 1013, dtype: object

Some values are NaN, so let's drop them:

In [32]:
df_can.dropna(subset=["Country"], axis=0, inplace=True)
In [33]:
df_can.Country
Out[33]:
0         Afghanistan
1             Albania
2             Algeria
3      American Samoa
4             Andorra
            ...
192             Yemen
193            Zambia
194          Zimbabwe
195           Unknown
196             Total
Name: Country, Length: 197, dtype: object

Let's filter on the list of countries and the data for years: 1980 - 1985:

In [34]:
df_can[['Country', 1980, 1981, 1982, 1983, 1984, 1985]]
Out[34]:
Country 1980 1981 1982 1983 1984 1985
0 Afghanistan 16.0 39.0 39.0 47.0 71.0 340.0
1 Albania 1.0 0.0 0.0 0.0 0.0 0.0
2 Algeria 80.0 67.0 71.0 69.0 63.0 44.0
3 American Samoa 0.0 1.0 0.0 0.0 0.0 0.0
4 Andorra 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ...
192 Yemen 1.0 2.0 1.0 6.0 0.0 18.0
193 Zambia 11.0 17.0 11.0 7.0 16.0 9.0
194 Zimbabwe 72.0 114.0 102.0 44.0 32.0 29.0
195 Unknown 44000.0 18078.0 16904.0 13635.0 14855.0 14368.0
196 Total 143137.0 128641.0 121175.0 89185.0 88272.0 84346.0

197 rows × 7 columns

Notice that the default index of the dataset is a numeric range from 0 to 195. This makes it very difficult to do a query by a specific country. For example to search for data on Japan, we need to know the corresponding index value, so let's make the Country column as our index:

In [35]:
df_can.set_index('Country', inplace=True)
In [36]:
df_can.head(3)
Out[36]:
Continent Region DevName 1980 1981 1982 1983 1984 1985 1986 ... 2005 2006 2007 2008 2009 2010 2011 2012 2013 Total
Country
Afghanistan Asia Southern Asia Developing regions 16.0 39.0 39.0 47.0 71.0 340.0 496.0 ... 3436.0 3009.0 2652.0 2111.0 1746.0 1758.0 2203.0 2635.0 2004.0 58639.0
Albania Europe Southern Europe Developed regions 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 1223.0 856.0 702.0 560.0 716.0 561.0 539.0 620.0 603.0 15699.0
Algeria Africa Northern Africa Developing regions 80.0 67.0 71.0 69.0 63.0 44.0 69.0 ... 3626.0 4807.0 3623.0 4005.0 5393.0 4752.0 4325.0 3774.0 4331.0 69439.0

3 rows × 38 columns

Let's view the number of immigrants from Japan for the following scenarios:

1. The full row data (all columns)

In [37]:
df_can.loc['Japan']
# or you can use df_can.iloc[87]
Out[37]:
Continent                 Asia
Region            Eastern Asia
DevName      Developed regions
1980                       701
1981                       756
1982                       598
1983                       309
1984                       246
1985                       198
1986                       248
1987                       422
1988                       324
1989                       494
1990                       379
1991                       506
1992                       605
1993                       907
1994                       956
1995                       826
1996                       994
1997                       924
1998                       897
1999                      1083
2000                      1010
2001                      1092
2002                       806
2003                       817
2004                       973
2005                      1067
2006                      1212
2007                      1250
2008                      1284
2009                      1194
2010                      1168
2011                      1265
2012                      1214
2013                       982
Total                    27707
Name: Japan, dtype: object

2. For year 2013

In [40]:
df_can.loc['Japan', 2013]
# or you can use df_can.iloc[87, 36]
Out[40]:
982.0

3. For years 1980 to 1985

In [41]:
df_can.loc['Japan', [1980, 1981, 1982, 1983, 1984, 1984]]
# or you can use df_can.iloc[87, [3, 4, 5, 6, 7, 8]]
Out[41]:
1980    701
1981    756
1982    598
1983    309
1984    246
1984    246
Name: Japan, dtype: object

To avoid this ambuigity, let's convert the column names into strings: '1980' to '2013':

In [42]:
df_can.columns = list(map(str, df_can.columns))

Now, let's declare a variable that will allow us to easily call upon the full range of years:

In [44]:
years = list(map(str, range(1980, 2014)))
years
Out[44]:
['1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013']

Filtering based on criteria:

To filter the dataframe based on a condition, we pass the condition as a boolean vector.

For example, Let's filter the dataframe to show the data on Asian countries (AreaName = Asia).

In [45]:
# 1. create the condition boolean series
condition = df_can['Continent'] == 'Asia'
print(condition)
Country
Afghanistan        True
Albania           False
Algeria           False
American Samoa    False
Andorra           False
                  ...
Yemen              True
Zambia            False
Zimbabwe          False
Unknown           False
Total             False
Name: Continent, Length: 197, dtype: bool
In [46]:
# 2. pass this condition into the dataFrame
df_can[condition]
Out[46]:
Continent Region DevName 1980 1981 1982 1983 1984 1985 1986 ... 2005 2006 2007 2008 2009 2010 2011 2012 2013 Total
Country
Afghanistan Asia Southern Asia Developing regions 16.0 39.0 39.0 47.0 71.0 340.0 496.0 ... 3436.0 3009.0 2652.0 2111.0 1746.0 1758.0 2203.0 2635.0 2004.0 58639.0
Armenia Asia Western Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 224.0 218.0 198.0 205.0 267.0 252.0 236.0 258.0 207.0 3310.0
Azerbaijan Asia Western Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 359.0 236.0 203.0 125.0 165.0 209.0 138.0 161.0 57.0 2649.0
Bahrain Asia Western Asia Developing regions 0.0 2.0 1.0 1.0 1.0 3.0 0.0 ... 12.0 12.0 22.0 9.0 35.0 28.0 21.0 39.0 32.0 475.0
Bangladesh Asia Southern Asia Developing regions 83.0 84.0 86.0 81.0 98.0 92.0 486.0 ... 4171.0 4014.0 2897.0 2939.0 2104.0 4721.0 2694.0 2640.0 3789.0 65568.0
Bhutan Asia Southern Asia Developing regions 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 5.0 10.0 7.0 36.0 865.0 1464.0 1879.0 1075.0 487.0 5876.0
Brunei Darussalam Asia South-Eastern Asia Developing regions 79.0 6.0 8.0 2.0 2.0 4.0 12.0 ... 4.0 5.0 11.0 10.0 5.0 12.0 6.0 3.0 6.0 600.0
Cambodia Asia South-Eastern Asia Developing regions 12.0 19.0 26.0 33.0 10.0 7.0 8.0 ... 370.0 529.0 460.0 354.0 203.0 200.0 196.0 233.0 288.0 6538.0
China Asia Eastern Asia Developing regions 5123.0 6682.0 3308.0 1863.0 1527.0 1816.0 1960.0 ... 42584.0 33518.0 27642.0 30037.0 29622.0 30391.0 28502.0 33024.0 34129.0 659962.0
China, Hong Kong Special Administrative Region Asia Eastern Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 729.0 712.0 674.0 897.0 657.0 623.0 591.0 728.0 774.0 9327.0
China, Macao Special Administrative Region Asia Eastern Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 21.0 32.0 16.0 12.0 21.0 21.0 13.0 33.0 29.0 284.0
Cyprus Asia Western Asia Developing regions 132.0 128.0 84.0 46.0 46.0 43.0 48.0 ... 7.0 9.0 4.0 7.0 6.0 18.0 6.0 12.0 16.0 1126.0
Democratic People's Republic of Korea Asia Eastern Asia Developing regions 1.0 1.0 3.0 1.0 4.0 3.0 0.0 ... 14.0 10.0 7.0 19.0 11.0 45.0 97.0 66.0 17.0 388.0
Georgia Asia Western Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 114.0 125.0 132.0 112.0 128.0 126.0 139.0 147.0 125.0 2068.0
India Asia Southern Asia Developing regions 8880.0 8670.0 8147.0 7338.0 5704.0 4211.0 7150.0 ... 36210.0 33848.0 28742.0 28261.0 29456.0 34235.0 27509.0 30933.0 33087.0 691904.0
Indonesia Asia South-Eastern Asia Developing regions 186.0 178.0 252.0 115.0 123.0 100.0 127.0 ... 632.0 613.0 657.0 661.0 504.0 712.0 390.0 395.0 387.0 13150.0
Iran (Islamic Republic of) Asia Southern Asia Developing regions 1172.0 1429.0 1822.0 1592.0 1977.0 1648.0 1794.0 ... 5837.0 7480.0 6974.0 6475.0 6580.0 7477.0 7479.0 7534.0 11291.0 175923.0
Iraq Asia Western Asia Developing regions 262.0 245.0 260.0 380.0 428.0 231.0 265.0 ... 2226.0 1788.0 2406.0 3543.0 5450.0 5941.0 6196.0 4041.0 4918.0 69789.0
Israel Asia Western Asia Developing regions 1403.0 1711.0 1334.0 541.0 446.0 680.0 1212.0 ... 2446.0 2625.0 2401.0 2562.0 2316.0 2755.0 1970.0 2134.0 1945.0 66508.0
Japan Asia Eastern Asia Developed regions 701.0 756.0 598.0 309.0 246.0 198.0 248.0 ... 1067.0 1212.0 1250.0 1284.0 1194.0 1168.0 1265.0 1214.0 982.0 27707.0
Jordan Asia Western Asia Developing regions 177.0 160.0 155.0 113.0 102.0 179.0 181.0 ... 1940.0 1827.0 1421.0 1581.0 1235.0 1831.0 1635.0 1206.0 1255.0 35406.0
Kazakhstan Asia Central Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 506.0 408.0 436.0 394.0 431.0 377.0 381.0 462.0 348.0 8490.0
Kuwait Asia Western Asia Developing regions 1.0 0.0 8.0 2.0 1.0 4.0 4.0 ... 66.0 35.0 62.0 53.0 68.0 67.0 58.0 73.0 48.0 2025.0
Kyrgyzstan Asia Central Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 173.0 161.0 135.0 168.0 173.0 157.0 159.0 278.0 123.0 2353.0
Lao People's Democratic Republic Asia South-Eastern Asia Developing regions 11.0 6.0 16.0 16.0 7.0 17.0 21.0 ... 42.0 74.0 53.0 32.0 39.0 54.0 22.0 25.0 15.0 1089.0
Lebanon Asia Western Asia Developing regions 1409.0 1119.0 1159.0 789.0 1253.0 1683.0 2576.0 ... 3709.0 3802.0 3467.0 3566.0 3077.0 3432.0 3072.0 1614.0 2172.0 115359.0
Malaysia Asia South-Eastern Asia Developing regions 786.0 816.0 813.0 448.0 384.0 374.0 425.0 ... 593.0 580.0 600.0 658.0 640.0 802.0 409.0 358.0 204.0 24417.0
Maldives Asia Southern Asia Developing regions 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 2.0 1.0 7.0 4.0 3.0 1.0 1.0 30.0
Mongolia Asia Eastern Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 59.0 64.0 82.0 59.0 118.0 169.0 103.0 68.0 99.0 952.0
Myanmar Asia South-Eastern Asia Developing regions 80.0 62.0 46.0 31.0 41.0 23.0 18.0 ... 210.0 953.0 1887.0 975.0 1153.0 556.0 368.0 193.0 262.0 9245.0
Nepal Asia Southern Asia Developing regions 1.0 1.0 6.0 1.0 2.0 4.0 13.0 ... 607.0 540.0 511.0 581.0 561.0 1392.0 1129.0 1185.0 1308.0 10222.0
Oman Asia Western Asia Developing regions 0.0 0.0 0.0 8.0 0.0 0.0 0.0 ... 14.0 18.0 16.0 10.0 7.0 14.0 10.0 13.0 11.0 224.0
Pakistan Asia Southern Asia Developing regions 978.0 972.0 1201.0 900.0 668.0 514.0 691.0 ... 14314.0 13127.0 10124.0 8994.0 7217.0 6811.0 7468.0 11227.0 12603.0 241600.0
Philippines Asia South-Eastern Asia Developing regions 6051.0 5921.0 5249.0 4562.0 3801.0 3150.0 4166.0 ... 18139.0 18400.0 19837.0 24887.0 28573.0 38617.0 36765.0 34315.0 29544.0 511391.0
Qatar Asia Western Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 11.0 2.0 5.0 9.0 6.0 18.0 3.0 14.0 6.0 157.0
Republic of Korea Asia Eastern Asia Developing regions 1011.0 1456.0 1572.0 1081.0 847.0 962.0 1208.0 ... 5832.0 6215.0 5920.0 7294.0 5874.0 5537.0 4588.0 5316.0 4509.0 142581.0
Saudi Arabia Asia Western Asia Developing regions 0.0 0.0 1.0 4.0 1.0 2.0 5.0 ... 198.0 252.0 188.0 249.0 246.0 330.0 278.0 286.0 267.0 3425.0
Singapore Asia South-Eastern Asia Developing regions 241.0 301.0 337.0 169.0 128.0 139.0 205.0 ... 392.0 298.0 690.0 734.0 366.0 805.0 219.0 146.0 141.0 14579.0
Sri Lanka Asia Southern Asia Developing regions 185.0 371.0 290.0 197.0 1086.0 845.0 1838.0 ... 4930.0 4714.0 4123.0 4756.0 4547.0 4422.0 3309.0 3338.0 2394.0 148358.0
State of Palestine Asia Western Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 453.0 627.0 441.0 481.0 400.0 654.0 555.0 533.0 462.0 6512.0
Syrian Arab Republic Asia Western Asia Developing regions 315.0 419.0 409.0 269.0 264.0 385.0 493.0 ... 1458.0 1145.0 1056.0 919.0 917.0 1039.0 1005.0 650.0 1009.0 31485.0
Tajikistan Asia Central Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 85.0 46.0 44.0 15.0 50.0 52.0 47.0 34.0 39.0 503.0
Thailand Asia South-Eastern Asia Developing regions 56.0 53.0 113.0 65.0 82.0 66.0 78.0 ... 575.0 500.0 487.0 519.0 512.0 499.0 396.0 296.0 400.0 9174.0
Turkey Asia Western Asia Developing regions 481.0 874.0 706.0 280.0 338.0 202.0 257.0 ... 2065.0 1638.0 1463.0 1122.0 1238.0 1492.0 1257.0 1068.0 729.0 31781.0
Turkmenistan Asia Central Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 40.0 26.0 37.0 13.0 20.0 30.0 20.0 20.0 14.0 310.0
United Arab Emirates Asia Western Asia Developing regions 0.0 2.0 2.0 1.0 2.0 0.0 5.0 ... 31.0 42.0 37.0 33.0 37.0 86.0 60.0 54.0 46.0 836.0
Uzbekistan Asia Central Asia Developing regions 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 330.0 262.0 284.0 215.0 288.0 289.0 162.0 235.0 167.0 3368.0
Viet Nam Asia South-Eastern Asia Developing regions 1191.0 1829.0 2162.0 3404.0 7583.0 5907.0 2741.0 ... 1852.0 3153.0 2574.0 1784.0 2171.0 1942.0 1723.0 1731.0 2112.0 97146.0
Yemen Asia Western Asia Developing regions 1.0 2.0 1.0 6.0 0.0 18.0 7.0 ... 161.0 140.0 122.0 133.0 128.0 211.0 160.0 174.0 217.0 2985.0

49 rows × 38 columns

We can pass multiple criteria in the same line

Let's filter for AreaNAme = Asia and RegName = Southern Asia

In [47]:
df_can[(df_can['Continent']=='Asia') & (df_can['Region']=='Southern Asia')]
Out[47]:
Continent Region DevName 1980 1981 1982 1983 1984 1985 1986 ... 2005 2006 2007 2008 2009 2010 2011 2012 2013 Total
Country
Afghanistan Asia Southern Asia Developing regions 16.0 39.0 39.0 47.0 71.0 340.0 496.0 ... 3436.0 3009.0 2652.0 2111.0 1746.0 1758.0 2203.0 2635.0 2004.0 58639.0
Bangladesh Asia Southern Asia Developing regions 83.0 84.0 86.0 81.0 98.0 92.0 486.0 ... 4171.0 4014.0 2897.0 2939.0 2104.0 4721.0 2694.0 2640.0 3789.0 65568.0
Bhutan Asia Southern Asia Developing regions 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 5.0 10.0 7.0 36.0 865.0 1464.0 1879.0 1075.0 487.0 5876.0
India Asia Southern Asia Developing regions 8880.0 8670.0 8147.0 7338.0 5704.0 4211.0 7150.0 ... 36210.0 33848.0 28742.0 28261.0 29456.0 34235.0 27509.0 30933.0 33087.0 691904.0
Iran (Islamic Republic of) Asia Southern Asia Developing regions 1172.0 1429.0 1822.0 1592.0 1977.0 1648.0 1794.0 ... 5837.0 7480.0 6974.0 6475.0 6580.0 7477.0 7479.0 7534.0 11291.0 175923.0
Maldives Asia Southern Asia Developing regions 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 2.0 1.0 7.0 4.0 3.0 1.0 1.0 30.0
Nepal Asia Southern Asia Developing regions 1.0 1.0 6.0 1.0 2.0 4.0 13.0 ... 607.0 540.0 511.0 581.0 561.0 1392.0 1129.0 1185.0 1308.0 10222.0
Pakistan Asia Southern Asia Developing regions 978.0 972.0 1201.0 900.0 668.0 514.0 691.0 ... 14314.0 13127.0 10124.0 8994.0 7217.0 6811.0 7468.0 11227.0 12603.0 241600.0
Sri Lanka Asia Southern Asia Developing regions 185.0 371.0 290.0 197.0 1086.0 845.0 1838.0 ... 4930.0 4714.0 4123.0 4756.0 4547.0 4422.0 3309.0 3338.0 2394.0 148358.0

9 rows × 38 columns

Let's review the changes we have made to our dataframe.

In [48]:
print('data dimensions:', df_can.shape)
print(df_can.columns)
df_can.head(2)
data dimensions: (197, 38)
Index(['Continent', 'Region', 'DevName', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', 'Total'],
      dtype='object')
Out[48]:
Continent Region DevName 1980 1981 1982 1983 1984 1985 1986 ... 2005 2006 2007 2008 2009 2010 2011 2012 2013 Total
Country
Afghanistan Asia Southern Asia Developing regions 16.0 39.0 39.0 47.0 71.0 340.0 496.0 ... 3436.0 3009.0 2652.0 2111.0 1746.0 1758.0 2203.0 2635.0 2004.0 58639.0
Albania Europe Southern Europe Developed regions 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 1223.0 856.0 702.0 560.0 716.0 561.0 539.0 620.0 603.0 15699.0

2 rows × 38 columns

Data Visualization:

Let's prepare the enviroment:

In [49]:
# we are using the inline backend
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt

Now let's check if Matplotlib is loaded

In [50]:
print('Matplotlib version: ', mpl.__version__)  # >= 2.0.0
Matplotlib version:  3.4.3

Let's apply style:

In [51]:
print(plt.style.available)
mpl.style.use(['ggplot']) # optional: for ggplot-like style
['Solarize_Light2', '_classic_test_patch', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark', 'seaborn-dark-palette', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'tableau-colorblind10']

Line Pots (Series/Dataframe)

Let's plot a line graph of immigration from Haiti

In [52]:
# First, we will extract the data series for Haiti:
haiti = df_can.loc['Haiti', years] # passing in years 1980 - 2013 to exclude the 'total' column
haiti.head()
Out[52]:
1980    1666
1981    3692
1982    3498
1983    2860
1984    1418
Name: Haiti, dtype: object
In [53]:
haiti.plot()
Out[53]:
<AxesSubplot:>

Let's change the index values of Haiti to type integer for plotting, and label the x and y axes

In [54]:
haiti.index = haiti.index.map(int) # let's change the index values of Haiti to type integer for plotting
haiti.plot(kind='line')

plt.title('Immigration from Haiti')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')

plt.show() # need this line to show the updates made to the figure

Let's annotate this spike in the plot "year 2010" (it was due to an earthquake in Haiti)

In [55]:
haiti.plot(kind='line')

plt.title('Immigration from Haiti')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

# annotate the 2010 Earthquake. 
# syntax: plt.text(x, y, label)
plt.text(2000, 6000, '2010 Earthquake')

plt.show()

Let's compare the number of immigrants from India and China from 1980 to 2013:

In [56]:
df_CI = df_can.loc[['India', 'China'], years]
df_CI
Out[56]:
1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ... 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Country
India 8880.0 8670.0 8147.0 7338.0 5704.0 4211.0 7150.0 10189.0 11522.0 10343.0 ... 28235.0 36210.0 33848.0 28742.0 28261.0 29456.0 34235.0 27509.0 30933.0 33087.0
China 5123.0 6682.0 3308.0 1863.0 1527.0 1816.0 1960.0 2643.0 2758.0 4323.0 ... 36619.0 42584.0 33518.0 27642.0 30037.0 29622.0 30391.0 28502.0 33024.0 34129.0

2 rows × 34 columns

In [57]:
# plotting a graph
df_CI.plot(kind='line')
Out[57]:
<AxesSubplot:xlabel='Country'>

Not working, we need to transpose the x and y axis:

In [58]:
df_CI = df_CI.transpose()
df_CI.head()
Out[58]:
Country India China
1980 8880.0 5123.0
1981 8670.0 6682.0
1982 8147.0 3308.0
1983 7338.0 1863.0
1984 5704.0 1527.0
In [59]:
# Now, let's try again
df_CI.index = df_CI.index.map(int) # let's change the index values of df_CI to type integer for plotting
df_CI.plot(kind='line')

plt.title('Immigrants from China and India')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

plt.show()

Let's compare the trend of top 5 countries that contributed the most to immigration to Canada.

1.Get the dataset. Recall that we created a Total column that calculates cumulative immigration by country, we will sort on this column to get our top 5 countries using pandas sort_values() method.

In [ ]:
inplace = True # paramemter saves the changes to the original df_can dataframe

df_can.sort_values(by='Total', ascending=False, axis=0, inplace=True)

# get the top 5 entries
df_top5 = df_can.head(5)

# transpose the dataframe
df_top5 = df_top5[years].transpose()

print(df_top5)

2. Plot the dataframe. To make the plot more readeable, we will change the size using the figsize parameter.

In [61]:
df_top5.index = df_top5.index.map(int) # let's change the index values of df_top5 to type integer for plotting
df_top5.plot(kind='line', figsize=(14, 8)) # pass a tuple (x, y) size

plt.title('Immigration Trend of Top 5 Countries')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

plt.show()