Taller de manejo de Datos en Pandas#

¿Qué es?#

Pandas es una librería de código abierto muy popular dentro del ámbito de Data Science y Machine Learning puesto que facilita el tratamiento y manipulación de datos.

Pandas nació de la necesidad de tener todas las funcionalidades necesarias para un analista de datos en una sola librería:

import pandas as pd

¿Cómo importar datos?#

Para leer los datos la estructura básica es pd.read_tipo-archivo. Si estamos en colab y queremos utilizar algunos datos que están en nuestro google drive podemos utilizar:

      from google.colab import drive
      drive.mount('/content/gdrive')

Primeros pasos Pandas

#from google.colab import ##drive correr solo en colab
#drive.mount('/content/gdrive')
counties = pd.read_excel("https://github.com/AprendizajeProfundo/Libro_Fundamentos_Programacion/blob/main/Python/Datos/counties.xlsx?raw=true")#cambiar dirección

Para tener una idea de qué variables y datos tengo en lo base de datos, sólo debemos llamarla.

counties
codestate codecounty county population area
0 1 1001 Auta#%&()uga 54571.0 594.436000
1 1 1003 Baldwin#%&() ? 182265.0 1589.784000
2 1 1005 Barbour 27457.0 884.876000
3 1 1007 Bi#%&()bb 22915.0 622.582000
4 1 1009 Blount ? 57322.0 644.776000
... ... ... ... ... ...
3229 72 72151 Yabucoa 37941.0 55.215000
3230 72 72153 Ya_uco 42043.0 68.192000
3231 78 78010 ; St. Croix ? 50601.0 83.345868
3232 78 78020 St. John 4170.0 19.689867
3233 78 78030 St. Thomas 51634.0 31.313503

3234 rows × 5 columns

En lo anterior veiamos todas las variables con las 5 primeras y 5 últimas filas. Pero, podemos ver una cantidad determinada de registros tanto del inicio de la tabla como del final. Utlizando .head() y .tail() respectivamente.

counties.head(3) #por defecto salen 5 primeros
codestate codecounty county population area
0 1 1001 Auta#%&()uga 54571.0 594.436
1 1 1003 Baldwin#%&() ? 182265.0 1589.784
2 1 1005 Barbour 27457.0 884.876
counties.tail(2)
codestate codecounty county population area
3232 78 78020 St. John 4170.0 19.689867
3233 78 78030 St. Thomas 51634.0 31.313503

Si aparte de ver solamente la tabla, queremos más información de la base de datos, por ejemplo sus dimensiones y tipo de variables podemos utilizar lo siguiente:

counties.info()
print("\n")
print("La forma de la base de datos es:",counties.shape)  #shape:forma de un array
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3234 entries, 0 to 3233
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   codestate   3234 non-null   int64  
 1   codecounty  3234 non-null   int64  
 2   county      3234 non-null   object 
 3   population  3232 non-null   float64
 4   area        3234 non-null   float64
dtypes: float64(2), int64(2), object(1)
memory usage: 126.5+ KB


La forma de la base de datos es: (3234, 5)

dtypes

print(counties.dtypes) #tipo de objeto en cada columna
print("\n")
print(counties.describe())
print("\n")
print(counties.describe(include="all")) #summary, si no se pone el include solo aparece de las variables numéricas
codestate       int64
codecounty      int64
county         object
population    float64
area          float64
dtype: object


         codestate    codecounty    population           area
count  3234.000000   3234.000000  3.232000e+03    3234.000000
mean     31.441868  31544.737786  9.679656e+04    1093.361817
std      16.411236  16425.545223  3.088044e+05    3564.706999
min       1.000000   1001.000000  1.700000e+01       0.031696
25%      19.000000  19039.500000  1.129700e+04     416.360000
50%      30.000000  30038.000000  2.607550e+04     602.977500
75%      46.000000  46128.500000  6.566050e+04     913.884500
max      78.000000  78030.000000  9.818605e+06  145504.789000


          codestate    codecounty    county    population           area
count   3234.000000   3234.000000      3234  3.232000e+03    3234.000000
unique          NaN           NaN      2625           NaN            NaN
top             NaN           NaN  Lincoln            NaN            NaN
freq            NaN           NaN        14           NaN            NaN
mean      31.441868  31544.737786       NaN  9.679656e+04    1093.361817
std       16.411236  16425.545223       NaN  3.088044e+05    3564.706999
min        1.000000   1001.000000       NaN  1.700000e+01       0.031696
25%       19.000000  19039.500000       NaN  1.129700e+04     416.360000
50%       30.000000  30038.000000       NaN  2.607550e+04     602.977500
75%       46.000000  46128.500000       NaN  6.566050e+04     913.884500
max       78.000000  78030.000000       NaN  9.818605e+06  145504.789000

Seleccionar subconjuntos de una base de datos#

Documentación

Escoger columnas#

Para ver el nombre de las columnas que existen, podemos utilizar .columns.

counties.columns
Index(['codestate', 'codecounty', 'county', 'population', 'area'], dtype='object')

Escoger una columna específica o un conjunto de ellas

condado = counties["county"]
condado
0               Auta#%&()uga 
1        Baldwin#%&()  ?     
2                    Barbour 
3                  Bi#%&()bb 
4              Blount  ?     
                ...          
3229                 Yabucoa 
3230                  Ya_uco 
3231       ; St. Croix ?     
3232                 St. John
3233               St. Thomas
Name: county, Length: 3234, dtype: object
print(condado.shape) #array unidimensional con 3234 elementos
print(counties["county"].shape)
print(type(condado))
(3234,)
(3234,)
<class 'pandas.core.series.Series'>
codi_condado=counties[["codecounty","county"]]
print(codi_condado)
print(type(codi_condado))
      codecounty                 county
0           1001          Auta#%&()uga 
1           1003   Baldwin#%&()  ?     
2           1005               Barbour 
3           1007             Bi#%&()bb 
4           1009         Blount  ?     
...          ...                    ...
3229       72151               Yabucoa 
3230       72153                Ya_uco 
3231       78010     ; St. Croix ?     
3232       78020               St. John
3233       78030             St. Thomas

[3234 rows x 2 columns]
<class 'pandas.core.frame.DataFrame'>

Escoger Filas#

Podemos escoger ciertas filas a través de su posición. utilizando .iloc

from seaborn import load_dataset
print(counties.iloc[0:1,])
print(type(counties.iloc[0:1,]))
   codestate  codecounty         county  population     area
0          1        1001  Auta#%&()uga      54571.0  594.436
<class 'pandas.core.frame.DataFrame'>
print(counties.iloc[8:12,1:3])
    codecounty                 county
8         1017     ; Chambers  ?     
9         1019              Cherokee 
10        1021             ; Chilton 
11        1023               Choctaw 

O también escoger filas con base a alguna condicion o característica específica.

counties.tail()
codestate codecounty county population area
3229 72 72151 Yabucoa 37941.0 55.215000
3230 72 72153 Ya_uco 42043.0 68.192000
3231 78 78010 ; St. Croix ? 50601.0 83.345868
3232 78 78020 St. John 4170.0 19.689867
3233 78 78030 St. Thomas 51634.0 31.313503
counties["codestate"] > 72 #Verdadero o falso si cumple la condición
0       False
1       False
2       False
3       False
4       False
        ...  
3229    False
3230    False
3231     True
3232     True
3233     True
Name: codestate, Length: 3234, dtype: bool
codsat72 = counties[counties["codestate"] > 72] # Llama aquellos cuyo valor es verdadero
codsat72.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 3231 to 3233
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   codestate   3 non-null      int64  
 1   codecounty  3 non-null      int64  
 2   county      3 non-null      object 
 3   population  3 non-null      float64
 4   area        3 non-null      float64
dtypes: float64(2), int64(2), object(1)
memory usage: 144.0+ bytes

Filas que no tengan valores faltantes en determinada columna

no_na = counties[counties["population"].notna()]

print(no_na.shape)
print(counties.shape)
(3232, 5)
(3234, 5)

Filas con valores específicos#

Existen varias maneras de realizar estas búsquedas, algunas de ellas son utilizando isin, loc de la librería seaborn y or.

codcou1 = counties[counties["codecounty"].isin([78010, 72151])]
codcou1
codestate codecounty county population area
3229 72 72151 Yabucoa 37941.0 55.215000
3231 78 78010 ; St. Croix ? 50601.0 83.345868
codcou2 = counties.loc[:, 'codecounty'] == 78010
codcou2_ = counties.loc[codcou2]
codcou2_
codestate codecounty county population area
3231 78 78010 ; St. Croix ? 50601.0 83.345868
counties[(counties["codecounty"] == 78010) | (counties["codecounty"] == 72151)]
codestate codecounty county population area
3229 72 72151 Yabucoa 37941.0 55.215000
3231 78 78010 ; St. Croix ? 50601.0 83.345868

Filas con más de una característica específica.

counties[(counties["codestate"] == 72) & (counties["area"] >= 60)].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16 entries, 3153 to 3230
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   codestate   16 non-null     int64  
 1   codecounty  16 non-null     int64  
 2   county      16 non-null     object 
 3   population  16 non-null     float64
 4   area        16 non-null     float64
dtypes: float64(2), int64(2), object(1)
memory usage: 768.0+ bytes

Eliminar valores omitidos#

#Eliminar valores omitidos: dropna
# axis= 0 -> Eliminar fila completa
# axis=1 -> Eliminar columna completa
counties.dropna(subset=["population"], axis=0, inplace=False) #inplace:modifique directamente la BD
codestate codecounty county population area
0 1 1001 Auta#%&()uga 54571.0 594.436000
1 1 1003 Baldwin#%&() ? 182265.0 1589.784000
2 1 1005 Barbour 27457.0 884.876000
3 1 1007 Bi#%&()bb 22915.0 622.582000
4 1 1009 Blount ? 57322.0 644.776000
... ... ... ... ... ...
3229 72 72151 Yabucoa 37941.0 55.215000
3230 72 72153 Ya_uco 42043.0 68.192000
3231 78 78010 ; St. Croix ? 50601.0 83.345868
3232 78 78020 St. John 4170.0 19.689867
3233 78 78030 St. Thomas 51634.0 31.313503

3232 rows × 5 columns

Manipulación de datos textuales (Limpieza)#

Para realizarle limpieza a columna county de la base de datos counties, podemos utilizar lo siguiente:

  • **lower:**Poner en minúscula todo el texto.

  • **replace:**Remplazar ciertos valores por otros.

  • **strip:**Eliminar los espacios al principio y al final de la cadena.

  • **title:**Poner primera letra de cada palabra en mayúscula.

counties["county"] = (counties["county"]
                            .str.lower()
                            .str.replace("[^a-záéíóúüñ ]","")
                            .str.replace(" +"," ")
                            .str.strip()
                            .str.title()
                           ) 
/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: The default value of regex will change from True to False in a future version.
  This is separate from the ipykernel package so we can avoid doing imports until
/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ipykernel_launcher.py:4: FutureWarning: The default value of regex will change from True to False in a future version.
  after removing the cwd from sys.path.
print(counties)
      codestate  codecounty     county  population         area
0             1        1001    Autauga     54571.0   594.436000
1             1        1003    Baldwin    182265.0  1589.784000
2             1        1005    Barbour     27457.0   884.876000
3             1        1007       Bibb     22915.0   622.582000
4             1        1009     Blount     57322.0   644.776000
...         ...         ...        ...         ...          ...
3229         72       72151    Yabucoa     37941.0    55.215000
3230         72       72153      Yauco     42043.0    68.192000
3231         78       78010   St Croix     50601.0    83.345868
3232         78       78020    St John      4170.0    19.689867
3233         78       78030  St Thomas     51634.0    31.313503

[3234 rows x 5 columns]

Creación de columnas a partir de otras#

counties["densidad"]= counties["population"]/counties["area"]
print(counties[["densidad","population","area"]])
         densidad  population         area
0       91.802986     54571.0   594.436000
1      114.647650    182265.0  1589.784000
2       31.029206     27457.0   884.876000
3       36.806397     22915.0   622.582000
4       88.902192     57322.0   644.776000
...           ...         ...          ...
3229   687.150231     37941.0    55.215000
3230   616.538597     42043.0    68.192000
3231   607.120676     50601.0    83.345868
3232   211.784062      4170.0    19.689867
3233  1648.937201     51634.0    31.313503

[3234 rows x 3 columns]

Renombrar columnas#

print(counties.columns)
counties.rename(columns={"densidad":"densidad_pob"},inplace=True) #renombrar columna
print(counties.columns) #Verificación
Index(['codestate', 'codecounty', 'county', 'population', 'area', 'densidad'], dtype='object')
Index(['codestate', 'codecounty', 'county', 'population', 'area',
       'densidad_pob'],
      dtype='object')

Modificar tipo de dato de una columna#

print(counties.info())
print("\n")
counties["codestate"]=counties["codestate"].astype("float")
print(counties.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3234 entries, 0 to 3233
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   codestate     3234 non-null   int64  
 1   codecounty    3234 non-null   int64  
 2   county        3234 non-null   object 
 3   population    3232 non-null   float64
 4   area          3234 non-null   float64
 5   densidad_pob  3232 non-null   float64
dtypes: float64(3), int64(2), object(1)
memory usage: 151.7+ KB
None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3234 entries, 0 to 3233
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   codestate     3234 non-null   float64
 1   codecounty    3234 non-null   int64  
 2   county        3234 non-null   object 
 3   population    3232 non-null   float64
 4   area          3234 non-null   float64
 5   densidad_pob  3232 non-null   float64
dtypes: float64(4), int64(1), object(1)
memory usage: 151.7+ KB
None

Eliminar filas y columnas#

Eliminación filas, puede ser por posición o que cumpla una característica.

print(counties.shape)
counties = counties.drop(counties.iloc[0:3,].index)
counties = counties.drop(counties[counties['codecounty']==70].index)
print(counties.shape)
(3234, 6)
(3231, 6)
counties.drop(['densidad_pob'], axis=1,inplace=True) #inplace=True, para que me modifique la BD
counties.columns
Index(['codestate', 'codecounty', 'county', 'population', 'area'], dtype='object')

Conteos#

La función groupby devuelve un objeto DataFrameGroupBy que contiene la información agrupada según las especificaciones que indiquemos.

group=counties.groupby("codestate")
print(type(group))
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

Podemos obtener detalles del anterior dataframe

print(group.first().head(10)) # Nos muestra el primer elemento de cada grupo según "codestate"
print('\n')
print(group.get_group(78)) #Nos muestra los elementos cuyo codestate es 78
           codecounty                county  population       area
codestate                                                         
1.0              1007                  Bibb     22915.0    622.582
2.0              2013        Aleutians East      3141.0   6981.943
4.0              4001                Apache     71518.0  11197.521
5.0              5001              Arkansas     19019.0    988.770
6.0              6001               Alameda   1510271.0    739.017
8.0              8001                 Adams    441603.0   1167.653
9.0              9001             Fairfield    916829.0    624.890
10.0            10001                  Kent    162310.0    586.179
11.0            11001  District Of Columbia    601723.0     61.048
12.0            12001               Alachua    247336.0    875.020


      codestate  codecounty     county  population       area
3231       78.0       78010   St Croix     50601.0  83.345868
3232       78.0       78020    St John      4170.0  19.689867
3233       78.0       78030  St Thomas     51634.0  31.313503
counties.groupby("codestate")["codecounty"].count().head(10)
codestate
1.0     64
2.0     29
4.0     15
5.0     75
6.0     58
8.0     64
9.0      8
10.0     3
11.0     1
12.0    67
Name: codecounty, dtype: int64

Modificar Tablas#

es posible hacer cambios de orden y estructura en las tablas de pandas.

folder_path="https://github.com/AprendizajeProfundo/Libro_Fundamentos_Programacion/blob/main/Python/Datos/"
elections=pd.read_excel(folder_path+'elections.xlsx?raw=true')
elections.head(5)
year democrat republic other codecounty
0 2000 4942 11993 273 1001
1 2000 13997 40872 1611 1003
2 2000 5188 5096 111 1005
3 2000 2710 4273 118 1007
4 2000 4977 12667 329 1009

organicemos los datos descendentemente depende de la cantidad de votos democratas

elections.sort_values(["democrat"], ascending=False)
year democrat republic other codecounty
12626 2016 2464364 769743 200201 6037
6400 2008 2295853 956425 65970 6037
9513 2012 2216903 885333 78831 6037
3287 2004 1907736 1076225 37686 6037
175 2000 1710505 871930 112719 6037
... ... ... ... ... ...
11981 2012 9 54 1 48301
8852 2008 8 151 4 48269
11965 2012 5 139 1 48269
15078 2016 5 149 5 48269
15094 2016 4 58 3 48301

15563 rows × 5 columns

usando groupby() podemos juntar los datos por año y sumarlos para tener el total de votos por año

elections.groupby('year')[['democrat', 'republic']].sum()
democrat republic
year
2000 50922964 50290498
2004 58614073 61603653
2008 69204582 59703000
2012 65508961 60460609
2016 65626398 62790289

Podemos aplicar ambos métodos en un mismo código

elections_sort=(elections.groupby('year')[['democrat', 'republic']]
                .sum()
                .sort_values('democrat', ascending=False))
display(elections_sort)
democrat republic
year
2008 69204582 59703000
2016 65626398 62790289
2012 65508961 60460609
2004 58614073 61603653
2000 50922964 50290498

Para reestructurar los datos usamos la función pivot().

pivot_elections=elections.pivot(index='codecounty',
                columns='year',
                values=['democrat','republic'])
pivot_elections.head()
democrat republic
year 2000 2004 2008 2012 2016 2000 2004 2008 2012 2016
codecounty
1001 4942.0 4758.0 6093.0 6363.0 5936.0 11993.0 15196.0 17403.0 17379.0 18172.0
1003 13997.0 15599.0 19386.0 18424.0 18458.0 40872.0 52971.0 61271.0 66016.0 72883.0
1005 5188.0 4832.0 5697.0 5912.0 4871.0 5096.0 5899.0 5866.0 5550.0 5454.0
1007 2710.0 2089.0 2299.0 2202.0 1874.0 4273.0 5472.0 6262.0 6132.0 6738.0
1009 4977.0 3938.0 3522.0 2970.0 2156.0 12667.0 17386.0 20389.0 20757.0 22859.0

Múltiples índices#

Se pueden elegir múltiples variables como índice del dataframe. Esto es util para facilitar la extracción de información en ciertos casos

counties_multi=pd.read_excel(folder_path+"counties.xlsx?raw=true", index_col=[0, 1])
display(counties_multi)
county population area
codestate codecounty
1 1001 Auta#%&()uga 54571.0 594.436000
1003 Baldwin#%&() ? 182265.0 1589.784000
1005 Barbour 27457.0 884.876000
1007 Bi#%&()bb 22915.0 622.582000
1009 Blount ? 57322.0 644.776000
... ... ... ... ...
72 72151 Yabucoa 37941.0 55.215000
72153 Ya_uco 42043.0 68.192000
78 78010 ; St. Croix ? 50601.0 83.345868
78020 St. John 4170.0 19.689867
78030 St. Thomas 51634.0 31.313503

3234 rows × 3 columns

Podemos obtener la suma de población de cada estado

counties_multi['population'].sum(level='codestate').head(5)
/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
  """Entry point for launching an IPython kernel.
codestate
1     4779736.0
2      710231.0
4     6392017.0
5     2915918.0
6    37253956.0
Name: population, dtype: float64

Concatenar y unir#

Es posible unir varias tablas tanto vertical como horizontalmente.

En ambos casos, podemos usar concat(), y se juntarán las bases con bases a los nombres de columnas o los indices de las filas

elections_2000=elections[elections['year']==2000]
display(elections_2000) #display:ejecuta el método dunder apropiado para obtener los datos apropiados para mostrar
year democrat republic other codecounty
0 2000 4942 11993 273 1001
1 2000 13997 40872 1611 1003
2 2000 5188 5096 111 1005
3 2000 2710 4273 118 1007
4 2000 4977 12667 329 1009
... ... ... ... ... ...
3107 2000 5521 9425 745 56037
3108 2000 4019 5454 958 56039
3109 2000 1650 5469 299 56041
3110 2000 806 3138 107 56043
3111 2000 449 2521 90 56045

3112 rows × 5 columns

elections_2004=elections[elections['year']==2004]
elections_2004['dummy']=0
display(elections_2004)
/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
year democrat republic other codecounty dummy
3112 2004 4758 15196 127 1001 0
3113 2004 15599 52971 750 1003 0
3114 2004 4832 5899 46 1005 0
3115 2004 2089 5472 39 1007 0
3116 2004 3938 17386 180 1009 0
... ... ... ... ... ... ...
6220 2004 5208 10653 411 56037 0
6221 2004 5972 5124 263 56039 0
6222 2004 1815 6081 185 56041 0
6223 2004 855 3200 59 56043 0
6224 2004 580 2739 73 56045 0

3113 rows × 6 columns

Si se hace la concatenación sin más, se tomarán todas las columnas y se agregarán NaN

elections_00_04=pd.concat([elections_2000, elections_2004])
display(elections_00_04)
year democrat republic other codecounty dummy
0 2000 4942 11993 273 1001 NaN
1 2000 13997 40872 1611 1003 NaN
2 2000 5188 5096 111 1005 NaN
3 2000 2710 4273 118 1007 NaN
4 2000 4977 12667 329 1009 NaN
... ... ... ... ... ... ...
6220 2004 5208 10653 411 56037 0.0
6221 2004 5972 5124 263 56039 0.0
6222 2004 1815 6081 185 56041 0.0
6223 2004 855 3200 59 56043 0.0
6224 2004 580 2739 73 56045 0.0

6225 rows × 6 columns

Ahora, usando merge(), podemos hacer uniones de las tablas de forma horizontal que compartan una columna/indice en común

Tipos de Joins

display(counties)
codestate codecounty county population area
3 1.0 1007 Bibb 22915.0 622.582000
4 1.0 1009 Blount 57322.0 644.776000
5 1.0 1011 Bullock 10914.0 622.805000
6 1.0 1013 Butler 20947.0 776.829000
7 1.0 1015 Calhoun 118572.0 605.868000
... ... ... ... ... ...
3229 72.0 72151 Yabucoa 37941.0 55.215000
3230 72.0 72153 Yauco 42043.0 68.192000
3231 78.0 78010 St Croix 50601.0 83.345868
3232 78.0 78020 St John 4170.0 19.689867
3233 78.0 78030 St Thomas 51634.0 31.313503

3231 rows × 5 columns

inner_joined=pd.merge(elections, counties)
inner_joined.head(10)
year democrat republic other codecounty codestate county population area
0 2000 2710 4273 118 1007 1.0 Bibb 22915.0 622.582
1 2004 2089 5472 39 1007 1.0 Bibb 22915.0 622.582
2 2008 2299 6262 83 1007 1.0 Bibb 22915.0 622.582
3 2012 2202 6132 86 1007 1.0 Bibb 22915.0 622.582
4 2016 1874 6738 207 1007 1.0 Bibb 22915.0 622.582
5 2000 4977 12667 329 1009 1.0 Blount 57322.0 644.776
6 2004 3938 17386 180 1009 1.0 Blount 57322.0 644.776
7 2008 3522 20389 356 1009 1.0 Blount 57322.0 644.776
8 2012 2970 20757 279 1009 1.0 Blount 57322.0 644.776
9 2016 2156 22859 573 1009 1.0 Blount 57322.0 644.776
outer_joined=pd.merge(elections, counties, how='outer')
outer_joined.head(10)
year democrat republic other codecounty codestate county population area
0 2000.0 4942.0 11993.0 273.0 1001 NaN NaN NaN NaN
1 2004.0 4758.0 15196.0 127.0 1001 NaN NaN NaN NaN
2 2008.0 6093.0 17403.0 145.0 1001 NaN NaN NaN NaN
3 2012.0 6363.0 17379.0 190.0 1001 NaN NaN NaN NaN
4 2016.0 5936.0 18172.0 865.0 1001 NaN NaN NaN NaN
5 2000.0 13997.0 40872.0 1611.0 1003 NaN NaN NaN NaN
6 2004.0 15599.0 52971.0 750.0 1003 NaN NaN NaN NaN
7 2008.0 19386.0 61271.0 756.0 1003 NaN NaN NaN NaN
8 2012.0 18424.0 66016.0 898.0 1003 NaN NaN NaN NaN
9 2016.0 18458.0 72883.0 3874.0 1003 NaN NaN NaN NaN
left_joined=pd.merge(elections, counties, how='left')
left_joined.head(10)
year democrat republic other codecounty codestate county population area
0 2000 4942 11993 273 1001 NaN NaN NaN NaN
1 2000 13997 40872 1611 1003 NaN NaN NaN NaN
2 2000 5188 5096 111 1005 NaN NaN NaN NaN
3 2000 2710 4273 118 1007 1.0 Bibb 22915.0 622.582
4 2000 4977 12667 329 1009 1.0 Blount 57322.0 644.776
5 2000 3395 1433 76 1011 1.0 Bullock 10914.0 622.805
6 2000 3606 4127 70 1013 1.0 Butler 20947.0 776.829
7 2000 15781 22306 822 1015 1.0 Calhoun 118572.0 605.868
8 2000 5616 6037 181 1017 1.0 Chambers 34215.0 596.531
9 2000 3497 4154 172 1019 1.0 Cherokee 25989.0 553.700
right_joined=pd.merge(elections, counties, how='right')
right_joined.head(10)
year democrat republic other codecounty codestate county population area
0 2000.0 2710.0 4273.0 118.0 1007 1.0 Bibb 22915.0 622.582
1 2004.0 2089.0 5472.0 39.0 1007 1.0 Bibb 22915.0 622.582
2 2008.0 2299.0 6262.0 83.0 1007 1.0 Bibb 22915.0 622.582
3 2012.0 2202.0 6132.0 86.0 1007 1.0 Bibb 22915.0 622.582
4 2016.0 1874.0 6738.0 207.0 1007 1.0 Bibb 22915.0 622.582
5 2000.0 4977.0 12667.0 329.0 1009 1.0 Blount 57322.0 644.776
6 2004.0 3938.0 17386.0 180.0 1009 1.0 Blount 57322.0 644.776
7 2008.0 3522.0 20389.0 356.0 1009 1.0 Blount 57322.0 644.776
8 2012.0 2970.0 20757.0 279.0 1009 1.0 Blount 57322.0 644.776
9 2016.0 2156.0 22859.0 573.0 1009 1.0 Blount 57322.0 644.776

Series de tiempo y time stamps#

Existe un tipo de datos llamado “timestamp” que se usa para medir registros de tiempo. Es posible pasar strings a este tipo de dato en diferentes formatos y trabajar con este usándolo como indice

airrpm=pd.read_csv(folder_path+'airrpm.txt?raw=true', header=None, delimiter= '\s+', decimal=",")
airrpm.columns=["Time", "R1", "R2", "R3"]
airrpm.head()
Time R1 R2 R3
0 Jan-79 26.64 15.50 11.15
1 Feb-79 27.20 16.58 10.62
2 Mar-79 27.87 18.85 9.02
3 Apr-79 23.22 17.23 5.99
4 May-79 23.27 16.04 7.23

La variable “Time” en este momento es un string. Pero podemos transformarlo usando to_datetime(). Para obtener más información de qué se podría poner en format, pueden revisar este link

airrpm['Time']=pd.to_datetime(airrpm['Time'], format='%b-%y')
airrpm.index=airrpm['Time']
airrpm=airrpm.drop(['Time'], axis=1)
airrpm.head()
R1 R2 R3
Time
1979-01-01 26.64 15.50 11.15
1979-02-01 27.20 16.58 10.62
1979-03-01 27.87 18.85 9.02
1979-04-01 23.22 17.23 5.99
1979-05-01 23.27 16.04 7.23

Definiendo la fecha como índice, podemos extraer información con respecto al año y respecto al mes

airrpm.groupby([airrpm.index.month])['R1'].describe()
count mean std min 25% 50% 75% max
Time
1 26.0 43.736538 9.778402 26.64 36.6575 44.82 51.86 59.36
2 25.0 43.680800 9.983696 27.20 35.5400 44.74 52.73 59.74
3 25.0 44.514400 10.363336 27.87 35.7000 45.01 53.26 60.10
4 25.0 44.287200 10.480331 23.22 36.6100 45.63 52.79 60.49
5 25.0 43.891600 10.524814 23.27 34.6000 45.11 52.57 60.36
6 25.0 45.655600 10.586951 27.30 35.4200 47.29 54.55 61.88
7 25.0 46.478400 10.333910 29.28 38.2700 47.79 54.94 63.02
8 25.0 46.773600 10.613169 27.27 38.8600 48.06 55.39 63.71
9 25.0 44.202400 9.731246 27.13 37.2600 46.39 51.83 59.54
10 25.0 44.433200 10.093832 26.97 37.8500 45.89 51.56 60.26
11 25.0 43.841200 9.923719 26.19 37.3300 45.50 51.02 59.21
12 25.0 44.600400 9.573277 27.32 38.7500 45.33 51.98 57.65
airrpm.groupby([airrpm.index.year]).sum().head(10)
R1 R2 R3
Time
1979 328.83 210.49 118.34
1980 340.95 200.92 140.05
1981 341.96 198.53 143.45
1982 355.17 210.23 144.91
1983 374.03 226.76 147.26
1984 415.88 243.71 172.18
1985 440.43 271.80 168.61
1986 492.37 302.20 190.18
1987 522.07 324.58 197.52
1988 530.25 328.69 201.57