Sourcing data fromat from multiple different structures












0














Problem



I want to read in the data to dictionary





person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}


The data comes from different formats:



Format A.



dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}


Format B.



dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}


There will ba additional sources added in the future with additional structures.



Background



For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.



There could be potentially 5-10 different structures from which the data will be read from.



My ideas so far



One potential way to solve this by taking advantage of polymorphism, where I create a Person class



class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected


and subclass it to all the "data mappers" of different data structures, e.g.



class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')

class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')


Now let's say I want to store these objects with SQLAlchemy



from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)


So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:



person = Person(**FormatA.__dict__)
person.add()
person.commit()


Question



I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:




  1. This is a good solution? What could be the drawbacks and problems down the line?

  2. Is there a better industry tested solution in python that is in use for this input data mapping?









share







New contributor




Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

























    0














    Problem



    I want to read in the data to dictionary





    person = {
    'name': 'John Doe',
    'email': 'johndoe@email.com',
    'age': 50,
    'connected': False
    }


    The data comes from different formats:



    Format A.



    dict_a = {
    'name': {
    'first_name': 'John',
    'last_name': 'Doe'
    },
    'workEmail': 'johndoe@email.com',
    'age': 50,
    'connected': False
    }


    Format B.



    dict_b = {
    'fullName': 'John Doe',
    'workEmail': 'johndoe@email.com',
    'age': 50,
    'connected': False
    }


    There will ba additional sources added in the future with additional structures.



    Background



    For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.



    There could be potentially 5-10 different structures from which the data will be read from.



    My ideas so far



    One potential way to solve this by taking advantage of polymorphism, where I create a Person class



    class Person:
    def __init__(self, name, email, age, connected):
    self.name = name
    self.email = email
    self.age = age
    self.connected = connected


    and subclass it to all the "data mappers" of different data structures, e.g.



    class FormatA(Person):
    def __init__(self, dict_a):
    self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
    self.email = dict_a.get('workEmail')
    self.age = dict_a.get('age')
    self.connected = dict_a.get('connected')

    class FormatB(Person):
    def __init__(self, dict_b):
    self.name = dict_b.get('fullName')
    self.email = dict_b.get('workEmail')
    self.age = dict_b.get('age')
    self.connected = dict_b.get('connected')


    Now let's say I want to store these objects with SQLAlchemy



    from sqlalchemy import Column, Integer, String, Boolean
    class Person(Base):
    __tablename__ = 'Person'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    email = Column(String)
    age = Column(Integer)
    connected = Column(Boolean)


    So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:



    person = Person(**FormatA.__dict__)
    person.add()
    person.commit()


    Question



    I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:




    1. This is a good solution? What could be the drawbacks and problems down the line?

    2. Is there a better industry tested solution in python that is in use for this input data mapping?









    share







    New contributor




    Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.























      0












      0








      0







      Problem



      I want to read in the data to dictionary





      person = {
      'name': 'John Doe',
      'email': 'johndoe@email.com',
      'age': 50,
      'connected': False
      }


      The data comes from different formats:



      Format A.



      dict_a = {
      'name': {
      'first_name': 'John',
      'last_name': 'Doe'
      },
      'workEmail': 'johndoe@email.com',
      'age': 50,
      'connected': False
      }


      Format B.



      dict_b = {
      'fullName': 'John Doe',
      'workEmail': 'johndoe@email.com',
      'age': 50,
      'connected': False
      }


      There will ba additional sources added in the future with additional structures.



      Background



      For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.



      There could be potentially 5-10 different structures from which the data will be read from.



      My ideas so far



      One potential way to solve this by taking advantage of polymorphism, where I create a Person class



      class Person:
      def __init__(self, name, email, age, connected):
      self.name = name
      self.email = email
      self.age = age
      self.connected = connected


      and subclass it to all the "data mappers" of different data structures, e.g.



      class FormatA(Person):
      def __init__(self, dict_a):
      self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
      self.email = dict_a.get('workEmail')
      self.age = dict_a.get('age')
      self.connected = dict_a.get('connected')

      class FormatB(Person):
      def __init__(self, dict_b):
      self.name = dict_b.get('fullName')
      self.email = dict_b.get('workEmail')
      self.age = dict_b.get('age')
      self.connected = dict_b.get('connected')


      Now let's say I want to store these objects with SQLAlchemy



      from sqlalchemy import Column, Integer, String, Boolean
      class Person(Base):
      __tablename__ = 'Person'
      id = Column(Integer, primary_key=True)
      name = Column(String)
      email = Column(String)
      age = Column(Integer)
      connected = Column(Boolean)


      So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:



      person = Person(**FormatA.__dict__)
      person.add()
      person.commit()


      Question



      I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:




      1. This is a good solution? What could be the drawbacks and problems down the line?

      2. Is there a better industry tested solution in python that is in use for this input data mapping?









      share







      New contributor




      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      Problem



      I want to read in the data to dictionary





      person = {
      'name': 'John Doe',
      'email': 'johndoe@email.com',
      'age': 50,
      'connected': False
      }


      The data comes from different formats:



      Format A.



      dict_a = {
      'name': {
      'first_name': 'John',
      'last_name': 'Doe'
      },
      'workEmail': 'johndoe@email.com',
      'age': 50,
      'connected': False
      }


      Format B.



      dict_b = {
      'fullName': 'John Doe',
      'workEmail': 'johndoe@email.com',
      'age': 50,
      'connected': False
      }


      There will ba additional sources added in the future with additional structures.



      Background



      For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.



      There could be potentially 5-10 different structures from which the data will be read from.



      My ideas so far



      One potential way to solve this by taking advantage of polymorphism, where I create a Person class



      class Person:
      def __init__(self, name, email, age, connected):
      self.name = name
      self.email = email
      self.age = age
      self.connected = connected


      and subclass it to all the "data mappers" of different data structures, e.g.



      class FormatA(Person):
      def __init__(self, dict_a):
      self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
      self.email = dict_a.get('workEmail')
      self.age = dict_a.get('age')
      self.connected = dict_a.get('connected')

      class FormatB(Person):
      def __init__(self, dict_b):
      self.name = dict_b.get('fullName')
      self.email = dict_b.get('workEmail')
      self.age = dict_b.get('age')
      self.connected = dict_b.get('connected')


      Now let's say I want to store these objects with SQLAlchemy



      from sqlalchemy import Column, Integer, String, Boolean
      class Person(Base):
      __tablename__ = 'Person'
      id = Column(Integer, primary_key=True)
      name = Column(String)
      email = Column(String)
      age = Column(Integer)
      connected = Column(Boolean)


      So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:



      person = Person(**FormatA.__dict__)
      person.add()
      person.commit()


      Question



      I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:




      1. This is a good solution? What could be the drawbacks and problems down the line?

      2. Is there a better industry tested solution in python that is in use for this input data mapping?







      python design-patterns scrapy





      share







      New contributor




      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.










      share







      New contributor




      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      share



      share






      New contributor




      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 5 mins ago









      Maivel

      1




      1




      New contributor




      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          0






          active

          oldest

          votes











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "196"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          Maivel is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f211007%2fsourcing-data-fromat-from-multiple-different-structures%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          Maivel is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          Maivel is a new contributor. Be nice, and check out our Code of Conduct.













          Maivel is a new contributor. Be nice, and check out our Code of Conduct.












          Maivel is a new contributor. Be nice, and check out our Code of Conduct.
















          Thanks for contributing an answer to Code Review Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f211007%2fsourcing-data-fromat-from-multiple-different-structures%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Ellipse (mathématiques)

          Quarter-circle Tiles

          Mont Emei