Sourcing data fromat from multiple different structures
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person
class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__
or FormatB.__dict__
to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Is there a better industry tested solution in python that is in use for this input data mapping?
python design-patterns scrapy
New contributor
add a comment |
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person
class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__
or FormatB.__dict__
to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Is there a better industry tested solution in python that is in use for this input data mapping?
python design-patterns scrapy
New contributor
add a comment |
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person
class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__
or FormatB.__dict__
to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Is there a better industry tested solution in python that is in use for this input data mapping?
python design-patterns scrapy
New contributor
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person
class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__
or FormatB.__dict__
to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Is there a better industry tested solution in python that is in use for this input data mapping?
python design-patterns scrapy
python design-patterns scrapy
New contributor
New contributor
New contributor
asked 5 mins ago
Maivel
1
1
New contributor
New contributor
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Maivel is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f211007%2fsourcing-data-fromat-from-multiple-different-structures%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Maivel is a new contributor. Be nice, and check out our Code of Conduct.
Maivel is a new contributor. Be nice, and check out our Code of Conduct.
Maivel is a new contributor. Be nice, and check out our Code of Conduct.
Maivel is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f211007%2fsourcing-data-fromat-from-multiple-different-structures%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown