Hi All!
I'm performing an econometric analysis over several billion rows of data
and would like to use the Pyspark SparkML implementation of linear
regression. In the example below I'm trying to interact hour of day and
month of year indicators. The StringIndexer documentation tells you what
it's doing when it's one hot encoding string/factor columns (i.e. taking
out the most/least common value or first/last when sorted alphabetically)
but doesn't allow you to recover your coefficient names. This feels like
such a general case that I must be missing something. How can I get my
column names back post regression to map to coefficient values? Do I need
to basically rebuild the RFormula logic in if this isn't already
implemented? Would be happy to use a different Spark language (Scala/Java
etc. ) if implemented there.
Thanks in advance
Andrew
rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day +
month_of_year + hour_of_day:month_of_year + additional_column",
featuresCol="features",
labelCol="label")
rform_regression_input =
rform.fit(regression_input).transform(regression_input)
lr = LinearRegression(featuresCol='features',
labelCol='label',
solver='normal')
lr_model = lr.fit(rform_regression_input)
coefs = [ *lr_model.coefficients, lr_model.intercept]
return pd.DataFrame(
{"pvalues": lr_model.summary.pValues,
"tvalues": lr_model.summary.tValues,
"std_errs": lr_model.summary.coefficientStandardErrors,
"coefs": coefs}
)